Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

How to parse keyword which is in Sentence using NLTK?

$
0
0

@premsheth wrote:

Hi friends,

I am trying to do CV parsing from PDF file.
My steps are follows:
1)convert pdf to text and list of sentences
2) extracting segment of experience from list of experience segment keywords
3) Extracting company name, Position title and Duration

Now I have problem in Extracting company name, position title
work_segment = ['Work Experience', 'Software Engineer', 'Digital, Data and Technology Services (DDTS), Department for Environment, Food & Rural Affairs (Defra)', 'January 2009 to Present']

Now I want to parse company name from work_segment list. I have list of company names also. here Digital, Data and Technology Services (DDTS) is company name and it included in my company names list.

i tried to used stanford library so it can give me Organisation or location or person tag but it was not working.

I also tried to used following code it works fine for some pdf but sometimes it not working if company name and position titles also included in lists.

work_experience = defaultdict(dict)
def extract_work_experience():
    noun_phrases = []
    comp = []
    pos_tit = []
    date = []
    title = []
    compan = []

    for i,text in enumerate(work_segment):
        lines = nltk.word_tokenize(text)
        tags = nltk.pos_tag(lines)

        nouns = [word for word,pos in tags if(pos == "NN" or pos == "NNP" )]
        company = " ".join(nouns)
        comp.append(company)
        
        print(tags)


        title = [word for word,pos in tags if(pos == "NNP" or pos == 'NN' )]
        ti = " ".join(title)
        pos_tit.append(ti)
    print(pos_tit)
    #print(comp)
    
    for pos in pos_tit:
        print("=== extracted NNP ====")
        print(pos)
        for tit in key_words:
            print("======Title list =====")
            print(tit)
            if pos.lower() == tit.lower():
                title.append(pos)
                print(title)
            elif tit.lower() in pos.lower():
                title.append(tit)
                
    #print(title)
                
    for c in comp:
        #print(c)
        for co in company_names:
            #print(co)
            if co.lower() == c.lower():
                compan.append(c)
            if co.lower() in c:
                compan.append(co)

                
    for text in work_segment:
        matches = re.findall(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})',text)
        #print(matches)
        if matches:
            #print(len(matches))
            if len(matches) == 1:
                ty = {'start_year':matches[0],'end_year':"Present"}
            else:
                ty = {'start_year':matches[0],'end_year':matches[1]}
            date.append(ty)
    
    print(date)
    print(title)
    print(compan)
    
    for i in range(0,len(title)):
        vars()["employer" + str(i)] = {}
        vars()["employer" + str(i)]['company'] = compan[i]
        vars()["employer" + str(i)]['Position_title'] = title[i]
        vars()["employer" + str(i)]['Time duration'] = date[i]

        #print(vars()["employer" + str(i)])
        work_experience["employer" + str(i+1)].update(vars()["employer" + str(i)])
    
    e = dict(work_experience)
    
    return e 

Questions:-

  1. How to parse company name if it included in Company name lists
  2. I used NLTK for tagging words and tried to parse all 'NN' and 'NNP' tag words. Now How to get some number of words which have same tag
    For Example:
    I tagged words from sentence.
    [('Digital', 'NNP'), (',', ','), ('Data', 'NNP'), ('and', 'CC'), ('Technology', 'NNP'), ('Services', 'NNP'), ('(', '('), ('DDTS', 'NNP'), (')', ')'), (',', ','), ('Department', 'NNP'), ('for', 'IN'), ('Environment', 'NNP'), (',', ','), ('Food', 'NNP'), ('&', 'CC'), ('Rural', 'NNP'), ('Affairs', 'NNPS'), ('(', '('), ('Defra', 'NNP'), (')', ')')] [('January', 'NNP'), ('2009', 'CD'), ('to', 'TO'), ('Present', 'VB')]
    Now I just want to parse only [('Digital', 'NNP'), (',', ','), ('Data', 'NNP'), ('and', 'CC'), ('Technology', 'NNP'), ('Services', 'NNP')]not all 'NNP' tags.
    How can I do that?

please if anyone have any ideas or answers it will appreciate it
thanks in Advance.

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles