@premsheth wrote:
Hi friends,
I am trying to do CV parsing from PDF file.
My steps are follows:
1)convert pdf to text and list of sentences
2) extracting segment of experience from list of experience segment keywords
3) Extracting company name, Position title and DurationNow I have problem in Extracting company name, position title
work_segment = ['Work Experience', 'Software Engineer', 'Digital, Data and Technology Services (DDTS), Department for Environment, Food & Rural Affairs (Defra)', 'January 2009 to Present']
Now I want to parse company name from work_segment list. I have list of company names also. here Digital, Data and Technology Services (DDTS) is company name and it included in my company names list.
i tried to used stanford library so it can give me Organisation or location or person tag but it was not working.
I also tried to used following code it works fine for some pdf but sometimes it not working if company name and position titles also included in lists.
work_experience = defaultdict(dict) def extract_work_experience(): noun_phrases = [] comp = [] pos_tit = [] date = [] title = [] compan = [] for i,text in enumerate(work_segment): lines = nltk.word_tokenize(text) tags = nltk.pos_tag(lines) nouns = [word for word,pos in tags if(pos == "NN" or pos == "NNP" )] company = " ".join(nouns) comp.append(company) print(tags) title = [word for word,pos in tags if(pos == "NNP" or pos == 'NN' )] ti = " ".join(title) pos_tit.append(ti) print(pos_tit) #print(comp) for pos in pos_tit: print("=== extracted NNP ====") print(pos) for tit in key_words: print("======Title list =====") print(tit) if pos.lower() == tit.lower(): title.append(pos) print(title) elif tit.lower() in pos.lower(): title.append(tit) #print(title) for c in comp: #print(c) for co in company_names: #print(co) if co.lower() == c.lower(): compan.append(c) if co.lower() in c: compan.append(co) for text in work_segment: matches = re.findall(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})',text) #print(matches) if matches: #print(len(matches)) if len(matches) == 1: ty = {'start_year':matches[0],'end_year':"Present"} else: ty = {'start_year':matches[0],'end_year':matches[1]} date.append(ty) print(date) print(title) print(compan) for i in range(0,len(title)): vars()["employer" + str(i)] = {} vars()["employer" + str(i)]['company'] = compan[i] vars()["employer" + str(i)]['Position_title'] = title[i] vars()["employer" + str(i)]['Time duration'] = date[i] #print(vars()["employer" + str(i)]) work_experience["employer" + str(i+1)].update(vars()["employer" + str(i)]) e = dict(work_experience) return e
Questions:-
- How to parse company name if it included in Company name lists
- I used NLTK for tagging words and tried to parse all
'NN'
and'NNP'
tag words. Now How to get some number of words which have same tag
For Example:
I tagged words from sentence.
[('Digital', 'NNP'), (',', ','), ('Data', 'NNP'), ('and', 'CC'), ('Technology', 'NNP'), ('Services', 'NNP'), ('(', '('), ('DDTS', 'NNP'), (')', ')'), (',', ','), ('Department', 'NNP'), ('for', 'IN'), ('Environment', 'NNP'), (',', ','), ('Food', 'NNP'), ('&', 'CC'), ('Rural', 'NNP'), ('Affairs', 'NNPS'), ('(', '('), ('Defra', 'NNP'), (')', ')')] [('January', 'NNP'), ('2009', 'CD'), ('to', 'TO'), ('Present', 'VB')]
Now I just want to parse only[('Digital', 'NNP'), (',', ','), ('Data', 'NNP'), ('and', 'CC'), ('Technology', 'NNP'), ('Services', 'NNP')]
not all'NNP'
tags.
How can I do that?please if anyone have any ideas or answers it will appreciate it
thanks in Advance.
Posts: 1
Participants: 1