0% found this document useful (0 votes)

116 views7 pages

Resume Parser: Code4Goal - Coding Contest

This document describes a resume parser solution that extracts information from resumes in various formats (e.g. PDF, DOC, HTML) and organizes it into a JSON format. The parser works by applying rules defined in a dictionary to each line of the resume. The rules are used to extract things like contact information, work history, skills and social media profiles. The solution is implemented as a Node.js CLI app that takes resume files as input and outputs the extracted JSON data.

Uploaded by

Ty Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views7 pages

Resume Parser: Code4Goal - Coding Contest

Uploaded by

Ty Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 7

Resume parser

Solution for Code4Goal - Coding Contest

Authored and maintained by Lizurchik Alexey, 2015

The Problem
Often Companies have problems with sorting out large volumes of CVs / Resumes advertising for
their job roles. In order to minimise their time in sorting out and have a benchmark way of
comparing candidates, you've been tasked with the challenging task of assisting their problem.

Contest
Develop a parser that is able to parse through CVs / Resumes in the word (.doc or .docx) / RTF /
TXT / PDF / HTML format to extract the necessary information in a predefined JSON format. If the
CVs / Resumes contain any social media profile links then the solution should also parse the
public social profile web-pages and organize the data in JSON format (e.g. Linkedin public profile,
Github, etc.)

Solution
This Resume parser can run throught unlimited number of Resumes and get relevant info from
that. With full-feature installation it supports most of the common use formats, provided by
textract:

 HTML
 PDF

 DOC

 RTF

 DOCX

 XLS

 PPTX

 DXF

 PNG

 JPG

 GIF

 application/javascript

 All text/* mime-types.

Pre-Requirements
Current solution tested on Windows 7 x64 Maximum (with babun shell), but it also may run on
OSX, Linux. Application is hard dependend on text extracting library textract.

Fast install
Project is nodejs cli application with some dependencies. If you already have installed copy of
nodejs, you can just clone this repo and run npm install:

git clone [email protected]:likerRr/code4goal-resume-parser.git

npm install

Step-by-step fresh installation

 First, go to nodejs site, download and setup it for you platform
 Then, clone this repo git clone [email protected]:likerRr/code4goal-resume-
parser.git

 Run npm install in terminal from root folder of project to setup dependencies

 At this moment application will work fine, but! By default it supports only .TXT and .HTML
text formats. For better performance you should install at least support of .PDF (and
.DOC). Here is instructions, how to do it from textract README file:

o PDF extraction requires pdftotext be installed, link

o DOC extraction requires catdoc be installed, link, unless on OSX in which case
textutil (installed by default) is used.

o DOCX extraction requires unzip be available (e.g. sudo apt-get install unzip
for Ubuntu)

Please, note, that it's not necessary install support of all formats but preferably. As for me, I didn't
get setup catdoc for .DOC files under Windows 7, so I played only with .TXT, .HTML, .PDF
formats, but I know, it will also work with the rest formats :)

Run
When you finish installation it's time to run application. Just put some Resume files to /public (it
already has 3 for tests) directory and run in terminal node app.js from project's root. Then you
can access JSONed results in /compiled folder (all file there will represent JSON string of parsed
data.

Execution presents as dialog between HR manager, that has a lot of Resume to work with, and
ParseBoy, who volunteered to help with it, i thought that it should have some fun.

How it works
Base principle on how parser works, based on dictionary of rules of how to handle Resume file. So
we have /src/dictionary.js file, where all rules places. It represents javascript object with the
following structure:

{
titles: {},
profiles: [],
inline: {},
regular: {}
}

All of these keys titles, profiles, inline, regular are converted to regular expressions, that
handled by specific conditions:

 titles - fires on each row of file. If string matches title, so it will capture all text between
current title and next title except current. For example we have such dictionary file:

{
titles: {
// values are the signs of the key that possibly may appears in
the Resume
objective: ['objective', 'objectives'],
summary: ['summary'],
}
}

And next Resume text is:

OBJECTIVE

Seeking a challenging position to use my software Web development and process optimization
skills.

SUMMARY

I worked on a wide range of products including building advanced dynamic multi language web
sites, internal and external API's, well as creating new internal workflows.

If we now run application it will go through next Application Loop (AL):

 Remove unnecessary Resume file from any \n\r\t and trim all lines
 Compile rules to regular expressions

 Split file into lines, delimited by \n

 Check each line for a match for each title rules

 When match found, parse text between current title and next title into titles or until
EOF

 Save parsed text (if found) under title key (objective or (and) summary)

So, according to this loop in the end we will have following JSON file:

{
objective: 'Seeking a challenging position to use my software Web
development and process optimization skills.'
summary: 'I worked on a wide range of products including building
advanced dynamic multi language web sites, internal and external API's, well
as creating new internal workflows.'
}
 profiles - fires on each row of file. If profile rule represent an array, so first key will be
the name of key and second key will be an handler. If profile rule just a string, parser will
try to found matched url without parsing it. Example:

profiles: [
['github.com', function(url, Resume, profilesWatcher) {
download(url, function(data, err) {
if (data) {
var $ = cheerio.load(data),
fullName = $('.vcard-fullname').text(),
location = $('.octicon-location').parent().text(),
mail = $('.octicon-mail').parent().text(),
link = $('.octicon-link').parent().text(),
clock = $('.octicon-clock').parent().text(),
company = $('.octicon-organization').parent().text();

Resume.addObject('github', {
name: fullName,
location: location,
email: mail,
link: link,
joined: clock,
company: company
});
} else {
return console.log(err);
}
//profilesInProgress--;
profilesWatcher.inProgress--;
});
}],
'stackoverflow.com'
],

It looks quite a big, but very flexible.

So here we can see, that profiles contains two rules: github.com with callback and
stackoverflow.com. When profile rule enters Application Loop (AL) and it has valid callback, so
it will try to request profile page from Internet and parse data on requested page, according to
rules in callback. Then it places all data into Resume object under the represented key (github in
out case). If rule is just a string and it meets match in AL row, so it simple puts profile link to
profile key in Resume object.

 inline - fires on each row of file. It converts to regular expression, that matches all data
after that:

expr+":?[\\s]*(.*)"

Example:

inline: {
skype: 'skype'
},

Text:

skype: sweet-liker
Result will be skype key with sweet-liker value in Resume object. So it can be extended with
simple lines of data, e.g. address or first name or whatever.

Note, that these rules are unreliable, cause can touch sensitive data from context, e.g. "I don't
have a skype, but I have IM". After parsing that string data in Resume will be as key skype and
value but I have IM. So use on your own risk.

 regular - fires on full data of file. It just search the first matches by regular expression,
e.g:

regular: {
name: [
/([A-Z][a-z]*)(\s[A-Z][a-z]*)/
],
email: [
/([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})/
],
phone: [
/((?:\+?\d{1,3}[\s-])?$?\d{2,3}$?[\s.-]?\d{3}[\s.-]\d{4,5})/
]
}

Will try find name, email, phone by expression sign.

Generic format
This solution hasn't generic output format of JSON string, cause it filled if rule in dictionary match
the condition. So, the full possible data, that may be extracted from Resume may have such
format:

{
objective: '',
summary: '',
technology: '',
experience: '',
education: '',
skills: '',
languages: '',
cources: '',
projects: '',
links: '',
contacts: '',
positions: '',
profiles: '',
awards: '',
honors: '',
additional: '',
certification: '',
interests: '',
github: {
name: '',
location: '',
email: '',
link: '',
joined: '',
company: ''
},
linkedin: {
summary: '',
name: '',
positions: [],
languages: [],
skills: [],
educations: [],
volunteering: [],
volunteeringOpportunities: []
},
skype: '',
name: '',
email: '',
phone: ''
}

Extending
All 'action' are by building dictionary.js file. For now it has only basics rules, that I met while
develop this solution, but it's very flexible (although a bit complicated) and extensible. Just put
your rule according to existing and following main principles and enjoy!

Vocabulary
 Resume object is a place, where all parsed data saves. After parsing whole document it
will stringify to JSON and save on into /compile folder.
 AL - Application Loop:

o Remove unnecessary Resume file from any \n\r\t and trim all lines

o Compile rules to regular expressions (under hood)

o Split file into lines, delimited by \n

o Check each line for a match for each title rules

o When match found, parse text between current title and next title into titles or
until EOF

o Save parsed text (if found) under title key (objective or (and) summary)

In action

100 C Program Solution
100% (1)
100 C Program Solution
56 pages
Fitzpatrick Dermatology
100% (4)
Fitzpatrick Dermatology
2,576 pages
Resume Parser Analysis Using Machine Learning and Natural Language Processing
No ratings yet
Resume Parser Analysis Using Machine Learning and Natural Language Processing
7 pages
Job Recommendation System Using NLP
No ratings yet
Job Recommendation System Using NLP
10 pages
21 Full Stack Developer Interview Questions (+ Answers) - TestGorilla
No ratings yet
21 Full Stack Developer Interview Questions (+ Answers) - TestGorilla
6 pages
Making Games With Python & Pygame
100% (2)
Making Games With Python & Pygame
368 pages
50 HR Interview Questions and Answers, HR Interview PDF
No ratings yet
50 HR Interview Questions and Answers, HR Interview PDF
8 pages
Mbistarchitect Process Guide: Software Version 2017.3 September 2017
No ratings yet
Mbistarchitect Process Guide: Software Version 2017.3 September 2017
350 pages
Personality Prediction Using CV, Deep Learning
No ratings yet
Personality Prediction Using CV, Deep Learning
7 pages
Resume Parser and Summarizer
No ratings yet
Resume Parser and Summarizer
6 pages
Employee Tracking System
No ratings yet
Employee Tracking System
4 pages
Resume Analyser: Automated Resume Ranking Software
No ratings yet
Resume Analyser: Automated Resume Ranking Software
6 pages
Agentica Hackthon Ppt-53
No ratings yet
Agentica Hackthon Ppt-53
7 pages
Case Study
43% (7)
Case Study
30 pages
AI Interviewer Research Paper Draft-Final
No ratings yet
AI Interviewer Research Paper Draft-Final
8 pages
Resume Screening Using NLP
No ratings yet
Resume Screening Using NLP
6 pages
Oracle Exadata Hardware Installation & Software Configuration Online Assessment (2021)
No ratings yet
Oracle Exadata Hardware Installation & Software Configuration Online Assessment (2021)
40 pages
Alumni Project Report II
No ratings yet
Alumni Project Report II
86 pages
The Ruby 1.9.x Web Servers Booklet
100% (7)
The Ruby 1.9.x Web Servers Booklet
60 pages
03 - DBMS - Relational
No ratings yet
03 - DBMS - Relational
41 pages
Music Recommendation
100% (1)
Music Recommendation
113 pages
Report For Resume Parser
No ratings yet
Report For Resume Parser
1 page
Business Intelligence MINI PROJECT
No ratings yet
Business Intelligence MINI PROJECT
12 pages
CV Update
No ratings yet
CV Update
13 pages
Frugal Testing Assignment
No ratings yet
Frugal Testing Assignment
5 pages
Vikas Garg: Key Skills Include
No ratings yet
Vikas Garg: Key Skills Include
2 pages
Lab4 Using - DE - Series - ADC
0% (1)
Lab4 Using - DE - Series - ADC
20 pages
Business Process Management
100% (1)
Business Process Management
28 pages
Amber Jain Resume
No ratings yet
Amber Jain Resume
2 pages
Brochure Training Java Programming
No ratings yet
Brochure Training Java Programming
2 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
CV ATS Friendly - Template
No ratings yet
CV ATS Friendly - Template
13 pages
Personality Prediction System Using CV Analysis
No ratings yet
Personality Prediction System Using CV Analysis
4 pages
Conference Proceedings 2016
No ratings yet
Conference Proceedings 2016
387 pages
Ministry of Education Database System. by - Abdirashid Jeeni
No ratings yet
Ministry of Education Database System. by - Abdirashid Jeeni
82 pages
Pratiksh Patel's Resume
No ratings yet
Pratiksh Patel's Resume
2 pages
HR Interview Questions For Fresher
50% (2)
HR Interview Questions For Fresher
6 pages
Cense Cakibration Manul
100% (1)
Cense Cakibration Manul
9 pages
Project - Software Engineering Techniques
No ratings yet
Project - Software Engineering Techniques
141 pages
Resume SHIVESH SINGHAL
No ratings yet
Resume SHIVESH SINGHAL
5 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
Resume: Vishnu Vardhan Reddy Atla
No ratings yet
Resume: Vishnu Vardhan Reddy Atla
2 pages
Linux&Unix Operation Support
No ratings yet
Linux&Unix Operation Support
3 pages
3834 - Mahender Reddy Resume
No ratings yet
3834 - Mahender Reddy Resume
2 pages
Education Experience: BML Munjal University Eplane - Ai
No ratings yet
Education Experience: BML Munjal University Eplane - Ai
1 page
Face Recognition Based Attendance System For CMR College of Engineering and Technology
No ratings yet
Face Recognition Based Attendance System For CMR College of Engineering and Technology
3 pages
Sample Resume For Freshers
No ratings yet
Sample Resume For Freshers
3 pages
SQL Question From Careercup
No ratings yet
SQL Question From Careercup
12 pages
Resume PDF
No ratings yet
Resume PDF
1 page
Vicky Gupta: Data Scientist
No ratings yet
Vicky Gupta: Data Scientist
1 page
Linux Interview
No ratings yet
Linux Interview
35 pages
Cheekatipalle Vaishnav Likhith: Security Consultant Intern
100% (2)
Cheekatipalle Vaishnav Likhith: Security Consultant Intern
1 page
Mining Class Comparisons
100% (1)
Mining Class Comparisons
4 pages
Most Expected Essay Topics For Wipro Elite NLTH 2019
No ratings yet
Most Expected Essay Topics For Wipro Elite NLTH 2019
13 pages
Mysql Interview Questions PDF
No ratings yet
Mysql Interview Questions PDF
5 pages
Simple Fresher Resume Template
No ratings yet
Simple Fresher Resume Template
1 page
Resume
No ratings yet
Resume
4 pages
Resume - XXXX
No ratings yet
Resume - XXXX
3 pages
Portal Administration
No ratings yet
Portal Administration
69 pages
Core Java Resume With 5 Years Experience
No ratings yet
Core Java Resume With 5 Years Experience
5 pages
M4818
No ratings yet
M4818
24 pages
Delta4000 Ds en
0% (1)
Delta4000 Ds en
4 pages
Untitled
No ratings yet
Untitled
12 pages
Why Did You Choose This Career
No ratings yet
Why Did You Choose This Career
18 pages
Arun Resume F
No ratings yet
Arun Resume F
2 pages
Interview 26251 7
No ratings yet
Interview 26251 7
2 pages
Excel - Advanced Lookups Simplified Guidebook
No ratings yet
Excel - Advanced Lookups Simplified Guidebook
36 pages
AWOS Sample Manual
No ratings yet
AWOS Sample Manual
96 pages
Interview Questions
No ratings yet
Interview Questions
5 pages
Virtual Machines
No ratings yet
Virtual Machines
6 pages
Fresher Dotnet Resume Model 6 Net
No ratings yet
Fresher Dotnet Resume Model 6 Net
2 pages
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
No ratings yet
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
9 pages
Huawei Jny-Lx1 10.0.1.167 (C185e3r3p1) &jny-Lx1 10.0.
No ratings yet
Huawei Jny-Lx1 10.0.1.167 (C185e3r3p1) &jny-Lx1 10.0.
9 pages
Krishna Reddy
No ratings yet
Krishna Reddy
5 pages
DMDW Auto Final
No ratings yet
DMDW Auto Final
12 pages
Dallas Ds12887a
No ratings yet
Dallas Ds12887a
19 pages
Comp 321 Lecture Slide Chapter 3 (Register Transfer & Microoperations)
No ratings yet
Comp 321 Lecture Slide Chapter 3 (Register Transfer & Microoperations)
43 pages
GCP Part2
No ratings yet
GCP Part2
5 pages
Paul Pimsleur American Council On The Teaching of Foreign Languages
No ratings yet
Paul Pimsleur American Council On The Teaching of Foreign Languages
2 pages
Benchmarking Excel Format
No ratings yet
Benchmarking Excel Format
5 pages
Ec 411
No ratings yet
Ec 411
2 pages
Mathematical Biology Assignment1 Michaelgboneh
No ratings yet
Mathematical Biology Assignment1 Michaelgboneh
4 pages
Training For SAP ERP in For Applications
No ratings yet
Training For SAP ERP in For Applications
4 pages
205 Intern Report
No ratings yet
205 Intern Report
18 pages
It0047 Fa6
No ratings yet
It0047 Fa6
15 pages
Task PDF
No ratings yet
Task PDF
3 pages
Constructions
No ratings yet
Constructions
1 page
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
From Everand
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
Whitney Soto
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Resume Parser: Code4Goal - Coding Contest

Uploaded by

Resume Parser: Code4Goal - Coding Contest

Uploaded by

Resume parser

Solution for Code4Goal - Coding Contest

Authored and maintained by Lizurchik Alexey, 2015

 All text/* mime-types.

git clone [email protected]:likerRr/code4goal-resume-parser.git

Step-by-step fresh installation

o PDF extraction requires pdftotext be installed, link

And next Resume text is:

If we now run application it will go through next Application Loop (AL):

 Split file into lines, delimited by \n

 Check each line for a match for each title rules

It looks quite a big, but very flexible.

Will try find name, email, phone by expression sign.

o Compile rules to regular expressions (under hood)

o Split file into lines, delimited by \n

o Check each line for a match for each title rules

You might also like