0% found this document useful (0 votes)
101 views6 pages

Automated Resume Parsing A Natural Language Processing Approach

Uploaded by

titans.sas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views6 pages

Automated Resume Parsing A Natural Language Processing Approach

Uploaded by

titans.sas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Automated Resume Parsing: A Natural Language

Processing Approach
2023 7th International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS) | 979-8-3503-4314-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/CSITSS60515.2023.10334236

Thatavarthi Giri Sougandh Sai Snehith K Nithish Sagar Reddy


Dept. of Computer Science and Dept. of Computer Science and Dept. of Computer Science and
Engineering Engineering Engineering
Amrita School of Computing Amrita School of Computing Amrita School of Computing
Bengaluru, Amrita Vishwa Bengaluru, Amrita Vishwa Bengaluru, Amrita Vishwa
Vidyapeetham, India Vidyapeetham, India Vidyapeetham, India
[email protected] [email protected] [email protected]

Meena Belwal
Dept. of Computer Science and
Engineering
Amrita School of Computing
Bengaluru, Amrita Vishwa
Vidyapeetham, India
[email protected]

Abstract— The extraction of critical information from candidate information, it speeds up the resume evaluation
resumes, such as contact information, skills, education, and job procedure for companies enabling hiring managers to
experience, requires the use of resume parsing. In this work, concentrate on assessing competent opportunities. In
we propose a resume parser that integrates two methodologies: addition, resume parsing systems offer a standard structure
a Named Entity Recognition (NER) model approach and a
for storing and comparing candidate data, facilitating
Keyword and Pattern Matching model approach using Regular
Expressions (Regex), utilizing certain NLP libraries and effective candidate database management and application
methods. The NER model makes use of NLP libraries to monitoring.
precisely recognize and categorize named entities—such as
names, phone numbers, and email addresses—in the resume Systems for resume processing are useful for job searchers
text. In order to provide a thorough profile overview, it also as well. These systems make sure that their resumes are
organizes the parts on talents, education, and job experience. correctly processed and that the right abilities and
In addition, the Keyword and Pattern Matching model makes qualifications are highlighted. Candidates improve their
use of Regex and pre-established rules to extract certain chances of being shortlisted for job possibilities by
information, such job titles, firm names, and years of formatting their CV data. Additionally, resume parsing tools
experience. Our resume parser uses NLP-based approaches to
let job seekers modify their applications to stand out in a
increase accuracy and performance, allowing it to handle
various resume formats and deliver trustworthy results. crowded field by personalizing their resumes to specific job
Performance assessments show how well the parser extracts needs. This improves candidates' visibility to potential
crucial information, even from resumes with various layouts employers and enables them to successfully display their
and formats. Our resume parser seems to be a useful tool for qualifications.
processing huge numbers of resumes in real-world applications
due to its use of NLP libraries and methodologies, together To correctly identify and categorize named elements,
with excellent accuracy and processing speed. including names, contact information, skills, education, and
work experience, the NER model makes use of machine
Keywords— Natural language processing (NLP), Named Entity learning algorithms and NLP approaches. By extracting
Recognition (NER), pattern matching, Regular Expressions. precise information based on structured patterns and
I. INTRODUCTION relevant keywords utilizing Regex and defined rules, the
Keyword and Pattern Matching model enhances the NER
Companies get abundant resumes for each job possibility in method. This combined technique ensures excellent
today's competitive job market. These resumes must be efficiency and precision while processing resumes offers
manually checked and gathered in a time-consuming, error- customization possibilities. The work’s unique performance
prone, and non-scalable process. Automated resume parsing in accurately extracting information from resumes with
systems have become an essential tool for recruiters and HR various layouts and formats has been demonstrated through
specialists to handle these issues. The term "resume parsing" a thorough assessment utilizing distinct resume datasets. A
describes the structured extraction of relevant data from resume parser is a useful tool for handling huge quantities of
resumes, such as contact information, skills, education, and resumes in real-world situations, saving time and effort
work experience. Faster candidate evaluation, better while promoting educated hiring decisions. The parser's
acquisition of talent, and better hiring decision-making are exceptional accuracy and processing speed enable recruiters
made possible. Employers and job seekers may both profit to quickly find the most appropriate candidates.
substantially from automated resume processing. By
automatically retrieving and categorizing candidate The following describes the way the paper is structured: A
information, it speeds up the resume screening process for thorough analysis of relevant work on the topic of resume
companies. By automatically retrieving and structuring parsing is provided in Section 2. The methodology for our

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
resume parser work is presented in Section 3, which goes study involves scraping news articles on possible viral
into depth about the NER model approach and the Keyword outbreaks. To accomplish this, the article uses the SpaCy
and Pattern Matching model approach. The output of the library's "en_core_web_sm" NER model, which evaluates
models and the evaluation results are shown in Section 4, the input text and classifies words into various entities. The
which also highlights the efficiency and performance of our entities of interest, namely GPE, CARDINAL, and EVENT,
system. The work is concluded in Section 6, which are then filtered and arranged using keywords based on
summarises the most important findings and suggests new relevancy. When compared to standard methods such as
paths for future research on automated resume processing. summarization, the new NER methodology outperforms
them in detecting relevant elements related to potential viral
outbreaks.
II. LITERATURE SURVEY
Sabareesh et al. described their work for extracting medical
The study of resume parsing gained importance from terminology such as symptoms, diseases, and substances
multiple research papers that have proposed various using Named Entity Recognition (NER) in [20]. The Bio-
techniques to address the difficulties in extracting Creative Chemical Disease Relation (BC5CDR) dataset is
appropriate data from resumes. The methodology that has used in the study, coupled with extra medical phrases
been outlined in [1] by Amit Pimpalkar et al. involves acquired from online pages. The BC5CDR collection
several phases, including text extraction from files, pre- contains annotated text data for chemicals and disorders,
processing to remove unnecessary components, feature however it lacks symptom labels. As a result, the symptoms
extraction, label encoding, resume classification model were manually labelled in the paper by changing the dataset.
construction using machine learning algorithms, and ranking The SpaCy package was used to train the NER models on
based on the extracted information. To determine the this labelled data, which included architectures such as
effectiveness of the method, performance evaluation Tok2Vec, NER, and pre-trained transformer models like as
requirements including precision, recall, F1, and accuracy BERT and RoBERTa. To reliably recognise medical items,
have been used. these models were fine-tuned and put into a proprietary
NER pipeline. The research shows how the NER approach,
Shujaat Hussain et al. have concentrated on an input and combined with manual annotation and transformer models,
output pipeline for resume parsing [2]. It highlights the efficiently finds relevant medical information from text
value of text extraction from resumes, distinct of font styles, descriptions provided by patients during their interactions
sizes, or document layouts. Text block categorization is seen with doctors.
as an essential step since resumes often have hierarchical
These studies highlight the significance of text extraction,
structures with associated subjects. For classifying text
pre-processing, feature extraction, machine learning
blocks, machine learning methods and neural network
techniques, and assessment measures in resume parsing. The
models are considered, with the suggested pipeline
approaches proposed provide useful insights into dealing
depending on the Boolean Naive Bayes algorithm. This
with various resume forms, enhancing accuracy, and
probabilistic model enables the categorization of various
allowing informed decision-making during the recruitment
resume information blocks, such as education, skills,
process.
experience, hobbies, and personal data.

A CV parser model that tries to extract information from III. METHODOLOGY


posted resumes on job portals has been suggested by Pandey
et al. in [3]. The concept involves the digitalization of The developed resume parser is able to extract different
manually supplied resumes and the extraction of relevant information from a given resume. It uses two different
details, including educational history and personal extraction techniques, one based on the Named Entity
information. Recruiters utilize this information to make Recognition (NER) model and the other on the Keywords
decisions once it has been retrieved. The CV parsing method and Pattern Machining using Regular Expressions (Regex)
lowers bottlenecks and improves application management model. This section provides an overview of the
systems. technologies employed and the paper's overall structure.

Shingal et al. specified a model that was developed in three A. Technologies used:
phases [4]. The initial phase focuses on cleansing the data
and extracting text from PDF resumes. The skills in the a. SpaCy: SpaCy library has been used in our study
dataset are converted into tokens, which are then vectorized. since it has strong natural language processing
The K-Means clustering technique is used in the following (NLP) capabilities. For applications like
phase to group the skills and predict the category or rank for tokenization, part-of-speech tagging, and entity
new resumes. The program then identifies the top candidates identification, SpaCy provides pre-trained models.
for recruiters by comparing resumes with job requirements. The English language model offered by SpaCy is
loaded to carry out the NLP Tasks.
Ishtiyaq et al. focused on developing Named Entity
Recognition (NER) as a part of Information Extraction to b. Python: Python and its libraries are used to create
categorize words and phrase segments into pre-defined resume parser models because they offer
entities such as GPE, CARDINAL, and EVENT in [5]. The frameworks for data processing, text manipulation,

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
and machine learning. Python was chosen for our accuracy in detecting and extracting particular things
work because of its simplicity and wide from resumes is improved by its capacity to
infrastructure. comprehend the context and semantics of the
text.
c. Pandas: Pandas is a popular library for handling
and analyzing data. It enabled us to check the
columns and make sure that the information
provided was consistent.

d. Regular Expressions (Regex): Regular


expressions were essential for pattern matching and
extracting particular data from the text of the
resume. To detect components like email
addresses, mobile numbers, educational
backgrounds, experience sections, and language
proficiency, regex patterns have been used.

e. YAML: A configuration file format, YAML


(YAML Ain't Markup Language) is utilized. The
predicted columns and other required parameters
for data processing and validation may be defined Fig 1: Keyword Cloud of Resume attribute values
using YAML. It ensured the provided data's In Figures 1 and 2, a tag cloud is used to visually
integrity and structure. represent the extracted text corpus and the attributes
that are to be extracted from each resume; the size of
f. PDFMiner: Using the PDFMiner library, it was each word corresponds to the frequency of that word in
possible to extract text from PDF documents. The the corpus. In this study, certain fields, such as
ability to go through PDF pages and convert them experience, year of experience, and location of the
into plain text format was offered. It enabled to candidate's place of employment, occur very frequently;
process resumes that were contained in PDF files. for these words, larger fonts are applied.
g. docx2txt: This library has been used to extract text
from Microsoft Word documents (DOC and
DOCX). For easier processing, it made the process
of transforming the document's content to plain text
simpler.

B. Role of Natural Language Processing in Resume


Parser:

Both the NER-based extraction model and the Keyword


and Pattern Matching Regex-based extraction of our
study in resume parser heavily rely on Natural
Language Processing (NLP). Using NLP tools, which
allow us to understand and modify human language, we
can extract useful data from resumes. The SpaCy
Fig 2: Keyword Cloud of Resume attributes
library is the main form of NLP utilized. Advanced
NLP features offered by SpaCy include entity NLP approaches are used in combination with regular
recognition, part-of-speech tagging, and tokenization. expressions (Regex) in the Keyword and Pattern
To carry out these tasks, we use SpaCy's pre-trained Matching Regex-based extraction model to extract
models, particularly the English language model. The information from the resume text. Regex patterns are
NER model offered by SpaCy has been taught to useful for pattern matching, but NLP methods improve
identify and categorize items of interest, including the context of the text, which makes them a
names, phone numbers, abilities, grades, and job titles, complement to regex-based extraction. For example, we
as well as companies and experiences. To analyse the use part-of-speech tagging and NLP tokenization to
resume content, identify pertinent entities, and properly identify noun phrases and other linguistic patterns that
extract the needed information, NLP methods are used. might help with the extraction process. Initially the
preprocessing of text using NLP approaches occur to
The NER model is trained on a dataset of annotated extend contractions, eliminate stopwords, and
resumes where items are manually labeled before being lemmatize words, which increases the precision of the
used in the NER-based extraction strategy. In order to Regex-based extraction.
train the model to anticipate resume texts that have not C. Flow of our study:
yet been read, NLP techniques are used to analyze and
learn patterns from the labeled data. The NER model's This work can be divided into the following steps:

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
a. Reading YAML Configuration: Model
processing begins with reading the YAML
configuration file, which contains information like
the likely column names and the file type. This
step made the input data match the
predefined structure.

b. Preprocessing the Text: Text preprocessing was


carried out before any extraction techniques were
used. In order to do this, the text had to be
lemmatized, all capital letters removed,
contractions grown, and special characters and
line breaks removed. Preprocessing enhanced the Fig 3: Flow chart of the Resume Parser for
precision of the next extraction stages. extracting Details.

c. NER-based Extraction Model Approach: Input for the model is either a PDF or DOC/DOCX file,
from which the text is extracted and pre-processed to train
i. Dataset Preparation: A Collection of annotated the NER model to recognise various entities like Name,
resume data was used where entities of interest Phone Number, Designation, Years of Experience, Work
were manually labeled. The training set and a Location, Email ID, Skills, and other details of the
testing set were created from the dataset. candidate. The working flow of the resume parser using
both models is shown in Fig. 3.
ii. Training the NER Model: SpaCy's NER model
is trained using the training set. The model was The proposed resume parser in this research combines
honed to identify particular entities, including strategies for regex pattern matching and extraction based
names, phone numbers, abilities, grades, and on NER models. Accurate identification of items including
job titles, as well as employers and names, phone numbers, talents, grades, designations,
experiences. companies of employment, and experiences was made
possible by the NER model, which was trained on annotated
iii. Model Evaluation: The trained model's resume data. By extracting additional information like email
performance was assessed on the testing set to addresses, mobile numbers, qualifications for employment,
gauge its effectiveness and guarantee that it and language proficiency, the regex patterns improved the
correctly identified the intended entities. NER model.
d. Keyword and Pattern Matching from Regex- The suggested study produced a strong and effective
based Extraction Approach: To extract resume parser capable of precisely extracting crucial
particular information from the resume text, information from resumes by integrating the advantages of
regular expressions and pattern- both techniques. The study was successful in part because of
matching approaches were used in combination the usage of tools including SpaCy, Python, Pandas, Regular
with the NER-based extraction. To recognize and Expressions, YAML, PDFMiner, and docx2txt. Our
extract items like email addresses, mobile approach enabled efficient entity extraction and filtering
numbers, educational backgrounds, experience utilizing regex patterns, as well as successful NER model
sections, and language proficiency, regex patterns preparation, training, and assessment. The collected
were developed. information offers helpful insights for processing and
analyzing resumes, enabling a variety of applications in the
e. Combining and Filtering: After receiving the sphere of hiring and talent management.
outputs from both the models, the collected data
was combined while making sure that any IV. RESULTS
overlapping items were removed. Additionally,
empty or pointless values were eliminated, leaving Basic information including name, email, mobile number,
just a collection of distinct entities as the final set skills, education details, job experience, languages known,
of retrieved information. and CGPA are correctly identified and retrieved by the
parser. Overall, the resume parser is a reliable and efficient
f. Displaying the Results: The extracted information tool for automatically extracting and comparing essential
is finally shown, with the exception of the "name" data from resumes.
and "phone number" values, which were already
shown individually.

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
This dictionary in the output of terminal is then made into a
dataframe and is being displayed in a structured format.

We can now compare upto 5 resumes in the GUI


implemetation. From the dropdown choose number of
resumes that have to be compared. The minimum resumes
that can be compared are 2 and a maximum of 5 resumes
can be compared. Fig 4 shows the implementation of
comparision of 3 resumes.

Fig 1. Terminal Output of Entities Extracted from a Resume

Fig 4. GUI Implementation of comparison of resume


V. CONCLUSION

Finally, in order to extract crucial data from resumes, our


resume parser software uses two separate approaches, the
NER Model and Regex-based extraction. Throughout the
research, we make use of Natural Language Processing
(NLP) tools to successfully comprehend and modify human
language.

The NER Model technique develops a model that can


identify and categorize items including names, skills, phone
numbers, grades, designations, companies, and experiences
using SpaCy, an NLP library. This algorithm can accurately
extract information from unseen resume texts because it was
Fig 2. Terminal output of Entities Extracted from another trained on annotated resumes. NLP approaches improve the
resume text's context knowledge, which increases the precision of
entity recognition. Regular expressions and NLP methods
The difference in the output is because of the difference in are used in the Regex-based extraction approach to extract
formats of resume. The above output is shown and information. To increase the precision of pattern matching,
implemented with GUI as shown in Fig. 3. NLP techniques including tokenization, part-of-speech
tagging, and preprocessing are utilized. We successfully
extract information from resumes by using the advantages of
Regex and NLP.
Through our study, we have shown how crucial NLP is for
precise and effective resume processing. NLP methods
make it easier to understand and take natural language,
which makes it possible to extract useful information from
resumes. A reliable method to handle various resume
formats and extract important facts is provided by the
combination of the NER Model and Regex-based extraction.
By effectively putting these techniques to use, we can
automatically extract crucial information from resumes,
including names, phone numbers, talents, grades,
designations, firms, and experiences. The productivity of
resume screening and analysis in various personnel
Fig 3. GUI Implementation of parsing a resume management software is increased thanks to this automation,
which reduces the time and effort required for manual
processing.

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
VI. REFERENCES International Conference on Advanced Computing and
Communication Systems (ICACCS) 1 (2023): 1782-1785.
[1] Pimpalkar, A. Lalwani, R. Chaudhari, M. Inshall, M. Dalwani
and T. Saluja, "Job Applications Selection and Identification: [13] Bhatia, Vedant, Prateek Rawat, Ajit Kumar and Rajiv Ratn
Study of Resumes with Natural Language Processing and Shah. “End-to-End Resume Parsing and Finding Candidates
Machine Learning," 2023 IEEE International Students' for a Job Description using BERT.” ArXiv abs/1910.03089
Conference on Electrical, Electronics and Computer Science (2019): n. pag.
(SCEECS), Bhopal, India, 2023, pp. 1-5, doi:
10.1109/SCEECS57921.2023.10063010. [14] Wang, Yan, Yacine Allouache and Christian Joubert. “A
Staffing Recommender System based on Domain-Specific
[2] H. Sajid et al., "Resume Parsing Framework for E- Knowledge Graph.” 2021 Eighth International Conference on
recruitment," 2022 16th International Conference on Social Network Analysis, Management and Security
Ubiquitous Information Management and Communication (SNAMS) (2021): 1-6.
(IMCOM), Seoul, Korea, Republic of, 2022, pp. 1-8, doi:
10.1109/IMCOM53663.2022.97217 [15] Wang, Yan and Capgemini. “Analysing CV Corpus for
Finding Suitable Candidates using Knowledge Graph and
[3] Das, Papiya, Bhaswati Sahoo, and Manjusha Pandey. "A BERT.” (2021).
review on text analytics process with a CV parser model."
In 2018 3rd International Conference for Convergence in [16] Kumaran, V. Senthil and A. Sankar. “Towards an automated
Technology (I2CT), pp. 1-7. IEEE, 2018. system for intelligent screening of candidates for recruitment
using ontology mapping (EXPERT).” Int. J. Metadata
[4] Sharma, Anushka, Smiti Singhal, and Dhara Ajudia. Semant. Ontologies 8 (2013): 56-64.
"Intelligent Recruitment System Using NLP." In 2021
International Conference on Artificial Intelligence and [17] Zaroor, Abeer, Mohammed Maree and Muath N Sabha. “JRC:
Machine Vision (AIMV), pp. 1-5. IEEE, 2021. A Job Post and Resume Classification System for Online
Recruitment.” 2017 IEEE 29th International Conference on
[5] Bhardwaj, Bhavya, Syed Ishtiyaq Ahmed, J. Jaiharie, R. Tools with Artificial Intelligence (ICTAI) (2017): 780-787.
Sorabh Dadhich, and M. Ganesan. "Web scraping using
summarization and named entity recognition (NER)." In 2021 [18] Rajathi, S., Reddi Tharun Kumar, Sripathi Vamsi Krishna, S.
Sabareesh, and Pandi Eswar Chand Ethihas. "Named Entity
7th international conference on advanced computing and
Recognition-based Hospital Recommendation." In 2023 2nd
communication systems (ICACCS), vol. 1, pp. 261-265.
IEEE, 2021. International Conference on Vision Towards Emerging
Trends in Communication and Networking Technologies
[6] Mohanty, Saswat, Anshuman Behera, Sushruta Mishra, (ViTECoN), pp. 1-6. IEEE, 2023.
Ahmed Alkhayyat, Deepak Gupta and Vandana Sharma.
“Resumate: A Prototype to Enhance Recruitment Process with [19] R. Ramachandran, S. Jayachandran and V. Das, "A Novel
NLP based Resume Parsing.” 2023 4th International Method for Text Summarization and Clustering of
Conference on Intelligent Engineering and Management Documents," 2022 IEEE 3rd Global Conference for
(ICIEM) (2023): 1-6. Advancement in Technology (GCAT), Bangalore, India,
2022, pp. 1-6, doi: 10.1109/GCAT55367.2022.9972037.
[7] “NLP based Extraction of Relevant Resume using Machine
Learning.” International Journal of Innovative Technology [20] N. Raj, S. Thomas and V. G, "Open Information Extraction
and Exploring Engineering (2020): n. pag. System For Extracting Relations in Legal Documents," 2022
IEEE 3rd Global Conference for Advancement in Technology
[8] Kinge, Bhushan, Shrinivas Mandhare, Pranali Chavan and S. (GCAT), Bangalore, India, 2022, pp. 1-8, doi:
M. Chaware. “Resume Screening using Machine Learning 10.1109/GCAT55367.2022.9971995.
and NLP: A proposed system.” International Journal of
Scientific Research in Computer Science, Engineering and
Information Technology (2022): n. pag.

[9] Reddy, D. Jagan Mohan, Sirisha Regella and Srinivasa Reddy


Seelam. “Recruitment Prediction using Machine Learning.”
2020 5th International Conference on Computing,
Communication and Security (ICCCS) (2020): 1-4.

[10] Surendiran, B, Tejus Paturu, Harsha Vardhan Chirumamilla


and Maruprolu Naga Raju Reddy. “Resume Classification
Using ML Techniques.” 2023 International Conference on
Signal Processing, Computation, Electronics, Power and
Telecommunication (IConSCEPT) (2023): 1-5.

[11] “Differential Hiring using a Combination of NER and Word


Embedding.” International Journal of Recent Technology and
Engineering (2020): n. pag.

[12] M, Spoorthi, Indu Priya B, Meghana Kuppala, Vaishnavi


Sunilkumar Karpe and Divya Dharavath. “Automated Resume
Classification System Using Ensemble Learning.” 2023 9th

Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.

You might also like