Automated Resume Parsing A Natural Language Processing Approach
Automated Resume Parsing A Natural Language Processing Approach
Processing Approach
2023 7th International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS) | 979-8-3503-4314-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/CSITSS60515.2023.10334236
Meena Belwal
Dept. of Computer Science and
Engineering
Amrita School of Computing
Bengaluru, Amrita Vishwa
Vidyapeetham, India
[email protected]
Abstract— The extraction of critical information from candidate information, it speeds up the resume evaluation
resumes, such as contact information, skills, education, and job procedure for companies enabling hiring managers to
experience, requires the use of resume parsing. In this work, concentrate on assessing competent opportunities. In
we propose a resume parser that integrates two methodologies: addition, resume parsing systems offer a standard structure
a Named Entity Recognition (NER) model approach and a
for storing and comparing candidate data, facilitating
Keyword and Pattern Matching model approach using Regular
Expressions (Regex), utilizing certain NLP libraries and effective candidate database management and application
methods. The NER model makes use of NLP libraries to monitoring.
precisely recognize and categorize named entities—such as
names, phone numbers, and email addresses—in the resume Systems for resume processing are useful for job searchers
text. In order to provide a thorough profile overview, it also as well. These systems make sure that their resumes are
organizes the parts on talents, education, and job experience. correctly processed and that the right abilities and
In addition, the Keyword and Pattern Matching model makes qualifications are highlighted. Candidates improve their
use of Regex and pre-established rules to extract certain chances of being shortlisted for job possibilities by
information, such job titles, firm names, and years of formatting their CV data. Additionally, resume parsing tools
experience. Our resume parser uses NLP-based approaches to
let job seekers modify their applications to stand out in a
increase accuracy and performance, allowing it to handle
various resume formats and deliver trustworthy results. crowded field by personalizing their resumes to specific job
Performance assessments show how well the parser extracts needs. This improves candidates' visibility to potential
crucial information, even from resumes with various layouts employers and enables them to successfully display their
and formats. Our resume parser seems to be a useful tool for qualifications.
processing huge numbers of resumes in real-world applications
due to its use of NLP libraries and methodologies, together To correctly identify and categorize named elements,
with excellent accuracy and processing speed. including names, contact information, skills, education, and
work experience, the NER model makes use of machine
Keywords— Natural language processing (NLP), Named Entity learning algorithms and NLP approaches. By extracting
Recognition (NER), pattern matching, Regular Expressions. precise information based on structured patterns and
I. INTRODUCTION relevant keywords utilizing Regex and defined rules, the
Keyword and Pattern Matching model enhances the NER
Companies get abundant resumes for each job possibility in method. This combined technique ensures excellent
today's competitive job market. These resumes must be efficiency and precision while processing resumes offers
manually checked and gathered in a time-consuming, error- customization possibilities. The work’s unique performance
prone, and non-scalable process. Automated resume parsing in accurately extracting information from resumes with
systems have become an essential tool for recruiters and HR various layouts and formats has been demonstrated through
specialists to handle these issues. The term "resume parsing" a thorough assessment utilizing distinct resume datasets. A
describes the structured extraction of relevant data from resume parser is a useful tool for handling huge quantities of
resumes, such as contact information, skills, education, and resumes in real-world situations, saving time and effort
work experience. Faster candidate evaluation, better while promoting educated hiring decisions. The parser's
acquisition of talent, and better hiring decision-making are exceptional accuracy and processing speed enable recruiters
made possible. Employers and job seekers may both profit to quickly find the most appropriate candidates.
substantially from automated resume processing. By
automatically retrieving and categorizing candidate The following describes the way the paper is structured: A
information, it speeds up the resume screening process for thorough analysis of relevant work on the topic of resume
companies. By automatically retrieving and structuring parsing is provided in Section 2. The methodology for our
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
resume parser work is presented in Section 3, which goes study involves scraping news articles on possible viral
into depth about the NER model approach and the Keyword outbreaks. To accomplish this, the article uses the SpaCy
and Pattern Matching model approach. The output of the library's "en_core_web_sm" NER model, which evaluates
models and the evaluation results are shown in Section 4, the input text and classifies words into various entities. The
which also highlights the efficiency and performance of our entities of interest, namely GPE, CARDINAL, and EVENT,
system. The work is concluded in Section 6, which are then filtered and arranged using keywords based on
summarises the most important findings and suggests new relevancy. When compared to standard methods such as
paths for future research on automated resume processing. summarization, the new NER methodology outperforms
them in detecting relevant elements related to potential viral
outbreaks.
II. LITERATURE SURVEY
Sabareesh et al. described their work for extracting medical
The study of resume parsing gained importance from terminology such as symptoms, diseases, and substances
multiple research papers that have proposed various using Named Entity Recognition (NER) in [20]. The Bio-
techniques to address the difficulties in extracting Creative Chemical Disease Relation (BC5CDR) dataset is
appropriate data from resumes. The methodology that has used in the study, coupled with extra medical phrases
been outlined in [1] by Amit Pimpalkar et al. involves acquired from online pages. The BC5CDR collection
several phases, including text extraction from files, pre- contains annotated text data for chemicals and disorders,
processing to remove unnecessary components, feature however it lacks symptom labels. As a result, the symptoms
extraction, label encoding, resume classification model were manually labelled in the paper by changing the dataset.
construction using machine learning algorithms, and ranking The SpaCy package was used to train the NER models on
based on the extracted information. To determine the this labelled data, which included architectures such as
effectiveness of the method, performance evaluation Tok2Vec, NER, and pre-trained transformer models like as
requirements including precision, recall, F1, and accuracy BERT and RoBERTa. To reliably recognise medical items,
have been used. these models were fine-tuned and put into a proprietary
NER pipeline. The research shows how the NER approach,
Shujaat Hussain et al. have concentrated on an input and combined with manual annotation and transformer models,
output pipeline for resume parsing [2]. It highlights the efficiently finds relevant medical information from text
value of text extraction from resumes, distinct of font styles, descriptions provided by patients during their interactions
sizes, or document layouts. Text block categorization is seen with doctors.
as an essential step since resumes often have hierarchical
These studies highlight the significance of text extraction,
structures with associated subjects. For classifying text
pre-processing, feature extraction, machine learning
blocks, machine learning methods and neural network
techniques, and assessment measures in resume parsing. The
models are considered, with the suggested pipeline
approaches proposed provide useful insights into dealing
depending on the Boolean Naive Bayes algorithm. This
with various resume forms, enhancing accuracy, and
probabilistic model enables the categorization of various
allowing informed decision-making during the recruitment
resume information blocks, such as education, skills,
process.
experience, hobbies, and personal data.
Shingal et al. specified a model that was developed in three A. Technologies used:
phases [4]. The initial phase focuses on cleansing the data
and extracting text from PDF resumes. The skills in the a. SpaCy: SpaCy library has been used in our study
dataset are converted into tokens, which are then vectorized. since it has strong natural language processing
The K-Means clustering technique is used in the following (NLP) capabilities. For applications like
phase to group the skills and predict the category or rank for tokenization, part-of-speech tagging, and entity
new resumes. The program then identifies the top candidates identification, SpaCy provides pre-trained models.
for recruiters by comparing resumes with job requirements. The English language model offered by SpaCy is
loaded to carry out the NLP Tasks.
Ishtiyaq et al. focused on developing Named Entity
Recognition (NER) as a part of Information Extraction to b. Python: Python and its libraries are used to create
categorize words and phrase segments into pre-defined resume parser models because they offer
entities such as GPE, CARDINAL, and EVENT in [5]. The frameworks for data processing, text manipulation,
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
and machine learning. Python was chosen for our accuracy in detecting and extracting particular things
work because of its simplicity and wide from resumes is improved by its capacity to
infrastructure. comprehend the context and semantics of the
text.
c. Pandas: Pandas is a popular library for handling
and analyzing data. It enabled us to check the
columns and make sure that the information
provided was consistent.
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
a. Reading YAML Configuration: Model
processing begins with reading the YAML
configuration file, which contains information like
the likely column names and the file type. This
step made the input data match the
predefined structure.
c. NER-based Extraction Model Approach: Input for the model is either a PDF or DOC/DOCX file,
from which the text is extracted and pre-processed to train
i. Dataset Preparation: A Collection of annotated the NER model to recognise various entities like Name,
resume data was used where entities of interest Phone Number, Designation, Years of Experience, Work
were manually labeled. The training set and a Location, Email ID, Skills, and other details of the
testing set were created from the dataset. candidate. The working flow of the resume parser using
both models is shown in Fig. 3.
ii. Training the NER Model: SpaCy's NER model
is trained using the training set. The model was The proposed resume parser in this research combines
honed to identify particular entities, including strategies for regex pattern matching and extraction based
names, phone numbers, abilities, grades, and on NER models. Accurate identification of items including
job titles, as well as employers and names, phone numbers, talents, grades, designations,
experiences. companies of employment, and experiences was made
possible by the NER model, which was trained on annotated
iii. Model Evaluation: The trained model's resume data. By extracting additional information like email
performance was assessed on the testing set to addresses, mobile numbers, qualifications for employment,
gauge its effectiveness and guarantee that it and language proficiency, the regex patterns improved the
correctly identified the intended entities. NER model.
d. Keyword and Pattern Matching from Regex- The suggested study produced a strong and effective
based Extraction Approach: To extract resume parser capable of precisely extracting crucial
particular information from the resume text, information from resumes by integrating the advantages of
regular expressions and pattern- both techniques. The study was successful in part because of
matching approaches were used in combination the usage of tools including SpaCy, Python, Pandas, Regular
with the NER-based extraction. To recognize and Expressions, YAML, PDFMiner, and docx2txt. Our
extract items like email addresses, mobile approach enabled efficient entity extraction and filtering
numbers, educational backgrounds, experience utilizing regex patterns, as well as successful NER model
sections, and language proficiency, regex patterns preparation, training, and assessment. The collected
were developed. information offers helpful insights for processing and
analyzing resumes, enabling a variety of applications in the
e. Combining and Filtering: After receiving the sphere of hiring and talent management.
outputs from both the models, the collected data
was combined while making sure that any IV. RESULTS
overlapping items were removed. Additionally,
empty or pointless values were eliminated, leaving Basic information including name, email, mobile number,
just a collection of distinct entities as the final set skills, education details, job experience, languages known,
of retrieved information. and CGPA are correctly identified and retrieved by the
parser. Overall, the resume parser is a reliable and efficient
f. Displaying the Results: The extracted information tool for automatically extracting and comparing essential
is finally shown, with the exception of the "name" data from resumes.
and "phone number" values, which were already
shown individually.
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
This dictionary in the output of terminal is then made into a
dataframe and is being displayed in a structured format.
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.
VI. REFERENCES International Conference on Advanced Computing and
Communication Systems (ICACCS) 1 (2023): 1782-1785.
[1] Pimpalkar, A. Lalwani, R. Chaudhari, M. Inshall, M. Dalwani
and T. Saluja, "Job Applications Selection and Identification: [13] Bhatia, Vedant, Prateek Rawat, Ajit Kumar and Rajiv Ratn
Study of Resumes with Natural Language Processing and Shah. “End-to-End Resume Parsing and Finding Candidates
Machine Learning," 2023 IEEE International Students' for a Job Description using BERT.” ArXiv abs/1910.03089
Conference on Electrical, Electronics and Computer Science (2019): n. pag.
(SCEECS), Bhopal, India, 2023, pp. 1-5, doi:
10.1109/SCEECS57921.2023.10063010. [14] Wang, Yan, Yacine Allouache and Christian Joubert. “A
Staffing Recommender System based on Domain-Specific
[2] H. Sajid et al., "Resume Parsing Framework for E- Knowledge Graph.” 2021 Eighth International Conference on
recruitment," 2022 16th International Conference on Social Network Analysis, Management and Security
Ubiquitous Information Management and Communication (SNAMS) (2021): 1-6.
(IMCOM), Seoul, Korea, Republic of, 2022, pp. 1-8, doi:
10.1109/IMCOM53663.2022.97217 [15] Wang, Yan and Capgemini. “Analysing CV Corpus for
Finding Suitable Candidates using Knowledge Graph and
[3] Das, Papiya, Bhaswati Sahoo, and Manjusha Pandey. "A BERT.” (2021).
review on text analytics process with a CV parser model."
In 2018 3rd International Conference for Convergence in [16] Kumaran, V. Senthil and A. Sankar. “Towards an automated
Technology (I2CT), pp. 1-7. IEEE, 2018. system for intelligent screening of candidates for recruitment
using ontology mapping (EXPERT).” Int. J. Metadata
[4] Sharma, Anushka, Smiti Singhal, and Dhara Ajudia. Semant. Ontologies 8 (2013): 56-64.
"Intelligent Recruitment System Using NLP." In 2021
International Conference on Artificial Intelligence and [17] Zaroor, Abeer, Mohammed Maree and Muath N Sabha. “JRC:
Machine Vision (AIMV), pp. 1-5. IEEE, 2021. A Job Post and Resume Classification System for Online
Recruitment.” 2017 IEEE 29th International Conference on
[5] Bhardwaj, Bhavya, Syed Ishtiyaq Ahmed, J. Jaiharie, R. Tools with Artificial Intelligence (ICTAI) (2017): 780-787.
Sorabh Dadhich, and M. Ganesan. "Web scraping using
summarization and named entity recognition (NER)." In 2021 [18] Rajathi, S., Reddi Tharun Kumar, Sripathi Vamsi Krishna, S.
Sabareesh, and Pandi Eswar Chand Ethihas. "Named Entity
7th international conference on advanced computing and
Recognition-based Hospital Recommendation." In 2023 2nd
communication systems (ICACCS), vol. 1, pp. 261-265.
IEEE, 2021. International Conference on Vision Towards Emerging
Trends in Communication and Networking Technologies
[6] Mohanty, Saswat, Anshuman Behera, Sushruta Mishra, (ViTECoN), pp. 1-6. IEEE, 2023.
Ahmed Alkhayyat, Deepak Gupta and Vandana Sharma.
“Resumate: A Prototype to Enhance Recruitment Process with [19] R. Ramachandran, S. Jayachandran and V. Das, "A Novel
NLP based Resume Parsing.” 2023 4th International Method for Text Summarization and Clustering of
Conference on Intelligent Engineering and Management Documents," 2022 IEEE 3rd Global Conference for
(ICIEM) (2023): 1-6. Advancement in Technology (GCAT), Bangalore, India,
2022, pp. 1-6, doi: 10.1109/GCAT55367.2022.9972037.
[7] “NLP based Extraction of Relevant Resume using Machine
Learning.” International Journal of Innovative Technology [20] N. Raj, S. Thomas and V. G, "Open Information Extraction
and Exploring Engineering (2020): n. pag. System For Extracting Relations in Legal Documents," 2022
IEEE 3rd Global Conference for Advancement in Technology
[8] Kinge, Bhushan, Shrinivas Mandhare, Pranali Chavan and S. (GCAT), Bangalore, India, 2022, pp. 1-8, doi:
M. Chaware. “Resume Screening using Machine Learning 10.1109/GCAT55367.2022.9971995.
and NLP: A proposed system.” International Journal of
Scientific Research in Computer Science, Engineering and
Information Technology (2022): n. pag.
Authorized licensed use limited to: Gannon University. Downloaded on December 13,2024 at 04:10:43 UTC from IEEE Xplore. Restrictions apply.