Research 5
Research 5
1051/itmconf/20224403011
ICACC-2022
Abstract. With the onset of the epidemic, everything has gone online, and individuals have been compelled
to work from home. There is a need to automate the hiring process in order to enhance efficiency and decrease
manual labour that may be done electronically. If resume categorization were done online, it would
significantly save paperwork and human error. The recruiting process has several steps, but the first is resume
categorization and verification. Automating the first stage would greatly assist the interview process in terms
of speedy applicant selection. Classification of resumes will be performed using Machine Learning
Algorithms such as Nave Bayes, Random Forest, and SVM, which will aid in the extraction of skills and show
diverse capabilities under appropriate job profile classes. While the abilities are being extracted, an
appropriate job profile may be retrieved from the categorised and pre-processed data and shown on the
interviewer's screen. During video interviews, this will aid the interviewer in the selection of candidates.
*
Corresponding author: [email protected]
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022
a meaningful class in which publications are extremely 70% of the data is being used for training data and the
likely to have similar subjects is needed to overcome the remaining 30% will be used for test data.
restrictions. The suggested method uses K-means
Clustering [2] and the Latent Dirichlet allocation (LDA)
3.2 Data Cleanup
scheme [2] to extract representative keywords from the
abstracts of each publication and subjects. The dataset containing a huge number of records is still
very rough and unclassified. Data cleaning will be done
Importance of Automation: the issue regarding lack of by removing any blank spaces from the data, then
automated systems for medical institutions and hospitals changing all the text to lowercase to avoid confusion and
have caused a splurge for development of automation for removing stop words from the data. Stop words are those
hospital system [3]. Natural Language Processing words which don’t play an important role in sentence
(NLP)[11] benefits society as a whole as text-based formation, such as “are”, “we”, “is”, etc. Cleaned data is
information handling is very difficult manually. stored in a separate dataset containing 10,000 entries with
two main classes of “query” and “description”.
Preprocessing of data using TF-IDF Vectorization: A
well-developed categorization system that can group
research papers into relevant classes based on their 3.3 Tokenization
subjects [4] helps in finding out relevant data in the most
In this step, each entry in the corpus i.e., each entry in the
time efficient manner. The suggested approach retrieves
document will be broken down into a set of words. To
representative keywords from each paper's and topic's
begin the tokenization process, we look for concepts or
abstract. Then, using the Term frequency-inverse
words that make up a character sequence. This is
document frequency (TF-IDF) values [4] of each article,
significant because we will be able to deduce meaning
the K-means clustering method [4] is used to categorize
from the original text sequence using these terms.
the entire set of papers into research papers with Tokenization is the process of separating large chunks of
comparable themes.
text into smaller pieces known as tokens. This is
accomplished by deleting or isolating characters like as
Modelling of semi-structured documents to fetch job whitespace and punctuation. Tokens are phrases that are
postings from resume: Matching semi-structured resumes divided into individual words after being tokenized out of
with positions in a big size real-world collection is a tough paragraphs. We may obtain information such as the
challenge. Experiments reveal that the SRM technique [5] number of words in a text, the frequency of a specific term
yielded encouraging results and outperformed traditional in the text, and much more by doing Tokenization.
unstructured relevance models in the first try. W. Bruce Tokenization can be done in a variety of methods, such
Croft, Xing Yi, and James Allan [5] Furthermore, we utilising the Natural Language Toolkit [NLTK], the
compared the suggested system's efficiency and efficacy spaCy library, and so on. Tokenization is a required step
to those of state-of-the-art online recruiting methods. A for subsequent text processing such as stop word removal,
system that utilizes machine learning to match job stemming, and lemmatization.
postings and resumes for huge data sets; in our work, it is
less difficult than previous papers; and using text mining, 3.4 Stemming and Lemmatization
the extracted data from the resume is matched with the
keyword recorded in the database and categorized for It is common to see a single English word employed in a
each job category. variety of different ways in different phrases based on its
grammatical rules. "Describe", "describing" and
"described", for example, are all various tenses of the
3 Methodology same verb. This condition necessitates the reduction of all
changed or derived versions of a word to its primary stem
This section will describe the methodology and concepts
or base, so that these derivationally related terms with
that facilitate the building of classification model capable
comparable meanings are not deemed distinct from one
for resume classification and displaying the output with a
another. Both stemming and lemmatization have the same
suitable job profile for the candidate. The system works
goal but take different approaches to achieve it.
in the following phases as given below.
“Stemming is the mechanism of reducing inflected
or derived words to their word root, or stem. It is a
3.1 Data Gathering crude heuristic process that involves chopping off the
ends of words to achieve this objective, and often includes
Data Gathering includes collection of datasets from the removal of derivational affixes [7]” These are rule-
various websites like kaggle.com, glassdoor.com and based algorithms that analyse a certain word under a
indeed.com. The datasets are not classified and are variety of scenarios and then decide how to shorten it
unstructured datasets in which the data will be cleaned, based on a list of recognized suffixes. It is worth noting
classified, and stored in that the root generated after stemming may not be the
“25_cleaned_job_descriptions.csv” including some parts same as the word's morphological root. Stemming is
from Kaggle and some from glassdoor and indeed.com. prone to under and over-stemming. Porter-Stemmer,
Snowball stemmer, and Lancaster stemmer are some
2
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022
3
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022
5 Results
The dataset has been taken from various websites like Fig. 5. Description Column from Fig. 3.
indeed.com, glassdoor.com and kaggle.com. These datasets are
unstructured and thus preprocessing has been done which
includes, tokenization, removing stop words, stemming and
lemmatization with POS Tagging. The output for preprocessing
data is shown in Fig. 3. Which consists of 3 main columns in its
corpus having 10,000 rows. “Query” displaying all the job
profiles (Fig. 4), “Description” containing raw data of words
(Fig. 5), and “text_final” displaying the main set of preprocessed
words (Fig. 6). The zoomed version of Fig 3. Has been
represented in Fig. 4, Fig. 5, and Fig. 6.
4
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022
Data after being cleaned into a new dataset containing The confusion matrix for naïve bayes classification model
10,000 entries with two main categories “query” and is shown in Fig. 10 depicting the true positive, true
“description”. Fig. 8. Depicts the main 25 job classes that negative, false positive and false negative values that have
are being evaluated which are: 'Artificial Intelligence', been detected. A diagonal line of dark colored boxes can
'Big Data Engineer', 'Business Analyst', 'Business be observed which shows the true positive values for each
Intelligence Analyst', 'Cloud Architect', 'Cloud Services label using Naïve Bayes Classification.
Developer', 'Data Analyst', 'Data Architect', 'Data
Engineer', 'Data Quality Manager', 'Data Scientist', 'Data
Visualization Expert', 'Data Warehousing', 'Data and
Analytics Manager', 'Database Administrator', 'Deep
Learning', 'Full Stack Developer', 'IT Consultant', 'IT
Systems Administrator', 'Information Security Analyst',
'Machine Learning', 'Network Architect', 'Statistics',
'Technical Operations', 'Technology Integration'.
5
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022
Fig. 11. Confusion matrix for SVM Classification Conclusion and Future Work
The confusion matrix for Random Forest is shown in The concept of classification is grasped by Resume
Fig.12 depicting the true positive, true negative, false Classification, and classification models have been built
positive and false negative values have detected. A using numerous techniques. This resume categorization
diagonal line of dark colored boxes can be observed platform will make the e-recruitment process more
which shows the true positive values for each label using efficient and user-friendly. This approach will assist
random forest classification. businesses and save time throughout the recruitment
process.
Future work includes combining these classification
models with frontend resume portal where the user will be
able to upload their resume and resume classification can
work in the backend. From the results and discussions, we
can conclude that various algorithms have given different
accuracy score percentages with Random Forest having
the highest accuracy among all three of them. Also, the
confusion matrix of random forest is well suited for all the
observed classes.
Thereby, resume classification is achieved using
preprocessing of data, cleaning of data and by applying
various classification algorithms.
This research paper can also be extended with the
combination of other video interviewing features such as
facial recognition, voice to text generation and voice
analysis. Building a plugin for people to include resume
classification models into their project is also one of the
Fig. 12. Confusion matrix for random forest classification future goals of this research paper and project work.
6
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022