0% found this document useful (0 votes)
4 views7 pages

Research 5

Uploaded by

sindhusindhukp28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Research 5

Uploaded by

sindhusindhukp28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.

1051/itmconf/20224403011
ICACC-2022

Resume Classification using various Machine Learning


Algorithms
Riya Pal1,*, Shahrukh Shaikh2, Swaraj Satpute3 and Sumedha Bhagwat4
1,2,3 Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, India
4 D.Y. Patil Deemed to be University, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, India

Abstract. With the onset of the epidemic, everything has gone online, and individuals have been compelled
to work from home. There is a need to automate the hiring process in order to enhance efficiency and decrease
manual labour that may be done electronically. If resume categorization were done online, it would
significantly save paperwork and human error. The recruiting process has several steps, but the first is resume
categorization and verification. Automating the first stage would greatly assist the interview process in terms
of speedy applicant selection. Classification of resumes will be performed using Machine Learning
Algorithms such as Nave Bayes, Random Forest, and SVM, which will aid in the extraction of skills and show
diverse capabilities under appropriate job profile classes. While the abilities are being extracted, an
appropriate job profile may be retrieved from the categorised and pre-processed data and shown on the
interviewer's screen. During video interviews, this will aid the interviewer in the selection of candidates.

1 Introduction into similar groups is done depending upon various


factors and observations.
Interviews are becoming time-consuming affairs. A resume is one of the most important necessities
Employees are required to travel to locations and conduct when it comes to the selection of a candidate for any job.
interviews, and it is difficult to manually remember each When the company’s hiring team receives the resume of
and every aspect of a candidate or the interview process. any candidate, the skills, if extracted using automation,
In many instances, the artificial intelligence system assists will save a tremendous amount of time for the hiring team
us in simplifying things. as they no longer need to sit and read through each and
Using the conventional method of recruitment, an every word. But for this, first dataset needs to be scraped
organization's HR department invites individuals based or made, then preprocessing has to be done on that dataset.
on their resumes to an interview for a specific position. Once the preprocessing of words is done, then using
This HR department manually evaluates a candidate's various machine learning algorithms, the data needs to be
skills based on their résumé to determine if he or she is classified into different classes of job profile in
qualified for the position or not. HR’s conduct interviews, accordance to the skillset...
and the panel plays a significant role in determining who
is the best applicant for the post. They examine not just
the candidate's talents, but also his or her personality. 2 Literature Survey
Maintaining resumes and profiles of all candidates
becomes a very tedious job when it comes to big mass Finding the need for resume classification: As humans,
recruitment companies because they provide employment everyone is bound to make mistakes. Storing and sharing
in bulk, and thus maintaining or storing data physically is physical copies safely is very inconvenient and
not possible. inefficient. For this purpose of overcoming the
Machine Learning enables the path through which a drawbacks, need for resume classification significantly
computer can be trained to follow specific instructions increases for which Artificial intelligence tools and
again and again to make human life easy. The most Machine Learning algorithms [1] have been used widely.
common usage of machine learning is for the
classification of objects. In machine learning, iteration is Categorization of similar data together: With the
important because models are exposed to new data and advancement of computer and information technologies,
adapt accordingly. Machine learning models learn from a large number of research papers have been published
previous results and computation to produce correct and both online and offline, and as new study topics continue
reliable decisions. In statistics, classification is a to emerge, users are having a difficult time discovering
supervised learning concept in which segregation of data and classifying their relevant research articles. A
classification method that can cluster research articles into

*
Corresponding author: [email protected]

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

a meaningful class in which publications are extremely 70% of the data is being used for training data and the
likely to have similar subjects is needed to overcome the remaining 30% will be used for test data.
restrictions. The suggested method uses K-means
Clustering [2] and the Latent Dirichlet allocation (LDA)
3.2 Data Cleanup
scheme [2] to extract representative keywords from the
abstracts of each publication and subjects. The dataset containing a huge number of records is still
very rough and unclassified. Data cleaning will be done
Importance of Automation: the issue regarding lack of by removing any blank spaces from the data, then
automated systems for medical institutions and hospitals changing all the text to lowercase to avoid confusion and
have caused a splurge for development of automation for removing stop words from the data. Stop words are those
hospital system [3]. Natural Language Processing words which don’t play an important role in sentence
(NLP)[11] benefits society as a whole as text-based formation, such as “are”, “we”, “is”, etc. Cleaned data is
information handling is very difficult manually. stored in a separate dataset containing 10,000 entries with
two main classes of “query” and “description”.
Preprocessing of data using TF-IDF Vectorization: A
well-developed categorization system that can group
research papers into relevant classes based on their 3.3 Tokenization
subjects [4] helps in finding out relevant data in the most
In this step, each entry in the corpus i.e., each entry in the
time efficient manner. The suggested approach retrieves
document will be broken down into a set of words. To
representative keywords from each paper's and topic's
begin the tokenization process, we look for concepts or
abstract. Then, using the Term frequency-inverse
words that make up a character sequence. This is
document frequency (TF-IDF) values [4] of each article,
significant because we will be able to deduce meaning
the K-means clustering method [4] is used to categorize
from the original text sequence using these terms.
the entire set of papers into research papers with Tokenization is the process of separating large chunks of
comparable themes.
text into smaller pieces known as tokens. This is
accomplished by deleting or isolating characters like as
Modelling of semi-structured documents to fetch job whitespace and punctuation. Tokens are phrases that are
postings from resume: Matching semi-structured resumes divided into individual words after being tokenized out of
with positions in a big size real-world collection is a tough paragraphs. We may obtain information such as the
challenge. Experiments reveal that the SRM technique [5] number of words in a text, the frequency of a specific term
yielded encouraging results and outperformed traditional in the text, and much more by doing Tokenization.
unstructured relevance models in the first try. W. Bruce Tokenization can be done in a variety of methods, such
Croft, Xing Yi, and James Allan [5] Furthermore, we utilising the Natural Language Toolkit [NLTK], the
compared the suggested system's efficiency and efficacy spaCy library, and so on. Tokenization is a required step
to those of state-of-the-art online recruiting methods. A for subsequent text processing such as stop word removal,
system that utilizes machine learning to match job stemming, and lemmatization.
postings and resumes for huge data sets; in our work, it is
less difficult than previous papers; and using text mining, 3.4 Stemming and Lemmatization
the extracted data from the resume is matched with the
keyword recorded in the database and categorized for It is common to see a single English word employed in a
each job category. variety of different ways in different phrases based on its
grammatical rules. "Describe", "describing" and
"described", for example, are all various tenses of the
3 Methodology same verb. This condition necessitates the reduction of all
changed or derived versions of a word to its primary stem
This section will describe the methodology and concepts
or base, so that these derivationally related terms with
that facilitate the building of classification model capable
comparable meanings are not deemed distinct from one
for resume classification and displaying the output with a
another. Both stemming and lemmatization have the same
suitable job profile for the candidate. The system works
goal but take different approaches to achieve it.
in the following phases as given below.
“Stemming is the mechanism of reducing inflected
or derived words to their word root, or stem. It is a
3.1 Data Gathering crude heuristic process that involves chopping off the
ends of words to achieve this objective, and often includes
Data Gathering includes collection of datasets from the removal of derivational affixes [7]” These are rule-
various websites like kaggle.com, glassdoor.com and based algorithms that analyse a certain word under a
indeed.com. The datasets are not classified and are variety of scenarios and then decide how to shorten it
unstructured datasets in which the data will be cleaned, based on a list of recognized suffixes. It is worth noting
classified, and stored in that the root generated after stemming may not be the
“25_cleaned_job_descriptions.csv” including some parts same as the word's morphological root. Stemming is
from Kaggle and some from glassdoor and indeed.com. prone to under and over-stemming. Porter-Stemmer,
Snowball stemmer, and Lancaster stemmer are some

2
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

common stemming algorithms. Lemmatization, on the IDF = log �


����� ��������� �� ��������
� (2)
other hand, is the process of accurately reducing words to ��������� ����� �������� ������� ��� ����

root words using a language dictionary. Lemmatization,


TF-IDF Vectorization = TF × IDF (3)
as opposed to Stemming, which merely chops out tokens
by basic pattern matching, is a more sophisticated
technique that employs language vocabulary and
morphological study of words to provide linguistically
proper lemmas. This implies that lemmatization makes
use of context information and may thus distinguish
between words with various meanings based on parts of
speech. For the English language, our system uses the
NLTK python package's WordNet Lemmatizer (based on
the WordNew Database).
The concept of stemming and lemmatization can be
understood by a simple example. The word “loving” will Fig. 1. Shows the graph of TF-IDF Vectorization
turn into “lov” after stemming which has no meaning
while if lemmatized, “loving” will turn into “love” which This will mark all the important sets of words related to
now has a proper meaning and use case. jobs and skill sets with the highest term frequency value
which can be then used to fetch important words and use
3.5 Parts of Speech (POS) Tagging them to train data on various algorithms.

It is the process of associating grammatical information


with a word depending on its context and relationship to 3.7 Applying Classification Algorithm To Dataset
other words in the sentence [8]. According to its usage in The following classification algorithms have been used
the phrase, the part-of-speech tag identifies whether the for classification and model training.
word is a noun, pronoun, verb, adjective, or other. These Naïve bayes is a classification algorithm that works on
tags must be assigned in order to grasp the right meaning probabilistic output as in whether the event is going to
of a phrase and to create knowledge bases for character occur or not provided with a set of conditions. On the NB
recognition. This procedure is not as straightforward as Classifier, the training data is fitted. Then, in the
mapping a word to its appropriate part of speech tags. This validation dataset labels are being predicted. To get
is because a word may have a distinct part of speech accuracy, use accuracy_score function. This classification
depending on the context in which it is spoken. model gave an accuracy of around 45% and did not get
For example, "writing" is a Verb in the statement "I the expected results.
am writing an essay," yet "building" is a Noun in the line Support Vector Machine (SVM) algorithm classifies
"I stay in the tallest building in the entire town." It is a data by drawing a hyperplane between two or more items.
supervised learning approach that analyses information The Hyperplane which best classifies the items is
such as the preceding word, following word, initial letter considered as ideal output. The working flow of SVM is
capitalised or not, and so on to label the words after similar to that of Naïve Bayes. This classification model
tokenization. It is also known as grammatical tagging. gave an accuracy of 60% which proved to be better than
naïve bayes model.
3.6 TF-IDF Vectorization Random Forest is a classification algorithm that
works on the principle of decision trees. It takes in input
TF-IDF stands for Term Frequency – Inverse Document of many decision trees and gives the best majority output
Frequency and it tells us how important a word is from the from all the inputted decision trees. On the RF Classifier
set of words in the dataset and assigns a tfidf value the training data is fitted. Then, in the validation dataset
indicating the importance of the word as per frequency. labels are being predicted. To get accuracy, use
The formula for term frequency is the division of accuracy_score function. Random forest model gave an
number of occurrences of a word in a sentence by the total accuracy of 70% and gave the most correct predictions as
number of words in the sentence (1). Inverse document per the comparison.
frequency is calculated by the log of total sentences in the The comparison between the naïve bayes model, the
document divided by the sentences that actually contain SVM model and the random forest model with their
the word (2). The multiplication of TF formula and IDF accuracy, precision, recall and F1 score is given in
formula will give us a value in the form of a vector with a TABLE 1.
graph of importance on one hand and set of words on the
other hand (3). Fig. 1 depicts a graph that shows TF-IDF
vectorization which explains that the word with higher 4 Methodology Flowchart
importance is placed according to ascending order of its
The methodology is important to understand, and hence it
importance and hence an inverse growth curve.
becomes necessary to make a flowchart for easy
understanding of the flow of system. Fig. 2. Shows the
TF =
��.�� ��������� �� �������� (1)
����� ��.�� ����� �� �������� flowchart of the system.

3
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

Fig. 4. Query Column from Fig. 3.


Fig. 2. Proposed Flowchart for the system

First the dataset needs to be scraped which can be done using


various websites like indeed.com, glassdoor.com and
kaggle.com. Once the dataset has been scrapped, preprocessing
of data is to be done by doing proper stemming and
lemmatization, removing stop words and filler words store the
relevant words in a separate column for further process. After
the preprocessing of data is done, making sure that the relevant
data have the words which have occurred the highest amount of
times need to be put in a term-frequency document using TF-
IDF Vectorization. Once the previous steps are done, the cleaned
data is now ready to test and form classification models using
machine learning algorithms. Analysing the model by their
accuracy and the forming confusion matrix for the same is
required for better understanding of the results.

5 Results
The dataset has been taken from various websites like Fig. 5. Description Column from Fig. 3.
indeed.com, glassdoor.com and kaggle.com. These datasets are
unstructured and thus preprocessing has been done which
includes, tokenization, removing stop words, stemming and
lemmatization with POS Tagging. The output for preprocessing
data is shown in Fig. 3. Which consists of 3 main columns in its
corpus having 10,000 rows. “Query” displaying all the job
profiles (Fig. 4), “Description” containing raw data of words
(Fig. 5), and “text_final” displaying the main set of preprocessed
words (Fig. 6). The zoomed version of Fig 3. Has been
represented in Fig. 4, Fig. 5, and Fig. 6.

Fig. 6. text_final Column from Fig. 3.

After preprocessing of data, the dataset containing all the


Fig. 3. Output of Preprocessing of Data words are given TF-IDF values in order to arrange the

4
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

word matrix in ascending order of their importance.


Result of TF-IDF vectorization shows columns with their
frequency number which have term frequency of more
than 5000 is shown in Fig. 7. With output like: 'job': 2456,
'description': 1221, 'junior': 2480, 'data': 1134, 'scientist':
3984, 'ibm': 2175, 'work': 4959, 'part': 3216, 'team': 4448,
'solve': 4182, 'problem': 3498, 'use': 4763, 'technique':
4455, 'apply': 266, 'scientific': 3983, 'method': 2816,
'business': 578, 'scenario': 3971, 'clean': 734, 'prepare':
3445, 'statistical': 4286, 'machine': 2667, 'learn': 2556,
'model': 2877, 'variety': 4798, 'analytical': 215,
'algorithm': 165, 'build': 570,…

Fig. 7. Output of TD-IDF Vectorization Fig. 9. Confusion matrix of job classes

Data after being cleaned into a new dataset containing The confusion matrix for naïve bayes classification model
10,000 entries with two main categories “query” and is shown in Fig. 10 depicting the true positive, true
“description”. Fig. 8. Depicts the main 25 job classes that negative, false positive and false negative values that have
are being evaluated which are: 'Artificial Intelligence', been detected. A diagonal line of dark colored boxes can
'Big Data Engineer', 'Business Analyst', 'Business be observed which shows the true positive values for each
Intelligence Analyst', 'Cloud Architect', 'Cloud Services label using Naïve Bayes Classification.
Developer', 'Data Analyst', 'Data Architect', 'Data
Engineer', 'Data Quality Manager', 'Data Scientist', 'Data
Visualization Expert', 'Data Warehousing', 'Data and
Analytics Manager', 'Database Administrator', 'Deep
Learning', 'Full Stack Developer', 'IT Consultant', 'IT
Systems Administrator', 'Information Security Analyst',
'Machine Learning', 'Network Architect', 'Statistics',
'Technical Operations', 'Technology Integration'.

Fig. 8. Classification of jobs

After the data is cleaned and arranged, classification of


data into various job profiles and skill sets can be achieved Fig. 10. Confusion matrix for naïve bayes classification
using various machine learning classification algorithms.
The developed systems have excellent true predictions The confusion matrix for SVM classification model is
when confusion matrix is taken into consideration. The shown in Fig. 11 depicting the true positive, true negative,
following diagrams below depict the confusion matrix of false positive and false negative values that have been
various classifications, where true positive values are detected. A diagonal line of dark colored boxes can be
those values where the model predicted the correct job observed which shows the true positive values for each
profiles as expected value. Fig. 9 shows the confusion label using SVM classification.
matrix of all the job classes.

5
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

Despite of average accuracy being 70%, the confusion


matrix shows excellent results. Also, accuracy is not the
only measure that needs to be taken into consideration
while building a model but the confusion matrix plays an
equally important part for accurate results for how many
true positive values have been returned. As per the
comparison in TABLE 1, Random forest has the highest
accuracy and has the best confusion matrix.

The random forest accuracy can be improved by


increasing the number of trees or by increasing the size of
data. In our case, we took 2,500 trees and increasing the
number would increase the accuracy of the model but at
the same time the learning time would get slower. Another
technique that can be applied for improvement is to use
hyper parameter tuning for random forest classifier.

Fig. 11. Confusion matrix for SVM Classification Conclusion and Future Work
The confusion matrix for Random Forest is shown in The concept of classification is grasped by Resume
Fig.12 depicting the true positive, true negative, false Classification, and classification models have been built
positive and false negative values have detected. A using numerous techniques. This resume categorization
diagonal line of dark colored boxes can be observed platform will make the e-recruitment process more
which shows the true positive values for each label using efficient and user-friendly. This approach will assist
random forest classification. businesses and save time throughout the recruitment
process.
Future work includes combining these classification
models with frontend resume portal where the user will be
able to upload their resume and resume classification can
work in the backend. From the results and discussions, we
can conclude that various algorithms have given different
accuracy score percentages with Random Forest having
the highest accuracy among all three of them. Also, the
confusion matrix of random forest is well suited for all the
observed classes.
Thereby, resume classification is achieved using
preprocessing of data, cleaning of data and by applying
various classification algorithms.
This research paper can also be extended with the
combination of other video interviewing features such as
facial recognition, voice to text generation and voice
analysis. Building a plugin for people to include resume
classification models into their project is also one of the
Fig. 12. Confusion matrix for random forest classification future goals of this research paper and project work.

The end result of the confusion matrix and classification


models shows that Random Forest is best suited for References
classification of resume data with 2500 decision trees
1. B. Balci, D. Saadati, D. Shiferaw, Handwritten Text
taken into consideration. TABLE 1. shows the result and
Recognition using Deep Learning, STAN. U. (2017).
comparison of various classification algorithms.
2. SW. Kim, JM. Gil, Research paper classification
Table 1. Comparison of various classification algorithms systems based on TF-IDF and LDA schemes. Hum.
Cent. Comput. Inf. Sci. 9, 30 (2019).
Algorithm Accuracy Precision Recall F1 Score 3. C. Friedman, G. Hripcsak, Natural language
processing and its future in medicine. Acad Med.
Naïve Bayes 45 0.521 0.452 0.448 August (1999)
4. J. Ramos, Using TF-IDF to Determine Word
SVM 60 0.598 0.597 0.594 Relevance in Document Queries, Rutgers University.
5. E. Xing Yi, James Allan and W. Bruce Croft Center
Random Forest 70 0.687 0.683 0.678 for Intelligent Information Retrieval, Department of

6
ITM Web of Conferences 44, 03011 (2022) https://doi.org/10.1051/itmconf/20224403011
ICACC-2022

Computer Science 140 Governor"s Drive, Matching


Resumes and Jobs Based on Relevance Models,
SIGIR, (2017)
6. F. Aseel B. Kmail, Mohammed Maree, Mohammed
Belkhatir, Saadat M. Alhashmi -An Automatic
Online Recruitment System based on Exploiting
Multiple Semantic Resources and Concept-
relatedness Measures, IEEE 27th International
Conference on Tools with artificial intelligence
(2015)
7. Jivani, A.G., A comparative study of stemming
algorithms. Int. J. Comp. Tech. Appl, (2011)
8. Gelbukh, A., Computational Linguistics And
Intelligent Text Processing. Berlin, Heidelberg:
Springer Berlin Heidelberg (2014)
9. Laumer, S. and Eckhardt, A., May. Help to find the
needle in a haystack: integrating recommender
systems in an IT supported staff recruitment system.
In Proceedings of the special interest group on
management information system's 47th annual
conference on Computer personnel research, (2009).
10. A. Zaroor, M. Maree, MuathSabhaJRC, A job post
and resume classification system for online
recruitment, International Conference on tools with
Artifical Intelligence, (2017)
11. Bird, S., Klein, E. and Loper, E, Natural Language
Processing With Python. Bejing: O'Reilly, (2009).
12. Kim S. B., Rim H. C., Yook D. S. and Lim H. S.,
“Effective Methods for Improving Naive Bayes Text
Classifiers”, (2002)
13. F. Dernoncourt and J. Y. Lee, Pubmed 200k rct, a
dataset for sequential sentence classification in
medical abstracts (2017).
14. D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, A
decision-tree-based symbolic rule induction system
for text categorization, IBM Systems Journal,
September (2002).
15. S. Sanyal, N. Ghosh, S. Hazra, S. Adhikary, Resume
Parser with Natural language Processing, IJESC
(2007).
16. X.Chen,X. Lin, Big Data Deep Learning: Challenges
and Perspectives, IEEE, (2014)
17. Bao Y. and Ishii N., Combining Multiple kNN
Classifiers for Text Categorization by Reducts.
18. Kolluri, Johnson and Razia, Shaik and Nayak,
Soumya Ranjan, Text Classification Using Machine
Learning and Deep Learning Models. International
Conference on Artificial Intelligence in
Manufacturing & Renewable Energy (ICAIMRE)
(2019)
19. S. Brindha, K. Prabha and S. Sukumaran, A survey
on classification techniques for text mining, 3rd
International Conference on Advanced Computing
and Communication Systems (ICACCS) (2016)

You might also like