1 Technical Seminar Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Visvesvaraya Technological University

BELAGAVI, KARNATAKA.

A TECHNICAL SEMINAR REPORT ON

“An Automated Resume Screening System Using


Natural Language Processing and Similarity”

Submitted to Visvesvaraya Technological University in partial


fulfillment of the requirement for the award of award of Bachelor of
Engineering in Computer Science and Engineering.

Submitted by:

SRIVATHSA ACHARYA R 4JN19CS105

Under the Guidance of

Chakrapani D S, B.E., M. Tech


Assistant Professor, Dept. of CS&E
JNNCE, Shivamogga.

Department of Computer Science & Engineering


J.N.N. College of Engineering
Shivamogga - 577 204
2023
National Education Society (R.)

CERTIFICATE
This is to certify that the Technical Seminar report entitled

“An Automated Resume Screening System Using


Natural Language Processing and Similarity”
Submitted by:

SRIVATHSA ACHARYA R 4JN19CS105

Student of 8th semester B. E. under the supervision and guidance towards


the partial fulfillment of the requirement for award of Bachelor of Engineering
degree in Computer Science and Engineering of Visvesvaraya Technological
University, Belagavi during the year 2021.

Signature of Guide Signature of Coordinators

Chakrapani D S., B.E., M. Tech. Dr. Ravindra S., M.Tech., Ph. D. Dr. Sankhya N Nayak., M.Tech., Ph. D.
Asst. Professor, Dept. of CS&E Assoc. Prof., Dept. of CS&E Assoc. Prof., Dept. of CS&E

Signature of HOD

Dr. K.M. Poornima. M.Tech., Ph. D.


Prof. & Head, Dept. of CS&E
ABSTRACT
A typical job posting on the Internet receives a massive number of
applications within a short window of time. Manually filtering out the resumes is not
practically possible as it takes a lot of time and incurs huge costs that the hiring companies
cannot afford to bear. In addition, this process of screening resumes is not fair as many
suitable profiles don’t get enough consideration which they deserve. The proposed system
uses Natural Language Processing to extract relevant information like skills, education,
experience, etc. from the unstructured resumes and hence creates a summarised form of
each application. With all the irrelevant information removed, the task of screening is
simplified and recruiters are able to better analyse each resume in less time. After this text
mining process is completed, the proposed solution employs a vectorisation model and uses
cosine similarity to match each resume with the job description. The calculated ranking
scores can then be utilised to determine best-fitting candidates for that particular job
opening.

i
ACKNOWLEDGMENT

The credit of the successful completion of the technical seminar should go to the
persons who rendered their consistent, constant source of knowledge, timely suggestions
and instructions towards us.

First of all, I wish to express earnest thanks and affection respect to my project
guide Chakrapani D S, Assistant Professor, Department of Computer Science &
Engineering, who is the motivator and source of inspiration.

I would like to thank our beloved Coordinators Dr. Ravindra S and Dr. Sankhya N
Nayak, Department of Computer Science & Engineering for their support.

I would like to thank our beloved Professor and Head Dr. K. M. Poornima,
Department of Computer Science & Engineering for allowing me to take up this technical
seminar.

I am very much grateful to our respected principal, Dr. K. Nagendra Prasad for his
encouragement and providing an excellent working environment in our college.

Finally, I thank our teaching and non-teaching staff, classmates and all who have
helped us directly or indirectly for the successful completion of this technical seminar and,
I would like to thank JNNCE College for providing a stage to show my talent.

SRIVATHSA ACHARYA R
4JN19CS105

ii
Table of Contents
Abstract i
Acknowledgement ii
Table of Contents iii
List of Figures iv
Chapter 1: Introduction 1
Chapter 2: Technology 3
2.1: Natural Language Processing 3
2.2: Similarity Technology 4
Chapter 3: Methodology 5
3.1: Information Extraction 5
3.2: Content Based Candidate Recommendation 7
Chapter 4: Applications & Advantages 10
4.1: Applications 10
4.2: Advantages 11
Chapter 5: Conclusion 12
Bibliography 13

iii
List of Figures

Fig. No. Name of Figure Page No.


2.1 Natural Language Processing Pipeline 3
2.2 Example of Similarity In NLP 4
3.1 JSON Output Generated by the System 7
3.2 Architecture Diagram of the System 9

iv
AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Chapter 1
INTRODUCTION

With the rapid increase in internet connectivity, there has been a change in the
recruitment process of all major companies. With the help of online job postings in various
job portals and websites, recruiters are able to attract a wide variety of people for their
openings. Though e-recruitment has provided convenience and savings for both the
recruiters and the applicants, some new challenges arise. Large companies and recruitment
agencies often receive thousands of resume every day. This situation is even more
aggravated due to the higher mobility of workers and in situations of economic distress,
where many people are looking to get jobs. With less than 5% of people to be selected from
these applications, it is impractical for the recruiters to manually go through each and every
resume for these limited number of openings.

Another problem faced by the organizations is that there is no one standard resume
format used by these applicants. People come from varied fields of profession and have
different backgrounds. Each one of them has had different types of education, has worked
on different projects and thus has a unique style of presenting his/her credentials in the
resume. Resumes are unstructured documents that come in various file formats (.pdf, .doc,
.docx, .jpg, .txt etc.) and their content is not written according to standard formats or
templates. This means reading resumes is not simple and thus recruiters spend a large
amount of time going through the resumes for selecting the right candidates.

Many job portals and external websites came up to reduce this difficulty of
handling unstructured and diverse resumes. These require candidates to manually fill up all
the information of their resume in an online form in a structured manner, thus creating a
candidate metadata. The problem with this approach is that it requires redundant efforts on
the part of the candidates, and they often miss out on filling complete information in these
templates.

The traditional keyword-based search functionality is insufficient to match


candidates with the job description. This is so as it relies only on the existence of certain
required keywords and has various extraction limitations like avoiding natural language
semantics such as synonyms, word combinations, and contextual meaning of the content

Dept. of CS&E, 2022-2023 1


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY
present in the resume.

In order to get better results for the resume shortlisting, it is necessary to investigate
more efficient approaches to candidate and job description matching. The proposed solution
will choose the best fitting candidates for a specific opportunity by relating the main
features of the applicants’ profile with the requirements defined in the job description. The
system works in two main phases:

In the first phase, all relevant candidate information like skills, work experience,
years of education, certifications, etc. is extracted from the unstructured text in the resumes.
The system uses Natural Language Processing to parse these relevant qualification details
and then creates a summarized version of each resume irrespective of the order of content
or the file format. With all the extraneous and irrelevant details removed, it becomes easy
for the evaluator as he can quickly look at the summarized form and analyze the credentials
of the candidates.

The second phase of the system involves ranking the resumes based on the
similarity of their content with the given job description. The documents are represented as
vectors using Vector Space Model and then similarity measures like cosine similarity are
used to measure which set of resumes are the best fitting for the particular job. In the end,
a ranked list of applicants is obtained.

Dept. of CS&E, 2022-2023 2


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Chapter 2

TECHNOLOGY

2.1 Natural language processing:


Natural language processing (NLP) is a subfield of computer science and artificial
intelligence that focuses on the interaction between computers and human language. It
involves the use of algorithms and computational techniques to analyze, understand, and
generate natural language text. NLP is a rapidly growing field that has numerous practical
applications, including machine translation, sentiment analysis, speech recognition, and
text classification.

NLP techniques can be broadly classified into two categories: rule-based and
statistical. Rule-based techniques involve the use of manually crafted rules and grammars
to analyze and process natural language text. Statistical techniques, on the other hand, rely
on machine learning algorithms and statistical models to learn patterns and relationships
from large amounts of data.

Some of the popular NLP techniques include tokenization, part-of-speech tagging,


named entity recognition, sentiment analysis, and machine translation. Tokenization
involves breaking up text into smaller units such as words or phrases. Part-of-speech
tagging involves assigning a part of speech (e.g., noun, verb, adjective) to each word in a
sentence. Named entity recognition involves identifying and categorizing named entities
such as people, organizations, and locations. Sentiment analysis involves determining the
sentiment or emotion expressed in a piece of text, while machine translation involves
automatically translating text from one language to another.

Fig 2.1: Natural Language Processing Pipeline

Dept. of CS&E, 2022-2023 3


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

2.2 Similarity technology:

Similarity technology refers to a set of computational techniques that are used to


compare and measure the similarity between two or more objects. In the context of natural
language processing, similarity techniques are used to compare and measure the similarity
between two texts, such as a job description and a resume.
There are various similarity techniques used in NLP, including string similarity, semantic
similarity, and lexical similarity.

String similarity measures the similarity between two texts based on the overlap
between their character sequences. Semantic similarity, on the other hand, measures the
similarity between two texts based on their meaning and context. This is typically done by
representing the texts as vectors in a high-dimensional space and measuring the distance or
angle between them. Lexical similarity measures the similarity between two texts based on
the overlap between their words and phrases.

Some of the popular similarity measures used in NLP include cosine similarity,
Jaccard similarity, and edit distance. Cosine similarity measures the cosine of the angle
between two vectors and is commonly used to measure semantic similarity between texts.
Jaccard similarity measures the similarity between two sets and is commonly used to
measure lexical similarity between texts. Edit distance measures the minimum number of
operations (such as insertions, deletions, and substitutions) required to transform one text
into another.

Fig 2.2: Example of similarity in NLP

Dept. of CS&E, 2022-2023 4


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Chapter 3

METHODOLOGY
The methodologies can be viewed into 2 different phases, namely, information
extraction phase and content based candidate recommendation phase which further
involves many sub techniques.

3.1 Information Extraction:

The first phase of the proposed system involves information extraction using
Natural Language Processing. The information in the resumes is not present in a structured
format. There are noises, inconsistencies and irrelevant bits of data which is of no use to
the recruiters. The objective is to derive relevant keywords from the unstructured textual
data in the resume without any need of human crawling efforts. Using techniques like
Tokenization, Stemming, POS Tagging, Named Entity Recognition, etc., the system
obtains important job-related content (skills, experience, education, etc.) from the uploaded
candidate resumes. The result is a summarized version of each resume in a JSON format
which can be easily used for further processing tasks in the next phase of this resume
screening system.

1. Tokenization: The tokenization process is done to identify terms or words that form up
a character sequence. This is important as through these words, we will be able to derive
meaning from the original text sequence. Tokenization involves dividing big chunks of text
into smaller parts called tokens. This is done by removing or isolating characters like
whitespaces and punctuation characters. The tokenization can be performed in multiple
ways such as using Natural Language Toolkit [NLTK], the spaCy library, etc.

2. Stemming and lemmatization: Stemming is the mechanism of reducing inflected or


derived words to their word root, or stem. It is a crude heuristic process that involves
chopping off the ends of words to achieve this objective, and often includes the removal of
derivational affixes. These are rule-based algorithms in which a particular word is tested
on a range of conditions and then based on a list of known suffixes, decides how to cut it
down. Lemmatization is the process of utilizing a language dictionary to perform an
accurate reduction to root words. Unlike Stemming which simply cuts off tokens by simple

Dept. of CS&E, 2022-2023 5


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY
pattern matching, lemmatization is a more careful approach that uses language vocabulary
and morphological analysis of words to give linguistically correct lemmas. This means
lemmatization utilizes the knowledge of context and therefore can differentiate between
words that have different meanings based on parts of speech.

3. Parts of speech (POS) tagging: It is a process of assigning grammatical information to


a word based on its context and its relationship with other words in the sentence. The part-
of-speech tag specifies whether the word is a noun, pronoun, verb, adjective, etc. according
to its usage in the sentence. It is important to assign these tags so as to understand the
correct meaning of a sentence and for building knowledge graphs for named entity
recognition.

4. Chunking: Chunking is a process that aims to add more structure to sentences by


grouping short phrases with parts of speech tags. Because parts of speech tags alone cannot
give information about the structure of the sentence or the actual meaning of the text,
chunking combines parts of speech tags with regular expressions to give a result as a set of
chunk tags like Noun Phrase (NP), Verb Phrase (VP), etc.

5. Named entity recognition: Named Entity Recognition is an information extraction


technique which extracts relevant information by classifying chunks of unorganized text
into predefined categories like names of persons, companies, contact info, educational
credentials, and skills.

Dept. of CS&E, 2022-2023 6


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Figure 3.1: JSON output generated by the system after the completion of the information
extraction process on a sample resume.

3.2 Content Based Candidate Recommendation:

The second phase of the proposed system aims to build a content-based


recommendation system that utilizes the extracted entities from phase 1 to recommend the
most appropriate resumes for the given job description. The system employs concepts like
Vectorization, importance or weight assigning techniques like TF-IDF and similarity
measures like cosine distance for calculating the similarity among the contents of the
documents.

1. Vectorization: Representing documents in a vector space model is called vectorization.


It is the process of turning a document into a numerical vector. An important reason behind
performing vectorization is that most machine learning models require the input to be
numerical vectors rather than strings. A common way of vectorizing text is to map every
Dept. of CS&E, 2022-2023 7
AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY
possible word to a specific integer. If we have a large array then every word fits into a
unique slot in the array. The value at that index is the number of times the word occurs.

2. TF-IDF: TF-IDF stands for “Term Frequency – Inverse Document Frequency”. The TF-
IDF weight is often used in text mining techniques. TF-IDF was invented for information
retrieval and document search. This weight is a numerical measure to determine how
important a term is with respect to a document in a collection or corpus. The importance
increases proportionally to the frequency of a word within the document but is offset by the
number of documents that contain the word. The TF-IDF value for a term in a document is
calculated by multiplying two different metrics as shown in equation below.

𝑇𝐹−𝐼𝐷𝐹 (𝑡,𝑑)= 𝑇𝐹 (𝑡,𝑑)∗ 𝐼𝐷𝐹 (𝑡,𝑑)

• Term Frequency: It measures how frequently a word occurs in each document in the
corpus. Since a word may occur more number of times in lengthy documents than shorter
ones, there will be need to adjust or normalize this frequency. A normalized term frequency
is calculated by dividing the number of times a term appears in a document by the total
number of terms in that document. Mathematically, it can be written as shown below in
equation.

𝑇𝐹 (𝑡,𝑑)= 𝑓𝑟𝑒𝑞 (𝑡,𝑑)/∑𝑛𝑖 freq (ti, d)

Here, freq (t, d) is the count of the instances of the term t in document d, TF (t, d) is the
proportion of the count of term t in document d, and n is the number of distinct terms in
document d.

• Inverse Document Frequency: It measures how important a word is for all documents
in the corpus. In other words, this metric helps to know how rare or common a word is
across in the corpus. It weighs down the terms that occur more often while scaling up the
rare terms. The terms that appear more often in the set of documents have IDF value close
to 0 while the rare terms have a high IDF. Mathematically, it can be written as shown below
in equation.

𝐼𝐷𝐹(𝑡)=𝑙𝑜𝑔(𝑁/𝑐𝑜𝑢𝑛𝑡(𝑡))

Here, N is the number of distinct documents in the corpus and count (t) is the number of
documents in the corpus in which the term t is present.

Dept. of CS&E, 2022-2023 8


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY
3. Cosine Similarity: A Similarity measure is a metric that determines how much the two
objects are alike. Cosine similarity is a measure to find how similar the two documents are
regardless of their size. It represents the orientation of the documents when plotted on an
N-dimensional space, where each dimension depicts the features of the object.
Mathematically, it can be represented as shown below in equation.

Here, 𝑎⃗.𝑏⃗⃗ = ∑𝑛1 𝑎𝑖 𝑏𝑖 = 𝑎1 𝑏1 + 𝑎2 𝑏2 + ...+𝑎𝑛 𝑏𝑛 is the dot product of the two vectors. Using
this formula, the cosine similarity between all pairs of elements can be calculated.

Figure 3.2: Architecture Diagram of the System

Dept. of CS&E, 2022-2023 9


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Chapter 4

APPLICATIONS AND ADVANTAGES

4.1 Applications:
An automated resume screening system using natural language processing and
similarity has several potential applications, including:

• Efficient recruitment process: An automated system can process a large volume of


resumes in a short period of time, enabling recruiters to identify qualified candidates faster
and with less effort. By leveraging natural language processing (NLP) techniques, the
system can analyze resumes and extract relevant information, such as work experience,
skills, and education.

• Candidate ranking: By using similarity and matching algorithms, the system can
compare the job requirements with the candidate's resume and assign a ranking score. The
system can use various metrics, such as keyword frequency, job title matching, and
education level to rank the resumes and present them to the recruiter in order of relevance.

• Eliminating bias: An automated system can eliminate human biases that may influence
the selection process, such as gender, age, or race. By using objective criteria and
algorithms, the system can ensure a fair and unbiased selection process.

• Customization: The system can be customized to match the specific requirements of the
job posting or the organization's culture. By using machine learning techniques, the system
can learn from the recruiter's feedback and improve the selection process over time.

• Scalability: An automated system can scale up or down depending on the organization's


needs, enabling recruiters to handle large volumes of resumes during peak recruitment
periods or smaller volumes during off-seasons. By leveraging cloud-based infrastructure,
the system can be easily deployed and managed without significant upfront costs.

Dept. of CS&E, 2022-2023 10


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

4.2 Advantages:
An automated resume screening system using natural language processing
and similarity has several advantages over traditional manual methods of resume screening.
It can improve the efficiency and accuracy of the screening process, saving time and
reducing the risk of human error. Here are some of the advantages of using such a system:

• Improved efficiency: An automated resume screening system can process resumes much
faster than a human could, allowing for a larger volume of resumes to be screened in a
shorter amount of time.

• Increased accuracy: By using natural language processing and similarity techniques, an


automated resume screening system can accurately identify the most relevant candidates
based on their skills and experience.

• Flexibility: Automated resume screening systems are highly adaptable and can be
customized to fit the specific needs of a company or organization. They can be programmed
to search for certain keywords or phrases that are important to a particular job or industry.

• Consistency: An automated system will consistently apply the same screening criteria to
each resume, ensuring a fair and objective screening process.

• Improved diversity: By removing potential biases that may exist in manual screening
processes, an automated system can help to improve diversity in the hiring process.

• Cost-effective: Automated resume screening systems can be cost-effective for


companies, as they can reduce the need for manual labor and potentially improve the quality
of hires.

Dept. of CS&E, 2022-2023 11


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

Chapter 5

CONCLUSION

In conclusion, an automated resume screening system using natural


language processing and similarity techniques is a promising area of research that has the
potential to revolutionize the hiring process for companies and organizations. By leveraging
the power of machine learning and natural language processing, it is possible to develop
more accurate and efficient resume screening techniques that can identify the most qualified
candidates based on their skills and experience. However, there are still many challenges
that need to be addressed in this field, including the development of more sophisticated
algorithms that can handle large volumes of resumes, the integration of additional data
sources for a more comprehensive analysis, and the evaluation of performance under
realistic conditions. Despite these challenges, an automated resume screening system using
natural language processing and similarity techniques represents an exciting opportunity
for researchers and practitioners alike to push the boundaries of what is possible in the field
of talent acquisition.

Dept. of CS&E, 2022-2023 12


AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY

BIBLIOGRAPHY

[1] Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V. and Kambhatla, N.,
2010, October. PROSPECT: a system for screening candidates for recruitment. In
Proceedings of the 19th ACM international conference on Information and
knowledge management, 659-668.

[2] Salton, G., Wong, A. and Yang, C.S., 1975. A vector space model for automatic
indexing. Communications of the ACM, 18(11),613-620.

[3] Kumaran, V.S. and Sankar, A., 2013. Towards an automated system for intelligent
screening of candidates for recruitment using ontology mapping (EXPERT).
International Journal of Metadata, Semantics and Ontologies, 8(1), 56-64.

[4] Jabri, S., Dahbi, A., Gadi, T. and Bassir, A., 2018, April. Ranking of text documents
using TF-IDF weighting and association rules mining. In 2018 4th International
Conference on Optimization and Applications (ICOA), 1-6. IEEE.

[5] Huang, A., 2008, April. Similarity measures for text document clustering. In
Proceedings of the sixth new zealand computer science research student conference
(NZCSRSC2008), Christchurch, New Zealand, 4, 9-56.

[6] Faliagka, E., Ramantas, K., Tsakalidis, A. and Tzimas, G., 2012, May. Application
of machine learning algorithms to an online recruitment system. In Proc.
International Conference on Internet and Web Applications and Services.

Dept. of CS&E, 2022-2023 13

You might also like