1 Technical Seminar Report
1 Technical Seminar Report
1 Technical Seminar Report
BELAGAVI, KARNATAKA.
Submitted by:
CERTIFICATE
This is to certify that the Technical Seminar report entitled
Chakrapani D S., B.E., M. Tech. Dr. Ravindra S., M.Tech., Ph. D. Dr. Sankhya N Nayak., M.Tech., Ph. D.
Asst. Professor, Dept. of CS&E Assoc. Prof., Dept. of CS&E Assoc. Prof., Dept. of CS&E
Signature of HOD
i
ACKNOWLEDGMENT
The credit of the successful completion of the technical seminar should go to the
persons who rendered their consistent, constant source of knowledge, timely suggestions
and instructions towards us.
First of all, I wish to express earnest thanks and affection respect to my project
guide Chakrapani D S, Assistant Professor, Department of Computer Science &
Engineering, who is the motivator and source of inspiration.
I would like to thank our beloved Coordinators Dr. Ravindra S and Dr. Sankhya N
Nayak, Department of Computer Science & Engineering for their support.
I would like to thank our beloved Professor and Head Dr. K. M. Poornima,
Department of Computer Science & Engineering for allowing me to take up this technical
seminar.
I am very much grateful to our respected principal, Dr. K. Nagendra Prasad for his
encouragement and providing an excellent working environment in our college.
Finally, I thank our teaching and non-teaching staff, classmates and all who have
helped us directly or indirectly for the successful completion of this technical seminar and,
I would like to thank JNNCE College for providing a stage to show my talent.
SRIVATHSA ACHARYA R
4JN19CS105
ii
Table of Contents
Abstract i
Acknowledgement ii
Table of Contents iii
List of Figures iv
Chapter 1: Introduction 1
Chapter 2: Technology 3
2.1: Natural Language Processing 3
2.2: Similarity Technology 4
Chapter 3: Methodology 5
3.1: Information Extraction 5
3.2: Content Based Candidate Recommendation 7
Chapter 4: Applications & Advantages 10
4.1: Applications 10
4.2: Advantages 11
Chapter 5: Conclusion 12
Bibliography 13
iii
List of Figures
iv
AN AUTOMATED RESUME SCREENING SYSTEM USING NLP AND SIMILARITY
Chapter 1
INTRODUCTION
With the rapid increase in internet connectivity, there has been a change in the
recruitment process of all major companies. With the help of online job postings in various
job portals and websites, recruiters are able to attract a wide variety of people for their
openings. Though e-recruitment has provided convenience and savings for both the
recruiters and the applicants, some new challenges arise. Large companies and recruitment
agencies often receive thousands of resume every day. This situation is even more
aggravated due to the higher mobility of workers and in situations of economic distress,
where many people are looking to get jobs. With less than 5% of people to be selected from
these applications, it is impractical for the recruiters to manually go through each and every
resume for these limited number of openings.
Another problem faced by the organizations is that there is no one standard resume
format used by these applicants. People come from varied fields of profession and have
different backgrounds. Each one of them has had different types of education, has worked
on different projects and thus has a unique style of presenting his/her credentials in the
resume. Resumes are unstructured documents that come in various file formats (.pdf, .doc,
.docx, .jpg, .txt etc.) and their content is not written according to standard formats or
templates. This means reading resumes is not simple and thus recruiters spend a large
amount of time going through the resumes for selecting the right candidates.
Many job portals and external websites came up to reduce this difficulty of
handling unstructured and diverse resumes. These require candidates to manually fill up all
the information of their resume in an online form in a structured manner, thus creating a
candidate metadata. The problem with this approach is that it requires redundant efforts on
the part of the candidates, and they often miss out on filling complete information in these
templates.
In order to get better results for the resume shortlisting, it is necessary to investigate
more efficient approaches to candidate and job description matching. The proposed solution
will choose the best fitting candidates for a specific opportunity by relating the main
features of the applicants’ profile with the requirements defined in the job description. The
system works in two main phases:
In the first phase, all relevant candidate information like skills, work experience,
years of education, certifications, etc. is extracted from the unstructured text in the resumes.
The system uses Natural Language Processing to parse these relevant qualification details
and then creates a summarized version of each resume irrespective of the order of content
or the file format. With all the extraneous and irrelevant details removed, it becomes easy
for the evaluator as he can quickly look at the summarized form and analyze the credentials
of the candidates.
The second phase of the system involves ranking the resumes based on the
similarity of their content with the given job description. The documents are represented as
vectors using Vector Space Model and then similarity measures like cosine similarity are
used to measure which set of resumes are the best fitting for the particular job. In the end,
a ranked list of applicants is obtained.
Chapter 2
TECHNOLOGY
NLP techniques can be broadly classified into two categories: rule-based and
statistical. Rule-based techniques involve the use of manually crafted rules and grammars
to analyze and process natural language text. Statistical techniques, on the other hand, rely
on machine learning algorithms and statistical models to learn patterns and relationships
from large amounts of data.
String similarity measures the similarity between two texts based on the overlap
between their character sequences. Semantic similarity, on the other hand, measures the
similarity between two texts based on their meaning and context. This is typically done by
representing the texts as vectors in a high-dimensional space and measuring the distance or
angle between them. Lexical similarity measures the similarity between two texts based on
the overlap between their words and phrases.
Some of the popular similarity measures used in NLP include cosine similarity,
Jaccard similarity, and edit distance. Cosine similarity measures the cosine of the angle
between two vectors and is commonly used to measure semantic similarity between texts.
Jaccard similarity measures the similarity between two sets and is commonly used to
measure lexical similarity between texts. Edit distance measures the minimum number of
operations (such as insertions, deletions, and substitutions) required to transform one text
into another.
Chapter 3
METHODOLOGY
The methodologies can be viewed into 2 different phases, namely, information
extraction phase and content based candidate recommendation phase which further
involves many sub techniques.
The first phase of the proposed system involves information extraction using
Natural Language Processing. The information in the resumes is not present in a structured
format. There are noises, inconsistencies and irrelevant bits of data which is of no use to
the recruiters. The objective is to derive relevant keywords from the unstructured textual
data in the resume without any need of human crawling efforts. Using techniques like
Tokenization, Stemming, POS Tagging, Named Entity Recognition, etc., the system
obtains important job-related content (skills, experience, education, etc.) from the uploaded
candidate resumes. The result is a summarized version of each resume in a JSON format
which can be easily used for further processing tasks in the next phase of this resume
screening system.
1. Tokenization: The tokenization process is done to identify terms or words that form up
a character sequence. This is important as through these words, we will be able to derive
meaning from the original text sequence. Tokenization involves dividing big chunks of text
into smaller parts called tokens. This is done by removing or isolating characters like
whitespaces and punctuation characters. The tokenization can be performed in multiple
ways such as using Natural Language Toolkit [NLTK], the spaCy library, etc.
Figure 3.1: JSON output generated by the system after the completion of the information
extraction process on a sample resume.
2. TF-IDF: TF-IDF stands for “Term Frequency – Inverse Document Frequency”. The TF-
IDF weight is often used in text mining techniques. TF-IDF was invented for information
retrieval and document search. This weight is a numerical measure to determine how
important a term is with respect to a document in a collection or corpus. The importance
increases proportionally to the frequency of a word within the document but is offset by the
number of documents that contain the word. The TF-IDF value for a term in a document is
calculated by multiplying two different metrics as shown in equation below.
• Term Frequency: It measures how frequently a word occurs in each document in the
corpus. Since a word may occur more number of times in lengthy documents than shorter
ones, there will be need to adjust or normalize this frequency. A normalized term frequency
is calculated by dividing the number of times a term appears in a document by the total
number of terms in that document. Mathematically, it can be written as shown below in
equation.
Here, freq (t, d) is the count of the instances of the term t in document d, TF (t, d) is the
proportion of the count of term t in document d, and n is the number of distinct terms in
document d.
• Inverse Document Frequency: It measures how important a word is for all documents
in the corpus. In other words, this metric helps to know how rare or common a word is
across in the corpus. It weighs down the terms that occur more often while scaling up the
rare terms. The terms that appear more often in the set of documents have IDF value close
to 0 while the rare terms have a high IDF. Mathematically, it can be written as shown below
in equation.
𝐼𝐷𝐹(𝑡)=𝑙𝑜𝑔(𝑁/𝑐𝑜𝑢𝑛𝑡(𝑡))
Here, N is the number of distinct documents in the corpus and count (t) is the number of
documents in the corpus in which the term t is present.
Here, 𝑎⃗.𝑏⃗⃗ = ∑𝑛1 𝑎𝑖 𝑏𝑖 = 𝑎1 𝑏1 + 𝑎2 𝑏2 + ...+𝑎𝑛 𝑏𝑛 is the dot product of the two vectors. Using
this formula, the cosine similarity between all pairs of elements can be calculated.
Chapter 4
4.1 Applications:
An automated resume screening system using natural language processing and
similarity has several potential applications, including:
• Candidate ranking: By using similarity and matching algorithms, the system can
compare the job requirements with the candidate's resume and assign a ranking score. The
system can use various metrics, such as keyword frequency, job title matching, and
education level to rank the resumes and present them to the recruiter in order of relevance.
• Eliminating bias: An automated system can eliminate human biases that may influence
the selection process, such as gender, age, or race. By using objective criteria and
algorithms, the system can ensure a fair and unbiased selection process.
• Customization: The system can be customized to match the specific requirements of the
job posting or the organization's culture. By using machine learning techniques, the system
can learn from the recruiter's feedback and improve the selection process over time.
4.2 Advantages:
An automated resume screening system using natural language processing
and similarity has several advantages over traditional manual methods of resume screening.
It can improve the efficiency and accuracy of the screening process, saving time and
reducing the risk of human error. Here are some of the advantages of using such a system:
• Improved efficiency: An automated resume screening system can process resumes much
faster than a human could, allowing for a larger volume of resumes to be screened in a
shorter amount of time.
• Flexibility: Automated resume screening systems are highly adaptable and can be
customized to fit the specific needs of a company or organization. They can be programmed
to search for certain keywords or phrases that are important to a particular job or industry.
• Consistency: An automated system will consistently apply the same screening criteria to
each resume, ensuring a fair and objective screening process.
• Improved diversity: By removing potential biases that may exist in manual screening
processes, an automated system can help to improve diversity in the hiring process.
Chapter 5
CONCLUSION
BIBLIOGRAPHY
[1] Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V. and Kambhatla, N.,
2010, October. PROSPECT: a system for screening candidates for recruitment. In
Proceedings of the 19th ACM international conference on Information and
knowledge management, 659-668.
[2] Salton, G., Wong, A. and Yang, C.S., 1975. A vector space model for automatic
indexing. Communications of the ACM, 18(11),613-620.
[3] Kumaran, V.S. and Sankar, A., 2013. Towards an automated system for intelligent
screening of candidates for recruitment using ontology mapping (EXPERT).
International Journal of Metadata, Semantics and Ontologies, 8(1), 56-64.
[4] Jabri, S., Dahbi, A., Gadi, T. and Bassir, A., 2018, April. Ranking of text documents
using TF-IDF weighting and association rules mining. In 2018 4th International
Conference on Optimization and Applications (ICOA), 1-6. IEEE.
[5] Huang, A., 2008, April. Similarity measures for text document clustering. In
Proceedings of the sixth new zealand computer science research student conference
(NZCSRSC2008), Christchurch, New Zealand, 4, 9-56.
[6] Faliagka, E., Ramantas, K., Tsakalidis, A. and Tzimas, G., 2012, May. Application
of machine learning algorithms to an online recruitment system. In Proc.
International Conference on Internet and Web Applications and Services.