0% found this document useful (0 votes)
44 views40 pages

Report 12

The document is a minor project report from Tribhuvan University's National College of Engineering on 'Automated Resume Screening Using Natural Language Processing.' It outlines the project's aim to improve recruitment efficiency by automating the resume screening process using AI algorithms, achieving an accuracy of 95.14%. The report includes acknowledgments, a certificate of approval, and a detailed abstract explaining the methodology and objectives of the project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views40 pages

Report 12

The document is a minor project report from Tribhuvan University's National College of Engineering on 'Automated Resume Screening Using Natural Language Processing.' It outlines the project's aim to improve recruitment efficiency by automating the resume screening process using AI algorithms, achieving an accuracy of 95.14%. The report includes acknowledgments, a certificate of approval, and a detailed abstract explaining the methodology and objectives of the project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
NATIONAL COLLEGE OF ENGINEERING

A
MINOR PROJECT REPORT
ON
“AUTOMATED RESUME SCREENING USING
NATURAL LANGUAGE PROCESSING”

SUBMITTED BY:
PRASAMSHA PANDAY (23219)
PRATIK PANDE (23221)
SABIN PYAKUREL (23228)
SUDIP GHIMIRE (23233)

SUBMITTED TO:
DEPARTMENT OF ELECTRONICS & COMPUTER
ENGINEERING

LALITPUR, NEPAL
MARCH, 2025
Certificate Of Approval
This is to certify that the work carried out by Mrs. Prasamsha Panday, Mr. Pratik
Pande, Mr. Sabin Pyakurel and Mr. Sudip Ghimire for the project entitled ”Au-
tomated Resume Screening Using Natural Language Processing” for the award of
the degree of Bachelor of Computer Engineering of the Institute of Engineering is
based upon the authentic work. We have the pleasure in forwarding their project.
The project was carried out under our supervision and all the materials included
as well as the software product is the result of their yearlong authentic work-effort.

Er. Umesh Kant Ghimire Er. Suroj Burlakoti


(External Examiner) Department of Electronics and
Computer Engineering
National College of Engineering
Talchhikhel, Lalitpur
(Project Coordinator)

Er. Suroj Burlakoti


Department of Electronics and
Computer Engineering
National College of Engineering
Talchhikhel, Lalitpur
(Head of Department/Project Supervisor)

i
COPYRIGHT ©
The author has agreed that the library of National College of Engineering may
make this report freely available for inspection. Moreover, the author has agreed
that permission for the extensive copying of this project report for scholarly pur-
poses may be granted by the supervisor who supervised the project work recorded
herein or, in his absence, the Head of the Department where the project was con-
ducted.It is understood that recognition will be given to the author of the report
and to the Department of National College Of Engineering in any use of the ma-
terial in this report. Copying or publishing or any other use of this material for
financial gain without the approval of the department and the author’s written
permission is strictly forbidden.Requests for permission to copy or to make any
use of the material in this report, in whole or in part, should be addressed to:

Head of the Department


Department of Electronics and Computer Engineering
National College Of Engineering

ii
Acknowledgments
We would like to express our sincere gratitude to all those who have contributed
to the completion of this project.
First and foremost, we would like to thank our project supervisor and Head
of Department Er.Suroj Burlakoti, for his invaluable guidance, continuous sup-
port,valuable insights and encouragement throughout the course of this project.
We would like to sincerely extend our heartfelt gratitude to the students, teachers,
and staff of the Department of Electronics and Computer Engineering at National
College of Engineering for their unwavering support, which has been instrumental
in the progress of our project.
Lastly, we would like to express our deep appreciation to everyone from the BCT-
078 batch of National College of Engineering for their continuous encouragement
and valuable suggestions, which greatly contributed to the successful completion
of our project.

Prasamsha Panday NCE078BCT026


Pratik Pande NCE078BCT028
Sabin Pyakurel NCE078BCT036
Sudip Ghimire NCE078BCT041

MARCH,2025

iii
Abstract
Automated resume screening using Natural Language Processing (NLP) refers to
the use of AI-driven algorithms to analyze job applicants’ resumes in an automated
fashion. In today’s competitive job market, hiring has become a challenging and
time-consuming process, especially when it comes to reviewing a large number of
resumes. Traditional manual resume screening methods are inefficient and can
introduce unintentional bias.
This project,“Automated Resume Screening Using NLP”, aims to simplify and
speed up recruitment by providing a web-based system where job seekers can up-
load resumes and employers can list job openings. The system applies Natural
Language Processing (NLP) to extract key details such as skills, work experience,
education, and job titles. A Random Forest classifier is used to categorize resumes
into relevant job fields, while cosine similarity ranks resumes based on how well
they match job descriptions.This ensures recruiters can quickly identify the most
suitable candidates.The system also promotes fairness by focusing only on job-
related information,eliminating biases in evaluation. With an accuracy of 95.14%,
the system proves to be a reliable tool for improving efficiency, accuracy, and fair-
ness in the hiring process.

Keywords: Automated Resume Screening, Natural Language Processing, Cosine


Similarity, Resume Ranking, Recruitment Automation, Random Forest Classifier

iv
Contents

Certificate Of Approval i

COPYRIGHT ii

Acknowledgements iii

Abstract iv

List of Figures vii

List of Abbreviations viii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Review 3

3 Related Theory 7

4 System Analysis 11
4.1 Requirement specification . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . 12

5 Methodology 13
5.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Vector Generation Using Word Embeddings . . . . . . . . . . . . . 16
5.7 Model Training: Random Forest Classifier . . . . . . . . . . . . . . 17
5.8 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v
5.9 Similarity Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.10 Frontend and Backend Development . . . . . . . . . . . . . . . . . . 19
5.11 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Results and Discussion 21


6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Data collection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Data Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Vector Generation using word Embedding . . . . . . . . . . . . . . 22
6.5 Analysis of Classification Report . . . . . . . . . . . . . . . . . . . . 22
6.6 Analysis of Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 24
6.7 Analysis of the Ranked Resume . . . . . . . . . . . . . . . . . . . . 25

7 Conclusion and Future Enhancements 27


7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Future Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . 27

References 28

A APPENDIX 29
A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi
List of Figures

5.1 System Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 13


5.2 Activity Diagram of the System . . . . . . . . . . . . . . . . . . . . 14
5.3 Sequence Diagram of the System . . . . . . . . . . . . . . . . . . . 15
5.4 Training of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1 In-use dataset for the system . . . . . . . . . . . . . . . . . . . . . . 21


6.2 Pre-processed dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Vector embeddings of each word . . . . . . . . . . . . . . . . . . . . 22
6.4 Classification report Of the Model . . . . . . . . . . . . . . . . . . . 23
6.5 Classification Report Visualization . . . . . . . . . . . . . . . . . . 23
6.6 Confusion Matrix of the Model . . . . . . . . . . . . . . . . . . . . 24
6.7 Ranking Resumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A.1 Front page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


A.2 Login page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.3 Job Seeker View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Job Application Access Page . . . . . . . . . . . . . . . . . . . . . . 30
A.5 Job Provider View . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vii
List of Abbreviations

AI Artificial Intelligence
BERT Bidirectional Encoder Representation From Transformer
CBOW Continuous Bag Of Words
CV Curriculum Vitae
FP False Positive
FN False Negative
HR Human Resources
LSTM Long Short Term Memory
ML Machine Learning
NER Named Entity Recognition
NLP Natural Language Processing
PDF Portable Document Format
TN True Negative
TP True Positive
UI User Interface
VSM Vector Space Model
IT Information Technology
TF-IDF Term Frequency-Inverse Document Frequency

viii
1. Introduction

1.1 Background
Finding and hiring qualified employees is a critical function within Human Re-
sources (HR), especially in large and ever-changing job markets. Every month,
millions of individuals enter the workforce, creating a high volume of applications
for each open position. Such high mass can make it difficult to efficiently identify
the best candidates.
One of the main challenges HR departments face is time and efficiency. Resumes
come in a variety of formats, making it time-consuming and prone to errors to
manually screen and shortlist applicants. Effectively evaluating resumes also re-
quires a deep understanding of the specific skills and experience needed for the
role, which can be inconsistent within HR teams. This creates a situation where
qualified candidates might be overlooked due to inefficient screening processes,
while HR departments spend excessive time sifting through applications.

1.2 Problem statements


In large and dynamic job markets, HR departments face significant challenges in
efficiently and accurately identifying the best candidates from a vast pool of ap-
plicants. Manual resume selection and shortlisting is time consuming and error
prone due to the diversity in resume formats and the varying levels of under-
standing within HR teams regarding the specific skills and experience required for
different roles. This inefficiency leads to the risk of overlooking qualified candidates
and expending excessive time and resources on the screening process. An effective
solution is needed to streamline and improve the accuracy of resume screening to
ensure that the most qualified candidates are identified promptly.

1.3 Objectives
The general objective of the project is to design a employee selection system using
classification algorithm. The specific objectives are:

• To design and develop web applications to screen resumes effectively and


efficiently for IT sectors.

• Automate the application selection process.

1
1.4 Scope
This project aims to transform how recruitment works by automating the process
of screening and ranking resumes. It will make a big difference in areas like
IT sector where candidates apply for different job positions. By using machine
learning and natural language processing, it helps employers quickly find the best
candidates while saving time and effort. The system also promotes fairness by
giving all job seekers an equal chance and ensuring resumes match job descriptions
more accurately. Ultimately, it will improve communication between employers
and candidates, making the hiring process smoother and more effective.
The scope of this project is currently limited to processing resumes written in
English and following standard resume formats.Additionally,the screening process
mainly focuses on matching keywords and phrases that are most relevant to the job
descriptions provided by employers. While this approach is effective for structured
and clearly defined resumes, future improvements could make the system more
versatile by supporting different formats and multiple languages, making it even
more inclusive and user-friendly.

2
2. Literature Review
The volume of job applications has increased exponentially with the advent of
online job portals, necessitating the use of automated systems to manage and pri-
oritize candidate profiles. Resume ranking programs, leveraging advancements in
artificial intelligence and machine learning, offer promising solutions to streamline
this process. By employing algorithms capable of analyzing and evaluating re-
sumes based on predefined criteria, these programs aim to enhance the efficiency
and accuracy of candidate selection processes. This literature review explores the
evolution, methodologies, challenges, and advancements in resume ranking sys-
tems, providing insights into their effectiveness and potential impact on modern
recruitment practices.
Recent research demonstrates the effectiveness of machine learning and other such
methodologies for ranking resumes through various innovative approaches.Chirag
Darwani [1] employed various Named Entity Recognition (NER)[2]approaches to
assess similarity between categorized resume data and job requirements. Tech-
niques included Rule-Based algorithms, regular expressions, and Bidirectional-
LSTM with Conditional Random Field algorithms. The spaCy module, pre-
trained on resume samples, identified entities like names, phone numbers, and
educational institutions. A content-based recommendation system utilized vector-
ization, TF-IDF, and cosine similarity measures to rank resumes based on their
fit for job requirements. Vectorization transformed text into numerical vectors
essential for machine learning models. TF-IDF scores reflected term importance,
while cosine similarity computed similarity between resume and job query vectors.
The system used the Vector Space Model (VSM) to represent resumes and job de-
scriptions, facilitating similarity calculations and candidate ranking. Performance
testing with Software Developer Engineer resumes validated candidate rankings
based on cosine similarity scores.
Dan Jurafsky and James H. Martin[3] explore a variety of topics related to natural
language processing. They discuss text preprocessing methods such as tokeniza-
tion and stemming, as well as vector embeddings through n-gram models, and
offer a thorough introduction to neural networks. The book also dives into text
classification, covering multi-class classifiers like Naive Bayes, along with more

3
advanced techniques like sequence labeling and machine translation. It offers in-
depth insights into both foundational and contemporary methods utilized in NLP.
Leo Breiman[4]compared the performance of two multiclass Classifier(Random
Forest Classifier and Naive Bayes Classifier) and found out that Random For-
est Classifier had better performance and lower error rate of 2% for test dataset
with non-linear relationship among the features than Naive Bayes Classifier which
had a error rate of 6.2%.However,this was not the same case when using another
test dataset where the features were largely independent.In such case,Naive Bayes
Classifier was found to have better error rate of 1% than Random Forest Classifier.
Natalia Vanetikl[5] used text preprocessing to remove numbers and convert text
to lowercase, followed by BERT-based extractive summarization for job vacancies
using bert-base-uncased. Summaries were limited to 10 sentences, determined by
the ELBOW method. Text representation involved converting resumes and va-
cancies into numeric vectors. Cosine similarity computed between these vectors
determined match scores, sorted for final ranking.
Pradeep Kumar Roy[6],created a system where they can minimize the cost of hir-
ing new candidates for the job positions in the company. They focused on 3 major
problems in this process:

• Picking the right candidates from the applicants

• Making sense of their CV’s

• Finding out if the candidate is fit

for the job role.They performed various NLP techniques for text preprocessing,TF-
IDF for vectorization and used Machine Learning to perform the classification
using the algorithms of Random Forest with 38.9 percent accuracy, Multinomial
Naı̈ve Bayes with 44.39 percent, Logistic Regression with 62.4 percent, and the
highest accuracy was obtained by Linear Support Vector Machine Classifier with
an accuracy of 78.53 percent.
Dr.Sandeep Tayal[7] and his team explored the utilization of Machine Learning
(ML) and Natural Language Processing (NLP) in automating the resume screening
process.They employed various NLP techniques like named entity recognition and
part-of-speech tagging, coupled with ML classifiers such as K-Nearest Neighbors

4
and Support Vector Machines and proposed a system that enhances the precision
of candidate selection while significantly reducing time and effort.
Chengguang Gan,Qinghao Zhang and Tatsunori Mori [8] presented a new ap-
proach using Large Language Models (LLMs) to make resume screening faster
and more efficient. The system was able to summarize and score resumes from
a large dataset while also using LLM agents to assist with decision-making. To
test the system, real resumes were collected and a simulated screening process
was conducted. The results showed that the method is 11 times faster than man-
ual screening. Additionally, by fine-tuning the LLMs, the F1 score improved to
87.73% during the resume classification step.
Asmita Deshmukh,Anjali Raut and Hanuman Vyayam Prasarak Mandal[9]utilized
NLP techniques and a cosine distance matrix for efficient screening. The process
involved pre-processing, embedding generation with S-BERT, cosine similarity
calculation, and ranking based on scores. In the evaluation on a dataset of 223
resumes, the system demonstrated a screening speed of 0.233 seconds per resume
and an accuracy of 90%, effectively identifying relevant resumes. This automated
screening system significantly reduced manual workload and enhanced accuracy,
streamlining hiring processes and making them more efficient and accessible.
Vidita Jagwani and co.[10] proposed a method for resume rating using Latent
Dirichlet Allocation (LDA) and entity detection with SpaCy. The method first
extracts key entities like education, experience, and skills from resumes using
SpaCy’s Named Entity Recognition (NER). Then, the LDA model assigns topic
probabilities to these entities to rate the resume. The paper also includes a detailed
analysis of entity detection with SpaCy’s NER and reports the evaluation metrics.
Using LDA, the system breaks down resumes into latent topics and extracts mean-
ingful semantic representations. Focusing on a content-driven approach rather
than just structure or keyword matching, the model achieved 77% accuracy when
considering only skills and an overall 82% accuracy when all attributes (such as
college name, work experience, degree, and skills) were considered.
Shradha Pujari[11] presented a resume screening system that uses NLP and Python.The
system has two main parts: a data pre-processing component and a machine learn-
ing component. The data pre-processing component cleans and prepares the re-

5
sume data for machine learning. The machine learning component trains a model
to predict the job category for each resume. The system was tested on a real-world
dataset of resumes. The results showed that the system achieved 99% accuracy
on the test set, indicating that NLP can be effectively used to screen resumes for
job openings.
In enhancing their project, we integrated Word2Vec embeddings[12] with a Ran-
dom Forest Classifier, aiming to augment resume parsing capabilities. Word2Vec
embeddings were employed to capture semantic relationships between words, en-
riching the system’s understanding of resume content. The Random Forest Classi-
fier utilized these embeddings to classify resumes based on extracted features such
as skills, experiences, and project details. Additionally, emphasis was placed on
user interface (UI) design to provide an intuitive and efficient experience for re-
cruiters and hiring managers. This integration not only aimed to improve accuracy
in resume parsing but also sought to enhance usability through a well-designed
interface, addressing both technical and user-centric aspects of the project.

6
3. Related Theory
a)Random Forest Classifier
A Random Forest Classifier is an ensemble learning method used for classification
tasks.It builds multiple decision trees during training and outputs the class that is
the majority vote of the individual trees. Random Forest works by creating many
decision trees based on random subsets of the data and features. Each tree is
trained independently, and their predictions are aggregated to improve accuracy
and prevent overfitting. It is robust, handles both numerical and categorical data,
and is less prone to overfitting compared to a single decision tree. It is widely
used for classification, regression, and feature selection tasks.
b)Cosine Similarity
In data analysis, cosine similarity is a measure of similarity between two non-zero
vectors defined in an inner product space. Cosine similarity is the cosine of the
angle between the vectors; that is, it is the dot product of the vectors divided by
the product of their lengths. It follows that the cosine similarity does not depend
on the magnitudes of the vectors, but only on their angle. The cosine similarity
always belongs to the interval [-1,1].
For example, two proportional vectors have a cosine similarity of 1, two orthogonal
vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In
some contexts, the component values of the vectors cannot be negative, in which
case the cosine similarity is bounded in [0,1].
For example, in information retrieval and text mining, each word is assigned a
different coordinate and a document is represented by the vector of the numbers
of occurrences of each word in the document. Cosine similarity then gives a useful
measure of how similar two documents are likely to be, in terms of their subject
matter, and independently of the length of the documents.
The technique is also used to measure cohesion within clusters in the field of data
mining.
One advantage of cosine similarity is its low complexity, especially for sparse vec-
tors: only the non-zero coordinates need to be considered.
Other names for cosine similarity include Orchini similarity and Tucker coefficient
of congruence; the Otsuka–Ochiai similarity is cosine similarity applied to binary

7
data.
c)Machine Learning
Machine learning is a field of artificial intelligence (AI) that allows computers to
learn without being explicitly programmed. Machine learning algorithms use data
to learn how to perform tasks such as classification, prediction, and clustering.
ML algorithms are mathematical models that uses different data-sets in the form
of text, audio, images and videos, in order to help the machine to learn, improving
its performance in each iteration.
ML algorithms can be used to perform a variety of tasks like:
1. Classification: This is the task of assigning a label to an input. For example, a
machine learning algorithm could be used to classify images as either cats or dogs.
2. Prediction: This is the task of predicting a future value based on past data.
For example, a machine learning algorithm could be used to predict the weather
or the stock market.
3. Clustering: This is the task of grouping similar data together. For example, a
machine learning algorithm could be used to group customers together based on
their buying behavior.
Some of the exciting applications of ML technology are fraud detection, spam fil-
tering, medical diagnosis, self-driving cars, recommendation systems, etc.
d)Word2Vec
Word2Vec is a widely used technique in natural language processing (NLP) for
learning vector representations of words. Developed by Tomas Mikolov and his
team at Google in 2013, Word2Vec aims to convert words into dense, continuous
vector spaces, enabling the model to capture intricate semantic relationships and
contextual meanings of words.
The core idea behind Word2Vec is to map each word to a vector in a high-
dimensional space where semantically similar words are positioned close to each
other. This mapping allows the model to capture meaningful relationships be-
tween words based on their usage in large text corpora. For example, words with
similar meanings or functions, such as “king” and “queen”, will have vector rep-
resentations that are close in this space.
Word2Vec employs two primary model architectures to generate these word vec-

8
tors: the Continuous Bag of Words (CBOW) model and the Skip-Gram model.
The CBOW model predicts a target word based on its surrounding context words.
For instance, given the context words “the”, “cat”,and “on” CBOW might pre-
dict the target word “mat”. On the other hand, the Skip-Gram model works in
reverse; it uses a target word to predict the surrounding context words. For ex-
ample, given the target word “cat”, Skip-Gram would attempt to predict context
words like “the”, “on” ,and “mat”.
During training, Word2Vec uses a neural network to adjust the word vectors such
that the probability of predicting the correct context words (or target words) is
maximized. This process ensures that the learned vectors reflect meaningful rela-
tionships and similarities between words, making them useful for a variety of NLP
tasks, such as text classification, sentiment analysis, and machine translation. By
representing words in a continuous vector space, Word2Vec provides a powerful
tool for understanding and processing natural language.
e)Natural Language Processing(NLP)

• Stemming:
Stemming is a process that reduces words to their root form by stripping
suffixes. This technique helps in standardizing words and reducing dimen-
sionality in text analysis. For example, the words “running”, “runner” and
“runs” might all be reduced to “run”. Stemming algorithms, such as the
Porter Stemmer and Snowball Stemmer, apply heuristic rules to remove
common prefixes and suffixes, though they do not always produce actual
words. For instance, “fishing” might be stemmed to “fish”, but “fished”
could be stemmed to “fish” as well.
Example: “running” → “run”
“happily” → “happy”

• Lemmatization:
Lemmatization is a more sophisticated approach than stemming, focusing
on reducing words to their base or dictionary form called a lemma. Unlike
stemming, lemmatization considers the context and the part of speech to
ensure the resulting lemma is a valid word. For example, “better” is lem-
matized to “good”, and “running” is lemmatized to “run”. Lemmatization

9
often uses lexical databases like WordNet for accuracy.
Example: “running” → ”run”
“better” → ”good”

• Tokenization:
Tokenization involves breaking down text into smaller units, such as words,
phrases, or sentences. This process is essential for many NLP tasks as it
simplifies the text into manageable pieces. Tokenization can be word-level
(breaking text into words) or sentence-level (breaking text into sentences).
Example: “Hello world!” → [“Hello”, “world!”]

• Stop Word Removal:


Stop word removal involves filtering out common words that are deemed to
have little significance for text analysis, such as “the”,“is”,“in” etc. Remov-
ing stop words helps in focusing on the more meaningful words in the text.
Example: “The quick brown fox” → [“quick”, “brown”, “fox”]

10
4. System Analysis

4.1 Requirement specification

4.1.1 Functional Requirements

a) Resume Ingestion:
- The system must allow users to upload resumes in various formats (PDF, DOCX).
- The system should extract relevant textual information from the uploaded re-
sumes.
b) Job Description Ingestion:
- The system must allow users to input or upload job descriptions.
- The system should extract relevant textual information from the job descrip-
tions.
c) Feature Extraction:
- The system must convert resumes and job descriptions into numerical vectors
using word embeddings.
d) Similarity Calculation:
- The system must calculate the cosine similarity between the job description and
each resume.
e) Classification with Random Forest Classifier:
- The system must classify the ranked resumes into respective job categories such
as “Software Developer”,“Robotics Engineer” and so on.
f) User Interface:
- The system must provide a user-friendly interface for users to upload resumes
and job descriptions. - The system must display the ranked and classified resumes
with relevant details.
g) Report Generation:
- The system should generate reports summarizing the ranking and classification
results.

11
4.1.2 Non-Functional Requirements

a) Performance:
- The system should process and rank resumes within a reasonable time frame
(e.g., within a few minutes for a batch of 100 resumes).
b) Scalability:
- The system should be able to handle large volumes of resumes and job descrip-
tions without significant degradation in performance.
c) Accuracy:
- The system should maintain high accuracy in both ranking and classification,
with precision and recall metrics above a specified threshold.
d) Usability:
- The user interface should be intuitive and easy to use for individuals with basic
computer skills.
e) Security:
- The system should ensure that all uploaded resumes and job descriptions are
securely stored and processed, maintaining data privacy.
f) Compatibility:
- The system should be compatible with various browsers and devices.

12
5. Methodology

5.1 System Design

Figure 5.1: System Flow Diagram

This flowchart represents a job-matching system. Job seekers upload resumes, and
job providers post job descriptions through a user interface. The data is stored
in respective databases and processed using text preprocessing and Word2Vec
embeddings. Cosine similarity ranks resumes based on relevance, while a Ran-
domForest model categorizes job postings. The system helps match candidates to
suitable jobs efficiently.

13
5.2 Activity Diagram

Figure 5.2: Activity Diagram of the System

This activity diagram represents a job processing system. It starts with user
interaction, where job providers post and manage jobs, while job seekers view jobs

14
and upload resumes. The system stores job descriptions and resumes in respective
databases. These are processed through text preprocessing, followed by ranking
resumes and matching job categories. The process ensures efficient job matching
between job providers and job seekers.

5.3 Sequence Diagram

Figure 5.3: Sequence Diagram of the System

This sequence diagram represents the job processing system’s workflow. Users log
in, select a job, or upload a resume through the frontend (Django). The backend
stores the data and sends it for processing. Machine learning (Jupyter) applies
text preprocessing, converts text using Word2Vec/FastText, and classifies resumes

15
using Random Forest Classifier. The processed data is stored, and job matches
with scores are sent back to the frontend. Finally, users see the matching job
results.

The methodology is divided into several key stages: data collection, data pre-
processing, feature extraction,model training, and evaluation.

5.4 Data Collection


The first step involved creating a custom dataset for the resume screening system.
During the creation of the dataset, resumes of different roles were collected to
form a diverse dataset. A total of 720 resumes data belonging to 8 different job
categories, namely Data Scientist, Cloud Engineer, DevOps Engineer, Software
Developer, Machine Learning Engineer, Robotics Engineer,CyberSecurity special-
ist and Graphics Designer has been tabulated to form the data set.

5.5 Data Preprocessing


After creating the dataset, the next step was to preprocess the textual data to
make it suitable for analysis. This step included the following techniques:

• Tokenization: The text was tokenized into individual words or phrases to


break the contentinto discrete units that could be analyzed further.

• Stopword Removal: Common words such as “the”, “and”, “is”, etc, that do
not contribute meaningful information, were removed.

• Lemmatization: Words were reduced to their base or root form (e.g., “run-
ning” to “run”) to ensure consistency and improve analysis.

These preprocessing steps helped clean the data and reduce noise, preparing the
resumes for feature extraction.

5.6 Vector Generation Using Word Embeddings


To convert the textual content of the resumes and the job description into numer-
ical features that could be used by the Random Forest classifier,word embedding
was applied.Pre-trained word embedding,Word2Vec was used to represent each
word in the resume as a vector. These embedding were chosen because they

16
capture semantic meaning and relationships between words, helping the model
understand context and job-related terms in resumes.
This process converted each resume into a numerical vector that captured the es-
sential information and context from the text, which was then used as input for
the machine learning model.

5.7 Model Training: Random Forest Classifier

Figure 5.4: Training of Model

The next step was to train a machine learning model using the processed and
vectored data.Random Forest Classifier was used as a classification model which
helped to make prediction of the resumes into respective categories. During model
training,following steps were carried out:

• Data Splitting: The dataset was divided into training and validation sets
using an 80-20 split. The training set (80 percent) was used to train the

17
model, while the validation set (20 percent) was kept aside to evaluate the
model’s performance on unseen data.

• Model Training: The Random Forest classifier was trained on the feature
vectors of the resumes, with each vector labeled according to its correspond-
ing job category.

5.8 Model Evaluation


After training the Random Forest classifier, the model’s performance was eval-
uated using the validation set. During the process,we measured the following
metrices:
a)Confusion Matrix
The confusion matrix is a table used in machine learning to evaluate the perfor-
mance of a classification model. It compares the predicted classes of the model
with the actual classes in the dataset. The matrix has rows and columns rep-
resenting the actual and predicted classes respectivelly, and contains four main
components:
i)TP (True Positive) is the number of instances correctly predicted as positive.
ii)FP (False Positive) is the number of instances incorrectly predicted as positive.
iii)TN (True Negative) is the number of instances correctly predicted as negative.
iv)FN (False Negative) is the number of instances incorrectly predicted as nega-
tive.
b)Accuracy
Accuracy is a performance metric used in classification tasks to measure the over-
all correctness of the model’s predictions. It represents the proportion of correctly
classified instances out of the total number of instances in the dataset.
Mathematically, accuracy is calculated using this formula:

TP + TN
Accuracy = (5.1)
TP + TN + FP + FN

c)Recall
Recall, also known as sensitivity or true positive rate, is a performance metric
used in binary classification tasks. It measures the proportion of actual positive
instances that are correctly identified by the model.

18
Mathematically, recall is calculated using the formula:

TP
Recall = (5.2)
TP + FN

d)Precision
Precision is a performance metric used in binary classification tasks that measure
the proportion of correctly predicted instances out of all instances predicted as
positive by the model.
TP
P recision = (5.3)
TP + FP
e)F1-score
The F1 score is a performance metric commonly used in binary classification
tasks,which considers both precision and recall to provide a balanced measure of a
model’s performance. It is the harmonic mean of precision and recall,emphasizing
the balance between the two metrics.

Precision · Recall
F 1 − Score = 2 · (5.4)
Precision + Recall

5.9 Similarity Calculation


After transforming the job description and each resume within the job category
into vectors, every resume vector is compared with the corresponding job descrip-
tion vector to compute a match score using Cosine Similarity. Subsequently, the
resumes are ranked based on these match scores.

5.10 Frontend and Backend Development


To facilitate the resume screening process, we created a frontend interface using
HTML and CSS, where users can upload resumes and employers can publish job
vacancies along with job descriptions. The backend of the system was built using
Django, which handled data processing, storage, and communication between the
frontend and the machine learning model.

5.11 Tools Used


• HTML & CSS: HTML and CSS have been used for designing the front-end
display of the website, ensuring a user-friendly interface for interacting with
the resume screening system.

19
• Django: Django, a Python framework has been used for backend develop-
ment, handling the server-side logic, database interactions, and integration
of the NLP model to process resumes.

• Jupyter Notebook:Jupyter Notebook has been used for data preprocess-


ing, including cleaning and preparing the resumes for model training by
performing tasks like tokenization, stop word removal, and lemmatization,

• Tensorflow: Tensorflow has been utilized for training and evaluating the
NLP model, which has helped in automatically analyzing and scoring re-
sumes based on the job requirements.

20
6. Results and Discussion

6.1 Results
The resume screening system innovated from this project was deployed and put
to a test to check its efficiency and effectiveness.The project was successful in
effectively ranking the resumes automatically helping job recruiter to reduce the
tiring work of evaluating the resumes manually.The Random Forest Classification
model that was put in an application was for the most part correctly able to
classify the resumes into respective job categories.Not only this,the system also
provided a platform for job recruiter to post job vacancies and for job seeker to
apply to for those vacant jobs.
The following are the results obtained after performing the various steps involved
in innovating the system:

6.2 Data collection:


To train and test Random Forest Classifier,a dataset comprising approximately
equal amounts of data across eight job categories were collected, totaling 720
resumes.

Figure 6.1: In-use dataset for the system

21
6.3 Data Preprocessing:
In the next step,the dataset was preprocessed. During this process, tokeniza-
tion,stemming, stop word removal, and lemmatization was performed, resulting in
the following outcome:

Figure 6.2: Pre-processed dataset

6.4 Vector Generation using word Embedding


The pre-trained Word2Vec model was utilized to represent each word as a vector
embedding, as shown below:

Figure 6.3: Vector embeddings of each word

6.5 Analysis of Classification Report


The dataset was split into training (80%) and testing (20%) sets using train test split.For
the classification report, the testing set was used, and the model was tasked with
predicting the outcomes of the resumes in the test dataset to evaluate the accuracy
of its predictions. Based on these predictions, various metrics such as accuracy,
precision, recall, and F1-score were calculated which is shown below:

22
Figure 6.4: Classification report Of the Model

Figure 6.5: Classification Report Visualization

Majority of the classes have a precision value of 1 or close to 1, which indicates


that the model is correctly predicting the resumes into the correct job categories.
However, this was not the case for Data Scientist, which had the lowest precision
score of 0.875 among the others. This was due to similarity in skills with Cloud
Engineer, particularly in Cloud Computing, and with DevOps Engineer in Machine
Learning-related operation. Similarly, the recall was 1 or close to 1 for majority
of classes, where the model was able to predict most of the true instances of the

23
class in the test dataset. However, in some classes such as Robotics Engineer and
Cloud Engineer, lower recall values were observed (0.894 and 0.9, respectively), as
few actual cases of Robotics Engineer were missed and were classified as Software
Developer and Machine Learning Engineer, whereas some actual cases of Cloud
Engineer were missed and classified as Cybersecurity Specialist and Data Scientist.
When looking at the F1-Score, every class’s value ranges between 0.9 and 0.97,
which indicates that the model performs well across all classes, with a good balance
between precision and recall. Overall, the model achieved an accuracy of 95.14%,
meaning it reliably predicts the actual classes most of the time

6.6 Analysis of Confusion Matrix

Figure 6.6: Confusion Matrix of the Model

Two Cloud Engineer samples and two Robotics Engineer samples were mis-
classified into different classes. This can be explained by the similarity in working
areas among Cloud Engineers, Cybersecurity Specialists, and Data Scientists in

24
Cloud Computing. Similarly, in the case of Robotics Engineers, Machine Learn-
ing Engineers, and Software Developers, all of them work in the field of Artificial
Intelligence, which can cause the model to misclassify.Cybersecurity Specialists,
Graphics Designers, and Software Developers only have True Positives (TP) and
True Negatives (TN), implying perfect classification—all 17 Cybersecurity Spe-
cialist samples, all 11 Graphics Designer samples, and all 21 Software Developer
samples were correctly identified, resulting in those classes having a Recall value
of 1.
This leads to an important observation: while most classes exhibit a small degree
of misclassification (FN and FP), the overall predictions remain highly accurate.
The high diagonal values indicate that the model successfully identifies most sam-
ples correctly, with only minor confusion in closely related classes (e.g., Cloud
Engineer-Data Scientist, Machine Learning Engineer-Robotics Engineer).

6.7 Analysis of the Ranked Resume


The following snapshots shows the model prediction along with ranking of resume
based on match score between the resume and job description which was calculated
using cosine similarity:

Figure 6.7: Ranking Resumes

25
The figure shows that the candidate with the name ”Jane Smith” was found
to be more suitable for the job of Data Scientist with match score 0.92. Sim-
ilarly,other candidates were also ranked on the basis of their match score with
their respective job category predicted using Random-Forest Classifier.

26
7. Conclusion and Future Enhancements

7.1 Conclusion
The Resume Screening system developed using NLP, Cosine Similarity, and Ran-
dom Forest algorithms provides an efficient and automated solution for resume
screening in recruitment processes. By automating the tedious task of evaluat-
ing resumes, it allows recruiters to focus on high-value tasks such as interviewing
and final decision-making. The system is capable of extracting key data from
resumes, calculating similarity scores with job descriptions, and ranking candi-
dates accordingly, significantly reducing the manual effort involved in candidate
selection.Through the integration of machine learning models, the system not only
offers efficiency but also accuracy, ensuring that the most qualified candidates are
prioritized. Despite facing challenges in handling varied resume formats and op-
timizing the algorithms, the project has demonstrated the potential of leveraging
NLP and machine learning to enhance recruitment processes.
In conclusion, the project successfully automates the resume screening process
while maintaining relatively high accuracy of 95.14% ,ensuring its applicability in
IT fields without compromising the quality of selection process.

7.2 Future Enhancements


While the Resume Screening system developed using NLP and the Cosine Simi-
larity algorithm is functional, several enhancements can be made to improve its
efficiency and overall user experience. Some potential future enhancements in-
clude:

• Integration of Additional factors:Resumes are currently ranked by their


similarity to job descriptions using cosine similarity. In the future, more
categories can be considered while looking for the best candidate such as
past hiring records,salary demand and so on.

• Support for More File Formats: In the future, the system could support
additional file formats beyond DOCX and PDF, such as TXT, RTF, and
ODT, making it more versatile for different user needs.

27
References
[1] C. Daryani. An automated resume screening system using natural language
processing and similarity. Ethics and Information Technology, 2020.

[2] J. Devlin. Pre-training of deep bidirectional transformers for language under-


standing. 2018.

[3] D. Jurafsky and J. H. Martin. Speech and language processing: An intro-


duction to natural language processing, computational linguistics, and speech
recognition with language models. 2025.

[4] L. Breiman. Random forests. Machine Learning, 2001.

[5] N. Vanetik. Nlp-based screening for it job vacancies. 2023.

[6] P. K. Roy. A machine learning approach for automation of resume recom-


mendation system. Procedia Computer Science, 2020.

[7] D. S. Tayal. Resume screening using machine learning. International Jour-


nal of Scientific Research in Computer Science Engineering and Information
Technology, 2024.

[8] C. Gan, Q. Zhang, and T. Mori. Application of llm agents in recruitment.


Journal of Information Processing, 2024.

[9] A. Deshmukh, A. Raut, and H. V. P. Mandal. Enhanced resume screening


for smart hiring using sentence-bidirectional encoder representations from
transformers (s-bert). 2024.

[10] V. Jagwani, S. Meghani, K. Pai, and S. Dhage. Resume evaluation through


latent dirichlet allocation and natural language processing for effective can-
didate selection. 2023.

[11] S. Pujari. Resume screening with natural language processing in python.


2023.

[12] T. Mikolov. Efficient estimation of word representations in vector space. 2013.

28
Appendix A. APPENDIX

A.1

Figure A.1: Front page

Figure A.2: Login page

29
A.2

Figure A.3: Job Seeker View

Figure A.4: Job Application Access Page

30
A.3

Figure A.5: Job Provider View

31

You might also like