b.e-cse-batchno-57
b.e-cse-batchno-57
By
SCHOOL OF COMPUTING
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC | 12B Status by UGC | Approved by AICTE
JEPPIAAR NAGAR, RAJIV GANDHISALAI,
CHENNAI - 600119
APRIL - 2023
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of ABIRAAMI.S
(39110012) and ABINAYA.R (39110010) who carried out the Project Phase-2
entitled “SCREENING AND RANKING RESUME USING STACKED MODEL”
under my supervision from January 2023 to April 2023.
Internal Guide
Dr. S. PRINCE MARY, M.E., Ph.D.
I, ABIRAAMI.S (Reg. No: 39110012), hereby declare that the Project Phase-2
Report entitled “SCREENING AND RANKING RESUME USING STACKED
MODEL” done by me under the guidance of Dr. S. Prince Mary, M.E., Ph.D.
is submitted in partial fulfillment of the requirements for the award of Bachelor
of Engineering degree in Computer Science and Engineering.
DATE:20. 04.2023
ABIRAAMI.S
PLACE: Chennai SIGNATURE OF THE CANDIDATE
iii
ACKNOWLEDGEMENT
I wish to express my thanks toall Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many
ways for the completion of the project.
iv
ABSTRACT
v
TABLE OF CONTENTS
Chapter Page
TITLE
No. No.
ABSTRACT v
LIST OF ABBREVIATIONS ix
1 INTRODUCTION 1
2 LITERATURE SURVEY 4
3 REQUIREMENTS ANALYSIS 8
5.1 Algorithms 19
7.1 Conclusion 29
vi
APPENDIX 33
A SOURCE CODE 33
B SCREENSHOTS 46
C RESEARCH PAPER 49
vii
LIST OF FIGURES
viii
CHAPTER 1
INTRODUCTION
Day by day there are so many new researches that are being carried out
by so many organizations in many fields. Right from the IT sector to the Medical
fields, there are so many researches as well are progress made in day to day
life. These progress in these fields are creating new employment opportunities all
over the world. Now we know that the hiring process across the world is the
same. That is the candidate has to make the resume first and as most of the
recruitment process is based on the candidates' resume. Now most of the
organizations tell the candidates to send the resume via e-mails. Now after they
receive the email the next job is to sort them according to the requirement. Now
usually they sort these manually but this process is time consuming.
Also there may be so many errors due to that. The efficiency of such
sorting is extremely low. This may result in giving a chance to a candidate whose
credentials are not up to the mark for the organization or this may also make the
manual sorting miss the candidate who is extremely good for the organization.
Hence a system which is intelligent enough to sort all these documents without
any error and within time is needed which is exactly our aim. Machine learning
for recruiting is the application of machine learning, such as the learning or
problem-solving that a computer can do, to the recruitment function. This new
technology is designed to streamline or automate some part of the recruiting
workflow, especially repetitive, high-volume tasks. For example, software that
applies machine learning to resumes to auto-screen candidates or software that
conducts sentiment analysis on job descriptions to identify potentially biased
language. The current recruitment process is more tedious and time consuming
which forces the candidates to fill all their skill and information manually. And HR
team requires more man power to scrutinize the resumes of the candidates. So
that motivated to build a solution that is more flexible and automated.
1
Recruitment is a 200-billion-dollar business. It deals with hiring the best-fit
candidates having the relevant skills for a given job profile from an immensely
large pool of candidates. If a company has any job opening for a position, scores
of candidates mail their resumes to the company to apply for that opening. In the
hiring process, the first task for any recruiter is to screen the resumes of all the
job applicants. Any company having a job opening for a particular position will
have their mail inboxes bombarded with thousands of emails from aspiring job
applicants every single day. Selecting the prospective candidates for that job
position from a large pool of candidates for any recruiter is very tedious. It is an
extremely daunting task for the recruiters of a company to manually go through
thousands of resumes and select the most appropriate candidates for the job.
Out of those thousands of resumes submitted to the company for the given job
posting, about 75% of them do not showcase the relevant skills that are required
for the job profile.
Due to this, the recruiters quite often find it really arduous to narrow down
the most appropriate candidates from a large applicant pool. In recent years,
there have been more than 50,000 e-recruitment sites have been developed.
The developers of these online recruitment sites have used various approaches
to identify the prospective candidates for a given job profile of a company. Some
of these, have managed to employ classification techniques that will classify the
candidate resumes into various categories for every job posting given by every
company. In these approaches, every candidate’s resume is tried to match with
every given job posting on the recruitment site. The aim of these recruitment
sites is to throw up the results to the candidate to which they are the best fit. The
techniques used by these sites have resulted in high accuracy and precision, but
one of the major disadvantages is the factor of time complexity. If every
candidate’s resume is matched with every other job posting given on the online
recruitment site, the time complexity for acquiring the results is very high.
The world of Artificial Intelligence [AI] and Machine Learning [ML] has
grown significantly in recent times. The availability of large amounts of data
2
brought about by advancements in technology which has made the internet
cheap and accessible to previously inaccessible regions of the world has
contributed to a great increase in the performance of the ML models in recent
times. Software companies around the world exploit the advances in ML to drive
automation and increase their productivity in areas that relied mostly on manual
human labor. The approach discussed in this project is by using machine
learning to train the dataset for a particular type of job position. It is also
proposed to use section-based segmentation for data extraction using Natural
language Processing (NLP).
3
CHAPTER 2
LITERATURE SURVEY
This problem statement has been extensively studied over the past 5
years by researchers and automotive companies in a bid to create a solution,
and all their solutions vary from analyzing various patterns of distractive habits to
analyzing resume.
Abdul Samad Ebrahim Yahiya [1] this resumed striving to reach a review
of the tools and the necessary methodology to present a clear understanding of
the association of NLP and DL for truly understanding in the training. Efficiency
and execution both are improved in NLP by Part of speech tagging (POST),
Morphological Analysis, Named Entity Recognition (NER), and Semantic Role
Labeling (SRL), Syntactic Parsing, and Conference resolution. Artificial Neural
Networks (ANN), Recurrent Neural Network (RNN), Convolution Neural
Networks (CNN), dealings among Dense Vector (DV), Windows Approach (WA),
and Multitask learning (MTL) as a characteristic of Deep Learning.
4
Garima Bhardwaj [4] The study work in [8] was conducted among 115 HR
professionals at various IT sectors in Delhi/NCR region. A multiple regression
method was used to test the hypothesis and confirmed a positive relationship
between these two factors establishing about the increased use of AI at work
results in better HR functional performance. However, AI has a significant
relationship with innovativeness and also with the ease of use which reflects AI
affects HR with innovations and ease of use.
Huaizheng Zhang [5], the author in [8] developed a method for automatic
RQA. Since there is also no public dataset for model training and evaluation, we
build a dataset for RQA by collecting around 10K resumes, which are provided
by a private resume management company. By investigating the dataset, we
identify some factors or features that could be useful to discriminate good
resumes from bad ones, e.g., the consistency between different parts of a
resume.
Muath Sabha [6] the research work of author in [5] presented a hybrid
approach that employs conceptual-based classification of resumes and job
postings and automatically ranks candidate resumes (that fall under each
category) to their corresponding job offers. In this context, the exploit an
integrated knowledge base for carrying out the classification task and
experimentally demonstrate - using a real-world recruitment dataset- achieving
promising precision results compared to conventional machine learning-based
resume classification approaches
5
Pradeep Kumar Roy [8] suggested that an automated way of “Resume
Classification and Matching” could really ease the tedious process of fair
screening and shortlisting; it would certainly expedite the candidate selection and
decision making process. This system could work with a large number of
resumes for first classifying the right categories using different classifier, once
classification has been done then as per the job description, top candidates
could be ranked using Content-based Recommendation, using cosine similarity
and by using k-NN to identify the CVs that are nearest to the provided job
description.
Sanjay Revanna [9] proposed the top applicants might be rated using
content-based suggestion, which uses cosine similarity to find the curriculum
vitae that are the most comparable to the job description supplied and KNN
algorithm is used to pick and rank Curriculum Vitaes (CV) based on job
descriptions in huge quantities.
6
2.1 INFERENCES FROM THE LITERATURE SURVEY
The systems has models that don’t have any way to improve themselves
over the time, the models will be trained only once. The models used Machine
learning algorithms which have a tend to plateau in performance when run over a
large dataset. The application can only be used effectively by the programmer.
Each code block is interdependent; each change requires disrupting the entire
code, which may disrupt the flow.
7
CHAPTER 3
REQUIREMENTS ANALYSIS
Due to this, the recruiters quite often find it really arduous to narrow down
the most appropriate candidates from a large applicant pool. In recent years, there
have been more than 50,000 e-recruitment sites have been developed. The
8
developers of these online recruitment sites have used various approaches to
identify the prospective candidates for a given job profile of a company. Some of
these, have managed to employ classification techniques that will classify the
candidate resumes into various categories for every job posting given by every
company. In these approaches, every candidate’s resume is tried to match with
every given job posting on the recruitment site. The aim of these recruitment sites
is to throw up the results to the candidate to which they are the best fit. The
techniques used by these sites have resulted in high accuracy and precision, but
one of the major disadvantages is the factor of time complexity. If every
candidate’s resume is matched with every other job posting given on the online
recruitment site, the time complexity for acquiring the results is very high.
The world of Artificial Intelligence [AI] and Machine Learning [ML] has
grown significantly in recent times. The availability of large amounts of data
brought about by advancements in technology which has made the internet cheap
and accessible to previously inaccessible regions of the world has contributed to a
great increase in the performance of the ML models in recent times. Software
companies around the world exploit the advances in ML to drive automation and
increase their productivity in areas that relied mostly on manual human labor. The
approach discussed in this project is by using machine learning to train the dataset
for a particular type of job position. It is also proposed to use section-based
segmentation for data extraction using Natural language Processing (NLP). In
order to improve the time efficiency of the web application, the candidate’s resume
will only be matched to those job openings where they are interested in and have
applied to which will, in turn, reduce the time complexity. Besides, the results of
the resume matching of all the candidates who have applied for the job opening
will be visible only to the recruiter of that particular company. This is done with the
aim to aid the recruiters of any company from the long and tedious task of viewing
and analyzing thousands of candidates’ resumes. In this intelligent-based
approach, they will be given the option to view the candidate’s resume as well as
they will get the results of the best candidates suitable for the required job position.
Same disease, we use it as the score to rank the similarity among patients.
9
3.2 SYSTEM USE CASE
This Use Case Diagram is a graphic depiction of the interactions among the
elements of Resume Builder. It represents the methodology used in system
analysis to identify, clarify, and organize system requirements of Resume Builder.
The main actors of Resume Builder in this Use Case Diagram are: Dataset, User,
Algorithm who perform the different type of use cases such as Manage Resume,
Manage Skills, Manage Job, Manage Formats, Manage Resume Types, Manage
Users and Full Resume Builder Operations. Major elements of the UML use case
diagram of Resume Builder are shown on the Fig 3.2.
10
3.3 SOFTWARE REQUIREMENTS SPECIFICATION DOCUMENT
11
CHAPTER 4
12
4.1.2 TF-IDF
At this stage, a dynamic Script for the Tf-Idf approach is written. Term
frequency-inverse document frequency, is a numerical statistic that is intended to
reflect how important a word is to a document in a collection.TF-IDF is word
frequency scores that aim to emphasize phrases that are more interesting, e.g.,
common in a text but not across texts, without delving into the arithmetic. The TF-
IDF Vectorizer tokenizes texts, learns vocabulary, inverts frequency weightings,
and allows encoding new ones. It provides information on a word frequency in the
documents. Higher the TF- IDF score of a term which is computed using the above
equations represents more relevance in a document.
A tool and technique for Topic Modeling, Latent Dirichlet Allocation (LDA)
classifies or categorizes the text into a document and the words per topic, these
are modeled based on the Dirichlet distributions and processes. Latent Dirichlet
Allocation has been used in the application for the following functions- Discovering
the hidden themes in the data. Classifying the data into the discovered themes.
Using the classification to organize/summarize/search the documents. The
application, then deals with the calculation of the score for a candidate’s resume
according to the job posting they have applied for. According to the score each
candidate’s resume receives, a rank list will be made with the candidate receiving
a higher score placed higher as compared to the candidate receiving a lower
score.
This diagram is nothing but a simple description of all the entities that have
been incorporated into the system. The diagram represents the relations between
each of them and involves a sequence of decision-making processes and steps.
You can simply call it a visual or the whole process and its implementation. All
functional correspondences are explained in this Fig 4.2.
13
Fig 4.2 – Architecture Diagram
14
is free and open-source. This means you can use it without spending any money.
In the data science sector, Anaconda is an industry staple. It is open-source too,
which has made it widely popular. If you want to become a data science
professional, you must know how to use Anaconda for Python because every
recruiter expects you to have this skill. It is a must-have for data science.
It has more than 1500 Python and R data science packages, so you don’t
face any compatibility issues while collaborating with others. For example,
suppose your colleague sends you a project which requires packages called A and
B but you only have package A. Without having package B, you wouldn’t be able
to run the project. Anaconda mitigates the chances of such errors. You can easily
collaborate on projects without worrying about any compatibility issues. It gives
you a seamless environment that simplifies deploying projects. You can deploy
any project with just a few clicks and commands while managing the rest.
Anaconda has a thriving community of data scientists and machine learning
professionals who use it regularly. If you encounter an issue, chances are, the
community has already answered the same. On the other hand, you can also ask
people in the community about the issues you face there, it’s a very helpful
community ready to help new learners. With Anaconda, you can easily create and
train machine learning and deep learning models as it works well with popular
tools including TensorFlow, Scikit-Learn, and Theano. You can create
visualizations by using Bokeh, Holoviews, Matplotlib, and Datashader while using
Anaconda.
15
You can replace <your_environment_name> with the name of your environment.
After entering this command, conda will ask you if you want to proceed to which
you should reply with y: proceed ([y])/n)?
On the other hand, if you want to create an environment with a particular version
of Python, you should use the following command:
conda create -n <your_environment_name> python=3.6
Similarly, if you want to create an environment with a particular package, you can
use the following command:
conda create -n <your_environment_name>pack_name
Here, you can replace pack_name with the name of the package you want to use.
If you have a .yml file, you can use the following command to create a new Conda
environment based on that file:
conda env create -n <your_environment_name> -f <file_name>.yml
We have also discussed how you can export an existing Conda
environment to a .yml file later in this article.
Activating an Environment
You can activate a Conda environment by using the following command:
conda activate <environment_name>
You should activate the environment before you start working on the same. Also,
replace the term <environment_name> with the environment name you want to
activate. On the other hand, if you want to deactivate an environment use the
following command:
conda deactivate
16
Updating Packages in an Environment
If you want to update the packages present in a particular Conda
environment, you should use the following command:
conda update
The above command will update all the packages present in the environment.
However, if you want to update a package to a certain version, you will need to
use the following command:
conda install <package_name>=<version>
Deleting an Environment
Sometimes, you don’t need to add a new environment but remove one. In
such cases, you must know how to delete a Conda environment, which you can do
so by using the following command:
17
conda env remove –name <env_name>
The above command would delete the Conda environment right away.
18
CHAPTER 5
IMPLEMENTATION DETAILS
5.1 ALGORITHMS
We applied the following algorithms to solve this situation:
natural language processing
term frequency-inverse document frequency (tf-idf)
latent dirichlet allocation (lda)
19
doorstep of the office in the hope of getting called for an interview. But now the
scenario is totally different. The immense increase of internet connectivity has
enabled change in the recruitment process of companies. With the various job
boards and e-recruitment portals, both the recruiters and candidates got some
benefits and convenience but there are a few challenges as well. Earlier, the
companies who received 10–50 resumes on a daily basis, now are flooded with
thousands of resumes. It’s practically impossible and cost-ineffective to go through
these many resumes. Also, there is no standard resume format that the candidate
uses. People with different backgrounds, industries and job profiles have their own
way of presenting details on their resume which is a daunting task for recruiters.
To overcome this, companies and websites started seeking the candidates’ details
through online forms but they also fail to do justice to the process as the forms are
very generic in nature. All these challenges have been substantially achieved by
the Application Tracking System aka ATS which is based on Natural Language
Processing.
INFORMATION EXTRACTION
As we know, there is no standard format for resumes so inconsistencies
and irrelevant data are inevitable. The objective of information extraction is to
extract relevant information or keywords from the unstructured data in the resume
without any human intervention. Using techniques like Tokenisation, Stemming,
POS Tagging etc shown In Fig 4.4 the system can obtain relevant job-related
20
information from the uploaded candidates. The summary of each resume is
created in a JSON format that can be easily used for further processing tasks in
the next phase of this resume screening system.
TOKENISATION
The process of identifying terms or words that form up a character
sequence is called tokenization. This involved dividing the big text into smaller
parts called tokens; the characters such as whitespace and punctuation are
removed in this process. Through Tokenization, we can get information like the
number of words in a text, frequency of a particular word in the text etc.
Tokenization is an essential step for further text processing such as removal of
stop words, lemmatization and stemming.
21
building data graphs for named entity recognition. For example: In the sentence “I
am building a website”, here the building is a verb, but in the sentence “I live in the
tallest building of Paris”, here the building is a noun.
22
appears in a document. Then, there are ways to adjust the frequency, by the
length of a document, or by the raw frequency of the most frequent word in a
document. The inverse document frequency of the word across a set of
documents. This means, how common or rare a word is in the entire document
set. The closer it is to 0, the more common a word is. This metric can be
calculated by taking the total number of documents, dividing it by the number of
documents that contain a word, and calculating the logarithm. So, if the word is
very common and appears in many documents, this number will approach 0.
Otherwise, it will approach 1. Multiplying these two numbers results in the TF-IDF
score of a word in a document. The higher the score, the more relevant that word
is in that particular document.
Machine learning with natural language is faced with one major hurdle – its
algorithms usually deal with numbers, and natural language is, well, text. So, we
need to transform that text into numbers, otherwise known as text vectorization.
It’s a fundamental step in the process of machine learning for analyzing data, and
different vectorization algorithms will drastically affect end results, so you need to
choose one that will deliver the results you’re hoping for. Once you’ve transformed
words into numbers, in a way that’s machine learning algorithms can understand,
the TF-IDF score can be fed to algorithms such as Naive Bayes and Support
Vector Machines, greatly improving the results of more basic methods like word
counts. This works simply by putting, a word vector represents a document as a
list of numbers, with one for each possible word of the corpus. Vectorizing a
document is taking the text and creating one of these vectors, and the numbers of
the vectors somehow represent the content of the text. TF-IDF enables us to gives
us a way to associate each word in a document with a number that represents
how relevant each word is in that document. Then, documents with similar,
relevant words will have similar vectors, which is what we are looking for in a
machine learning algorithm.
At this stage, a dynamic Script for the Tf-Idf approach is written. Term
frequency-inverse document frequency, is a numerical statistic that is intended to
reflect how important a word is to a document in a collection. The TF-IDF
Victimizer tokenizes texts, learns vocabulary, inverts frequency weightings, and
23
allows encoding new ones. It provides information on a word frequency in the
documents. Higher the TF- IDF score of a term which is computed using the above
equations represents more relevance in a document.
A tool and technique for Topic Modeling, Latent Dirichlet Allocation (LDA)
classifies or categorizes the text into a document and the words per topic, these
are modeled based on the Dirichlet distributions and processes. Latent Dirichlet
Allocation has been used in the application for the following functions:
The application then deals with the calculation of the score for a candidate’s
resume according to the job posting they have applied for. According to the score
each candidate’s resume receives, a rank list will be made with the candidate
receiving a higher score placed higher as compared to the candidate receiving a
lower score. By displaying a resume list in order of relevance to the position, the
technique ranks CVs according to their match with the job description, making it
easy for recruiters. Customized options for job description in our web application
are shown. Slider options are present in our app for a better user experience. The
web application would also have a Bar Chart that shows the stats. The Resume
Screening System replaces ineffective manual screening, ensuring that no
candidate is overlooked.
24
5.3 PROPOSED SYSTEM
Scores can then be given to the resumes and they can be ranked from
highest match to lowest match. This ranking is made visible only to the company
recruiter who is interested to select the best candidates from a large pool of
candidates. This is done with the aim to aid the recruiters of any company from the
long and tedious task of viewing and analyzing thousands of candidate resumes.
The calculated ranking scores can then be utilized to determine best-fitting
candidates for that particular job opening Since the dynamic model leverages
NLP, it gives the output instantly. While going through all these pipelines, It will
score each resume and give out accurate output with higher efficiency, precision,
and accuracy. It works the following way.
25
Resumes from the Data set are parsed to remove white spaces, numbers,
stop words like and, or, etc. TF-IDF vectorization is then applied to convert the
words in the resumes to vectors. The text in the job description is also converted
to vectors using the TF-IDF vectorizer. Cosine distance is computed to measure
the similarity between the resume and the job description provided and Then
Mong String algorithm is applied to identify the resumes which are closely
matching with the JD provided by the recruiters.The main contributions of our work
can be summarized as follows: Use of our framework is not limited to a single field
of application and is useful for many more real-world applications. Providing a
ranking-based approach after filtering Framework to highlight skills of a resume
26
CHAPTER 6
Using Kneighbors algorithm, we have got 98% of accuracy which is the perfect
accuracy which is shown in Fig 6.1. On first glance, Linear SVC and XGBoost
provided the highest accuracy and a favorable confusion matrix. Both were able to
correctly forecast the Role for a specific job description given to the system.
27
Fig.6.2 confusion matrix is plotted
The top 50 most frequently used words chosen described each job
description efficiently. These words helped train our model and provide
satisfactory results. We plotted Word Cloud to visualize the most common
occurring words presented in fig. 7.3
28
CHAPTER 7
SUMMARY
7.1 CONCLUSION
Our algorithm was successfully able to screen and shortlist the best
candidates with the help of NLP.Highly accurate results were obtained by using
Latent Dirichlet Allocation to display the best-shortlisted resume on Web Ui. The
web application was successfully able to order the resumes, by intelligently
reading job descriptions as input and comparing the resumes which fall into the
category of given Job Descriptions. The organization receives a large number of
applications for each employment opening. Finding the right candidate's
application from a sea of resumes is a time-consuming endeavor for any company
these days. The classification of a candidate's resume is a laborious, time-
consuming, and resource-intensive process. To address this problem, we created
an automated machine learning-based algorithm that recommends acceptable
applicant resumes to HR based on the job description provided. The suggested
methodology had two stages: first, it classified resumes into various groups.
Second, it suggests resumes based on their resemblance to the job description. If
an industry produces a high number of resumes, the proposed approach can be
used to create an Industry-specific model.
29
be based on the applicant's skills, ensuring that the resumes uploaded by the
applicant are genuine and the applicant is truly
knowledgeable about the skills. Using NLP(Natural Language Processing) and
ML(Machine Learning) to rank the resumes according to the given constraint, this
intelligent system ranks the resume of any format according to the given
constraints or the following requirement provided by the client company. We will
basically take the bulk of input resume from the client company and that client
company will also provide the requirement and the constraints according to which
the resume should be ranked by our system. Besides the information provided by
the resume, we are going to read the candidate’s social profiles (like LinkedIn,
GitHub, etc.) which will give us more genuine information about that candidate.
The application automates the task of project allocation, thereby eliminating the
tedious and redundant affair of opening and analyzing the resumes manually by
the HR team of the organization.
30
REFERENCES
[2] Anushka Sharma; Smiti Singhal; Dhara Ajudia,” Intelligent Recruitment System
Using NLP”, International Conference on Artificial Intelligence and Machine Vision
(AIMV), 2021
[3] Fahad, SK Ahmed, and Abdul Samad Ebrahim Yahya. "Inflectional review of
deep learning on natural language processing." 2018 International Conference on
Smart Computing and Electronic Enterprise (ICSEE). IEEE, 2018
[6] Mujtaba, Dena F., and Nihar R. Mahapatra. "Ethical Considerations in AI-Based
Recruitment." 2019 IEEE International Symposium on Technology and Society
(ISTAS). IEEE, 2019.
[7] Pradeep Kumar Roy; Sarabjeet Singh Chowdhary; Rocky Bhatia,” A Machine
Learning approach for automation of Resume Recommendation system”, Procedia
Computer Science Volume 167, Pages 2318-2327, 2020
31
[8] Rajath V; Riza Tanaz Fareed; Sharadadevi Kaganurmath,” Resume
Classification And Ranking Using KNN And Cosine Similarity”,international
JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY, Volume 10, Issue
08, AUGUST 2021
[10] Yong Luo; Huaizheng Zhang; Yongjie Wang; Yonggang Wen; Xinwen Zhang,”
ResumeNet: A Learning-Based Framework for Automatic Resume Quality
Assessment”, 2018 IEEE International Conference on Data Mining (ICDM),
December 2018
32
APPENDIX
A. SOURCE CODE
image = Image.open('Images//logo.png')
st.image(image, use_column_width=True)
st.title("Resume Screening")
# Reading the CSV files prepared by the fileReader.py
Resumes = pd.read_csv('Resume_Data.csv')
Jobs = pd.read_csv('Job_Data.csv')
#############JOB DESCRIPTION CODE ##############
# Checking for Multiple Job Descriptions
# If more than one Job Descriptions are available, it asks user to select one as
well.
if len(Jobs['Name']) <= 1:
st.write(
"There is only 1 Job Description present. It will be used to create scores.")
else:
st.write("There are ", len(Jobs['Name']),
33
"Job Descriptions available. Please select one.")
34
align='left'))])
fig.update_layout(width=800, height=500)
st.write(fig)
st.markdown("---")
Ranked_resumes['Rank'] = pd.DataFrame(
[i for i in range(1, len(Ranked_resumes['Scores'])+1)])
35
st.write(fig1)
st.markdown("---")
fig2 = px.bar(Ranked_resumes,
x=Ranked_resumes['Name'], y=Ranked_resumes['Scores'],
color='Scores',
color_continuous_scale='haline', title="Score and Rank Distribution")
# fig.update_layout(width=700, height=700)
st.write(fig2)
st.markdown("---")
for a in document:
raw = a.split(" ")
Document.append(raw)
return Document
document = get_list_of_words(Resumes['Cleaned'])
id2word = corpora.Dictionary(document)
corpus = [id2word.doc2bow(text) for text in document]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word, num_topics=6, random_state=100,
update_every=3, chunksize=100, passes=50,
alpha='auto', per_word_topics=True)
36
def format_topics_sentences(ldamodel, corpus):
sent_topics_df = []
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0:
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df.append(
[i, int(topic_num), round(prop_topic, 4)*100, topic_keywords])
else:
break
return sent_topics_df
cloud = WordCloud(background_color='white',
width=2500,
height=1800,
max_words=10,
colormap='tab10',
collocations=False,
color_func=lambda *args, **kwargs: cols[i],
37
prefer_horizontal=1.0)
topics = lda_model.show_topics(formatted=False)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
topic_words = dict(topics[i][1])
cloud.generate_from_frequencies(topic_words, max_font_size=300)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
st.pyplot(plt)
st.markdown("---")
df_topic_sents_keywords = format_topics_sentences(
ldamodel=lda_model, corpus=corpus)
df_some = pd.DataFrame(df_topic_sents_keywords, columns=[
'Document No', 'Dominant Topic', 'Topic % Contribution',
'Keywords'])
df_some['Names'] = Resumes['Name']
df = df_some
38
st.markdown("## Topic Modeling with Resume using LDA ")
st.markdown(
"Using LDA to divide the topics into a number of usefull topics and creating a
Cluster of matching topic resumes. ")
fig3 = px.sunburst(df, path=['Dominant Topic', 'Names'], values='Topic %
Contribution',
color='Dominant Topic', color_continuous_scale='viridis', width=800,
height=800, title="Topic Distribution Graph")
st.write(fig3)
39
st.write("With a Match Score of :", Ranked_resumes.iloc[indx-1, 6])
fig = go.Figure(data=[go.Table(
header=dict(values=["Resume"],
fill_color='#f0a500',
align='center', font=dict(color='white', size=16)),
cells=dict(values=[str(value)],
fill_color='#f4f4f4',
align='left'))])
fig.update_layout(width=800, height=1200)
st.write(fig)
st.markdown("---")
import spacy
import Distill
try:
nlp = spacy.load('en_core_web_sm')
except ImportError:
print("Spacy's English Language Modules aren't present \n Install them by doing
\n python -m spacy download en_core_web_sm")
def _base_clean(text):
"""
Takes in text read by the parser file and then does the text cleaning.
"""
text = Distill.tokenize(text)
text = Distill.remove_stopwords(text)
text = Distill.remove_tags(text)
text = Distill.lemmatize(text)
return text
def _reduce_redundancy(text):
"""
40
Takes in text that has been cleaned by the _base_clean and uses set to reduce
the repeating words
giving only a single word that is needed.
"""
return list(set(text))
def _get_target_words(text):
"""
Takes in text and uses Spacy Tags on it, to extract the relevant Noun, Proper
Noun words that contain words related to tech and JD.
"""
target = []
sent = " ".join(text)
doc = nlp(sent)
for token in doc:
if token.tag_ in ['NN', 'NNP']:
target.append(token.text)
return target
def Cleaner(text):
sentence = []
sentence_cleaned = _base_clean(text)
sentence.append(sentence_cleaned)
sentence_reduced = _reduce_redundancy(sentence_cleaned)
sentence.append(sentence_reduced)
sentence_targetted = _get_target_words(sentence_reduced)
sentence.append(sentence_targetted)
return sentence
import nltk
import spacy
import re
41
# Define english stopwords
stop_words = stopwords.words('english')
# load the spacy module and create a nlp object
# This need the spacy en module to be present on the system.
nlp = spacy.load('en_core_web_sm')
# proces to remove stopwords form a file, takes an optional_word list
# for the words that are not present in the stop words but the user wants them
deleted.
def tokenize(text):
# Removes any useless punctuations from the text
text = re.sub(r'[^\w\s]', '', text)
return word_tokenize(text)
def lemmatize(text):
# the input to this function is a list
str_text = nlp(" ".join(text))
lemmatized_text = []
for word in str_text:
lemmatized_text.append(word.lemma_)
return lemmatized_text
def _to_string(List):
# the input parameter must be a list
string = " "
42
return string.join(List)
resume_dir = "Data/Resumes/"
job_desc_dir = "Data/JobDesc/"
resume_names = os.listdir(resume_dir)
job_description_names = os.listdir(job_desc_dir)
document = []
43
text = tx.process(resume_directory+res, encoding='ascii')
text = str(text, 'utf-8')
temp.append(text)
placeholder.append(temp)
return placeholder
def get_cleaned_words(document):
for i in range(len(document)):
raw = Cleaner.Cleaner(document[i][1])
document[i].append(" ".join(raw[0]))
document[i].append(" ".join(raw[1]))
document[i].append(" ".join(raw[2]))
sentence = tf_idf.do_tfidf(document[i][3].split(" "))
document[i].append(sentence)
return document
Doc = get_cleaned_words(document)
Database = pd.DataFrame(document, columns=[
"Name", "Context", "Cleaned", "Selective", "Selective_Reduced",
"TF_Based"])
Database.to_csv("Resume_Data.csv", index=False)
44
job_document = read_jobdescriptions(job_description_names, job_desc_dir)
Jd = get_cleaned_words(job_document)
jd_database = pd.DataFrame(Jd, columns=[
"Name", "Context", "Cleaned", "Selective", "Selective_Reduced",
"TF_Based"])
jd_database.to_csv("Job_Data.csv", index=False)
45
B. SCREENSHOTS
46
We have build and got the perfect accuracy
We have build an web application code with the algorithm which has
given best performance.It has an accuracy of 98.3%.
47
Pie chart is plotted
48