Iocl1 Internship
Iocl1 Internship
By:
Nilakhya Mandita Bordoloi
Computer Science and Engineering,
Jorhat Engineering College
Organization:
Indian Oil Corporation Ltd., Guwahati Refinery
This project report details the development of a Resume Screening Website created dur-
ing a summer internship at IOCL Guwahati Refinery under the Information Systems
Department. The website is designed to accept resumes in both text and PDF formats,
categorizing them based on the provided data and recommending appropriate job roles
to the resume owners.
The system parses essential information such as the name, personal details, skills,
and more from the resumes. For the frontend development, we utilised HTML, CSS, and
JavaScript, along with the Flask framework for the backend.
In terms of machine learning, we identified Logistic Regression as the most suitable
model for categorization, while the Random Forest Classifier was chosen for job role
recommendations. The project also incorporates the use of TF-IDF Vectorizer for feature
extraction. Additionally, Natural Language Processing (NLP) techniques and regular
expressions were employed to parse and analyze the resume content effectively.
The system aims to streamline the resume screening process, providing efficient cate-
gorization and job role recommendations based on the extracted information.
1
Acknowledgements
We would like to express our deepest gratitude to our internship guide, Mr. Jon Jonak
Phukan, Chief Manager, Information Systems Department, IOCL Guwahati Refinery, for
his invaluable guidance, support, and encouragement throughout this project. His in-
sights and expertise were crucial in navigating the complexities of developing the Resume
Screening Website.
We also extend our sincere thanks to the entire Information Systems Department
team at IOCL Guwahati Refinery for providing us with the necessary resources and a
conducive environment for our project work.
Additionally, we are grateful to the management and staff of IOCL Guwahati Re-
finery for offering us this internship opportunity, which has been an invaluable learning
experience.
Finally, we would like to thank our families and friends for their unwavering support
and encouragement throughout the internship period.
2
Contents
Abstract 1
Acknowledgements 2
1 Introduction 5
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Project Description 7
2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Technical Stack 9
3.1 Web Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 HTML (HyperText Markup Language) . . . . . . . . . . . . . . . 9
3.1.2 CSS (Cascading Style Sheets) . . . . . . . . . . . . . . . . . . . . 9
3.1.3 JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.4 Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . . . 10
3.4 Platforms and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 Visual Studio Code (VS Code) . . . . . . . . . . . . . . . . . . . 11
3.4.3 Port Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.4 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.5 Python Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Development Process 12
4.1 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . . 12
4.1.1 Clean Resume Data . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Job Dataset with Features . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Integration with Flask and NLP . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Web Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Testing and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 14
3
5 Implementation Details 15
5.1 Integration with Flask and NLP . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Frontend Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Testing and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4
Chapter 1
Introduction
1.1 Background
In the digital age, organizations receive an overwhelming number of resumes for every
job opening, making the recruitment process increasingly complex and time-consuming.
Traditional methods of resume screening, which often involve manual review by human
resources personnel, can be inefficient, inconsistent, and prone to bias. This challenge has
driven the development of automated systems that leverage advancements in technology
to streamline and enhance the hiring process.
Our project, developed during a summer internship at IOCL Guwahati Refinery un-
der the Information Systems Department, addresses this challenge by creating a Resume
Screening Website. This website utilizes web development technologies, machine learn-
ing, and natural language processing (NLP) to automate the resume screening process,
providing an efficient and scalable solution for organizations.
1.2 Purpose
The primary purpose of the resume screening system is to automate the initial stages
of the recruitment process by categorizing resumes based on predefined criteria and rec-
ommending suitable job roles to applicants. By doing so, the system aims to reduce the
time and effort required for manual resume screening, increase the accuracy of candidate
selection, and minimize potential biases in the evaluation process.
The system extracts and analyzes key information from resumes, such as personal
details, educational background, work experience, and skills. It then matches this in-
formation against job descriptions and criteria to categorize candidates and provide job
recommendations. This automated process not only speeds up the recruitment cycle but
also ensures a more objective assessment of candidates.
1.3 Scope
The scope of the project encompasses the following components:
• Resume Parsing: The system accepts resumes in both text and PDF formats,
extracting essential information using NLP techniques and regular expressions.
5
• Categorization: Utilizing machine learning models, the system categorizes re-
sumes based on various criteria such as skills, experience, and qualifications.
• Web Interface: A user-friendly web interface developed using HTML, CSS, and
JavaScript, with Flask as the backend framework, allows users to upload resumes
and view results.
1.4 Objectives
The key objectives of the project are:
6
Chapter 2
Project Description
7
• Categorization: The system uses a Logistic Regression model to categorize re-
sumes based on predefined criteria, such as skill sets, industry experience, and edu-
cational qualifications. This model was selected for its simplicity and effectiveness
in binary and multi-class classification tasks.
• Web Interface: A user-friendly web interface, built using HTML, CSS, and
JavaScript, with Flask as the backend framework, allows users to upload resumes
and view categorization and job recommendation results. The interface is designed
to be intuitive and accessible, ensuring a seamless user experience.
• Testing and validation of the system to ensure accuracy, reliability, and scalability.
By automating the resume screening process, this system aims to reduce the workload
on human resources personnel, minimize biases in candidate evaluation, and improve the
overall efficiency of the recruitment process.
8
Chapter 3
Technical Stack
3.1.3 JavaScript
JavaScript is a versatile programming language used to add interactivity and dynamic
features to web pages. It allows for the manipulation of HTML and CSS elements,
making it possible to create responsive user interfaces, handle events, and communicate
with backend services asynchronously using AJAX.
3.1.4 Flask
Flask is a lightweight web framework for Python, designed to be simple yet flexible. It
is used for developing the backend of the Resume Screening Website. Flask provides a
robust platform for handling requests, routing, and integration with machine learning
models. It also supports templating, making it easy to render dynamic web pages.
9
3.2 Machine Learning
The system employs two key machine learning models: Logistic Regression for cate-
gorization and Random Forest Classifier for job recommendations. Both models are
implemented using the Python programming language and related libraries.
10
• Named Entity Recognition (NER): Detecting and classifying named entities
in text, such as names, dates, and organizations.
• Regular Expressions: Used for pattern matching within text to extract specific
information.
11
Chapter 4
Development Process
12
Figure 4.1: Data Analysis - Clean Resume Data
• Random Row Dropping: Due to the large size of the dataset, some rows were
dropped randomly to manage data processing efficiently.
• Feature Extraction: Similar to the Clean Resume Data, text data from job
descriptions and features was vectorized.
• Model Selection: The Random Forest Classifier was chosen for its robustness and
capability to handle a diverse set of features. Remarkably, the model achieved an
accuracy of 100%, indicating a perfect fit with the data.
13
Figure 4.2: Data Analysis - Job Dataset with Features
14
Chapter 5
Implementation Details
• NLP and Regular Expressions: These were employed to parse important infor-
mation from the resumes, such as the candidate’s name, contact information, skills,
and work experience. This information was crucial for both categorization and job
recommendation processes.
• Flask Framework: Flask was chosen for its simplicity and flexibility, enabling
efficient routing and handling of user interactions with the web application.
15
5.4 Deployment
The application was deployed on a local server, running on Flask’s development server
on a specified port (typically port 5000). This setup allowed for easy testing and demon-
stration of the system’s capabilities.
16
Chapter 6
6.1 Results
6.1.1 Categorization Model
The categorization model was developed using the ”Clean Resume Data” dataset, which
contained resumes categorized into 24 distinct job categories. The key findings are as
follows:
• Data Preprocessing: The dataset was resampled to balance the class distribution,
and duplicates were removed to ensure data quality. The resumes were vectorized
using the TF-IDF Vectorizer.
• Model Evaluation: Various models were evaluated using Grid Search Cross-
Validation, including SVM, Random Forest Classifier, Logistic Regression, and De-
cision Tree.
• Best Performing Model: Logistic Regression emerged as the best-performing
model with an accuracy of 84.75%. The optimal parameters identified were:
– C: 15
– Penalty: l1
• Model Development: The Logistic Regression model was finalized and imple-
mented with the TF-IDF Vectorizer. This model effectively categorized resumes
into their respective job categories.
17
Figure 6.1: Categorization Model Performance
6.2 Discussion
6.2.1 Model Performance
The results from the categorization model show that Logistic Regression performed well
in classifying resumes into 24 different categories. The model’s accuracy of 84.75% in-
dicates robust performance. The chosen parameters (C: 15, Penalty: l1) contributed
to this performance by optimizing the regularization and complexity of the model. The
performance of other models like SVM and Decision Tree was also evaluated, but Logistic
Regression provided the best balance between accuracy and interpretability.
For the recommendation model, the Random Forest Classifier achieved a perfect ac-
curacy of 100
• Data Imbalance: Even after resampling, some categories in the ”Clean Resume
Data” dataset had fewer samples, which may impact model performance.
• Large Dataset Handling: The ”Job Dataset with Features” was large, and ran-
dom row dropping was necessary to manage processing. This may have resulted in
18
Figure 6.3: website
• Model Overfitting: The perfect accuracy achieved by the Random Forest Clas-
sifier raises concerns about potential overfitting. Further validation on unseen data
would be needed to confirm the model’s generalization capability.
• Enhanced Data Collection: Collecting more diverse and balanced datasets could
improve model performance and generalization.
The project successfully developed a resume screening website utilizing web develop-
ment, machine learning, and natural language processing techniques. The categorization
and recommendation models demonstrated strong performance, with the Logistic Regres-
sion model achieving an accuracy of 84.75% and the Random Forest Classifier reaching
19
Figure 6.4: fig: categorization, recommended job and parsed information
a perfect accuracy of 100%. The implementation in Flask, combined with effective data
preprocessing and model training, resulted in a functional and accurate system. Fur-
ther enhancements and real-world testing could further validate and refine the system’s
capabilities.
20
Chapter 7
7.1 Conclusion
The project successfully developed a comprehensive resume screening website that inte-
grates web development technologies with advanced machine learning and natural lan-
guage processing techniques. This system was designed to process and categorize resumes
and recommend suitable job roles based on the analyzed data.
21
7.2 Future Work
While the project has achieved its primary objectives, several areas for future enhance-
ment and exploration can further refine and expand the system’s capabilities:
• Multilingual Support: Adding support for multiple languages can make the sys-
tem accessible to a global audience. This would involve developing language-specific
models and incorporating translation capabilities for resume and job description
analysis.
22
• Bias and Fairness: Addressing potential biases in the machine learning models
is essential. Implementing fairness audits and bias mitigation techniques can help
ensure equitable treatment of all applicants and prevent discriminatory practices.
7.4 Links
7.4.1 GitHub Repository
23