Application of NLP For Information Extraction From Unstructured Documents
Application of NLP For Information Extraction From Unstructured Documents
Bachelor of Technology
In
Submitted by
Karrem Praneeth Reddy : 2111CS020348
i
COLLEGE CERTIFICATE
This is to certify that this bonafied record of the application development entitled “Application
of NLP for Information Extraction from Unstructed Documents” submitted by K.Praneeth
(2111CS020348), P.Pranay (2111CS020349), T.Pranay (2111CS020350), V.Pranay
(2111CS020351), G.Praneeth sai Saran (2111CS020353) of B Tech IV year I semester, Department
of CSE(AI&ML) during the year 2024-25. The results embodied in the report have not been
submitted to any other university or institute for the award of any degree or diploma.
DEAN CSE(AI&ML)
EXTERNAL EXAMINER
ii
ACKNOWLEDGEMENT
We sincerely thank our DEAN Dr. ThayyabaKhatoon for her constant support and motivation all
the time. A special acknowledgement goes to a friend who enthused us from the back stage. Last
but not the least our sincere appreciation goes to our family who has been tolerant understanding
our moods, and extending timely support.
We would like to express our gratitude to all those who extended their support and suggestions to
come up with this application. Special Thanks to our Guide Ms.Maddi Sri V.S.Suneeta whose help
and stimulating suggestions and encouragement helped us all time in the due course of project
development.
iii
Abstract
The increasing interest in data has led to significant investments in developing tools that can analyze and
extract useful information from various sources. However, when it comes to applicant tracking
systems(ATS) that gather information from candidates' resumes and job descriptions, most approaches are
still rule-based and do not fully utilize modern techniques. This is surprising because, although the
content of these documents may vary, their structure is usually quite similar.In this paper, we introduce a
Natural Language Processing (NLP) pipeline designed to extract structured information from a wide
range of textual documents, with a focus on those used in applicant tracking systems, such as resumes and
job postings. The pipeline employs several NLP techniques, including document classification,
segmentation, and text extraction.To classify the documents, we use algorithms like Support Vector
Machines (SVM) and XGBoost, which help in accurately identifying the type of document based on its
content. After classification, the documents are divided into different sections using methods such as
chunking, regular expressions, and Part-of-Speech (POS) tagging. These techniques allow us to identify
and focus on the most important parts of the document.Finally, we use tools like Named Entity
Recognition (NER), regular expressions, and pattern matching to extract relevant information from each
section. The structured data obtained can be used to improve various processes, such as document
organization, scoring, matching, and auto-filling forms, making ATS systems more efficient and effective
for both job seekers and employers.
iv
CONTENTS
CHAPTER NO TITLE PROJECT
INTRODUCTION:
1 1.1 Project Defination 1-4
1.2 Objective Of Project
1.3 Scope of the Project
2 Literature Review 5
ANALYSIS:
3.1 Project Planning and Research
3.2 Software Requirement
3 Specification 6-8
3.2.1 Software Requirement
3.2.2 Hardware Requirement
3.3 Model Selection And Architecture
DESIGN:
4.1 Introduction
4.2 UML Diagram
4 4.3 Dataset Description 9-12
4.4 Data Preprocessing Techniques
4.5 Methods & Algorithms
DEPLOYMENT AND RESULTS:
5.1 Introduction
5.2 Source Code
5 5.3 Model Implementation and 13-19
Training
5.4 Model Evaluation Metrics
5.5 Model Deployment : Testing And
Validation
5.6 Results
CONCLUSION:
6 6.1 Project Conclusion 20
6.2 Future Scope
v
CHAPTER 1
1.INTRODUCTION
• Information Extraction (IE) identifies and extracts relevant information from unstructured
documents, converting it into a structured format suitable for storage, processing, and
retrieval.
• Extracting information from unstructured documents is more complex than from structured
ones due to the variability in formats and the need to identify specific types of information.
• The research specifically targets CVs and job vacancy details within the IT field, aiming to
streamline the recruitment process by automating the selection of suitable candidates.
• The extraction methods will focus on critical details from CVs, including personal
information, educational background, and work experience, while job vacancies will reveal
job positions, required skills, responsibilities, and educational qualifications.
• By developing effective extraction methods, the approach aims to ease the manual
recruitment process, making it more efficient and less time-consuming for recruiter
• With the increasing volume of data available online, effective information extraction
techniques are essential for managing and utilizing this data effectively in various
applications, including recruitment
1
1.1 Problem Definition
The recruitment process often involves handling a vast number of unstructured documents, such
as CVs and job descriptions, which can be time-consuming and challenging to process manually.
Extracting relevant information from these documents is complicated due to their varied formats
and the need to identify specific types of information. This leads to inefficiencies and potential
errors in candidate selection. Consequently, there is a pressing need for an automated system that
can accurately parse these documents, categorize them, and extract essential details such as
personal information, educational background, work experience, and job requirements. By
implementing an effective information extraction approach, the recruitment process can be
streamlined, ultimately saving time and resources while improving the accuracy of candidate.
Key Componets:
• Personal information
• Educational background
• Work experience
3. Technical Implementation:
The objective of the project is to develop an efficient and automated Natural Language Processing
(NLP) pipeline for the extraction of structured information from unstructured documents,
specifically targeting CVs and job vacancy descriptions in the IT field. The project aims to:
Implement a custom NLP pipeline using techniques such as document classification, segmentation,
and Named Entity Recognition (NER) to accurately identify and extract relevant information.
Enhance the recruitment process by automating the extraction of key details, including personal
information, educational qualifications, work experience, and required skills from CVs and job
postings.
Improve the accuracy and efficiency of candidate selection by minimizing manual processing time
and reducing the risk of human error.
4. Insight Provision
Provide insights that can assist in document maintenance, scoring, matching, and auto-filling
forms, thereby facilitating a more streamlined recruitment workflow.
Explore the applicability of the developed methods to other domains and types of unstructured
documents in the future.
Develop a user-friendly interface that allows recruiters and HR professionals to easily interact with
the system, making it accessible for users with varying levels of technical expertise.
Establish performance metrics to evaluate the effectiveness of the NLP pipeline, ensuring that the
system meets the desired accuracy and efficiency standards for information extraction tasks.
3
1.3 Scope & Limitations of the project
1. Targeted Document Types: Focusing primarily on CVs and job vacancy descriptions
within the IT sector, allowing for specialized extraction techniques tailored to these document
types.
3. Integration with Recruitment Systems: Integrating the developed NLP pipeline with
existing recruitment systems and applicant tracking systems to streamline workflows and improve
data processing.
4. Evolving Language and Formats: The project may face challenges in adapting to
evolving language usage, terminology, and document formats, which could require ongoing
updates to the NLP models.
4
CHAPTER 2
2. LITERATURE SURVEY
A literature survey for the project on the application of NLP for information extraction from
unstructured documents would include an exploration of existing research and methodologies in
related areas. Here’s a structured overview: 1. Information Extraction in NLP
Key Papers:
2. CV Parsing Techniques
Key Papers:
Key Papers:
Key Papers:
5
CHAPTER 3
3.1 Project Planning And Research
• Define project scope to extract structured information from unstructured
documents, focusing on CVs and job vacancy details.
• Research and explore adaptive NLP models for continuous improvement and
optimization in processing diverse document types.
IDE: Jupyter Notebook, PyCharm, or Visual Studio Code for writing, debugging, and testing code.
6
Machine Learning: Scikit-Learn (for splitting data, preprocessing, and evaluation metrics).
Data Visualization: Matplotlib, Seaborn (for visualizing data distribution and model results).
Database Management: MySQL or SQLite (if handling large datasets that require efficient
querying and storage).
Documentation: Jupyter Notebook or Markdown files for documenting code and project findings.
Deployment Platform: Flask or FastAPI (if building a web interface for model deployment).
7
3.3 Model Selection And Architecture
Model Selection
• Document Classification:
• Evaluation Metrics:
8
Architecture
9
CHAPTER 4
4.1 Introduction
The introduction of the paper discusses the growing intrigue surrounding data and the significant
investments made to implement statistical methods and extract analytics from various sources. It
highlights the limitations of traditional applicant tracking systems, which often rely on rule-based
methods and fail to leverage contemporary techniques for retrieving valuable information from
candidates' CVs and job descriptions. To address this challenge, the paper proposes the
implementation of a Natural Language Processing (NLP) pipeline designed to extract structured
information from a diverse range of textual documents, specifically focusing on CVs and job
vacancy information in the Information Technology (IT) field. This approach aims to automate the
recruitment process by efficiently extracting key information such as personal details, educational
background, and work experience from CVs, as well as job position and required skills from job
vacancies. The authors emphasize the importance of developing methods that can accurately
identify and extract relevant information, thereby enhancing the efficiency of document
maintenance and scoring in recruitment contexts.
10
4.3 Data Set Descriptions
The dataset described in the paper consists of a collection of documents used for
training and testing the models for document classification and information
extraction. Here are the details regarding the dataset:
A total of 1402 documents were used for training the classification model.
Document Types:
Others: 200 documents (which include various documents like news articles and
training certificates)
Training Data: 75% of the documents were used for training the model.
Testing Data: The remaining 25% were used for testing the model's performance.
Preprocessing:
Tokenization
11
Data Cleaning:
• Handling Missing Values: Identifying and imputing missing values using mean, median, or
other techniques. Alternatively, rows with too many missing values can be dropped if they
are few.
• Outlier Detection: Detecting and handling outliers, especially in numeric columns like lung
function metrics (FEV1, FVC) and walk test measurements, using techniques like Z-score
or IQR filtering.
Data Transformation:
Feature Engineering:
• Age Binning: Utilizing the "AGEquartiles" feature to represent age as categorical bins or
creating custom bins based on age ranges.
• Combining Features: Creating new features by combining existing ones, such as deriving
an index or score from multiple quality-of-life indicators (e.g., CAT, HAD, SGRQ) to get a
composite health score.
• Handling Imbalanced Classes: If the severity levels of COPD (such as "SEVERE" and
"VERY SEVERE") are imbalanced, using techniques like Synthetic Minority Oversampling
Technique (SMOTE) or under-sampling to balance class distribution.
Splitting Data:
• Train-Test Split: Dividing the dataset into training and test sets to evaluate the model's
performance on unseen data.
• Cross-Validation: Implementing k-fold cross-validation during model training to ensure
robustness and reduce variance in model performance.
12
4.5 Methods And Algorithms
1. Document Classification:
• Algorithms: Support Vector Machines (SVM) and XGBoost were implemented for
classifying documents into three classes: CV, job-vacancy-detail, and others.
• Training Data: A total of 10,670 documents were used for training, with an
accuracy of 98.7% achieved using the SVM model.
2. NLP Techniques:
• Part of Speech (POS) Tagging: Used to identify the grammatical parts of words in
the text.
• Skills Pattern Matching Component: Extracts different skills from the text.
• Word Embedding Component: Extracts the embedding value of words for further
processing.
• Usage: A probabilistic graphical model used for training the NER model for tagging
entities in CVs.
CHAPTER 5
5.1 Introduction
Information Extraction (IE) is a crucial process that identifies and extracts relevant information
from unstructured documents, transforming it into a structured format suitable for storage,
processing, and retrieval. Extracting information from unstructured documents presents greater
challenges compared to structured ones due to variability in formats and the necessity of
pinpointing specific types of information. This research specifically targets CVs and job vacancy
details within the IT field, aiming to streamline the recruitment process by automating the selection
of suitable candidates. As the volume of data available online continues to grow, effective
information extraction techniques are essential for managing and utilizing this data efficiently in
recruitment.
14
15
16
5.3 Model Implementation and Training
The model implementation described in the document involves several key components. Firstly, for
document categorization, a Support Vector Machine (SVM) model was utilized to classify
documents into three categories: CV, job-vacancy-detail, and others, achieving an impressive
accuracy of 98.7%. Other algorithms such as Naive Bayes and Random Forest were tested, but
SVM outperformed them. Secondly, the segmentation of CVs was accomplished using a
GaussianNB classifier to identify section titles, allowing for the effective splitting of CVs into
different parts based on their content. For Named Entity Recognition (NER), a custom Stanford
NER model was developed using the Conditional Random Field (CRF) algorithm, trained on
approximately 350 CVs. This model employed various tags, including PER (person), LOC
(location), DATE, ORG (organization), DESIG (designation), and others, to accurately extract
relevant information from the segmented CVs. Overall, these implementations facilitated structured
information extraction from unstructured documents effectively.
• "PERSON": Contains a list of names identified as people: "John Doe" and "Jane Smith".
• "ORG": An empty list, suggesting no organizations were identified in the source data.
"contact_info" represents contact details, but both "phones" and "emails" are empty lists, indicating
no phone numbers or email addresses were found.
Overall: This JSON snippet provides a structured representation of people, a date, locations, and
the lack of contact details. This type of format is often used for data extraction and analysis.
Accuracy:
The model's overall accuracy, calculated as the ratio of correct predictions to total predictions,
stands at approximately 90%.
5.5 Model Deployment: Testing And Validation
Pre-Deployment Testing:
Validation:
Post-Deployment Monitoring:
5.6 Results
18
19
CHAPTER 6
20