0% found this document useful (0 votes)
56 views26 pages

Application of NLP For Information Extraction From Unstructured Documents

The recruitment process often involves handling a vast number of unstructured documents, such as CVs and job descriptions, which can be time-consuming and challenging to process manually. Extracting relevant information from these documents is complicated due to their varied formats and the need to identify specific types of information. This leads to inefficiencies and potential errors in candidate selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views26 pages

Application of NLP For Information Extraction From Unstructured Documents

The recruitment process often involves handling a vast number of unstructured documents, such as CVs and job descriptions, which can be time-consuming and challenging to process manually. Extracting relevant information from these documents is complicated due to their varied formats and the need to identify specific types of information. This leads to inefficiencies and potential errors in candidate selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

‘;

Application of NLP for Information Extraction from Unstructured


Documents
A project report submitted in partial fulfillment of the requirements for

the award of the degree of

Bachelor of Technology

In

COMPUTER SCIENCE AND ENGINEERING(AI&ML)

Submitted by
Karrem Praneeth Reddy : 2111CS020348

Pawar Pranay : 2111CS020349

Tekumatla Pranay : 2111CS020350

Vadde Pranay : 2111CS020351


:
Goli Praneeth Sai Saran 2111CS020353

Under the guidance of

Ms.Maddi Sri V.S.Suneeta

i
COLLEGE CERTIFICATE

This is to certify that this bonafied record of the application development entitled “Application
of NLP for Information Extraction from Unstructed Documents” submitted by K.Praneeth
(2111CS020348), P.Pranay (2111CS020349), T.Pranay (2111CS020350), V.Pranay
(2111CS020351), G.Praneeth sai Saran (2111CS020353) of B Tech IV year I semester, Department
of CSE(AI&ML) during the year 2024-25. The results embodied in the report have not been
submitted to any other university or institute for the award of any degree or diploma.

PROJECT GUIDE HEAD OF THE DEPARTMENT

Ms.Maddi Sri V.S.Suneeta Dr.Sujith Das

DEAN CSE(AI&ML)

Dr. Thayyaba Khatoon

EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT

We sincerely thank our DEAN Dr. ThayyabaKhatoon for her constant support and motivation all
the time. A special acknowledgement goes to a friend who enthused us from the back stage. Last
but not the least our sincere appreciation goes to our family who has been tolerant understanding
our moods, and extending timely support.

We would like to express our gratitude to all those who extended their support and suggestions to
come up with this application. Special Thanks to our Guide Ms.Maddi Sri V.S.Suneeta whose help
and stimulating suggestions and encouragement helped us all time in the due course of project
development.

iii
Abstract

The increasing interest in data has led to significant investments in developing tools that can analyze and
extract useful information from various sources. However, when it comes to applicant tracking
systems(ATS) that gather information from candidates' resumes and job descriptions, most approaches are
still rule-based and do not fully utilize modern techniques. This is surprising because, although the
content of these documents may vary, their structure is usually quite similar.In this paper, we introduce a
Natural Language Processing (NLP) pipeline designed to extract structured information from a wide
range of textual documents, with a focus on those used in applicant tracking systems, such as resumes and
job postings. The pipeline employs several NLP techniques, including document classification,
segmentation, and text extraction.To classify the documents, we use algorithms like Support Vector
Machines (SVM) and XGBoost, which help in accurately identifying the type of document based on its
content. After classification, the documents are divided into different sections using methods such as
chunking, regular expressions, and Part-of-Speech (POS) tagging. These techniques allow us to identify
and focus on the most important parts of the document.Finally, we use tools like Named Entity
Recognition (NER), regular expressions, and pattern matching to extract relevant information from each
section. The structured data obtained can be used to improve various processes, such as document
organization, scoring, matching, and auto-filling forms, making ATS systems more efficient and effective
for both job seekers and employers.

iv
CONTENTS
CHAPTER NO TITLE PROJECT
INTRODUCTION:
1 1.1 Project Defination 1-4
1.2 Objective Of Project
1.3 Scope of the Project

2 Literature Review 5

ANALYSIS:
3.1 Project Planning and Research
3.2 Software Requirement
3 Specification 6-8
3.2.1 Software Requirement
3.2.2 Hardware Requirement
3.3 Model Selection And Architecture

DESIGN:
4.1 Introduction
4.2 UML Diagram
4 4.3 Dataset Description 9-12
4.4 Data Preprocessing Techniques
4.5 Methods & Algorithms
DEPLOYMENT AND RESULTS:
5.1 Introduction
5.2 Source Code
5 5.3 Model Implementation and 13-19
Training
5.4 Model Evaluation Metrics
5.5 Model Deployment : Testing And
Validation
5.6 Results
CONCLUSION:
6 6.1 Project Conclusion 20
6.2 Future Scope

v
CHAPTER 1

1.INTRODUCTION

• Information Extraction (IE) identifies and extracts relevant information from unstructured
documents, converting it into a structured format suitable for storage, processing, and
retrieval.

• Extracting information from unstructured documents is more complex than from structured
ones due to the variability in formats and the need to identify specific types of information.

• The research specifically targets CVs and job vacancy details within the IT field, aiming to
streamline the recruitment process by automating the selection of suitable candidates.

• The extraction methods will focus on critical details from CVs, including personal
information, educational background, and work experience, while job vacancies will reveal
job positions, required skills, responsibilities, and educational qualifications.

• By developing effective extraction methods, the approach aims to ease the manual
recruitment process, making it more efficient and less time-consuming for recruiter

• With the increasing volume of data available online, effective information extraction
techniques are essential for managing and utilizing this data effectively in various
applications, including recruitment

1
1.1 Problem Definition

The recruitment process often involves handling a vast number of unstructured documents, such
as CVs and job descriptions, which can be time-consuming and challenging to process manually.
Extracting relevant information from these documents is complicated due to their varied formats
and the need to identify specific types of information. This leads to inefficiencies and potential
errors in candidate selection. Consequently, there is a pressing need for an automated system that
can accurately parse these documents, categorize them, and extract essential details such as
personal information, educational background, work experience, and job requirements. By
implementing an effective information extraction approach, the recruitment process can be
streamlined, ultimately saving time and resources while improving the accuracy of candidate.

Key Componets:

1. Custom spaCy Pipeline Components :

• Category Component (document type identification)

• Segmentation Component (CV section separation) • Profile NER

parsing Component (name/address extraction)

2. Document Processing Features :

• Document Categorization (using SVM with 98.7% accuracy)

• CV Segmentation (using GaussianNB classifier) • Named Entity

Recognition (NER) using CRF algorithm

• Information extraction for:

• Personal information

• Educational background

• Work experience

3. Technical Implementation:

• Machine Learning algorithms (SVM, GaussianNB)

• Natural Language Processing techniques


1.2 Objective of project

The objective of the project is to develop an efficient and automated Natural Language Processing
(NLP) pipeline for the extraction of structured information from unstructured documents,
specifically targeting CVs and job vacancy descriptions in the IT field. The project aims to:

1. Custom NLP Pipeline Implementation

Implement a custom NLP pipeline using techniques such as document classification, segmentation,
and Named Entity Recognition (NER) to accurately identify and extract relevant information.

2. Recruitment Process Enhancement

Enhance the recruitment process by automating the extraction of key details, including personal
information, educational qualifications, work experience, and required skills from CVs and job
postings.

3. Accuracy and Efficiency Improvement

Improve the accuracy and efficiency of candidate selection by minimizing manual processing time
and reducing the risk of human error.

4. Insight Provision

Provide insights that can assist in document maintenance, scoring, matching, and auto-filling
forms, thereby facilitating a more streamlined recruitment workflow.

5. Future Applicability Exploration

Explore the applicability of the developed methods to other domains and types of unstructured
documents in the future.

6. User-Friendly Interface Development

Develop a user-friendly interface that allows recruiters and HR professionals to easily interact with
the system, making it accessible for users with varying levels of technical expertise.

7. Performance Metrics Evaluation

Establish performance metrics to evaluate the effectiveness of the NLP pipeline, ensuring that the
system meets the desired accuracy and efficiency standards for information extraction tasks.

3
1.3 Scope & Limitations of the project

Scope of the Project

1. Targeted Document Types: Focusing primarily on CVs and job vacancy descriptions
within the IT sector, allowing for specialized extraction techniques tailored to these document
types.

2. NLP Techniques Utilization: Utilizing advanced NLP techniques, including machine


learning algorithms and deep learning models, to enhance the accuracy of information extraction.

3. Integration with Recruitment Systems: Integrating the developed NLP pipeline with
existing recruitment systems and applicant tracking systems to streamline workflows and improve
data processing.

4. Scalability: Designing the system to be scalable, enabling it to handle a growing volume


of documents as recruitment needs increase.

Limitations of the Project

1. Domain-Specific Focus: The current implementation is primarily focused on the IT sector,


which may limit its applicability to other fields or industries where document formats and
information requirements differ.

2. Dependence on Document Quality: The effectiveness of the NLP pipeline may be


impacted by the quality and consistency of the input documents, as unstructured formats can vary
significantly.

3. Training Data Constraints: The performance of machine learning models is dependent


on the quality and quantity of training data. Limited or biased datasets may lead to suboptimal
extraction results.

4. Evolving Language and Formats: The project may face challenges in adapting to
evolving language usage, terminology, and document formats, which could require ongoing
updates to the NLP models.

4
CHAPTER 2

2. LITERATURE SURVEY

A literature survey for the project on the application of NLP for information extraction from
unstructured documents would include an exploration of existing research and methodologies in
related areas. Here’s a structured overview: 1. Information Extraction in NLP

Key Papers:

• “Natural Language Processing for Information Extraction” (arXiv, 2018): Discusses


various NLP techniques for effective information extraction from diverse text sources.

2. CV Parsing Techniques

Key Papers:

• “Application of Machine Learning Algorithms to an Online Recruitment System”


(International Conference on Internet and Web Applications and Services, 2015):
Investigates different machine learning approaches for improving online recruitment
efficiency, including CV parsing.

3. Named Entity Recognition (NER) in Healthcare and Other Domains

Key Papers:

• “FoodIE: A Rule-based Named-entity Recognition Method for Food Information


Extraction” (Proceedings of the International Conference on Pattern Recognition
Applications and Methods, 2019): Highlights a rule-based approach for NER,
demonstrating its effectiveness in specific domains.

4. Machine Learning and Deep Learning in Information Extraction

Key Papers:

• “Deep Q-Learning for Medical Decision Making” (Proceedings of the IEEE


International Conference on Healthcare Informatics, 2020): Discusses the potential of
deep learning techniques in decision-making tasks, relevant to information extraction
methodologies.

5
CHAPTER 3
3.1 Project Planning And Research
• Define project scope to extract structured information from unstructured
documents, focusing on CVs and job vacancy details.

• Develop objectives for implementing an NLP pipeline for document


classification, segmentation, and information extraction.
• Review existing NLP techniques and models such as Support Vector
Machines (SVM), XGBoost, and Named Entity Recognition (NER) for
effective information extraction.
• Analyze the application of machine learning algorithms and NLP techniques
in recruitment systems for improving CV parsing and job matching.
• Collect data from CVs and job vacancies related to the IT field, ensuring a
diverse set of documents for training and testing.
• Schedule implementation: data preparation (Weeks 1-2), model design
(Weeks 3-5), training/testing (Weeks 6-8), deployment (Weeks 9-10).

• Research and explore adaptive NLP models for continuous improvement and
optimization in processing diverse document types.

• 3.2 Software Requirement Specification

3.2.1 Software Requirement


Programming Language: Python (for data processing, model development, and implementation).

IDE: Jupyter Notebook, PyCharm, or Visual Studio Code for writing, debugging, and testing code.

Data Processing: Pandas, NumPy (for data handling and preprocessing).

6
Machine Learning: Scikit-Learn (for splitting data, preprocessing, and evaluation metrics).

Reinforcement Learning: TensorFlow or PyTorch (for implementing Q-Learning and Deep


QLearning algorithms).

Data Visualization: Matplotlib, Seaborn (for visualizing data distribution and model results).

Database Management: MySQL or SQLite (if handling large datasets that require efficient
querying and storage).

Operating System: Windows, macOS, or Linux, depending on compatibility and resource


availability.

Version Control: Git (for tracking code changes and collaboration).

Documentation: Jupyter Notebook or Markdown files for documenting code and project findings.

Deployment Platform: Flask or FastAPI (if building a web interface for model deployment).

3.2.2 Hardware Requirement


• Processor: Multi-core processor, ideally Intel i5 or AMD Ryzen 5 and above, to handle
complex computations efficiently.
• RAM: Minimum 8 GB, recommended 16 GB or higher, for handling large datasets and
model training without lag.
• GPU: NVIDIA GPU (e.g., GTX 1060 or higher) or equivalent, especially for deep
Qlearning, as it speeds up model training and reinforcement learning tasks.
• Storage: Minimum 256 GB SSD, recommended 512 GB or more, for faster data access and
storage, especially if dealing with large COPD datasets.
• Cooling System: A good cooling system to prevent overheating during extensive model
training sessions, which can be resource-intensive.
• Network Connectivity: Stable internet connection for downloading dependencies,
libraries, and datasets, as well as for cloud-based model training if applicable.

7
3.3 Model Selection And Architecture

Model Selection

• Document Classification:

Support Vector Machines (SVM): Chosen for its high accuracy


(98.7%) in classifying documents into categories (CV, job vacancy
details, and others).

XGBoost: An alternative model considered for classification tasks.

• Segmentation:Gaussian Naive Bayes (GaussianNB): Used for segmenting


CVs based on identified titles to categorize different sections like personal
information, experience, and education.

• Text Extraction Techniques:

Regular Expressions (regex): Employed for pattern matching and


extracting specific information from text segments.
CSV Parsing: For extracting predefined lists of skills, nationalities,
and languages from CVs.

• Evaluation Metrics:

Confusion matrix and classification report generated to assess the


performance of the NER model and other classifiers.

8
Architecture

9
CHAPTER 4

4.1 Introduction
The introduction of the paper discusses the growing intrigue surrounding data and the significant
investments made to implement statistical methods and extract analytics from various sources. It
highlights the limitations of traditional applicant tracking systems, which often rely on rule-based
methods and fail to leverage contemporary techniques for retrieving valuable information from
candidates' CVs and job descriptions. To address this challenge, the paper proposes the
implementation of a Natural Language Processing (NLP) pipeline designed to extract structured
information from a diverse range of textual documents, specifically focusing on CVs and job
vacancy information in the Information Technology (IT) field. This approach aims to automate the
recruitment process by efficiently extracting key information such as personal details, educational
background, and work experience from CVs, as well as job position and required skills from job
vacancies. The authors emphasize the importance of developing methods that can accurately
identify and extract relevant information, thereby enhancing the efficiency of document
maintenance and scoring in recruitment contexts.

4.2 UML Diagram

10
4.3 Data Set Descriptions

The dataset described in the paper consists of a collection of documents used for
training and testing the models for document classification and information
extraction. Here are the details regarding the dataset:

Total Number of Documents:

A total of 1402 documents were used for training the classification model.

Document Types:

The documents were categorized into three classes:

CVs: 590 documents

Job Vacancy Details: 512 documents

Others: 200 documents (which include various documents like news articles and
training certificates)

Training and Testing Split:

Training Data: 75% of the documents were used for training the model.

Testing Data: The remaining 25% were used for testing the model's performance.

Preprocessing:

The training data underwent preprocessing steps, which included:

Tokenization

Removing stopwords and unwanted characters (such as punctuations, emails, and


bullet points)

11
Data Cleaning:

• Handling Missing Values: Identifying and imputing missing values using mean, median, or
other techniques. Alternatively, rows with too many missing values can be dropped if they
are few.
• Outlier Detection: Detecting and handling outliers, especially in numeric columns like lung
function metrics (FEV1, FVC) and walk test measurements, using techniques like Z-score
or IQR filtering.

Data Transformation:

• Scaling and Normalization: Applying Min-Max scaling or standardization to numerical


features such as FEV1, FVC, and walk test results to ensure uniform feature distribution.
• Encoding Categorical Variables: Converting categorical variables (e.g., Gender, COPD
Severity) into numerical format using one-hot encoding or label encoding.

Feature Engineering:

• Age Binning: Utilizing the "AGEquartiles" feature to represent age as categorical bins or
creating custom bins based on age ranges.
• Combining Features: Creating new features by combining existing ones, such as deriving
an index or score from multiple quality-of-life indicators (e.g., CAT, HAD, SGRQ) to get a
composite health score.

Balancing the Dataset:

• Handling Imbalanced Classes: If the severity levels of COPD (such as "SEVERE" and
"VERY SEVERE") are imbalanced, using techniques like Synthetic Minority Oversampling
Technique (SMOTE) or under-sampling to balance class distribution.

Splitting Data:

• Train-Test Split: Dividing the dataset into training and test sets to evaluate the model's
performance on unseen data.
• Cross-Validation: Implementing k-fold cross-validation during model training to ensure
robustness and reduce variance in model performance.

12
4.5 Methods And Algorithms
1. Document Classification:

• Algorithms: Support Vector Machines (SVM) and XGBoost were implemented for
classifying documents into three classes: CV, job-vacancy-detail, and others.

• Training Data: A total of 10,670 documents were used for training, with an
accuracy of 98.7% achieved using the SVM model.

2. NLP Techniques:

• Document Segmentation: Segmentation of CVs into different sections based on


titles to identify personal information, experience, education, etc.

• Named Entity Recognition (NER): Used to extract relevant information from


segmented sections of the CV.

• Part of Speech (POS) Tagging: Used to identify the grammatical parts of words in
the text.

3.Custom spaCy Pipeline Components:

• Category Component: Identifies the document type.

• Segmentation Component: Segments the CV into different sections.

• Experience and Education NER Parsing Component: Extracts information


regarding experience and education.

• Skills Pattern Matching Component: Extracts different skills from the text.

• Word Embedding Component: Extracts the embedding value of words for further
processing.

4.Conditional Random Field (CRF):

• Usage: A probabilistic graphical model used for training the NER model for tagging
entities in CVs.
CHAPTER 5
5.1 Introduction
Information Extraction (IE) is a crucial process that identifies and extracts relevant information
from unstructured documents, transforming it into a structured format suitable for storage,
processing, and retrieval. Extracting information from unstructured documents presents greater
challenges compared to structured ones due to variability in formats and the necessity of
pinpointing specific types of information. This research specifically targets CVs and job vacancy
details within the IT field, aiming to streamline the recruitment process by automating the selection
of suitable candidates. As the volume of data available online continues to grow, effective
information extraction techniques are essential for managing and utilizing this data efficiently in
recruitment.

5.2 Source Code

14
15
16
5.3 Model Implementation and Training
The model implementation described in the document involves several key components. Firstly, for
document categorization, a Support Vector Machine (SVM) model was utilized to classify
documents into three categories: CV, job-vacancy-detail, and others, achieving an impressive
accuracy of 98.7%. Other algorithms such as Naive Bayes and Random Forest were tested, but
SVM outperformed them. Secondly, the segmentation of CVs was accomplished using a
GaussianNB classifier to identify section titles, allowing for the effective splitting of CVs into
different parts based on their content. For Named Entity Recognition (NER), a custom Stanford
NER model was developed using the Conditional Random Field (CRF) algorithm, trained on
approximately 350 CVs. This model employed various tags, including PER (person), LOC
(location), DATE, ORG (organization), DESIG (designation), and others, to accurately extract
relevant information from the segmented CVs. Overall, these implementations facilitated structured
information extraction from unstructured documents effectively.

5.4 Model Evaluation Metrics


Classification Report:

This is a JSON representation of structured data.

"entities" represents a list of various entities categorized by their type.

• "PERSON": Contains a list of names identified as people: "John Doe" and "Jane Smith".

• "DATE": Contains a single date: "September 15, 2023".

• "ORG": An empty list, suggesting no organizations were identified in the source data.

• "GPE": (Geopolitical Entities) Lists locations: "OpenAI", "San Francisco", and


"California".

"contact_info" represents contact details, but both "phones" and "emails" are empty lists, indicating
no phone numbers or email addresses were found.

Overall: This JSON snippet provides a structured representation of people, a date, locations, and
the lack of contact details. This type of format is often used for data extraction and analysis.

Accuracy:

The model's overall accuracy, calculated as the ratio of correct predictions to total predictions,
stands at approximately 90%.
5.5 Model Deployment: Testing And Validation
Pre-Deployment Testing:

• Unit Testing: Verify individual components for expected functionality.


• Integration Testing: Ensure that different modules work together properly.

Validation:

• Cross-Validation: Use techniques like k-fold cross-validation to evaluate model


performance on unseen data.
• A/B Testing: Compare the new model with the existing one in a controlled environment.

Post-Deployment Monitoring:

• Real-Time Monitoring: Track model performance in production for drift or degradation.


• Feedback Loop: Implement mechanisms to gather user feedback and retrain the model
accordingly.

5.6 Results

18
19
CHAPTER 6

6.1 Project Conclusion


The conclusion of the project states that the authors have demonstrated an efficient and accurate
method for structured-information extraction from textual documents. This capability is achieved
through the application of Natural Language Processing (NLP) techniques, combined with Machine
Learning (ML) and Deep Learning (DL) models. The research specifically references the use of
CVs and job vacancy information, which are commonly utilized in applicant tracking systems. The
results indicate that the system provides high evaluation metrics and improved execution times for
information extraction. The authors suggest that such optimized and accurate systems could be
beneficial in various fields, including research publications and job portals. They also acknowledge
the need for future adaptations of their methods to accommodate changes in requirements or new
types of data, and they express interest in exploring additional techniques for information extraction
from unstructured documents across different domains.

6.2 Future Scope


The future scope of the project includes the possibility of adapting and enhancing the current
methods to accommodate changing requirements or new types of data that may arise. The authors
express interest in exploring additional techniques, such as using the BERT model for embedding,
investigating document similarity instead of token similarity, and expanding their approach to
extract information from unstructured documents in other domains. This indicates a commitment
to evolving their system to improve its applicability and effectiveness in various contexts beyond
the current focus on CVs and job vacancies..

20

You might also like