Chapter Two

Chapter Two reviews key areas of text categorization, the K-Nearest Neighbors (KNN) algorithm, and the TF-IDF feature extraction technique. It discusses the definitions, importance, common methods of text classification, and the workings, advantages, and applications of KNN and TF-IDF. The chapter aims to highlight previous research developments and identify gaps for further exploration.

Uploaded by

Amna Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

Chapter Two

Uploaded by

Amna Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Chapter Two: Theory Fundamentals

This chapter aims to provide a comprehensive review of previous research related to three
main areas: text categorization, the K-Nearest Neighbors algorithm (KNN), and the Term
Frequency-Inverse Document Frequency (TF-IDF) feature extraction technique. This
chapter helps understand previous developments in this field and identifies the research
gaps that this research seeks to address.

2.1 Text Classification

2.1.1 Definition of Text classification
Text categorization is the process of assigning one or more predefined categories to a given
text based on its content. This field is widely used in various applications, such as news
classification, sentiment analysis, and spam detection.

2.1.2 The Importance of Text classification

Text categorization techniques have become essential due to the vast amount of text
available on the internet. By using automatic classification, text data can be efficiently
organized, facilitating information retrieval and decision-making based on automated text
analysis.

2.1.3 Common Methods in Text Classification

There are several methods used for text classification, including:

Rule-Based Methods: These rely on manually creating rules to define classes.

Statistical Methods: These rely on analyzing the distribution of words and frequencies,
such as Naïve Bayes.
Machine Learning Methods: These include KNN, SVM, and Decision Trees, which rely on
trained models to classify texts.
Deep Learning Methods: These include CNN, RNN, and Transformers, which rely on
neural networks to process text.
2.2 K-Nearest Neighbors (KNN) Algorithm
2.2.1 Definition of KNN
The KNN algorithm is one of the simplest and most popular supervised machine learning
algorithms. This algorithm is based on the neighbor principle, whereby any new text is
classified based on the majority class of its K nearest neighbors in the training dataset.
2.2.2 KNN Working Principle
The algorithm works as follows:

The distance between the new text and all texts in the training dataset is calculated using
one of the distance measures, such as Euclidean distance or cosine similarity.
The K nearest neighbors of the new text are identified.
The class is assigned to the new text based on the majority among its neighbors.
2.2.3 Advantages and Disadvantages of KNN
Advantages:
Easy to implement and understand.
Does not require complex training like neural networks.
Can be used with nonlinear data.
Disadvantages:
Requires intensive computations when using large datasets, as the distance is calculated for
each point.
Performance may be affected by noise in the data.
Choosing the appropriate K significantly impacts accuracy.
2.2.4 Applications of KNN in Text Classification
KNN has been used in many studies for text classification due to its simplicity and
effectiveness. Some applications include:

Classifying news according to its topic (sports, politics, economics, etc.).

Sentiment analysis in product reviews or tweets.
Classifying emails into junk and junk.
2.3 TF-IDF Feature Extraction Technique
2.3.1 Definition of TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency) is a technique used in natural
language processing (NLP) to convert text into a numerical representation based on the
importance of words within documents.

2.3.2 How TF-IDF Works

This technique relies on two metrics:

TF (Term Frequency): Measures the number of times a word appears within a document.
IDF (Inverse Document Frequency): Reduces the impact of very common words across
documents, increasing the importance of rare words.
TF-IDF for each word is calculated using the equation:
TF−IDF(w)=TF(w)×IDF(w)
where:
 TF(w,d) = Term Frequency (measures how often the word appears in a document).
 IDF(w) = Inverse Document Frequency (measures how important the word is across
all documents).

2.3.3 Advantages of TF-IDF

It assigns higher weight to more important words, improving classification performance.
It reduces the impact of unhelpful common words such as conjunctions.
It is widely used in information retrieval and search engines.
2.3.4 Use of TF-IDF in Text Classification
TF-IDF is used to extract text features and convert them into numerical representations that can
be used with classification algorithms such as KNN. Studies have shown that this technique
provides good accuracy when used with various classifiers.

Computational Intelligence and its Applications
From Everand
Computational Intelligence and its Applications
Vikash Yadav
No ratings yet
IR Unit 2 (1,2)
No ratings yet
IR Unit 2 (1,2)
76 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Unit-I: Define TF-IDF and Perform A Calculation Procedure of TF-IDF With A Sample Example?
No ratings yet
Unit-I: Define TF-IDF and Perform A Calculation Procedure of TF-IDF With A Sample Example?
38 pages
Unit 4
No ratings yet
Unit 4
54 pages
G10 Pe Week 2
No ratings yet
G10 Pe Week 2
3 pages
Data Science Unit 5 Sppu Notes
No ratings yet
Data Science Unit 5 Sppu Notes
23 pages
Analytics of Machine Learning-Based Algorithms For Text Classification
No ratings yet
Analytics of Machine Learning-Based Algorithms For Text Classification
11 pages
NLP DL Lecture1
No ratings yet
NLP DL Lecture1
48 pages
Unit Iii PART B - 13 Marks 1. Explain Briefly About Text Classification. Introduction To Text Classification
No ratings yet
Unit Iii PART B - 13 Marks 1. Explain Briefly About Text Classification. Introduction To Text Classification
23 pages
Duplication Question Detection Using Machine Learning
No ratings yet
Duplication Question Detection Using Machine Learning
8 pages
Mod 2
No ratings yet
Mod 2
10 pages
Cse Vsem 503 B PR Unit 2 Notes
No ratings yet
Cse Vsem 503 B PR Unit 2 Notes
17 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
Unit 3 - Supervise Learning Classification
No ratings yet
Unit 3 - Supervise Learning Classification
23 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
A Text-Image Feature Mapping Algorithm Based On TR
No ratings yet
A Text-Image Feature Mapping Algorithm Based On TR
10 pages
FPA Unit 2
No ratings yet
FPA Unit 2
20 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
Text Categorization Using Association Rule and Naïve Bayes Classifier
No ratings yet
Text Categorization Using Association Rule and Naïve Bayes Classifier
9 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
An Improved Text
No ratings yet
An Improved Text
9 pages
Agarwal 2014
No ratings yet
Agarwal 2014
9 pages
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
No ratings yet
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
6 pages
Unit 2
No ratings yet
Unit 2
26 pages
Texthuff
No ratings yet
Texthuff
3 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification
No ratings yet
(IEEE Semantic 2008 Pingpen Yuan) MSVM-KNN Multi-Class Text Classification
8 pages
DM - MP
No ratings yet
DM - MP
15 pages
Theorical Basis
No ratings yet
Theorical Basis
4 pages
Article Classification Using Natural Language Processing and Machine Learning
No ratings yet
Article Classification Using Natural Language Processing and Machine Learning
8 pages
BDA3
No ratings yet
BDA3
61 pages
PROBLEM and JOURNAL
No ratings yet
PROBLEM and JOURNAL
2 pages
Learn 4
No ratings yet
Learn 4
27 pages
Unit 5
No ratings yet
Unit 5
28 pages
Text Classification
No ratings yet
Text Classification
3 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Assignment No 4 - KNN Twitter
No ratings yet
Assignment No 4 - KNN Twitter
3 pages
Text Classification Based On Machine Learning and
No ratings yet
Text Classification Based On Machine Learning and
12 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
A Comparative Study On Different Types of Approaches To The Arabic Text Classification
No ratings yet
A Comparative Study On Different Types of Approaches To The Arabic Text Classification
12 pages
Basic Calculus Activity Sheet Quarter 3 - Melc 4
33% (3)
Basic Calculus Activity Sheet Quarter 3 - Melc 4
7 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
9 TZ
No ratings yet
9 TZ
101 pages
Mining Text Data and Classificatin
No ratings yet
Mining Text Data and Classificatin
4 pages
Cisco Networking Academy Course Information
No ratings yet
Cisco Networking Academy Course Information
12 pages
Annex 11 - Evaluation Tool For Content
No ratings yet
Annex 11 - Evaluation Tool For Content
6 pages
Building A K-Nearest Neighbor Classifier For Text Categorization
No ratings yet
Building A K-Nearest Neighbor Classifier For Text Categorization
3 pages
Written Language. The Reading-Writing Process. Reading Comprehension Techniques of Global and Specific Understanding of Texts. Writing From Comprehension To Production.
100% (1)
Written Language. The Reading-Writing Process. Reading Comprehension Techniques of Global and Specific Understanding of Texts. Writing From Comprehension To Production.
9 pages
Lesson 4 Shopping For A TV
No ratings yet
Lesson 4 Shopping For A TV
11 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Text Categorization Using Association Rule and Naïve Bayes Classifier
No ratings yet
Text Categorization Using Association Rule and Naïve Bayes Classifier
9 pages
Nursing Education Foundations For Practice Excellence 1st Edition by Barbara Ann Moyer, Ruth Wittmann Price 9780803614048
100% (9)
Nursing Education Foundations For Practice Excellence 1st Edition by Barbara Ann Moyer, Ruth Wittmann Price 9780803614048
89 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
Module Iii
No ratings yet
Module Iii
15 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Factors Influencing Students - Mathematics Performance in Some Selected Co PDF
No ratings yet
Factors Influencing Students - Mathematics Performance in Some Selected Co PDF
7 pages
DLL Precalculus Week 9
No ratings yet
DLL Precalculus Week 9
4 pages
Types of Assessment Applicable in CBA at The Early Years Education
No ratings yet
Types of Assessment Applicable in CBA at The Early Years Education
5 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Prime Human Resource Management Implementation, Work Performance, and Client Satisfaction in The City Government of Calamba
No ratings yet
Prime Human Resource Management Implementation, Work Performance, and Client Satisfaction in The City Government of Calamba
15 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Machine Learning Roadmap For Absolute Beginners
No ratings yet
Machine Learning Roadmap For Absolute Beginners
2 pages
Text Classification
No ratings yet
Text Classification
32 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Unit III Affective Learning Competencies
No ratings yet
Unit III Affective Learning Competencies
15 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
No ratings yet
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
9 pages
Republic of The Philippines Department of Education Region V-Bicol Schools Division Office of Albay Harigue Elementary School Harigue, Libon, Albay
No ratings yet
Republic of The Philippines Department of Education Region V-Bicol Schools Division Office of Albay Harigue Elementary School Harigue, Libon, Albay
4 pages
Ijet V2i3p7
No ratings yet
Ijet V2i3p7
6 pages
Ralph Llanto Dugcot The Child and Adolescent Learners and Learning Principles
No ratings yet
Ralph Llanto Dugcot The Child and Adolescent Learners and Learning Principles
15 pages
TC Shs Final Utilization Plan Science
No ratings yet
TC Shs Final Utilization Plan Science
3 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Self Introduction Final
No ratings yet
Self Introduction Final
2 pages
M1 Teaching Macro Skills
No ratings yet
M1 Teaching Macro Skills
35 pages
IGNOU MCS 231 Mobile Computing Previous Year Solved Papers
From Everand
IGNOU MCS 231 Mobile Computing Previous Year Solved Papers
Manish Soni
No ratings yet
Nstp2-Module 1
No ratings yet
Nstp2-Module 1
8 pages
Bi̇lun Ali̇oğlu Kon 301 Course Outline
No ratings yet
Bi̇lun Ali̇oğlu Kon 301 Course Outline
3 pages
Communication Studies Notes PDF
No ratings yet
Communication Studies Notes PDF
2 pages
Elaborating On My Feelings
No ratings yet
Elaborating On My Feelings
3 pages
Critical Thinking
No ratings yet
Critical Thinking
3 pages
Unit Planner-EY2 - WWA
No ratings yet
Unit Planner-EY2 - WWA
4 pages
Professional Development
No ratings yet
Professional Development
2 pages
Lab Rubrics For Ananlytical and Intrumentation
No ratings yet
Lab Rubrics For Ananlytical and Intrumentation
3 pages
ETEC 511: Foundations of Educational Technology Dr. Franc Feng
No ratings yet
ETEC 511: Foundations of Educational Technology Dr. Franc Feng
3 pages
How Non-Academic Supports Work: Four Mechanisms For Improving Student Outcomes
No ratings yet
How Non-Academic Supports Work: Four Mechanisms For Improving Student Outcomes
4 pages
Stregnth Weaknesses Opportunities and Challenges Faced During Semester Pattern
No ratings yet
Stregnth Weaknesses Opportunities and Challenges Faced During Semester Pattern
4 pages

Chapter Two

Uploaded by

Chapter Two

Uploaded by

Chapter Two: Theory Fundamentals

2.1 Text Classification

2.1.2 The Importance of Text classification

2.1.3 Common Methods in Text Classification

Rule-Based Methods: These rely on manually creating rules to define classes.

Classifying news according to its topic (sports, politics, economics, etc.).

2.3.2 How TF-IDF Works

2.3.3 Advantages of TF-IDF

You might also like