0% found this document useful (0 votes)
11 views3 pages

Chapter Two

Chapter Two reviews key areas of text categorization, the K-Nearest Neighbors (KNN) algorithm, and the TF-IDF feature extraction technique. It discusses the definitions, importance, common methods of text classification, and the workings, advantages, and applications of KNN and TF-IDF. The chapter aims to highlight previous research developments and identify gaps for further exploration.

Uploaded by

Amna Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Chapter Two

Chapter Two reviews key areas of text categorization, the K-Nearest Neighbors (KNN) algorithm, and the TF-IDF feature extraction technique. It discusses the definitions, importance, common methods of text classification, and the workings, advantages, and applications of KNN and TF-IDF. The chapter aims to highlight previous research developments and identify gaps for further exploration.

Uploaded by

Amna Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Chapter Two: Theory Fundamentals

This chapter aims to provide a comprehensive review of previous research related to three
main areas: text categorization, the K-Nearest Neighbors algorithm (KNN), and the Term
Frequency-Inverse Document Frequency (TF-IDF) feature extraction technique. This
chapter helps understand previous developments in this field and identifies the research
gaps that this research seeks to address.

2.1 Text Classification


2.1.1 Definition of Text classification
Text categorization is the process of assigning one or more predefined categories to a given
text based on its content. This field is widely used in various applications, such as news
classification, sentiment analysis, and spam detection.

2.1.2 The Importance of Text classification


Text categorization techniques have become essential due to the vast amount of text
available on the internet. By using automatic classification, text data can be efficiently
organized, facilitating information retrieval and decision-making based on automated text
analysis.

2.1.3 Common Methods in Text Classification


There are several methods used for text classification, including:

Rule-Based Methods: These rely on manually creating rules to define classes.


Statistical Methods: These rely on analyzing the distribution of words and frequencies,
such as Naïve Bayes.
Machine Learning Methods: These include KNN, SVM, and Decision Trees, which rely on
trained models to classify texts.
Deep Learning Methods: These include CNN, RNN, and Transformers, which rely on
neural networks to process text.
2.2 K-Nearest Neighbors (KNN) Algorithm
2.2.1 Definition of KNN
The KNN algorithm is one of the simplest and most popular supervised machine learning
algorithms. This algorithm is based on the neighbor principle, whereby any new text is
classified based on the majority class of its K nearest neighbors in the training dataset.
2.2.2 KNN Working Principle
The algorithm works as follows:

The distance between the new text and all texts in the training dataset is calculated using
one of the distance measures, such as Euclidean distance or cosine similarity.
The K nearest neighbors of the new text are identified.
The class is assigned to the new text based on the majority among its neighbors.
2.2.3 Advantages and Disadvantages of KNN
Advantages:
Easy to implement and understand.
Does not require complex training like neural networks.
Can be used with nonlinear data.
Disadvantages:
Requires intensive computations when using large datasets, as the distance is calculated for
each point.
Performance may be affected by noise in the data.
Choosing the appropriate K significantly impacts accuracy.
2.2.4 Applications of KNN in Text Classification
KNN has been used in many studies for text classification due to its simplicity and
effectiveness. Some applications include:

Classifying news according to its topic (sports, politics, economics, etc.).


Sentiment analysis in product reviews or tweets.
Classifying emails into junk and junk.
2.3 TF-IDF Feature Extraction Technique
2.3.1 Definition of TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency) is a technique used in natural
language processing (NLP) to convert text into a numerical representation based on the
importance of words within documents.

2.3.2 How TF-IDF Works


This technique relies on two metrics:

TF (Term Frequency): Measures the number of times a word appears within a document.
IDF (Inverse Document Frequency): Reduces the impact of very common words across
documents, increasing the importance of rare words.
TF-IDF for each word is calculated using the equation:
TF−IDF(w)=TF(w)×IDF(w)
where:
 TF(w,d) = Term Frequency (measures how often the word appears in a document).
 IDF(w) = Inverse Document Frequency (measures how important the word is across
all documents).

2.3.3 Advantages of TF-IDF


It assigns higher weight to more important words, improving classification performance.
It reduces the impact of unhelpful common words such as conjunctions.
It is widely used in information retrieval and search engines.
2.3.4 Use of TF-IDF in Text Classification
TF-IDF is used to extract text features and convert them into numerical representations that can
be used with classification algorithms such as KNN. Studies have shown that this technique
provides good accuracy when used with various classifiers.

You might also like