Chapter Two
Chapter Two
This chapter aims to provide a comprehensive review of previous research related to three
main areas: text categorization, the K-Nearest Neighbors algorithm (KNN), and the Term
Frequency-Inverse Document Frequency (TF-IDF) feature extraction technique. This
chapter helps understand previous developments in this field and identifies the research
gaps that this research seeks to address.
The distance between the new text and all texts in the training dataset is calculated using
one of the distance measures, such as Euclidean distance or cosine similarity.
The K nearest neighbors of the new text are identified.
The class is assigned to the new text based on the majority among its neighbors.
2.2.3 Advantages and Disadvantages of KNN
Advantages:
Easy to implement and understand.
Does not require complex training like neural networks.
Can be used with nonlinear data.
Disadvantages:
Requires intensive computations when using large datasets, as the distance is calculated for
each point.
Performance may be affected by noise in the data.
Choosing the appropriate K significantly impacts accuracy.
2.2.4 Applications of KNN in Text Classification
KNN has been used in many studies for text classification due to its simplicity and
effectiveness. Some applications include:
TF (Term Frequency): Measures the number of times a word appears within a document.
IDF (Inverse Document Frequency): Reduces the impact of very common words across
documents, increasing the importance of rare words.
TF-IDF for each word is calculated using the equation:
TF−IDF(w)=TF(w)×IDF(w)
where:
TF(w,d) = Term Frequency (measures how often the word appears in a document).
IDF(w) = Inverse Document Frequency (measures how important the word is across
all documents).