IDTA For NLP
IDTA For NLP
Coursework 2
Page 1 of 16
School of Computing, University of Portsmouth
Module M33147
Page 2 of 16
Table of Contents
Introduction…………………………………………………………………………………………………………………….……………………4
Task 1: Pre-process the textual data to remove potential noise……………………………………………………………4
Task 2: Using the bag-of-words/terms representation……………………………………………………………..……..……7
AI Statement
I confirm that, I utilized AI tools to assist in the revision of the English language in my coursework to ensure
clarity, coherence, and academic integrity.
Page 3 of 16
Introduction
This report presents a detailed analysis of a sentiment-labeled dataset containing 1,000 sentences from Amazon
product reviews. These sentences are split evenly between 500 positive (scored as 1) and 500 negative (scored as
0) sentiments (Figure 1). The analysis is structured into four main tasks. The first task focuses on preparing the
text for analysis by removing punctuation and numbers, eliminating stop words, and adjusting the text case.
Additionally, I applied lemmatization and stemming techniques to refine the data further. In the second task, I
tested five key classification algorithms using both the CountVectorizer (Bag of Words) and Tfidf approaches. The
third task involved using a BERT-based model for classifying sentiments. The fourth task explored topic detection
through two different modeling techniques. The dataset overview is shown in figure 2:
In this section, I focus on preparing the text data for further processing and analysis. This systematic preparation
of text data is essential for minimizing noise and variability in the dataset, thereby enhancing the reliability of the
analyses performed in later stages. The following steps are undertaken to ensure the data is clean and
standardized:
1. Text Cleaning:
1.1. Removing Punctuation and Numbers: Punctuation and numbers often do not contribute to the
meaning of the text in sentiment analysis, and other language processing tasks.
1.2. Converting Text to Lowercase: Text data often contains variations in case (uppercase and lowercase)
which can lead to the same word being treated as different tokens (e.g., "House" vs. "house"). Converting
all text to lowercase ensures that all instances of a word are recognized as the same token.
In this step, as observed in Figure 3, I utilized the translate() function to remove punctuations and numbers.
Additionally, I employed the lower() function to convert all text to lowercase.
Page 4 of 16
Figure 3: Removing punctuations and Numbers and converting text to Lowercase
2. Removing Stopwords: Common words that typically do not add meaningful information to the text, such as
"and", "the", and "is", are removed. Stopwords are extremely common in English but usually don't carry significant
meaning relative to the specific analysis objectives. In tasks like sentiment analysis, and keyword extraction, the
emphasis is often on capturing the essence or themes of texts. Removing stopwords helps highlight and elevate
the importance of more descriptive and contextually relevant words in the text.
3. Lemmatization:
3.1. WordNet Lemmatization: WordNet Lemmatization uses the extensive WordNet database, which
includes lexical relationships between words. I used ntlk to download database. The results of applying
this method is shown in figure 6.
3.2. Lemmatization Using Part of Speech (POS) Tagging: This involves assigning part of speech tags to
words before lemmatization to ensure that words are correctly reduced to their base form based on their
Page 5 of 16
usage in the sentence. Combining POS tagging with
lemmatization allows for a more effective approach to
preparing text for various NLP applications, resulting in
higher quality data processing. The results of applying
this method is shown in figure 7.
4. Stemming:
Porter stemmer: The Porter Stemmer uses a series of rules that are applied sequentially
to strip suffixes from words. The results of applying this method is shown in figure 8.
Krovetz stemmer: This specific stemming algorithm is applied to reduce words to their root
forms, which may be closer to their correct lemma compared to other stemmers. By condensing
words to their root forms, stemming helps in reducing the complexity of the text data. Also, it is
decreasing the size of the vocabulary that needs to be processed. The results of applying this
method is shown in figure 9.
Page 6 of 16
Task 2: Using the bag-of-words/terms representation
In this task, five different following classification models were used to classify the Positive and Negative Sentiments in the
dataset.
1. Logistic Regression
2. Decision Tree
3. Support Vector Machine (SVM)
4. Naive Bayes
5. k Nearest Neighbour
In the following subsections I will explain the results of applying these classification methods. In this report I
applied both TF-IDF and Bag of Words (BoW).
The Bag of Words model represents text data by counting the frequency of words within the
documents.
TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate
how important a word is to a document in a collection or corpus.
1. Logistic Regression
Applying TF-IDF
Accuracy in 82% and this shows that results is pretty good and
other ML metrics are supporting this.
Applying BOW
Accuracy in 83% and this shows that results is pretty good and other
ML metrics are supporting this.
Page 7 of 16
2. Decision Tree
Applying TF-IDF
Applying BOW
Page 8 of 16
3. Support Vector Machine (SVM)
Applying TF-IDF
Applying BOW
Page 9 of 16
4. Naive Bayes
Applying TF-IDF
Applying BOW
Page 10 of 16
5. k Nearest Neighbour
Applying TF-IDF
Figure 18 presents the results of a sentiment analysis using KNN
with TF-IDF vectorization.
Accuracy in 82% and this shows that results is pretty good and other
ML metrics are supporting this.
Applying BOW
Page 11 of 16
6. Comparison
Analyzing the performance of various classification models as depicted in the table 1 provides a basis for
comparison of applied models in sentiment analysis tasks. I divided this comparison in the following
subsections:
BoW vs TF-IDF
A comparison between BoW and TF-IDF vectorizations in applied ML models reveals several insights and
potential limitations. While BoW often shows slightly higher accuracy in some models like Logistic Regression
and Decision Trees, this might be attributed to its simplicity and direct approach in treating term frequencies.
However, this method can overlook the importance of less frequent but potentially more impactful words, which
TF-IDF emphasizes by scaling down the impact of frequently occurring words in the dataset.
TF-IDF generally performs better in models that benefit from a good understanding of term importance across
documents, such as Naive Bayes and KNN, where the weighting helps differentiate between relevant and
irrelevant features more effectively. This suggests that while BoW might be useful for achieving high accuracy
in a straightforward manner, it could lead to a superficial understanding of text content, potentially missing deeper
textual nuances that TF-IDF captures. Moreover, the higher AUC scores associated with TF-IDF in some models
indicate better overall performance in distinguishing between classes, particularly in probabilistic and distance-
based models.
So, while BoW is computationally simpler and may occasionally outperform TF-IDF in raw accuracy, TF-IDF's
approach to handling term significance often makes it a more robust choice, especially in complex NLP tasks that
require a deeper semantic understanding of the text.
Classification Method
Logistic Regression shows strong performance with relatively high accuracy and AUC scores across both
vectorization techniques. Decision Trees display moderate effectiveness, with their performance generally lower
than Logistic Regression. SVMs perform comparably well, particularly with BoW vectorization, suggesting that
their capacity to find a hyperplane that maximally separates classes is effective. Naive Bayes shows a promising
balance between accuracy and AUC with TF-IDF, potentially due to its assumption of feature independence and
its probabilistic foundation, which aligns well with the weight-based nature of TF-IDF. NN’s performance is
notably lower in the context of BoW, with a drastic reduction in both accuracy and AUC scores, underscoring its
dependence on a suitable distance metric and the curse of dimensionality.
Page 12 of 16
Task 3: Perform classification using a BERT-based model
In this task I applied BERT model and the results is shown in figure 20.
As observed in Figure 20, despite a significant reduction in the loss value, approaching zero, the accuracy has
reached 66%, which is not satisfactory compared to the classification algorithms used in Task 2. The accuracy is
relatively lower than almost all of those algorithms, indicating weaker performance.
By comparing these results with table 1, we can say the BERT model underperformed compared to traditional
classifiers like Logistic Regression, Naive Bayes, and SVM, particularly when using TF-IDF and Bag-of-Words. This
indicates that, for this specific dataset, simpler models with feature engineering (TF-IDF/BoW) are more effective
than fine-tuning a BERT model, potentially due to data size or preprocessing. The performance gap suggests
further optimization of BERT, such as better fine-tuning or using a larger, more balanced dataset, may be
necessary to achieve competitive results.
Page 13 of 16
Task 4: Perform topic detection
In this task, I focused on applying two models for topic detection, namely Latent Semantic Analysis (LSA) and Latent
Dirichlet Allocation (LDA).
Latent Semantic Analysis (LSA) is a technique that reduces text data dimensionality using singular value
decomposition. It uncovers latent semantic structures by projecting terms and documents into a lower-
dimensional space, helping to resolve issues of synonymy and polysemy, and enhancing the understanding of
underlying thematic content. In figure 21 the results of applying LSA on amazon dataset is shown.
Topic 0: The highest values in this topic is '0.754*"1". This value shows that positive sentiment is the best result for this
topic and by considering other words we can conclude: “The phone is woks great and the headset sound is good, so they
are good product. “
Topic 1: In this topic, '0.841*"0" + -0.494*"1" shows that this sentiment is negative. It can be: The phone is disappointing
and buying it is wasting money.
Topic 2: In this topic, '0.914*"phone" shows the sentiment should be related to phone. The other values in this topic shows
that there is no dominant subject in this topic. So, we can say sentiment is Mixed. One inference can be: The headset sound
is good for ear.
Topic 3: Two values in '0.827*"work" + 0.333*"great" show the direction. But there are some negative and Positive sentiments
in this topic. So, we can say sentiment is Mixed. One possible inference in positive side: Headset sound is well and this is
good for ear.
Topic 4: In this topic, '-0.763*"great" strongly negatively influences and is clearly linked to this topic. Some of the words
contribute positively. So, we can say sentiment is Mixed.
Topic 5: In this topic, '0.622*"use" shows the strongest positive influence, highlighting a significant link to the identified topic.
Some of the words contribute positively and some others contribute negatively. So, we can say sentiment is Mixed.
Topic 6: In this topic, '-0.570*"sound," -0.406*"quality," and -0.380*"headset" all make strong negative contributions. On the
other hand, 0.144*"battery" and 0.184*"product" add positively, suggesting a review that criticizes the product, particularly its
sound and headset quality.
Topic 7: Some of the words like "ear", "one", "sound" and "comfort contribute positively and Some of the words such as "use"
and "good" negatively impact the discussion. This reveals mixed feelings.
Topic 8: This review expresses dissatisfaction with aspects like comfort, fit, and headsets, yet it provides positive feedback on
sound quality and battery life.
Topic 9: The terms related to this topic such as "product", "one", "ear" ,"quality", "price", and "recommend" have positive
impacts , while words like "battery", "headset", "life", and "long" contribute negatively. This suggests positive sentiments
towards the product overall, but issues are noted concerning battery life and headset functionality.
Page 14 of 16
LDA: Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a generative statistical model that assumes documents are composed of
multiple topics. It infers the hidden topic structure by assigning a mixture of topics to each document and a
distribution of words to each topic. This approach is widely used to discover the abstract themes that permeate
a large corpus of texts, making it effective for organizing, understanding, and summarizing large datasets in natural
language processing. In figure 22 the results of applying LSA on amazon dataset is shown.
Topic 1: Emphasizes quality and effort in usage, with key words such as "work", "great", "device", and "battery."
The sentiment is positive, highlighting good performance and quality.
Topic 2: Discusses product quality and user satisfaction, using terms like "phone", "product", "quality", "good",
and "bad." The sentiment is mixed.
Topic 3: Centers around product functionality and satisfaction, mentioning "phone", "great", "quality", and
"good." The sentiment is positive, focusing on functional benefits and satisfaction.
Topic 4: Relates to product appearance and usability concerns, featuring words like "make", "headset", "look",
and "easy." The sentiment is mixed, combining positive aspects of appearance with potential negative usability
issues.
Page 15 of 16
Topic 5: Focuses on the product's value for money, highlighting "good", "battery", "price", and "really." The
sentiment can be negative.
Topic 6: Pertains to product appreciation and love, with "product", "great", "make", and "love." The sentiment
is positive, reflecting strong affection for the product.
Topic 7: Deals with economic aspects and efficiency, including "phone", "work", "money", and "waste." The
sentiment is mixed, with positive notes on work and efficiency but negative implications with "waste."
Topic 8: Tackles product reception and functionality, using "good", "sound" "work",” don’t “and "reception."
The sentiment is negative, focusing on the effective reception and sound quality.
Topic 9: Involves product usability and lifestyle integration, with terms like "headset," "use," "great," and "fit."
The sentiment is mixed, with positives on fit and usability tempered by potential negatives associated with
"headset" issues.
Comparison LSA vs LDA
In analysing Amazon reviews, both LSA and LDA reveal different aspects of customer opinions and product details.
LSA is great for digging into complex word relationships and helps clarify meanings in product features. On the
other hand, LDA provides a broader look, capturing main themes across various topics like work use and cost,
offering a bigger picture of what customers think and how they feel about products. LSA focuses more on specific
details, while LDA shows general trends and overall sentiments in the reviews.
References
[5]https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-
python/, Last Visit: 16/01/2025.
[7] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse
processes, 25(2-3), 259-284.
[8] Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA)
and topic modeling: models, applications, a survey. Multimedia tools and applications, 78, 15169-15211.
[9] Shukla, D., & Dwivedi, S. K. (2024). The study of the effect of preprocessing techniques for emotion detection
on Amazon product review dataset. Social Network Analysis and Mining, 14(1), 191.
Page 16 of 16