0% found this document useful (0 votes)

4 views16 pages

IDTA For NLP

This report analyzes a sentiment-labeled dataset of 1,000 Amazon product reviews, focusing on data pre-processing, classification using various algorithms, and topic detection. The analysis includes tasks such as cleaning text data, applying classification models like Logistic Regression and Naive Bayes, and exploring topic detection through LSA and LDA. Results indicate that simpler models often outperform complex models like BERT for this dataset, highlighting the importance of feature engineering in sentiment analysis.

Uploaded by

ramintahery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views16 pages

IDTA For NLP

Uploaded by

ramintahery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Intelligent Data and Text Analysis

Coursework 2

Word Count: 2893

Page 1 of 16
School of Computing, University of Portsmouth

Module M33147

Submitted by: UP2291855

Page 2 of 16
Table of Contents
Introduction…………………………………………………………………………………………………………………….……………………4
Task 1: Pre-process the textual data to remove potential noise……………………………………………………………4
Task 2: Using the bag-of-words/terms representation……………………………………………………………..……..……7

Task 3: Perform classification using a BERT-based model………………………………………….………………………..13

Task 4: Perform topic detection using 1 algorithm………………………………………………………………….…………..14
References………………………………………………………………………………………………………………………………………….16

AI Statement
I confirm that, I utilized AI tools to assist in the revision of the English language in my coursework to ensure
clarity, coherence, and academic integrity.

Page 3 of 16
Introduction

This report presents a detailed analysis of a sentiment-labeled dataset containing 1,000 sentences from Amazon
product reviews. These sentences are split evenly between 500 positive (scored as 1) and 500 negative (scored as
0) sentiments (Figure 1). The analysis is structured into four main tasks. The first task focuses on preparing the
text for analysis by removing punctuation and numbers, eliminating stop words, and adjusting the text case.
Additionally, I applied lemmatization and stemming techniques to refine the data further. In the second task, I
tested five key classification algorithms using both the CountVectorizer (Bag of Words) and Tfidf approaches. The
third task involved using a BERT-based model for classifying sentiments. The fourth task explored topic detection
through two different modeling techniques. The dataset overview is shown in figure 2:

Figure 1: Number of Samples and Sentiment distribution Figure 2: Dataset Overview

Task 1: Data Pre-processing the textual data to remove potential noise.

In this section, I focus on preparing the text data for further processing and analysis. This systematic preparation
of text data is essential for minimizing noise and variability in the dataset, thereby enhancing the reliability of the
analyses performed in later stages. The following steps are undertaken to ensure the data is clean and
standardized:

1. Text Cleaning:

1.1. Removing Punctuation and Numbers: Punctuation and numbers often do not contribute to the
meaning of the text in sentiment analysis, and other language processing tasks.

1.2. Converting Text to Lowercase: Text data often contains variations in case (uppercase and lowercase)
which can lead to the same word being treated as different tokens (e.g., "House" vs. "house"). Converting
all text to lowercase ensures that all instances of a word are recognized as the same token.

In this step, as observed in Figure 3, I utilized the translate() function to remove punctuations and numbers.
Additionally, I employed the lower() function to convert all text to lowercase.

Page 4 of 16
Figure 3: Removing punctuations and Numbers and converting text to Lowercase

2. Removing Stopwords: Common words that typically do not add meaningful information to the text, such as
"and", "the", and "is", are removed. Stopwords are extremely common in English but usually don't carry significant
meaning relative to the specific analysis objectives. In tasks like sentiment analysis, and keyword extraction, the
emphasis is often on capturing the essence or themes of texts. Removing stopwords helps highlight and elevate
the importance of more descriptive and contextually relevant words in the text.

Figure 4: Before removing Stopwords Figure 5: After removing Stopwords

3. Lemmatization:

3.1. WordNet Lemmatization: WordNet Lemmatization uses the extensive WordNet database, which
includes lexical relationships between words. I used ntlk to download database. The results of applying
this method is shown in figure 6.

3.2. Lemmatization Using Part of Speech (POS) Tagging: This involves assigning part of speech tags to
words before lemmatization to ensure that words are correctly reduced to their base form based on their

Page 5 of 16
usage in the sentence. Combining POS tagging with
lemmatization allows for a more effective approach to
preparing text for various NLP applications, resulting in
higher quality data processing. The results of applying
this method is shown in figure 7.

Figure 7: After applying Lemmatization

Figure 6: After applying WordNet Lemmatization Using Part of Speech (POS) Tagging

4. Stemming:

 Porter stemmer: The Porter Stemmer uses a series of rules that are applied sequentially
to strip suffixes from words. The results of applying this method is shown in figure 8.
 Krovetz stemmer: This specific stemming algorithm is applied to reduce words to their root
forms, which may be closer to their correct lemma compared to other stemmers. By condensing
words to their root forms, stemming helps in reducing the complexity of the text data. Also, it is
decreasing the size of the vocabulary that needs to be processed. The results of applying this
method is shown in figure 9.

Figure 8: After applying Port stemmer

Figure 9: After applying Krovetz stemmer

Page 6 of 16
Task 2: Using the bag-of-words/terms representation
In this task, five different following classification models were used to classify the Positive and Negative Sentiments in the
dataset.
1. Logistic Regression
2. Decision Tree
3. Support Vector Machine (SVM)
4. Naive Bayes
5. k Nearest Neighbour
In the following subsections I will explain the results of applying these classification methods. In this report I
applied both TF-IDF and Bag of Words (BoW).
 The Bag of Words model represents text data by counting the frequency of words within the
documents.
 TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate
how important a word is to a document in a collection or corpus.

1. Logistic Regression

Applying TF-IDF

Figure 10 presents the results of a sentiment analysis using

Logistic Regression with TF-IDF vectorization.

Accuracy in 82% and this shows that results is pretty good and
other ML metrics are supporting this.

AUC (Area Under the Curve) score of approximately 0.818. This

graphical representation demonstrates the model's
effectiveness in distinguishing between positive and negative
sentiment classes.

Figure 10: Classification Results- Logistic Regression (TF-IDF)

Applying BOW

Figure 11 presents the results of a sentiment analysis using Logistic

Regression with BoW vectorization.

Accuracy in 83% and this shows that results is pretty good and other
ML metrics are supporting this.

AUC (Area Under the Curve) score of approximately 0.828, which is

better than TF-IDF vectorization.

Figure 11: ROC Results - Logistic Regression (BoW)

Page 7 of 16
2. Decision Tree

Applying TF-IDF

Figure 12 presents the results of a sentiment analysis

using Decision Tree with TF-IDF vectorization.

Accuracy in 75% and this shows that results is pretty good

and other ML metrics are supporting this.

AUC (Area Under the Curve) score of approximately

0.748. This graphical representation demonstrates the
model's effectiveness in distinguishing between positive
and negative sentiment classes is not as good as Logistic
Regression.

Figure 12: Classification Results - Decision Tree (TF-IDF)

Applying BOW

Figure 13 presents the results of a sentiment analysis using

Decision Tress with BoW vectorization.

Accuracy in 78% and this shows that results is not good as

Logistic regression.

AUC (Area Under the Curve) score of approximately 0.77,

which is better than TF-IDF vectorization.

Figure 13: ROC results - Decision Tree (BoW)

Page 8 of 16
3. Support Vector Machine (SVM)

Applying TF-IDF

Figure 14 presents the results of a sentiment analysis using

SVM with TF-IDF vectorization.

Accuracy in 79% and this shows that results is almost same as

Decision Tree.

AUC (Area Under the Curve) score of approximately 0.79. This

graphical representation demonstrates the model's
effectiveness in distinguishing between positive and
negative sentiment classes is not as good as Logistic
Regression.

Figure 6: Classification Results- SVM (TF-IDF)

Applying BOW

Figure 15 presents the results of a sentiment analysis using

Decision Tress with BoW vectorization.

Accuracy in 80% and this shows that results is not good as

Logistic regression but it is better than Decision Tree.

AUC (Area Under the Curve) score of approximately 0.805,

which is better than TF-IDF vectorization.

Figure 7: ROC results – SVM (Bow)

Page 9 of 16
4. Naive Bayes

Applying TF-IDF

Figure 16 presents the results of a sentiment analysis using

Naive Bayes with TF-IDF vectorization.

Accuracy in 81.5% and this shows that results is pretty good

and other ML metrics are supporting this.

AUC (Area Under the Curve) score of approximately 0.81.6

and it is one of the best results.

Figure 8: Classification Results - Naive Bayes (TF-IDF)

Applying BOW

Figure 17 presents the results of a sentiment analysis

using Naive Bayes with BoW vectorization.

Accuracy in 79.5% and this shows that results is not

good as Logistic regression.

AUC (Area Under the Curve) score of approximately

0.796, which is lower than TF-IDF vectorization.

Figure 9: ROC results - Naive Bayes (BoW)

Page 10 of 16
5. k Nearest Neighbour

Applying TF-IDF
Figure 18 presents the results of a sentiment analysis using KNN
with TF-IDF vectorization.

Accuracy in 82% and this shows that results is pretty good and other
ML metrics are supporting this.

AUC (Area Under the Curve) score of approximately 0.82.

Figure 10: Classification Results – KNN (TF-IDF)

Applying BOW

Figure 17 presents the results of a sentiment analysis using

KNN with BoW vectorization.

Accuracy in 66% and this shows that results is not good as

other classification methods.

AUC (Area Under the Curve) score of approximately 0.667,

which is lower than TF-IDF vectorization.

Figure 11: ROC results - KNN(BoW)

Page 11 of 16
6. Comparison
Analyzing the performance of various classification models as depicted in the table 1 provides a basis for
comparison of applied models in sentiment analysis tasks. I divided this comparison in the following
subsections:

Table 1: Comparing Classification Algorithms

Classification Accuracy TF-IDF ROC - TF-IDF Accuracy- Bow ROC -Bow

Logistic Regression 0.82 0.81 0.83 0.82
Decision Tree 0.75 0.74 0.78 0.77
SVM 0.79 0.79 0.805 0.805
Naive Bayes 0.815 0.815 0.795 0.796
KNN 0.82 0.821 0.66 0.667

BoW vs TF-IDF
A comparison between BoW and TF-IDF vectorizations in applied ML models reveals several insights and
potential limitations. While BoW often shows slightly higher accuracy in some models like Logistic Regression
and Decision Trees, this might be attributed to its simplicity and direct approach in treating term frequencies.
However, this method can overlook the importance of less frequent but potentially more impactful words, which
TF-IDF emphasizes by scaling down the impact of frequently occurring words in the dataset.

TF-IDF generally performs better in models that benefit from a good understanding of term importance across
documents, such as Naive Bayes and KNN, where the weighting helps differentiate between relevant and
irrelevant features more effectively. This suggests that while BoW might be useful for achieving high accuracy
in a straightforward manner, it could lead to a superficial understanding of text content, potentially missing deeper
textual nuances that TF-IDF captures. Moreover, the higher AUC scores associated with TF-IDF in some models
indicate better overall performance in distinguishing between classes, particularly in probabilistic and distance-
based models.

So, while BoW is computationally simpler and may occasionally outperform TF-IDF in raw accuracy, TF-IDF's
approach to handling term significance often makes it a more robust choice, especially in complex NLP tasks that
require a deeper semantic understanding of the text.

Classification Method
Logistic Regression shows strong performance with relatively high accuracy and AUC scores across both
vectorization techniques. Decision Trees display moderate effectiveness, with their performance generally lower
than Logistic Regression. SVMs perform comparably well, particularly with BoW vectorization, suggesting that
their capacity to find a hyperplane that maximally separates classes is effective. Naive Bayes shows a promising
balance between accuracy and AUC with TF-IDF, potentially due to its assumption of feature independence and
its probabilistic foundation, which aligns well with the weight-based nature of TF-IDF. NN’s performance is
notably lower in the context of BoW, with a drastic reduction in both accuracy and AUC scores, underscoring its
dependence on a suitable distance metric and the curse of dimensionality.

Page 12 of 16
Task 3: Perform classification using a BERT-based model
In this task I applied BERT model and the results is shown in figure 20.

Figure 12: The Bert Model Results

As observed in Figure 20, despite a significant reduction in the loss value, approaching zero, the accuracy has
reached 66%, which is not satisfactory compared to the classification algorithms used in Task 2. The accuracy is
relatively lower than almost all of those algorithms, indicating weaker performance.

By comparing these results with table 1, we can say the BERT model underperformed compared to traditional
classifiers like Logistic Regression, Naive Bayes, and SVM, particularly when using TF-IDF and Bag-of-Words. This
indicates that, for this specific dataset, simpler models with feature engineering (TF-IDF/BoW) are more effective
than fine-tuning a BERT model, potentially due to data size or preprocessing. The performance gap suggests
further optimization of BERT, such as better fine-tuning or using a larger, more balanced dataset, may be
necessary to achieve competitive results.

Page 13 of 16
Task 4: Perform topic detection
In this task, I focused on applying two models for topic detection, namely Latent Semantic Analysis (LSA) and Latent
Dirichlet Allocation (LDA).

LSA: Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a technique that reduces text data dimensionality using singular value
decomposition. It uncovers latent semantic structures by projecting terms and documents into a lower-
dimensional space, helping to resolve issues of synonymy and polysemy, and enhancing the understanding of
underlying thematic content. In figure 21 the results of applying LSA on amazon dataset is shown.

Figure 21: The results of applying LSA

The meaning of topics in figure 21 as follow:

Topic 0: The highest values in this topic is '0.754*"1". This value shows that positive sentiment is the best result for this
topic and by considering other words we can conclude: “The phone is woks great and the headset sound is good, so they
are good product. “

Topic 1: In this topic, '0.841*"0" + -0.494*"1" shows that this sentiment is negative. It can be: The phone is disappointing
and buying it is wasting money.

Topic 2: In this topic, '0.914*"phone" shows the sentiment should be related to phone. The other values in this topic shows
that there is no dominant subject in this topic. So, we can say sentiment is Mixed. One inference can be: The headset sound
is good for ear.

Topic 3: Two values in '0.827*"work" + 0.333*"great" show the direction. But there are some negative and Positive sentiments
in this topic. So, we can say sentiment is Mixed. One possible inference in positive side: Headset sound is well and this is
good for ear.

Topic 4: In this topic, '-0.763*"great" strongly negatively influences and is clearly linked to this topic. Some of the words
contribute positively. So, we can say sentiment is Mixed.

Topic 5: In this topic, '0.622*"use" shows the strongest positive influence, highlighting a significant link to the identified topic.
Some of the words contribute positively and some others contribute negatively. So, we can say sentiment is Mixed.

Topic 6: In this topic, '-0.570*"sound," -0.406*"quality," and -0.380*"headset" all make strong negative contributions. On the
other hand, 0.144*"battery" and 0.184*"product" add positively, suggesting a review that criticizes the product, particularly its
sound and headset quality.

Topic 7: Some of the words like "ear", "one", "sound" and "comfort contribute positively and Some of the words such as "use"
and "good" negatively impact the discussion. This reveals mixed feelings.

Topic 8: This review expresses dissatisfaction with aspects like comfort, fit, and headsets, yet it provides positive feedback on
sound quality and battery life.

Topic 9: The terms related to this topic such as "product", "one", "ear" ,"quality", "price", and "recommend" have positive
impacts , while words like "battery", "headset", "life", and "long" contribute negatively. This suggests positive sentiments
towards the product overall, but issues are noted concerning battery life and headset functionality.
Page 14 of 16
LDA: Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative statistical model that assumes documents are composed of
multiple topics. It infers the hidden topic structure by assigning a mixture of topics to each document and a
distribution of words to each topic. This approach is widely used to discover the abstract themes that permeate
a large corpus of texts, making it effective for organizing, understanding, and summarizing large datasets in natural
language processing. In figure 22 the results of applying LSA on amazon dataset is shown.

Figure 13: The Results of Applying LDA

The meaning of topics in figure 22 as follow:

Topic 0: Focuses on workplace communication and recommendations, highlighting terms like "phone", "work",
"recommend", and "purchase." The sentiment is positive, emphasizing recommendations and workplace utility.

Topic 1: Emphasizes quality and effort in usage, with key words such as "work", "great", "device", and "battery."
The sentiment is positive, highlighting good performance and quality.

Topic 2: Discusses product quality and user satisfaction, using terms like "phone", "product", "quality", "good",
and "bad." The sentiment is mixed.

Topic 3: Centers around product functionality and satisfaction, mentioning "phone", "great", "quality", and
"good." The sentiment is positive, focusing on functional benefits and satisfaction.

Topic 4: Relates to product appearance and usability concerns, featuring words like "make", "headset", "look",
and "easy." The sentiment is mixed, combining positive aspects of appearance with potential negative usability
issues.

Page 15 of 16
Topic 5: Focuses on the product's value for money, highlighting "good", "battery", "price", and "really." The
sentiment can be negative.
Topic 6: Pertains to product appreciation and love, with "product", "great", "make", and "love." The sentiment
is positive, reflecting strong affection for the product.

Topic 7: Deals with economic aspects and efficiency, including "phone", "work", "money", and "waste." The
sentiment is mixed, with positive notes on work and efficiency but negative implications with "waste."

Topic 8: Tackles product reception and functionality, using "good", "sound" "work",” don’t “and "reception."
The sentiment is negative, focusing on the effective reception and sound quality.

Topic 9: Involves product usability and lifestyle integration, with terms like "headset," "use," "great," and "fit."
The sentiment is mixed, with positives on fit and usability tempered by potential negatives associated with
"headset" issues.
Comparison LSA vs LDA

In analysing Amazon reviews, both LSA and LDA reveal different aspects of customer opinions and product details.
LSA is great for digging into complex word relationships and helps clarify meanings in product features. On the
other hand, LDA provides a broader look, capturing main themes across various topics like work use and cost,
offering a bigger picture of what customers think and how they feel about products. LSA focuses more on specific
details, while LDA shows general trends and overall sentiments in the reviews.

References

[1] https://keras.io/examples/nlp/, Last Visit: 16/01/2025.

[2] https://keras.io/keras_hub/api/models/bert/ , Last Visit: 16/01/2025.

[3] https://keras.io/examples/nlp/text_classification_from_scratch/ , Last Visit: 16/01/2025.

[4] https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer, Last Visit: 16/01/2025.

[5]https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-
python/, Last Visit: 16/01/2025.

[6] https://www.datacamp.com/tutorial/stemming-lemmatization-python, Last Visit: 16/01/2025.

[7] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse
processes, 25(2-3), 259-284.

[8] Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA)
and topic modeling: models, applications, a survey. Multimedia tools and applications, 78, 15169-15211.

[9] Shukla, D., & Dwivedi, S. K. (2024). The study of the effect of preprocessing techniques for emotion detection
on Amazon product review dataset. Social Network Analysis and Mining, 14(1), 191.

Page 16 of 16

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
30 pages
Twitter Sentiment Analysis
100% (2)
Twitter Sentiment Analysis
10 pages
MOD 4 Notes
No ratings yet
MOD 4 Notes
19 pages
Sentiment Analysis Using Feature Selection and Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Feature Selection and Machine Learning Algorithms
48 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
ARM CORTEX - M & OMAP Processors
50% (2)
ARM CORTEX - M & OMAP Processors
34 pages
Cloudera Administrator Training For Apache Hadoop PDF
50% (2)
Cloudera Administrator Training For Apache Hadoop PDF
2 pages
Bacnet MS TP
100% (1)
Bacnet MS TP
44 pages
2017 Pascal Solution
No ratings yet
2017 Pascal Solution
9 pages
Kawasaki FastCheck
No ratings yet
Kawasaki FastCheck
18 pages
Unit 6 Advanced Databases
No ratings yet
Unit 6 Advanced Databases
108 pages
Atlib B
100% (1)
Atlib B
2 pages
Sentiment Analysis Using Machine Learning Classifiers
No ratings yet
Sentiment Analysis Using Machine Learning Classifiers
41 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
ML Project Report
No ratings yet
ML Project Report
26 pages
Product Rating Through Sentiment Analysis
No ratings yet
Product Rating Through Sentiment Analysis
23 pages
Emotion AI Driven Sentiment Analysis A S
No ratings yet
Emotion AI Driven Sentiment Analysis A S
27 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Start Guide
No ratings yet
Start Guide
37 pages
MP 1
No ratings yet
MP 1
14 pages
Data Science
No ratings yet
Data Science
25 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
37 pages
Machine Learning With Advance Model
No ratings yet
Machine Learning With Advance Model
19 pages
Ffu 0001114 01
No ratings yet
Ffu 0001114 01
27 pages
Restaurant Review Production Analysis Using Python
No ratings yet
Restaurant Review Production Analysis Using Python
33 pages
Single-Chip Microcontrollers (AMCU) : in Brief - .
No ratings yet
Single-Chip Microcontrollers (AMCU) : in Brief - .
31 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Unit V
No ratings yet
Unit V
22 pages
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
No ratings yet
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
15 pages
### Seminar Report
No ratings yet
### Seminar Report
12 pages
280 - DS Complete-2
No ratings yet
280 - DS Complete-2
24 pages
Data Science Project
No ratings yet
Data Science Project
24 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
Aspect-Based Sentiment Analysis Using A Hybridized Approach Based On CNN and GA
No ratings yet
Aspect-Based Sentiment Analysis Using A Hybridized Approach Based On CNN and GA
14 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
What The Fake BNPL
No ratings yet
What The Fake BNPL
19 pages
Unit6 002
No ratings yet
Unit6 002
10 pages
DS - Lab Report.
No ratings yet
DS - Lab Report.
25 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Basic CRUD Operations, F Unctions, Expressions An D Clauses
No ratings yet
Basic CRUD Operations, F Unctions, Expressions An D Clauses
35 pages
Digital Therapeutics Apps On Prescription
No ratings yet
Digital Therapeutics Apps On Prescription
12 pages
Assignment 1 Groupwork C0927405 C0928791
No ratings yet
Assignment 1 Groupwork C0927405 C0928791
11 pages
Module4 TextAnalytics
No ratings yet
Module4 TextAnalytics
9 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
No ratings yet
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
14 pages
13 Chapter 6 PSO GA DT
No ratings yet
13 Chapter 6 PSO GA DT
11 pages
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
No ratings yet
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
6 pages
Sanket Shelar
No ratings yet
Sanket Shelar
5 pages
Unit 5
No ratings yet
Unit 5
9 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Computer Science Assignment 2022-2023
No ratings yet
Computer Science Assignment 2022-2023
14 pages
Minor Project Presentation
No ratings yet
Minor Project Presentation
16 pages
PA308 Installation Manual
No ratings yet
PA308 Installation Manual
80 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
Leveraging Natural Language Processing and Machine Learning For Enhanced Content Rating
No ratings yet
Leveraging Natural Language Processing and Machine Learning For Enhanced Content Rating
8 pages
Capco Murex Cs
100% (1)
Capco Murex Cs
4 pages
121a1114 D2 Sma Exp3
No ratings yet
121a1114 D2 Sma Exp3
9 pages
Preview
No ratings yet
Preview
11 pages
Stock Prediction With Sentiment
No ratings yet
Stock Prediction With Sentiment
7 pages
Sentiment Analysis Using Bert Model
No ratings yet
Sentiment Analysis Using Bert Model
8 pages
Sentimental Analysis Using NLP
No ratings yet
Sentimental Analysis Using NLP
5 pages
CSE4062S21 Group3 Project Delivery7 FinalReport
No ratings yet
CSE4062S21 Group3 Project Delivery7 FinalReport
9 pages
WBDV111 Finals CS2
No ratings yet
WBDV111 Finals CS2
4 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Exam 2
No ratings yet
Exam 2
5 pages
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
No ratings yet
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
4 pages
Samsung LN46C550J1FXZA Fast Track Guide (SM)
No ratings yet
Samsung LN46C550J1FXZA Fast Track Guide (SM)
4 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Xhamster VR Manual
No ratings yet
Xhamster VR Manual
5 pages
Deep Learning Mental Health Dialogue System
No ratings yet
Deep Learning Mental Health Dialogue System
4 pages
Epq96 2 Data Sheet 4921240364 Uk
No ratings yet
Epq96 2 Data Sheet 4921240364 Uk
8 pages
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
No ratings yet
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
6 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
Sentiment Analysis Using Support Vector Machine Based On Feature Selection and Semantic Analysis
No ratings yet
Sentiment Analysis Using Support Vector Machine Based On Feature Selection and Semantic Analysis
5 pages
MSBTE Solution App-2
No ratings yet
MSBTE Solution App-2
4 pages
10 1109@icaccs48705 2020 9074208
No ratings yet
10 1109@icaccs48705 2020 9074208
3 pages
Information Retrieval From Text
No ratings yet
Information Retrieval From Text
6 pages
Counters: "Registers" Section
No ratings yet
Counters: "Registers" Section
10 pages
3706durgam Cheruvu CADASTRAL
No ratings yet
3706durgam Cheruvu CADASTRAL
1 page
Exemple de Contrôle Continu
No ratings yet
Exemple de Contrôle Continu
1 page
Preprocessing The Informal Text For Efficient Sentiment Analysis
No ratings yet
Preprocessing The Informal Text For Efficient Sentiment Analysis
4 pages
Document Mosaicing: Unlocking Visual Insights through Document Mosaicing
From Everand
Document Mosaicing: Unlocking Visual Insights through Document Mosaicing
Fouad Sabry
No ratings yet
Ebooks File Interviewing Children and Adolescents Second Edition 2nd All Chapters
100% (1)
Ebooks File Interviewing Children and Adolescents Second Edition 2nd All Chapters
25 pages

IDTA For NLP

Uploaded by

IDTA For NLP

Uploaded by

Intelligent Data and Text Analysis

Word Count: 2893

Submitted by: UP2291855

Task 3: Perform classification using a BERT-based model………………………………………….………………………..13

Figure 1: Number of Samples and Sentiment distribution Figure 2: Dataset Overview

Task 1: Data Pre-processing the textual data to remove potential noise.

Figure 4: Before removing Stopwords Figure 5: After removing Stopwords

Figure 7: After applying Lemmatization

Figure 8: After applying Port stemmer

Figure 9: After applying Krovetz stemmer

Figure 10 presents the results of a sentiment analysis using

AUC (Area Under the Curve) score of approximately 0.818. This

Figure 10: Classification Results- Logistic Regression (TF-IDF)

Figure 11 presents the results of a sentiment analysis using Logistic

AUC (Area Under the Curve) score of approximately 0.828, which is

Figure 11: ROC Results - Logistic Regression (BoW)

Figure 12 presents the results of a sentiment analysis

Accuracy in 75% and this shows that results is pretty good

AUC (Area Under the Curve) score of approximately

Figure 12: Classification Results - Decision Tree (TF-IDF)

Figure 13 presents the results of a sentiment analysis using

Accuracy in 78% and this shows that results is not good as

AUC (Area Under the Curve) score of approximately 0.77,

Figure 13: ROC results - Decision Tree (BoW)

Figure 14 presents the results of a sentiment analysis using

Accuracy in 79% and this shows that results is almost same as

AUC (Area Under the Curve) score of approximately 0.79. This

Figure 6: Classification Results- SVM (TF-IDF)

Figure 15 presents the results of a sentiment analysis using

Accuracy in 80% and this shows that results is not good as

AUC (Area Under the Curve) score of approximately 0.805,

Figure 7: ROC results – SVM (Bow)

Figure 16 presents the results of a sentiment analysis using

Accuracy in 81.5% and this shows that results is pretty good

AUC (Area Under the Curve) score of approximately 0.81.6

Figure 8: Classification Results - Naive Bayes (TF-IDF)

Figure 17 presents the results of a sentiment analysis

Accuracy in 79.5% and this shows that results is not

AUC (Area Under the Curve) score of approximately

Figure 9: ROC results - Naive Bayes (BoW)

AUC (Area Under the Curve) score of approximately 0.82.

Figure 10: Classification Results – KNN (TF-IDF)

Figure 17 presents the results of a sentiment analysis using

Accuracy in 66% and this shows that results is not good as

AUC (Area Under the Curve) score of approximately 0.667,

Figure 11: ROC results - KNN(BoW)

Table 1: Comparing Classification Algorithms

Classification Accuracy TF-IDF ROC - TF-IDF Accuracy- Bow ROC -Bow

Figure 12: The Bert Model Results

LSA: Latent Semantic Analysis

Figure 21: The results of applying LSA

The meaning of topics in figure 21 as follow:

Figure 13: The Results of Applying LDA

The meaning of topics in figure 22 as follow:

[1] https://keras.io/examples/nlp/, Last Visit: 16/01/2025.

[2] https://keras.io/keras_hub/api/models/bert/ , Last Visit: 16/01/2025.

[3] https://keras.io/examples/nlp/text_classification_from_scratch/ , Last Visit: 16/01/2025.

[4] https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer, Last Visit: 16/01/2025.

[6] https://www.datacamp.com/tutorial/stemming-lemmatization-python, Last Visit: 16/01/2025.

You might also like