Assignment 1 Groupwork C0927405 C0928791
Assignment 1 Groupwork C0927405 C0928791
Analytics 01
ASSIGNMENT 1
Prashansa Thapa
Student ID: C0927405
1
1. Text Preprocessing
1.1. Loading Dataset
The dataset was loaded from the sentimentdataset.csv file using pandas.read.csv()
1.2. Standardization
The dataset had more than just Positive, Negative, and Neutral sentiment labels
hence, all potential labels were categorized into these three primary sentiment classes
using a mapping dictionary. This stage is done to make sure that the labeling system
was uniform.
1.3. Cleaning
Regular expressions (re.sub()) were used to eliminate excess whitespace,
punctuation, and special characters. This made it possible to guarantee that only
relevant text was left.
1.4. Tokenization
NLTK's word_tokenize() was used to tokenize each text entry, dividing the text
into distinct words. This is necessary for subsequent processing, including
lemmatization and stopword elimination.
1.5. Word Removal
The NLTK stopwords list was used to remove frequent words that don't add much
meaning, like "the," "is," and "and." By reducing noise, this stage increases the
accuracy of the model.
1.6. Lemmatization
Words were reduced to their base forms using WordNetLemmatizer(). For
example, ‘running’ became ‘run’ and ‘better’ became ‘good’. This process ensures that
variations of words are treated as the same feature.
1.7. Content handling
1.7.1. Abbreviation and Slangs
A simple change was made to common abbreviations (such as "u" to "you") because
social media posts frequently use casual language.
1.7.2. Emoji
To keep the text consistent, emojis were eliminated using regular expressions.
2
2. Feature Extraction
2.1. Bag of words (BoW)
The cleaned text was transformed into a matrix of token counts using
CountVectorizer(). Individual word frequencies were recorded using unigrams.
2.2. TF-IDF
Term frequency and inverse document frequency were used by TfidfVectorizer()
to convert text into numerical numbers. Default parameters were applied to convert
words into weighted numerical features.
2.3. Comparison
While TF-IDF emphasized key terms by lessening the influence of frequently used
words, BoW supplied raw frequency counts.
3
3.3. SVM Model
SVC() was trained using both features, and because it concentrated on unique
phrases that successfully distinguished sentiments, it performed exceptionally well
with TF-IDF. 'Amazing' for Positive and 'terrible' for Negative are examples of
important phrases that the SVM model used to create more distinct decision boundaries.
3.6. Observations
The SVM model outperformed the Naive Bayes model with TF-IDF based on our
modelling because it focused on significant terms, while the Naive Bayes model
outperformed BoW because it relied on frequency patterns.
4
4. Comparison
With BoW With TF-IDF
Overall: The best performance was achieved by SVM with TF-IDF, which used weighted term
importance to successfully distinguish between feelings.
5. Challenges Faced
5.1. Class Imbalance
Uneven sentiment distributions in the dataset affected the model's performance.
5
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv('sentimentdataset.csv')
df.head(2)
Text Sentiment \
0 Enjoying a beautiful day at the park! ... Positive
1 Traffic was terrible this morning. ... Negative
Text Sentiment
0 Enjoying a beautiful day at the park! ... Positive
1 Traffic was terrible this morning. ... Negative
2 Just finished an amazing workout! 💪 ... Positive
3 Excited about the upcoming weekend getaway! ... Positive
4 Trying out a new recipe for dinner tonight. ... Neutral
Sentiment
Positive 328
Neutral 271
Negative 133
Name: count, dtype: int64
True
# Text Preprocessing
def preprocess_text(text):
text = text.lower()
text = re.sub(f"[{string.punctuation}]", "", text)
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in
stopwords.words('english')]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return " ".join(tokens)
df['Processed_Text'] = df['Text'].apply(preprocess_text)
# Train-Test Split
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y,
test_size=0.2, random_state=42)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)
# Model Training
nb_model = MultinomialNB()
nb_model.fit(X_train_bow, y_train)
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_tfidf, y_train)
# Predictions
y_pred_nb = nb_model.predict(X_test_bow)
y_pred_svm = svm_model.predict(X_test_tfidf)
# Evaluation
print("Naïve Bayes Performance:")
print(classification_report(y_test, y_pred_nb))
print("SVM Performance:")
print(classification_report(y_test, y_pred_svm))
# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(confusion_matrix(y_test, y_pred_nb), annot=True, fmt='d',
cmap='Blues', ax=axes[0])
axes[0].set_title("Naïve Bayes Confusion Matrix")
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt='d',
cmap='Blues', ax=axes[1])
axes[1].set_title("SVM Confusion Matrix")
plt.show()
SVM Performance:
precision recall f1-score support