0% found this document useful (0 votes)
12 views11 pages

Assignment 1 Groupwork C0927405 C0928791

The document outlines an assignment on Natural Language Processing and Social Media Analytics, detailing methods for text preprocessing, feature extraction, model training, and performance evaluation. It discusses challenges faced, such as class imbalance and noisy data, and recommends improvements like using oversampling techniques and advanced text cleaning methods. The best performance was achieved using an SVM model with TF-IDF for sentiment classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Assignment 1 Groupwork C0927405 C0928791

The document outlines an assignment on Natural Language Processing and Social Media Analytics, detailing methods for text preprocessing, feature extraction, model training, and performance evaluation. It discusses challenges faced, such as class imbalance and noisy data, and recommends improvements like using oversampling techniques and advanced text cleaning methods. The best performance was achieved using an SVM model with TF-IDF for sentiment classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Natural Language Processing and Social Media

Analytics 01

ASSIGNMENT 1

Prashansa Thapa
Student ID: C0927405

Jyoti Prakash Uprety


Student ID: C0928791
Table of Contents
1. TEXT PREPROCESSING ...............................................................................................................2
1.1. LOADING DATASET ............................................................................................................................. 2
1.2. STANDARDIZATION ............................................................................................................................. 2
1.3. CLEANING........................................................................................................................................ 2
1.4. TOKENIZATION .................................................................................................................................. 2
1.5. WORD REMOVAL ............................................................................................................................... 2
1.6. LEMMATIZATION ................................................................................................................................ 2
1.7. CONTENT HANDLING .......................................................................................................................... 2
1.7.1. Abbreviation and Slangs ............................................................................................... 2
1.7.2. Emoji ......................................................................................................................... 2
1.7.3. Lower casing and extra space ....................................................................................... 2
2. FEATURE EXTRACTION ...............................................................................................................3
2.1. BAG OF WORDS (BOW) ....................................................................................................................... 3
2.2. TF-IDF ............................................................................................................................................ 3
2.3. COMPARISON ................................................................................................................................... 3
3. MODEL TRAINING AND PERFORMANCE EVALUATION ..................................................................3
3.1. DATA SPLITTING ................................................................................................................................ 3
3.2. NAÏVE BAYES MODEL.......................................................................................................................... 3
3.3. SVM MODEL .................................................................................................................................... 4
3.4. EVALUATION METRICS ........................................................................................................................ 4
3.5. CONFUSION MATRIX .......................................................................................................................... 4
3.6. OBSERVATIONS ................................................................................................................................. 4
4. COMPARISON ............................................................................................................................5
5. CHALLENGES FACED..................................................................................................................5
5.1. CLASS IMBALANCE ............................................................................................................................. 5
5.2. FEATURE LIMITATION .......................................................................................................................... 5
5.3. NOISY DATA ..................................................................................................................................... 5
6. RECOMMENDATION FOR IMPROVEMENT ....................................................................................5

1
1. Text Preprocessing
1.1. Loading Dataset
The dataset was loaded from the sentimentdataset.csv file using pandas.read.csv()
1.2. Standardization
The dataset had more than just Positive, Negative, and Neutral sentiment labels
hence, all potential labels were categorized into these three primary sentiment classes
using a mapping dictionary. This stage is done to make sure that the labeling system
was uniform.
1.3. Cleaning
Regular expressions (re.sub()) were used to eliminate excess whitespace,
punctuation, and special characters. This made it possible to guarantee that only
relevant text was left.
1.4. Tokenization
NLTK's word_tokenize() was used to tokenize each text entry, dividing the text
into distinct words. This is necessary for subsequent processing, including
lemmatization and stopword elimination.
1.5. Word Removal
The NLTK stopwords list was used to remove frequent words that don't add much
meaning, like "the," "is," and "and." By reducing noise, this stage increases the
accuracy of the model.
1.6. Lemmatization
Words were reduced to their base forms using WordNetLemmatizer(). For
example, ‘running’ became ‘run’ and ‘better’ became ‘good’. This process ensures that
variations of words are treated as the same feature.
1.7. Content handling
1.7.1. Abbreviation and Slangs
A simple change was made to common abbreviations (such as "u" to "you") because
social media posts frequently use casual language.
1.7.2. Emoji
To keep the text consistent, emojis were eliminated using regular expressions.

1.7.3. Lower casing and extra space


To keep the dataset consistent, the text was changed to lowercase and extraneous
whitespace was eliminated.

2
2. Feature Extraction
2.1. Bag of words (BoW)
The cleaned text was transformed into a matrix of token counts using
CountVectorizer(). Individual word frequencies were recorded using unigrams.
2.2. TF-IDF
Term frequency and inverse document frequency were used by TfidfVectorizer()
to convert text into numerical numbers. Default parameters were applied to convert
words into weighted numerical features.
2.3. Comparison
While TF-IDF emphasized key terms by lessening the influence of frequently used
words, BoW supplied raw frequency counts.

3. Model training and Performance Evaluation


3.1. Data Splitting
Train_test_split() was used to separate the dataset into 80% training and 20% testing
sets.
3.2. Naïve Bayes Model
We trained multinomialNB() using both TF-IDF and BoW representations.
Effective classification based on word frequency distributions was made possible by
Naive Bayes' probabilistic nature.

3
3.3. SVM Model
SVC() was trained using both features, and because it concentrated on unique
phrases that successfully distinguished sentiments, it performed exceptionally well
with TF-IDF. 'Amazing' for Positive and 'terrible' for Negative are examples of
important phrases that the SVM model used to create more distinct decision boundaries.

3.4. Evaluation Metrics


Accuracy, Precision, Recall, and F1-score were computed using classification_report()
to evaluate the models' performances.
3.5. Confusion Matrix
The categorization results were shown using confusion_matrix(), and Seaborn
heatmaps showed trends in both accurate and inaccurate predictions.

3.6. Observations
The SVM model outperformed the Naive Bayes model with TF-IDF based on our
modelling because it focused on significant terms, while the Naive Bayes model
outperformed BoW because it relied on frequency patterns.

4
4. Comparison
With BoW With TF-IDF

Naïve Bayes performed well because it Because the weighting down


relied on high-frequency terms of frequently used phrases
and was probabilistic, which reduced its probability
made it appropriate for assumptions, accuracy
simpler text patterns. declined.
SVM performed well enough but achieved the highest accuracy
was unable to successfully use and F1-score among all
word importance. models by identifying
sentiment-critical terms using
the TF-IDF significance
scores.

Overall: The best performance was achieved by SVM with TF-IDF, which used weighted term
importance to successfully distinguish between feelings.

5. Challenges Faced
5.1. Class Imbalance
Uneven sentiment distributions in the dataset affected the model's performance.

5.2. Feature Limitation


Contextual interpretation of sentiments was limited by Unigrams alone.
5.3. Noisy Data
Model accuracy was impacted by the use of slang, emojis, and informal
language in social media posts.

6. Recommendation for improvement


Several crucial aspects must be addressed in order to enhance the sentiment analysis
model. First, oversampling methods like SMOTE can be used to address class imbalance
by making sure the model has enough samples from each sentiment class, which will
improve generalization. The development of unique tokenizers and the application of
sophisticated text cleaning techniques can then improve noise handling by efficiently
processing emojis, slang, and abbreviations which are prevalent in social media data.
Incorporating N-grams to record word sequences and using POS tagging to aid the model
in comprehending grammatical relationships and context can also improve feature
engineering and, ultimately, the accuracy of sentiment categorization.

5
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('sentimentdataset.csv')

df.head(2)

Unnamed: 0.1 Unnamed: 0 \


0 0 0
1 1 1

Text Sentiment \
0 Enjoying a beautiful day at the park! ... Positive
1 Traffic was terrible this morning. ... Negative

Timestamp User Platform \


0 2023-01-15 12:30:00 User123 Twitter
1 2023-01-15 08:45:00 CommuterX Twitter

Hashtags Retweets Likes


Country \
0 #Nature #Park 15.0 30.0 USA

1 #Traffic #Morning 5.0 10.0


Canada

Year Month Day Hour


0 2023 1 15 12
1 2023 1 15 8

# Keep only relevant columns


df = df[['Text', 'Sentiment']]
# Define sentiment mapping
sentiment_mapping = {
'Positive': 'Positive', 'Negative': 'Negative', 'Neutral':
'Neutral',
'Happy': 'Positive', 'Happiness': 'Positive', 'Joy': 'Positive',
'Love': 'Positive',
'Excitement': 'Positive', 'Admiration': 'Positive', 'Affection':
'Positive',
'Awe': 'Positive', 'Surprise': 'Positive', 'Acceptance':
'Positive',
'Adoration': 'Positive', 'Anticipation': 'Positive', 'Calmness':
'Positive',
'Kind': 'Positive', 'Pride': 'Positive', 'Hope': 'Positive',
'Empowerment': 'Positive', 'Compassion': 'Positive', 'Tenderness':
'Positive',
'Elation': 'Positive', 'Euphoria': 'Positive', 'Contentment':
'Positive',
'Serenity': 'Positive', 'Gratitude': 'Positive', 'Fulfillment':
'Positive',
'Reverence': 'Positive', 'Enthusiasm': 'Positive', 'Satisfaction':
'Positive',
'Accomplishment': 'Positive', 'Wonder': 'Positive', 'Optimism':
'Positive',
'Friendship': 'Positive', 'Success': 'Positive', 'Adventure':
'Positive',
'Celebration': 'Positive', 'Creativity': 'Positive', 'Freedom':
'Positive',
'Hopeful': 'Positive', 'Inspired': 'Positive', 'Zest': 'Positive',
'Proud': 'Positive', 'Mindfulness': 'Positive',
'Sad': 'Negative', 'Sadness': 'Negative', 'Fear': 'Negative',
'Anger': 'Negative', 'Disgust': 'Negative', 'Disappointed':
'Negative',
'Bitter': 'Negative', 'Shame': 'Negative', 'Despair': 'Negative',
'Grief': 'Negative', 'Loneliness': 'Negative', 'Jealousy':
'Negative',
'Resentment': 'Negative', 'Frustration': 'Negative', 'Boredom':
'Negative',
'Anxiety': 'Negative', 'Helplessness': 'Negative', 'Envy':
'Negative',
'Regret': 'Negative', 'Melancholy': 'Negative', 'Bitterness':
'Negative',
'Heartbreak': 'Negative', 'Betrayal': 'Negative', 'Suffering':
'Negative',
'Isolation': 'Negative', 'Darkness': 'Negative', 'Exhaustion':
'Negative',
'Desolation': 'Negative', 'Desperation': 'Negative', 'Loss':
'Negative',
'Heartache': 'Negative', 'Hopelessness': 'Negative', 'Hate':
'Negative',
'Bad': 'Negative', 'Indifference': 'Neutral', 'Confusion':
'Neutral',
'Numbness': 'Neutral', 'Ambivalence': 'Neutral', 'Curiosity':
'Neutral',
'Reflection': 'Neutral', 'Determination': 'Neutral', 'Sympathy':
'Neutral',
'Miscalculation': 'Neutral', 'Obstacle': 'Neutral', 'Pressure':
'Neutral',
'Renewed Effort': 'Neutral', 'Acceptance': 'Neutral',
'Tranquility': 'Neutral',
'Observation': 'Neutral'
}

# Standardizing Sentiment column before mapping


df['Sentiment'] = df['Sentiment'].str.strip().str.capitalize()

# Replace Sentiment values with mapped values


df['Sentiment'] =
df['Sentiment'].map(sentiment_mapping).fillna('Neutral')

# Display updated dataframe


print(df.head())

# Check new sentiment distribution


print(df['Sentiment'].value_counts())

Text Sentiment
0 Enjoying a beautiful day at the park! ... Positive
1 Traffic was terrible this morning. ... Negative
2 Just finished an amazing workout! 💪 ... Positive
3 Excited about the upcoming weekend getaway! ... Positive
4 Trying out a new recipe for dinner tonight. ... Neutral
Sentiment
Positive 328
Neutral 271
Negative 133
Name: count, dtype: int64

# Download necessary NLTK resources


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Jyoti Prakash


[nltk_data] Uprety\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Jyoti Prakash
[nltk_data] Uprety\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Jyoti Prakash
[nltk_data] Uprety\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!

True

# Text Preprocessing
def preprocess_text(text):
text = text.lower()
text = re.sub(f"[{string.punctuation}]", "", text)
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in
stopwords.words('english')]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return " ".join(tokens)

df['Processed_Text'] = df['Text'].apply(preprocess_text)

# Feature Extraction (BoW and TF-IDF)


vectorizer_bow = CountVectorizer()
vectorizer_tfidf = TfidfVectorizer()
X_bow = vectorizer_bow.fit_transform(df['Processed_Text'])
X_tfidf = vectorizer_tfidf.fit_transform(df['Processed_Text'])
y = df['Sentiment']

# Train-Test Split
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y,
test_size=0.2, random_state=42)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)

# Model Training
nb_model = MultinomialNB()
nb_model.fit(X_train_bow, y_train)
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_nb = nb_model.predict(X_test_bow)
y_pred_svm = svm_model.predict(X_test_tfidf)

# Evaluation
print("Naïve Bayes Performance:")
print(classification_report(y_test, y_pred_nb))
print("SVM Performance:")
print(classification_report(y_test, y_pred_svm))

# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(confusion_matrix(y_test, y_pred_nb), annot=True, fmt='d',
cmap='Blues', ax=axes[0])
axes[0].set_title("Naïve Bayes Confusion Matrix")
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt='d',
cmap='Blues', ax=axes[1])
axes[1].set_title("SVM Confusion Matrix")
plt.show()

Naïve Bayes Performance:


precision recall f1-score support

Negative 0.65 0.73 0.69 30


Neutral 0.70 0.57 0.63 54
Positive 0.78 0.86 0.82 63

accuracy 0.73 147


macro avg 0.71 0.72 0.71 147
weighted avg 0.73 0.73 0.72 147

SVM Performance:
precision recall f1-score support

Negative 0.76 0.43 0.55 30


Neutral 0.64 0.67 0.65 54
Positive 0.72 0.84 0.77 63

accuracy 0.69 147


macro avg 0.71 0.65 0.66 147
weighted avg 0.70 0.69 0.68 147

You might also like