0% found this document useful (0 votes)
17 views9 pages

Twitter Sentiment Analysis

Uploaded by

vikas Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Twitter Sentiment Analysis

Uploaded by

vikas Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SENTIMENT ANALYSIS

➢ Machine learning tool to extract emotions from text using Natural Language Processing (NLP)
➢ By training ML tools with example of emotions in text, machines automatically learn how to
detect sentiment without human input

Ø The dataset contains approximately 2000 different


(scrapped) tweets with the following attributes:

• 'id' : unique 19 digit id for each tweet


• 'created_at' : date & time of each tweet (or retweet)
• 'text' : tweet details/ description
• 'location' : origin of tweet

Ø ‘sentiment’ : target column is created for each tweet using a lexicon based functionality TextBlob
(https://textblob.readthedocs.io/en/dev/) with values either ‘neutral’, ‘positive’ or ‘negative’
Feature Engineering & Exploratory Data Analysis
Ø New features (day_of_week & hour) were created using ‘created_at’ datetime attribute

Ø day_of_week Ø hour
• All weekdays (except Tuesdays) have majority of tweets • Majority of the tweets are in the late afternoons & evenings
as positive tweets, while weekends & Tuesdays have (peaking after 3PM)
majority of tweets as neutral tweets
Ø New features were created by counting number of positive & negative words in each tweet. The list of all positive & negative
words are borrowed from this study: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Ø number_positive_words Ø number_negative_words
• Tweets with 1, 2, 3 or 4 number_positive_words are • Tweets with 2, 3 or 4 number_negative_words are
majorly a 'positive' sentiment as expected majorly a negative' sentiment as expected
Text Pre-processing

Conversion to
lowercase & Tokenization

Removing
non-alphanumeric
characters like ‘@’ or ‘#’

Stemming Lemmatization

• Can see examples of words being • Can see examples of words being
broken down to root (irrespective of broken down to the root like markets
the tense) like please & pleased has has become market & ads has
become pleas become ad
Ø Word Cloud before (left) & after (right) text pre processing

• 'cibc', 'co', 'bank', 'thank' are some of the common word occurrences in tweets after text pre processing
• ‘https' was a common occurrence in the original tweets but is no longer a common occurrence after text pre processing

Ø ‘location’ attribute is too unclean and will be dropped from analysis

Ø ‘ sentiment’

• There is imbalance in sentiment attribute;


Maximum tweets have a positive sentiment
followed by neutral sentiment
Machine Leaning
Ø Label encoding of Target (1:’positive’, 0:’neutral’ & -1:’negative’) and one hot encoding of categorical columns like ‘day_of_week’

Ø TF-IDF vectorizer chosen for text feature engineering to convert text (after necessary pre processing) into a numerical matrix
• This gives importance to ‘rare’ words in tweets which are not common across all other tweets
• Use stop_words = ‘English’ to remove common English occurrences like ‘if’, ‘but’, ‘or’, ‘an’, ‘the’ etc.

Ø RandomOverSampler to account for target imbalance (specifically to account for number of negative tweets being less than
positive & neutral ones)

Ø Choice of popular ML algorithms to work with text data: Multinomial Naive Baiyes, Linear Support Vector Classifier, Random
Forest Classifier & XGBoost

Ø Choice of metric : ‘F1’ chosen to ensure the maximum possibility of correct predictions across each of the target classes

Ø ML Workflow
• Split dataset into training & testing set
• Perform cross validation on training set to choose best (base)
vectorizer & ML algorithm
• Perform Grid Search Cross Validation on training set using
vectorizer & chosen ML algorithm to get best hyperparameters
• Use the final classifier to make predictions on testing set

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
5-fold Cross Validation Results
• Best scores were with Multinomial Naive Bayes & XGBoost.
Multinomial Naive Bayes takes the least amount of time to fit
& predict, while XGBoost takes the maximum amount of time

• Random Forest is easier to interpret with


class_feature_importance built in the module with not much
compromise in performance score & time

Hyperparameter Tuning – GridSearchCV - Random Forest

• Build a pipeline to choose the best hyperparameters for both


TF-IDF Vectorizer & chosen ML algorithm (Random Forest)
using GridSearchCV object
Final Classifier

Ø Performance on (unseen) Testing Dataset


• The final classifier is able to achieve a F1 score of ~65%
for negative sentiments, ~>=75% for positive & neutral
sentiments

Ø Feature Importance
• The feature importance words & tweet sentiments make
somewhat intuitive sense lending confidence in the
explainability of the final model!

negative sentiments (-1) neutral sentiments (0) positive sentiments (1)


Summary & Conclusion

Ø TextBlob was used to assign sentiment labels to tweets. There was inherently some imbalance in classes of sentiments for the dataset;
negative : neutral : positive ratio was 0.15 : 0.43 : 0.40

Ø Data cleaning, visualization was followed by text analytics (removing non alphanumeric characters, tokenizing sentences, lemmatization,
removing stop words & performing vectorization using TF-IDF). This was followed by random oversampling to handle target imbalance

Ø Four different base models were fitted to the training set. The best cross validated scores though were achieved with Multinomial NB &
XGBoost; however Random Forest was chosen for hyperparameter tuning as it's easier to interpret with the class feature importance's, &
with not much compromise with score & computational time

Ø A pipeline was further built for TF-IDF & Random Forest, thereby enabling tuning for the best hyper-parameters. Post tuning, performance
was enhanced to achieve a F1 score of ~0.65 for negative sentiment & >0.75 for neutral & positive sentiments on unseen test data

Ø For the final model, feature importance was identified for all 3 classes (negative, neutral & positive tweets)
• number_negative_words had high feature importance for predicting negative sentiments & number_positive_words had high
feature importance for predicting positive sentiments
• The words of importance's associated with negative tweets were found to be 'flat', 'mortgage', 'single', 'low', 'sorry', 'close' etc
• The words of importance's associated with positive tweets were found to be 'new', 'thank', 'wood', 'latest', 'game', 'love', 'home',
'great', 'good'
• The words of importance's associated with neutral tweets were found to be 'http', 'cibc', 'company', 'recruit', 'reaffirm', 'provide’
The feature importance words & tweet sentiments make somewhat intuitive sense.

You might also like