Twitter Sentiment Analysis
Twitter Sentiment Analysis
➢ Machine learning tool to extract emotions from text using Natural Language Processing (NLP)
➢ By training ML tools with example of emotions in text, machines automatically learn how to
detect sentiment without human input
Ø ‘sentiment’ : target column is created for each tweet using a lexicon based functionality TextBlob
(https://textblob.readthedocs.io/en/dev/) with values either ‘neutral’, ‘positive’ or ‘negative’
Feature Engineering & Exploratory Data Analysis
Ø New features (day_of_week & hour) were created using ‘created_at’ datetime attribute
Ø day_of_week Ø hour
• All weekdays (except Tuesdays) have majority of tweets • Majority of the tweets are in the late afternoons & evenings
as positive tweets, while weekends & Tuesdays have (peaking after 3PM)
majority of tweets as neutral tweets
Ø New features were created by counting number of positive & negative words in each tweet. The list of all positive & negative
words are borrowed from this study: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Ø number_positive_words Ø number_negative_words
• Tweets with 1, 2, 3 or 4 number_positive_words are • Tweets with 2, 3 or 4 number_negative_words are
majorly a 'positive' sentiment as expected majorly a negative' sentiment as expected
Text Pre-processing
Conversion to
lowercase & Tokenization
Removing
non-alphanumeric
characters like ‘@’ or ‘#’
Stemming Lemmatization
• Can see examples of words being • Can see examples of words being
broken down to root (irrespective of broken down to the root like markets
the tense) like please & pleased has has become market & ads has
become pleas become ad
Ø Word Cloud before (left) & after (right) text pre processing
• 'cibc', 'co', 'bank', 'thank' are some of the common word occurrences in tweets after text pre processing
• ‘https' was a common occurrence in the original tweets but is no longer a common occurrence after text pre processing
Ø ‘ sentiment’
Ø TF-IDF vectorizer chosen for text feature engineering to convert text (after necessary pre processing) into a numerical matrix
• This gives importance to ‘rare’ words in tweets which are not common across all other tweets
• Use stop_words = ‘English’ to remove common English occurrences like ‘if’, ‘but’, ‘or’, ‘an’, ‘the’ etc.
Ø RandomOverSampler to account for target imbalance (specifically to account for number of negative tweets being less than
positive & neutral ones)
Ø Choice of popular ML algorithms to work with text data: Multinomial Naive Baiyes, Linear Support Vector Classifier, Random
Forest Classifier & XGBoost
Ø Choice of metric : ‘F1’ chosen to ensure the maximum possibility of correct predictions across each of the target classes
Ø ML Workflow
• Split dataset into training & testing set
• Perform cross validation on training set to choose best (base)
vectorizer & ML algorithm
• Perform Grid Search Cross Validation on training set using
vectorizer & chosen ML algorithm to get best hyperparameters
• Use the final classifier to make predictions on testing set
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
5-fold Cross Validation Results
• Best scores were with Multinomial Naive Bayes & XGBoost.
Multinomial Naive Bayes takes the least amount of time to fit
& predict, while XGBoost takes the maximum amount of time
Ø Feature Importance
• The feature importance words & tweet sentiments make
somewhat intuitive sense lending confidence in the
explainability of the final model!
Ø TextBlob was used to assign sentiment labels to tweets. There was inherently some imbalance in classes of sentiments for the dataset;
negative : neutral : positive ratio was 0.15 : 0.43 : 0.40
Ø Data cleaning, visualization was followed by text analytics (removing non alphanumeric characters, tokenizing sentences, lemmatization,
removing stop words & performing vectorization using TF-IDF). This was followed by random oversampling to handle target imbalance
Ø Four different base models were fitted to the training set. The best cross validated scores though were achieved with Multinomial NB &
XGBoost; however Random Forest was chosen for hyperparameter tuning as it's easier to interpret with the class feature importance's, &
with not much compromise with score & computational time
Ø A pipeline was further built for TF-IDF & Random Forest, thereby enabling tuning for the best hyper-parameters. Post tuning, performance
was enhanced to achieve a F1 score of ~0.65 for negative sentiment & >0.75 for neutral & positive sentiments on unseen test data
Ø For the final model, feature importance was identified for all 3 classes (negative, neutral & positive tweets)
• number_negative_words had high feature importance for predicting negative sentiments & number_positive_words had high
feature importance for predicting positive sentiments
• The words of importance's associated with negative tweets were found to be 'flat', 'mortgage', 'single', 'low', 'sorry', 'close' etc
• The words of importance's associated with positive tweets were found to be 'new', 'thank', 'wood', 'latest', 'game', 'love', 'home',
'great', 'good'
• The words of importance's associated with neutral tweets were found to be 'http', 'cibc', 'company', 'recruit', 'reaffirm', 'provide’
The feature importance words & tweet sentiments make somewhat intuitive sense.