Twitter Sentiment Analysis
Twitter Sentiment Analysis
The objective of this project is to perform sentiment analysis on Twitter data using machine
learning techniques, specifically logistic regression and Naive Bayes classification algorithms.
The focus is on distinguishing between positive and negative sentiments expressed in tweets. To
achieve this, the project incorporates preprocessing steps including stemming for text cleaning.
By analyzing tweets, the aim is to develop models that accurately classify the sentiment of tweets,
enabling valuable insights into public opinion, customer sentiment, and trends on various topics
discussed on Twitter. Ultimately, the project seeks to contribute to the understanding of
sentiment dynamics in social media and provide a tool for businesses, researchers, and
individuals to gauge public sentiment effectively.
Twitter API Use the Twitter API to access and collect a large volume of tweets related to the desired topic.
Keyword Filtering Filtered the collected tweets based on specific keywords or hashtags relevant to the analysis.
Language Detection Identified the language of each tweet and removed tweets that were not in the desired language.
Text Cleaning Removed any unnecessary characters, such as punctuations and special symbols from the tweet text.
Tokenization Split the tweet text into individual words or tokens for further analysis.
Stopword Removal Removed common words, such as ‘and’, ’the’ and ‘is’, that do not carry significant meaning for sentimental analysis.
Normalisation Converted all words to lowercase and applied Stemming or Lemmatization to reduce words to their base form.
Data Sampling Selected a representative sample from the collected dataset for analysis to reduce computational requirements.
FEATURE ENGINEERING
1.Stemming: Reduces words to their base form (e.g.,
"running" -> "run"). This helps capture synonyms
and reduces features.
2.Regular Expressions: Define patterns to identify
specific elements (e.g., hashtags, emoticons).
Useful for removing irrelevant information or
creating sentiment features (e.g., positive
emoticons).
3.TF-IDF: Assigns weights to words based on their
importance in a document and rarity across the
corpus. Focuses on informative words for
sentiment analysis.
Train-Test-Split
1.Train-Test Split: Divides data into training and testing sets for
machine learning.
2.Model Training: The training set educates the model on
patterns and relationships.
3.Model Evaluation: The unseen test set assesses the model's
ability to generalize to new data.
4.Overfitting Prevention: Helps avoid models that memorize the
training data but fail on unseen data.
5.Validation Technique: Crucial step in machine learning to
ensure robust model performance.
Model Evaluation
1.Naive Bayes
2.Logistic Regression
Advantages
Naive Bayes is simple and efficient, making it computationally inexpensive.
It performs well with high-dimensional data and is less prone to overfitting.
It can handle both binary and multi-class classification problems.
Naive Bayes
The training data exhibited an accuracy of 0.81, whereas the test data
showed an accuracy score of 0.74.
Logistic Regression
The training data exhibited an accuracy of 0.83, whereas the test data
showed an accuracy score of 0.77.
Result Analysis
Naive Bayes 1
1
Logistic Regression