Twitter Sentiment Analysis
Twitter Sentiment Analysis
With the advancement of web technology and its growth, there is a huge volume of data present
in the web for internet users and a lot of data is generated too. Internet has become a platform for
online learning, exchanging ideas and sharing opinions. Social networking sites like Twitter,
Facebook, Google+ are rapidly gaining popularity as they allow people to share and express their
views about topics, have discussion with different communities, or post messages across the
world. There has been lot of work in the field of sentiment analysis of twitter data. This survey
focuses mainly on sentiment analysis of twitter data which is helpful to analyze the information
in the tweets where opinions are highly unstructured, heterogeneous and are either positive or
negative, or neutral in some cases. In this paper, we provide a survey and a comparative analyses
of existing techniques for opinion mining like machine learning and lexicon-based approaches,
together with evaluation metrics. Using various machine learning algorithms like Naive Bayes,
Max Entropy, and Support Vector Machine, we provide research on twitter data streams. We
have also discussed general challenges and applications of Sentiment Analysis on Twitter.
Nowadays, the age of Internet has changed the way people express their views, opinions. It is
now mainly done through blog posts, online forums, product review websites, social media, etc.
Nowadays, millions of people are using social network sites like Facebook, Twitter, Google
Plus, etc. to express their emotions, opinion and share views about their daily lives. Through the
online communities, we get an interactive media where consumers inform and influence others
through forums. Social media is generating a large volume of sentiment rich data in the form of
tweets, status updates, blog posts, comments, reviews, etc. Moreover, social media provides an
opportunity for businesses by giving a platform to connect with their customers for advertising.
People mostly depend upon user generated content over online to a great extent for decision
making. For e.g. if someone wants to buy a product or wants to use any service, then they firstly
look up its reviews online, discuss about it on social media before taking a decision. The amount
of content generated by users is too vast for a normal user to analyze. So there is a need to
automate this, various sentiment analysis techniques are widely used.
1. Introduction:
The proliferation of social media platforms like Twitter has resulted in a huge amount of user-
generated data. Analyzing this data can provide insights into public opinion, brand perception,
market trends, or even political sentiments. Sentiment analysis involves processing text data to
determine the sentiment behind it, classifying it as positive, negative, or neutral. In this project,
we aim to develop a sentiment analysis system for Twitter using machine learning techniques.
2. Problem Statement:
With the explosive growth of social media, platforms like Twitter have become a vital source for
capturing public sentiment on various topics ranging from products and services to political
events and social issues. However, the sheer volume and unstructured nature of this data make it
challenging to extract meaningful insights. Companies, governments, and organizations often
struggle to manually analyze and understand public opinion in real-time, missing opportunities to
respond to trends or public concerns.
This project aims to address the challenge of automatically analyzing sentiments from Twitter
data. The system will classify tweets into positive, negative, or neutral categories using machine
learning techniques. The solution will help businesses, political organizations, or individuals
make data-driven decisions based on public opinion by providing real-time sentiment analysis on
Twitter data. Additionally, the system will visualize sentiment trends, providing a clearer
understanding of the sentiment landscape around specific topics, events, or brands.
Handling short, noisy, and unstructured text data with informal language, emojis, and slang.
Effectively preprocessing and cleaning tweet data for analysis.
Selecting and optimizing machine learning models for accurate sentiment classification.
Ensuring the system can process real-time data efficiently and present visual insights on
sentiment trends.
3. Objectives:
4. Scope:
5. Literature Survey:
Twitter is one of the most widely used microblogging platforms, generating millions of
tweets daily. Its concise 280-character limit forces users to express opinions in short
texts, making it a challenging yet popular source for sentiment analysis. The real-time
nature of Twitter data has been explored in numerous studies. For example, Pak and
Paroubek (2010) proposed a method to automatically collect a corpus of Twitter
messages and apply sentiment analysis, demonstrating Twitter's potential as a social
opinion resource.
Naive Bayes: One of the simplest algorithms used in text classification, Naive Bayes has
been applied successfully in sentiment classification tasks due to its efficiency with large
datasets (Manning et al., 2008).
Support Vector Machines (SVM): Introduced by Cortes and Vapnik (1995), SVM is
known for its robustness in text classification tasks. Studies like Pang et al. (2002)
showed that SVM outperforms Naive Bayes in sentiment classification due to its ability
to create a clear separation between different sentiment classes.
Logistic Regression: Another popular technique for text classification, Logistic
Regression, has been successfully applied in sentiment analysis due to its simplicity and
interpretability (Ng and Jordan, 2001).
Text preprocessing is a critical step in sentiment analysis. Studies by Bird et al. (2009)
highlighted the use of Natural Language Toolkit (NLTK) for preprocessing tasks such as
tokenization, stop word removal, and stemming/lemmatization. These techniques have
proven essential for reducing noise in tweet data, which often contains informal language,
emojis, and misspellings.
Real-time sentiment analysis has been applied in various domains, including market
analysis, political campaigns, and customer service. Bollen et al. (2011) explored the
application of Twitter sentiment analysis for predicting stock market trends, while
O’Connor et al. (2010) used it for political polling predictions. Both studies emphasize
the impact of Twitter sentiment on real-world events, making it a valuable tool for
businesses and organizations.
Short Texts: Tweets are brief, which can make it difficult to capture sufficient context for
sentiment classification.
Noisy Data: Twitter contains a high amount of informal language, slang, hashtags,
emojis, and misspellings, requiring extensive preprocessing.
Sarcasm and Irony: Detecting sarcasm is a major challenge in sentiment analysis.
Traditional algorithms often fail to understand the underlying sentiment in sarcastic
tweets (Riloff et al., 2013).
6. Methodology:
Data Collection:
o Tweets will be collected using the Twitter API based on specific keywords or
hashtags.
o The Tweepy library can be used to interact with the Twitter API for data
collection.
Data Preprocessing:
o Tweets will be preprocessed by converting to lowercase, removing punctuation,
special characters, links, stop words, and applying tokenization.
o Natural Language Processing (NLP) techniques will be used for text processing,
including stemming and lemmatization.
Feature Extraction:
o The cleaned tweet text will be converted into numerical features using techniques
such as Bag of Words (BoW) or TF-IDF (Term Frequency-Inverse Document
Frequency).
Model Building:
o A machine learning model (e.g., Naive Bayes, Logistic Regression, SVM) will be
trained on a labeled dataset of tweets.
o A labeled dataset like Sentiment140 or manually labeled tweets will be used to
train the model.
Model Evaluation:
o The model will be evaluated based on a test dataset using accuracy, precision,
recall, and F1-score.
Visualization:
o Sentiment results will be presented using visualizations such as pie charts, bar
graphs, or time-series graphs to showcase the distribution of sentiments
8. Expected Outcome:
A trained sentiment analysis model capable of classifying tweets into positive, negative,
or neutral categories.
A dashboard for visualizing the sentiment distribution and trends over time.
Insights into public opinion and trends based on the sentiment analysis results.
9. Future Scope:
10. References:
[1] A.Pak and P. Paroubek. „Twitter as a Corpus for Sentiment Analysis and Opinion Mining". In
Proceedings of the Seventh Conference on International Language Resources and Evaluation, 2010,
pp.1320-1326
[2] R. Parikh and M. Movassate, “Sentiment Analysis of User- GeneratedTwitter Updates using Various
Classi_cation Techniques",CS224N Final Report, 2009
[3] Go, R. Bhayani, L.Huang. “Twitter Sentiment ClassificationUsing Distant Supervision". University,
Technical Paper,2009 Stanford
[4] L. Barbosa, J. Feng. “Robust Sentiment Detection on Twitterfrom Biased and Noisy Data". COLING
2010: Poster Volume,pp. 36-44.
[5] Bifet and E. Frank, "Sentiment Knowledge Discovery inTwitter Streaming Data", In Proceedings of the
13th InternationalConference on Discovery Science, Berlin, Germany: Springer,2010, pp. 1-15.
[6] Agarwal, B. Xie, I. Vovsha, O. Rambow, R. Passonneau, “Sentiment Analysis of Twitter Data", In
Proceedings of the ACL 2011Workshop on Languages in Social Media,2011 , pp. 30-38
[7] Dmitry Davidov, Ari Rappoport." Enhanced Sentiment Learning Using Twitter Hashtags and Smileys".
Coling 2010: Poster Volumepages 241{249, Beijing, August 2010
[8] Po-Wei Liang, Bi-Ru Dai, “Opinion Mining on Social MediaData", IEEE 14th International Conference
on Mobile Data Management,Milan, Italy, June 3 - 6, 2013, pp 91-96, ISBN: 978-1-494673-6068-5,
http://doi.ieeecomputersociety.org/10.1109/MDM.2013.
[9] Pablo Gamallo, Marcos Garcia, “Citius: A Naive-Bayes Strategyfor Sentiment Analysis on English
Tweets", 8th InternationalWorkshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland,Aug 23-24
2014, pp 171-175.