ProjectFinalReport 2copies
ProjectFinalReport 2copies
1. Introduction
Twitter is a popular social networking site where millions of people tweet every second
about various topics related to society, politics, sports, entertainment, and many more. The
standard syntax followed by Twitter users while tweeting involves hashtags, retweets, and user
mentions. Hashtags are words or phrases which are prefixed with “#,” and user mention means
mentioning other people, companies, brands, or precisely other Twitter users in the tweet by
using the “@” symbol at the beginning of their username. Tweets thus help people to understand
how others feel about different ongoing events, government policies, sports tournaments, etc.
Brands can analyse tweets to know people’s sentiments towards their products.
The main motivation for the Twitter trend analysis is to identify the recent trends happening
across the world using big data machine learning techniques. This will help to analyze what has
happened in the past and what may happen in the future. It helps to track customer trends and
interests especially what customers like, what their behaviours are, and how those changes over
the time.
Sentiment analysis has gained significant importance in recent years due to the explosive
growth of social media and the vast amount of user-generated content available. It allows
businesses, researchers, and decision-makers to gain a deeper understanding of the public's
perception and sentiment towards products, services, brands, or even broader topics such as
social issues and political events
The proposed methodology includes the various steps, namely, collecting the static and
real-time tweets from the Twitter and to perform the trend analysis. The proposed technique
uses both static tweets and also real-time tweet trend analysis. Initially, the tweets need to
be pre-processed for further analysis. Later, various machine learning techniques are applied
on these static and real-time tweets to analyse the trends.
Sentiment analysis influences users to classify whether the information about the
product is satisfactory or not before they acquire it. The pre-processed tweets are run using
different machine learning algorithms. These algorithms reveal the polarity of tweets. The
algorithms used were support vector regression, decision tree, random forest, and
multinomial logistic regression of which decision tree showed the highest accuracy.
1.3. Objectives
2. Literature Review
These days analysis of feelings from twitter is on constant appraisal within the research
community as its applications have a huge influence over the working of different industries
today. The main challenge faced by this type of analysis is the variation of speech and complex
structure of data when extracted.
As stated in the section above sentiment analysis could be used for politics. Tumasjan came
across the field and its benefits in election and used it for predicting the results in 2009 for
German federal elections. They extracted approximately 100,000 tweets for this purpose
regarding many political parties of that time and area. Then analyse the tweets in order to gain
sentiments for them. For this they used a software popularly known as (Linguistic Inquiry and
Word Count) LIWC2007. This software uses textual analysis as a base to derive sentiments.
The results obtained by this analysis were very much similar to the actual results of the
elections. Another interesting research was carried out by Dr Rajiv along with some of his
mates. They have applied the technique of sentiment analysis in a brand new way, where they
have used this technique to better situations in crises situations. They collected the data of 2014
about a deluge which occurred in Kashmir at that time. Data set collected by them consisted of
almost 8490 tweets on which naïve bayes classification technique was implemented. Their
research showed that applying analysis of feeling in these situations of crises could help the
government in saving lives
The system aims to classify sentiment polarity from text, distinguishing between positive,
negative, and neutral sentiments. This classification can occur at the sentence level, document
level, or even at the entity and aspect level the system aims to classify sentiment polarity from text,
distinguishing between positive, negative, and neutral sentiments. This classification can occur at
the sentence level, document level, or even at the entity and aspect level. SRS emphasizes the need
to preprocess Twitter data effectively, ensuring the data is prepared for training using machine
learning techniques. This involves eliminating unnecessary noise, such as slang, abbreviations,
URLs, and special characters, thereby reducing the dataset's size while maintaining data integrity.
After preprocessing the Data the powerful performance and accurate results we used transformer
model from deep learning which is trained on vader lexicon dataset. SRS highlights the importance
of real-time tracking of relevant Twitter content pertaining to a brand. By continuously monitoring
brand mentions, the system can perform analysis as topics or issues emerge, enabling brands to
stay proactive and detect anomalies. This functionality empowers brands to enhance customer
engagement and deliver better experiences worldwide. For visualizing the overall results we used
different types of graphs from different python libraries. The sentiment analysis project's SRS
combines the objectives of accurate sentiment classification, efficient data preprocessing, high
performance, real-time tracking, and intuitive visualization, thereby facilitating the development
of a powerful and comprehensive sentiment analysis solution.
3.1 Methodology
Dataset Collection
A major part of solving any problem with machine learning is gaining proper dataset for the
training model. Getting the proper data consists of gathering or identifying the data that correlates
with the outcomes the system wants to predict. In order to find the polarity of tweets, we have to
study the natural language processing. Acquiring the dataset is the first step in machine learning.
To build and develop Machine Learning models, you must first acquire the relevant dataset. This
dataset will be comprised of data gathered from multiple and disparate sources which are then
combined in a proper format to form a dataset. Dataset formats differ according to use cases. For
instance, a business dataset will be entirely different from a medical dataset. While a business
dataset will contain relevant industry and business data, a medical dataset will include healthcare-
related data. You can also create a dataset by collecting data via different Python APIs. Once the
dataset is ready, you must put it in CSV, or HTML, or XLSX file format.
Collection of Dataset
The first part is the collection of datasets. This dataset will be made up of data collected
from various and different sources, which will then be integrated in the right way to produce a
dataset.
Data Pre-processing
The adjustments we apply to our data before feeding it to the algorithm are referred to as
pre-processing. Data pre-processing is a technique for converting raw data into a clean data
collection. In other words, anytime data is acquired from various sources, it is obtained in raw
format, which makes analysis impossible.
Data preparation is an exploratory data analysis and visualization it divides into statical
modelling and claimed to that of no obvious errors. Data pre-processing is the process of
transforming the raw data into an understandable format.
Data Visualization
Data visualization is defined as a graphical representation that contains the information and
the data. By using visual elements like charts, graphs, and maps, data visualization techniques
provide an accessible way to see and understand trends, outliers, and patterns in data. Data
visualization provides an important suite of tools for identifying a qualitative understanding. This
can be helpful when we try to explore the dataset and extract some information to know about a
dataset and can help with identifying patterns, corrupt data, outliers, and much more.
Dataset Splitting
Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model. Data splitting is
an important aspect of data science, particularly for creating models based on data. This technique
helps ensure the creation of data models and processes that use data models -- such as machine
learning are accurate. In this project we divide into training -70% and for test-30%.
Modelling
The process of modelling means training a machine learning algorithm to predict the labels
from the features, tuning it for the business need, and validating it on holdout data. A machine
learning model is built by learning and generalizing from training data, then applying that acquired
knowledge to new data it has never seen before to make predictions and fulfil its purpose. We used
Sentiment Module from natural language toolkit library to deciding the polarity of tweets.
To develop a tool that compares a given piece of text to a predefined reference text. To
achieve this goal, the project will utilize natural language processing techniques to analyze and
compare the texts at a semantic level. This script splits the predefined text and the user text into
lists of words, and then iterates through the words in the user text. If a word is found in the
predefined text, the score is increased by 1. The script then calculates the percentage of similarity
by dividing the total score by the length of the predefined text and multiplying the result by 100.
Finally, the script prints the percentage of similarity.
3.2 Algorithms
SVM is a supervised machine learning algorithm used for classification and regression tasks.
Its primary goal is to find an optimal hyperplane that separates data points into different classes
with the maximum margin. The margin is the distance between the hyperplane and the nearest
data points from each class. SVM can handle linear and non-linear data by using different
kernel functions to map the data into a higher-dimensional space.
RNN is a type of neural network architecture that is designed to handle sequential data, where
the order of data points matters. Unlike traditional feedforward neural networks, RNNs have
connections between nodes that form directed cycles, allowing them to retain information
about previous inputs and make decisions based on the context. RNNs utilize recurrent
connections to propagate information through time, which enables them to process sequences
of varying lengths. This makes them suitable for tasks like natural language processing, speech
recognition, and time series analysis.
Logistic Regression:
Logistic Regression is a statistical model used for binary classification problems, where the
goal is to predict a binary outcome (e.g., true/false, yes/no). Despite its name, logistic
regression is a classification algorithm rather than a regression algorithm. It estimates the
probability of the binary outcome using a logistic function (also called the sigmoid function),
which maps any real-valued input into a value between 0 and 1. The algorithm learns the
optimal coefficients for the input features to maximize the likelihood of the observed data.
4. DESIGN
The Fig 1 indicates the system architecture of proposed Sentiment analysis system. It shows the
working flow of model. Firstly the model is trained using the dataset containing positive and negative
tweets. Feature extraction refers to the process of transforming raw data into numerical features that can
be processed while preserving the information in the original data set. After Feature Extraction phase the
sentiment analysis performed on pre processed dataset by using different machine learning algorithms.
Finally the system gives output of classified tweets with sentiment score as final result.
Python:
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code
readability with the use of significant indentation is dynamically-typed and garbage-collected. It
supports multiple programming paradigms, including structured (particularly procedural), object-
Python oriented and functional programming.
Machine Learning:
Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information currently.
Sentiment Analysis
The sentiment analysis process was applied to the preprocessed tweets to gain insights into the
sentiment expressed in the textual content. Two methods, VADER and Transformers, were utilized
in this analysis to capture different aspects of sentiment. The following is an overview of these
methods and how they were incorporated into the sentiment analysis process:
4. Combined Sentiment Score and Label: To derive the final sentiment analysis results, the
sentiment scores and labels obtained from VADER and Transformers were combined. The
combined sentiment score was calculated by taking the average of the compound score from
VADER and the sentiment score from Transformers. The combined sentiment label was
determined based on this combined score. If the combined score was equal to or greater than 0,
the sentiment label was considered positive; otherwise, it was considered negative.
The dataset on Redbull Racing consists of 100 text snippets related to the Redbull Racing team and their
activities. The dataset provides insights into various aspects of Redbull Racing, such as partnerships,
content creation, web news, motorsports, and other related topics. Each snippet in the dataset is labelled
with sentiment polarity, indicating whether the sentiment expressed is positive, negative, or neutral.
"Ysten Labs SUI Network partners with Red Bull Racing team" - Sentiment: Positive
"Do you see your favourite content creators running HR races and want to get in on the action but are a
bit nervous to start? Fear not! I've got some tips and advice on how to get started in this piece I wrote
for..." - Sentiment: Positive
"Today in web, your daily dose of web news: Getty Images launching second NFT collection, SUI collabs
with Redbull Racing, Gamestop joins forces with Telos, OpenAI CTO falls to a Twitter hack, Magic
raises $1 million in funding led by PayPal Ventures" - Sentiment: Positive
"The greatest days in motorsports sure impressed this year, and how about that Penske power!" -
Sentiment: Positive
"Friday's recap: Soon to be integrated with partners with the racing team, United States dollar-pegged
stablecoin to launch a new native version of USDC on..." - Sentiment: Positive
This dataset provides a diverse range of text snippets, allowing for sentiment analysis and further
exploration of Redbull Racing-related topics. The sentiment labels can be used to train models or analyze
sentiment trends and patterns related to Redbull Racing.
Dataset 2: Inflation
The dataset on inflation contains 100 text snippets discussing various aspects of inflation, such as its
causes, effects, economic implications, government policies, and related topics. Each snippet in the
dataset is labelled with sentiment polarity, indicating whether the sentiment expressed is positive,
negative, or neutral.
The dataset on Moto Edge 40 consists of 100 text snippets related to the Moto Edge 40 smartphone
model. The snippets may include discussions about the phone's features, specifications, user reviews,
comparisons with other models, and more. Each snippet in the dataset is labeled with sentiment polarity,
indicating whether the sentiment expressed is positive, negative, or neutral.
The dataset on IPL Final includes 100 text snippets related to the Indian Premier League (IPL) cricket
tournament's final match. The snippets may cover various aspects of the final, such as team performances,
player performances, match highlights, fan reactions, and more. Each snippet in the dataset is labeled
with sentiment polarity, indicating whether the sentiment expressed is positive, negative, or neutral.
These datasets can be used for sentiment analysis, text classification, trend analysis, or any other relevant
analysis related to the respective topics. The sentiment labels enable the identification of sentiment
patterns and trends within the datasets, aiding in understanding public opinion and sentiment towards the
specific topics of interest.
5.4.Interface Implementation:
The sentiment analysis project incorporates a user-friendly interface using Gradio, allowing users to input
the tweet dataset file and custom word dictionary file. The interface seamlessly integrates the sentiment
analysis functionality, generating sentiment analysis output files ("sentiment_analysis_tweets.csv" and
"sentiment_analysis_score.csv") while providing visualizations such as bar graphs, pie charts, and
histograms to visualize the sentiment distribution of the analyzed tweets. This interface enhances the
usability and accessibility of the sentiment analysis tool, enabling users to perform sentiment analysis
tasks efficiently and gain insights into the sentiment patterns within the dataset.
System Requirements
Hardware interface:
Software Requirements:
Dataset: CSV
5.6.Screenshots
OUTPUTS:
Fig 5. Bar Plot (WTC Final) Fig 6. Scatter Plot (WTC Final)
Above screenshots gives detailed comparison of output of the model. Fig.1 shows output in pie chart
form, where blue portion indicates the positive and orange potion shows negative tweets.
Fig 2. Shows histogram which describes the overall sentiment analysis distribution. Fig 3. Area chart
shows distribution of sentiment score on what range it is on . Fig 4. Shows Bubble chart
representation of positive tweets by green dots and red dots for negative tweets. Fig 5 shows simple
bar chart for broad visualization of output. Fig. 6 shows the scatter plot which also represent the
positive sentiment score by green dots while negative score with the red dots.
6. Model Testing
Table of Accuracy:
Test Data 1: “I Used oneplus 10r my experience was good but Nowadays oneplus going down.”
Here is the table of accuracy results with the method used for each model:
Negative:0.2613,
Neutral: 0.0010
Negative:0.7047,
Neutral: 0.0557
Negative:0.0344,
Neutral:0.8820
Test Data 2: whine all you want but ChatGPT is genuinely useful for us students. lol.”
Negative:0.2613,
Neutral: 0.0010
Negative:0.2072,
Neutral: 0.1854
Neutral:0.9015
Note: The accuracy for every model is different and may require further.
Here we compared different methods of NLP and Deep Learning on two different test data’s. Table
1 shows the accuracy on Test Data 1 and Table no. 2 for Test Data 2. From above comparison we
can say that the transformers model of Deep Learning method gives accurate results as compare to
RNN,SVM and Logistic Regression.
The sentiment analysis was performed on the preprocessed tweets using the combined approach of
VADER and Transformers. The results provide valuable insights into the sentiment expressed in the
collected tweets. The following summarizes the key findings and analysis of the sentiment analysis
results:
1. Distribution of Sentiment Labels: Visualizations such as bar graphs, pie charts, and histograms
were employed to showcase the distribution of sentiment labels derived from the sentiment
analysis. The bar graph displays the count of positive, negative, and neutral sentiment labels,
allowing us to understand the overall sentiment distribution. The pie chart provides a visual
representation of the proportion of each sentiment label, highlighting the dominant sentiment.
Additionally, the histogram depicts the distribution of the combined sentiment scores, enabling
us to identify the sentiment trends across the dataset.
2. Sentiment Patterns and Trends: Analyzing the sentiment analysis results revealed several
notable findings and trends. By examining the distribution of sentiment labels, we observed the
dominant sentiment prevailing within the collected tweets related to the given topic. Furthermore,
analyzing the combined sentiment scores allowed us to identify the intensity and polarization of
sentiment. This analysis shed light on whether the sentiment expressed in the tweets was
predominantly positive, negative, or neutral and provided insights into the overall sentiment
tendencies within the dataset.
3. Insights into Tweet Sentiment: The sentiment analysis results offer valuable insights into the
sentiment of the collected tweets related to the given topic. Through the sentiment labels and
combined sentiment scores, we gained a deeper understanding of the public opinion and
sentiment surrounding the topic. By examining the positive and negative sentiment labels, we
identified the aspects or factors that elicit positive or negative sentiment among Twitter users.
These insights can contribute to understanding public sentiment, guiding decision-making, and
identifying areas for improvement or further investigation related to the given topic.
The visualizations and analysis of the sentiment analysis results provide a comprehensive overview
of the sentiment expressed in the collected tweets. By examining the distribution of sentiment labels
and combined sentiment scores, notable patterns and trends can be identified, allowing for a deeper
understanding of public sentiment regarding the given topic. The insights gleaned from the sentiment
analysis results serve as a valuable resource for decision-makers, researchers, and stakeholders
seeking to comprehend the public opinion landscape and make data-driven decisions.
6.2 Validation
To validate our project work, we conducted a comparison of our sentiment analysis model with various
existing models available on the internet. We utilized the same topics from Twitter and gathered results
from different models. Here are the outcomes:
The Twitter-roBERTa-base sentiment analysis model yielded a positive sentiment rate of 78..66%.
The Monkey Learn sentiment analysis model produced a positive sentiment rate of 82.2%.
Through this evaluation, our model demonstrated superior performance, surpassing the accuracy of the
other models. These results indicate the effectiveness and reliability of our sentiment analysis approach.
VADER 62.0%
MonkeyLearn 54.9%
VADER+TRANSFORMERS 81.2%
VADER 56.0%
MonkeyLearn 41.8%
VADER+TRANSFORMERS 79.6%
The provided tables table no. 3 and table no. 4 presents comparative analysis results for two different
datasets: the WTC Final dataset and the Android 13 dataset respectively. The confidence scores of three
sentiment analysis models, namely VADER, MonkeyLearn, and VADER+TRANSFORMERS, are
reported for each dataset.
For the WTC Final dataset, the VADER model achieved a confidence score of 62.0%, while
MonkeyLearn obtained a slightly lower score of 54.9%. On the other hand, the combined approach of
VADER+TRANSFORMERS demonstrated a significantly higher confidence score of 81.2%. These
results indicate that VADER+TRANSFORMERS outperformed both VADER and MonkeyLearn in
analysing sentiment for the WTC Final dataset.
Moving on to the Android 13 dataset, the VADER model attained a confidence score of 56.0%.
MonkeyLearn, on the other hand, achieved a lower score of 41.8%. Similar to the previous dataset, the
VADER+TRANSFORMERS approach showcased superior performance with a confidence score of
79.6%. Hence, once again, VADER+TRANSFORMERS emerged as the top-performing model for
sentiment analysis in the context of the Android 13 dataset.
7. Conclusion
In conclusion, this project aimed to analyze the sentiment expressed in tweets related to a given topic
using a combined approach of VADER and Transformers sentiment analysis methods. The key
findings and outcomes of the project are summarized below:
1. Key Findings: The sentiment analysis of the collected tweets provided valuable insights into the
sentiment expressed by Twitter users regarding the given topic. By analyzing the distribution of
sentiment labels and combined sentiment scores, we gained a comprehensive understanding of
the overall sentiment tendencies within the dataset. The visualizations showcased the prevalence
of positive, negative, and neutral sentiments, allowing us to identify sentiment patterns and
trends.
2. Effectiveness and Limitations: The sentiment analysis approach employed in this project
utilizing both VADER and Transformers demonstrated effectiveness in capturing and analyzing
sentiment in the tweets. The incorporation of a custom word dictionary enhanced the sentiment
analysis by refining the sentiment scores generated by VADER. However, it is important to
acknowledge the limitations of sentiment analysis, such as the reliance on textual data and the
inherent challenges in accurately capturing nuanced sentiment. While the approach provided
valuable insights, there may be cases where the analysis falls short in fully capturing the
complexity and context of sentiment expressed in the tweets.
3. Areas of Improvement and Future Work: There are several potential areas of improvement
and avenues for future work in this project. Firstly, expanding the dataset by collecting more
tweets or exploring multiple data sources can provide a broader understanding of public
sentiment. Additionally, fine-tuning the sentiment analysis models, incorporating domain-
specific lexicons, or exploring advanced natural language processing techniques may further
improve the accuracy and granularity of sentiment analysis results.
8. References
[1] Mishra, Dibya Nandan, and Rajeev Kumar Panda. "Decoding customer experiences in rail transport
service: application of hybrid sentiment analysis." Public Transport (2022): 1-30.
[2] Liu, Bing, and Lei Zhang. "A survey of opinion mining and sentiment analysis." In Mining text data,
pp. 415-463. Springer, Boston, MA, 2012.
[3] Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. "Recognizing contextual polarity in phrase-level
sentiment analysis." In Proceedings of human language technology conference and conference on
empirical methods in natural language processing, pp. 347-354. 2005..
[4] Al Badani, Barakat, Ronghua Shi, and Jian Dong. "A novel machine learning approach for sentiment
analysis on Twitter incorporating the universal language model fine-tuning and SVM." Applied System
Innovation 5, no. 1 (2022): 13.