0% found this document useful (0 votes)
24 views18 pages

Cse499a Report

The document summarizes a senior design project that aims to perform sentiment analysis on Twitter data using different techniques. The project will collect real-time tweets from Twitter using Tweepy and analyze the sentiment of the tweets using both a lexicon-based approach and a recurrent neural network approach. Specifically, it will use the Apache-Hadoop framework with a lexicon-based algorithm and the Stanford CoreNLP package with a Recursive Neural Tensor Network model. The results from both approaches will be compared to determine the most accurate approach for detecting sentiment in social media posts. An interface will be developed to display trending hashtags from different countries and the sentiment analysis results.

Uploaded by

redwanhossan.96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

Cse499a Report

The document summarizes a senior design project that aims to perform sentiment analysis on Twitter data using different techniques. The project will collect real-time tweets from Twitter using Tweepy and analyze the sentiment of the tweets using both a lexicon-based approach and a recurrent neural network approach. Specifically, it will use the Apache-Hadoop framework with a lexicon-based algorithm and the Stanford CoreNLP package with a Recursive Neural Tensor Network model. The results from both approaches will be compared to determine the most accurate approach for detecting sentiment in social media posts. An interface will be developed to display trending hashtags from different countries and the sentiment analysis results.

Uploaded by

redwanhossan.96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Department of Electrical and Computer Engineering

North South University

Senior Design Project


Sentiment Analysis on Twitter Data

Md. Redwan Hossan ID # 1610322042


Mohammad Burhan Uddin ID # 1812673642
Al Nur Istiak ID # 1711989642

Faculty Advisor:

M. Rashedur Rahman
Professor, ECE Department
Abstract 2

Introduction 3

Literature Review 4

Technical Description and Implementation 4

Recurrent Neural Network(RNN) 5

LSTM 6

Dataset 11

Lexicon-based sentiment analysis 12

Tweepy 12

Results and Discussion 13

Conclusion 16

References 16

1
Abstract

This project insists on Opinion Mining also known as Sentiment Analysis on the
collected dataset. The analysis techniques include machine learning, deep neural
network, lexicon-based approach, and non-machine learning based naïve
approaches.
The advent of social networks has opened the possibility of having access to
massive blogs, recommendations, and reviews. The challenge is to extract the
polarity from these data, which is a task of opinion mining or sentiment analysis. It
has vast implications on automation and Artificial Intelligence based applications.
In every minute, people post different types of opinions on various online
platforms. Analyzing those opinions manually is a very difficult task. To analyze
this opinion automatically, researchers need to come out with some sort of
automation system. Nowadays different types of approaches are being used for
analyzing these types of data. Among those approaches, machine learning based
approaches are the most popular ones. But this can be done using different
approaches as well. For example, lexicon-based approaches are also used in this
area.

This research will be conducted on the real-time tweets data from social media data
– Twitter. The project needs to access real-time user tweets to analyze the polarity
of their tweets. For accessing the Twitter data in real time, a module built by
Twitter engineers called Tweepy will be used in the project. Tweepy is built on
Python.

The implementation and integration of a sentiment analysis pipeline into the


continuing open-source cross-media analysis system is our targeted goal. The chat
room cleaning, NLP, and sentiment analyzer are all part of the pipeline. We
examine two main categories of sentiment analysis methodologies, lexicon-based,
and machine learning approaches, before integrating them. We're mostly interested
in determining which strategy is best for detecting feelings in forum discussion
postings.

2
We would also employ the Apache-Hadoop framework with its lexicon-based
sentiment prediction algorithm and the Stanford coreNLP package with the
Recursive Neural Tensor Network (RNTN) model to conduct our studies. The
lexicon-based uses sentiment dictionary containing words annotated with sentiment
labels and other basic lexical features, and the latter one is trained on Sentiment
Treebank

User reviews can be analyzed using sentiment analysis to detect positive, negative,
and neutral information. Many methodologies in Sentiment Analysis have been
developed by researchers. A single machine learning algorithm is typically used for
sentiment analysis. This project leverages Twitter sentiment data to locate aspect
phrases in each review, identify Parts-of-Speech, and apply classification
algorithms to determine each review's positivity, negativity, and neutrality score.
Vector representations can be used to compute a variety of vector-based features
and to run a series of studies to demonstrate their efficacy.

Introduction
People nowadays tend to share their opinion through various social media
platforms. Twitter, among those social media platforms, is a place where users
share their thoughts on recent subjects of interest. Everyone has their point of view
and they tend to convey their message through features like tweeting on a subject
or retweeting someone’s tweet or posting something related to a trending topic,
oftentimes using the trending tweets. Our target is to get the topics that are trending
at this moment by going through the trending hashtags and then gather tweets in a
specific language; English for our project. We would later run a sentiment analysis
on the collected tweets by employing the Apache-Hadoop framework with its
lexicon-based sentiment prediction algorithm and the Stanford coreNLP package
with the Recursive Neural Tensor Network (RNTN) model to conduct our studies.
The lexicon-based uses a sentiment dictionary containing words annotated with
sentiment labels. basic lexical features and the later approach is to train on
Sentiment Treebank with 215,154 phrases, labeled using Amazon Turk. We would
later compare these two methods' results in terms of accuracy and determine the
better method among these two. For the user of our product there would be an

3
interface where users may choose a country from several options, upon choosing
the interface would display the hashtags trending on that country. User may choose
one such option and he would be presented with the fixed number of tweets on that
topic from the chosen country. Upon completing sentiment analysis on the tweets
the system would show results accumulated from both DNN and lexicon-based
approaches alongside a wordcloud consisting of the keywords used for sentiment
analysis. Real-time Twitter data collection would conduct using Tweepy, which is a
module built on python, by Twitter devs.

We have observed that many sensational topics have aroused in different parts of
the world over time. People have addressed their sides, opinion, and sentiment on
social media. It is essential to have a better understanding of the sentiment related
to these sensitive trending topics. On Twitter, users tend to leave a concise
statement hence we chose to work on this platform. These statements are our key
assets for conducting sentiment analysis.

Literature Review
In recent times, we have seen or read about Twitter data sentiment analysis using
CNN, LSTM, or bi-directional LSTM models. They have achieved success by
implementing these models. Most of their approach was targeted at a single
particular subject, for example, Covid19, product review, etc. We aim at targeting
no particular subject or topic in general, rather we would like to give diverse
options to users to choose from. Users can pinpoint their point of interest through
perimeters such as selecting a country and then confirming a trending hashtag from
that country that best matches users' interest. We would also like to compare results
obtained through both the Long Short Term Memory model and the lexicon-based
approach for sentiment prediction.

Technical Description and Implementation

4
Recurrent Neural Network(RNN)
Recurrent Neural Network or RNN is also known as tree-structured neural
network. It is called RNN as it often has the output of a module go into a module
of the same type. The idea of building neural networks from smaller neural
network “modules” that can be composed together, is not very commonly used. It
has, however, been very successful in NLP operations. RNNs are Neural Networks
that are good at modeling sequence data due to the concept of sequential memory.
Sequential memory is a mechanism that makes it easier for our brains to recognize
sequence patterns. RNNs have the abstract concept of sequential memory and it
replicates the concept with the help of its structure. RNN or feed-forward neural
networks consist of three layers; input layer, hidden layer, and output layer. It also
has a looping mechanism that acts as a highway to allow information to flow from
one step to the next. This information is referred to as the hidden state which is a
representation of the previous inputs.

An unrolled recurrent neural network.

Here, A looks at some input Xt and outputs a value ht. A loop allows information
to be passed from one step of the network to the next. A chain-like attribute
indicates that recurrent neural networks are deeply related to sequences and lists.
An RNN can be thought of as multiple copies of the same network, each passing a
message to a successor.

However, this sort of model comes with its unique shortcomings. One such
shortcoming is that it has what we call short-term memory. It is due to the way it
sends information from one step to another. The main culprit behind short-term
memory is the vanishing gradient problem of RNN. This is caused by

5
Backpropagation through time in RNN. This function allows fine-tuning the
weights of a neural network based on the error rate or loss. It is to train and
optimize a neural network.

A neural network has three major steps,

1. It does a forward pass and makes predictions.


2. It compares prediction with ground truth using a loss function. This outputs
an error value, E; an estimate of how badly the network is performing.
3. Using error value, it does backpropagation, calculating gradients for each
node in the network. The gradient is a value to adjust the network's internal
weights allowing the network to learn. Bigger gradients lead to bigger
changes.

While doing backpropagation, each node in the layer calculates its gradient with
respect to the effects of the gradient in the previous layer. If the adjustment in the
layer is small, the adjustment on the current layer would be smaller. This causes
gradients to exponentially shrink as it backpropagates down.

Gradient Update Rule => new weight = weight - learning rate * gradient

The earlier layers thus fail to do any learning as the internal weights are barely
being adjusted due to an extremely small gradient. To eliminate this shortcoming
we will use a special kind of RNN named LSTM or Long Short Term Memory
model.

LSTM
LSTM is an evolved version of the recurrent neural network. During
backpropagation, LSTM was created as a solution to short-term memory. They
have internal mechanisms called gates that can regulate the flow of information.
These gates can learn which data in a sequence is important to keep or throw away.
By doing this it learns to use relevant information to make predictions.

In this model, words get transformed into vectors that are machine-readable. Then
DNN processes a sequence of vectors one by one. While processing, it passes the

6
previous hidden state to the next step of the sequence and the hidden state acts as
the neural network's memory. It holds information on previous data that the
network has seen before. First, let’s see how the hidden state is calculated,

Following diagram indicates the initial state of a hidden layer.

First, the input and the previous hidden state are combined to form a vector. This
vector has information about the current input and previous inputs.

The vector goes through the tan activation and the output is the new hidden state.

7
The tan activation is used to help regulate the values flowing through the network.
This function shrinks values to always be within negative 1 and positive 1. Some
values can exponentially increase causing other values to seem insignificant. This
is to regulate such increased value.

LSTM has the same control flow as a recurrent network. It processes data
sequentially, passing on information as it propagates forward. The differences
between classic RNN and LSTM are the operations within the LSTM cells. These
operations are used to allow lstm to keep or forget information.

The core concepts of the lstm are the cell states and its various gates. Illustrated
below is an overview of the cell state and its gates.

8
The cell state works as a transport highway that transfers relative information all
the way down to the sequence chain thus information from earlier time steps can be
carried all the way to the last time step thus reducing the effects of short-term
memory.

The gates are different neural networks that decide which information is allowed in
the cell state. The gates learn what information is relevant to keep or forget while
training. These gates contain Sigmoid activations. Sigmoid activation is similar to
tan activation. Sigmoid shrinks the values between zero and one, where 1
represents “completely keep this” while a 0 represents “completely get rid of this.

Three gates regulate information flow. They are, Forget gate, input gate and output
gate.

Firstly, the forget gate decides which information should be thrown or kept away.
Information from the previous hidden state and information from the current input
is passed through the sigmoid function. Values come out between zero and one.
Closer to zero indicates to forget and closer to one indicates to keep.

9
To update the cell state we have the input gate. First, we passed the previous
hidden state and the current input to a sigmoid function that determines which
value will be updated by transforming values to be zero and one. We also pass the
hidden state and the current input to the tan function to shrink the values between
minus one and one. Then tan output gets multiplied by the sigmoid output. Here
sigmoid output decides which information to keep from the tan output.

Now there is enough information to calculate the current cell state. First, the
previous cell state gets multiplied by the forget vector.

Then we take the output from the input gate and do polarize addition which
updates the cell state to new values.

Lastly, we have the output gate. This decides what the next hidden state would be.
Initially, we pass a previous hidden state and current input in the sigmoid function.
Then we pass the newly modified cell state to the tan function. By multiplying the
tan output with the sigmoid function we get the new hidden state.

10
The new cell state and the new hidden state are then carried to the next time step.

Dataset
The Twitter dataset to be used for training and validating our model is collected
from Kaggle. This is the sentiment140 dataset. It contains 1,600,000 tweets
extracted using the Twitter API. The tweets have been annotated (0 = negative, 4 =
positive) and they can be used to detect sentiment. Dataset It contains the following
6 fields:

target: the polarity of the tweet (0 = negative, 4 = positive)

ids: The id of the tweet

date: the date of the tweet

flag: The query. If there is no query, then this value is NO_QUERY.

user: the user that tweeted

text: the text of the tweet

11
Lexicon-based sentiment analysis
We would employ the Apache-Hadoop framework with its lexicon-based sentiment
prediction algorithm. In this approach, we would try to calculate the emotional
orientation of a document through the document's semantic orientation of words.
This idea is further strengthened by a dictionary of a number of words, where each
word has its own polarity, strength, and semantic orientation. Hadoop provides a
framework that allows the collection, storage, retrieval, management, and
distributed processing of huge data. A simple process of sentiment analysis is
shown below that consists of the following steps.

After labelling each word, we can get an overall sentiment score by simply
counting the number of positive and negative words of a sequence first, then
subtracting number of negative words from the number of positive words and later
dividing the result by total number of words.
Sentiment Score(StSc) = (number of positive words- number of negative words) / total number of words

Tweepy
Tweepy is an easy-to-use python library for accessing Twitter API. To use this
library we first need to have a developer account. I have already applied for my
account to be a developer account and it takes roughly a week for Twitter to turn an

12
account into a developer account. It is essential to have a developer account as we
would require keys from Twitter to work with Tweepy such as API Keys, and
Access Token. Working without a developer account often leads to account ban.

Results and Discussion


We initially coded on Jupyter Notebook but after encountering problems presenting
them in class we decided to move our work to Google Colab. Here we would
continue to finish our work.

First we import necessary libraries like tensorflow, numpy, pandas, nltk etc.

Then we mount google drive to colab as mounting Google Drive on Colab allows
any code in your notebook to access any files in your Google Drive. We have our
dataset stored in the drive already.

13
We then read the dataset from google drive and display first few dataset entries.

We see that there is no index for each column, so we name each column according
to the information they convey for simplicity and displayed outcome.

Through our study, we noticed that not all columns are significant for sentiment
analysis. So we remove columns such as id, date, query, user id to get better
results. Sentiments were denoted as zero or four, where zero indicates negative and
four indicates positive. We then replaced 0 and 4 with negative and positive

respectfully for further simplification and displayed outputs.

14
Then we looked for total number of positive and negative sentiment texts. Where
we found 800000 positive and 800000 negative sentiment texts and plotted a graph
according to the values received.

Within the following cell, we have initiated the data sanitization stage. Stop words
are words that occur more frequently in the sentence and make the text heavier and
less important for the analysis, they should be excluded from the input. Stemmer is
used for lowering inflection in words to their root forms. Using regular expression,
we have removed words starting with @ symbols, hyperlinks, and numbers and
displayed outcome.

15
Conclusion
We have done our background study on this topic and learned NLP fundamentals
that were new for all of us groupmates. We have done the data sanitization part up
until now and starting from the next semester we would start working on
tokenizing and pad sequencing. We would have a Twitter developer account by
then and would start working on APIs alongside.

References
1. Apoorv A., Boyi X. Ilia V., and Owen R., "Sentiment Analysis of Twitter Data",
No date found.

2. Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin


Stoyanov,and Theresa Wilson. Semeval-2013 task 2: Sentiment analysis in twitter.
2013

3. Woldemariam, Y. (2016). Sentiment analysis in a cross-media analysis


framework. 2016 IEEE International Conference On Big Data Analysis (ICBDA).
doi: 10.1109/icbda.2016.7509790.

4. Fan, X., Li, X., Du, F., Li, X., & Wei, M. (2016). Apply word vectors for
sentiment analysis of APP reviews. 2016 3Rd International Conference On
Systems And Informatics (ICSAI). doi: 10.1109/icsai.2016.7811108

5. Vanaja, S., & Belwal, M. (2018). Aspect-Level Sentiment Analysis on


E-Commerce Data. 2018 International Conference On Inventive Research In
Computing Applications (ICIRCA). doi: 10.1109/icirca.2018.8597286.

6. Porntrakoon, P., & Moemeng, C. (2018). Thai Sentiment Analysis for


Consumer’s Review in Multiple Dimensions Using Sentiment Compensation
Technique (SenseComp). 2018 15Th International Conference On Electrical
Engineering/Electronics, Computer, Telecommunications And Information
Technology (ECTI-CON). doi: 10.1109/ecticon.2018.8619892

16
7. Bashri, M., & Kusumaningrum, R. (2017). Sentiment analysis using Latent
Dirichlet Allocation and topic polarity wordcloud visualization. 2017 5Th
International Conference On Information And Communication Technology
(Icoic7). doi: 10.1109/icoict.2017.8074651

8. Rojas-Barahona, L. (2016). Deep learning for sentiment analysis. Language And


Linguistics Compass, 10(12), 701-719. doi: 10.1111/lnc3.12228

9. Ding, J., Le, Z., Zhou, P., Wang, G., & Shu, W. (2009). An Opinion-Tree Based
Flexible Opinion Mining Model. 2009 International Conference On Web
Information Systems And Mining. doi: 10.1109/wism.2009.38

17

You might also like