Cse499a Report
Cse499a Report
Faculty Advisor:
M. Rashedur Rahman
Professor, ECE Department
Abstract 2
Introduction 3
Literature Review 4
LSTM 6
Dataset 11
Tweepy 12
Conclusion 16
References 16
1
Abstract
This project insists on Opinion Mining also known as Sentiment Analysis on the
collected dataset. The analysis techniques include machine learning, deep neural
network, lexicon-based approach, and non-machine learning based naïve
approaches.
The advent of social networks has opened the possibility of having access to
massive blogs, recommendations, and reviews. The challenge is to extract the
polarity from these data, which is a task of opinion mining or sentiment analysis. It
has vast implications on automation and Artificial Intelligence based applications.
In every minute, people post different types of opinions on various online
platforms. Analyzing those opinions manually is a very difficult task. To analyze
this opinion automatically, researchers need to come out with some sort of
automation system. Nowadays different types of approaches are being used for
analyzing these types of data. Among those approaches, machine learning based
approaches are the most popular ones. But this can be done using different
approaches as well. For example, lexicon-based approaches are also used in this
area.
This research will be conducted on the real-time tweets data from social media data
– Twitter. The project needs to access real-time user tweets to analyze the polarity
of their tweets. For accessing the Twitter data in real time, a module built by
Twitter engineers called Tweepy will be used in the project. Tweepy is built on
Python.
2
We would also employ the Apache-Hadoop framework with its lexicon-based
sentiment prediction algorithm and the Stanford coreNLP package with the
Recursive Neural Tensor Network (RNTN) model to conduct our studies. The
lexicon-based uses sentiment dictionary containing words annotated with sentiment
labels and other basic lexical features, and the latter one is trained on Sentiment
Treebank
User reviews can be analyzed using sentiment analysis to detect positive, negative,
and neutral information. Many methodologies in Sentiment Analysis have been
developed by researchers. A single machine learning algorithm is typically used for
sentiment analysis. This project leverages Twitter sentiment data to locate aspect
phrases in each review, identify Parts-of-Speech, and apply classification
algorithms to determine each review's positivity, negativity, and neutrality score.
Vector representations can be used to compute a variety of vector-based features
and to run a series of studies to demonstrate their efficacy.
Introduction
People nowadays tend to share their opinion through various social media
platforms. Twitter, among those social media platforms, is a place where users
share their thoughts on recent subjects of interest. Everyone has their point of view
and they tend to convey their message through features like tweeting on a subject
or retweeting someone’s tweet or posting something related to a trending topic,
oftentimes using the trending tweets. Our target is to get the topics that are trending
at this moment by going through the trending hashtags and then gather tweets in a
specific language; English for our project. We would later run a sentiment analysis
on the collected tweets by employing the Apache-Hadoop framework with its
lexicon-based sentiment prediction algorithm and the Stanford coreNLP package
with the Recursive Neural Tensor Network (RNTN) model to conduct our studies.
The lexicon-based uses a sentiment dictionary containing words annotated with
sentiment labels. basic lexical features and the later approach is to train on
Sentiment Treebank with 215,154 phrases, labeled using Amazon Turk. We would
later compare these two methods' results in terms of accuracy and determine the
better method among these two. For the user of our product there would be an
3
interface where users may choose a country from several options, upon choosing
the interface would display the hashtags trending on that country. User may choose
one such option and he would be presented with the fixed number of tweets on that
topic from the chosen country. Upon completing sentiment analysis on the tweets
the system would show results accumulated from both DNN and lexicon-based
approaches alongside a wordcloud consisting of the keywords used for sentiment
analysis. Real-time Twitter data collection would conduct using Tweepy, which is a
module built on python, by Twitter devs.
We have observed that many sensational topics have aroused in different parts of
the world over time. People have addressed their sides, opinion, and sentiment on
social media. It is essential to have a better understanding of the sentiment related
to these sensitive trending topics. On Twitter, users tend to leave a concise
statement hence we chose to work on this platform. These statements are our key
assets for conducting sentiment analysis.
Literature Review
In recent times, we have seen or read about Twitter data sentiment analysis using
CNN, LSTM, or bi-directional LSTM models. They have achieved success by
implementing these models. Most of their approach was targeted at a single
particular subject, for example, Covid19, product review, etc. We aim at targeting
no particular subject or topic in general, rather we would like to give diverse
options to users to choose from. Users can pinpoint their point of interest through
perimeters such as selecting a country and then confirming a trending hashtag from
that country that best matches users' interest. We would also like to compare results
obtained through both the Long Short Term Memory model and the lexicon-based
approach for sentiment prediction.
4
Recurrent Neural Network(RNN)
Recurrent Neural Network or RNN is also known as tree-structured neural
network. It is called RNN as it often has the output of a module go into a module
of the same type. The idea of building neural networks from smaller neural
network “modules” that can be composed together, is not very commonly used. It
has, however, been very successful in NLP operations. RNNs are Neural Networks
that are good at modeling sequence data due to the concept of sequential memory.
Sequential memory is a mechanism that makes it easier for our brains to recognize
sequence patterns. RNNs have the abstract concept of sequential memory and it
replicates the concept with the help of its structure. RNN or feed-forward neural
networks consist of three layers; input layer, hidden layer, and output layer. It also
has a looping mechanism that acts as a highway to allow information to flow from
one step to the next. This information is referred to as the hidden state which is a
representation of the previous inputs.
Here, A looks at some input Xt and outputs a value ht. A loop allows information
to be passed from one step of the network to the next. A chain-like attribute
indicates that recurrent neural networks are deeply related to sequences and lists.
An RNN can be thought of as multiple copies of the same network, each passing a
message to a successor.
However, this sort of model comes with its unique shortcomings. One such
shortcoming is that it has what we call short-term memory. It is due to the way it
sends information from one step to another. The main culprit behind short-term
memory is the vanishing gradient problem of RNN. This is caused by
5
Backpropagation through time in RNN. This function allows fine-tuning the
weights of a neural network based on the error rate or loss. It is to train and
optimize a neural network.
While doing backpropagation, each node in the layer calculates its gradient with
respect to the effects of the gradient in the previous layer. If the adjustment in the
layer is small, the adjustment on the current layer would be smaller. This causes
gradients to exponentially shrink as it backpropagates down.
Gradient Update Rule => new weight = weight - learning rate * gradient
The earlier layers thus fail to do any learning as the internal weights are barely
being adjusted due to an extremely small gradient. To eliminate this shortcoming
we will use a special kind of RNN named LSTM or Long Short Term Memory
model.
LSTM
LSTM is an evolved version of the recurrent neural network. During
backpropagation, LSTM was created as a solution to short-term memory. They
have internal mechanisms called gates that can regulate the flow of information.
These gates can learn which data in a sequence is important to keep or throw away.
By doing this it learns to use relevant information to make predictions.
In this model, words get transformed into vectors that are machine-readable. Then
DNN processes a sequence of vectors one by one. While processing, it passes the
6
previous hidden state to the next step of the sequence and the hidden state acts as
the neural network's memory. It holds information on previous data that the
network has seen before. First, let’s see how the hidden state is calculated,
First, the input and the previous hidden state are combined to form a vector. This
vector has information about the current input and previous inputs.
The vector goes through the tan activation and the output is the new hidden state.
7
The tan activation is used to help regulate the values flowing through the network.
This function shrinks values to always be within negative 1 and positive 1. Some
values can exponentially increase causing other values to seem insignificant. This
is to regulate such increased value.
LSTM has the same control flow as a recurrent network. It processes data
sequentially, passing on information as it propagates forward. The differences
between classic RNN and LSTM are the operations within the LSTM cells. These
operations are used to allow lstm to keep or forget information.
The core concepts of the lstm are the cell states and its various gates. Illustrated
below is an overview of the cell state and its gates.
8
The cell state works as a transport highway that transfers relative information all
the way down to the sequence chain thus information from earlier time steps can be
carried all the way to the last time step thus reducing the effects of short-term
memory.
The gates are different neural networks that decide which information is allowed in
the cell state. The gates learn what information is relevant to keep or forget while
training. These gates contain Sigmoid activations. Sigmoid activation is similar to
tan activation. Sigmoid shrinks the values between zero and one, where 1
represents “completely keep this” while a 0 represents “completely get rid of this.
Three gates regulate information flow. They are, Forget gate, input gate and output
gate.
Firstly, the forget gate decides which information should be thrown or kept away.
Information from the previous hidden state and information from the current input
is passed through the sigmoid function. Values come out between zero and one.
Closer to zero indicates to forget and closer to one indicates to keep.
9
To update the cell state we have the input gate. First, we passed the previous
hidden state and the current input to a sigmoid function that determines which
value will be updated by transforming values to be zero and one. We also pass the
hidden state and the current input to the tan function to shrink the values between
minus one and one. Then tan output gets multiplied by the sigmoid output. Here
sigmoid output decides which information to keep from the tan output.
Now there is enough information to calculate the current cell state. First, the
previous cell state gets multiplied by the forget vector.
Then we take the output from the input gate and do polarize addition which
updates the cell state to new values.
Lastly, we have the output gate. This decides what the next hidden state would be.
Initially, we pass a previous hidden state and current input in the sigmoid function.
Then we pass the newly modified cell state to the tan function. By multiplying the
tan output with the sigmoid function we get the new hidden state.
10
The new cell state and the new hidden state are then carried to the next time step.
Dataset
The Twitter dataset to be used for training and validating our model is collected
from Kaggle. This is the sentiment140 dataset. It contains 1,600,000 tweets
extracted using the Twitter API. The tweets have been annotated (0 = negative, 4 =
positive) and they can be used to detect sentiment. Dataset It contains the following
6 fields:
11
Lexicon-based sentiment analysis
We would employ the Apache-Hadoop framework with its lexicon-based sentiment
prediction algorithm. In this approach, we would try to calculate the emotional
orientation of a document through the document's semantic orientation of words.
This idea is further strengthened by a dictionary of a number of words, where each
word has its own polarity, strength, and semantic orientation. Hadoop provides a
framework that allows the collection, storage, retrieval, management, and
distributed processing of huge data. A simple process of sentiment analysis is
shown below that consists of the following steps.
After labelling each word, we can get an overall sentiment score by simply
counting the number of positive and negative words of a sequence first, then
subtracting number of negative words from the number of positive words and later
dividing the result by total number of words.
Sentiment Score(StSc) = (number of positive words- number of negative words) / total number of words
Tweepy
Tweepy is an easy-to-use python library for accessing Twitter API. To use this
library we first need to have a developer account. I have already applied for my
account to be a developer account and it takes roughly a week for Twitter to turn an
12
account into a developer account. It is essential to have a developer account as we
would require keys from Twitter to work with Tweepy such as API Keys, and
Access Token. Working without a developer account often leads to account ban.
First we import necessary libraries like tensorflow, numpy, pandas, nltk etc.
Then we mount google drive to colab as mounting Google Drive on Colab allows
any code in your notebook to access any files in your Google Drive. We have our
dataset stored in the drive already.
13
We then read the dataset from google drive and display first few dataset entries.
We see that there is no index for each column, so we name each column according
to the information they convey for simplicity and displayed outcome.
Through our study, we noticed that not all columns are significant for sentiment
analysis. So we remove columns such as id, date, query, user id to get better
results. Sentiments were denoted as zero or four, where zero indicates negative and
four indicates positive. We then replaced 0 and 4 with negative and positive
14
Then we looked for total number of positive and negative sentiment texts. Where
we found 800000 positive and 800000 negative sentiment texts and plotted a graph
according to the values received.
Within the following cell, we have initiated the data sanitization stage. Stop words
are words that occur more frequently in the sentence and make the text heavier and
less important for the analysis, they should be excluded from the input. Stemmer is
used for lowering inflection in words to their root forms. Using regular expression,
we have removed words starting with @ symbols, hyperlinks, and numbers and
displayed outcome.
15
Conclusion
We have done our background study on this topic and learned NLP fundamentals
that were new for all of us groupmates. We have done the data sanitization part up
until now and starting from the next semester we would start working on
tokenizing and pad sequencing. We would have a Twitter developer account by
then and would start working on APIs alongside.
References
1. Apoorv A., Boyi X. Ilia V., and Owen R., "Sentiment Analysis of Twitter Data",
No date found.
4. Fan, X., Li, X., Du, F., Li, X., & Wei, M. (2016). Apply word vectors for
sentiment analysis of APP reviews. 2016 3Rd International Conference On
Systems And Informatics (ICSAI). doi: 10.1109/icsai.2016.7811108
16
7. Bashri, M., & Kusumaningrum, R. (2017). Sentiment analysis using Latent
Dirichlet Allocation and topic polarity wordcloud visualization. 2017 5Th
International Conference On Information And Communication Technology
(Icoic7). doi: 10.1109/icoict.2017.8074651
9. Ding, J., Le, Z., Zhou, P., Wang, G., & Shu, W. (2009). An Opinion-Tree Based
Flexible Opinion Mining Model. 2009 International Conference On Web
Information Systems And Mining. doi: 10.1109/wism.2009.38
17