Twitter Sentiment Analysis Using Deep Learning
Twitter Sentiment Analysis Using Deep Learning
Introduction
Twitter Sentiment Analysis means, using advanced text mining techniques to investigate the sentiment of
the text (here, tweet) within the sort of positive, negative, and neutral. it's also called Opinion Mining, is
primarily for analyzing conversations, opinions, and sharing of views (all within the sort of tweets) for
deciding business strategy, political analysis, and also for assessing public actions. Sentiment analyses are
often want to identify trends within the content of tweets, which are then analyzed by machine learning
algorithms. Sentiment analysis is a crucial tool within the eld of social media marketing because it'll discuss
how it will be accustomed to predict the behavior of a user's online persona. Sentiment analysis is employed
to investigate the sentiment of a given post or investigate any given topic.In fact, it's one of the foremost
popular tools in social media marketing.
Text understanding could be a signi cant problem to resolve. One approach may well be to rank the
importance of sentences within the text then generate a summary for the text supported by the important
numbers.
These systems don’t depend on manually crafted rules, but on machine learning techniques, like
classi cation. Classi cation, which is employed for sentiment analysis, is an automatic system that must be
fed sample text before returning a category, e.g. positive, negative, or neutral. Urgent issues will often arise,
and they must be restrained immediately. A complaint on Twitter, for instance, could quickly escalate into a
PR crisis if it goes viral. While it'd be di cult for your team to spot a crisis before it happens, it’s very easy
for machine learning tools to identify these situations in real-time.
Patterns are often extracted from analyzing the frequency distribution of those parts of speech (either
individually or collectively with some other parts of speech) during a particular class of labeled tweets.
Twitter-based features are more informal and relate to how people express themselves on online social
platforms and compress their sentiments within the limited space of 140 characters o ered by Twitter.
They include Twitter hashtags, retweets, word capitalization, word lengthening, question marks, presence of
URL in tweets, exclamation marks, internet emoticons, and internet shorthand/slangs.
1 / 17
Literature Review
Sentiment analysis within the domain of micro-blogging could be a relatively new research topic so
there's still plenty of room for further research in this area. A decent amount of related prior work
has been done on sentiment analysis of user reviews, web blogs/articles, and phrase-level sentiment
analysis, These di er from Twitter mainly thanks to the limit of 140 characters per tweet which
forces the user to speci c opinion compressed in a very very short text. The simplest results were
reached in sentiment classi cation using supervised learning techniques like Naive Bayes and
Support Vector Machines, but the manual labeling required for the supervised approach is
incredibly expensive. Some work has been done on unsupervised and semi-supervised approaches,
and there's plenty of room for improvement.
Various researchers are testing new classi cation features and techniques He often compares their
results to baseline performance. There is a desire to correct and Formal comparisons between these
results are made by di erent features and Classi cation techniques to select the most e ective and
most e ective features Classi cation techniques for speci c applications.This is a really simplistic
assumption but it appears to perform fairly well. The thanks to use unigrams as features is to line them with
a particular preset polarity, and take the average general polarity of the text, where the nal polarity of the
text. It can simply be calculated by summing the previous poles of individual unigrams. The preceding
polarity of the word is going to be positive if the word is mostly used to denote the positive, as an example,
the word "sweet"; While it might be negative if The word is mostly related to negative connotations, like
"evil." over there. They can even be degrees of polarity within the model, which implies what proportion is
indicative of it: A word for that speci c class. A word like "wonderful" are often strong. Subjective polarity
goes hand in hand with positivity, while "decent" may bePositive a priori polarity but possibly with weak
subjectivity.
1 Problem Statement
Twitter is a popular social networking website where members create and interact with messages known as
“tweets”. This serves as a means for individuals to express their thoughts or feelings about di erent subjects.
Various di erent parties such as consumers and marketers have done sentiment analysis on such tweets to
gather insights into products or to conduct market analysis. Furthermore, with the recent advancements in
machine learning algorithms, I was able to improve the accuracy of our sentiment analysis predictions. In
this report, I will attempt to conduct sentiment analysis on “tweets” using various di erent machine
learning algorithms.attempted to classify the polarity of the tweet where it is either positive or negative. If
the tweet has both positive and negative elements, the more dominant sentiment should be picked as the
nal label.
I used the dataset from Kaggle which was crawled and labeled positive/negative. The data provided comes
with emoticons, usernames and hashtags which are required to be processed and converted into a standard
form. I also need to extract useful features from the text such as unigrams and bigrams which is a form of
representation of the “tweet”
Used various machine learning algorithms to conduct sentiment analysis using the extracted features.
However, just relying on individual models did not give a high accuracy so I picked the top few models to
generate a model ensemble. Ensembling is a form of meta learning algorithm technique where I combined
di erent classi ers in order to improve the prediction accuracy. Finally, I report my experimental results and
ndings at the end.
2 Data Description:
The data given is in the form of comma-separated values les with tweets and their corresponding
sentiments. The training dataset is a csv le of type tweet_id,sentiment,tweet where the tweet_id unique
2 / 17
Total Unique Average Max Positive Negative
Tweets 800000 - - - 400312 399688
User Mentions 393392 - 0.4917 12 - -
Emoticons 6797 - 0.0085 5 5807 990
URLs 38698 - 0.0484 5 - -
Unigrams 9823554 181232 12.279 40 - -
Bigrams 9025707 1954953 11.28 - - -
and emoticons contribute to predicting the sentiment, but URLs and references to people don’t.
Therefore, URLs and references can be ignored. The words are also a mixture of misspelled words,
extra punctuations, and words with many repeated letters. The tweets, therefore, have to be pre-
processed to standardize the dataset.
The provided training and test dataset have 800000 and 200000 tweets respectively. Preliminary
statistical analysis of the contents of datasets, after preprocessing as described in section 3.1, is
shown in tables 1 and 2.
3.1.1 URL
Users often share hyperlinks to other webpages in their tweets. Any particular URL is not
important for text classification as it would lead to very sparse features. Therefore, we re-
place all the URLs in tweets with the word URL. The regular expression used to match URLs
is ((www\.[\S]+)|(https?://[\S]+)).
2
3 / 17
Emoticon(s) Type Regex Replacement
:), : ), :-), (:, ( :, (-:, :’) Smile (:\s?\)|:-\)|\(\s?:|\(-:|:\’\)) EMO_POS
:D, : D, :-D, xD, x-D, XD, X-D Laugh (:\s?D|:-D|x-?D|X-?D) EMO_POS
;-), ;), ;-D, ;D, (;, (-; Wink (:\s?\(|:-\(|\)\s?:|\)-:) EMO_POS
<3, :* Love (<3|:\*) EMO_POS
:-(, : (, :(, ):, )-: Sad (:\s?\(|:-\(|\)\s?:|\)-:) EMO_NEG
:,(, :’(, :"( Cry (:,\(|:\’\(|:"\() EMO_NEG
3.1.3 Emoticon
Users often use a number of different emoticons in their tweet to convey different emotions. It is
impossible to exhaustively match all the different emoticons used on social media as the number
is ever increasing. However, we match some common emoticons which are used very frequently.
We replace the matched emoticons with either EMO_POS or EMO_NEG depending on whether it is
conveying a positive or a negative emotion. A list of all emoticons matched by our method is given
in table 3.
3.1.4 Hashtag
Hashtags are unspaced phrases prefixed by the hash symbol (#) which is frequently used by users
to mention a trending topic on twitter. We replace all the hashtags with the words with the hash
symbol. For example, #hello is replaced by hello. The regular expression used to match hashtags
is #(\S+).
3.1.5 Retweet
Retweets are tweets which have already been sent by someone else and are shared by other users.
Retweets begin with the letters RT. We remove RT from the tweets as it is not an important feature
for text classification. The regular expression used to match retweets is \brt\b.
After applying tweet level pre-processing, we processed individual words of tweets as follows.
• Convert 2 or more letter repetitions to 2 letters. Some people send tweets like I am sooooo
happpppy adding multiple characters to emphasize on certain words. This is done to handle
such tweets by converting them to I am soo happy.
• Remove - and ’. This is done to handle words like t-shirt and their’s by converting them to
the more general form tshirt and theirs.
• Check if the word is valid and accept it only if it is. We define a valid word as a word which
begins with an alphabet with successive characters being alphabets, numbers or one of dot
(.) and underscore(_).
Some example tweets from the training dataset and their normalized versions are shown in table
4.
3.2.1 Unigrams
Probably the simplest and the most commonly used features for text classification is the presence
of single words or tokens in the the text. We extract single words from the training dataset and
create a frequency distribution of these words. A total of 181232 unique words are extracted from
3
4 / 17
Raw misses Swimming Class. http://plurk.com/p/12nt0b
Normalized misses swimming class URL
Raw @98PXYRochester HEYYYYYYYYY!! its Fer from Chile again
Normalized USER_MENTION heyy its fer from chile again
Raw Sometimes, You gotta hate #Windows updates.
Normalized sometimes you gotta hate windows updates
Raw @Santiago_Steph hii come talk to me i got candy :)
Normalized USER_MENTION hii come talk to me i got candy EMO_POS
Raw @bolly47 oh no :’( r.i.p. your bella
Normalized USER_MENTION oh no EMO_NEG r.i.p your bella
Table 4: Example tweets from the dataset and their normalized versions.
the dataset. Out of these words, most of the words at end of frequency spectrum are noise and
occur very few times to influence classification. We, therefore, only use top N words from these
to create our vocabulary where N is 15000 for sparse vector classification and 90000 for dense
vector classification. The frequency distribution of top 20 words in our vocabulary is shown in
figure 1. We can observe in figure 2 that the frequency distribution follows Zipf’s law which states
that in a large sample of words, the frequency of a word is inversely proportional to its rank in
the frequency table. This can be seen by the fact that a linear trendline with a negative slope
fits the plot of log (F requency) vs. log (Rank). The equation of the trendline shown in figure 2 is
log(F requency) = −0.78 log(Rank) + 13.31.
3.2.2 Bigrams
Bigrams are word pairs in the dataset which occur in succession in the corpus. These features are
a good way to model negation in natural language like in the phrase – This is not good. A total of
1954953 unique bigrams were extracted from the dataset. Out of these, most of the bigrams at end
of frequency spectrum are noise and occur very few times to influence classification. We therefore
use only top 10000 bigrams from these to create our vocabulary. The frequency distribution of top
20 bigrams in our vocabulary is shown in figure 3.
4
5 / 17
Figure 2: Unigrams frequencies follow Zipf’s Law.
5
6 / 17
3.3.1 Sparse Vector Representation
Depending on whether or not we are using bigram features, the sparse vector representation of
each tweet is either of length 15000 (when considering only unigrams) or 25000 (when considering
unigrams and bigrams). Each unigram (and bigram) is given a unique index depending on its rank.
The feature vector for a tweet has a positive value at the indices of unigrams (and bigrams) which
are present in that tweet and zero elsewhere which is why the vector is sparse. The positive value
at the indices of unigrams (and bigrams) depends on the feature type we specify which is one of
presence and frequency.
• presence In the case of presence feature type, the feature vector has a 1 at indices of
unigrams (and bigrams) present in a tweet and 0 elsewhere.
• frequency In the case of frequency feature type, the feature vector has a positive integer at
indices of unigrams (and bigrams) which is the frequency of that unigram (or bigram) in the
tweet and 0 elsewhere. A matrix of such term-frequency vectors is constructed for the entire
training dataset and then each term frequency is scaled by the inverse-document-frequency of
the term (idf) to assign higher values to important terms. The inverse-document-frequency
of a term t is defined as.
1 + nd
idf (t) = log +1
1 + df (d, t)
where nd is the total number of documents and df (d, t) is the number of documents in which
the term t occurs.
Handling Memory Issues Which dealing with sparse vector representations, the feature vector for
each tweet is of length 25000 and the total number of tweets in the training set is 800000 which
means allocation of memory for a matrix of size 800000 × 25000. Assuming 4 bytes are required to
represent each float value in the matrix, this martix needs a memory of 8 × 1010 bytes (≈ 75 GB)
which is far greater than the memory available in common notebooks. To tackle this issue, we used
scipy.sparse.lil_matrix data structure provided by Scipy which is a memory efficient linked
list based implementation of sparse matrices. In addition to that, we used Python generators
wherever possible instead of keeping the entire dataset in memory.
3.4 Classifiers
3.4.1 Naive Bayes
Naive Bayes is a simple model which can be used for text classification. In this model, the class ĉ
is assigned to a tweet t, where
ĉ = argmax P(c|t)
c
n
Y
P(c|t) ∝ P(c) P(fi |c)
i=1
In the formula above, fi represents the i-th feature of total n features. P(c) and P(fi |c) can be
obtained through maximum likelihood estimates.
6
7 / 17
3.4.2 Maximum Entropy
Maximum Entropy Classifier model is based on the Principle of Maximum Entropy. The main idea
behind it is to choose the most uniform probabilistic model that maximizes the entropy, with given
constraints. Unlike Naive Bayes, it does not assume that features are conditionally independent
of each other. So, we can add features like bigrams without worrying about feature overlap. In
a binary classification problem like the one we are addressing, it is the same as using Logistic
Regression to find a distribution over the classes. The model is represented by
P
exp[ i λi fi (c, d)]
PM E (c|d, λ) = P P
c′ exp[ i λi fi (c, d)]
Here, c is the class, d is the tweet and λ is the weight vector. The weight vector is found by
numerical optimization of the lambdas so as to maximize the conditional probability.
3.4.5 XGBoost
Xgboost is a form of gradient boosting algorithm which produces a prediction model that is an
ensemble of weak prediction decision trees. We use the ensemble of K models by adding their
outputs in the following manner
K
X
yˆi = fk (xi ), fk ∈ F
k=1
where F is the space of trees, xi is the input and yˆi is the final output. We attempt to minimize
the following loss function
X X
L(Φ) = l(yˆi , yi ) + Ω(fk )
i
1
where Ω(f ) = γT + λkwk2
2
where Ω is the regularisation term.
3.4.6 SVM
SVM, also known as support vector machines, is a non-probabilistic binary linear classifier. For a
training set of points (xi , yi ) where x is the feature vector and y is the class, we want to find the
7
8 / 17
maximum-margin hyperplane that divides the points with yi = 1 and yi = −1.
The equation of the hyperplane is as follow
w·x−b=0
max γ, s.t.∀i, γ ≤ yi (w · xi + b)
w,γ
4 Experiments
We perform experiments using various different classifiers. Unless otherwise specified, we use
10% of the training dataset for validation of our models to check against overfitting i.e. we use
720000 tweets for training and 80000 tweets for validation. For Naive Bayes, Maximum Entropy,
Decision Tree, Random Forest, XGBoost, SVM and Multi-Layer Perceptron we use sparse vector
representation of tweets. For Recurrent Neural Networks and Convolutional Neural Networks we
use the dense vector representation.
4.1 Baseline
For a baseline, we use a simple positive and negative word counting method to assign sentiment to a
given tweet. We use the Opinion Dataset of positive and negative words to classify tweets. In cases
when the number of positive and negative words are equal, we assign positive sentiment. Using
this baseline model, we achieve a classification accuracy of 63.48% on Kaggle public leaderboard.
8
9 / 17
than floats. We also observed that addition of bigram features improves the accuracy. We obtain
a best validation accuracy of 79.68% using Naive Bayes with presence of unigrams and bigrams. A
comparison of accuracies obtained on the validation set using different features is shown in table
5.
For a binary classification problem, Logistic Regression is essentially the same as Maximum En-
tropy. So, we implemented a sequential Logistic Regression model using keras, with sigmoid
activation function, binary cross-entropy loss and Adam’s optimizer achieving better performance
than nltk. Using frequency and presence features we get almost the same accuracies, but the
performance is slightly better when we use unigrams and bigrams together. The best accuracy
achieved was 81.52%. A comparison of accuracies obtained on the validation set using different
features is shown in table 5.
4.6 XGBoost
We also attempted tackling the problem with XGboost classifier. We set max tree depth to 25
where it refers to the maximum depth of a tree and is used to control over-fitting as a high value
might result in the model learning relations that are tied to the training data. Since XGboost
is an algorithm that utilises an ensemble of weaker trees, it is important to tune the number of
estimators that is used. We realised that setting this value to 400 gave the best result. The best
result was 0.78.72 which came from the configuration of presence with Unigrams + Bigrams.
4.7 SVM
We utilise the SVM classifier available in sklearn. We set the C term to be 0.1. C term is the
penalty parameter of the error term. In other words, this influences the misclassification on the
objective function. We run SVM with both Unigram as well Unigram + Bigram. We also run the
configurations with frequency and presence. The best result was 81.55 which came the configuration
of frequency and Unigram + Bigram.
9
10 / 17
Presence Frequency
Algorithms
Unigrams Unigrams+Bigrams Unigrams Unigrams+Bigrams
Naive Bayes 78.16 79.68 77.52 79.38
Max Entropy 79.96 81.52 79.7 81.5
Decision Tree 68.1 68.01 67.82 67.78
Random Forest 76.54 77.21 76.16 77.14
XGBoost 77.56 78.72 77.42 78.32
SVM 79.54 81.11 79.83 81.55
MLP 80.1 81.7 80.15 81.35
is a single value which we pass through the sigmoid non-linearity to squish it in the range [0, 1].
1
The sigmoid function is defined by f (z) = 1+exp −z . The output from the neural network gives the
probability Pr(positive|tweet) i.e. the probability of the tweets sentiment being positive. At the
prediction step, we round off the probability values to convert them to class labels 0 (negative) and
1 (positive). The architecture of the model is shown in figure . Red hidden layers represent layers
with sigmoid non-linearity. We trained our model using binary cross entropy loss with the weight
update scheme being the one defined by Adam et. al. We also conducted experiments using SGD
+ Momentum weight updates and found out that it takes too long to converge. We ran our model
upto 20 epochs after which it began to overfit. We used sparse vector representation of tweets for
training. We found that the presence of bigrams features significantly improved the accuracy.
10
11 / 17
Figure 5: Neural Network Architecture with 1 Conv Layer.
the end until its length is equal to max_length which is a parameter we tweak in our experiments.
We trained our model using binary cross entropy loss with the weight update scheme being the one
defined by Adam et. al. We also conducted experiments using SGD + Momentum weight updates
and found out that it takes longer (≈100 epochs) to converge compared to validation accuracy
equivalent to Adam. We ran our model upto 10 epochs. Using the Adam weight update scheme,
the model converges very fast (≈4 epochs) and begins to overfit badly after that. We, therefore,
use models from 3rd or 4th epoch for our results. We tried four different CNN architectures which
are as follows.
• 1-Conv-NN: As the name suggests, this is an architecture with 1 convolution layer. We
perform temporal convolution with a kernel size of 3 and zero padding. After the convo-
lution layer, we apply relu activation function (which is defined as f (x) = max(0, x)) and
then perform Global Max Pooling over time to reduce the dimensionality of the data. We
pass the output of the Global Max Pool layer to a fully-connected layer which then out-
puts a single value which is passed through sigmoid activation function to convert it into a
probability value. We also added dropout layers after the embedding layer and the fully-
connected layer to regularize our network and prevent it from overfitting. We use a tweet
max_length of 20 in this network with a vocabulary of 80000 words. The complete archi-
tecture of the network is embedding_layer (800001×200) → dropout(0.2) → conv_1
(500 filters) → relu → global_maxpool → dense(500) → relu → dropout(0.2)
→ dense(1) → sigmoid as shown in figure 5. Green layers indicate relu activation while
red indicates sigmoid.
• 2-Conv-NN: In this architecture we increased the vocabulary from 80000 to 90000. We also
increased the dropout after embedding layer to 0.4 and that after the fully connected layer to
0.5 to further regularize the network and thus prevent overfitting. We changed the number of
filters in the first convolution layer to 600 and added another convolution layer with 300 filters
after the first convolution layer. We also replaced the Global MaxPool layer with a Flatten
layer as we believed some features of the input tweets got lost while max pooling. We also
increased the number of units in the fully-connected layer to 600. All of these changes allowed
the network to learn and regularize better thereby improving the validation accuracy. The
complete architecture of the network is embedding_layer (900001×200) → dropout(0.4)
→ conv_1 (600 filters) → relu → conv_2 (300 filters) → relu → flatten →
dense(600) → relu → dropout(0.5) → dense(1) → sigmoid as shown in figure 6.
• 3-Conv-NN: In this architecture we added another convolution layer with 150 filters after
the second convolution layer. The complete architecture of the network is embedding_layer
11
12 / 17
Figure 7: Neural Network Architecture with 3 Conv Layers.
• 4-Conv-NN: In this architecture we added another convolution layer with 75 filters after the
third convolution layer. We also increased max_length of the tweet to 40 going by the
fact that the length of largest tweet in our pre-processed dataset is about 40 words. The
complete architecture of the network is embedding_layer (900001×200) → dropout(0.4)
→ conv_1 (600 filters) → relu → conv_2 (300 filters) → relu → conv_3
(150 filters) → relu → conv_4 (75 filters) → relu → flatten → dense(600)
→ relu → dropout(0.5) → dense(1) → sigmoid as shown in figure 8.
We notice that each successive CNN model is better than the previous one with 1-Conv-NN,
2-Conv-NN, 3-Conv-NN and 4-Conv-NN achieving accuracies of 82.40, 82.76, 82.95 and 83.34 re-
spectively on Kaggle public leaderboard.
• Embeddings Seeded with GloVe: In these models, we use a word vector dimension of 200
instead and seed it with GloVe word vectors provided by the StanfordNLP group. The word
embeddings are fine tuned during the course of training. We follow the embeddings layer
with an LSTM layer which is followed by a fully-connected layer with relu activation. Finally,
the output is a single value with sigmoid activation. We add dropouts of 0.4 and 0.5 after
embeddings layer and the penultimate layer respectively to further regularize the network.
12
13 / 17
LSTM Units Dense Units max_length Loss Embedding Initialization Accuracy
100 32 40 MSE Random 79.8%
100 32 40 BCE Random 82.2%
50 32 40 MSE Random 78.96%
50 32 40 BCE Random 81.97%
100 600 20 BCE GloVe 82.7%
128 64 40 BCE GloVe 83.0%
Table 6: Comparison of different LSTM models. MSE is mean squared error and BCE is binary
cross entropy.
We experimented with different values of LSTM and fully-connected units and the results are
summarized in table 6. The architecture of our best performing LSTM-NN is shown in figure
9.
We experimented with both Adam optimizer and SGD with momentum for training our net-
works and find the Adam worked better and converges faster. We trained our model using
mean_squared_error and binary_cross_entropy loss. We found that binary_cross_entropy
worked better than mean_squared_error which is expected given our binary classification prob-
lem. The results from various different LSTM models are summarized in table 6. We obtain best
accuracy of 83.0% among the different LSTM models.
4.11 Ensemble
In a quest to further improve accuracy, we developed a simple ensemble model. We first extract
600 dimensional feature vectors for each tweet from the penultimate layer of our best performing
4-Conv-NN model. Each tweet is now represented by a 600 dimensional feature vector. We use
these features to classify the tweets using a linear SVM model with C=1. We classify the tweets
using this SVM model. We then take the majority vote of predictions from the following 5 models.
1. LSTM-NN
2. 4-Conv-NN
5 Conclusion
5.1 Summary of achievements
The provided tweets were a mixture of words, emoticons, URLs, hastags, user mentions, and sym-
bols. Before training the we pre-process the tweets to make it suitable for feeding into models. We
13
14 / 17
implemented several machine learning algorithms like Naive Bayes, Maximum Entropy, Decision Tree,
Random Forest, XGBoost, SVM, Multi-Layer Perceptron, Recurrent Neural networks and Convolutional
Neural Networks to classify the polarity of the tweet. We used two types of features namely unigrams and
bigrams for classification and observed that augmenting the feature vector with bigrams improved the
accuracy. Once the feature has been extracted it is represented as either a sparse vector or a dense vector. It
has been observed that presence in the sparse vector representation recorded a better performance than
frequency.
Neural methods performed better than other classifiers in general. Our best LSTM model achieved an
accuracy of 83.0% on Kaggle while the best CNN model achieved 83.34%. The model which used features
from our best CNN model and classified using SVM performed slightly better than only CNN. We finally
used an ensemble method taking a majority vote over the predictions of 5 of our best models achieving an
accuracy of 83.58%.
15 / 17
5.2 Future directions
Handling emotion ranges: we can improve and train our models to handle a range of sentiments. Tweets
don’t always have positive or negative sentiment. At times they may have no sentiment i.e., neutral.
Sentiment can also have gradations like the sentence, this is good, is positive but the sentence, this is
extraordinary. is somewhat more positive than the first. we can therefore classify the sentiment in ranges, say
from -2 to +2.
Using symbols: During our pre-processing, we discard most of the symbols like commas, full-stops, and
exclamation marks. These symbols may be helpful in assigning sentiment to a sentence.
For our feature-based approach, we analyze features that reveal that the most important features
are those that combine the pre-polarity of words with their part-of-speech signs. we conclude
initially that sentiment analysis of Twitter data is not very different from sentiment analysis of
other types. In future work, we will explore richer linguistic analyses, for example, parsing,
semantic analysis, and subject modelling
Analysing the Positive VS Negative thesis. That is a binary classification task with two classes of sentiment
polarity: positive and negative. Used a balanced data-set of 1709 instances for each class and therefore the
chance baseline is 50%.
For all the experiments, using Support Vector Machines (SVM) and reports averaged 5-fold cross-validation
test results. we tune the C parameter for SVM using an embedded 5-fold cross-validation on the training
data of each fold, i.e., for each fold, we first run 5-fold cross-validation only on the training data of that fold
for different values of C. we pick the setting that yields the best cross-validation error and use that C for
determining test error for that fold. As usual, the reported accuracy is the average over the five folds.
16 / 17
References:
[1] Boguslavsky, I. (2017). Semantic Descriptions for a Text Understanding System. In Computational
Linguistics and Intellectual Technologies. Papers from the Annual International Conference
“Dialogue”(2017) (pp. 14-28).
[2] Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The good, the bad and the omg!
In: ICWSM, vol. 11, pp. 538–541 (2011)
Google Scholar
[3] Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: 2Cudré-Mauroux, P., He in,
J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J.,
Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 508–524.
Springer, Heidelberg (2012)
Google Scholar
[4] Dos Santos, C. N., & Gatti, M. (2014, August). Deep Convolutional Neural Networks for
Sentiment Analysis of Short Texts
[5] Gokulakrishnan, B., Priyanthan, P., Ragavan, T., Prasath, N., Perera, A.: Opinion mining and
sentiment analysis on a twitter data stream. In: IEEE 2012 International Conference on Advances in
ICT for Emerging Regions, ICTer (2012)
Google Scholar
[6] Poria, S., Cambria, E., & Gelbukh, A. (2015). Deep convolutional neural network textual features
and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing (pp. 2539- 2544).
[7] TextBlob, 2017, https://textblob.readthedocs.io/en/dev/
[8] Statista,, https://www.statista.com/statistics/282087/number-ofmonthly-active-twitter-users/
[9] Alm, C.O. Subjective natural language problems: motivations, applications, characterizations,
and implications. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: short papers (ACL-2011), 2011.
[10] Kiritchenko, S., Zhu, X., & Mohammad, S. M. (2014). Sentiment analysis of short informal texts.
Journal of Arti cial Intelligence Research, 50, 723-762.
[11] Duh, K., A. Fujino, and M. Nagata. Is machine translation ripe for cross-lingual sentiment
classi cation? In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: short papers (ACL-2011), 2011.
[12] Jiang, L., M. Yu, M. Zhou, X. Liu, and T. Zhao. Target-dependent twitter sentiment classi cation.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
(ACL2011), 2011: Association for Computational Linguistics.
17 / 17