FULLTEXT01 Uppsala Uni

IT 16 047
Examensarbete 30 hp
Augusti 2016
Using social media and machine

learning to predict financial
performance of a company
Sepehr Forouzani
Masterprogram i datavetenskap
Master Programme in Computer Science
Abstract
Using social media and machine learning to predict
financial performance of a company
Sepehr Forouzani
Teknisk- naturvetenskaplig fakultet

UTH-enheten Social media have recently become one of the most popular communicating form of
media for
Besöksadress: numerous number of people. the text and posts shared on social media is widely used
Ångströmlaboratoriet
Lägerhyddsvägen 1 by
Hus 4, Plan 0 researcher to analyze, study and relate them to various fields. In this master thesis,
sentiment
Postadress: analysis has been performed on posts containing information about two companies
Box 536
751 21 Uppsala that are
shared on Twitter, and machine learning algorithms has been used to predict the
Telefon: financial
018 – 471 30 03 performance of these companies.
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Handledare: Lisa Kaati

Ämnesgranskare: Micheal Ashcroft
Examinator: Edith Ngai
UPTEC 16 047
Contents
1 Introduction 5
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
3 Background theory 11
3.1 Social media . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Financial performance . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Feature Vectors . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.1 Classification Algorithms . . . . . . . . . . . . . . . . . 16
3.5.2 Data balancing . . . . . . . . . . . . . . . . . . . . . . 18
3.5.3 Feature selection . . . . . . . . . . . . . . . . . . . . . 18
4 Implementation 19
4.1 Financial Performance Predictor design . . . . . . . . . . . . . 19
4.2 Financial Performance Predictor Implementation . . . . . . . 21
4.2.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Feature vectors creation . . . . . . . . . . . . . . . . . 21
5 Experiments and Results 22

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Quarterly reports . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
5.4 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.5.1 Experiments with the regular dictionary . . . . . . . . 27
5.5.2 Experiments using the financial dictionary . . . . . . . 30
6 Discussion 31
7 Conclusion 32
8 Future work 32
2
List of Figures
1 The methodology . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Sentiment Analysis methods [18] . . . . . . . . . . . . . . . . . 12
3 Machine learning workflow [34] . . . . . . . . . . . . . . . . . 15
4 Steps toward financial prediction . . . . . . . . . . . . . . . . 20
5 The format of a feature vector. . . . . . . . . . . . . . . . . . 22
3
List of Tables
1 The datasets used in the experiments. . . . . . . . . . . . . . . 22
2 The companies performance based on the ROA. . . . . . . . . 24
3 The two different dictionaries and some example words. . . . . 25
4 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 The results for experiment 1 using T WBM W dataset. . . . . . 28
8 The results for experiment 3 using T WV W dataset. . . . . . . 29
10 The results for experiment 4 using T WV W dataset. . . . . . . 30
4
1 Introduction
Nowadays media and in particular social media is considered as a big data
source to researchers due to the large number of people communicating and
sharing their ideas, feelings, knowledge, and personal opinions about various
topics at any time. During the last ten years, Twitter and Facebook has
emerged to be the most popular social networking websites. Facebook has
1.59 billion monthly users and Twitter has 332 million active users [6].
Data from social media provides a unique opportunity to social scientists,
economists, and statisticians to understand individuals and human behav-
ioral patterns that has effects on different areas such as finance [4]. As an
example, recent research on financial performance prediction using opinion
and sentiment analysis of posts that are shared in social media indicates that
there is a possibility to predict a company’s stock value [5].
The data available on social media is enormous, unstructured and con-
tains a lot of irrelevant information, therefore it is impossible for individuals
to read and analyze all of the data manually. To analyze data from social
media, statistical and data mining techniques need to be applied to make the
best use of the data [7].
Customer’s opinion about products and services is always a concern for
most large-and middle sized companies. Social media is one of the most
widely used source of data about customer’s opinion toward a certain com-
pany [8]. Most companies use different methods and techniques to find out
customer’s opinion about their services and products. However relating the
data extracted from social media about customer’s opinion to the co-related
sectors of the companies such as productivity, profitability, financial per-
formance and economics is not always possible [21], for example if a firm
improves productivity by downsizing, the profitability might be endangered
5
if the customer satisfaction depends on companies services [23]. Research [1]
has shown that there is a relation between opinion and sentiment about a
company and the stock price. However, to the best of our knowledge there
are no studies that focus on investigating the relation of sentiment analysis
of tweets and the financial performance of companies.
In this master thesis we will investigate the correlation between the sen-
timent of tweets where a certain company is mentioned in a hashtag and the
financial performance of that company.
1.1 Objectives
The over all objective of this thesis project is to investigate the relation
between sentiment extracted from social media and the financial performance
of automotive companies. The goal is to predict the financial performance of
a company based on what people write about the company on Twitter. This
results in the following more specific objectives:
• Develop techniques for sentiment analysis of data from Twitter with

respect to a specific company.
• Use machine learning and train a model to predict the financial perfor-
mance of a company
• Develop a prototype tool for the proposed method.
1.2 Method
The work in this thesis is done through five steps, as illustrated in Figure 1.
6
Figure 1: The methodology
In the first step, the problem and the objectives for the research is defined.
In the second step a literature review is done. The literature study focus on
reviewing related work as well as gaining knowledge about the techniques
that will be used in the project.
In the third step, the experiment setups and configurations will be de-
signed and data will be collected.
In the forth step, a prototype tool is developed in order to collect, prepare
and analyze data. The analysis is based on mood and sentiment word lists.
For the machine learning components in this project the Weka data mining
tool [39] is used. In the fifth step, the results are evaluated by measuring the
accuracy of performance prediction.
7
2 Related Work
In this chapter some work related to sentiment analysis methods and financial
predictions using mood and sentiment analysis, will be reviewed.
In [1] the authors are collecting public tweets posted by approximately 2.7
million users. All tweets have an identifier, a publishing time, a submission
type and a 140 character text. To make the data suitable for analysis, stop-
words (topic independent words that are most common in a language) and
punctuation are removed and then the text is filtered by words such as ”I
feel”,”i am feeling”, ”I’m”,”Im”,”I am”, and ”makes me” because those words
state their author’s mood state. At the next stage they use the OpinionFinder
(OF) tool [13] for sentiment analysis. In order to measure polarity of a
sentence in terms of being positive and negative, OF takes a text (e.g. large
number of tweets) and uses the OF lexicon to determine the percentage
of positive against negative sentiment of the text. To measure mood of a
text they use an algorithm called Google-Profile of Mood States (GPOMS).
GPOMS measures the mood of a text from six different dimensions, which
are: calm, alert, sure, vital, kind, and happy.
To enable normalization of time series and comparison between OF and
GPOMS results, the authors of [1] are using z-score statistical measurement
which is based on local mean and standard deviation. The authors are also
using econometric technique of Granger causality analysis [19] in order to
investigate the relation between public mood and stock market closing value
changes. The Granger causality indicates that there is a predictive relation
of certain mood categories and the closing price of the stock market.
In [3] the authors used machine learning and social media to predict
how successful a movie will be. In order to measure success of a movie the
authors used return on investment (ROI) which is a profitability metric, and
8
they applied binary and multi-class classification algorithms such as support
vector machines (SVM), multilayer perceptron (MLP), decision trees (J48),
random forest and logitBoost algorithm to predict the success. The results
shows that random forest was the best classifier, with an accuracy of almost
84%.
In [12] the authors investigate the possibility of predicting electronic de-
vices market sales using social media. In their work they are analyzing sen-
timent of Twitter comments about a certain product before the product is
released. They are using semi-supervised recursive auto encoders for pre-
dicting sentiment distribution. Semi-supervised recursive auto encoders is
an artificial neural network which its goal is to learn encoding a set of data,
typically for the purpose of dimensional reduction. In sentiment analysis
semi-supervised recursive auto encoders are used to learn semantic vector
representations of a phrases [20]. After running sentiment analysis, the to-
tal number of comments, number of positive comments, total number of re-
tweeted comments and number of re-tweeted positive comments are extracted
and used as features in their model. In the experiments their model showed
35% of accuracy in prediction of iPad3 sale meanwhile linear regression was
showing 58% accuracy in iPad3 sale prediction which is a low accuracy and
could not be used as a practical model.
In [2] the authors are using Artificial Neural Networks (ANN), Support
Vector Machines (SVM) and Relevance Vector Machines (RVM) to predict
daily returns for an FX carry basket. A currency basket is a portfolio of
selected currencies with different weightings, and FX carry basket is made
of a long position in high yielding currencies versus a short position in low
yielding ones is a common asset for fund managers and speculative traders. It
was found that in general the committee of networks was much more effective
9
at predicting five day returns than one day returns, and it was on this basis
that the optimal configuration was used.
In [9] it is stated that the list of words that is used in general to measure
the sentiment of a text is not accurate to be used to measure sentiment
of finance related texts. To illustrate this, the authors of [9] did a review
of the negative words extracted from 10-k reports (an annual report which
contains summery of a company’s financial performance [15]) based on the
Harvard dictionary [14] and found out that almost seventy five percent of the
words counted as negative are not negative in finance. Therefore they have
developed a new word dictionary which reflects the tone of financial texts
with a higher accuracy. The authors have used a bag of words (considering
a text like a bag for its words, regardless of grammar and order of words)
approach to produce vector of words and word counts, and modified one of
the most common term weighting scheme to make it adjustable to document
length.
In [10] the authors are developing an automated method for sentiment
classification. They are using a classifier which is based on a multinomial
Naive Bayes classifier to determine the positive, negative and neutral sen-
timent of a document. They also propose a technique that can be used to
determine sentiment of documents in any languages. In their method, the
TreeTagger [16] (a language independent part-of-speech tagger) is used for
part-of-speech tagging and the differences in distribution of positive, nega-
tive and neutral tags are observed. For feature extraction they used N-gram
as binary features and the frequency of keywords. Unigrams, bigrams, and
trigrams are used for experiments, and the authors are stating that when
bigrams are used, the performance is the best.
In [11] four classes of mood: calm, happy, alert and kind are used and
10
a text is categorized into these four classes using a analysis tool. The tool
uses a word list based on the Profile of Mood States (POMS) questionnaire
[17] where the POMS different states are mapped into their four mood states
using static correlation rules. They also filtered down a set of tweets into
emotion specific texts using words such as ”feel”, ”makes me”, ”I’m”, ”I am”.
In this work the authors are using a new cross validation method called k-fold
sequential cross validation to train the model and the model showed 75.56%
accuracy in prediction of stock market movements. They have tried four
different learning algorithms: linear regression, logistic regression, support
vector machines (SVMs), and self organizing fuzzy neural networks (SOFNN)
to learn and study correlation of mood and market. The conclusion is that
SOFNN performed better compared to the other algorithms.
3 Background theory
3.1 Social media
The tools and platforms that enables users to interact and exchange informa-
tion in different forms such as text, picture, video and etc. are called social
media [24]. There are a number of different types of social media for exam-
ple blogs, discussion boards and networking platforms such as Facebook and
Twitter. Twitter is one of the most popular social media services that enable
users to publish and share a maximum of 140 characters text called tweets
and use hashtags ”#” to relate their tweets to a specific topic, person or a
company. Several companies and business strategists consider social media
as an important arena and they are constantly trying to find out various
ways to increase their profitability using social media[25].
11
3.2 Sentiment analysis
Sentiment analysis is done using natural language processing and information

extraction with the goal of obtaining the writer’s feeling as positive, negative
or neutral [27]. Sentiment analysis is often used as component in opinion
mining when the goal is to is to analyze sentiment and attitudes [28]. There
are a number of various methods that can be used to classify sentiment of a
text. A list of methods are shown in Figure 2.
Figure 2: Sentiment Analysis methods [18]
In this thesis the Dictionary-based approach is used for sentiment analy-

sis.
12
3.3 Financial performance
Most of the time financial analysts and investors are focusing on return on
equity (ROE) as the primary metric for measuring companies performance.
Many executives focus heavily on this metric as well, believing that it is the
one that seems to get the most attention from the investor community. ROE
is calculated by dividing the net income by shareholder’s equity.
N et Income
Return on Equity = (1)
shareholder� s equity
Shareholder’s equity is the equity of a company as divided among individ-

ual shareholders of company’s stock [48]. Using ROE as performance metric
has some shortcomings as well. As an example, companies can artificially
maintain a good value of ROE by growing debt leverage and stock buybacks
which are funded through accumulated cash. Therefore other metrics such
as return on assets (ROA) can be used instead of ROE. ROA directly consid-
ers the assets that are used to support business activities and it determines
whether a company is able to generate sufficient return on the assets rather
than simply showing robust return on sales [29]. ROA is an indicator of a
company’s profitability based on its total assets [31], it captures the funda-
mentals of company’s performance in a general way by looking at both income
statement performance and the assets required to run a business [22]. ROA
is a good metric to measure performance of a company on generating income
by using the assets. ROA is calculated by dividing a company’s earnings by
its total assets and displayed as a percentage. Sometimes ROA is referred to
as ”return on investment”. ROA is calculated using below formula:
N et Income
Return on Assets = (2)
T otal Assets
13
3.4 Data collection
Data collection and dataset creation is the first step when you want to create
a statistical model using machine learning. The dataset is commonly divided
into three subsets: a training set, a validation set and a test set. The train-
ing set is used to train the statistical model, the validation set is used to
estimate how well the model is trained and the test set is used to measure
the performance of the model.
3.4.1 Feature Vectors
A feature vector is the way an object is presented in machine learning and

pattern recognition. Feature vectors are n-dimensional vectors where each
vector represents an object. A numeric representation of the features (vari-
ables) will enhance statistical analysis, therefore many machine learning al-
gorithms requires numerical features.
3.5 Machine learning
Machine learning is a field of computer science which studies and explores

ways of making algorithms find patterns or learn how to do certain tasks. In
this thesis machine learning is used to predict the performance of a company.
Figure 3 shows the workflow for the machine learning process we have used
in this thesis.
14
Figure 3: Machine learning workflow [34]
In the first step (data ingestion) the data is collected and stored in a
database. After collecting the data, the data is cleaned and/or transformed.
The data is divided into two sets: a training set and a testing set. In the
next step a mathematical model is built based on the training set and then
the model will be tested against the testing set.
In order to improve the results, the user can make decision about creating
or choosing different data and feature vectors (data presentation style), after
results are produced from the model.
There are three categories of machine learning that are based on their
nature of learning.
• Supervised Learning: In supervised learning the computer receives a

set of inputs and their related outputs from a teacher. The goal is to
find a general mapping model from input to output.
• Unsupervised Learning: In unsupervised learning, the computer find

structures in the input data without having any input from a teacher.
• Reinforcement Learning: In reinforcement learning the computer inter-
15
acts with an environment to achieve the goal without any help from a
teacher.
3.5.1 Classification Algorithms
A classification algorithm task is to pick the right identified categories in

data, for the new observations, the classifier estimates categories for new
data based on the model parameters that are learned from the training data.
Different classification algorithms use different classifier methods and vari-
ables and therefore a number of classification algorithms can be applied on
the data in order to find the most suitable and efficient algorithm [30]. In this
section a few different classification algorithms that are used in the project
will be reviewed.
Random Forest [35] is bagged trees with both bootstrap sampling of
the data and a form of attribute bagging. A decision tree is made of a
directed series of decisions, based on input variables value, and culminating
in a classification of the target variable. Bagging is a method of combining
multiple predictors. It will get a bootstrap sample from training set and
train a predictor on that sample. Samples with replacement from the known
weights called a bootstrap sample. Random forests provide a simple means
of analyzing feature importance, and the resulting score is known as the
variable importance score. In random forest it is not required to separate
a test set from the data to get an unbiased estimate of the error since each
tree in random forest is built by using a different bootstrap sample from the
original data. Bootstrap is an algorithm, designed to improve the stability
and accuracy of machine learning algorithms
Naive Bays [33] is a probabilistic classifier that uses Bayes theory with
the assumption that the features are independent (occurrence of one feature
16
does not effect the probability of others). Naive Bayes computes probability
p as the probability of feature x represented by a vector x = (x1 , ..., xn ) being
in the class c : p(c|x). The conditional probability using Bayes theorem can
be shown as:
p(c)p(x|c)
p(c|x) = (3)
p(x)
when training model time is important Naive Bays is useful.
AdaBoost [32] stands for adaptive boosting and it assumes that finding
many weak models are easier than finding one accurate model. Boosting is an
approach to create predictions rules with high accuracy using a combination
of weak models and rules that have low accuracy in prediction. Boosting
generates a sequence of base models and then decides a final estimate of
the target variable based on aggregating the estimates of the base models.
AdaBoost generates a numbers of weak classifiers and a final estimate of the
target variable is chosen based on aggregating the estimates made by the
base models. Similar to the random forest algorithm, AdaBoost also have a
variable importance estimation but in a different way. In AdaBoost the more
informative variables are used more often, and the less informative features
are barely used.
Cross validation [42] creates a training set and a test set by partitioning
the original data with the goal to train and evaluate the model. In k-fold
cross validation the original data will be divided into k number of subsamples.
One subsample is selected as test dataset and the rest (k − 1) number of
subsamples are used as training set for the model. The same process will be
repeated for k number of times (folds) and each subsample will be used at
least once as test set and then the results will be averaged or combined to
make the best estimation.
17
3.5.2 Data balancing
If the number on instances in classification categories in a dataset are having

a huge difference, the dataset is called imbalanced. To counter the issues
of imbalanced data, methods such as over-sampling (creating new samples
of a certain class) and under-sampling (removing instances of a class) have
been proposed. Synthetic Minority Oversampling TEchnique (SMOTE) [36]
is an over-sampling algorithm which provides more instances of the class
with lower number of instances in addition to under-sampling of the class
with more number of instances. In SMOTE, based on the required number
of over-sampling K number of the nearest neighbor to the data point is
selected and then after these steps the synthetic sample will be created:
• Take the difference of a data instance to its nearest neighbor,
• Multiply the number by a random value between 0 and 1,
• Add the new data point to the considered feature vector
3.5.3 Feature selection
The process of selecting a subset of features that should be used to construct

the model is called feature selection. In machine learning and statistics, the
process is also called variable selection. There are various ways to do feature
selection. As an example, information gain IG specify the most important
features following the formula:
IG(T, a) = H(T ) − H(T |a) (4)
where:
T is set of training example,
a is the index of a feature
18
H() function is an entropy (Entropy is a measure of the randomness of a
variable and it measures the level of impurity in a group of examples).
4 Implementation
In this chapter the design and implementation of the financial performance
predictor (FPP) is described.
4.1 Financial Performance Predictor design
The financial performance predictor (FPP) is a prototype tool for prediction

of companies financial performance using machine learning. The flow of how
FPP is used is shown in figure 4.
19
Figure 4: Steps toward financial prediction
The first step is to collect relevant data, in this thesis we use data from
Twitter. In order to detect the sentiment of a tweet or a group of tweets,
we use the bag of word method. The bag of word method focus on the
words or in some cases set of words (a string of words), regardless of the
context of sentence. We use a list of words (from a dictionary) and all words
that are attached to a sentiment. The words are either positive or negative.
In the experiment we have used two different dictionaries one with that is
developed for financial purposes and one more general. The second step is to
count the number of occurrence of each word present in the dictionaries in the
extracted tweets. The result is combined with the ROA for the corresponding
20
time period and included in the feature vectors. In the forth step machine
learning algorithms will be applied on the feature vectors to train a model
to predict if the ROA increases or decreases based on the sentiment of the
tweets. The classification algorithms that we have used to train the model
are Random Forest, Naive Bayes and Adaboost.
4.2 Financial Performance Predictor Implementation
Various programming languages and tools are used in the implementation of

the FPP.
4.2.1 Collecting data
In order to download tweets a web scraper is written in python programming

language. At the first step a web search query will be made by a python
library called selenium [49]. In the second step the HTML contents will be
stored to driver’s page source of a web browser.
In the third step a python library called beautifulsoup [41] is used to
organize and extract the required data from the HTML source.
At the last step the tweets will be saved as a comma separated version
(CSV) file and then stored in a MySQL database to ease the data manage-
ment.
4.2.2 Feature vectors creation
In this thesis a program for creating feature vectors is written in Java. The
program uses the word dictionaries and count the number of occurrence of
each dictionary word in the tweets. The result is stored in a vector. The
format of a feature vector is shown in Figure 5.
21
Figure 5: The format of a feature vector.
The class variable it the company’s performance. The value of class vari-
able is 1 in case of over-performance and 0 in case of under-performance.
5 Experiments and Results

In this section the experimental setup along with the results are described.
The results are further analyzed in Section 6.
5.1 Dataset
Two datasets are used for the experiments. The first dataset denoted as
T WBM W contains tweets where BMW is either mentioned or used in a hash-
tag (#BMW). The second dataset is called T WV W contains tweets where
Volkswagen is either mentioned or used in a hashtag (#Volkswagen). The
two datasets are described in Table 1
Table 1: The datasets used in the experiments.

Dataset Description Size Time period
T WBM W Tweets related to BMW 677596 2007-2015
T WV W Tweets related to Volkswagen 151648 2012-2015
An example of a negative tweet from T WBM W is:

”BMW is ruining the M-division brand by releasing crap like the ”X6 M”
- http://tinyurl.com/cb2nq7”
22
An example of a positive tweet from the same dataset is:
”Track drive reveals excellent balance of the 2015 BMW 228i - Torque
News http://bit.ly/1xk4xj7 - #BMW”
An example of a neutral tweet (neither positive or negative) from the
same dataset:
”mclaren should come back later in the race when ferrari and bmw have
to use the hard tyres hopefully, anyway”
The sentiment of each tweet is determined by counting the occurrence
of positive and negative words. If a tweet contain more positive words than
negative words, the sentiment is considered positive, if there are more neg-
ative words than positive words, the sentiment is considered negative. If a
tweet contain the same amount of positive and negative words the sentiment
is considered to be neutral.
5.2 Quarterly reports
To obtain the value on return on asset (ROA) for each quarter, BMW quar-
terly reports (10-Q reports) are downloaded from [44] and Volkswagen quar-
terly reports are downloaded from [45]. The value of ROA is not explictly
mentioned in the quarterly reports and therefore it is calculated manually
using the value of the total income and and the total assets value. In Table
2 performance of BMW and Volkswagen in different quarter of the year is
shown.
5.3 Dictionaries
We have used two different dictionaries to determine the sentiment of tweets.

The first dictionary (called the r egular dictionary) is inspired by the posi-
tive and negative emotions from the tool Linguistic Inquiry and Word Count
23
Table 2: The companies performance based on the ROA.
Year Quarter BMW Volkswagen

2015 Quarter 1 Over-perform Under-perform
Quarter 2 Over-perform Over-perform
Quarter 3 Under-perform Under-perform
2013 Quarter 1 Under-perform Under-perform
Quarter 2 Over-perform Under-perform
Quarter 3 Under-perform Over-perform
2011 Quarter 1 Over-perform —
Quarter 2 Over-perform —
2009 Quarter 1 Under-perform —
Quarter 3 Under-perform —
2008 Quarter 1 Under-perform —
Quarter 3 Under-perform —
(LIWC) [37]. The second dictionary (called the f inancial dictionary) is called
Loughran-McDonald master dictionary[38]. The Loughran-McDonald mas-
24
ter dictionary is an extension of the 2of12inf wordlist that includes an ad-
dition of the words that are appearing in companies annual reports. The
2of12inf is a wordlist from SCOWL (Spell Checker Oriented Word Lists) and
Friends consisting of English words that are useful for creating high-quality
list of words for spell checkers [43].
Table 3: The two different dictionaries and some example words.

Regular dictionary Example
Positive Emotions happy, pretty, good
Negative Emotions hate, worthless, enemy, hurt
Financial dictionary Example
Positive Emotions best, achieve, able
Negative Emotions abandoned, misprice, untrusted
Table 3 shows some sample words from the two different dictionaries we
have used.
5.4 Weka
All experiments are done using Weka [39]. Weka has a collection of data min-
ing algorithms, predictive modeling and tools for visualization and a graph-
ical user interface for ease of access to its functions.
Three different classification algorithms are used in our experiments: Ran-
dom forest, Naive Bayes and AdaBoost. Information Gain feature selection
method is been used for Naive Bayes classifier. For data balancing, the
SMOTE algorithm [36] and Weka Randomize filter are used. The default
settings for each algorithm in Weka are:
• Random Forest: Number Of Trees: 100, Seed = 1.
25
• AdaBoost: Number of Iteration = 10, Seed = 1, Weight Threshold =
100.
• SMOTE: Nearest Neighbor = 5, Percentage (percentage of SMOTE

instances to create) = 100, Random seed = 1.
5.5 Experiments
We have done four different experiments to get an understanding on the

possibilities to predict a company’s performance based on public opinion
extracted from social media. The experiments are different in terms of the
number of feature vectors used, the features and the choice of classifier. All
experiments have the same classifier setup. For each relevant time period,
a number of feature vectors are created from the datasets. For each time
period a variable describing if the company was under-performing or over-
performing (relative to previous quarter) is added. The differences between
the experiments are the number of feature vectors that are created for the
dataset and what dictionary that is used.
The results for the different classifiers are described as confusion matri-
ces in which we present the number of true positives, false negatives, true
negatives, and false positives as illustrated in Table 4.
Predicted class
True Neg. (TN) False Pos. (FP)
Actual class
False Neg. (FN) True Pos. (TP)
Table 4: Confusion matrix
To evaluate the results we use the measures accuracy, precision, recall

and F-score that can be derived from the confusion matrix.
26
Accuracy is defined as:
TP + TN
TP + FP + TN + FN
precision is defined as:

TP
TP + FP
recall as:
TP
TP + FN
and F-score (to measure test’s accuracy) as:
2 ∗ precision ∗ recall
precision + recall
5.5.1 Experiments with the regular dictionary
Experiment 1: Combined tweets

In the first experiment all tweets that were published during each year’s
quarter are combined and one feature vector representing a quarter of a year
is created. The words in the regular dictionary are used as features together
with a variable representing the total sentiment of the tweets and a variable
that indicates whether the company was over performing or under performing
during specific quarter of the year.
In the experiment, a model was trained and evaluated on 27 instances
using 10-fold cross validation.
Table 5 shows the results for experiment 1 using three different classifiers
and the T WBM W dataset.
Experiment 2: Combined tweets and changes in sentiment
In the second experiment all tweets published during each year’s quarter are
combined and the total sentiment is specified. Feature vectors are created
27
Table 5: The results for experiment 1 using T WBM W dataset.
Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

T WBM W Random Forest 10 3 74.07% 0.714 0.769 0.74
4 10
T WBM W Naive Bays 10 3 74.07% 0.714 0.769 0.74
4 10
T WBM W AdaBoost 9 4 62.96% 0.6 0.692 0.64
8 6
using the changes of sentiment from one quarter to another. The words in
regular dictionary are used as features together with a variable representing
the total sentiment of the tweets and a variable that indicates whether the
company was over performing or under performing during specific quarter
of the year. In the experiment, a model was trained and evaluated on 27
instances using 10-fold cross validation.

11 2
T WBM W Naive Bays 13 1 77.77% 0.722 0.929 0.8
5 8
T WBM W AdaBoost 10 4 66.66% 0.667 0.714 0.688
5 8
Experiment 3: One feature vector per 100 tweets

In experiment 3 one feature vector is created per 100 tweets and Y variables
28
of feature vectors are assigned based on their published time. The Y variable
(value to be predicted) is zero if the company is under-performing and one
if the company is over-performing.
In this experiment the data is balanced using SMOTE algorithm and the
randomize algorithm [47]. The randomize algorithm randomly shuffles the
order of instances passed through and is used to prevent over-fitting.

698 2968
T WBM W Naive Bays 1947 925 68.09% 0.626 0.678 0.65
1161 2505
T WBM W AdaBoost 1136 1736 62.05% 0.604 0.396 0.478
745 2921
Table 8: The results for experiment 3 using T WV W dataset.

T WV W Random Forest 604 194 86.17% 0.953 0.757 0.842
30 792
T WV W Naive Bays 567 231 77.22% 0.804 0.711 0.752
138 684
T WV W AdaBoost 448 350 60.86% 0.612 0.561 0.584
284 538
29
5.5.2 Experiments using the financial dictionary
Experiment 4: One feature vector per 100 tweets In the forth exper-
iment one feature vector is created per 100 tweets and Y variables of feature
vectors are assigned based on their published time.
In this experiment in order to balance the data instances, SMOTE and
randomize algorithms are used.

442 1390
T WBM W Naive Bays 1110 326 71.54% 0.79 0.67 0.724
604 1228
T WBM W AdaBoost 268 1168 60.67% 0.594 0.93 0.724
117 1715
Table 10: The results for experiment 4 using T WV W dataset.

T WV W Random Forest 1304 492 83.03% 0.914 0.726 0.808
122 1702
T WV W Naive Bays 1110 686 65.85% 0.669 0.618 0.64
550 1274
T WV W AdaBoost 1629 167 57.59% 0.544 0.907 0.678
1368 456
30
6 Discussion
In the first experiment one feature vector was created for each quarter of the
year, which means 27 data instances in total. Low number of data instances
can be one of the reasons that the accuracy is lower in compare to other
experiments. In the second experiment, instead of counting number of words
and use them as features, the differences of word counts from previous quarter
is used and the prediction accuracy has dropped for random forest algorithm
while it showed a little improvement in other classifiers. The reason for
getting low accuracy with random forest classifier could be that the sentiment
in feature vectors should not be created in relation to other feature vectors.
In the third and forth experiment, one feature vector is created per 100 tweets
and the datasets are balanced, then the prediction accuracy improves. This
could be due to balanced number of instances.
Among all of the experiments that is done, except experiment 2, the most
accurate classifier was Random forest classification algorithm, from the third
experiment which provided 86.17% accuracy in an experiment where 100
tweets from T WV W dataset were combined into one feature vector and the
regular dictionary was used as features.
The best results was obtained when using random forest. Random forest
ranks the variables in the feature vector, and also relation between each
variables while splitting nodes, in order to produce higher accuracy. The
data used to train the random forest classifier was balanced and therefore a
more accurate classification model could be produced.
31
7 Conclusion
Customer’s opinion about products and services is always a concern for most
large-and middle sized companies because it has effects on the company’s
financial performance. Social media is one of the most widely used source of
data about customer’s opinion toward a certain company. We have presented
a machine learning approach toward predicting two companies financial per-
formance using tweets that are related to them from twitter. We use two
different set of features based on two different sentiment analysis dictionar-
ies. Three different classification algorithms (Random forest, Naive Bays and
AdaBoost) are used to find the best model to predict changes of Return on
Assets (ROA) from one quarter to another quarter. Our experiments shows
that with an accuracy of 86.17% tweets can predict whether a company will
over-perform or under perform in the upcoming quarter of the year. However
more research on various companies need to be done in order to find the most
optimal prediction accuracy percentage.
8 Future work
In this thesis, sentiment of twitter and changes of ROA from one quarter of
a year to another quarter have been used to predict financial performance of
a company. Changes of ROA is not the only way to predict the financial per-
formance of a company. There are many different variable and metrics such
as Internal rate of return (IRR), Cash-flow return on investment (CFROI),
Discounted cash flow (DCF) and Return on Equity (ROE) that could also be
used and it would be interesting to investigate possibilities to predict these
metrics as well.
We focused on Twitter in this work but there are many other online
32
forums and social media that may have more effect on companies performance
or reflect the opinion of certain companies user better than Twitter. A
direction for future work would be to investigate other forms of social media
and how well they can predict the performance of a company.
In this work finding we used a bag of words method to detect the senti-
ment of a text. There are many other sentiment analysis methods which can
be used to find sentiment of a text.
In this work the features that we considered consist of word counts only.
There might be many other factors that are important in predicting the
performance of a company. An obvious direction for future work is to extend
the set of features and to do more experiments on different data and on
different companies.
References
[1] Johan Bollen, Huina Mao, Xiaojun Zeng (2011) Twitter mood predicts
the stock market Journal of Computational Science 2, 1–8
[2] Tristan Fletcher, Fabian Redpath and Joe DAlessandro (2009) Machine
Learning in FX Carry Basket Prediction Proceedings of the International
Conference of Financial Engineering, vol. 2, page 1371-1375.
[3] Michael T. Lash and Kang Zhao (2016). Early Predictions of Movie Suc-
cess: the Who, What, and When of Profitability Artificial Intelligence
(cs.AI); Social and Information Networks (cs.SI).
[4] Harald Schoen, Daniel Gayo-Avello, P. Takis Metaxas, Eni Mustafaraj,

Markus Strohmaier (2013) The Power of Prediction with Social Media
Computer Science Faculty Scholarship, Wellesley College.
33
[5] Sheng Yu and Subhash Kak (2012) A Survey of Prediction Using Social
Media Department of Computer Science, Oklahoma State University.
[6] Statistics Portal http://www.statista.com/statistics/272014/

global-social-networks-ranked-by-number-of-users/
[7] Reza Zafarani, Mohammad Ali Abbasi, Huan Liu (2014) Social Media
Mining Cambridge University.
[8] Marta Zembik (2014) Social media as a source of knowledge for customers
and enterprises Online Journal of Applied Knowledge Management, Vol-
ume 2, Issue 2
[9] Tim Loughran and Bill McDonald (2011) When Is a Liability Not a Lia-
bility? Textual Analysis, Dictionaries, and 10-Ks The Journal of Finance,
Vol. LXVI, NO. 1
[10] Alexander Pak, Patrick Paroubek Twitter as a Corpus for Sentiment

Analysis and Opinion Mining In LREC Vol. 10, pp. 1320–1326.
[11] Mittal and Goel (2012). Stock Prediction Using Twitter Sentiment Anal-
ysis Project report.
[12] Sahar Nassirpour, Parnian Zargham, Reza Nasiri Mahalati (2012). Elec-
tronic Devices Sales Prediction Using Social Media Sentiment Analysis
Project report Stanford university.
[13] Opinion Finder http://mpqa.cs.pitt.edu/opinionfinder/
[14] Harvard IV-4 dictionary http://www.wjh.harvard.edu/~inquirer/

homecat.htm
34
[15] Definition of ’10-K’ http://www.investopedia.com/terms/1/10-k.
asp
[16] TreeTagger’ http://www.cis.uni-muenchen.de/~schmid/tools/

TreeTagger/
[17] Douglas M. McNair, Maurice Lorr, and Leo F. Droppleman (1971). Man-
ual for the Profile of Mood States San Diego, CA: Educational and In-
dustrial Testing Service.
[18] Walaa Medhat, Ahmed Hassan, Hoda Korashy (2014). Sentiment anal-
ysis algorithms and applications: A survey Ain Shams Engineering Jour-
nal.
[19] C. W. J. Granger (1969). Investigating Causal Relations by Econometric

Models and Cross-spectral Methods Econometrica Vol. 37, No. 3 (Aug.,
1969), pp. 424-438.
[20] Richard Socher Jeffrey Pennington, Eric H. Huang Andrew, Y. Ng

Christopher, D. Manning (2011). Semi-Supervised Recursive Autoen-
coders for Predicting Sentiment Distributions Proceedings of the Confer-
ence on Empirical Methods in Natural Language Processing Pages 151-
161.
[21] JAN A. EKLOF, PETER HACKL, ANDERS WESTLUND (2009). On

measuring interactions between customer satisfaction and financial results
TOTAL QUALITY MANAGEMENT Pages 514-522.
[22] Return on Assets http://www.investopedia.com/terms/r/

returnonassets.asp
35
[23] Eugene W.Anderson, Claes Fornell, Ronald T.Rust (1997). Customer
Satisfaction, Productivity, and Profitability: Differences Between Goods
and Services Marketing Science Pages 129-145.
[24] Dan Zarrella. (2009). The social media marketing book. OReillyMedia,
Inc.
[25] Andreas M. Kaplan, Michael Haenlein (2009). Users of the world, unite!
The challenges and opportunities of Social Media ESCP Europe, 79 Av-
enue de la Rpublique, F-75011 Paris, France.
[26] Weka Data Mining http://weka.wikispaces.com/
[27] Subhabrata Mukherjee. (2012). Sentiment analysis. Indian Institute of

Technology, Bombay Department of Computer Science and Engineering.
[28] Bing Liu. (2012). Sentiment analysis and opinion mining. Claypool Pub-
lishers.
[29] John Hagel III, John Seely Brown and Lang Davison. (2010). The
Best Way to Measure Company Performance https://hbr.org/2010/
03/the-best-way-to-measure-compan
[30] Karina Gibert, Miquel Snchez-Marr, Vctor Codina. (2010). Principles of

Accounting. International Environmental Modelling and Software Society
(iEMSs).
[31] Belverd E.Needles, Marian Powers, Susan V. (2014). Principles of Ac-

counting South-Western Cengage Learning.
[32] Yoav Freund Robert E. Schapire. (1996). Experiments with a New Boost-
ing Algorithm Machine Learning: Proceedings of the Thirteenth Interna-
tional Conference.
36
[33] Russell Stuart, Norvig Peter. (2003). Artificial Intelligence: A Modern
Approach. Prentice Hall. ISBN 978-0137903955.
[34] Carol McDonald. (2015). Parallel and Iterative Processing for Machine
Learning Recommendations with Spark https://www.mapr.com/blog/
parallel-and-iterative-processing-machine-learning-recommendations-spark
[35] Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: the-
ory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711.
[36] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip

Kegelmeyer. (2002). SMOTE: Synthetic Minority Over-sampling Tech-
nique. Journal of Artificial Intelligence Research, page 321357.
[37] Linguistic Inquiry and Word Count http://liwc.wpengine.com/
[38] 2014 Master Dictionary http://www3.nd.edu/~mcdonald/Word_

Lists.html
[39] Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.

nz/ml/weka/index.html
[40] Stehman, Stephen V. (1997). Selecting and interpreting measures of the-

matic classification accuracy. Remote Sensing of Environment, p7789.
[41] Beautiful Soup Documentation https://www.crummy.com/software/

BeautifulSoup/bs4/doc/
[42] Sylvain Arlot. (2004). A survey of cross-validation procedures for model

selection. Journal of Machine Learning Research , p1089-1105.
[43] Release 4 of the 12dicts word lists http://wordlist.aspell.net/

12dicts-readme-r4/
37
[44] BMW Quarterly Reports https://www.bmwgroup.com/en/
investor-relations/financial-reports.html
[45] Volkswagen Quarterly Reports http://quicktake.morningstar.com/

stocknet/secdocuments.aspx?symbol=vlkay
[46] SCOWL (And Friends) wordlist http://wordlist.aspell.net/
[47] Class Randomize http://weka.sourceforge.net/doc.dev/weka/

filters/unsupervised/instance/Randomize.html
[48] Shareholders’ Equity http://www.investopedia.com/terms/s/

shareholdersequity.asp
[49] Selenium with Python http://selenium-python.readthedocs.io/
38

FULLTEXT01 Uppsala Uni

Uploaded by

Copyright:

Available Formats

FULLTEXT01 Uppsala Uni

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FULLTEXT01 Uppsala Uni

Uploaded by

Copyright:

Available Formats

IT 16 047

Using social media and machine

Teknisk- naturvetenskaplig fakultet

Handledare: Lisa Kaati

5 Experiments and Results 22

• Develop techniques for sentiment analysis of data from Twitter with

• Develop a prototype tool for the proposed method.

3.1 Social media

Sentiment analysis is done using natural language processing and information

Figure 2: Sentiment Analysis methods [18]

In this thesis the Dictionary-based approach is used for sentiment analy-

Shareholder’s equity is the equity of a company as divided among individ-

3.4.1 Feature Vectors

A feature vector is the way an object is presented in machine learning and

3.5 Machine learning

Machine learning is a field of computer science which studies and explores

• Supervised Learning: In supervised learning the computer receives a

• Unsupervised Learning: In unsupervised learning, the computer find

• Reinforcement Learning: In reinforcement learning the computer inter-

3.5.1 Classification Algorithms

A classification algorithm task is to pick the right identified categories in

If the number on instances in classification categories in a dataset are having

• Take the diﬀerence of a data instance to its nearest neighbor,

• Multiply the number by a random value between 0 and 1,

• Add the new data point to the considered feature vector

3.5.3 Feature selection

The process of selecting a subset of features that should be used to construct

IG(T, a) = H(T ) − H(T |a) (4)

4.1 Financial Performance Predictor design

The financial performance predictor (FPP) is a prototype tool for prediction

4.2 Financial Performance Predictor Implementation

Various programming languages and tools are used in the implementation of

4.2.1 Collecting data

In order to download tweets a web scraper is written in python programming

4.2.2 Feature vectors creation

5 Experiments and Results

Table 1: The datasets used in the experiments.

T WBM W Tweets related to BMW 677596 2007-2015

T WV W Tweets related to Volkswagen 151648 2012-2015

An example of a negative tweet from T WBM W is:

5.2 Quarterly reports

We have used two diﬀerent dictionaries to determine the sentiment of tweets.

Year Quarter BMW Volkswagen

Table 3: The two diﬀerent dictionaries and some example words.

• Random Forest: Number Of Trees: 100, Seed = 1.

• SMOTE: Nearest Neighbor = 5, Percentage (percentage of SMOTE

We have done four diﬀerent experiments to get an understanding on the

Table 4: Confusion matrix

To evaluate the results we use the measures accuracy, precision, recall

precision is defined as:

5.5.1 Experiments with the regular dictionary

Experiment 1: Combined tweets

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

Table 6: The results for experiment 2 using T WBM W dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

Experiment 3: One feature vector per 100 tweets

Table 7: The results for experiment 3 using T WBM W dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

Table 8: The results for experiment 3 using T WV W dataset.