FULLTEXT01 Uppsala Uni
FULLTEXT01 Uppsala Uni
FULLTEXT01 Uppsala Uni
Examensarbete 30 hp
Augusti 2016
Sepehr Forouzani
Masterprogram i datavetenskap
Master Programme in Computer Science
Abstract
Using social media and machine learning to predict
financial performance of a company
Sepehr Forouzani
Hemsida:
http://www.teknat.uu.se/student
2 Related Work 8
3 Background theory 11
3.1 Social media . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Financial performance . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Feature Vectors . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.1 Classification Algorithms . . . . . . . . . . . . . . . . . 16
3.5.2 Data balancing . . . . . . . . . . . . . . . . . . . . . . 18
3.5.3 Feature selection . . . . . . . . . . . . . . . . . . . . . 18
4 Implementation 19
4.1 Financial Performance Predictor design . . . . . . . . . . . . . 19
4.2 Financial Performance Predictor Implementation . . . . . . . 21
4.2.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Feature vectors creation . . . . . . . . . . . . . . . . . 21
1
5.4 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.5.1 Experiments with the regular dictionary . . . . . . . . 27
5.5.2 Experiments using the financial dictionary . . . . . . . 30
6 Discussion 31
7 Conclusion 32
8 Future work 32
2
List of Figures
1 The methodology . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Sentiment Analysis methods [18] . . . . . . . . . . . . . . . . . 12
3 Machine learning workflow [34] . . . . . . . . . . . . . . . . . 15
4 Steps toward financial prediction . . . . . . . . . . . . . . . . 20
5 The format of a feature vector. . . . . . . . . . . . . . . . . . 22
3
List of Tables
1 The datasets used in the experiments. . . . . . . . . . . . . . . 22
2 The companies performance based on the ROA. . . . . . . . . 24
3 The two different dictionaries and some example words. . . . . 25
4 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 The results for experiment 1 using T WBM W dataset. . . . . . 28
6 The results for experiment 2 using T WBM W dataset. . . . . . 28
7 The results for experiment 3 using T WBM W dataset. . . . . . 29
8 The results for experiment 3 using T WV W dataset. . . . . . . 29
9 The results for experiment 4 using T WBM W dataset. . . . . . 30
10 The results for experiment 4 using T WV W dataset. . . . . . . 30
4
1 Introduction
Nowadays media and in particular social media is considered as a big data
source to researchers due to the large number of people communicating and
sharing their ideas, feelings, knowledge, and personal opinions about various
topics at any time. During the last ten years, Twitter and Facebook has
emerged to be the most popular social networking websites. Facebook has
1.59 billion monthly users and Twitter has 332 million active users [6].
Data from social media provides a unique opportunity to social scientists,
economists, and statisticians to understand individuals and human behav-
ioral patterns that has effects on different areas such as finance [4]. As an
example, recent research on financial performance prediction using opinion
and sentiment analysis of posts that are shared in social media indicates that
there is a possibility to predict a company’s stock value [5].
The data available on social media is enormous, unstructured and con-
tains a lot of irrelevant information, therefore it is impossible for individuals
to read and analyze all of the data manually. To analyze data from social
media, statistical and data mining techniques need to be applied to make the
best use of the data [7].
Customer’s opinion about products and services is always a concern for
most large-and middle sized companies. Social media is one of the most
widely used source of data about customer’s opinion toward a certain com-
pany [8]. Most companies use different methods and techniques to find out
customer’s opinion about their services and products. However relating the
data extracted from social media about customer’s opinion to the co-related
sectors of the companies such as productivity, profitability, financial per-
formance and economics is not always possible [21], for example if a firm
improves productivity by downsizing, the profitability might be endangered
5
if the customer satisfaction depends on companies services [23]. Research [1]
has shown that there is a relation between opinion and sentiment about a
company and the stock price. However, to the best of our knowledge there
are no studies that focus on investigating the relation of sentiment analysis
of tweets and the financial performance of companies.
In this master thesis we will investigate the correlation between the sen-
timent of tweets where a certain company is mentioned in a hashtag and the
financial performance of that company.
1.1 Objectives
The over all objective of this thesis project is to investigate the relation
between sentiment extracted from social media and the financial performance
of automotive companies. The goal is to predict the financial performance of
a company based on what people write about the company on Twitter. This
results in the following more specific objectives:
• Use machine learning and train a model to predict the financial perfor-
mance of a company
1.2 Method
The work in this thesis is done through five steps, as illustrated in Figure 1.
6
Figure 1: The methodology
In the first step, the problem and the objectives for the research is defined.
In the second step a literature review is done. The literature study focus on
reviewing related work as well as gaining knowledge about the techniques
that will be used in the project.
In the third step, the experiment setups and configurations will be de-
signed and data will be collected.
In the forth step, a prototype tool is developed in order to collect, prepare
and analyze data. The analysis is based on mood and sentiment word lists.
For the machine learning components in this project the Weka data mining
tool [39] is used. In the fifth step, the results are evaluated by measuring the
accuracy of performance prediction.
7
2 Related Work
In this chapter some work related to sentiment analysis methods and financial
predictions using mood and sentiment analysis, will be reviewed.
In [1] the authors are collecting public tweets posted by approximately 2.7
million users. All tweets have an identifier, a publishing time, a submission
type and a 140 character text. To make the data suitable for analysis, stop-
words (topic independent words that are most common in a language) and
punctuation are removed and then the text is filtered by words such as ”I
feel”,”i am feeling”, ”I’m”,”Im”,”I am”, and ”makes me” because those words
state their author’s mood state. At the next stage they use the OpinionFinder
(OF) tool [13] for sentiment analysis. In order to measure polarity of a
sentence in terms of being positive and negative, OF takes a text (e.g. large
number of tweets) and uses the OF lexicon to determine the percentage
of positive against negative sentiment of the text. To measure mood of a
text they use an algorithm called Google-Profile of Mood States (GPOMS).
GPOMS measures the mood of a text from six different dimensions, which
are: calm, alert, sure, vital, kind, and happy.
To enable normalization of time series and comparison between OF and
GPOMS results, the authors of [1] are using z-score statistical measurement
which is based on local mean and standard deviation. The authors are also
using econometric technique of Granger causality analysis [19] in order to
investigate the relation between public mood and stock market closing value
changes. The Granger causality indicates that there is a predictive relation
of certain mood categories and the closing price of the stock market.
In [3] the authors used machine learning and social media to predict
how successful a movie will be. In order to measure success of a movie the
authors used return on investment (ROI) which is a profitability metric, and
8
they applied binary and multi-class classification algorithms such as support
vector machines (SVM), multilayer perceptron (MLP), decision trees (J48),
random forest and logitBoost algorithm to predict the success. The results
shows that random forest was the best classifier, with an accuracy of almost
84%.
In [12] the authors investigate the possibility of predicting electronic de-
vices market sales using social media. In their work they are analyzing sen-
timent of Twitter comments about a certain product before the product is
released. They are using semi-supervised recursive auto encoders for pre-
dicting sentiment distribution. Semi-supervised recursive auto encoders is
an artificial neural network which its goal is to learn encoding a set of data,
typically for the purpose of dimensional reduction. In sentiment analysis
semi-supervised recursive auto encoders are used to learn semantic vector
representations of a phrases [20]. After running sentiment analysis, the to-
tal number of comments, number of positive comments, total number of re-
tweeted comments and number of re-tweeted positive comments are extracted
and used as features in their model. In the experiments their model showed
35% of accuracy in prediction of iPad3 sale meanwhile linear regression was
showing 58% accuracy in iPad3 sale prediction which is a low accuracy and
could not be used as a practical model.
In [2] the authors are using Artificial Neural Networks (ANN), Support
Vector Machines (SVM) and Relevance Vector Machines (RVM) to predict
daily returns for an FX carry basket. A currency basket is a portfolio of
selected currencies with different weightings, and FX carry basket is made
of a long position in high yielding currencies versus a short position in low
yielding ones is a common asset for fund managers and speculative traders. It
was found that in general the committee of networks was much more effective
9
at predicting five day returns than one day returns, and it was on this basis
that the optimal configuration was used.
In [9] it is stated that the list of words that is used in general to measure
the sentiment of a text is not accurate to be used to measure sentiment
of finance related texts. To illustrate this, the authors of [9] did a review
of the negative words extracted from 10-k reports (an annual report which
contains summery of a company’s financial performance [15]) based on the
Harvard dictionary [14] and found out that almost seventy five percent of the
words counted as negative are not negative in finance. Therefore they have
developed a new word dictionary which reflects the tone of financial texts
with a higher accuracy. The authors have used a bag of words (considering
a text like a bag for its words, regardless of grammar and order of words)
approach to produce vector of words and word counts, and modified one of
the most common term weighting scheme to make it adjustable to document
length.
In [10] the authors are developing an automated method for sentiment
classification. They are using a classifier which is based on a multinomial
Naive Bayes classifier to determine the positive, negative and neutral sen-
timent of a document. They also propose a technique that can be used to
determine sentiment of documents in any languages. In their method, the
TreeTagger [16] (a language independent part-of-speech tagger) is used for
part-of-speech tagging and the differences in distribution of positive, nega-
tive and neutral tags are observed. For feature extraction they used N-gram
as binary features and the frequency of keywords. Unigrams, bigrams, and
trigrams are used for experiments, and the authors are stating that when
bigrams are used, the performance is the best.
In [11] four classes of mood: calm, happy, alert and kind are used and
10
a text is categorized into these four classes using a analysis tool. The tool
uses a word list based on the Profile of Mood States (POMS) questionnaire
[17] where the POMS different states are mapped into their four mood states
using static correlation rules. They also filtered down a set of tweets into
emotion specific texts using words such as ”feel”, ”makes me”, ”I’m”, ”I am”.
In this work the authors are using a new cross validation method called k-fold
sequential cross validation to train the model and the model showed 75.56%
accuracy in prediction of stock market movements. They have tried four
different learning algorithms: linear regression, logistic regression, support
vector machines (SVMs), and self organizing fuzzy neural networks (SOFNN)
to learn and study correlation of mood and market. The conclusion is that
SOFNN performed better compared to the other algorithms.
3 Background theory
The tools and platforms that enables users to interact and exchange informa-
tion in different forms such as text, picture, video and etc. are called social
media [24]. There are a number of different types of social media for exam-
ple blogs, discussion boards and networking platforms such as Facebook and
Twitter. Twitter is one of the most popular social media services that enable
users to publish and share a maximum of 140 characters text called tweets
and use hashtags ”#” to relate their tweets to a specific topic, person or a
company. Several companies and business strategists consider social media
as an important arena and they are constantly trying to find out various
ways to increase their profitability using social media[25].
11
3.2 Sentiment analysis
12
3.3 Financial performance
Most of the time financial analysts and investors are focusing on return on
equity (ROE) as the primary metric for measuring companies performance.
Many executives focus heavily on this metric as well, believing that it is the
one that seems to get the most attention from the investor community. ROE
is calculated by dividing the net income by shareholder’s equity.
N et Income
Return on Equity = (1)
shareholder� s equity
N et Income
Return on Assets = (2)
T otal Assets
13
3.4 Data collection
Data collection and dataset creation is the first step when you want to create
a statistical model using machine learning. The dataset is commonly divided
into three subsets: a training set, a validation set and a test set. The train-
ing set is used to train the statistical model, the validation set is used to
estimate how well the model is trained and the test set is used to measure
the performance of the model.
14
Figure 3: Machine learning workflow [34]
In the first step (data ingestion) the data is collected and stored in a
database. After collecting the data, the data is cleaned and/or transformed.
The data is divided into two sets: a training set and a testing set. In the
next step a mathematical model is built based on the training set and then
the model will be tested against the testing set.
In order to improve the results, the user can make decision about creating
or choosing different data and feature vectors (data presentation style), after
results are produced from the model.
There are three categories of machine learning that are based on their
nature of learning.
15
acts with an environment to achieve the goal without any help from a
teacher.
16
does not effect the probability of others). Naive Bayes computes probability
p as the probability of feature x represented by a vector x = (x1 , ..., xn ) being
in the class c : p(c|x). The conditional probability using Bayes theorem can
be shown as:
p(c)p(x|c)
p(c|x) = (3)
p(x)
when training model time is important Naive Bays is useful.
AdaBoost [32] stands for adaptive boosting and it assumes that finding
many weak models are easier than finding one accurate model. Boosting is an
approach to create predictions rules with high accuracy using a combination
of weak models and rules that have low accuracy in prediction. Boosting
generates a sequence of base models and then decides a final estimate of
the target variable based on aggregating the estimates of the base models.
AdaBoost generates a numbers of weak classifiers and a final estimate of the
target variable is chosen based on aggregating the estimates made by the
base models. Similar to the random forest algorithm, AdaBoost also have a
variable importance estimation but in a different way. In AdaBoost the more
informative variables are used more often, and the less informative features
are barely used.
Cross validation [42] creates a training set and a test set by partitioning
the original data with the goal to train and evaluate the model. In k-fold
cross validation the original data will be divided into k number of subsamples.
One subsample is selected as test dataset and the rest (k − 1) number of
subsamples are used as training set for the model. The same process will be
repeated for k number of times (folds) and each subsample will be used at
least once as test set and then the results will be averaged or combined to
make the best estimation.
17
3.5.2 Data balancing
where:
T is set of training example,
a is the index of a feature
18
H() function is an entropy (Entropy is a measure of the randomness of a
variable and it measures the level of impurity in a group of examples).
4 Implementation
In this chapter the design and implementation of the financial performance
predictor (FPP) is described.
19
Figure 4: Steps toward financial prediction
The first step is to collect relevant data, in this thesis we use data from
Twitter. In order to detect the sentiment of a tweet or a group of tweets,
we use the bag of word method. The bag of word method focus on the
words or in some cases set of words (a string of words), regardless of the
context of sentence. We use a list of words (from a dictionary) and all words
that are attached to a sentiment. The words are either positive or negative.
In the experiment we have used two different dictionaries one with that is
developed for financial purposes and one more general. The second step is to
count the number of occurrence of each word present in the dictionaries in the
extracted tweets. The result is combined with the ROA for the corresponding
20
time period and included in the feature vectors. In the forth step machine
learning algorithms will be applied on the feature vectors to train a model
to predict if the ROA increases or decreases based on the sentiment of the
tweets. The classification algorithms that we have used to train the model
are Random Forest, Naive Bayes and Adaboost.
In this thesis a program for creating feature vectors is written in Java. The
program uses the word dictionaries and count the number of occurrence of
each dictionary word in the tweets. The result is stored in a vector. The
format of a feature vector is shown in Figure 5.
21
Figure 5: The format of a feature vector.
The class variable it the company’s performance. The value of class vari-
able is 1 in case of over-performance and 0 in case of under-performance.
5.1 Dataset
Two datasets are used for the experiments. The first dataset denoted as
T WBM W contains tweets where BMW is either mentioned or used in a hash-
tag (#BMW). The second dataset is called T WV W contains tweets where
Volkswagen is either mentioned or used in a hashtag (#Volkswagen). The
two datasets are described in Table 1
22
An example of a positive tweet from the same dataset is:
”Track drive reveals excellent balance of the 2015 BMW 228i - Torque
News http://bit.ly/1xk4xj7 - #BMW”
An example of a neutral tweet (neither positive or negative) from the
same dataset:
”mclaren should come back later in the race when ferrari and bmw have
to use the hard tyres hopefully, anyway”
The sentiment of each tweet is determined by counting the occurrence
of positive and negative words. If a tweet contain more positive words than
negative words, the sentiment is considered positive, if there are more neg-
ative words than positive words, the sentiment is considered negative. If a
tweet contain the same amount of positive and negative words the sentiment
is considered to be neutral.
To obtain the value on return on asset (ROA) for each quarter, BMW quar-
terly reports (10-Q reports) are downloaded from [44] and Volkswagen quar-
terly reports are downloaded from [45]. The value of ROA is not explictly
mentioned in the quarterly reports and therefore it is calculated manually
using the value of the total income and and the total assets value. In Table
2 performance of BMW and Volkswagen in different quarter of the year is
shown.
5.3 Dictionaries
23
Table 2: The companies performance based on the ROA.
(LIWC) [37]. The second dictionary (called the f inancial dictionary) is called
Loughran-McDonald master dictionary[38]. The Loughran-McDonald mas-
24
ter dictionary is an extension of the 2of12inf wordlist that includes an ad-
dition of the words that are appearing in companies annual reports. The
2of12inf is a wordlist from SCOWL (Spell Checker Oriented Word Lists) and
Friends consisting of English words that are useful for creating high-quality
list of words for spell checkers [43].
Table 3 shows some sample words from the two different dictionaries we
have used.
5.4 Weka
All experiments are done using Weka [39]. Weka has a collection of data min-
ing algorithms, predictive modeling and tools for visualization and a graph-
ical user interface for ease of access to its functions.
Three different classification algorithms are used in our experiments: Ran-
dom forest, Naive Bayes and AdaBoost. Information Gain feature selection
method is been used for Naive Bayes classifier. For data balancing, the
SMOTE algorithm [36] and Weka Randomize filter are used. The default
settings for each algorithm in Weka are:
25
• AdaBoost: Number of Iteration = 10, Seed = 1, Weight Threshold =
100.
5.5 Experiments
Predicted class
True Neg. (TN) False Pos. (FP)
Actual class
False Neg. (FN) True Pos. (TP)
26
Accuracy is defined as:
TP + TN
TP + FP + TN + FN
TP
TP + FN
and F-score (to measure test’s accuracy) as:
2 ∗ precision ∗ recall
precision + recall
27
Table 5: The results for experiment 1 using T WBM W dataset.
using the changes of sentiment from one quarter to another. The words in
regular dictionary are used as features together with a variable representing
the total sentiment of the tweets and a variable that indicates whether the
company was over performing or under performing during specific quarter
of the year. In the experiment, a model was trained and evaluated on 27
instances using 10-fold cross validation.
28
of feature vectors are assigned based on their published time. The Y variable
(value to be predicted) is zero if the company is under-performing and one
if the company is over-performing.
In this experiment the data is balanced using SMOTE algorithm and the
randomize algorithm [47]. The randomize algorithm randomly shuffles the
order of instances passed through and is used to prevent over-fitting.
29
5.5.2 Experiments using the financial dictionary
Experiment 4: One feature vector per 100 tweets In the forth exper-
iment one feature vector is created per 100 tweets and Y variables of feature
vectors are assigned based on their published time.
In this experiment in order to balance the data instances, SMOTE and
randomize algorithms are used.
30
6 Discussion
In the first experiment one feature vector was created for each quarter of the
year, which means 27 data instances in total. Low number of data instances
can be one of the reasons that the accuracy is lower in compare to other
experiments. In the second experiment, instead of counting number of words
and use them as features, the differences of word counts from previous quarter
is used and the prediction accuracy has dropped for random forest algorithm
while it showed a little improvement in other classifiers. The reason for
getting low accuracy with random forest classifier could be that the sentiment
in feature vectors should not be created in relation to other feature vectors.
In the third and forth experiment, one feature vector is created per 100 tweets
and the datasets are balanced, then the prediction accuracy improves. This
could be due to balanced number of instances.
Among all of the experiments that is done, except experiment 2, the most
accurate classifier was Random forest classification algorithm, from the third
experiment which provided 86.17% accuracy in an experiment where 100
tweets from T WV W dataset were combined into one feature vector and the
regular dictionary was used as features.
The best results was obtained when using random forest. Random forest
ranks the variables in the feature vector, and also relation between each
variables while splitting nodes, in order to produce higher accuracy. The
data used to train the random forest classifier was balanced and therefore a
more accurate classification model could be produced.
31
7 Conclusion
Customer’s opinion about products and services is always a concern for most
large-and middle sized companies because it has effects on the company’s
financial performance. Social media is one of the most widely used source of
data about customer’s opinion toward a certain company. We have presented
a machine learning approach toward predicting two companies financial per-
formance using tweets that are related to them from twitter. We use two
different set of features based on two different sentiment analysis dictionar-
ies. Three different classification algorithms (Random forest, Naive Bays and
AdaBoost) are used to find the best model to predict changes of Return on
Assets (ROA) from one quarter to another quarter. Our experiments shows
that with an accuracy of 86.17% tweets can predict whether a company will
over-perform or under perform in the upcoming quarter of the year. However
more research on various companies need to be done in order to find the most
optimal prediction accuracy percentage.
8 Future work
In this thesis, sentiment of twitter and changes of ROA from one quarter of
a year to another quarter have been used to predict financial performance of
a company. Changes of ROA is not the only way to predict the financial per-
formance of a company. There are many different variable and metrics such
as Internal rate of return (IRR), Cash-flow return on investment (CFROI),
Discounted cash flow (DCF) and Return on Equity (ROE) that could also be
used and it would be interesting to investigate possibilities to predict these
metrics as well.
We focused on Twitter in this work but there are many other online
32
forums and social media that may have more effect on companies performance
or reflect the opinion of certain companies user better than Twitter. A
direction for future work would be to investigate other forms of social media
and how well they can predict the performance of a company.
In this work finding we used a bag of words method to detect the senti-
ment of a text. There are many other sentiment analysis methods which can
be used to find sentiment of a text.
In this work the features that we considered consist of word counts only.
There might be many other factors that are important in predicting the
performance of a company. An obvious direction for future work is to extend
the set of features and to do more experiments on different data and on
different companies.
References
[1] Johan Bollen, Huina Mao, Xiaojun Zeng (2011) Twitter mood predicts
the stock market Journal of Computational Science 2, 1–8
[2] Tristan Fletcher, Fabian Redpath and Joe DAlessandro (2009) Machine
Learning in FX Carry Basket Prediction Proceedings of the International
Conference of Financial Engineering, vol. 2, page 1371-1375.
[3] Michael T. Lash and Kang Zhao (2016). Early Predictions of Movie Suc-
cess: the Who, What, and When of Profitability Artificial Intelligence
(cs.AI); Social and Information Networks (cs.SI).
33
[5] Sheng Yu and Subhash Kak (2012) A Survey of Prediction Using Social
Media Department of Computer Science, Oklahoma State University.
[7] Reza Zafarani, Mohammad Ali Abbasi, Huan Liu (2014) Social Media
Mining Cambridge University.
[8] Marta Zembik (2014) Social media as a source of knowledge for customers
and enterprises Online Journal of Applied Knowledge Management, Vol-
ume 2, Issue 2
[9] Tim Loughran and Bill McDonald (2011) When Is a Liability Not a Lia-
bility? Textual Analysis, Dictionaries, and 10-Ks The Journal of Finance,
Vol. LXVI, NO. 1
[11] Mittal and Goel (2012). Stock Prediction Using Twitter Sentiment Anal-
ysis Project report.
[12] Sahar Nassirpour, Parnian Zargham, Reza Nasiri Mahalati (2012). Elec-
tronic Devices Sales Prediction Using Social Media Sentiment Analysis
Project report Stanford university.
34
[15] Definition of ’10-K’ http://www.investopedia.com/terms/1/10-k.
asp
[17] Douglas M. McNair, Maurice Lorr, and Leo F. Droppleman (1971). Man-
ual for the Profile of Mood States San Diego, CA: Educational and In-
dustrial Testing Service.
[18] Walaa Medhat, Ahmed Hassan, Hoda Korashy (2014). Sentiment anal-
ysis algorithms and applications: A survey Ain Shams Engineering Jour-
nal.
35
[23] Eugene W.Anderson, Claes Fornell, Ronald T.Rust (1997). Customer
Satisfaction, Productivity, and Profitability: Differences Between Goods
and Services Marketing Science Pages 129-145.
[24] Dan Zarrella. (2009). The social media marketing book. OReillyMedia,
Inc.
[25] Andreas M. Kaplan, Michael Haenlein (2009). Users of the world, unite!
The challenges and opportunities of Social Media ESCP Europe, 79 Av-
enue de la Rpublique, F-75011 Paris, France.
[28] Bing Liu. (2012). Sentiment analysis and opinion mining. Claypool Pub-
lishers.
[29] John Hagel III, John Seely Brown and Lang Davison. (2010). The
Best Way to Measure Company Performance https://hbr.org/2010/
03/the-best-way-to-measure-compan
[32] Yoav Freund Robert E. Schapire. (1996). Experiments with a New Boost-
ing Algorithm Machine Learning: Proceedings of the Thirteenth Interna-
tional Conference.
36
[33] Russell Stuart, Norvig Peter. (2003). Artificial Intelligence: A Modern
Approach. Prentice Hall. ISBN 978-0137903955.
[34] Carol McDonald. (2015). Parallel and Iterative Processing for Machine
Learning Recommendations with Spark https://www.mapr.com/blog/
parallel-and-iterative-processing-machine-learning-recommendations-spark
[35] Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: the-
ory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711.
37
[44] BMW Quarterly Reports https://www.bmwgroup.com/en/
investor-relations/financial-reports.html
38