Research Paper_Group2

Predicting Stock Market Movements Through Daily News Headlines Sentiment
Analysis: US Stock Market
Yubo Bi1*, Hanting Liu2, Ruiyang Wang3, Shiyou Li4
1
Business school, University of New South Wales Sydney, Sydney, NSW, 2052, Australia
2
School of Information Science and Technology, Xiamen University, Xiamen, 361005, China
3
Nanjing Foreign Language School, Nanjing,210018, China
4
Shanghai Pinghe School, Shanghai, 201206, China
*
Corresponding author. Email: [email protected]
Abstract
This study leverages features extracted from daily world news headlines between August
2008 and June 2016 to predict next day’s Dow Jones Industrial Average (DJIA) index
movement. Other related index return, commodity price changes and trading volume data
are used to improve the prediction accuracy. In this work, the predictive models are based
on three machine learnings algorithms: Random Forest, Support Vector Machine and
Naïve Bayes. Based on the prediction results of three predictive models, a majority vote is
also conducted The results show that: (1) general daily world news headlines are less
correlated with next day’s Dow Jones Industrial Average index movement; (2) sentiment
implied in the news with indicators from technical analysis such as past return or trading
volume can help to improve the prediction performance; (3) Naïve Bayes shows the
superior ability in predicting next day’s index change, while majority voting of different
predictive models can further the prediction accuracy.
Keywords: Stock Market Prediction, Daily News Headlines, Machine Learning
1 Introduction
Stock market prediction has long been a challenge among investors, researchers and
financial analysts, while the increasingly complex financial markets further the
sophisticated nature and quantity of efforts required in this pursuit. Nonetheless, this topic
would be a continuous endeavor for stock market investors and financial institutions to
maximize their investment returns. According to the efficient market hypothesis (Fama,
1970), the current stock prices of financial markets with weak form, semi-strong form and
strong form efficiency have already reflected past price trends, all public information and
all the information include public and private respectively (Malkiel, 1989). That implies that
a financial market with weak form efficiency, the analysis of past trend of stock returns is
meaningless, meaning history is no indication of the future. Most recent stock price studies
are conducted through two methods - fundamental analysis by predicting stock price
through analyzing underlying businesses and forecasting future business performances,
and technical analysis that predicting future stock prices based on past and present price
trends (Nti, Adekoya and Weyori, 2019; Picasso, Merello, Ma, Oneto and Cambria, 2019).
However, the stock price movements are not only affected by its past trends or correlation
with the financial markets, but also news, investors comments, current events or company
announcements. According to Paramanik and Singhal (2020), negative news has a
significant impact over the Indian stock market. According to Moller and Reichmann
(2021), European Central Bank monetary policy announcements, sentiment implied in
forward looking answers to the questions during press conferences significantly affect
Euro stock market returns. According to Suleman (2010), positive sentiment derived from
political news will have a positive impact on the markets, while negative sentiment will
have negative impacts. Also, publicly listed companies are required by the regulatory
institutions to disclose any information that might have an impact on its stock price, which
are also known as company announcement. According to Ratku, Feuerriegel and
Neumann (2014), different topics of company announcements will have impacts on its
stock price.
Nowadays, the improvement of internet influence not only the technical aspects of
computer communications but also the whole society that include electronic commerce
and community operations (Internet Society, 2021). One of the most significant benefits
offered by the internet evolution is information gathering. The internet speeds up the
information communicated through the whole society which in turn also speeds up the
reaction of stock prices. The internet also offers great quantity of information to conduct
more comprehensive analysis of stock prices. As suggested by behavioral finance, the
theoretical methodology to predict stock prices will include several sources of noise,
considering not all investors are rationale. As a result, the information from Social
Networking platforms will offer more effective indicators to predict stock price movements
(Chen, Cai, Lai and Xie, 2016; Li, Chan, Ou and Ruifeng, 2017). Because of such hard-to-
quantify factors that influence stock prices, researchers start using machine learning
techniques and sentiment analysis of the text information to predict stock prices (Khedr et
al., 2017; Houlihan and Creamer, 2019). Moreover, the evolution of Natural Language
Processing offers the possibility to investigate the impact of news information on the stock
market movements (Chahine & Malhotra, 2018), while this strong technique is still less
applied in finance field.
In this research, we derive the sentiment and topics implied by each of daily news
headlines for the purpose of predicting next day’s DJIA index movement. We also
incorporate past price trend and trading volume of S&P500, NASDAQ and Gold as
variables to test whether they can improve the prediction results. Finally, we explore
whether a specific standalone model can outperform an aggregate model in stock market
movement prediction.
2 Literature Review
NLP based Sentiment analysis
In the realm of data science, NLP (Natural language processing) is one of theory-
motivated machine learning technique that represents transformation of human-language

and computer languages. Since the inception of NLP in the 1950s, previous research has
been focusing on tasks such as machine translation, information retrieval, text
summarization, information extraction, topic modeling, and more recently, opinion mining
(Paramanik & Singhal, 2020). To understand the view or opinion implied in the text, NLP
technique is essential for understanding the opinions or emotions that implicated or
expressed in the text, which is a current research focus in the field of data science.
(Kanakaraj et al., 2015). Under the scope of NLP, sentiment analysis is a widely used
method to classify the subjective statements in the text, interchangeably throughout the
document. It uses NLP to collect and examine opinion or sentiment contained in text while
classifies textual data into categories ， usually positive, negative, and neutral sentiments.
(Khedr and Yaseen, 2017). Kanakaraj and Guddeti (2017) proposed a system to gather
texual data from the social network Twitter and to extract features from the tweets by
leverages NLP techniques. To increase the prediction accuracy, WordNe and WordNet
Word Sense Disambiguation synsets are applied in the feature vector. All previous studies
of sentiment analysis had a due to the complexity of dealing with raw data. Text
preprocessing is a crucial step in sentiment analysis which is useful for preparing
unstructured data for features extraction or information retrieval. (Gonçalves and
Quaresma, 2005). There are numerous previous approaches that based on sentiment
analysis techniques to predict stock market trends. Some techniques use press releases
or press conference transcripts to predict the next day's price change. While and others
depend on sentiment extracted from social media sites such as Twitter, Reddit, or
Facebook.
News Sentiment Analysis
The study proposed by Khedr et al. (2017) leverages financial news related to stock
markets, companies noew and financial reports. The texual financial news, along with
features extracted from historical stock prices are to predict the future behavior of the
stock market using sentiment analysis. The prediction model uses Naïve Bayes as well as
the k nearest neighbor (K-NN) technique which achieved an accuracy score of 89.80%.
Cambria and White (2014) investigate that market participants will pick up on any revealed
cues that allow them to draw inferences about the possible future path of monetary policy.
The study is mainly based on the word lists that contain both positive words and negative
words which are deliberately selected for this particular study. Furthermore, based on
grammatical and syntactical cues, the application of VADER allows adjustment of the
sentiment scores of the text. Besides, three heuristics based on social media text are
identified and are generalizable to intensifiers, contrastive conjunctions, and negations.
Therefore, a relatively up-to-date and easy-to-adapt approach was proposed. Paramanik
and Singhal (2020), use NRC word-emotion association lexicon constructed sentiment
indices. In the meanwhile, an opposed augmented asymmetric GARCH model is proposed
where the dominance of two traders’ contradictory sentiments leaked in the text. Costola et
al. (2020) indicated in their study that the variance of the sentiment and the volume of the
news sources for Reuters and MarketWatch are negatively associated with market returns,
suggesting that an increase in uncertainty results in an adverse impact on the stock
market. Shah et al. (2018) developed a dictionary-based sentiment analysis model and a
sentiment analysis dictionary for the financial.
Social Media Sentiment Analysis
Batra and Daudpota (2018) determined the positive relationship between people's
opinions or attitude and market movement. They perform sentiment analysis on tweets
related to Apple products. The texual data are extracted from StockTwits, and the
sentiment scores are calculated through SVM algorithm, then are categorized as bullish or
bearish, namely positive or negative. The presented model has an accuracy of 76.65% in
stock prediction. Nguyen and Shirai (2015) introduced a new topic model TSLDA with a
new feature that captures topics and their sentiments simultaneously to the proposed
model. The accuracy of the model in this study is better than LDA and JST-based methods
by 6.43% and 6.07%, respectively. The results of this study indicate the positive influence
of incorporation of sentiment data in social media on stock future behavior prediction. Sul
et al. (2016) analyzed the relationship between the sentiment in tweets about a specific
firm from users with less than 171 followers (the median in the sample) and the stock's
returns on the next trading day, as well as the next 10 days and 20 days. The findings
provide traders a profitable trading strategy idea that can produce around 15% annual
return.
3 Data
News data and stock data were drawn from the period from August 12, 2008, to June 30,
2016. Figure 1 shows the number of different types of data collected.
Figure 1 Dataset count (This chart represents the aggregate counts of our data
sources. Index return, commodity price change and trading volume data corresponds
to 1,986 trading days. News dataset contains total 49,643 news headlines.)
News data:
Roughly 50 thousand daily news headlines were provided by Reddit World News
Channel, the largest forum on reddit that contains the latest news articles posted by
redditors. Notably, US News and internal American politics are barred from being
posted in this subreddit, between August 12, 2008, and June 30, 2016. They are
ranked by reddit users’ votes, and only the top 25 headlines are considered for a
single date.
DJIA index data:
Stock data is represented by Dow Jones Industrial Average (DJIA) between August
12, 2008, and June 30, 2016, sourced from Yahoo Finance, a website that provides
financial news, data and commentary including stock quotes, press releases,
financial reports and original content.
Other related data:
Historical price data and trading volume for S&P 500, NASDAQ and Gold are
derived from FactSet platform (https://www.factset.com/).
According to Houlihan and Creamer (2019), message from Stock Twits was successfully
used to predict future asset price movements. In this research, we consider more general
and original news topic that are not specifically related to any specific stock, rather general
headlines. General news may affect trading actions of both rational and irrational investors
(Hirshleifer, 2001) which in turn may have more promising performance in predicting stock
market movements. According to Deng et al (2011), by leveraging indicators from both
news sentiment and technical analysis can actually improve the overall results regarding
predicting stock price movements for three Japanese companies listed at US stock
exchange. However, in this research, we are considering the US financial market as a
whole and using more extended period that corresponds to 1,986 trading days rather than
two years in previous research.
4 Methodology
The goal of the proposed model is to predict the stock market trends through predicting the
DJIA index is either rising or falling. The proposed model combines the analysis of the
most popular headlines from Reddit and the historical prices and compares different
approaches to boost the accuracy of the prediction result. To achieve the required target,
the following sections are included as depicted in Figure 2. 1) Initially, we gathered texts in
daily news headlines from Reddit as well as DJIA index, other related indices and
commodity price data from aforementioned sources. 2) the text message is preprocessed
using Natural Language Toolkit (NLTK) to filter the text for feature extraction. The 25
headlines with the highest heat were integrated one for each day. 3) The next step is
taking two different approaches to extract features from the text. 4) The fourth step is
applying classification algorithms to construct predictive models. Selected algorithms are
Random Forest (RF), Naive Bayes (NB), Support Vector Machine (SVM), and Voting
Classifier which is used to conduct majority voting. 5) Lastly, after the predictive models
are constructed and validated, we compare results obtained by two different feature
extraction methods.
Figure 2 Process Flowchart (This chart shows the flow of combining the analysis of the
most popular headlines from Reddit and the historical prices and comparing different
approaches to boost the accuracy of the prediction result)
4.1 Dataset Pre-processing
DJIA Index
Since what we try to accomplish is to classify the movement direction of the overall stock
market, we conducted a binary classification to label our DJIA index data according to the
movement direction. Figure 3 shows the number of days index is decreasing and the
number of days index is increasing or remaining stable, which are relatively evenly
distributed.
 ‘1’ when DJIA Adj Close value increased or stayed as the same.
 ‘0’ when DJIA Adj Close value deceased.
Figure 3 Label Count (This chart shows the number of days index is decreasing and
the number of days index is increasing or remaining stable)
Other Related Index and Commodity price Data Pre-Processing
The classification task falls under the time series prediction problems since the movement
of the overall stock market can be affected by the news headlines from the previous day
and even the day before. Features such as past return and trading volume can cause a
similar impact on the stock market as well. Therefore, the issue underlying the time series
task would be that the data we gathered from the day merely have pertinence of stock
market movement on the day. Our solution to prevent time lag is to shift the headlines in
the dataset backward for a day as headlines should be used to predict the market trends
of the next day. In the meanwhile, the return and trading volume of either DJIA and other
related indices from the day before and two days ago are calculated and assigned to t, t-1,
t-2. The steps included in pre-processing of related indices and generation of suitable
features is 1) calculate the percentage of return of each trading day. 2) normalize the
trading volume of each trading day.
For generating the rate of return, the formula is shown below.
ClosingValue−Initial Value
Rate of Return(r )= ×100 % （1）
Initial Value
The distribution of the volume data is not following a Gaussian distribution as Naive Bayes
classification requests, hence the data should be rescaled to ranging between 0 and 1. We
normalized the trading volume data by calculating the percentage change in trading
volume relative to previous day. The results are further divided by 10 to mitigate the large
volatility of trading volume changes and ensure consistency of scale.
Trading Volumet −Trading Volumet −1

% change∈Trading Volume= /10 （2）
TradingVolume t−1
Textual Data Pre-Processing
We can only move on to extract features from the previously combined text from headlines
after it has been pre-processed. The textual data undergo the following processes:
1) Text Cleaning: all upper-case letters in the text are converted to lower case letter
when cleaning the data.
2) Removing Null Value and meaningless letter: the letter ‘b’ in the head of each
text is removed as well as the null value.
3) Removing Stop-words (where applicable): when applying to remove stop-words
in Natural Language ToolKit (NLTK), each word in the list of words is compared to
the dictionary, Words that do not have a significant meaning in the documents such
as the, a, of are removed to reduce the number of features and improve the
performance.
4) Lemmatizing: lemmatization from Wordnet would consider the content, identify
the base form of words and convert them back to base form while stemming is the
process of producing variations of the root of a word. Compare to stemming,
lemmatization looks at the surrounding text to determine a given word’s part of
speech. Generally speaking, both stemming and lemmatizing are applicable in the
data pre-processing session. However, applying stemming may lead to an issue
when the stemmed word will not be in the form of a word that can be calculated
sentiment scores using VADER. Applying lemmatizing instead of stemming has
been proven in this case is a better choice when pre-processing data.
5) Dictionary Check: We checked the news texts against dictionary to removing any
meaningless words or wrong words.
4.2 Feature Extraction
Two different approaches are applied to extract features from textual data in the dataset,
the first is adopting Count Vectorizer, the second is to calculate sentiment scores.
Count Vectorizer
Count Vectorizer is used to transform a given text into a count of vector-based frequency
of each word that occurs throughout text, thus enables the pre-processing of text data
before generating the vector representation. As result, a vectorized matrix is generated by
Count Vectorizer, in which each unique word is represented by a column of the matrix.
Calculate Sentiment Scores
We used VADER (Valence Aware Dictionary and sentiment Reasoner) as a method to
extract features by calculating sentiment scores implied in the headlines. VADER is a

sentiment analysis toolkit that is specifically attuned to the text message sourced from
social media. One benefit offered by VADER is that it will indicate both polarities, namely
positive and negative, and intensity of emotion. VADER uses a combination of a sentiment
lexicon which is a list of lexical features that are generally labeled according to their
semantic orientation. The sentiment score of each headline is calculated by summing up
the scores of each VADER-dictionary-listed word in the sentence, mapping the scores of
categories of positive, negative, neutral, and deriving a compound sentiment score.
4.3 Proposed Model:
After the pre-processing of data and features extraction, two prediction models that share
the same classification algorithms were constructed corresponding to two approaches to
extract features. The original dataset was split into the train, test, and validation sets. The
train set is at 68%, while the test set is at 17% and the validation set is 15%. In the first
prediction model, we try doing nothing but using Count Vectorizer to transform text to a
matrix. The second integrated prediction model combines sentiment score for each daily
news headline along with other features to investigate their influence on stock market
raising and falling.
1) Random Forest (RF)
Random forest is a combination of multiple decision trees. Decision trees often produce
overfitting problems. Random forest has a total of m features and randomly selects k
features to form n decision trees. For random forest is composed of numerous decision
tress, Random variables are passed to decision tree to predict the outcome of each
decision. It stores all the prediction results and uses the prediction target with a high
number of votes as the final prediction.

Figure 4 Demonstration of Random Forest
2) Gaussian Naïve Bayes (GNB)
On accout of the simplicity and speed of Naïve Bayes classifier in textual classification, it
can be adopted to predict the polarity of text. In scikit-learn libraries, there are 3 Naïve
Bayes classification algorithm classes, which are Gaussian, Multinomial, and Bernoulli.
Among them, we choose to apply Gaussian, which implements the classification by
assuming the likelihood of the features to be Gaussian and continuous.
2
1 (x i−μ y )
P(x i∨ y )= exp( ) （3）
√2 π σ 2
y
2
2σ y
3) Support Vector Machine (SVM)
SVM is a two-class linear classifier in the feature space. In SVM classification process,
data is conversed to n-dimensional point in n-dimensional space, where n is equals to the
count of features that extracted during preprocessing and the value equals to a particular
coordinate. The classification of SVM is performed by finding hyper-plane in order to
differentiate accurately. As an output, an optimal hyper-plane that classifies test data is
generated by learning the training data.

Figure 5 Support Vector Machine Diagram
4) Voting Classifier
We are hoping that combining decisions of different models can improve the overall
performance of the proposed model. The Voting Classifier performs multiple computations
for each prediction result, and then verify that a majority of the results agree. This classifier
model can multiple times combine parallel outputs and returns the element which is the
majority in the prediction results of each other models.
5) Grid Search CV for model tunning
Gird Search CV is applied to return the best parameters for the classification model, and
then apply this parameter in our models to get better performances. Gird Search CV works
by giving the optimized results and parameters when input the parameters. Within the
specified parameter range, the parameters are adjusted step by step, and the parameter
with the highest accuracy on the verification set is found from all the parameters. Gird
Search CV can be divided into Grid Search and CV, namely grid search for best
parameters and k-fold cross-validation.
4.4 Performance Evaluation Criteria
The classification models that we construct during the implementation stage were
evaluated by calculating the precision, recall, F1 score, and accuracy. Results are
presented in the form of a confusion matrix as Table 1 shows, true positives (TP) and true
negatives (TN) are corresponded to correctly predicted positive and negative changes in
next day’s index movement. While false positives (FP) and false negatives (FN) represent
incorrectly predicted tuples.
Actual
0 1
Predict
0 TN FN
1 FP TP
Table 1 Confusion matrix for TP, TN, FP, FN
Below shows the explanations and equations for calculating the 4 performance evaluation
criteria.
1) Precision
The accuracy is resolved in a sample that is identified as a positive category and is the
proportion of the positive category.
TP
Precision= （4）
TP+ FP
2) Recall
Focus on the accuracy of correctly predicting the positive results out of total positive
results.
TP
Recall= （5）
TP+ FN
3) F1 score
F1 scores are the weighted average of accuracy and recall rates. β is used to balance the
weights of Precision and Recall in F-score calculation. β is generally signed to be 1.

2 Presision × Recall
F 1 Score=(1+ β )× 2 （6）
β ×(Precision+ Recall)
4) Accuracy
Accuracy is an indicator used to evaluate a classification model, i.e., the proportion of the
total correctly predicted result that include both positive and negative changes.
TP+TN
Accuracy= （7）
TP+TN + FP+ FN
5 Results
Table 2 Predictive model performance based on Count Vectorizer

Algorithms Precision Recall F1Score Accuracy
Random Forest 0.55 0.54 0.54 0.54
SVM 0.28 0.53 0.37 0.53
NaЇve Bayes 0.56 0.55 0.55 0.55
Voting Clasifier 0.59 0.57 0.55 0.57
Algorithms Precision Recall F1Score Accuracy

Random Forest 0.55 0.55 0.52 0.55
SVM 0.28 0.53 0.37 0.53
NaЇve Bayes 0.58 0.57 0.57 0.57
Voting Clasifier 0.61 0.58 0.53 0.58
Table 3 Predictive model performance based on sentiment score, past return and trading volume
This section illustrates the prediction results and discussions of this study. Table 2 and 3
shows the performance (Precision, Recall, F1 Score and Accuracy) of three different
predictive models that are used in this research based on two different feature extraction
method.
DJIA index movement prediction based on Count Vectorizer shows 54% accuracy for
Random Forest, 53% for Support Vector Machine and 55% for Naïve Bayes. This result
suggests that the specific word contained in the daily world news headlines does not have
much prediction power to the next day’s DJIA index movement. However, by considering
the news sentiment with past return and trading volume of DJIA, S&P500, NASDAQ and
gold, the prediction accuracy has improved to 55% for Random Forest and 57% for Naïve
Bayes. This result suggests that leveraging both sentiment and variables from technical
analysis can improve the outcome of predicting next day’s DJIA index movement.
While the prediction accuracy of SVM have remained unchanged at 53% despite the
different input features. In addition, the precision (28%), recall (53%), F1 Score (37%) and
accuracy (53%) for SVM are consistently lower than other predictive models for both
methods. The resulting low performance is caused by its inability to predict negative
change for the DJIA index, which means SVM is least applicable to our input features. On
the contrary, Naïve Bayes shows the most superior performance with highest precision,
recall, F1 score and accuracy in both methods.
The performance of Voting Classifier which conducts a majority vote based on prediction
results of three different models has further improve the precision, recall, F1 score and
accuracy in both methods. Thus, despite the strongest performance of Naïve Bayes
predictive model, the overall stock market prediction outcome can still be improved by
combining decisions from different classifiers and reducing variance of estimation errors
(Kim, Min and Han, 2006).
6 Conclusion
In conclusion, this research utilizes daily world news headlines to predict next day’s DJIA
index movement with respect to whether it will increase or decrease. In addition, we
incorporate other related stock index and commodity price data such as S&P500,
NASDAQ and gold to improve the overall prediction outcome. Based on specific word
contained in each daily news headline, the prediction result shows less correlation
between the DJIA index and the daily world news headlines. Based on the sentiment
implied in each daily news headline and some features from technical aspect, the
prediction power has been improved. Naïve Bayes predictive model shows the most
superior prediction power in both feature extraction methods. Furthermore, the prediction
results can be improved by conducting a majority voting of three predictive models that are
used in this research.
Nonetheless, the overall prediction accuracy remains relatively low despite the
improvements. Therefore, we draw conclusion that the daily world news headlines are
weakly correlated with next day’s movement direction of DJIA index. The reasons can
come from several aspects. Firstly, news information in our dataset contains not only news
from US but also news from countries worldwide. In addition, the news information is
related to wide dispersed topics which results in large noises. Secondly, the derived
sentiments are largely composed with negative or neutral attitude which furthers the
prediction difficulty.
Thus, future research will be based on information that are focused more specifically on
the financial markets such as posts on Stock twits. On the other hand, this research only
shows the low level of correlation between daily world news headlines and next day’s
index movement. However, the US stock market may absorb the information quickly once
the news occurred or released, thus the prediction results may improve if shorten the
prediction period.
Appendices
Appendix1 DJIA, S&P500, NASDAQ and Gold return
DJIA(t) DJIA(t-1) DJIA(t-2) S&P500(t) S&P500(t-1) S&P500(t-2) NASDAQ(t) NASDAQ(t-1) NASDAQ(t-2) Gold(t) Gold(t-
Day1 -1.18% 0.45% 2.64% -1.21% 0.69% 2.39% -0.01% 0.78% 2.45% -1.62% -4.23
Day2 -0.86% -1.18% 0.45% -0.29% -1.21% 0.69% 0.05% -0.01% 0.78% 2.08% -1.62
Day3 0.73% -0.86% -1.18% 0.55% -0.29% -1.21% 1.15% 0.05% -0.01% -2.04% 2.08
Day4 0.42% 0.73% -0.86% 0.41% 0.55% -0.29% -0.35% 1.15% 0.05% -2.75% -2.04
Day5 -1.55% 0.42% 0.73% -1.51% 0.41% 0.55% -1.27% -0.35% 1.15% 1.74% -2.75
Note: This table shows the daily percentage return for DJIA, S&P500, NASDAQ and Gold. For each trading day, the percentage return
day and two days ago are also incorporated which is denoted as t-1 and t-2. Today's return is denoted as t. This table only use the fi
days for illustration.
Appendix2 DJIA, S&P500, NASDAQ trading volumes after normalization
DJIA(t) DJIA(t-1) DJIA(t-2) S&P500(t) S&P500(t-1) S&P500(t-2) NASDAQ(t) NASDAQ(t-1) NASDAQ(t-2)
Day1 -0.0066 -0.0084 -0.0022 -0.0071 0.0095 -0.0030 -0.0064 0.0128 0.0002
Day2 0.0009 -0.0066 -0.0084 0.0002 -0.0071 0.0095 -0.0007 -0.0064 0.0128
Day3 -0.0089 0.0009 -0.0066 -0.0114 0.0002 -0.0071 -0.0087 -0.0007 -0.0064
Day4 -0.0018 -0.0089 0.0009 -0.0055 -0.0114 0.0002 -0.0086 -0.0087 -0.0007
Day5 -0.0058 -0.0018 -0.0089 -0.0100 -0.0055 -0.0114 -0.0160 -0.0086 -0.0087
Note: This table shows the daily trading volume of DJIA, S&P500 and NASDAQ which have been normalized. This table
only shows first 5 trading days for illustration. Trading volume for today, previous day and two days ago are denoted as t,
t-1 and t-2 respectively.
Appendix3 Sentiment Score
Day1 Day2 Day3 Day4 Day5

Top1 0.03 -0.76 0.20 -0.59 -0.92
Top2 0.00 -0.81 -0.60 0.00 -0.46
Top3 -0.54 -0.44 0.68 0.42 -0.38
Top4 -0.61 -0.42 -0.87 0.44 0.00
Top5 0.00 -0.51 -0.61 -0.32 -0.76
Top6 -0.69 0.36 -0.64 -0.74 -0.34
Top7 -0.60 0.51 0.69 0.32 -0.38
Top8 -0.60 0.00 -0.18 -0.56 -0.36
Top9 0.34 0.49 -0.68 0.10 -0.85
Top10 -0.76 0.40 -0.34 -0.05 0.00
Top11 -0.80 0.40 0.00 -0.32 0.13
Top12 -0.59 -0.23 0.00 -0.05 -0.74
Top13 -0.86 -0.25 0.56 0.09 0.00
Top14 0.00 -0.30 0.18 -0.23 -0.60
Top15 0.54 0.00 0.00 -0.30 0.51
Top16 0.00 0.00 -0.62 -0.03 0.00
Top17 0.00 0.00 0.00 -0.64 0.00
Top18 0.08 0.00 -0.74 0.00 -0.74
Top19 -0.60 -0.54 -0.32 -0.44 0.00
Top20 -0.59 -0.03 0.00 -0.27 0.00
Top21 -0.60 0.00 -0.44 0.42 0.00
Top22 0.53 0.49 -0.60 -0.74 0.00
Top23 0.00 -0.57 0.18 -0.30 -0.05
Top24 0.00 -0.42 -0.70 0.00 0.34
Top25 0.00 -0.34 0.71 0.00 -0.51
Average -0.25 -0.12 -0.17 -0.15 -0.24
Note: This table shows the sentiment score for Top1 to Top25 daily news headlines as well as
the average sentiment score for that trading day. This table only capture first 5 trading days for
illustration.
References
Batra, R., & Daudpota, S. M. (2018). Integrating StockTwits with sentiment analysis for
better prediction of stock price movement. 2018 International Conference on
Computing, Mathematics and Engineering Technologies (ICoMET).
https://doi.org/10.1109/icomet.2018.8346382
Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language
processing research. IEEE Computational intelligence magazine, 9(2), 48-57.
Chahine, S., & Malhotra, N. (2018). Impact of social media strategies on stock price: the
case of Twitter. European Journal Of Marketing, 52(7/8), 1526-1549. doi:
10.1108/ejm-10-2017-0718
Chen, W., Cai, Y., Lai, K., & Xie, H. (2016). A topic-based sentiment analysis model
to predict stock market price movement using Weibo mood. Web Intelligence, 14(4),
287-300. doi: 10.3233/web-160345
Costola, M., Nofer, M., Hinz, O., & Pelizzon, L. (2020). Machine Learning Sentiment
Analysis, Covid-19 News and Stock Market Reactions. SSRN Electronic Journal.
https://doi.org/10.2139/ssrn.3690922
Day, M.-Y., & Lee, C.-C. (2016). Deep learning for financial sentiment analysis on finance
news providers. 2016 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM).
https://doi.org/10.1109/asonam.2016.7752381
Deng, S., Mitsubuchi, T., Shioda, K., Shimada, T., & Sakurai, A. (2011). Combining
Technical Analysis with Sentiment Analysis for Stock Price Prediction. 2011 IEEE
Ninth International Conference on Dependable, Autonomic and Secure Computing,
pp. 800-807, doi: 10.1109/DASC.2011.138.
Fama, E. (1970). Efficient Capital Markets: A Review of Theory and Empirical Work. The
Journal Of Finance, 25(2), 383. doi: 10.2307/2325486
Gonçalves, T., & Quaresma, P. (2005). Evaluating preprocessing techniques in a text
classification problem. São Leopoldo, RS, Brasil: SBC-Sociedade Brasileira de
Computação.
Hirshleifer, D. (2001). Investor Psychology and Asset Pricing. SSRN Electronic Journal.
doi: 10.2139/ssrn.265132
Houlihan, P., & Creamer, G. (2019). Leveraging Social Media to Predict Continuation and
Reversal in Asset Prices. Computational Economics, 57(2), 433-453. doi:
10.1007/s10614-019-09932-9
Internet Society. (2021, June 1). Brief History of the Internet.
https://www.internetsociety.org/internet/history-internet/brief-history-internet/.
Kanakaraj, M., & Guddeti, R. M. R. (2015, March). NLP based sentiment analysis on
Twitter data using ensemble classifiers. In 2015 3Rd international conference on
signal processing, communication and networking (ICSCN) (pp. 1-5). IEEE.
Khan, M. T., Durrani, M., Ali, A., Inayat, I., Khalid, S., & Khan, K. H. (2016). Sentiment
analysis and the complex natural language. Complex Adaptive Systems Modeling,
4(1), 1-19.
Khedr, A. E., S.E.Salama, & Yaseen, N. (2017). Predicting Stock Market Behavior using
Data Mining Technique and News Sentiment Analysis. International Journal of
Intelligent Systems and Applications, 9(7), 22–30.
https://doi.org/10.5815/ijisa.2017.07.03
Khedr, A. E., S.E.Salama, & Yaseen, N. (2017). Predicting Stock Market Behavior using
Data Mining Technique and News Sentiment Analysis. International Journal of
Intelligent Systems and Applications, 9(7), 22–30.
https://doi.org/10.5815/ijisa.2017.07.03
Kim, M., Min, S., & Han, I. (2006). An evolutionary approach to the combination of multiple
classifiers to predict a stock price index. Expert Systems With Applications, 31(2),
241-247. doi: 10.1016/j.eswa.2005.09.020
Li, B., Chan, K., Ou, C., & Ruifeng, S. (2017). Discovering public sentiment in social media
for predicting stock movement of publicly listed companies. Information Systems, 69,
81-92. doi: 10.1016/j.is.2016.10.001
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies, 5(1), 1-167.
Malkiel, B.G. (1989). Efficient Market Hypothesis. Eatwell J., Milgate M., Newman P. (eds)
Finance. The New Palgrave. Palgrave Macmillan, London.
https://doi.org/10.1007/978-1-349-20213-3_13
Minh, D. L., Sadeghi-Niaraki, A., Huy, H. D., Min, K., & Moon, H. (2018). Deep learning
approach for short-term stock trends prediction based on two-stream gated recurrent
unit network. Ieee Access, 6, 55392-55404.
Möller, R., & Reichmann, D. (2021). ECB language and stock returns – A textual analysis
of ECB press conferences. The Quarterly Review Of Economics And Finance, 80,
590-604. doi: 10.1016/j.qref.2021.04.003
Nguyen, T. H., & Shirai, K. (2015, July). Topic modeling based sentiment analysis on
social media for stock market prediction. In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1354-
1364)
Nti, I., Adekoya, A., & Weyori, B. (2019). A systematic review of fundamental and technical
analysis of stock market predictions. Artificial Intelligence Review, 53(4), 3007-3057.
doi: 10.1007/s10462-019-09754-z
Paramanik, R., & Singhal, V. (2020). Sentiment Analysis of Indian Stock Market Volatility.
Procedia Computer Science, 176, 330-338. doi: 10.1016/j.procs.2020.08.035
Paramanik, R., & Singhal, V. (2020). Sentiment Analysis of Indian Stock Market Volatility.
Procedia Computer Science, 176, 330-338. doi: 10.1016/j.procs.2020.08.035
Picasso, A., Merello, S., Ma, Y., Oneto, L., & Cambria, E. (2019). Technical analysis and
sentiment embeddings for market trend prediction. Expert Systems With
Applications, 135, 60-70. doi: 10.1016/j.eswa.2019.06.014
Picasso, A., Merello, S., Ma, Y., Oneto, L., & Cambria, E. (2019). Technical analysis and
sentiment embeddings for market trend prediction. Expert Systems with
Applications, 135, 60-70.
Ratku, A., Feuerriegel, S., & Neumann, D. (2014). Analysis of How Underlying Topics in
Financial News Affect Stock Prices Using Latent Dirichlet Allocation. SSRN
Electronic Journal. doi: 10.2139/ssrn.2529457
Shah, D., Isah, H., & Zulkernine, F. (2018). Predicting the Effects of News Sentiments on
the Stock Market. 2018 IEEE International Conference on Big Data (Big Data), 4705-
4708. doi: 10.1109/BigData.2018.8621884
Sul, H. K., Dennis, A. R., & Yuan, L. (2017). Trading on twitter: Using social media
sentiment to predict stock returns. Decision Sciences, 48(3), 454-488.
Suleman, M. T. (2010). Stock Market Reaction to Good and Bad Political News. SSRN
Electronic Journal. https://doi.org/10.2139/ssrn.1713804

Research Paper_Group2

Uploaded by

Copyright:

Available Formats

Research Paper_Group2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Paper_Group2

Uploaded by

Copyright:

Available Formats

Predicting Stock Market Movements Through Daily News Headlines Sentiment

Analysis: US Stock Market

Yubo Bi1*, Hanting Liu2, Ruiyang Wang3, Shiyou Li4

predictive models can further the prediction accuracy.

Keywords: Stock Market Prediction, Daily News Headlines, Machine Learning

through analyzing underlying businesses and forecasting future business performances,

announcements. According to Paramanik and Singhal (2020), negative news has a

(2021), European Central Bank monetary policy announcements, sentiment implied in

are also known as company announcement. According to Ratku, Feuerriegel and

more comprehensive analysis of stock prices. As suggested by behavioral finance, the

applied in finance field.

NLP based Sentiment analysis

motivated machine learning technique that represents transformation of human-language

been focusing on tasks such as machine translation, information retrieval, text

technique is essential for understanding the opinions or emotions that implicated or

preprocessing is a crucial step in sentiment analysis which is useful for preparing

unstructured data for features extraction or information retrieval. (Gonçalves and

News Sentiment Analysis

identified and are generalizable to intensifiers, contrastive conjunctions, and negations.

Therefore, a relatively up-to-date and easy-to-adapt approach was proposed. Paramanik

indices. In the meanwhile, an opposed augmented asymmetric GARCH model is proposed

suggesting that an increase in uncertainty results in an adverse impact on the stock

sentiment analysis dictionary for the financial.

Social Media Sentiment Analysis

2016. Figure 1 shows the number of different types of data collected.

financial reports and original content.

Other related data:

derived from FactSet platform (https://www.factset.com/).

market movements. According to Deng et al (2011), by leveraging indicators from both

exchange. However, in this research, we are considering the US financial market as a

two years in previous research.

applying classification algorithms to construct predictive models. Selected algorithms are

 ‘0’ when DJIA Adj Close value deceased.

the number of days index is increasing or remaining stable)

Other Related Index and Commodity price Data Pre-Processing

trading volume of each trading day.

For generating the rate of return, the formula is shown below.

volatility of trading volume changes and ensure consistency of scale.

Trading Volumet −Trading Volumet −1

Textual Data Pre-Processing

when cleaning the data.

text is removed as well as the null value.

3) Removing Stop-words (where applicable): when applying to remove stop-words

4) Lemmatizing: lemmatization from Wordnet would consider the content, identify

process of producing variations of the root of a word. Compare to stemming,

lemmatization looks at the surrounding text to determine a given word’s part of

data pre-processing session. However, applying stemming may lead to an issue

sentiment scores using VADER. Applying lemmatizing instead of stemming has

been proven in this case is a better choice when pre-processing data.

meaningless words or wrong words.

4.2 Feature Extraction

before generating the vector representation. As result, a vectorized matrix is generated by

Calculate Sentiment Scores

We used VADER (Valence Aware Dictionary and sentiment Reasoner) as a method to

extract features by calculating sentiment scores implied in the headlines. VADER is a

semantic orientation. The sentiment score of each headline is calculated by summing up

categories of positive, negative, neutral, and deriving a compound sentiment score.

4.3 Proposed Model:

the same classification algorithms were constructed corresponding to two approaches to

raising and falling.