Stock Prediction With Sentiment
Stock Prediction With Sentiment
Senior Project
∗
Nirdesh Bhandari
Earlham College
801 National Rd W
Richmond Indiana
[email protected]
1
http://textblob.readthedocs.io/en/dev/
entire news section. Before I dive into the details of my pro- ment. Using various sentiment dictionaries to look at overall
gram, I will be going over some previous research in the next sentiment of a news article, the program graphs the senti-
section. ments over a certain timescale and then compares it to the
different stock market indexes. Figure 1 gives an overall soft-
2. PREVIOUS RESEARCH ware architecture of my project.The first step involves data
Traditionally sentiments and opinions have been collected collection from using the Guardian API2 .The second step
using surveys and polls. While these are fairly reliable, is to collect data for the stock market indices from Yahoo
they are highly dependent on the sample size. There has Finance3 for the same time range. The next involves pro-
always been tremendous potential for research, competitive cessing the data and creating a workable data frame. Then,
analysis, and market data collection in being able to recog- the data frame is fed to a sentiment analyzer, which iterates
nize human opinion from online documents, chat rooms and over the articles and returns their sentiment value. The final
news articles. While humans can easily recognize the text in step involves the visualization of data along with tests for
these documents, similar understanding of textual context cross-correlation and a simple Random Forrest prediction.
and common linguistic occurrences proves to be difficult for As of writing this paper, the source code for my program is
a machine. hosted on GitHub 4 .
On the other hand, machine learning based sentiment analy- Figure 1: Software Architecture
sis techniques such as Artificial Neural Net (ANN), Random
Forest, Support Vector Machines (SVM), NaÃŕve Bayes,
3.1. Data Collection
Multi-Kernel Gaussian Process (MKGP), XGBoost Regres-
sors(XGB), etc are also used to classify sentiments. How-
The first task of this project was collecting and compiling
ever, given the complex nature of linguistic identification,
a newspaper corpus to run sentiment analysis. While there
machine-learning approaches rarely attain more than 70%
were corpuses of historical texts and emails available on the
accuracy [10].
Internet, not many online archives provided the articles in a
chronological structure with well-organized Metadata. Fur-
Similar work has been done by [5] where they collected Ap-
thermore, digital forms of newspapers would require text
ple Inc. stock information and news related to Apple Inc.
extraction and often had copyright issues.
over a time span of three years. They gathered news from
Google as well as yahoo finance. Using a sentiment detection
I decided to compile my own corpus of news articles. The
algorithm, they collected the sentiments over a preprocessed
two helpful APIs I found for this purpose was the NY Times
corpus of news articles. Their sentiment classifier employed
API and The Guardian API. I used the Guardian API to
a bag of words approach where they counted the number
compile a corpus of 1.75 million articles. While the Guardian
of positive and negative words in each piece. Then using
API was free for developers and offered structure data, it
several models of SVM, Random Forest, and Naive Bayes,
would only return the URL for the articles. So I used python
they obtained an accuracy score range between 75% to 90%
to return daily search queries, open each article and created
on the test dataset. Likewise, [6] also employed a similar
a compilation of articles of each day. My data was organized
method as done by [5] on Indian stock companies over a
on JSON files for each day from the 1st of January of 2000 to
ten-year span and got an accuracy score range between 47%
2016. I picked the Guardian API because the data returned
to 75%.on the test dataset for SVM, KNN and Naive Bayes.
contained the Title, date, section ID, web publication date,
They also employed a bag of words approach.
text publication date and other helpful structured metadata.
3. PROGRAM DESIGN As for the stock data, I decided to use Yahoo Finance and
For this project, I decided to build sentiment analysis tool
2
that retrieves, processes, evaluates, visualizes and tests news- http://open-platform.theguardian.com/
3
paper articles. The goal of my project was to find the cor- https://finance.yahoo.com/
4
relation between stock prices and the overall market senti- https://github.com/nirdesh1995/CS488_Project
download their historical data for the end of day closing time series since stock markets tend to close on weekends
prices for key stock indices to create a pickle file. The Ya- and some public holidays. Given the fact that I was deal-
hoo Finance API also connects to the Pandas5 package at ing with time series data where response lags in variables
python, which makes it convenient for gathering information were expected, I decided to interpolate values for the missing
of various stock prices. For the stock indices, I picked the days instead of removing them from my analysis. For this,
Standard Poor’s 500, Dow Jones Industrial Average, NAS- I used the inbuilt interpolate function of Pandas to evenly
DAQ Composite, Wilshire 5000 Total Market Index and the space out the values between the gaps. While interpolating
Russell 2000 Index. I believe that these five indices would with only closing values isn’t always the ideal solution, the
be sufficient enough to evaluate various sectors of the econ- evenly spaced averaged prices approach is better than not
omy in respect to what appears on newspaper articles. The looking at those dates altogether. Since stocks are respon-
program I designed simply takes in the start and the end sive to news articles and news is published throughout the
dates to collect the news articles and stock values. weekend, my intuition was that not factoring in the week-
ends altogether would throw off predictions. Furthermore,
when data frames for the stocks and polarity values were
combined, I noticed that some sections had missing values.
The missing values create problems in the visualization and
analysis portion later on. Since these missing polarities were
occasional, I adopted the simple solution of backfilling the
data frame whereby any empty entry would take the value
from the next non-empty entry.
As for the stock data, I noticed that there were gaps in the 6
https://github.com/cjhutto/vaderSentiment
5 7
http://pandas.pydata.org/ http://www.nltk.org/
sentiment polarity scores were in the range of -1 to 1, where results would give me a quicker view of the data rather than
-1 represented a highly negative article while 1 represented going over the numbers in a large table. I, therefore, opted
a very positive article. These scores were then aggregated for a heat map using the Seaborn9 visualization packet to
and stored in a Pandas data frame along with the headline display the cross-correlation results. Because the correla-
of the article, the word count and the sections these articles tions I was interested in were the ones between the polarity
belonged to. Because of the large size of the corpus, I com- values and the stock indices, I set a max value of 0.3 for the
piled these polarity value data frames for yearly chunks of heat map. This would be helpful to accurately see changes
news articles. Each year took roughly 20 min to run through in the smaller correlation values between stocks and polari-
the analysis tool. ties since the larger values of stock-stock correlations would
simply occupy extreme end of the spectrum. Figure 4 shows
3.4. Visualization one of the heat maps I obtained.
3.5. Analysis