Paper 4
Paper 4
MINING
Shila Jawale (Guide) Shweta Yeshwant Nimje
Abstract—Stock market is a very volatile space. similar terms for prediction markets are decision
Accurately predicting the changes in the stock prices markets, future ideas, virtual markets, informative
may prove exceedingly profitable to the investors and markets and predictive markets[1]. Every second the
assist them in making smarter decisions. This research
market prices rise or fall that means changing constantly.
subject uses Twitter sentiment analysis to obtain the
overall sentiment of the users towards the company in Therefore, it becomes difficult to predict and invest in
question which ideally leads to the changes in the stock the market. There are different techniques determined to
market prices. This study attempts to implement a analyze the rise and fall of stocks. Stock means owning
data mining technique called Random Forest the shares of the company. If company ownership is
Algorithm and use the same with the twitter sentiment divided in 100 parts and we are the investor purchasing
score of the company to accurately predict the
one part which is equal to one share then we own one
fluctuations in the stock market.
percent of that company [1]
Keywords—Data mining, stock, Random Forest,
Twitter sentiment analysis. Data mining is the extraction of useful and trivial patterns
or knowledge from large data sets. Alternative names for
data mining are knowledge discovery from data (KDD),
I. INTRODUCTION
knowledge extraction, pattern analysis, business
Predicting stock prices has been a popular topic for intelligence. Whereas, plain search in goggle engine or
literature survey. Still the research is being carried out to query firing on relational database is not data mining.
There are some domains of data mining such as machine
find the best way to get money through stock market
learning, cognitive learning, statistics, algorithms, pattern
activity. Overall, the aim is to predict the future. The
recognition and virtualization. Files, databases and other
repositories consist of huge amounts of data, hence it is
B.Twitter Sentiment Analysis Twitter is a popular social network where users share
messages called tweets. Twitter allows us to mine the data
Consumers usually express their sentiments on public of any user using Twitter API or Tweepy. The data will be
forums like social network sites like Facebook and tweets extracted from the user. The first thing to do is get
the consumer key, consumer secret, access key and access
Twitter. The data collected from real time would result in
secret from twitter developer available easily for each user.
accurate results for prediction of stocks. Opinions, These keys will help the API for authentication. Tweepy is
feelings, comments are all in slang or disorganized one of the libraries that can be installed using pip. Tweepy
manner. Manual analysis of such data is virtually provides the convenient Cursor interface to iterate through
impossible. Python library and Tweepy allows text different types of objects. Thus we could retrieve the
tweets related to provided keywords of the company; it
preparation, sentiment detection, sentiment classification,
includes used_id, Tweets, date, time, retweets, likes, along
and at last presentation of output. Specifically, it with Sentiments.
eliminates the irrelevant data and extracts the text relevant
to the area of study from the data. Sentimental analysis is
done at different levels such as negation, lemmas etc.
Sentimental classification is the next important step where
groups are classified into good, bad, positive, negative,
like, dislike. Python allows you to represent the data using
line graph, bar chart, pie chart.
Data collection is the initial step of any project. The right For our stocks, we used the Random Forest Algorithm;
collection of data is the important aspect. The data Random Forests are based on learning techniques for the
collected are never ready to implement any algorithm. ensemble. Ensemble means literally a group or set, which
Collecting data for the relevant project will make the in this case is a set of decision trees, called a random forest
together. The reliability of ensemble models is higher than
process easy. We have collected data from NSE websites
the accuracy of individual models as it compiles the results
of different companies. Initially, we will be analyzing the from the individual models and produces a final result.
dataset according to the model and predict the results Properties are automatically chosen using a process known
accurately. as bootstrap averaging or bagging. From the set of features
available in the dataset, a number of training subsets are
In principle, the random forest consists of many deep but Figure 3, shows the output of a regression model with
uncorrelated decision trees built upon different samples trained dataset and test dataset of stocks this model gives
of the data. The process of constructing a random forest an accuracy around 85%, thus increasing the accuracy we
is simple. For each decision tree, we first randomly approached to twitter sentiment analysis. 700-900 data
generate a subset as a sample from the original dataset. were collected as historical data of stocks. Tweeter data
Then, we grow a decision tree with this sample to its were also collected for the same period of time. The
maximum depth of ‘Sd’. Meanwhile, ‘sp’ features used Twitter data is available for all days lying in the giving
on each d split are selected at random from ‘p’ features. period, the stock values obtained using Yahoo! Finance
After repeating the procedure numerous times with the was (understandably) absent for weekends and other
original dataset, ‘O’ decision trees are generated. The holidays when the market is closed. In order to complete
final doutput is an ensemble of all decision trees, and the this data, we approximated the missing values So if the
classification is conducted via a majority vote. The stock value on a given day is x and the next accessible
computational complexity can be simply estimated as data point is y with n days left in between, we estimate
O(O(p∗nins∗lognins)) (1) the missed data by calculating to all be (y+x)/2 on the
first day after x and then continuously using the same
Where d ‘nins’ represents the number of instances in approach before all the holes are filled.
the training datasets. Three parameters must be tuned to At first we retrieve the tweets and try to clean the tweets,
check the robustness of the RF on classification, i.e., cleaning the tweets include removing all hash tags,
the number of trees O,the maximum d depth S and the unnecessary, spaces and tabs, and all special character.
number of features spd of each split. We set the d Further applying sentiment analysis over tweets we get
maximum depth S to be unlimited so that the nodes are the respective sentiment scores, these scores appears as a
expanded until all leaves are pure or until all leaves percentage of the obtaining result i.e. the positive and
contain less than two samples. Regarding the feature sub negative result of tweets. When tweets were collected and
sampling, we typically choose sp=√p. The influence d their polarity is decided, the next step was to collect data
Of the number of trees on the classification accuracy and from the stock exchange market. Data was collected via
the out-of-sample performance is then systematically Yahoo finance. We have considered closing the price
investigated. column as our target, thus we clubbed the tweets from
twitter, and price from stocks on that particular date.
Figure 2 shows the dataset along with sentiment values of
a tweets (here we have taken an example of TCS
company)
VIII. Conclusion
IX. References