0% found this document useful (0 votes)
12 views6 pages

Paper 4

This document discusses a research project that aims to predict stock market fluctuations using data mining techniques, specifically the Random Forest Algorithm and Twitter sentiment analysis. By analyzing historical stock data and recent tweets related to companies, the study seeks to provide investors with insights into stock price movements. The proposed system combines sentiment analysis with machine learning to enhance prediction accuracy, ultimately assisting investors in making informed decisions.

Uploaded by

Ravi Bhuva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Paper 4

This document discusses a research project that aims to predict stock market fluctuations using data mining techniques, specifically the Random Forest Algorithm and Twitter sentiment analysis. By analyzing historical stock data and recent tweets related to companies, the study seeks to provide investors with insights into stock price movements. The proposed system combines sentiment analysis with machine learning to enhance prediction accuracy, ultimately assisting investors in making informed decisions.

Uploaded by

Ravi Bhuva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PREDICTION ON STOCKS USING DATA

MINING
Shila Jawale (Guide) Shweta Yeshwant Nimje

Department of Information technology Department of Information technology

Datta Meghe College of Engineering Datta Meghe College of Engineering


Airoli, India Airoli, India
[email protected] [email protected]

Ritesh Mayya Mirza Nauman Ali Baig


Department of Information technology Department of Information technology
Datta Meghe College of Engineering Datta Meghe College of Engineering
Airoli, India Airoli, India
[email protected] [email protected]

Abstract—Stock market is a very volatile space. similar terms for prediction markets are decision
Accurately predicting the changes in the stock prices markets, future ideas, virtual markets, informative
may prove exceedingly profitable to the investors and markets and predictive markets[1]. Every second the
assist them in making smarter decisions. This research
market prices rise or fall that means changing constantly.
subject uses Twitter sentiment analysis to obtain the
overall sentiment of the users towards the company in Therefore, it becomes difficult to predict and invest in
question which ideally leads to the changes in the stock the market. There are different techniques determined to
market prices. This study attempts to implement a analyze the rise and fall of stocks. Stock means owning
data mining technique called Random Forest the shares of the company. If company ownership is
Algorithm and use the same with the twitter sentiment divided in 100 parts and we are the investor purchasing
score of the company to accurately predict the
one part which is equal to one share then we own one
fluctuations in the stock market.
percent of that company [1]
Keywords—Data mining, stock, Random Forest,
Twitter sentiment analysis. Data mining is the extraction of useful and trivial patterns
or knowledge from large data sets. Alternative names for
data mining are knowledge discovery from data (KDD),
I. INTRODUCTION
knowledge extraction, pattern analysis, business
Predicting stock prices has been a popular topic for intelligence. Whereas, plain search in goggle engine or
literature survey. Still the research is being carried out to query firing on relational database is not data mining.
There are some domains of data mining such as machine
find the best way to get money through stock market
learning, cognitive learning, statistics, algorithms, pattern
activity. Overall, the aim is to predict the future. The
recognition and virtualization. Files, databases and other
repositories consist of huge amounts of data, hence it is

Electronic copy available at: https://ssrn.com/abstract=3565311


necessary to develop a prevailing tool for analysis and then the result is positive and the result is negative
explanation of data and extracting interesting knowledge otherwise. This information is processed using the
to facilitate in decision making[2]. Some of the Random forest algorithm. After which we get many
functionalities of data mining are the discovery of concept features along with a positive and a negative feature.
or class descriptions, associations and correlations These positive and negative features are selected and
classifications, prediction, clustering, trend analysis, classified so that we can get the overall result. It may be
outlier and deviation analysis, and similarity analysis [3]. positive or negative. This helps the investor make an
intelligent decision. For our proposed system we at first
Sentiment analysis is the process of determining people‘s
collected the historical data from the internet via the
attitudes, opinions, evaluations, appraisals and emotions
Yahoo! Finance, it provides the original content of
towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes[4]. financial reports, useful financial historical data, stock
Basically, it is the one’s judgment or evaluation on some data. We used python language for our problem statement,
topic or the polarity of the document. A basic job of python has a library named yfinance by which it is reliable
sentimental analysis is to collect all required data that may to download the historical data of stocks of a particular
be a single sentence, a whole paragraph or line from company. Further tweets are retrieved through the API of
respective tweets and analyzing its positivity and twitter named tweepy, this would easily able to retrieve
negativity for a better result.
whole information of about a particular tweets for
examples, tweets, ID of a user who had tweeted date and
time of tweet, location, likes and retweets for that tweets,
II. PROBLEM DEFINITION etc. Thus by applying sentiment analysis over tweets
results into the sentiment values combining these values
Stock markets are incredibly large and hard to grasp its and the result obtained from the algorithm applied to
behavior. There are too much of variations present in the historical data will conclude the prediction. Figure 1
result of stocks. People’s main aim in stock market is to shows the flow of our system.
make profit by buying or selling the stocks, but due to
many ups and downs in the stock price with respect to
time it become difficult to go with the stocks. Thus there
is a necessary in prediction of stocks. But due to this large
market volatility it is considered too unpredictable to be
reliable. Values of Stock market is varied due to many
aspects such as Historical Data, Tweets, News, Reputation
of that company, natural calamities, global financial
disturb and many more.
Funding in a strong stock but at a bad moment may have
catastrophic consequences; at the same moment investing
at a good time will produce better income. Stock holders
face this trading issue because they don't fully grasp
which stocks to purchase at a particular time or which
stocks to sell to get effective outcomes. So we tried to
overcome this problem by using regressor algorithm and
twitter sentiment analysis to predict up to its extent.

III. PROPOSED SYSTEM

The solution proposed from this project is to use Twitter


sentiment analysis to predict the rise or fall of the price of
a stock. This is done by fetching raw historical data of the
stock along with most recent tweets related to that
company. These tweets are analyzed using text mining.
Fig 1. System Architecture
For example along with the name of the company the
words used are good, great or any other positive words

Electronic copy available at: https://ssrn.com/abstract=3565311


B.Pre Processing
IV. METHODOLOGIES
Data preprocessing means converting raw data into
A. Random Forest Algorithm efficient and useful data. Different processes involve data
Random forest algorithm is used for classification as well cleaning, data transformation, data reduction. In data
as regression. It is used for stock market prediction. This cleaning missing, noisy data are eliminated. Next data
algorithm is flexible to handle missing values as well as it transformation where data normalization, discretization
won’t overfit the model. As stock prices are volatile in actions are carried. Data reduction aims to reduce the
nature, predicting is quite challenging. As the name storage efficiency and reduce data storage
suggests, this algorithm creates the forest with different
parameters that is the number of trees. The algorithm C.Training the machine/data score
works by selecting random samples from a given dataset. In Data mining process training the data plays an
Next, it will construct a decision tree for every sample. important role so as to get an accurate result of our
Then it will get the prediction result from every decision prediction. For the algorithm we used the dataset of stock
tree. After that, voting for every predicted result will be market containing the parameters like Date, Open price,
evaluated. At last, most voted predicted result will be the Close price, High price, Low price, Adjacent close and
predicted output. The dataset we used from the Yahoo Volume. Each single dataset belongs to a particular
Finance, 80% of data was used to train the machine company. We have used Yahoo! Finance market data
according to our model and 20% to test the data. Thus, the downloader to retrieve the data; “yfinance” aims to offer a
basic approach is to learn the patterns and relationships reliable, threaded, and Pythonic way to download
from the training set and reproduce them to the test data. historical market data from Yahoo! finance.

B.Twitter Sentiment Analysis Twitter is a popular social network where users share
messages called tweets. Twitter allows us to mine the data
Consumers usually express their sentiments on public of any user using Twitter API or Tweepy. The data will be
forums like social network sites like Facebook and tweets extracted from the user. The first thing to do is get
the consumer key, consumer secret, access key and access
Twitter. The data collected from real time would result in
secret from twitter developer available easily for each user.
accurate results for prediction of stocks. Opinions, These keys will help the API for authentication. Tweepy is
feelings, comments are all in slang or disorganized one of the libraries that can be installed using pip. Tweepy
manner. Manual analysis of such data is virtually provides the convenient Cursor interface to iterate through
impossible. Python library and Tweepy allows text different types of objects. Thus we could retrieve the
tweets related to provided keywords of the company; it
preparation, sentiment detection, sentiment classification,
includes used_id, Tweets, date, time, retweets, likes, along
and at last presentation of output. Specifically, it with Sentiments.
eliminates the irrelevant data and extracts the text relevant
to the area of study from the data. Sentimental analysis is
done at different levels such as negation, lemmas etc.
Sentimental classification is the next important step where
groups are classified into good, bad, positive, negative,
like, dislike. Python allows you to represent the data using
line graph, bar chart, pie chart.

V. MODULE Fig. 2 Sentiment Scores of Tweets

IDENTIFICATION A. Data Collection VI. EXPERIMENTAL RESULTS

Data collection is the initial step of any project. The right For our stocks, we used the Random Forest Algorithm;
collection of data is the important aspect. The data Random Forests are based on learning techniques for the
collected are never ready to implement any algorithm. ensemble. Ensemble means literally a group or set, which
Collecting data for the relevant project will make the in this case is a set of decision trees, called a random forest
together. The reliability of ensemble models is higher than
process easy. We have collected data from NSE websites
the accuracy of individual models as it compiles the results
of different companies. Initially, we will be analyzing the from the individual models and produces a final result.
dataset according to the model and predict the results Properties are automatically chosen using a process known
accurately. as bootstrap averaging or bagging. From the set of features
available in the dataset, a number of training subsets are

Electronic copy available at: https://ssrn.com/abstract=3565311


created by choosing random features with replacement.
What this means is that one feature may be repeated in
different training subsets at the same time.
In the training data set, stocks are divided into N classes
based on the forward excess returns of each stock. The
trained RF model is then used in the subsequent trading
period to predict the probability for each stock. We
construct our random forest model with no change in it.
No modification is made to the algorithm, as it is
believed that the original RF can have enough capacity to
handle large numbers of variables in datasets and give
rise to unbiased estimates for real world classification
problems, including finance.
Fig. 3 Output of Random Forest Algorithm

In principle, the random forest consists of many deep but Figure 3, shows the output of a regression model with
uncorrelated decision trees built upon different samples trained dataset and test dataset of stocks this model gives
of the data. The process of constructing a random forest an accuracy around 85%, thus increasing the accuracy we
is simple. For each decision tree, we first randomly approached to twitter sentiment analysis. 700-900 data
generate a subset as a sample from the original dataset. were collected as historical data of stocks. Tweeter data
Then, we grow a decision tree with this sample to its were also collected for the same period of time. The
maximum depth of ‘Sd’. Meanwhile, ‘sp’ features used Twitter data is available for all days lying in the giving
on each d split are selected at random from ‘p’ features. period, the stock values obtained using Yahoo! Finance
After repeating the procedure numerous times with the was (understandably) absent for weekends and other
original dataset, ‘O’ decision trees are generated. The holidays when the market is closed. In order to complete
final doutput is an ensemble of all decision trees, and the this data, we approximated the missing values So if the
classification is conducted via a majority vote. The stock value on a given day is x and the next accessible
computational complexity can be simply estimated as data point is y with n days left in between, we estimate
O(O(p∗nins∗lognins)) (1) the missed data by calculating to all be (y+x)/2 on the
first day after x and then continuously using the same
Where d ‘nins’ represents the number of instances in approach before all the holes are filled.
the training datasets. Three parameters must be tuned to At first we retrieve the tweets and try to clean the tweets,
check the robustness of the RF on classification, i.e., cleaning the tweets include removing all hash tags,
the number of trees O,the maximum d depth S and the unnecessary, spaces and tabs, and all special character.
number of features spd of each split. We set the d Further applying sentiment analysis over tweets we get
maximum depth S to be unlimited so that the nodes are the respective sentiment scores, these scores appears as a
expanded until all leaves are pure or until all leaves percentage of the obtaining result i.e. the positive and
contain less than two samples. Regarding the feature sub negative result of tweets. When tweets were collected and
sampling, we typically choose sp=√p. The influence d their polarity is decided, the next step was to collect data
Of the number of trees on the classification accuracy and from the stock exchange market. Data was collected via
the out-of-sample performance is then systematically Yahoo finance. We have considered closing the price
investigated. column as our target, thus we clubbed the tweets from
twitter, and price from stocks on that particular date.
Figure 2 shows the dataset along with sentiment values of
a tweets (here we have taken an example of TCS
company)

Electronic copy available at: https://ssrn.com/abstract=3565311


stock market modeling using a feature reduction
TABLE 1 algorithm.
It’s possible to obtain a higher correlation if the actual
Sentiment values
mood is studied. It may be hypothesized that people’s
mood indeed affects their investment decisions, hence the
correlation.

VIII. Conclusion

The solution proposed in this paper is to use twitter


sentiment analysis to predict the rise or fall of the price of
a stock. This is done by fetching raw historical data of the
stock along with most recent tweets related to that
company. These tweets are analyzed using text mining.
Further analyzing our data we arrived at a result figure 5
For example along with the name of the company the
showing a high percentage of Positive tweets resulting words used are good, great or any other positive words
into rise in the stock price of that company. then the result is positive and the result is negative
otherwise. This information is processed using the
Random Forest algorithm. After which we get many
features along with a positive and a negative feature.
These positive and negative features are selected and
classified so that we can get the overall result. It may be
positive or negative. This helps the investor make an
intelligent decision.

IX. References

[1] Kute, Shyam, and Sunil Tamhankar. "A survey on


stock market prediction techniques." International
Journal of Science andResearch (2013)

[2] Khedr, Ayman E., and Nagwa Yaseen. "Predicting


stock market behaviour using data mining technique and
Fig. 4 Pie Chart of Sentiments news sentiment analysis." International Journal of
Intelligent Systems andApplications 9.7 (2017): 22.
VII. SCOPE OF THE PROJECT
[3] Maini, Sahaj Singh, and K. Govinda. "Stock market
We have investigated the causative relation between User prediction using data mining techniques." 2017
International Conference on Intelligent Sustainable
sentiments as measured from a large scale collection of Systems (ICISS).IEEE, 2017.
tweets from twitter.com and the stock values. Our results
show that firstly public mood can indeed be captured [4] Sharma, Ashish, Dinesh Bhuriya, and Upendra Singh.
"Survey of stock market prediction using machine learning
from the large-scale. Twitter feeds by means of simple approach2017 International conference of Electronics,
Sentiment analysis. Our results are in some conjunction, Communication and Aerospace Technology (ICECA).Vol.
2 IEEE, 2017.
but there are some major differences as well. Firstly, our
results show a better correlation between the positive, [5] Alostad, Hana, and Hasan Davulcu. "Directional
negative, and neutral dimensions with the NSE values, prediction of stock prices using breaking news on
Twitter." 2015 IEEE/WIC/ACM International Conference
unlike other, which showed high correlation with only on Web Intelligence and Intelligent Agent IEEE, 2015
neutral mood dimension. Smart City and Emerging Technology (ICSCET).IEEE,
2018.
In a potential course, work would like to test and apply a
model of economic growth for stock market prediction [6] Mankar, Tejas, et al. "Stock Market Prediction based
on Social Sentiments using Machine Learning." 2018
and examine how economic growth models can impact International Conference on Smart City and Emerging
stock market prediction. this work would like to conduct a Technology (ICSCET). IEEE, 2018.
comparative study of deep learning classifiers and severe
[7] Navale, G. S., et al. "Prediction of stock market using
learning classifiers centered on the parameters used for data mining and artificial intelligence." International
Journal of Engineering Science 6539 (2016).

Electronic copy available at: https://ssrn.com/abstract=3565311


Electronic copy available at: https://ssrn.com/abstract=3565311

You might also like