Stock Market Price Prediction Using Sentiment Analysis
Stock Market Price Prediction Using Sentiment Analysis
SENTIMENT ANALYSIS
BACHELOR OF TECHNOLOGY
IN
By
KritikVerma (171009)
OF
December 2020
TABLE OF CONTENTS
DECLARATION i
ACKNOWLEDGEMENT ii
LIST OF FIGURES iv
ABSTRACT vi
CHAPTER-1: INTRODUCTION 1
1.4 Procedure 5
2.1.2 Conclusion 10
2.2.2 Conclusion 12
3.1 Reliance 14
3.1.1 Results 14
3.1.2 Conclusion 16
3.2.1 Results 16
3.2.2 Conclusion 18
3.3 Wipro 19
3.3.1 Results 19
3.3.2 Conclusion 21
3.4 Infosys 21
3.4.1 Results 22
3.4.2 Conclusion 23
4.1 TextBlob 24
4.2 Code 24
4.3 Results 29
5.2 Code 30
5.3 Results 35
CHAPTER – 6: CONCLUSION 37
REFERENCES 38
DECLARATION
We hereby declare that the work reported in the B.Tech Project Report entitled “Stock Market
Price Prediction Using Sentiment Analysis” submitted at Jaypee University of Information
Technology,Waknaghat, Indiais an authentic record of our work carried out under the supervision
of Dr Naveen Jaglan.We have not submitted this work elsewhere for any other degree or diploma.
171009 171016
This is to certify that the above statement made by the candidates is correct to the best of my
knowledge.
Dr Naveen Jaglan
Date:
i
ACKNOWLEDGEMENT
We would like to thank God for guiding us throughout our academic journey and to acknowledge
our project supervisor, Dr Naveen Jaglan, for his undying support, priceless motivation and
guidance throughout the project duration. Moreover, we extend our sincere gratitude to all the
faculties and non-teaching staff of the Department of Electronics and Communication Engineering
for their contribution towards the success of this work.
The role our friends played during the entire period cannot also go unmentioned. Thank you all for
your moral support and encouragement. We deeply honoured and indebted to you all.
To our families, we appreciate the support you have given us throughout our academic journey.
This quest has not been easy but you have always solemnly stood by our side.
Thank you.
ii
LIST OF ACRONYMS AND ABBREVIATIONS
OS Operating System
AI Artificial Intelligence
K-NN K NearestNeighbor
ML Machine Learning
IT Information Technology
iii
LIST OF FIGURES
Figure 1.3: Electoral College projections for 2020 US elections using sentiment analysis.................3
iv
Figure 3.12: Line Graph for Sentiment Analysis on Infosys..............................................................21
v
ABSTRACT
The intent of this project is to create a stock market prediction model based on the current trends in
the market place, historical data and the general sentiment of the public based on different media
outlets.
Financial market value data is created in gigantic volume and it just changes in a second. Stock
market is an intricate and testing framework where individuals will either pick up cash or lose as
long as they can remember investment funds. In this work, an endeavor is made for forecast of
stock market pattern. Two models are constructed one for every day expectation and the other one
is for month to month forecast. In this we already have the data so we can create supervised model
and also test it. Model with best accuracy can be chosen.
Sentiment investigation is relevant extracting of text which distinguishes also removes abstract
data in source material, and helping a business to know the social sentiment of their image, item or
administration while checking on the web forums. Sentiment examination (or assessment mining)
utilizes normal language preparing and AI to decipher and group feelings in abstract information.
Up to 70% of precision is noticed utilizing regulated AI calculations on every day expectation
model. Month to month forecast model attempts to assess whether there is any comparability
between any two months pattern. Assessment demonstrates that pattern of one month is least related
with the pattern of one more month.
vi
CHAPTER – 1
INTRODUCTION
In simple words, in stock exchange is an open market place where we can buy and sell the shares of
the publically traded company it is regulated by SEBI.
Not a lot of people prefer to invest their money in stock market because of the high risk that is
involved in it but those who dare can get good to excellent returns. In India alone, where the
population is 1.3 billion, there are only 18 million investors in equity market.
Before jumping blindly into investing people tend to check various statistics, graphs, charts, etc. in
1
order to get a good read on where to invest and where to not invest. A general public sentiment
plays a huge role on whether a stock will perform well or not. Keeping in mind that this is not the
only factor that might affect the final result but it can be considered as a major factor. In this
project we are just focussing on the sentiment analysis and how it will affect the final prediction of
a stock.
Basically, assumption investigation or assessment grouping fall into the general class of text
arrangement undertakings where you are provided with an expression, or a rundown of expressions
and your classifier should tell if the estimation behind that is positive, negative or impartial. Once
in a while, the third characteristic isn't taken to keep it a parallel characterization issue. In ongoing
errands, slants like "fairly certain" and "to some degree negative" are additionally being thought
off.
As wistful investigation has improved over the most recent couple of many years so has its
applicatons.Sentimental analysis[1] is presently being utilized from explicit item showcasing to
hostile to s social conduct acknowledgment. Nowadays there are many medium from which general
public can show their emotions like through Facebook, twitter, youtube, and the other small
websites .In earlier days there were only news channels were the medium now every one show their
point of view .
Sentimental analysis is not just seeing the emotions for stock market it can be checked for any topic
like elections. Nowadays every political party have account on social media so it very important
tool in elections. Since in this digital age voices can reach farther through this.
As more and more users post about products and services they use, or express their political and
religious views on internet it becomes valuable information because from that analyst can
predict opnion of general public and then there can be a target advertisement which turn to be
extremely profitable [2].
2
Fig 1.3: Electoral College projections for 2020 US elections using sentiment analysis[3]
With the end goal of representation of information the matplotlib library is utilized which helps in
indicating the 2-D plot for given number of days.
5
Fig 1.5: Sigmoid Function used in Logistic Regression
Naive Bayes classifiers work well as it is faster than other classifiers and it used in many areas such
as G-Mail uses it for spam mail filtration and it also used for search engine optimization(SEO).
Two unique models have been worked to anticipate financial exchange pattern. First model
predicts the securities exchange pattern for the following day (Daily expectation model) by
thinking about all accessible information on everyday schedule as info. Second model predicts the
financial exchange pattern for the following month(Monthly forecast model) by thinking about
8
accessible information on month to month premise.
First commitment of the proposed work is that couple of highlights has been derived from the
authentic information accessible by utilizing insights. One of the measurable boundaries
considered is connection between pattern of a day and volume of stock exchanged around the
same time. Volume exchanged element chronicled information will reflect both purchased and
sold stocks consistently.
At the point when the pattern is up volume exchanged may show the sold offers, correspondingly
when the pattern is down volume exchanged may reflect shares purchased by merchants. This
component has been joined with pattern of that day to get whether the volume of stocks is sold or
purchased by the dealer. Enormous number of volume exchanged has positive effect if and just if
shares are bought by the dealer. Supposition for the offers bought is stock exchanges are more and
pattern is down. On the off chance that volume exchanged is more and pattern is up methods,
shares are offered to pick up cash. One more measurable boundary is registered by considering
past n days example of up/down. These highlights are produced for preparing and testing dataset.
Presently the forecast model is based on preparing dataset. Another commitment of this paper is
Monthly expectation model. In this the whole month pattern is processed by thinking about
chronicled information. Contribution to the model is given month shrewd. Month ms and year ys
expectation depends on year m −1,m−2,... of year y. Here the supposition that is pattern of month
m in the year y will follow pattern of some extraordinary month in the very year.
On the considered dataset, Decision Boosted Tree is performing in a way that is better than Support
Vector Machine and Logistic Regression.
Stock Market follows the arbitrary walk, which infers that the best expectation you can have about
the upcoming worth is the present worth. Because of the huge fluctuations in the market it is a
pretty huge task to develop an accurate model and because of these fluctuations the person who is
investing his hard earned money in order to get some little bonuses loses his faith in it. Stock prices
change and vary in a blink of an eye and are very dynamic because of the nature of the financial
market and because of various other reasons (Previous day's end value, P/E proportion and so
forth) and the obscure elements (like Election Results, Rumors and so on). Attempts have been
made in order to find a machine learning model that predicts this easily for the people. The main
10
idea when
10
it comes to projects of research in this field is based on these three main points. The price that is
being targeted can change in less than a minute, tomorrow or some time in the coming week or it
can be months. The stocks when arranged in set can be less than, to stocks belonging to an
industry, to all stocks in general. There can be a number of different sources from where we get our
data in order to predict, like it can be from some international news outlet or economic outlet or it
can be the general sentiment of the general public towards the company or it can be historical data
of the stock prices.
The main target that we want to achieve is to get the future values of the stocks, or to understand
the dynamic and volatile nature of the market or to understand the market trend. In the stock market
prediction model there is a dummy and a real time prediction. Some set of rules are defined in the
dummy prediction and the future values are calculated by using the average price whereas in the
real time prediction, it is absolutely compulsory to use the internet and to observe the current prices
of different shares.
Part 1:
This step is significant for the download information from the net. We are anticipating the monetary
market estimation of any stock. So the offer an incentive up to the end date are download from the
website.
Part 2:
In the subsequent stage the information value of any stock that can be changed over into the CSV
document (Comma Separate Value) so it will handily stack into the calculation.
11
Part 3:
In the next stage wherein GUI is open and when we click on the SVM button it will show the
window from which we select the stock dataset esteem document.
Part 4:
Subsequent to choosing the stock dataset record from the organizer it will show chart Stock prior to
planning and stock in the wake of planning.
Part 5:
The subsequent stage calculation determined the log2c and log2g esteem for limiting blunder. In
this way, it will foresee the diagram for the dataset esteem proficiently.
Part 6:
In definite advance calculation show the anticipated worth diagram of select stock which shows
the first worth and anticipated estimation of the stock.
2.2.2 Conclusion
In the task, we proposed the utilization of the information gathered from various worldwide
12
monetary business sectors with AI calculations to anticipate the stock file developments. SVM
calculation takes a shot at the enormous dataset esteem which is gathered from various worldwide
monetary business sectors. Likewise, SVM doesn't give an issue of over fitting. Different AI based
models are proposed for foreseeing the every day pattern of Market stocks. Mathematical outcomes
recommend the high effectiveness. The viable exchanging models based upon our very much
prepared indicator. The model creates higher benefit contrasted with the chose benchmarks.
13
CHAPTER – 3
SENTIMENT ANALYSIS OF DIFFERENT COMPANIES
3.1 Reliance
Reliance Industries Limited (RIL) was started by Late DhirubhaiAmbani with his brother in law .
Reliance started with manufacturing of women clothes then they expanded to Oil, Telecomm ,
News , Finance , Energy , Ecommerce etc. Now the company is managed by Mr Mukesh Ambani
. People have trusted this company because it’s a family business and given exceptional return.
Reliance industries having a main market position and piece of the pie in India which considered as
their best strength.Reliance business network isn't simply in India they have business more than
five landmasses. Considering the Indian market they have without a doubt, not many contenders to
contend. Total Nifty 50 it has 16% of Reliance share just effect on reliance share it can show
significant effect on the whole stock market.
3.1.1 Results
14
Fig 3.2: Bar Graph for Sentiment Analysis on Reliance
Moreover, Dependence Industries on Friday revealed a 15 percent year-on-year drop in united net
benefit for the July-September quarter as the COVID-19 pandemic hit its key petrochemicals and
oil refining organizations. In the subsequent quarter, its combined net benefit remained at Rs 9,567
crore, contrasted and Rs 11,262 crore in the year prior quarter [7]. This is unquestionably not an
incredible sign as the stock costs have likewise gone down for the organization.
Results
16
3.2.1 Conclusion
After observing the above results we can finally conclude that the general public sentiment, as is
derived by using sentiment analysis on 50 twitter users on 29th November 2020, is somewhat
towards a relatively positive side of the spectrum which can be considered as great. This type of
public sentiment towards the company can have a massively positive impact on the stocks of the
company.
The Q1 profits of TCS fell 14 percent as did the profits of many other companies and the Q2
profits declined by 7 percent which is relatively better when we compare that to our previous
company, i.e., Reliance. But decline in profits is definitely not a good sign for any company but
despite these declines, the company still manages to have a positive public opinion of itself in the
market which in the end helps the company’s image and its stocks.
18
3.3 Wipro
Wipro Limited was started by Mr AzimPremjifather before Independence it was a vegetable oil
company then in 1981 expanded into IT industry. It is settled in Bangalore, Karnataka,
India.In2013, Wipro isolated its non-IT organizations and framed the exclusive Wipro Enterprises.
The organization was joined on 29 December 1945 in Amalner, Maharashtra by Mohamed Premji
as "Western India Palm Refined Oil Limited", later curtailed to "Wipro". It was at first set up as a
producer of vegetable and refined oils in Amalner, Maharashtra, British India, under the trademarks
of Kisan, Sunflower, and Camel.
In 1966, after Mohamed Premji's demise, his child AzimPremji took over Wipro as its director at 21
years old.
Wipro's first proposal of stock was in the 1946. Wipro's worth offers are recorded on Bombay
Stock Exchange, where it is a constituent of the BSE SENSEX list, and the National Stock
Exchange of India where it is a constituent of the S&P CNX Nifty. The American Depositary
Shares of the affiliation are recorded at the NYSE since October 2000.
Results
Fig 3.7: Pie Chart for Sentiment Analysis on Wipro
19
20
3.3.1 Conclusion
After observing the above results we can finally conclude that the general public sentiment, as is
derived by using sentiment analysis on 50 twitter users on 29th November 2020, is somewhat
towards a relatively neutral side of the spectrum which can be considered as not so great as every
company’s main focus is always on increasing its profits whereas a neutral sentiment of the public
denies that. This type of public sentiment towards the company can have a somewhat negative
impact on the stocks of the company but since the last two companies that we worked on were
much bigger in comparison to Wipro, therefore, it might not affect it as negatively as you might
expect. The neutral sentiment might even work in its favour.
If we look at the net profit that Wipro made this year amid the COVID-19 pandemic, it is quite
commendable as it made 1-3 percent of profit this year. This might definitely work in the favour of
the company and the sentiment of the public which at the moment is relatively neutral might get
more positive as the time passes.
3.4 Infosys
Infosys Limited, is an Indian worldwide association that gives business directing, information
advancement and reconsidering organizations. The association is gotten comfortable Bangalore,
Karnataka, India. Infosys is the second-greatest Indian IT association after Tata Consultancy
Services by 2017 pay figures and the 596th greatest public association on earth reliant on income.
On 29 March 2019, its market capitalisation was $46.52 billion. The FICO appraisal of the
association is A− (rating by Standard and Poor's).
Infosys is started by eight people without a computer for seven years it did not made any profit but
1989 after liberalization countries were allowed to come in our country for businesses so it also
started generating profit . In 1993 it was listed in stock exchange and made many employees
millions. Management of Infosys has been rocky that might be the reason since it is very far away
from TCS in market capitalization. Recently Infosys has been acquiring many companies in
America showing a Global expansion . Training of Infosys is world renowned which is in Mysore .
Infosys is expanding with recent times in many fields such as IOt, Blockchainetc but still the main
revenue is generated from customer services.
In this pandemic when the MNC like Accenture were laying of the employees Infosys showed the
loyality towards their employeesalso in this they also gave promotion in this October this shows
how strong the company is from inside.
21
.
3.4.1 Results
3.4.2 Conclusion
After observing the above results we can finally conclude that the general public sentiment, as is derived
by using sentiment analysis on 50 twitter users on 29th November 2020, is somewhat towards a
relatively neutral side of the spectrum which can be considered as alright. This type of public
sentiment towards the company can have a relatively neutral impact on the stocks of the company.
23
CHAPTER - 4
4.1 TextBlob
TextBlob is a library that is made for python (either 2 or 3) to process the vast amounts of data that
is produced in the textual format. The superiority of using TextBlob is that it comes up with a
relatively easy API, i.e. , an Application Programming Interface to dive into common but multiple
natural language processing tasks such as part-of-speech labelling, interpretation, sentiment
analysis, classification, translation, and more.
In this particular project we are using this library in order to perform the sentiment analysis on the
data that we are retrieving from the twitter about various companies. The sentiment analysis is
pretty easy to use in this library by using the inherent sentiment property. This sentiment property
set a named tuple of the form SENTIMENT. Polarity basically here means how much is the
sentiment of the text leaning towards either of the positive or the negative end of the spectrum or if
it is neutral and does not provide any specific sentiment in general. Subjectivity basically refers to
how much a tweet offers or how much sentimental data we can get from the text in the tweet.
Basically, it uses a Naive Bays Classifier inherently to group the text into either positive or negative
polarity. This classifier has been trained on vast amounts of datasets such as movie reviews by
various people and then uses the classifier to do the task which it has learned from the movie
reviews for example.
4.2 Code
Second step that we perform here is to first of all initialize all the API keys and tokens provided by
the Twitter Developer site in order to access all the tweets needed and the further, we authenticate
those keys and tokens using the tweepy library.
25
Next we start to retrieve the tweets that we need by using the property Cursor in the library tweepy.
Here we provide the query and the language for our search and also provide the length of the
dataset that we require. Then we convert the data retrieved into a readable tabular format that gives
us the date of the tweet in one column and the text in the next one.
Next, we clean the data using the Regular Expression library commonly written as ‘re’. Here we
create two functions, one to clean the text of any hashtags, URLs, retweets, etc and the other
function is to remove any unwanted emojis. By performing this cleaning process, we get the text in
such a format that is much easier for the TextBlob library to process and also provides much more
accurate results to the sentiment analysis.
26
Fig 4.5: Sentiment Analysis
Next we perform the main sentiment analysis process by using the TextBlob library and using the
sentiment property included in the library. We create two functions namely, subjectivity() and
polarity(), and create two news columns of the same name to occupy the values that we get from the
sentiment property in TextBlob. Further we make another column to name the sentiment based on
the score received and finally we make another dataframe that shows the whole output in a very
orderly fashion.
27
Fig 4.6: Clubbing the data
Further we club the data together as from twitter we get many tweets for the same day. So we create
another dataframe based on the general sentiment on that particular day.
Finally we perform the training and testing on the data and perform some standard scaling and label
encoding. Then we fit and predict the data and see how our model scores on the testing data.
4.3 Results
As we can see in the figure above, our model is not performing really well. We get an R2 Score in
the negative range which is bad in and of itself but we also get very unrealistic Mean Absolute
Error and Mean Squared error. This is because the dataset we have is extremely small and to get
some sort of reliable results we need at least 500-1000 rows, but because of the twitter’s guidelines,
we can only get data for the past 10 days. This restricts us to the use of a very small dataset which
is not suitable to perform any sort of predictions.
29
CHAPTER – 5
5.2 Code
30
Fig 5.2: Reading the Scraped Data
For this part of the project we had to increase our dataset substantially, which was just comprised of
10 rows of data from the twitter. But now we performed some web scraping on the financial news
site called as moneycontrol.com and scraped the headlines including news about TCS for the past 3
years dating back to 2017. After scraping the data we saved it into CSV format and then further
cleaning was performed.
Data cleaning in this case was pretty straight forward as the headlines only contained proper text,
not like the tweets which included hashtags, emojis, retweets, etc.
31
We just needed to extract the date from the column which included the time, date and source for the
news and we did so by performing a simple regular expression search. After that we just renamed
the columns and converted the Date column to datetime format. Further we dropped some duplicate
rows and deleted the unnecessary columns.
Now we start performing the sentiment analysis on our dataset which is very much similar to the
way we performed sentiment analysis by using TextBlob. But the difference is just that we get a
much better polarity score when we use the NLTK library. So by running the first line we get the
polarity scores in the form of dictionaries from which we extract the compound values by running
the second line and then further on, we make another column named sentiment based on the score
that we get in our polarity column. Finally, we drop the headlines and polarity column as they are
not needed anymore.
32
Fig 5.5: Clubbing the Data
Further we club the data together as we have got multiple news from one particular day here. So we
create another dataframe based on the general sentiment on that particular day by performing a
mode function on loop.
Now we just split our data into training and testing sets and just perform some basic label
encoding functions to change our strings to numbers and use some standard scaling in order to
scale the data to get better results. Now we apply various regression models.
First, we apply a simple linear regression model to our training and testing dataset. As we can see
34
we are getting a very high R2 Score, which can either mean that our model is perfect in every sense
which is quite impossible or the other possible explanation is that our model is over fitting. But
either way we are getting pretty satisfactory results.
Second model that we have here is the Ridge Regression model which is giving us not that different
result. We can get better result, maybe, by tweaking the value of alpha which by default is set to 1.
But as of now we can see that this model too is over fitting our data.
Lastly, we have Lasso Regression model which gives us some promising result. Here the R2 Score
is just in the right spot, which is, not over fitting.
5.3 Results
As we can see that there has been a substantial difference in the values of our performance
measures from the twitter dataset code and we can definitely attribute all of this to, first of all, to
the
35
large dataset that has been gathered and hence it provides much more data for our models to train
on and, secondly, to the Natural Language Tool Kit, which performs an exceptional job at detecting
the sentiment of the given text. We are getting an R2 score of 0.99 which is pretty unrealistic and
shows that our model is over fitting but the Lasso Regression model gives us a very genuine R2
score of 0.86 which can be said as a good fit for our data.
36
CHAPTER – 6
CONCLUSION
Nowadays, Data Science, Machine Learning and Artificial Intelligence are gaining a lot of traction
and are considered to be the hottest subjects for the coming decades. A huge amount of research
work is going on in these fields and a lot of it has got into the markets already such as the
recommendation systems.
Just like the recommendation systems that take into account the reviews written by a person, and
the use of natural language processing in this process, we have also used natural language
processing in a new light, i.e., for the financial market.
In the first code we use TextBlob library for the sentiment analysis on the data collected from the
twitter and then predict the future trend of the stock prices and how they will fluctuate.
In the second code we use a much larger and better dataset of news headlines over the years and
also use a much larger library for language processing known as Natural Language Tool Kit.
Further we use three different models to fit on our data and finally, we get some reliable results.
37
REFERENCES
[1] YazhiGao, WengeRong, YikangShen, Zhang Xiong, “Convolutional Neural Network based
sentiment analysis using Adaboost combination”, International Joint Conference on Neural
Networks (IJCNN), 2016, Pg 1333-1338.
[2] Y. Kim and S. Myaeng. “Opinion analysis based on lexical clues and their expansion” . In
Proceedings of 6th NTCIR Evaluation Workshop, 2007.
[3] https://twitter.com/JumptuitNow/status/1323447352661856256?s=20
[4] AparnaNayak, M.M. ManoharaPai, Radhika M. Pai, “Prediction Models for Indian Stock
Market”, Twelfth International Multi-Conference on Information Processing,2016, Pg 441-449.
[5] V KranthiSai Reddy, “Stock Market Prediction Using Machine Learning”, International
Research Journal of Engineering and Technology(IRJET), Vol. 5, Issue 10.
[7] https://www.theweek.in/news/biz-tech/2020/10/30/covid-19-impact-reliance-q2-net-profit-
falls-15-from-year-ago.html
38