Stockmarket Analysis Using Map Reduce and Py Spark
Stockmarket Analysis Using Map Reduce and Py Spark
Stockmarket Analysis Using Map Reduce and Py Spark
net/publication/356776423
CITATIONS READS
9 789
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Marimuthu Muthuvel on 04 December 2021.
Spark is a leading tool in the Hadoop Ecosystem. MapReduce with Hadoop can only be used for batch
processing and cannot work on real-time data. Spark can work stand-alone or over the Hadoop framework to
leverage big data and perform real-time data analytics in a distributed computing environment. It can support
all sorts of complex analysis including Machine Learning, Business Intelligence, Streaming and Batch
processing. Spark is 100 times faster than the Hadoop MapReduce framework for large scale data processing as
it performs in-memory computations thus providing increased speed over MapReduce.
The big-data era has not only forced us to think of fast capable data-storage and processing frameworks but
also platforms for implementing machine learning (ML) algorithms that have applications in many domains.
With a lot of ML tools available, deciding the tool that can perform analysis and implement ML algorithms
efficiently has been a daunting task. Fortunately, Spark provides a flexible platform for implementing a number
of Machine Learning tasks, including classification, regression, optimization, clustering, dimensionality
reduction etc.
III. STOCK MARKET
a) STOCK MARKET INTRODUCTION
The stock market is nothing but a collection of markets and exchanges where activities such as buying, selling,
and issuance of shares of publicly-held companies occur. These financial activities are conducted through
● Whether the values plummet or rise, the fluctuations depend on a day-to-day basis.
Figure 2: These are the top 10 stocks with maximum and minimum volatility.
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1789]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:07/July-2021 Impact Factor- 5.354 www.irjmets.com
VII. MODEL ARCHITECTURE FLOWCHART
This shows that the LSTM is very prone to overfitting similar to the convolutional network - as epoch increases,
so does training accuracy, but validation accuracy stays constant, or validation loss increases.
Our next step is to choose the attributes that are necessary and then leave the attributes that we don’t need. We
have selected variables based on the output of the correlation matrix. The scale is best used for variables that
show a linear relationship between each other. It can be indicated by a scatter plot. With the help of this, we
have found the required variables. The selected variables are "LABEL", "Open", "High", "Low", "Close",
"Volume", "Interest rate", "ExchangeRate", "VIX", "Gold", "Oil", "TEDSpread". We selected it by using data.loc []
function.
The next process is to split the training and testing data. We are implementing this by using the randomSplit()
function . In this we are splitting the data into the ratio of 7:3.
To improve the preprocessing process, we divide the attributes into 2 sets such as numCols containing
numerical values, and catCols which have categorical variables. We use One hot encoding process in catCols,
which helps us better represent category information. This is because most machine learning algorithms do not
work better with categorical variables, so it is better to encode. One hot code entry will be 0’s and 1’s.It will
make them a standard state to identify whenever it is required or needed while performing.
Label encoding is used to convert categorical values into numbers so that the machine is able to read the data. It
is considered an important step in the preprocessing of structured data. Each machine learning algorithm
determines how well the labels work. In labeling, we will place the variables in categories between 0 values and
(number of classes - 1). Of the selected values, all of them were converted into numerical values except height
which has already consisted of numerical values by using the function LabelEncoder ().
String Indexer helps to provide label indices to machine learning columns. It is nothing but when we give the
input column in numbers it changes that into strings and then it indexes the string values.
To feed the columns into the machine learning algorithms the entire column list needs to be converted to a
single vector value. It is considered helpful in combining various features into a vector of a single element. This
is mainly done to train ML models such as SVM, random forest, Decision Tree etc. It is done using the
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1793]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:07/July-2021 Impact Factor- 5.354 www.irjmets.com
VectorAssembler () function. We are creating a new column named VectorAssembler_feature and inserting the
output into this column.
All the input features provided are converted into a single object using a function called Dense Vector function.
The data framework has 11 columns and features that are now ready are given as input into the machine
learning algorithm.
h) MACHINE LEARNING MODEL:
Machine learning algorithms are performed by Spark which has a function named Mlib. Then we fitted the
Decision tree regressor to the training data set. Here the aim is to find the label values to determine whether
the stock is increasing or decreasing at the end of a particular day. We have chosen this because the results are
better on this when compared with other models and it also reduces the overfitting of training variables. We
have got 90% accuracy by implementing the Decision tree algorithm.
g) MODEL EVALUATION:
In this case, MLs and big data tools have been used for better stock market analysis. Because the stock market is
often done. For evaluating the model, we used 2 metrics, first is a Mean absolute (MAE) error. It is a measure of
the difference between paired observations which is expressed in the same way. It helps to measure the
accuracy of the continuous variance. Our predicted MAE score is 1.024, which is a good score.
The second metric counts the r-square ratio, the ratio that shows the value of the x variance (dependent
variable) which is defined by the independent variable. A value of more than 60% r square is considered
positive. In our model, the score rate is 70%.
X. CORRELATION ANALYSIS OF PRICE AND SENTIMENTS
After training and testing the model, we get coefficients which will tell us basically how the stock market price
is related to the headlines and the historic price of the particular stock. The positive coefficients determine that
the two factors are directly related and negative coefficients tell us that data is inversely proportional. This
correlation model is going to help us to predict the stock price of the new data set.
XI. LIMITATIONS OF THE STUDY
● One of the major limitations of this approach is the changing stock market prices. Although the model
predicts the stock of the next prices, there are additional factors from nature that affect stock prices.
● Using technical analysis to select stocks and increasing your investing potential to invest in which stock
through algorithms, may lead to risk management and might tend to lose your money.
XII. FUTURE SCOPE OF STUDY
● Potential improvements can be made to the data collection and analysis method.
● Future research can be done on potential improvements such as, using more refined data, time frames and
more accurate algorithms that are associated with the new dataset.
● Real time trading model, with live streaming data that can be upgraded to calculate the total returns or
investments in real time. This helps in increasing the efficiency of the model and would boost the accuracy
of the model.
XIII. CONCLUSION
● In Stock Volatility Analysis, being below the bottom of the numerical range, zero is obviously a very special
form of volatility. Volatility is zero if there are no changes in the price (price remains constant). So, the
above mentioned are top 10 minimum stocks whose price would never change, since it is closer to zero.
● Higher volatility means that a collateral value can potentially be spread out over a larger range of values. So,
the top 10 are those maximum stocks whose price of the security can change significantly over a short time
period in either way. A lower volatility means that a collateral value does not fluctuate dramatically, and
tends to be steadier.
● In this case, ML and big data tools have been used for better stock market analysis. Because the stock market
is often uncertain. With this, we are able to avoid the investors from facing significant financial losses.