Awan Et Al. (2021)
Awan Et Al. (2021)
Awan Et Al. (2021)
DOI:10.32604/cmc.2021.014253
Article
1
School of Computing, Faculty of Engineering, University Teknologi Malaysia, Johor, Malaysia
2
Department of Software Engineering, University of Management and Technology, Lahore, Pakistan
3
Collage of Business, Abu Dhabi University, Abu Dhabi, United Arab Emirates
4
Oxford Center for Islamic Studies, The University of Oxford, Oxford, UK
5
The University of Liverpool Management School, The University of Liverpool, Liverpool, UK
6
Department of Computer Engineering, National University of Technology, Islamabad, Pakistan
7
School of Computing, UTM Big Data Centre, Universiti Teknologi Malaysia, Johor, Malaysia
*
Corresponding Author: Mazhar Javed Awan. Email: [email protected]
Received: 09 September 2020; Accepted: 21 December 2020
Abstract: Big data is the collection of large datasets from traditional and
digital sources to identify trends and patterns. The quantity and variety of
computer data are growing exponentially for many reasons. For example,
retailers are building vast databases of customer sales activity. Organizations
are working on logistics financial services, and public social media are sharing
a vast quantity of sentiments related to sales price and products. Challenges of
big data include volume and variety in both structured and unstructured data.
In this paper, we implemented several machine learning models through Spark
MLlib using PySpark, which is scalable, fast, easily integrated with other tools,
and has better performance than the traditional models. We studied the stocks
of 10 top companies, whose data include historical stock prices, with MLlib
models such as linear regression, generalized linear regression, random forest,
and decision tree. We implemented naive Bayes and logistic regression classi-
fication models. Experimental results suggest that linear regression, random
forest, and generalized linear regression provide an accuracy of 80%–98%.
The experimental results of the decision tree did not well predict share price
movements in the stock market.
1 Introduction
Big data consists of massive structured and unstructured information that can be manipulated
by machine models, unlike earlier structures such as relational databases [1]. The increasing
importance of big data springs from the common use of new technology and smart devices such
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
2570 CMC, 2021, vol.67, no.2
as smartphones and tablets. Big data analytics help to manage large amounts of data generated
by different companies and can be used to predict trends [2].
Spark provides a basic machine learning library with tools such as scalability, design, and
languages such as Scala, R, and Python. To use machine learning in a sequential environment
has many weaknesses, such as long execution time with large datasets, task-dependency, non-
scalability, and limited memory. Spark provides tools to address these problems, and it facilitates
data engineering and science [3]. Every investor needs to predict stock prices, which helps them to
make correct decisions to buy or sell. Forecasting the market is the main challenge to investors.
Information such as from news and social media can be used to predict the market [4].
External factors such as social media [5] and blogs can be used to analyze a company’s
performance. It is difficult to accurately analyze a massive amount of data. Forecasting the stock
market requires a system, which can be built using machine learning algorithms. Machine learning
models and other data mining techniques like time series analysis helps to forecast stock prices.
The prediction process includes many factors and a huge amount of data; hence, we propose big
data techniques to predict stock prices and trends [6]. Fig. 1 shows how investors manually predict
the movement of stock prices using social media.
Most investors analyze market and company information before buying stocks. Market-related
information is available on social media and from news, blogs, and companies’ customer reviews.
Many investors analyze such information to predict the movements of stock prices. The manual
analysis of this information is prone to errors [7]. Accuracy is essential in data analysis, but factors
such as computation speed are also important. External elements such as financial broadcasts
and social media require machine learning to predict market prices [8]. Information sources
provide unstructured data, which is not directly used in machine learning; hence, preprocessing is
required [9].
We present models to predict market trends. Machine learning algorithms have different
approaches. All are applied in the Spark framework using Python. Databricks platforms are also
used. We discuss both structured and unstructured datasets.
Machine models such as linear regression, decision tree, random forest, and generalized
linear regression are applied to the structure of datasets. The main inputs are closing prices and
adjusted closing prices, which are used to predict the behavior of stock prices. Logistical regression
CMC, 2021, vol.67, no.2 2571
and naive Bayes models are applied to unstructured datasets such as messages and reviews of
customers and investors.
We compare results to predict the intrinsic values of share prices using the above models.
We use existing and new techniques of the machine learning classification model with modern
platforms Databricks and PySpark with Python. We use a 10-year history dataset to predict stock
price movements. We include daily news and social media (Yahoo! Finance, Twitter) datasets for
use in sentiment analysis to predict the intrinsic values of share prices.
The remainder of this paper is organized as follows. Section 1 discusses the background and
previous work. Section 2 discusses the proposed model and defines the datasets (structured
and unstructured) and Spark machine learning libraries. Section 3 discusses the performance
and results of the models and compares their results. Section 4 discusses our conclusions and
future work.
2 Related Work
Forecasting is a challenge as investors seek to accurately predict market values, and many
models have been proposed. Every researcher tries to accurately predict the market. Stock market
prediction is based on various techniques related to structured and unstructured data [10]. Florian
et al. [11] argued that investors consider customers’ views and historical data to predict future
movements of share prices.
Khan [12] described the process of collecting messages and texture from Twitter’s data to pre-
dict share price movements, and argued that stop-words and comments affect the results because
Twitter data are unstructured. Before applying a new model, stop-words should be removed for
better prediction of price movements [13]. Sentiment analysis was applied to stock data to predict
stock price movements [14]. Our investigation depends on Twitter data, and findings suggest that
naive Bayes is a 50% accurate index to predict stock price movements, which is inadequate. It has
many missing values because the stock market is closed during the weekend, which affects the
accuracy of the results [15].
Hong mentioned that scientific analysis applies to historical data where mathematical
approaches are used to forecast the movements of the stock market [16]. Cakra et al. [17] studied
the use of historical information and sentiment analysis to predict share price movements, but did
not achieve a good result. Researchers classified the system into two parts: the first including data
collected from newspapers and company profiles, and the second applying naive Bayes algorithms
to process and tokenize data [18]. Their prediction of stock market price movements showed 86%
correct values.
Peng [19] analyzed one million messages from different markets and applied text classification
to describe the sentiment score of messages related to companies. Preda et al. [20] defined the
problem for the wholesaler by predicting the expected value of the share prices and stock market
indices. The proposed model relies on a sub-model that eliminates the errors associated with
spam data.
Misra et al. [21] discussed linear classification with support vector machine (SVM) and
artificial neural networks to accurately forecast the market’s daily movements. An artificial neural
network (ANN) and SVM are commonly applied in handling heterogeneous data and finance
data. Rechenthin, Street, and Srinivasan proposed a system to predict the U.S. oil market
(USO) stock chatter, which can examine the USO index to predict the true value of the stocks.
Yahoo! Finance has information of all opening and closing prices of the USO. Machine learning
2572 CMC, 2021, vol.67, no.2
approaches to Spark, Hadoop, and regression methods have been applied in the literature to
predict the movements of stock prices. Spark was used to predict real-time data because Hadoop
with MapReduce cannot work correctly in a real-time system [22].
Seif et al. [23] proposed a model to forecast markets using news and other external factors.
Machine algorithms and other neural network techniques were applied to collect data to predict
accurate values. The naive Bayes technique, SVM, and text mining were used to finalize the
dataset to predict the stock market index [23]. This system collects positive and negative views,
and historical stocks’ opening and closing price information to increase the forecasting of stock
market values. The proposed system predicts heterogeneous data such as news, social media, and
historical movements of share prices. The researchers stated that results of the k-nearest neighbors
(KNN) algorithm used to examine the relationship between social media news and the movement
of the stock prices are significant [24].
In this article, we describe artificial intelligence, natural network, and machine learning
algorithms for predicting movements of stock prices. This article divides a method into two
parts: prediction and classification. Prediction techniques like ANN, CNN, Naive Bayes, NN, and
Digital Signature Standard (DSS) are used to predict a model. The second part describes some
classification methods like filtering, fuzzy-based optimization, and KNN methods applied to some
datasets to evaluate the value of the stock market. The neural network method of DSS is used
in the hybrid model to predict the movements of stock prices. Yang et al. [25] proposed a model
to forecast stock price movements with big data, with data obtained from social media.
Yang et al. proposed a model to forecast the market value with big data. The researchers
collected data from real-time processing using social media [25]. Before applying the model, they
removed stop-words and processed the data to forecast the market index. To predict stock price
movements, they employed machine models such as SVM, decision trees (DT), and random
forest (RF) to forecast the value of the market. We have employed an integrated Support Vector
Machine (SVM) with the latest square to forecast the daily movements of the stock prices were
used to forecast a daily value of the stock market index and used article swarm optimization.
In our study, we rely on different sources of data—both structured and unstructured data—to
predict the future movements of stock prices. We have employed several models that produced
better results compared with the models applied in the literature.
3 Proposed Model
We propose a model to help investors decide which shares to buy and sell. The model aims
to predict stock price movements.
This model applies to historical data, Twitter, and news related to different companies. The
historical news of different companies’ datasets is used to predict the future values of stocks.
3.1 Datasets
We use two types of datasets. First, we collect the historical datasets of companies in the last
15 years. The second dataset is collected from news, blogs, Twitter, Yahoo! Finance, and reviews
and messages about different companies, along with sources such as Google Dataset Search and
Kaggle.com.
CMC, 2021, vol.67, no.2 2573
The adjusted closing price (Adj-Close) is the price at the end of the trading day or session
adjusted for companies’ actions such as stock dividends or stock splits. Researchers may use the
Adj-Close value to examine expected movements of share prices. Many articles use Adj-Close to
forecast the market’s exact value. Tab. 2 shows a historical dataset with features of structured
data [27].
The X value is a better solution to fill the missing value in the movement of the market. A
high-low price percentage is an important formula to forecast the stock market. Eq. (1) is used
to compensate for missing values. Eq. (2) helps to find the low- and high-percentage movement
of the market:
LH Percentage = (Low Prices − High Prices)/Low Prices (2)
Both formulae apply the ML-lib model to predict the future values of stock prices. Fig. 3
shows the general flow of our model using machine learning techniques.
3.2.1 Preprocessing
We apply some text processing techniques to unstructured datasets such as social media
and news. We cannot use unstructured datasets in machine models; hence, techniques such as
tokenization and text processing are employed to remove spam. In Databricks, we used Spark with
Python to process datasets. A dataset can be read using a Spark context. Datasets are converted to
resilient distributed datasets (RDDs), a Spark structure that allows data to be divided into clusters.
3.3.1 Preprocessing
We use Spark with Python to process datasets in Databricks. A dataset can be read in a
Spark context. Datasets are converted to resilient distributed datasets (RDDs), a Spark structure
by which data can be divided into clusters. We can then convert the RDD to a vector function.
We construct data frames consisting of two-column labels and messages. Message columns may
be news or social-media text relevant to the stock market.
3.3.2 Tokenization
Tokenization splits text into tokens to which classification models apply. We apply a function
to a dataset after removing unnecessary words.
2576 CMC, 2021, vol.67, no.2
A similar concept can be used in LR to find the exact value for Spark. The LR model depends
on the supervised machine. It forecasts the value of the stock price. The model targets values
based on self-directed or dependent changeable values. The LR model forecasts prices that depend
on independent values. This model can be applied to different companies’ datasets to forecast the
future values of stock prices. The result of Apple (AAPL) is presented in Fig. 5.
CMC, 2021, vol.67, no.2 2577
Eq. 4 shows the general formulation of the GLR model. In this article, we used this model
with Spark to predict the movements of stock prices. We have applied this model to 10 different
companies. The results produced by this model are more accurate than the results produced by
the leaner regression model. The GLR results are presented in Fig. 7.
2578 CMC, 2021, vol.67, no.2
Figure 10: Graph shows the AAPL stock message panel result
help to identify the best model that predicts the movement of the stock prices. Tab. 3 presents
the Apple (APPL) stock price movements using all of the models.
The results show that Spark MLlib models using big data produce more accurate results than
the other models, and that LR and GLR are more accurate than other models. We apply these
techniques to larger historical datasets, and we obtain better accuracy. Tab. 4 presents the results
of all models that produce accurate results.
We can apply the naive Bayes and logistic regression classifier models to news and messages
to predict the movement of stock prices. Naive Bayes gives approximately 60%–70% accuracy, and
so both models give close accuracy. Nevertheless, the logistics classifier model is better to predict
the value of stock price movements, as it provides 60% to 80% accuracy.
Fig. 12 shows the results of the text classifier model (Naive Bayes and logistic regression),
when applied to news and messages about APPL stock. Tab. 5 presents the results for the naïve
Bayes and logistic models.
Tab. 4 presents the results of the NB and Logistical LR models when applied to different
datasets. The data are collected from the message panel and analyzed using sentiment analy-
sis. The NB model achieves better accuracy than the logistical model with Spark MLlib using
the Databricks platform. By applying the above techniques, investors can build accurate predic-
tions of stock price movements. The machine learning model is mostly used to predict stock
price movements.
5 Conclusion
We employed several machine learning models to predict stock price movements through the
Spark big data framework. We used Spark MLlib to predict stock price movements We applied
Machine learning libraries on historical data for 10 companies The results indicate that linear
regression, random forest, and generalized linear regression produced more accurate results than
the decision tree model. Naive Bayes and logistics regression applied to the texture of data results
show approximately 77% to 80% accuracy ratios. We suggest employing deep learning models
through LSTM for future research.
2582 CMC, 2021, vol.67, no.2
Funding Statement: The researchers did not receive a specific grant from funding agencies in the
public, commercial, or not-for-profit sectors.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding
the present study.
References
[1] M. M. Hasan, J. Popp and J. Oláh, “Current landscape and influence of big data on finance,” Journal
of Big Data, vol. 7, pp. 1–17, 2020.
[2] X. Liu, “Analyzing the impact of user-generated content on B2B Firms’ stock performance: Big data
analysis with machine learning methods,” Industrial Marketing Management, vol. 86, pp. 30–39, 2020.
[3] R. T. Whitman, B. G. Marsh, M. B. Park and G. E. Hoel, “Distributed spatial and spatio-temporal join
on apache spark,” ACM Transactions on Spatial Algorithms and Systems, vol. 5, no. 1, pp. 1–28, 2019.
[4] S. Coyne, P. Madiraju and J. Coelho, “Forecasting stock prices using social media analysis,” in 2017
IEEE 15th Int. Conf. on Dependable, Autonomic and Secure Computing, 15th Int. Conf. on Pervasive
Intelligence and Computing, 3rd Int. Conf. on Big Data Intelligence and Computing and Cyber Science and
Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Orlando, pp. 1031–1038, 2017.
[5] J. Peiran, A. Veiga and A. Walther, “Social media, news media and the stock market,” Journal of
Economic Behavior & Organization, vol. 176, pp. 63–90, 2020.
[6] M. C. Angadi and A. P. Kulkarni, “Time series data analysis for stock market prediction using data
mining techniques with R,” International Journal of Advance Research Computer Science, vol. 6, no. 6,
pp. 104–108, 2015.
[7] R. Batra and S. M. Daudpota, “Integrating stocktwits with sentiment analysis for better prediction
of stock price movement,” in 2018 Int. Conf. on Computing, Mathematics and Engineering Technologies,
Sukkur, Pakistan, pp. 1–5, 2018.
[8] D. T.Selin and O. Tas, “Social media sentiment in international stock returns and trading activity,”
Journal of Behavioral Finance, vol. 1, pp. 1–14, 2020.
[9] S. A. Bogle and W. D. Potter, “SentAMaL—a sentiment analysis machine learning stock predictive
model,” in Proc. on the Int. Conf. on Artificial Intelligence, pp. 610, 2015.
[10] D. P. Gandhmal and K. Kumar, “Systematic analysis and review of stock market prediction tech-
niques,” Computer Science Reviews, vol. 34, pp. 100190, 2019.
[11] K. Florian, F Manz and D. Schiereck, “The conditional stock market response to banks’ distressed
asset sales on CDS availability,” Applied Economics, vol. 52, pp. 1–13, 2020.
[12] W. Khan, M. A. Ghazanfar, M. A. Azam, A. Karami and H. K. Alyoubi, et al., “Stock market
prediction using machine learning classifiers and social media, news,” Journal of Ambient Intelligence and
Humanized Computing, vol. 1, pp. 1–24, 2020.
[13] S. Kalra and J. S. Prasad, “Efficacy of news sentiment for stock market prediction,” in 2019 Int. Conf.
on Machine Learning, Big Data, Cloud and Parallel Computing, Faridabad, India, pp. 491–496, 2019.
[14] V. S. Pagolu, K. N. Reddy, G. Panda and B. Majhi, “Sentiment analysis of Twitter data for predicting
stock market movements,” in 2016 Int. Conf. on Signal Processing, Communication, Power and Embedded
System, Paralakhemundi, pp. 1345–1350, 2016.
[15] J. Kordonis, S. Symeonidis and A. Arampatzis, “Stock price forecasting via sentiment analysis on
Twitter,” in Proc. of the 20th Pan-Hellenic Conf. on Informatics, Patras, pp. 1–6, 2016.
[16] S. Jeon, B. Hong and V. Chang, “Pattern graph tracking-based stock price prediction using big data,”
Future Generation Computer Systems, vol. 80, pp. 171–187, 2018.
[17] Y. E. Cakra and B. D. Trisedya, “Stock price prediction using linear regression based on sentiment
analysis,” in 2015 Int. Conf. on Advanced Computer Science and Information Systems, Depok, pp. 147–
154, 2015.
CMC, 2021, vol.67, no.2 2583
[18] D. K. Kirange and R. R. Deshmukh, “Sentiment analysis of news headlines for stock price pre-
diction,” COMPUSOFT: An International Journal of Advanced Computer Technology, vol. 5, no. 3,
pp. 2080–2084, 2016.
[19] Z. Peng, “Stocks analysis and prediction using big data analytics,” in 2019 Int. Conf. on Intelligent
Transportation, Big Data & Smart City, Changsha, pp. 309–312, 2019.
[20] S. Preda, S. V. Oprea and A. Bâra, “PV forecasting using support vector machine learning in a big
data analytics context,” Symmetry (Basel), vol. 10, no. 12, pp. 748, 2018.
[21] P. Misra and S. Chaurasia, “Data-driven trend forecasting in stock market using machine learning
techniques,” Journal of Information Technology Research, vol. 13, no. 1, pp. 130–149, 2020.
[22] M. Rechenthin, W. N. Street and P. Srinivasan, “Stock chatter: Using stock sentiment to predict price
direction,” Algorithmic Finance, vol. 2, no. 3–4, pp. 169–196, 2013.
[23] M. M. Seif and E. M. R. Hamed, “Stock market real time recommender model using apache spark
framework,” In Hassanien, A., Tolba, M., Elhoseny, M. and Mostafa, M. (Eds.) The International Con-
ference on Advanced Machine Learning Technologies and Applications (AMLTA2018), AMLTA, Advances
in Intelligent Systems and Computing, vol. 723. Cham: Springer, 2018.
[24] W. Wang, K. Y. Ho, W. M. R. Liu and K. T. Wang, “The relation between news and stock price
jump: An analysis based on neural network,” in 20th International Congress on Modeling and Simulation,
Adelaide, Australia, pp. 2506–2512, 2013.
[25] R. Yang, L. Yu, Y. Zhao, Y. Yu, H. Xu et al., “Big data analytics for financial market volatility forecast
based on support vector machine,” International Journal of Information Management, vol. 50, pp. 452–
462, 2020.
[26] Yahoo, “No Title.” p. yahoo finance.com. [Online]. Available: http://finance.yahoo.com, 2020.
[27] D. Search, “No title.” [Online]. Available: https://datasetsearch.research.google.com/, 2020.