0% found this document useful (0 votes)
21 views14 pages

Prediction of S&P 500 Index Based On Regression: Zhang Daping, Qin Shihan, Shi Ziyue

Uploaded by

zdp2031
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Prediction of S&P 500 Index Based On Regression: Zhang Daping, Qin Shihan, Shi Ziyue

Uploaded by

zdp2031
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Prediction of S&P 500 Index Based On Regression

Zhang Daping, Qin Shihan, Shi Ziyue

Abstract: With the development of modern finance, more and more researchers are
trying to make prediction on the financial market, while its changeability non-linearity
and poor controllability make it hard to be forecasted. In this paper, regression
method including linear regression and ridge regression is implemented to make
predictions on S&P 500 Index. First, a data set is collected from Yahoo Finance
which includes one- year closing price of S&P 500 Index. Then, regression models
based on linear, and ridge regressions are trained for prediction. Last, RMSE and F2
score are used as evaluation to compare different models. It can be concluded that
multi linear regression model and ridge regression model performed better than other
models, which can provide reference for investment decision and financial market
analysis.

Keywords: Linear Regression, Ridge Regression, S&P 500 Index(SPX)

I. Introduction
At the end of the 20th century, with the rapid development of computer technology,
some financiers began to use computer algorithms to simulate and predict the current
economic development, design economic models, and construct stock portfolios. With
the advent of the era of big data in the 21st century, the data and information available
to people began to grow exponentially, and the use of relevant model algorithms to
predict the economy plays a role that cannot be ignored, which can greatly help
investors avoid stock market risks as much as possible and obtain more benefits at a
relatively small cost.
Based on the regression model of stock data analysis and recommendation, we can
use different data sources, adopt different data content, through further data analysis,
and get different content, to achieve "multiple eyes and ears" on stock trading
information, so as to get rid of the traditional stock trading process information
acquisition content is small, information acquisition time is long, transaction
judgment is single, arbitrary and other shortcomings [1] At the same time, people with
initial accounting knowledge can accurately grasp the basic logic of the current
market pricing stocks in the industry sector. By using different regression models, we
provide a scientific measurement mechanism for real-time prediction of stock trends,
real-time monitoring of stock information, real-time evaluation of individual stocks
and other behaviors. It is worth noting that this model has strong timeliness and is not
stable under different market trends.

Python language is mostly used to deal with financial and economic problems,
advanced mathematics, time series and other fields, and can be applied to artificial
intelligence, big data statistics and forecasting, website development and other fields.
Free and open source programming software developed by Dutchman Guido van
Rossum, with free, open source, and non-American direct origin [2]. Regression
analysis mainly explores the correlation between the independent variable and the
dependent variable, and builds a regression model based on this relationship to predict
the future evolution trend of the dependent variable. The more common regression
models include linear regression, ridge regression, logistic regression, stepwise
regression and other models. This paper mainly uses Python language to train linear
regression, multiple linear regression and ridge regression, and compares and finds
regression models that are more suitable for the current dataset.

In the Python model library, sklearn is more common, providing methods including
regression, dimensionality reduction, classification and other methods for machine
learning, which is a more popular tool for machine learning and practice. In this
paper, when using sklearn to build a regression model, the main steps are followed:
(1) Determine the independent and dependent variables according to the data set
and prediction target;
(2) Select the appropriate features and explain the selected features in-depth;
(3) Clean the data set, filter the available data information, and select standardized
or normalized models of different data;
(4) Estimate model parameters, establish linear regression, multiple linear
regression, ridge regression and other regression models;
(5) Test different regression models;
(6) Use different regression models to make predictions on data and compare the
results.

II. Literature review


Research and papers have shown the use of different regression techniques and their
applications in multiple areas. There have long been known as evolving and effective
prediction tools in economics, engineering and chemistry. Multiple linear, Ridge,
Gradient boosting and other regression techniques applying in financial forecasting
cases, such as house price predictions, which require valid processed data set to get
the exact time to purchase specific houses.

Previous works in financial areas are conducting experiments under multiple


regression methods, fitting models according to the requirement in situations. While
some predictions may use different regression models: Multiple linear, Ridge,
LASSO, Ada Boosting Regression. [3] Then compare their final results using error-
determined functions, aiming for the best performance in prediction accuracy. Further
studies in stock markets also present the validity of regression techniques in this
industry along with multiple other machine learning algorithms. A study by Srinath
Ravikumar and Prasad Saraf showed that prediction based on historical data can use
regression algorithms to predict the closing price of companies, and classification
algorithms to predict the price of the stock decrease or increase the next day. [4] The
combination of regression and classification is then been proposed as an application.

Except for financial industries, regression techniques also share enormous


significance in recognition areas for engineering and social spheres. In the speech
recognition field, kernel ridge regression has good performance in sensitivity
performance measures and time complexity calculations compared to other machine
learning algorithms. [5] Kernel ridge regression is also used in hyperspectral image
classification, which is related more to engineering. In this area, the newly developed
spectral-spatial-based ridge linear regression for hyperspectral image classification
sometimes can not deal with data sets that are linear inseparable, hence in the work
conducted by Chunhui Zhao, Wu Liu and others introduced kernel ridge regression
and shared subspace learning. [6]

During the research processes, most works will go through the process of data
collecting, for financial-related papers, they sometimes relate to historical raw data
mining. Then is the process of data pre-processing, null value and different types of
data would be detected and elected according to requirement. For linear regression,
the non-linear and linear inseparable data sets will be reprocessed or deleted.
Separation to training and testing data would then be conducted. Different test-train
data percentage also shows different results. After the separation of the train and test
datasets, researches then fit data into the models and record, compare and analyze the
results.

For studies in stock markets, the study by Srinath Ravikumar and Prasad Saraf
explained the development of deep learning models applied to stock markets.
Regression models have long been useful in price prediction, while studies also tend
to choose the appropriate factors affecting stock price and then take them as variables.
Support Vector Machine algorithms are also applied in stock price prediction, which
showed better performance in accuracy compared to other machine learning
algorithms according to the study conducted by Pushkar Khanal and Shree Raj
Shakya. [7]

Here in this article, we continue the study of stock price using linear regression,
multilinear regression and ridge regression to predict the stock prices and compare the
results.

III. Methodology
Simple linear regression:
Linear regression is a machine learning algorithm based on supervised learning.
Linear regression models the target predicted value based on the independent variable,
and its main goal is to find the relationship between the variable value and the
predicted value, and different regression models vary according to the relationship
between the dependent and independent variables they are considering and the
number of independent variables used. In linear regression [8], the target variable (Y)
depends on a single independent variable (X), and the model establishes a linear
relationship (Health Insurance Cost Prediction Using Regression Models) between
these two variables. This technique is commonly used in predictive analytics, time
series models, and to discover causal relationships between variables. Curves/lines are
typically used to fit data points, with the goal of minimizing the difference in distance
from the curve to the data point. A linear regression model attempts to fit a line
between the regression variable independent variable (x) and the dependent variable
(y), and the equation for this straight line can be expressed as:

where "a" and "b" are model parameters called regression coefficients, "a" is the Y-
intercept value of the line when X = 0, and "b" is the slope of Y with X. When we
find the best b-value, it also means that we have found the best fit line. So, when we
finally use our model to make predictions, it predicts the value of the dependent
variable Y as the input value of the independent variable X.

For linear regression algorithms, there are mainly the following steps:
(1) For a given dataset, the initial input is made;
(2) Obtaining the objective (loss) function: In order to better measure the results,
we need to quantify an objective function so that the model can be optimized
in the process of continuous solving. Therefore, we define the loss function as
follows:

It is the average square distance between the predicted value and the true value, which
is also called the mean square error (mean square error).
(3) For the derivation of the loss function, we can use the least squares method. As
shown in the following figure:

(4) Perform gradient descent on the calculated function and update the
parameters. The result is shown in the following figure:

Simple linear regression has the advantages of simplicity, interpretability, wide


availability, etc., so linear regression is often the first method considered for solving
related problems. But it is also because of its simplicity and ease of operation that the
scope of use of simple linear regression is very small, and in most cases in the real
world, it does not meet the assumptions of linear regression models at all, or it is
difficult to produce useful results through linear regression models.

Multiple linear regression:


Multiple linear regression is a statistical process that examines the degree of
association between a set of independent variables and a dependent variable. In
contrast to simple linear regression, there are many predictors in multiple linear
regression, and the value of the dependent variable Y is calculated based on the value
of the predictor. When the predicted object y is affected by multiple different
explanatory variables x at the same time, and each explanatory variable is linearly
related to the predicted object, the multiple linear regression model can be used for
analysis and prediction (secondary market stock price influencing factors - multiple
linear regression based on R). For example, suppose the target value is dependent on
"n" independent variables, and then the regression equation becomes fitted to the
regression line in n-dimensional space. For the establishment of multiple linear
regression model, there are three main steps [9]:

(1) The multiple regression model is gradually established according to the


formula, which is:

The formula is expressed as a multiple linear regression model for the


dependent variable y and the independent variable.
(2) Parameter estimation of regression models is usually done using least squares
estimation, that is, finding the regression constant and regression coefficient so
that the sum of squared dispersions is minimized, that is:

(3) The goodness-of-fit of the model is evaluated. Since the number of variables
contained in each regression model is not necessarily the same in multiple
linear regression models, the more commonly used evaluation index is the
coefficient of determination of modified degrees of freedom, which can be
expressed as:

where n is the sample size and k is the number of independent variables in the model,
(n-1) and (n-k-1) are respectively The degree of freedom of the sum of total
dispersion squares and residual sum of squares (the effect of employed persons on
fiscal revenue based on multiple linear regression models). The coefficient is obtained
through this calculation formula, which takes a value between 0 and 1, the closer to 1
indicates the higher the degree of interpretation of the independent variable, and the
closer to 0 indicates the weaker the explanatory ability of the independent variable.

Ridge Regression
Ridge regression is also known as regularized linear regression and Tikhonov
regularization in statistics [10]. This method estimates the coefficients where exists
multi-regression models and during the same situation, the independent variables are
highly correlated. Here, we also need to introduce another concept: Multicollinearity.

Multicollinearity occurs when variables can be used to predict one predictor variable
in a multiple regression model. For this situation, multilinearity indicates the
correlation between independent variables in modeled data. However, multilinearity
does not affect the overall prediction ability and reliability of the model, only changes
individual predictors’ results. Hence it may lead to invalid individual predictor’s
results, also non-identifiable parameters.

For collinearity, a linear association between two variables, if X 1 and X 2 have an exact
linear relationship between them, they are perfectly collinear and will fit the following
equation where i is the observations.
Same case for multicollinearity, the span of X k increases from 2 to k. That is, for all
observations i

Since multicollinearity leads to invalid predictor results, Ridge regression is particular


suitable in reducing the problem of multicollinearity in linear regression. In general,
this method provides the improved efficiency in parameter estimation problems in
exchange for a tolerable amount of bias.

In the following graphs this paper will display and explain the detailed Tikhonov
regularization (Ridge regression) steps.

First, we suppose for a matrix A and vector b, wish to find a vector X such that

This standard method is ordinary least squares (OLS) linear regression, but
sometimes we can’t find a X that satisfies the equation, or more than one X solution
occurs. Under these circumstances, OLS estimation will lead to overdetermined or
underdetermined system of equations. The standard approach to solve an
overdetermined system of linear equations is known as linear least squares and seeks
to minimize the residual:

where is the Euclidean norm.

In order to give preference to a particular solution with desirable properties, the


regularization term can be included in this minimization:
For some cases, we chose Tikhonov Matrix Γ . This matrix can be chosen as a scalar
multiple for the identity matrix and is also known as L2 regularizaiton.(6) This
regularization improves the conditioning of the problem, thus enabling a numerical
solution. An explicit solution, denoted by ^x , is given by

For this reduces to the unregularized least squared solution provided −1


Γ =0 ( AT A )

exists.

IV. Experiment
Dataset
There are many factors influencing the price of a stock [11] such as the company’s
financial situation, the macroeconomic market environment, even the personal
behavior of the administrators, which make it changeable and hard to be predicted
[12]. Therefore, the stock index, which consists of bunches of companies stock will be
more predictable. The Standard and Poors 500, also known as S&P 500, is an index
which reveals the stock price of the 500 largest companies in the United States. In the
experiment, the Historical data from 12/27/2021 to 12/23/2022 is extracted from
Yahoo Finance [13], which is 253 valid trading days in total exclude from holidays.
As is shown is table 1, the original data include the date, closing price, trading volume
and VIX index of each trading days and is ranked from the newest day to the oldest
day.
TABLE 1 Head of Dataset
Preprocessing
The daily change of stock price will not affect the long-term price, thus only closing
days will be used as the prediction target. First, a column of Num is generated to
represent the index of daily data. Then null value is checked and delete the thousand
separator for all data. Since the original data is in‘Object’format, it is transformed to
float for convenience. The line graph of the targeted whole year is shoen in Fig4. The
original dataset is split into train set and test set with 8:2, which is 202 and 51
respectively. Finally, a column of VIX index is added to improve the model
performance.

Fig1: S&P 500 Index

Evaluation
Two evaluation matrices including root mean square error (RMSE) and F2 scores
are used to evaluate the model. As for the RMSE, it is clear that multi linear
regression and ridge regression is much better than that of simple linear regression,
while these two models performs similar from one another. For F2 score it is the same
case while all the better two models gets scores over 0.8, which shows a good
correlation between model and data
TABLE 2 EXPERIMENT RESULT

RMSE F2
Linear 185.11484053 0.62502278
Multi 105.83258720 0.88270912
Ridge 105.84222311 0.88268776

Meanwhile, the prediction result is visualized and compared to the actual data which
is shown in Fig 2 and Fig 3. The same as the evaluation result, the graph of multi
linear regression and ridge regression are more similar to the actual market
movement. Also we can conclude that all the prediction is more or less late when
compared to the actual market. This corresponds to the financial theory that market
can not be exactly predicted [14] and it is unavoidable for the regression methods to
be left behind by the actual data. Nevertheless, the result is still meaningful to provide
the possible trend for the future market and the gap between prediction and reality can
be made up by training of more historical data and adapted more comprehensive
models.
Fig2: PERFORMANCE OF MULTI LINEAR REGRESSION
Fig3: PERFORMANCE OF RIDGE REGRESSION

V. Conclusion
In this paper, closing price of S&P 500 is predicted by several models based on
regression methods, including linear regression, multilinear regression and ridge
regression. From the result we can conclude that the multi linear regression model
performs better than the others by the evaluation of RMSE and F2. It is shown that the
proposed model has high performance in predicting S&P 500, which could provide
contribution in the analysis of financial market trend. In order to focus on the
comparison between different regression methods, only several variables and one year
data is used for dataset. For future job, longer range data and more indicators can be
used to further improve the performace.
Reference:
[1] Zhang Xiaolei, Chen Lei, “Design and Implementation of Popular Stock Analysis
and Recommendation System Based on Linear Regression”, Nov.2022
[2] He Xiaonian, Duan Fenghua, “Case Analysis of Linear Regression Based on
Python”, No.11.2022
[3] House Price Prediction Using Regression Techniques: A Comparative Study
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8882834
[4] Prediction of Stock Prices using Machine Learning (Regression,Classification)
Algorithms. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9154061
[5] Kernel Ridge Regression method applied to speech recognition problem: a novel
approach
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7043378
[6] Hyperspectral Image Classification via Spectral–Spatial Shared Kernel Ridge
Regression
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8721560
[7] Pushkar Khanal, Shree Raj Shakya “Analysis and Prediction of Stock Prices of
Nepal using different Machine Learning Algorithms” Department of Mechanical
Engineering, Pulchowk campus, Institute of Engineering, Tribhuvan University,
Nepal
[8] Sudhir Panda, Biswajit Purkayastha, Dolly Das, Manomita Chakraborty, Saroj
Kumar Biswas,” Health Insurance Cost Prediction Using Regression Models “,
(COM-IT-CON), 26-27 May 2022
[9] Wei Zheng, Yunfeng Dong, Can Huang, Di Yang, "The Impact of Employment on
Fiscal Revenue Based on Multiple Linear Regression Model", 03 2022
[10] Ahn, J. J., Byun, H. W., Oh, K. J., & Kim, T. Y. (2012). Using ridge regression
with genetic algorithm to enhance real estate appraisal forecasting. Expert Systems
With Applications, 39(9), 8369–8379. https://doi.org/10.1016/j.eswa.2012.01.183
[11] Phayung Meesad & Risul Islam Rasel, “Predicting Stock Market Price Using
Support Vector Regression”,pp.6
[12] A. Altan and S. Karasu, “THE EFFECT OF KERNEL VALUES IN SUPPORT
VECTOR MACHINE TO FORECASTING PERFORMANCE OF FINANCIAL
TIME SERIES”, The Journal of Cognitive Systems Vol. 4, No. 1, 2019, pp.20
[13] “S&P 500 historical data,” Yahoo! Finance, 27-Dec-2022. [Online]. Available:
https://finance.yahoo.com/quote/%5EGSPC/history/. [Accessed: 27-Dec-2022].
[14] S. Karasu, et “Prediction of Solar Radiation based on Machine Learning
Methods”, The Journal of Cognitive Systems, Vol. 2, No. 1, pp. 16-20, 2017.

You might also like