0% found this document useful (0 votes)
67 views12 pages

COMP1901 Research Project

This document summarizes several classical algorithms for stock selection and forecasting, including linear regression, neural networks, support vector machines (SVM), K-nearest neighbors (KNN), random forests, and Markov chains. It analyzes stock data from New World Development Company using these algorithms in Orange, an open-source data mining software. It finds that linear regression has the best performance based on metrics like mean absolute error and R-squared. The document also notes that forecast accuracy tends to decrease when the original stock curve fluctuates drastically.

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views12 pages

COMP1901 Research Project

This document summarizes several classical algorithms for stock selection and forecasting, including linear regression, neural networks, support vector machines (SVM), K-nearest neighbors (KNN), random forests, and Markov chains. It analyzes stock data from New World Development Company using these algorithms in Orange, an open-source data mining software. It finds that linear regression has the best performance based on metrics like mean absolute error and R-squared. The document also notes that forecast accuracy tends to decrease when the original stock curve fluctuates drastically.

Uploaded by

taiiq zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

COMP1901 Project 2 Research

Stock Selection: HK0017 New World Development Company Limited


Introduction:
Mathematical economics came into being after the Italian economist Cheva creatively used
mathematical paradigms to deconstruct economic principles in 1711. In the 21st century, the
development of computer technology has greatly enhanced people's ability to process statistical
data. It is no longer a fantasy to predict the trend of the financial market to a certain extent by
using probabilistic and statistical methods and computer compilation technology. This article uses
some trading data of New World Development Company to briefly describe some simple stock
forecasting methods from the perspectives of computer science, statistics and economics.
Classical data mining algorithm
Linear Regression Method:
Linear regression analysis is the simplest and most commonly used analysis
method. It is based on three mathematical principles: one is that the regression line of a
set of samples must pass through the center of the sample; the other is that the curve
can be approximately fitted to a straight line in a short interval; the third is the so-called
least squares method.
One-variable linear regression is actually to calculate a straight line from a bunch of
training sets to minimize the distance difference between the data set and the straight
line.
The idea provided by linear regression is to first assume a straight line:

You can bring each value in the feature X into it to get


the corresponding . The definition can define the loss as the sum of the squared

differences between and and modify it to:

Then the problem is transformed into finding the


smallest . In order to facilitate understanding, we draw a three-dimensional function
image of the residual sum of squares:

The rest of the work needs to be solved by mathematical least squares.


The advantages and disadvantages of linear fitting are obvious. First of all, it has a
small amount of calculation and a simple method. It has a good fitting effect for some
samples with relatively uniform distribution; but at the same time, the applicable range of
linear regression fitting is relatively small. For data with very large samples and uneven
distribution, linear regression analysis is less effective.

The prediction function that comes with Excel is actually linear regression prediction.
We use the built-in function of excel to draw the regression line:

At the same time, using orange


verification, we get:

Neutral Network Method


Artificial neural networks are models inspired by the structure and functions of biological
neural networks. They are a type of pattern matching, usually used for regression and
classification problems, but are actually a huge subfield, containing hundreds of algorithms and
various types question type.
Algorithm advantages
1 The classification accuracy is high, and the learning ability is extremely strong.
2 Strong robustness and fault tolerance to noise data.
3 Has the ability to associate and can approximate any non-linear relationship.
Algorithm shortcomings
1. There are many neural network parameters, weights and thresholds.
2. In the black box process, intermediate results cannot be observed.
3. The learning process is relatively long and may fall into a local minimum.
The result after implementing with orange is:

SVM Method
Svm (support Vector Mac), also known as support vector machine, is a two-class model. Of
course, it can also be used for classification of multi-category problems after modification.
Support vector machines can be divided into two categories: linear kernel and nonlinear. The main
idea is to find a hyperplane in the space that is more able to divide all data samples, and make the
distance from all data in the book to this hyperplane the shortest.
Algorithm advantages
1 Solve the problem of machine learning under small samples.
2 Solve non-linear problems.
3 There is no local minimum problem. (Compared to algorithms such as neural networks) can
handle high-dimensional data sets well.
4 Strong generalization ability
Algorithm shortcomings
1 The explanatory power of the high-dimensional mapping of the kernel function is not
strong, especially the radial basis function.
2 Sensitive to missing data.
Similarly, we use orange to test the SVM method
The results of the operation are:

Knn Method
K nearest neighbor classification algorithm is one of the simplest methods in data mining
classification technology.
The principle is: If most of the k nearest samples in the feature space of a sample belong to a
certain category, the sample also belongs to this category and has the characteristics of the samples
in this category.
Algorithm advantages:
1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training;
2. Suitable for classifying rare events;
3. Especially suitable for multi-class problems (multi-modal, objects with multiple category
labels), kNN performs better than SVM.
Algorithm disadvantages:
   Needs a lot of space to store known instances, and the algorithm complexity is high.
Due to the high complexity of this algorithm, we still use orange to achieve:
The results of the operation are:
Random Forest Method
The decision tree method is constructed based on the actual values of the attributes in the
data. Decision trees are often trained for classification and regression problems. Decision trees are
usually fast and accurate, and are the most popular in machine learning.
Algorithm advantages
1. The decision tree is easy to understand and explain, can be visualized and analyzed, and it is
easy to extract rules.
2. It can process nominal and numerical data at the same time.
3. When testing the data set, the running speed is relatively fast.
4. The decision tree can be well extended to large databases, and its size is independent of the size
of the database.
Algorithm shortcomings
1. It is difficult to deal with missing data.
2. Prone to overfitting.
3. Ignore the correlation of attributes in the data set.
4. When the ID3 algorithm calculates the information gain, the result tends to be more numerical.
The results displayed by Orange are:

Analysis

Grey is stimulated curve

Linear Regression knn


Blue is original stock curve
SVM Neural Network

Random Forest
By comparing the values of parameters such as MAE and R2 and studying the fitting curves
of various models, we found that among these models, the linear regression model has the best
effect. So we use the linear regression model to do further calculations:
10-day 5-day 10-day 5-day Machine learning Machine learning
moving moving forecast forecast based on based on previous
average average Previous price price change%
0.832 0.502 0.637 0.502 0.488 0.442
At the same time, there are some unexplained but noteworthy
phenomena:
When the original stock curve fluctuates drastically, the accuracy of most
forecasting models will drop significantly

Orange is linear regression method

When the original stock curve fluctuates drastically, the accuracy of


most forecasting models will drop significantly

Mathematical Economics Algorithm


Markov Chain Method based on Efficient Market Theory
The Effective Market Hypothesis (EMH) was proposed and deepened by the famous
American economist Eugene Fama in 1970. The theory believes that in a stable stock
market, all valuable information has been fully reflected in the current stock price, and it
is impossible for investors to obtain excess profits above the market average by
analyzing past prices. The stock price described by this theory has no aftereffect in
statistics, so the stock trading behavior is idealized as a Markov process.
The stock trend changes in accordance with the system included in the discrete-time
Markov chain, and each step of it is in a state of random change. Discrete-time Markov
chain is a sequence of random variables X1, X2, X3 …, but must satisfy the Markov
property, so the transition probability is only related to the current state, and has nothing
to do with the previous state. The mathematical formula of probability is expressed as
follows:
Pr( Xn+1 = x | X1 = x1, X2 = x2, …, Xn = xn) = Pr( Xn+1 = x | Xn = xn)
The matrix P(m,m+n)=(Pij(m,m+n)) composed of transition probability is called the
transition probability matrix of Markov chain.
When the transition probability Pij is only related to i, j and the time interval n, the
transition probability is said to have stationarity, and the chain is also said to be
homogeneous or homogeneous. Markov chains usually discussed are homogeneous
chains
It can be seen that the probability of Xn+1 is only related to the previous probability
of Xn. So you only need to know the previous state to determine the probability
distribution of the current state, and satisfy the condition of independence. The
implementation method of Markov class in python is as follows (due to the limitation of
the number of pages, only part of it is shown):
It must be pointed out that even with such a lengthy code, the function of the
implemented Markov chain is still quite crude. Due to space limitations, a large number of
matrix operations will not be listed.

Importing the relevant stock data of New World Co., Ltd., we can get:

Machine learning based on Machine learning based on


Previous price previous price change%
0.323 0.286

Compared with the previous four classical methods, the Markov chain method can
undoubtedly eliminate the interference of a lot of useless data. Therefore, the Markov
method is better in terms of accuracy and operation speed.

The theoretical basis of Markov chain is very simple: the state of a system at the
next moment only depends on the state at this moment, and has nothing to do with any
previous moment.

A Markov process in which time and state are both discrete is called a Markov chain.
Denoted as

{Xn= X(n),n = 1,2,3,…}

It can be seen as the result of successive observations of discrete state Markov


processes on the time set T = {0,1,2, ...}. In fact, according to the different matrix types
that the observer can use, we can divide
this model into four types:

spherical : It means that in each


Markov hidden state, all the characteristic
components of the observable state vector
use the same variance value. The non-
diagonal angle of the corresponding
covariance matrix is 0, and the diagonal
values are equal, that is, the spherical characteristic. This is the simplest Gaussian
distribution PDF.

diag: Refers to the diagonal


covariance matrix for the observable
state vector in each Markov hidden state.
The non-diagonal angle of the
corresponding covariance matrix is 0,
and the diagonal values are not equal.
diag is the default type in hmm_learn。
full:It means that in each Markov hidden
state, the observable state vector uses the
complete covariance matrix. The elements in
the corresponding covariance matrix are all
non-zero.

tied: Means that all Markov implicit states


use the same complete covariance matrix.
Among the four PDF types, spherical, diag and full represent three different
Gaussian distribution probability density functions, and tied can be regarded as a unique
realization of the hidden Markov chain. Among them, full is the most powerful, but
requires enough data to make a reasonable parameter estimation; spherical is the
simplest, usually used in the case of insufficient data or limited hardware platform
performance; and diag is the two A compromise. When using, you need to select the
appropriate type according to the correlation of the different characteristics of the
observable state vector. In this example, the full model is used
In fact, the finite-dimensional distribution of the Markov chain model is only
determined by the initial distribution and transition probability. This brevity actually
contains too many idealistic assumptions. Both parties in the financial market are not
completely "rational economic people." With the introduction of the speculative bubble
and momentum effect concepts at the end of the 20th century, the efficient market
hypothesis and its methods began to be questioned.
Hurst Exponent method based on Fractal Market Theory
Fractal was first proposed by Benoit Mandelbort to describe the irregular geometric
characteristics. The tests of Lyapunov index and fractal dimension all show that the
capital market exhibits chaotic behavior. With the development of nonlinear dynamics, a
new perspective based on chaos and analysis theory provides us with a new way to
predict stock trends.
For discrete time series such as stock prices, how the range (or fluctuation) of its
change over a period of time changes with the size of the time span can often reveal the
characteristics of the time series.
The Hurst index H is used to describe this time memory; it is used to measure how the
fluctuation range of a long-term series

changes over time, namely:

Among them, n is the number of


time series observation points,
representing the size of the time
span. ;   Is the variation range
of these n observation points;   Is the standard deviation of these points. Use  

 to  Standardize get   , It is the range over which the standard

deviation is rescaled, called the rescaled range; A is a constant; H is the Hurst exponent.
There are three forms of the Hurst index:
1. If H=0.5, it indicates that the time series can be described by random walk;
2. If 0.5<H≤1, it means =, continuity, which implies a time series of long-term
memory;
3. If 0≤H<0.5, it indicates anti-persistence, that is, the mean recovery process.
In other words, as long as H ≠0.5, the time series data can be described by biased
Brownian motion (fractal Brownian motion).
The calculation of Hurst exponent is more complicated, but it can still be
implemented with python.

After the program runs successfully, we can get the hurst index of New World
Development Co., Ltd. during this period as:

Investment
Strategy
1.Using machine learning algorithms for short-term prediction
This article provides several methods for short-term forecasting of stock asset prices using
machine learning and mathematical economics methods. Among them, the linear regression model
and Markov chain model have the best prediction results. But this does not explain the pros and
cons of the models in other situations. In fact, different models are applicable to different types of
stock movements. Through the description of the advantages and disadvantages of the model in
the article, select the appropriate model to analyze and predict the stock price. Buy when the
predicted price rises and sell when it falls.
2. Use the Hurst index for long-term forecasts
When 0.5<Hurst<1.0, the fractal characteristics of the stock curve can be used for long-term
prediction. According to the fractal market hypothesis, under normal circumstances, stocks will
show similar trend characteristics in different time periods. For example, the stock of New World
Development Co., Ltd. went through a process of first increase in price and then decrease in price
from July to August. And this kind of process actually occurs in many short time periods (such as
a week or even a day). People use this similarity to assist in judging that the current stock's trend
is roughly in a certain position in the similar range, and then judge whether to buy or sell.
3. Portfolio Investment Theory
The American economist Markowitz first proposed the portfolio investment theory in 1952.
The theory points out that the return of a portfolio of several securities is the weighted average of
the returns of these securities, but the risk is not the weighted average of the risks of these
securities, so the investment portfolio can reduce non-systematic risks. People can use various
methods to make short-term forecasts for a variety of stocks. And according to the Hurst index to
determine the accuracy of this forecast, and finally allocate the amount of investment based on this
accuracy.

Summarize
This article is based on the five machine learning algorithms (linear regression, neural
network, SVM, knn, and decision tree) that come with excel and orange and two new algorithms
in the field of mathematical economics (Markov chain and Hurst index). New World Development
Co., Ltd. analyzes and predicts stock trends during 2017-2020 as an example, and puts forward
some relevant stock investment strategies. However, due to the limitations of the author's level,
delivery time, article length, and computer capabilities, I have to skip some parts of the
explanation.
Stock trends have been considered unpredictable for a long time in the past, but with the
development of computer science and economics, it is no longer impossible to find regularities in
these random walk stock prices. The application of machine learning in the field of economics still
has a lot of room for exploration.

Reference
[1] H Zhang, A. C. Berg, M. Maire and J. Malik, "SVM-KNN: Discriminative
Nearest Neighbor Classification for Visual Category Recognition," 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR'06), 2006, pp. 2126-2136, doi: 10.1109/CVPR.2006.301.

[2] J Wang, Z Zhang, S Guo, Forecasting stock indices with back propagation neural
network, Expert Systems with Applications, Volume 38, Issue 11,2011, Pages 14346-
14355, ISSN 0957-4174,

[3] Mankiw, G. N. (2021). Principles of Economics, 7th Edition (7th ed.). Cengage

Learning.

[5] Masís, S. (2021). Interpretable Machine Learning with Python. Van Haren

Publishing.

[4] Shiller, R. J. (2016). Irrational Exuberance: Revised and Expanded Third

Edition (3rd ed.). Princeton University Press.

[6] Theobald, O. (2021). Machine Learning for Absolute Beginners: A Plain English

Introduction (Third Edition) (Python for Data Science). Independently published.

You might also like