COMP1901 Research Project
COMP1901 Research Project
The prediction function that comes with Excel is actually linear regression prediction.
We use the built-in function of excel to draw the regression line:
SVM Method
Svm (support Vector Mac), also known as support vector machine, is a two-class model. Of
course, it can also be used for classification of multi-category problems after modification.
Support vector machines can be divided into two categories: linear kernel and nonlinear. The main
idea is to find a hyperplane in the space that is more able to divide all data samples, and make the
distance from all data in the book to this hyperplane the shortest.
Algorithm advantages
1 Solve the problem of machine learning under small samples.
2 Solve non-linear problems.
3 There is no local minimum problem. (Compared to algorithms such as neural networks) can
handle high-dimensional data sets well.
4 Strong generalization ability
Algorithm shortcomings
1 The explanatory power of the high-dimensional mapping of the kernel function is not
strong, especially the radial basis function.
2 Sensitive to missing data.
Similarly, we use orange to test the SVM method
The results of the operation are:
Knn Method
K nearest neighbor classification algorithm is one of the simplest methods in data mining
classification technology.
The principle is: If most of the k nearest samples in the feature space of a sample belong to a
certain category, the sample also belongs to this category and has the characteristics of the samples
in this category.
Algorithm advantages:
1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training;
2. Suitable for classifying rare events;
3. Especially suitable for multi-class problems (multi-modal, objects with multiple category
labels), kNN performs better than SVM.
Algorithm disadvantages:
Needs a lot of space to store known instances, and the algorithm complexity is high.
Due to the high complexity of this algorithm, we still use orange to achieve:
The results of the operation are:
Random Forest Method
The decision tree method is constructed based on the actual values of the attributes in the
data. Decision trees are often trained for classification and regression problems. Decision trees are
usually fast and accurate, and are the most popular in machine learning.
Algorithm advantages
1. The decision tree is easy to understand and explain, can be visualized and analyzed, and it is
easy to extract rules.
2. It can process nominal and numerical data at the same time.
3. When testing the data set, the running speed is relatively fast.
4. The decision tree can be well extended to large databases, and its size is independent of the size
of the database.
Algorithm shortcomings
1. It is difficult to deal with missing data.
2. Prone to overfitting.
3. Ignore the correlation of attributes in the data set.
4. When the ID3 algorithm calculates the information gain, the result tends to be more numerical.
The results displayed by Orange are:
Analysis
Random Forest
By comparing the values of parameters such as MAE and R2 and studying the fitting curves
of various models, we found that among these models, the linear regression model has the best
effect. So we use the linear regression model to do further calculations:
10-day 5-day 10-day 5-day Machine learning Machine learning
moving moving forecast forecast based on based on previous
average average Previous price price change%
0.832 0.502 0.637 0.502 0.488 0.442
At the same time, there are some unexplained but noteworthy
phenomena:
When the original stock curve fluctuates drastically, the accuracy of most
forecasting models will drop significantly
Importing the relevant stock data of New World Co., Ltd., we can get:
Compared with the previous four classical methods, the Markov chain method can
undoubtedly eliminate the interference of a lot of useless data. Therefore, the Markov
method is better in terms of accuracy and operation speed.
The theoretical basis of Markov chain is very simple: the state of a system at the
next moment only depends on the state at this moment, and has nothing to do with any
previous moment.
A Markov process in which time and state are both discrete is called a Markov chain.
Denoted as
deviation is rescaled, called the rescaled range; A is a constant; H is the Hurst exponent.
There are three forms of the Hurst index:
1. If H=0.5, it indicates that the time series can be described by random walk;
2. If 0.5<H≤1, it means =, continuity, which implies a time series of long-term
memory;
3. If 0≤H<0.5, it indicates anti-persistence, that is, the mean recovery process.
In other words, as long as H ≠0.5, the time series data can be described by biased
Brownian motion (fractal Brownian motion).
The calculation of Hurst exponent is more complicated, but it can still be
implemented with python.
After the program runs successfully, we can get the hurst index of New World
Development Co., Ltd. during this period as:
Investment
Strategy
1.Using machine learning algorithms for short-term prediction
This article provides several methods for short-term forecasting of stock asset prices using
machine learning and mathematical economics methods. Among them, the linear regression model
and Markov chain model have the best prediction results. But this does not explain the pros and
cons of the models in other situations. In fact, different models are applicable to different types of
stock movements. Through the description of the advantages and disadvantages of the model in
the article, select the appropriate model to analyze and predict the stock price. Buy when the
predicted price rises and sell when it falls.
2. Use the Hurst index for long-term forecasts
When 0.5<Hurst<1.0, the fractal characteristics of the stock curve can be used for long-term
prediction. According to the fractal market hypothesis, under normal circumstances, stocks will
show similar trend characteristics in different time periods. For example, the stock of New World
Development Co., Ltd. went through a process of first increase in price and then decrease in price
from July to August. And this kind of process actually occurs in many short time periods (such as
a week or even a day). People use this similarity to assist in judging that the current stock's trend
is roughly in a certain position in the similar range, and then judge whether to buy or sell.
3. Portfolio Investment Theory
The American economist Markowitz first proposed the portfolio investment theory in 1952.
The theory points out that the return of a portfolio of several securities is the weighted average of
the returns of these securities, but the risk is not the weighted average of the risks of these
securities, so the investment portfolio can reduce non-systematic risks. People can use various
methods to make short-term forecasts for a variety of stocks. And according to the Hurst index to
determine the accuracy of this forecast, and finally allocate the amount of investment based on this
accuracy.
Summarize
This article is based on the five machine learning algorithms (linear regression, neural
network, SVM, knn, and decision tree) that come with excel and orange and two new algorithms
in the field of mathematical economics (Markov chain and Hurst index). New World Development
Co., Ltd. analyzes and predicts stock trends during 2017-2020 as an example, and puts forward
some relevant stock investment strategies. However, due to the limitations of the author's level,
delivery time, article length, and computer capabilities, I have to skip some parts of the
explanation.
Stock trends have been considered unpredictable for a long time in the past, but with the
development of computer science and economics, it is no longer impossible to find regularities in
these random walk stock prices. The application of machine learning in the field of economics still
has a lot of room for exploration.
Reference
[1] H Zhang, A. C. Berg, M. Maire and J. Malik, "SVM-KNN: Discriminative
Nearest Neighbor Classification for Visual Category Recognition," 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR'06), 2006, pp. 2126-2136, doi: 10.1109/CVPR.2006.301.
[2] J Wang, Z Zhang, S Guo, Forecasting stock indices with back propagation neural
network, Expert Systems with Applications, Volume 38, Issue 11,2011, Pages 14346-
14355, ISSN 0957-4174,
[3] Mankiw, G. N. (2021). Principles of Economics, 7th Edition (7th ed.). Cengage
Learning.
[5] Masís, S. (2021). Interpretable Machine Learning with Python. Van Haren
Publishing.
[6] Theobald, O. (2021). Machine Learning for Absolute Beginners: A Plain English