0% found this document useful (0 votes)

67 views12 pages

COMP1901 Research Project

This document summarizes several classical algorithms for stock selection and forecasting, including linear regression, neural networks, support vector machines (SVM), K-nearest neighbors (KNN), random forests, and Markov chains. It analyzes stock data from New World Development Company using these algorithms in Orange, an open-source data mining software. It finds that linear regression has the best performance based on metrics like mean absolute error and R-squared. The document also notes that forecast accuracy tends to decrease when the original stock curve fluctuates drastically.

Uploaded by

taiiq zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views12 pages

COMP1901 Research Project

Uploaded by

taiiq zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

COMP1901 Project 2 Research

Stock Selection: HK0017 New World Development Company Limited

Introduction:
Mathematical economics came into being after the Italian economist Cheva creatively used
mathematical paradigms to deconstruct economic principles in 1711. In the 21st century, the
development of computer technology has greatly enhanced people's ability to process statistical
data. It is no longer a fantasy to predict the trend of the financial market to a certain extent by
using probabilistic and statistical methods and computer compilation technology. This article uses
some trading data of New World Development Company to briefly describe some simple stock
forecasting methods from the perspectives of computer science, statistics and economics.
Classical data mining algorithm
Linear Regression Method：
Linear regression analysis is the simplest and most commonly used analysis
method. It is based on three mathematical principles: one is that the regression line of a
set of samples must pass through the center of the sample; the other is that the curve
can be approximately fitted to a straight line in a short interval; the third is the so-called
least squares method.
One-variable linear regression is actually to calculate a straight line from a bunch of
training sets to minimize the distance difference between the data set and the straight
line.
The idea provided by linear regression is to first assume a straight line:

You can bring each value in the feature X into it to get

the corresponding . The definition can define the loss as the sum of the squared

differences between and and modify it to:

Then the problem is transformed into finding the

smallest . In order to facilitate understanding, we draw a three-dimensional function
image of the residual sum of squares:

The rest of the work needs to be solved by mathematical least squares.

The advantages and disadvantages of linear fitting are obvious. First of all, it has a
small amount of calculation and a simple method. It has a good fitting effect for some
samples with relatively uniform distribution; but at the same time, the applicable range of
linear regression fitting is relatively small. For data with very large samples and uneven
distribution, linear regression analysis is less effective.

The prediction function that comes with Excel is actually linear regression prediction.
We use the built-in function of excel to draw the regression line:

At the same time, using orange

verification, we get:

Neutral Network Method

Artificial neural networks are models inspired by the structure and functions of biological
neural networks. They are a type of pattern matching, usually used for regression and
classification problems, but are actually a huge subfield, containing hundreds of algorithms and
various types question type.
Algorithm advantages
1 The classification accuracy is high, and the learning ability is extremely strong.
2 Strong robustness and fault tolerance to noise data.
3 Has the ability to associate and can approximate any non-linear relationship.
Algorithm shortcomings
1. There are many neural network parameters, weights and thresholds.
2. In the black box process, intermediate results cannot be observed.
3. The learning process is relatively long and may fall into a local minimum.
The result after implementing with orange is:

SVM Method
Svm (support Vector Mac), also known as support vector machine, is a two-class model. Of
course, it can also be used for classification of multi-category problems after modification.
Support vector machines can be divided into two categories: linear kernel and nonlinear. The main
idea is to find a hyperplane in the space that is more able to divide all data samples, and make the
distance from all data in the book to this hyperplane the shortest.
Algorithm advantages
1 Solve the problem of machine learning under small samples.
2 Solve non-linear problems.
3 There is no local minimum problem. (Compared to algorithms such as neural networks) can
handle high-dimensional data sets well.
4 Strong generalization ability
Algorithm shortcomings
1 The explanatory power of the high-dimensional mapping of the kernel function is not
strong, especially the radial basis function.
2 Sensitive to missing data.
Similarly, we use orange to test the SVM method
The results of the operation are:

Knn Method
K nearest neighbor classification algorithm is one of the simplest methods in data mining
classification technology.
The principle is: If most of the k nearest samples in the feature space of a sample belong to a
certain category, the sample also belongs to this category and has the characteristics of the samples
in this category.
Algorithm advantages:
1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training;
2. Suitable for classifying rare events;
3. Especially suitable for multi-class problems (multi-modal, objects with multiple category
labels), kNN performs better than SVM.
Algorithm disadvantages:
Needs a lot of space to store known instances, and the algorithm complexity is high.
Due to the high complexity of this algorithm, we still use orange to achieve:
The results of the operation are:
Random Forest Method
The decision tree method is constructed based on the actual values of the attributes in the
data. Decision trees are often trained for classification and regression problems. Decision trees are
usually fast and accurate, and are the most popular in machine learning.
Algorithm advantages
1. The decision tree is easy to understand and explain, can be visualized and analyzed, and it is
easy to extract rules.
2. It can process nominal and numerical data at the same time.
3. When testing the data set, the running speed is relatively fast.
4. The decision tree can be well extended to large databases, and its size is independent of the size
of the database.
Algorithm shortcomings
1. It is difficult to deal with missing data.
2. Prone to overfitting.
3. Ignore the correlation of attributes in the data set.
4. When the ID3 algorithm calculates the information gain, the result tends to be more numerical.
The results displayed by Orange are:

Analysis

Grey is stimulated curve

Linear Regression knn

Blue is original stock curve
SVM Neural Network

Random Forest
By comparing the values of parameters such as MAE and R2 and studying the fitting curves
of various models, we found that among these models, the linear regression model has the best
effect. So we use the linear regression model to do further calculations:
10-day 5-day 10-day 5-day Machine learning Machine learning
moving moving forecast forecast based on based on previous
average average Previous price price change%
0.832 0.502 0.637 0.502 0.488 0.442
At the same time, there are some unexplained but noteworthy
phenomena：
When the original stock curve fluctuates drastically, the accuracy of most
forecasting models will drop significantly

Orange is linear regression method

When the original stock curve fluctuates drastically, the accuracy of

most forecasting models will drop significantly

Mathematical Economics Algorithm

Markov Chain Method based on Efficient Market Theory
The Effective Market Hypothesis (EMH) was proposed and deepened by the famous
American economist Eugene Fama in 1970. The theory believes that in a stable stock
market, all valuable information has been fully reflected in the current stock price, and it
is impossible for investors to obtain excess profits above the market average by
analyzing past prices. The stock price described by this theory has no aftereffect in
statistics, so the stock trading behavior is idealized as a Markov process.
The stock trend changes in accordance with the system included in the discrete-time
Markov chain, and each step of it is in a state of random change. Discrete-time Markov
chain is a sequence of random variables X1, X2, X3 …, but must satisfy the Markov
property, so the transition probability is only related to the current state, and has nothing
to do with the previous state. The mathematical formula of probability is expressed as
follows:
Pr( Xn+1 = x | X1 = x1, X2 = x2, …, Xn = xn) = Pr( Xn+1 = x | Xn = xn)
The matrix P(m,m+n)=(Pij(m,m+n)) composed of transition probability is called the
transition probability matrix of Markov chain.
When the transition probability Pij is only related to i, j and the time interval n, the
transition probability is said to have stationarity, and the chain is also said to be
homogeneous or homogeneous. Markov chains usually discussed are homogeneous
chains
It can be seen that the probability of Xn+1 is only related to the previous probability
of Xn. So you only need to know the previous state to determine the probability
distribution of the current state, and satisfy the condition of independence. The
implementation method of Markov class in python is as follows (due to the limitation of
the number of pages, only part of it is shown):
It must be pointed out that even with such a lengthy code, the function of the
implemented Markov chain is still quite crude. Due to space limitations, a large number of
matrix operations will not be listed.

Importing the relevant stock data of New World Co., Ltd., we can get:

Machine learning based on Machine learning based on

Previous price previous price change%
0.323 0.286

Compared with the previous four classical methods, the Markov chain method can
undoubtedly eliminate the interference of a lot of useless data. Therefore, the Markov
method is better in terms of accuracy and operation speed.

The theoretical basis of Markov chain is very simple: the state of a system at the
next moment only depends on the state at this moment, and has nothing to do with any
previous moment.

A Markov process in which time and state are both discrete is called a Markov chain.
Denoted as

{Xn= X（n），n = 1,2,3，…}

It can be seen as the result of successive observations of discrete state Markov

processes on the time set T = {0,1,2, ...}. In fact, according to the different matrix types
that the observer can use, we can divide
this model into four types:

spherical ： It means that in each

Markov hidden state, all the characteristic
components of the observable state vector
use the same variance value. The non-
diagonal angle of the corresponding
covariance matrix is 0, and the diagonal
values are equal, that is, the spherical characteristic. This is the simplest Gaussian
distribution PDF.

diag: Refers to the diagonal

covariance matrix for the observable
state vector in each Markov hidden state.
The non-diagonal angle of the
corresponding covariance matrix is 0,
and the diagonal values are not equal.
diag is the default type in hmm_learn。
full：It means that in each Markov hidden
state, the observable state vector uses the
complete covariance matrix. The elements in
the corresponding covariance matrix are all
non-zero.

tied: Means that all Markov implicit states

use the same complete covariance matrix.
Among the four PDF types, spherical, diag and full represent three different
Gaussian distribution probability density functions, and tied can be regarded as a unique
realization of the hidden Markov chain. Among them, full is the most powerful, but
requires enough data to make a reasonable parameter estimation; spherical is the
simplest, usually used in the case of insufficient data or limited hardware platform
performance; and diag is the two A compromise. When using, you need to select the
appropriate type according to the correlation of the different characteristics of the
observable state vector. In this example, the full model is used
In fact, the finite-dimensional distribution of the Markov chain model is only
determined by the initial distribution and transition probability. This brevity actually
contains too many idealistic assumptions. Both parties in the financial market are not
completely "rational economic people." With the introduction of the speculative bubble
and momentum effect concepts at the end of the 20th century, the efficient market
hypothesis and its methods began to be questioned.
Hurst Exponent method based on Fractal Market Theory
Fractal was first proposed by Benoit Mandelbort to describe the irregular geometric
characteristics. The tests of Lyapunov index and fractal dimension all show that the
capital market exhibits chaotic behavior. With the development of nonlinear dynamics, a
new perspective based on chaos and analysis theory provides us with a new way to
predict stock trends.
For discrete time series such as stock prices, how the range (or fluctuation) of its
change over a period of time changes with the size of the time span can often reveal the
characteristics of the time series.
The Hurst index H is used to describe this time memory; it is used to measure how the
fluctuation range of a long-term series

changes over time, namely:

Among them, n is the number of

time series observation points,
representing the size of the time
span. ； Is the variation range
of these n observation points; Is the standard deviation of these points. Use

to Standardize get ， It is the range over which the standard

deviation is rescaled, called the rescaled range; A is a constant; H is the Hurst exponent.
There are three forms of the Hurst index:
1. If H=0.5, it indicates that the time series can be described by random walk;
2. If 0.5<H≤1, it means =, continuity, which implies a time series of long-term
memory;
3. If 0≤H<0.5, it indicates anti-persistence, that is, the mean recovery process.
In other words, as long as H ≠0.5, the time series data can be described by biased
Brownian motion (fractal Brownian motion).
The calculation of Hurst exponent is more complicated, but it can still be
implemented with python.

After the program runs successfully, we can get the hurst index of New World
Development Co., Ltd. during this period as:

Investment
Strategy
1.Using machine learning algorithms for short-term prediction
This article provides several methods for short-term forecasting of stock asset prices using
machine learning and mathematical economics methods. Among them, the linear regression model
and Markov chain model have the best prediction results. But this does not explain the pros and
cons of the models in other situations. In fact, different models are applicable to different types of
stock movements. Through the description of the advantages and disadvantages of the model in
the article, select the appropriate model to analyze and predict the stock price. Buy when the
predicted price rises and sell when it falls.
2. Use the Hurst index for long-term forecasts
When 0.5<Hurst<1.0, the fractal characteristics of the stock curve can be used for long-term
prediction. According to the fractal market hypothesis, under normal circumstances, stocks will
show similar trend characteristics in different time periods. For example, the stock of New World
Development Co., Ltd. went through a process of first increase in price and then decrease in price
from July to August. And this kind of process actually occurs in many short time periods (such as
a week or even a day). People use this similarity to assist in judging that the current stock's trend
is roughly in a certain position in the similar range, and then judge whether to buy or sell.
3. Portfolio Investment Theory
The American economist Markowitz first proposed the portfolio investment theory in 1952.
The theory points out that the return of a portfolio of several securities is the weighted average of
the returns of these securities, but the risk is not the weighted average of the risks of these
securities, so the investment portfolio can reduce non-systematic risks. People can use various
methods to make short-term forecasts for a variety of stocks. And according to the Hurst index to
determine the accuracy of this forecast, and finally allocate the amount of investment based on this
accuracy.

Summarize
This article is based on the five machine learning algorithms (linear regression, neural
network, SVM, knn, and decision tree) that come with excel and orange and two new algorithms
in the field of mathematical economics (Markov chain and Hurst index). New World Development
Co., Ltd. analyzes and predicts stock trends during 2017-2020 as an example, and puts forward
some relevant stock investment strategies. However, due to the limitations of the author's level,
delivery time, article length, and computer capabilities, I have to skip some parts of the
explanation.
Stock trends have been considered unpredictable for a long time in the past, but with the
development of computer science and economics, it is no longer impossible to find regularities in
these random walk stock prices. The application of machine learning in the field of economics still
has a lot of room for exploration.

Reference
[1] H Zhang, A. C. Berg, M. Maire and J. Malik, "SVM-KNN: Discriminative
Nearest Neighbor Classification for Visual Category Recognition," 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR'06), 2006, pp. 2126-2136, doi: 10.1109/CVPR.2006.301.

[2] J Wang, Z Zhang, S Guo, Forecasting stock indices with back propagation neural
network, Expert Systems with Applications, Volume 38, Issue 11,2011, Pages 14346-
14355, ISSN 0957-4174,

[3] Mankiw, G. N. (2021). Principles of Economics, 7th Edition (7th ed.). Cengage

Learning.

[5] Masís, S. (2021). Interpretable Machine Learning with Python. Van Haren

Publishing.

[4] Shiller, R. J. (2016). Irrational Exuberance: Revised and Expanded Third

Edition (3rd ed.). Princeton University Press.

[6] Theobald, O. (2021). Machine Learning for Absolute Beginners: A Plain English

Introduction (Third Edition) (Python for Data Science). Independently published.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Aiml Manual 6th Sem
No ratings yet
Aiml Manual 6th Sem
15 pages
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Project Report On Breast Cancer
67% (3)
Project Report On Breast Cancer
47 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Wifall: Device-Free Fall Detection by Wireless Networks: Chunmei Han, Kaishun Wu, Yuxi Wang, and Lionel M. Ni
No ratings yet
Wifall: Device-Free Fall Detection by Wireless Networks: Chunmei Han, Kaishun Wu, Yuxi Wang, and Lionel M. Ni
9 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Lecture2 - Gradient Descent - V0
No ratings yet
Lecture2 - Gradient Descent - V0
51 pages
Lecture1 Introduction V0
No ratings yet
Lecture1 Introduction V0
25 pages
Config
No ratings yet
Config
3 pages
Computational Thinking and Problem Solving (COMP1002) and Problem Solving Methodology in Information Technology (COMP1001)
No ratings yet
Computational Thinking and Problem Solving (COMP1002) and Problem Solving Methodology in Information Technology (COMP1001)
3 pages
The Hong Kong Polytechnic University: Reference Checklist (Confidential)
No ratings yet
The Hong Kong Polytechnic University: Reference Checklist (Confidential)
4 pages
Sampling Distribution and P G Estimation: T I3 Topic 3
No ratings yet
Sampling Distribution and P G Estimation: T I3 Topic 3
46 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
2223hk1 Slide03 ML2022
No ratings yet
2223hk1 Slide03 ML2022
33 pages
UPI Fraud Detection Using Convolutional Neural Net
No ratings yet
UPI Fraud Detection Using Convolutional Neural Net
16 pages
Lab Workbook
No ratings yet
Lab Workbook
160 pages
Introduction To Pattern Recognition and Machine Learning PDF
No ratings yet
Introduction To Pattern Recognition and Machine Learning PDF
402 pages
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
No ratings yet
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
20 pages
Duval
No ratings yet
Duval
9 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
15 pages
Digital Image Processing
No ratings yet
Digital Image Processing
8 pages
Conceptual Design of A Natural Fibre-Reinforced Composite Automotive Anti-Roll Bar Using A Hybrid Approach
No ratings yet
Conceptual Design of A Natural Fibre-Reinforced Composite Automotive Anti-Roll Bar Using A Hybrid Approach
17 pages
Machine Learning Interview Preparation
No ratings yet
Machine Learning Interview Preparation
19 pages
Indian Currency Detection Using KNN Classifier
No ratings yet
Indian Currency Detection Using KNN Classifier
4 pages
Identification of Cucumber Leaf Diseases Using Dee
No ratings yet
Identification of Cucumber Leaf Diseases Using Dee
13 pages
A Malware Detection Approach Using Autoencoder in Deep Learning
No ratings yet
A Malware Detection Approach Using Autoencoder in Deep Learning
11 pages
Spam Detection in Online Social Networks
No ratings yet
Spam Detection in Online Social Networks
14 pages
A Review Paper On Outlier Detection Using Two-Phase SVM Classifiers With Cross Training Approach For Multi - Disease Diagnosis
No ratings yet
A Review Paper On Outlier Detection Using Two-Phase SVM Classifiers With Cross Training Approach For Multi - Disease Diagnosis
5 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Project Report
No ratings yet
Project Report
44 pages
Magnimind Academy Full-Stack Data Science Bootcamp Syllabus
No ratings yet
Magnimind Academy Full-Stack Data Science Bootcamp Syllabus
17 pages
Music Genre Classification With ResNet and
No ratings yet
Music Genre Classification With ResNet and
17 pages
Deep Neural Networks For Spectrum Sensing A Review
No ratings yet
Deep Neural Networks For Spectrum Sensing A Review
25 pages
2020 Machine Learning Approach To Predict Computer Vunerability
No ratings yet
2020 Machine Learning Approach To Predict Computer Vunerability
6 pages
Machine Learning (Trang 1 Trên 4)
No ratings yet
Machine Learning (Trang 1 Trên 4)
5 pages
Hotel Booking Prediction Using Machine Learning
No ratings yet
Hotel Booking Prediction Using Machine Learning
5 pages
Machine Learning in Predicting Mechanical
No ratings yet
Machine Learning in Predicting Mechanical
17 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Fraud Detection Paper English
No ratings yet
Fraud Detection Paper English
19 pages
Ai, MLDL Bigda Syllabus For Internship Training
No ratings yet
Ai, MLDL Bigda Syllabus For Internship Training
7 pages

COMP1901 Research Project

Uploaded by

COMP1901 Research Project

Uploaded by

COMP1901 Project 2 Research

Stock Selection: HK0017 New World Development Company Limited

You can bring each value in the feature X into it to get

differences between and and modify it to:

Then the problem is transformed into finding the

The rest of the work needs to be solved by mathematical least squares.

At the same time, using orange

Neutral Network Method

Grey is stimulated curve

Linear Regression knn

Orange is linear regression method

When the original stock curve fluctuates drastically, the accuracy of

Mathematical Economics Algorithm

Machine learning based on Machine learning based on

{Xn= X（n），n = 1,2,3，…}

It can be seen as the result of successive observations of discrete state Markov

spherical ： It means that in each

diag: Refers to the diagonal

tied: Means that all Markov implicit states

changes over time, namely:

Among them, n is the number of

to Standardize get ， It is the range over which the standard

[4] Shiller, R. J. (2016). Irrational Exuberance: Revised and Expanded Third

Edition (3rd ed.). Princeton University Press.

Introduction (Third Edition) (Python for Data Science). Independently published.

You might also like