Forecasting Hong Kong Hang Seng Index Stock Price Movement Using Social Media Data Analysis
Forecasting Hong Kong Hang Seng Index Stock Price Movement Using Social Media Data Analysis
September 2017
Statement of Authorship
Except where reference is made in the text of this dissertation, this dissertation
contains no material published elsewhere or extracted in whole or in part from a
dissertation presented by me for another degree or diploma.
No other person’s work has been used without due acknowledgement in the main text
of the dissertation.
This dissertation has not been submitted for the award of any other degree or diploma
in any other tertiary institution.
___________________________
Dated: 20/04/2018
I
Acknowledgements
After the period of eight months on intensive research and development, this note of
thanks is to represent my finishing master programme on my dissertation. Learn a lot,
implement a lot, earn a lot during an intensive learning and writing, not only in the
mathematics and science areas, but also in my personal attitudes, training and
knowledge.
I would like to express my sincere gratitude for the assistance of Prof. CHAN Chun
Chung Keith, whose work demonstrated and suggested to me that related to Hang Seng
Index in comparative social media data analysis, and should the new trend data science
for prediction and provide a quest for our times. In addition, I thank the Hong Kong
Polytechnic University for permission to allow copyrighted documentation as my
dissertation, which was originally published in the public.
II
Abstract
Intelligence and high-tech algorithms are the evolution of predicting stock price
movement to inherit the historical researches – Social Media Data Analysis
(SMeDA). It is perfectly to collect Hong Kong social media data in the public text.
Hang Seng Index (HSI) of stock price prediction movement based on the stock
analysts expressed on financial and economic social media website has been an
intriguing field of research. Their emotional text expressed their perspective and
point of view in the future of stock price movement. The thesis of this dissertation is
to examine and study creditability and referencing of stock analysts’ prediction
movement.
In this paper, applied text and data analysis, including supervised machine learning
algorithms extracted from social media articles, as well as analyze the correlation
features between stock price prediction and stock commentary, which compared with
historical stock data prediction. In an elaborate way, the level of creditability from
stock analyst’s text can be roughly assort the stock market development trend in HSI,
it can definitely encourage the public for references (Appendix V and Appendix VI
provide the programming coding of all algorithms and feature of raw data for
algorithm ) .
*Keywords: Social Media Data Analysis, Hang Seng Index, stock analysts’
prediction movement.
III
Table of Contents
Statement of Authorship ........................................................................................... I
Acknowledgements ................................................................................................... II
Abstract.................................................................................................................... III
List of Figures......................................................................................................... VII
List of Tables ........................................................................................................ VIII
Chapter 1 Introduction ........................................................................................... 1
1.1 Aims and Objectives ................................................................................ 4
1.2 Research Process ...................................................................................... 4
1.3 Contribution and Originality .................................................................... 4
1.4 Dissertation Overview.............................................................................. 5
Chapter 2 Background ............................................................................................ 6
2.1 General Terms and Concepts ................................................................... 6
2.1.1 Period of Investment .................................................................... 6
2.1.2 Hang Seng Index (HSI) ................................................................ 7
2.1.3 Social Media Mining Analysis ................................................... 10
2.1.4 Data Mining ............................................................................... 11
2.1.5 Text Mining – Text Analysis ..................................................... 12
2.1.6 Summary .................................................................................... 13
2.2 Technical Background & Description .................................................... 13
2.2.1 Time Series Similarity Analysis ................................................ 14
2.2.2 Learning Algorithms .................................................................. 16
2.2.3 Text Processing and Understanding ........................................... 17
2.2.4 Classification algorithms............................................................ 19
2.2.5 Text Summarization ................................................................... 23
2.2.6 Text Analysis ............................................................................. 24
2.2.7 Performance Evaluation ............................................................. 25
IV
2.3 Movement Performance Index (MPI) Research .................................... 26
2.4 Summary ................................................................................................ 26
Chapter 3 Design Approach.................................................................................. 28
3.1 Analysis Structure and Design ................................................................. 29
3.2 Data Acquisition ...................................................................................... 30
3.3 Data Preparation ...................................................................................... 31
3.4 Data Pre-Processing & Munging ............................................................. 32
3.4.1 Bag-of-words model................................................................... 33
3.4.2 Topic Modelling ......................................................................... 36
3.5 Social Media Analysis Methodologies ...................................................... 36
3.5.1 Analysis SMeDA ....................................................................... 36
3.5.2 Data mining ................................................................................ 39
3.6 Feature Selection ........................................................................................ 40
3.7 Feature Extraction ...................................................................................... 40
3.8 Summary .................................................................................................... 41
Chapter 4 Experimental Framework ................................................................... 42
4.1 Basic Experiment ................................................................................... 42
4.1.1 Features ...................................................................................... 42
4.1.2 Prediction Score and Movement ................................................ 43
4.2 Summary ................................................................................................ 43
Chapter 5 Analysis Result and Evaluation .......................................................... 44
5.1 Basic Form ............................................................................................. 44
5.2 Data Analysis Result .............................................................................. 45
5.2.1 Regression Linear....................................................................... 45
5.3 Social Media Analysis Result ................................................................ 46
5.3.1 Stop words and stemming word ................................................. 46
5.3.2 Support Vector Machine (SVM) ................................................ 47
5.3.3 Multinomial Naïve Bayesian (MNB) ......................................... 49
V
5.3.3 Random Forest (RF) ................................................................... 50
5.3.4 eXtreme Gradient Boosting (Xgboost) ...................................... 52
5.4 Comparison on Prediction Movement Result ........................................ 54
5.5 Goal-Based Trading Experiment............................................................ 67
5.6 Issues Influencing HSI Prediction Result .............................................. 69
Chapter 6 Conclusion ............................................................................................ 72
Chapter 7 Further Research & Limitation ...................................................... 73
Selected Bibliography .............................................................................................. 74
Appendix I – The relationship between SP500 and HSI ...................................... 80
Appendix II – Technical Indicators ....................................................................... 81
Appendix III – Feature Importance SVM ............................................................. 82
Appendix IV - Feature Importance Naïve Bayes .................................................. 87
Appendix V - Algorithm using Python Programming Coding ............................ 92
Appendix VI - Feature of raw data for algorithm .............................................. 102
VI
List of Figures
Figure 1. Venn diagram describing the intersection of disciplines ............................ 12
Figure 9. Hang Seng Index (HSI) basic trend from Sep 2017 to Feb 2018............... 43
Figure 10. Prediction future trend by regression linear until 8th February 2018…... 44
Figure 16. Buy Hong Kong 50 statistics record on 16th February 2018................... 67
VII
List of Tables
Table 1. 2017 Hang Seng Index Constituents.............................................................. 7
Table 2. Part of Speech (POS) tagging list................................................................. 34
Table 3. Stock attribute description............................................................................ 39
Table 4. Top 10 positive features for linear SVM...................................................... 47
Table 5. Top 10 negative features for linear SVM..................................................... 47
Table 6. Top 20 negative features for linear Multinomial Naïve Bayesian ............... 48
Table 7. Features importance Random Forest............................................................ 49
Table 8. Features importance eXtreme Gradient Boosting........................................ 51
Table 9. Accuracy result in analysts’ prediction and algorithm prediction................ 53
Table 10. Prediction movement result with actual price.............................................53
VIII
Chapter 1 Introduction
Analysis on the current stock market in society will be discussed through multi-
channels application, such as magazines, television, and media applications and so on.
A number of people have starting to browse social media sites to be containing
repositories for answering to all kinds of opinion poll questions, replaced by the
traditional method of standing in the bank or HKEX to receive information for
transaction. Ever since financial social media sites have an alternative communication
channel generally, analyze and predict on a certain type of stock and index trend
movement according to the basic background of financial stock and technique
indicators, such Simple Moving Average (SMA), Relative Strength Index (RSI) and
Price-to-Earnings Ratio (P/E ratio), are the reference techniques and skills to generally
forecast the future trend movement. It has become popular phenomenon for investors
to perform their opinions (Alassiri, Mud, & Ghazali, 2014). However, some most
likely follow the market atmosphere to invest, while some experienced investors and
analysts close to technical trend of stock and index, to predict next year stock turnover
from annual report or even using data mining algorithm to forecast the trend.
Even since Visa et al.’s work in 1999 and back et al.’s work reporting on a “the
performance on data and text mining” performed to have identified - “Text bears more
diverse information than dry numbers do”, researchers have carried out text
investigation in integration with the study of companies’ financial ratios (Visa et. al,
1
2004). Text mining techniques are the advanced effectively to deal with enormous
quantitative and qualitative data have been a new trend. Text in social media has
provided an importance and valuable text mining asset to understand, identify and
explore nuggets of knowledge and algorithm result.
This research continues and builds on the work with evolution of financial text mining,
which is social media data analysis (SMeDA) in text analysis, and data mining
methods to use analysts’ prediction report in social media to forecast the movement,
including showing the prediction graphics and statistics.
However, reports of financial text analysis in Hong Kong, Asia, to forecast indications
of Hang Seng Index (HSI), on future performance are few and the number of cases
studied have been very limited. Social media text analysis by data and text on the other
hand, has been largely neglected.
In the present study, we examined a reliability and availability large text data and
historical stock price data regarding to forecast Hang Seng Index (HSI) in this paper.
It has been widely accepted by official social media report that experienced financial
market analysts and investors regularly express the view of the market outlook,
including trend, volatility range, resistance position etc. The aim of this dissertation is,
given that all the expression or opinion release on social media sites can be collected
relatively easily, to figure out the result in the form of textual part on expression value
(positive or negative) of report and show the prototype-matching approach in
prediction of HSI performance.
2
1.1 Motivation
Everyone invest money to the market, aim must be gaining the profit. Some most
likely follow the market atmosphere to invest, while some experienced investors and
analysts close to technical trend or indicator of index, like relative strength index (RSI),
simple moving average (SMA) or other technical indicator by data mining method.
Moreover, some of them may see the Hang Seng evening futures, index American
Depositary Receipt (ADR) or Dow Jones index to predict the next day of HSI goes up
or down. On the other hand, in some way stock analysts will regularly post their point
of view and prediction trend of HSI in social media, like aastock, etnet and so on, for
a reference to the public. In the overlap of these three factors, its result a research idea
“Social Media Data Analysis” (SMeDA).
Although social media analysis has becoming popular in research topic, but not much
work has been done and never been any easier to process in study financial aspect of
patterns and prediction from analysts’ content (Li, Chan, Ou, & Reuifeng, 2017)
3
1.2 Aims and Objectives
The aim of this dissertation is to determine and critically examine the creditability of
financial stock analysts’ in the content and prediction of Hang Seng Index (HSI), to
forecast the future movement and trend in stock market by data mining and text mining
analysis in order to notify and alert the prediction movement and avoid the financial
crisis.
To achieve it, Stock analysts’ in the content of social media (stock market prices,
volume, the content of analysts’ comments from social media, such as news, blogs or
video channels etc.) and HSI prediction and historical HSI data will be calculated by
data mining and text mining for index movement forecasting. Two-stage architecture
on using the features from text analysis and intergradation of numeric and textual are
to conduct the experiments’ result – data science and graphical analysis report. Data
and text mining algorithms evaluate the performance of the experiments.
In this dissertation, data mining and text mining learning algorithms and techniques to
utilize price factors can use predicting index movement performance. The mining
algorithms refers to the process of estimation to discover the prediction movement and
hidden patterns and behaviors, such as text analysis, timer series analysis, moving
average etc., are used to forecast movement performance index (MPI).
This dissertation, although some people have been number of interdisciplinary studies
on data mining and text mining, it will be original in-depth studies on text mining in
text analysis from social media in relation to investors’ investment desire and
objectively to evaluate the analyst’s perspectives performance, so as to forecast the
stock price movement accuracy. The features from Hong Kong social media are found
4
to give the highest prediction accuracy, which may be linked with the fact on the
current market. Social media from analysts give the comments on market outlook
atmosphere and future perspectives. Their perspectives to predict the stock price needs
to take into account with the content analysis for evaluating their degree of credibility.
This estimation methodology will be the first dissertation to figure out the stock price
movement in Hong Kong as well as the use of analysts and investors as references.
This dissertation is an effort to serve as a financial social media text analysis on text
mining and data mining algorithms, as well as to advance the current state of the
finance with respect to collection and analysis of index number and text comments by
financial analysts and experienced investors.
5
Chapter 2 Background
To catch up and discuss the current state of social media mining approximately social
media data analysis (SMeDA) easily, it is useful to first look at the information about
social media research in Hong Kong as well as machine learning and mining
algorithms, to understand the current situation. As these two fields are immense and
such an overview could easily grow into a work of hundreds of pages, this chapter will
focus only on those theories and algorithms that are actively cited by recent social
media data analysis (SMeDA) and data mining publications.
All investment period is a strategy trading option and financial investment behaviour
to understand the stock market volume and investors’ psychologies and behaviours.
Moreover, there is no double even short term to long term trading that aim definitely
is to make for profits. There is different definition to describe the term of trading for
the time period.
a) Short term
The period of short-term investment generally is a few days to a month. The
potential net profit and investment risk can be highly fluctuation as same as
6
roller coaster, so short-term investment can make money faster, but also can
be loss money quickly (The ASPIRA Association, 2017).
Less control is also the advantage of long-term investment. Even if the price
value goes down, but it will be climb up again, which presents a business cycle,
so a high growth profit is not consideration for medium to long-term
investment.
One of the earlier stock indexes in Hong Kong is Hang Seng Index (HSI), which has
become the most widely quoted indicator of the performance of the Hong Kong stock
market since 24 November 1969. Investopedia stated that the index is a statistical
measurement to reflect the economic change or a security of market, which represents
the stock volume trading (buying and selling), in a previous of time. HSI did not have
basic analysis, including organization profile or industry relevance, as HSI represents
the Hong Kong economic index and reference indicator instead of an actual stock
company. Therefore, all economic, livelihood, organizations information and news
releases will have some impacts about stability on the Hang Seng Index, as there are
a fixed 50 numbers of Hang Seng index constituents as blue chips (Table 1).
Those blue chips influence Hang Seng Index (HSI) of index value by weighting score.
Hang Seng bank will review quarterly the performance of Hang Seng Index
constituent stocks, making a decision to exclude risk stock or include potential stock
7
(Hang Seng Indexes Company Limited, 2017). For example, according to Hang Seng
Index’s report on October, 2017, Hang Seng Index Co., Ltd. released the index review
results. Cathay Pacific (0293) and Kunlun Energy (0135) were excluded from the
blue-chip list and were included in the HSI constituents by Sunny Optical (2382) and
Country Garden (2007). That decision is based on the stock of future development to
implement, which will influence by weighting Hang Seng Index (HIS).
The importance of Hang Seng Index (HSI) to be the research is that Hong Kong in
Asia to provide and achieve comprehensive, credible and impartial financial
reputation, which represent a high positive in a financial and economy market
(Investor Education Centre, 2014). In a microcosm economy point of view, Hang Seng
Index (HSI) is an economic norm to estimate and reflect overall health of a sector and
performance in Hong Kong and a basic for investment return from the stock market
(Investor Education Centre, 2014).
Telecommunica
941 HK0941009539 China Mobile tions Red Chip 5.39
1398 CNE1000003G1 ICBC Financials H Share 5.12
2318 CNE1000003X6 Ping An Financials H Share 4.00
3988 CNE1000001Z5 Bank of China Financials H Share 3.46
1 KYG217651051 CKH Holdings Conglomerates HK Ordinary 3.00
388 HK0388045442 HKEx Financials HK Ordinary 2.85
2628 CNE1000002L3 China Life Financials H Share 2.15
883 HK0883013259 CNOOC Energy Red Chip 2.13
Properties &
1113 KYG2177B1014 CK Asset Construction HK Ordinary 1.87
Properties &
16 HK0016000132 SHK Ppt Construction HK Ordinary 1.86
2 HK0002007356 CLP Holdings Utilities HK Ordinary 1.68
386 CNE1000002Q2 Sinopec Corp Energy H Share 1.64
Properties &
823 HK0823032773 Link REIT Construction HK Ordinary 1.63
11 HK0011000095 Hang Seng Bank Financials HK Ordinary 1.58
2388 HK2388011192 BOC Hong Kong Financials HK Ordinary 1.54
Consumer Other HK-listed
175 KYG3777B1032 Geely Auto Goods Mainland Co. 1.45
Consumer
27 HK0027032686 Galaxy Ent Services HK Ordinary 1.40
3 HK0003000038 HK & China Gas Utilities HK Ordinary 1.39
8
857 CNE1000003W8 PetroChina Energy H Share 1.20
Information Other HK-listed
2018 KYG2953R1149 AAC Tech Technology Mainland Co. 1.18
Properties &
688 HK0688002218 China Overseas Construction Red Chip 1.09
6 HK0006000050 Power Assets Utilities HK Ordinary 1.05
Consumer
1928 KYG7800X1079 Sands China Ltd Services HK Ordinary 1.00
Properties &
4 HK0004000045 Wharf Holdings Construction HK Ordinary 0.96
Consumer
66 HK0066009694 MTR Corporation Services HK Ordinary 0.90
Consumer
288 KYG960071028 WH Group Goods HK Ordinary 0.84
Properties &
17 HK0017000149 New World Dev Construction HK Ordinary 0.77
267 HK0267001375 CITIC Conglomerates Red Chip 0.74
Telecommunica
762 HK0000049939 China Unicom tions Red Chip 0.74
Properties &
1109 KYG2108Y1052 China Res Land Construction Red Chip 0.72
1088 CNE1000002R0 China Shenhua Energy H Share 0.71
Properties &
12 HK0012000102 Henderson Land Construction HK Ordinary 0.68
Consumer
2319 KYG210961051 Mengniu Dairy Goods Red Chip 0.67
Consumer Other HK-listed
1044 KYG4402L1510 Hengan Int'l Goods Mainland Co. 0.62
3328 CNE100000205 Bankcomm Financials H Share 0.58
23 HK0023000190 Bank of E Asia Financials HK Ordinary 0.57
1038 BMG2178K1009 CKI Holdings Utilities HK Ordinary 0.50
Properties &
83 HK0083000502 Sino Land Construction HK Ordinary 0.48
Consumer Other HK-listed
151 KYG9431R1039 Want Want China Goods Mainland Co. 0.45
19 HK0019000162 Swire Pacific A Conglomerates HK Ordinary 0.43
Properties &
101 HK0101000591 Hang Lung Ppt Construction HK Ordinary 0.41
Information
992 HK0992009065 Lenovo Group Technology Red Chip 0.37
144 HK0144000764 China Mer Port Industrials Red Chip 0.34
836 HK0836012952 China Res Power Utilities Red Chip 0.32
135 BMG5320C1082 Kunlun Energy Energy Red Chip 0.26
Consumer
293 HK0293001514 Cathay Pac Air Services HK Ordinary 0.12
100
9
The other reasons influence the stock market performance, which the indicator of
index, such as Relative Strength Index (RSI), Moving Average Convergence /
Divergence (MACD), Exponential Moving Average (EMA), as well as daily index
price, daily investment volume, international currency exchange rate, HKEX policy
and strategy or even the current-year of hot spot, which is Northern China investor
volume are the factors that must influence the performance of HSI (Yu, Ng, Wong,
Chu, & Chan, 2010). Hence, a complexity of stock market has different of
circumstances impact the prediction of stock price movement.
Based on the above factors, it often creates a gap distance, which is called “gap theory”
in stock market. This theory is to describe the next day of stock or index price, which
is suddenly increase or decrease sharply on a chart and have a big distance price
compared with the last day, create, interpret and exploit the gap. This phenomenon
can be created by many factors, such as the global news issue, a cycle of financial
pressure, including irregular RSI, or stock market arrangement etc., both like the
above determinant factors to influence HSI.
Social Media shatter between the traditional media and virtual world, which notice
how to interact and communicate from to the public. The uniqueness of data collection
for novel data mining techniques from social media that export well-organized content
of user-generated with rich social relations. Social media mining techniques
implement the process of representing, analyzing and extracting hidden patterns from
the message data. Social media mining is theories and methodologies from several of
disciplines, including data science, computer science, data mining, machine learning,
text mining etc. It multiples different tools to apply, represent and measure social
media big data to result the meaningful patterns and assist to bridge the gap from those
theories models and measurement tools.
10
Social media enables us to be connected and interact with each other anywhere and
anytime, like a blog, video, slideshow, podcast as a one-to-many communication
channel – allowing us to observe human behavior in an unprecedented scale with a
new lens. This social media lens provides us with golden opportunities to understand
individuals at scale and to mine human behavioral patterns.
In the traditional data mining, historical data will be used to measure the result by any
algorithms. However, mining user-generated content is also a representative key that
cannot be ignored to analyze.
Mining information from data is an idea of mining knowledge with a big data, to
provide a data analysis to extract the knowledge patterns. To advance the simple
statistics analysis, data mining is essential mining concepts for understanding the data
of knowledge discovery in databases (KDD). The result of patterns and behaviors are
used to describe concepts, like stock price movement behaviors and interesting
patterns. The methodologies to analyze the data used by association rules, which is to
support within the data, like classification, clustering, time series analysis etc. In
general, mining concepts benefits on sales and marketing to understand customer
behavior, but in financial industry, it is a refereeing standard for programmers or
analysts used for predicting the future of stock movement. Various algorithms and
techniques are to prove the data creditability and reliability to get accurate prediction
of data in order to provide a reference relevant to the public and stock analysts to
understand the current market status.
Supervised learning and unsupervised learning are the basic categories in data mining
algorithm, which supervised is to predict the class attribute value and unsupervised is
to no class attribute to figure out similar instances in the dataset, but it can find and
identify the significant patterns result. Supervised learning can be included
classification and regression, such as Multinomial Naïve Bayes and Linear Regression.
11
Unsupervised learning is mainly on clustering algorithm, which require knowing a
distance measure, like Euclidean distance.
Stock price analysis was already using in the normal form to predict the further
performance, as an investor, always heard about the stock analysts to explain the
development trend and outlook of the market. Stock micro blogs, news, official stock
webs, such as aastock and yahoo finance, provide a simple way to allow data from
social media and the stock market. It is also easily to collect in a significant proportion
stock market news and advertisements, which may be covered data collection for text
analysis (Degutls & Novickyte, 2014). Certainly, the data should be filtered and
captured the most important original sequence of segment sentence by topics to
analyze. In fact, those opinions and segments are good examples for referencing in the
prediction of stock movement. Text analysis is the other formula rule of machine
learning to identify, extract and character the text content of sentence text unit,
sometimes referred to opinion mining. However, there has been a big confusion
among researchers and people about the difference between text and opinion.
According to Merriam-Webster’s Collegiate Dictionary, Delivery the message about
attitude, thought by feeling from the case are defined as “text”. In the opposite side,
opinion is based on a case or matter to judge, take a point of view, discuss, evaluate
and give a statement. All the text analysis is generally called the technique of text
mining. There has been many successful in the areas of cases in analyzing the text,
such as political science, policy making, psychology and sociology.
During text analysis, an analysis model, such as Twitter Investor Text (TIS), can base
on all words in the sentence to compute and count an opinion weighted scoring,
reflected the statement of credibility and representativeness.
In this research, it based on the social media to start up text analysis, which is called
“Social Media Data Analysis” (SMeDA). It is an effectiveness and comprehensive
12
measurement and should collect enormous data in using data mining approximately
professional individuals and groups’ entities so as to measure their interactions,
discover patterns and understand human behavior. The methodology of formulating
and estimating and theory of text will be discussed on the later section.
2.1.6 Summary
This dissertation certainly to achieve the aim to forecast HSI future performance by
mining algorithms, but it also reflects the psychological, social cognitive and emotion
of human behavior of investors’ interest and the future financial market prediction
(TMFP).
This is also given the reason why researchers must use advanced algorithm to predict
stock predict, evaluate the risk of financial crisis and ensure the health of financial
market. Both data mining and text mining respectively able to predict the future
movement price in the content of prediction text status and prediction price by
financial stock analysts’.
In this subsection, using algorithms in data mining and text mining to predict the index
trend based on a combination of sequential chart patterns and behaviors.
13
2.2.1 Time Series Similarity Analysis
A number of index price and text scores can extract from textual data can be
represented in the form of time series. In this subsection, several similarity
measurements are represented.
Linear Regression
The basic concept of linear regression is to calculate functional and statistical relation
between two variables, which independent variable mainly is a predictor and
explanatory. Surely, scatter diagram is to present the result between variable Y and
variable X. To make a prediction, variable Y is a tendency of response to vary with X
variable (predictor). (Figure 2) The linear regression line is one of the prediction
measurement to forecast the value, which is to calculate the slopes and intercept to
result the prediction.
𝑌̂𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 (2.1)
14
Euclidean distance
This measurement is named after Euclid famous mathematician, which is popularly
referred to as the father geometry and it is an unsupervised learning algorithm to
demonstrate the distance between two points connected by line, given two-time series
of equal length. The formula is illustrated in Eq.2.2 where the equation is the sum of
𝑛 and 𝑛 represents the number of dimensions in data, 𝑖 starts as one goes up to 𝑛 and
is dimensions; whereas 𝑆 and 𝑞 are two points by line in n-dimension (Batista, Wang,
& Keogh, 2012).
𝑑(𝑠, 𝑞) = √∑(𝑆𝑡 − 𝑞𝑡 )2
(2.2)
𝑖=0
15
Short Time Series (STS) distance
Depending on each time series of linear function, Möller-Levet et al., to compare total
value of the squared of the slopes in two-time series. In mathematically, the similarity
of STS between two timer series as xa and yb is defined as
𝑃
yb(k−1)− ybk xa (k−1)− xak
𝑑𝑆𝑇𝑆 = √∑( − )2 (2.4)
𝑡(𝑘+1) − 𝑡𝑘 𝑡(𝑘+1) − 𝑡𝑘
𝑘=1
In the above formula in Eq.2.4, 𝑡𝑘 is the center point for data point xak and ybk . Z
standardization of the series is to delete the effect of scale.
K-means
A partitioned and basic clustering approach and techniques is used by K-means
algorithm, which each cluster is associated with a center point and each point is
assigned to the cluster with the closest centroid. This equation of K-means can be
defined as Euclidean distance or other similarity algorithms given at subsection 2.2.1.
16
Hybrid Kohonen Self-Organizing Map
It is a type of Self-Organizing Map (SOM) in unsupervised and non-parametric neural
network learning to feed into a hidden layer pattern to forecast the HSI performance.
Each neuron, which is the basic unit, is assigned the related weighting. The output
from this method of the closer neuron is nearby the hidden layer as the winning neuron,
therefore, it can identify the pattern space with clusters of neurons in the layer to
objectively to predict the stock price (Sap & Mohebi, 2017).
There are many feature extraction text and data mining techniques can be applied on
those collected data. Features are unique in machine learning terminology, which is
usually absolute numeric values or categorical features in nature. It can be encoded as
binary features for each category in the list using a process, which is called one-hot
encoding. Feature selection and extraction are fed into machine learning techniques to
find learning patterns, which is applied on future new data points for gaining insights.
The form of each algorithm is numeric vectors and usually is a mathematical operation
of optimization and minimizing loss and error. This is the concept and technique of
feature extraction on how to transform textual data and extract numeric features from
it. To calculate degree level of word text performance, there are many form of
algorithms to estimate and process the data to find out the feature result.
Sentence Tokenization
Sentence tokenization is the process of splitting a text corpus into that act as the first
level of tokens which is to reduce different word forms into tokens. This theory and
method can be divided into Porter Stemming Algorithm, Part of Speech (POS)
tagging, Chunking, and Chinking.
17
noun and verb form; an adjective form, which is “increasing” and adverb form, which
is “increasingly”. Those vocabularies represent the same meaning. To reduce the
feature dimension and unified display outcome, such vocabularies should be filtered
out to same root of “increase”, but behind the actual vocabularies should be keep it to
guarantee the data integrity.
In the above formula is given in Eq.2.5, where 𝑊(𝑡, 𝑑) means the weighted score of
a term 𝑡 in a document. 𝑠𝑓(𝑡, 𝑑) means the frequency of a term 𝑡 in storing a social
media data document 𝑑. 𝑁 means the total number of sampling to be analysis. This
model and calculation performance is used the cross-validation. Using scikit-learn to
build in is a general form and standard to implement.
18
A term frequencies as a number of word play an important role for weighting, but
Term Presence (TP), represented as a Boolean value in a vector, the result will be
assigned as True (vector score or value will be 1), yields a good performance than
term frequency in text analysis (Pang, Lee, & Vaithyanathan, 2002).
Negation
It is a consideration negation during processing a sentence. Generally, a word “not” is
a simple and inverse meaning of example. The researchers are appending the term –
“NOT” to “no” or “do not” to solve the negation problem. For instance, “I do not
expect HSI to be raised.” the term extracted will be “expect_NOT” instead of “expect”
(Das & Chen, 2007).
Bag-of-Words Model
It is a classic model, used in NLP, to represent a term frequency vector in a document,
although syntax and the ordering of term are missing, the major message is included
in the term frequency vector.
19
occurrences of term t in training documents from class (c), including multiple
occurrences. Naïve Bayesian is a popular and supervised learning algorithm for
prediction and classification tasks to be either positive or negative text value based on
the algorithm of TF-IDF. Naïve is to assume that every feature is independent to each
other and the formulation to display in the form of a feature vector is {𝑥1 , 𝑥2 , … , 𝑥𝑛 }
and 𝑦 is a class variable. This theorem is to notice the probability of the occurrence
of 𝑦, which present in Eq. 2.6.
𝑃(𝑦) × 𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝑦)
𝑃(𝑦|𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = (2.6)
𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 )
Under the assumption that 𝑃(𝑥𝑖 |𝑥1 , 𝑥2 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑛 ) = 𝑃(𝑥𝑖 |𝑦) and the
variable of 𝑖 represents as the range from 1 to 𝑛. Simple term can be presented as
𝑝𝑟𝑖𝑜𝑟 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
since 𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) is constant.
In the equation of model, each feature is independent of each other conditionally over
the class variable, which is to be predicted. The following equation 𝑦 and 𝑍 = 𝑝(𝑥) is
a constant issue dependent, which can be represented as:
𝑛
1
𝑃(𝑦|𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = 𝑃(𝑦) × ∏ 𝑃(𝑥𝑖 |𝑦) (2.7)
𝑍
𝑖=1
MAP decision rule is the naïve Bayes classifier by combining Eq.2.7, as maximum
posteriori. Mathematical function as the classifier can allocate a class lab prediction
𝑛
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦̂ = 𝑘∈{1,2,…,𝐾}𝑃(𝐶𝑘
) × ∏ 𝑃(𝑥𝑖 |𝐶𝑘 ) (2.8)
𝑖=1
20
class variable-related conditional feature distributions and it also link to single
dimension distribution.
In the equation, if the result of 𝑔(→) is more or equal than 1, it will classify to class
𝑥
The total margin is computed by Eq.2.10 minimizing the term to maximize the
separability.
1
=→ 𝑇 → +𝑤0 (2.10)
|| → || 𝑤 𝑥
𝜔
→ = ∑ 𝜆𝑖 𝑦𝑖 → (2.11)
𝜔 𝑥𝑖
𝑖=0
21
𝑁
∑ 𝜆𝑖 𝑦𝑖 = 0 (2.12)
𝑖=0
The SVM for regression as the above equation to find out Support Vector Regression.
22
2.2.5 Text Summarization
Most of non-likeable reading people will become short attention spans when reading
a long article and a large document, it also leads to get bored and miss the important
messages. Therefore, text summarization is extremely important concepts for readers
to extract the key theme of information.
𝛽 𝜑
K
𝛼 𝜃 𝑍 W
N
M
Figure 4. LDA plate notation
In the above in Figure 4, it shows the plate notation for LDA model, which K is the
number of topics; N is the number of words in the topic; M is the number of topic; 𝛼
is Dirichlet-prior concentration parameter of the each topic distribution; 𝛽 is the
same parameter of the each topic word distribution; 𝜑 (k) is the word distribution for
topic k; 𝜃(i) is the topic distribution for document i; 𝑍 (i,j) is the topic assignment for
w(i,j); W(i,j) is the j-th word in the i-th document; 𝜑 and 𝜃 are Dirichlet
distribution, 𝑍 and W are multinomial.
23
2.2.6 Text Analysis
Dictionary
Storing and generating all the word list for text analysis is a simple technique by
dictionary-based method. A new synonyms and antonyms are added into the word
list, which a word has a related degree of credibility level. However, some mood
words within specific domains are difficult to find, which a major weakness is in
dictionary-based method to estimate the accuracy.
1) Score each case of investors’ comment using the function of POMS is given in eq.
2.9. Each case 𝑐 is denoted in the term set of 𝑤. The POMS emotion adjectives are
displayed as 𝑝𝑖 for mood dimension 𝑖.
3) Integrate emotion vectors for dates and denote as 𝑚𝑑 and 𝜃𝑚𝑑 [𝑖, 𝑘] represents a
period of k-day mood.
∑ ∀𝑐∈𝑇 𝑚̂
𝑑
𝑚𝑑 = (2.15)
||𝑇𝑑 ||
24
θ𝑚𝑑 [𝑖, 𝑘] = [𝑚𝑖 , 𝑚𝑖+1 , … , 𝑚𝑖+𝑘 ] (2.16)
θ̃𝑚𝑑 [𝑖, 𝑘] = [𝑚
̃ 𝑖, 𝑚 ̃ 𝑖+𝑘 ]
̃ 𝑖+1 , … , 𝑚 (2.18)
Lydia
It is created of tie series for counting the text emotion value, including positive and
negative words displaying in related to parties.
𝑝−𝑛 𝑝+𝑛
𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦 = (2.19) Subjectivity = (2.20)
𝑝+𝑛 𝑁
In the above in Eq.2.19 and 2.20 are the equation to calculate the text emotion 𝑝 and
𝑛 are represented as the positive and negative value. 𝑁 is the total number of emotion
value.
Cross-validation (statistics)
It is a statistic evaluation of model validation technique to estimate how the result of
learning algorithms analysis is generalized, which is also called rotation estimation.
This method is a setting the goal of prediction and estimate how accurately in a
prediction of model that performance is, which generally use K subsamples and K
iterations to represent the data set of training and testing data. The result of
measurement reported by K-fold cross-validation is the average of values computed
in the loop.
25
2.3 Movement Performance Index (MPI) Research
A stock price must have technical indicators to analyze stock price movements.
Combining of soft computing technology with technical analysis in stock analysis has
done several researches work as well as gain a high rate of predication result. There
are many parameters of indicators to shows the strength index of price movement
performance, which is called “Relative Strength Index (RSI). This is an important
index indicator to understand the resistance and upside potential index. RSI technical
indicators calculations are shown on (Appendix II), which are divided into 10 days, 20
days, 50 days and others length of trend time span. RSI is also a minor technical
indicator for people to understand short-term and long-term goal of HSI movement.
2.4 Summary
Although there are many analysts believe that RSI is a good reference to predict stock
movement and share with the public, it still takes time for investors to understand,
analyze and digest the information meaning. In this dissertation, every social media
report from analyst’s prediction comment is the main source to make social media
analysis, which is also included data analysis and text analysis (text analysis).
Bag-of-words model, multinomial Naïve Bayes and support and Support Vector
Machines (SVM) are used in social media classification tasks. The targets of the
occasion in forecasting stock price models are allocated by future stock price
movement.
26
To discover the similarity of text and behavior patterns, Latent Dirichlet Allocation
(LDA) is an estimation algorithm based on word frequency for discovering probability
distributions over latent topic, and it can be used for topic modelling. Mahajan et al.
(2008) stated that LDA is to identify the characteristics and understanding the impact
of financial news events, which located the average direction of accuracy was 60%.
Learning algorithms, data mining and text mining techniques and evaluation
estimations are the research methodologies in this chapter to achieve the potential
experimental result in the chapter 4. This dissertation has been considered several of
financial aspects to use specific equation to complete. In the next chapter, the principal
and framework of design approach in the system are going to present.
27
Chapter 3 Design Approach
User generated content, consumer-generated media, researcher professional opinion-
content from stock or economic media weblogs, online broadcast, stream or official
web communities etc. are the planform to receive the part of information about Hang
Seng Index (HSI) (Zeng, Chen, Lusch, & Li, 2010).
This chapter presents the design principle and approach on implementation and
estimation of each module of HSI financial forecasting system. (1) The description of
numerical and textural data used in analyzing research, (2) preprocessing and munging
for the features on data weighted estimation and other calculations on index price and
indicators from social media to forecast. (3) Other sub-parts are given the details on
methodologies in content analysis and data and text extraction.
28
3.1 Methodology
Social Media
Collection
HSI Historical
(financial sites, news Data
reports, radio etc.)
Data Pre-
processing
Index
Movement
Prediction
In the Figure 5 shows, the overall analysis structure procedures design. This
structure can logically be seen as three main processes for achieving the prediction
movement. The two main steps of text and numerical HSI data can be processing at
the same moment and will make pre-processing to filter meaningful of analyzing
data. At the end of both textural and numerical data processes, selected algorithm to
estimate from the processed data is the most essential part to evaluate the
experimental results.
29
3.2 Data Acquisition
This dissertation is to analyze the data from social media. The data collection is
necessary from diverse announced social media, which refers to interact with other
people by announcing or posting information to the public and give an instrument of
communication via radio, news report, and social networking sites etc.
There are two essential types of data willing to collect, one is the historical data of
HSI, and other is the comments of analysts’ prediction from social media. All the data
will be stored in text files with .csv format.
The data collection is obtained from Hong Kong financial and stock social media sites.
The social media used in this dissertation is published by:
(1) Financial and stock: Aastock, Yahoo! Finance, Quamnet, Etnet etc.
(2) E-news: Appledaily, Singtao, Mingpao, Oncc, Takungpo etc.
(3) Radio from site: RTHK, Metroradio
All received information about Hang Seng Index (HSI) prediction stored in specific
general collection data (analyst or company name, position (if necessary), release
information date, content of prediction, prediction date, index prediction and reference
link) by establishing a common database with uniform information and structure to
store (Dascalaki, Droustsa, Gaglia, Kontoyiannidis, & Balaras, 2010). Along the same
lines, the public, who is responsible for organizing, monitoring and analyzing HSI
index price and the statement of analysts and investors’ opinions, could use this
analysis and prediction for reference to understand the prediction accuracy and
credibility.
For the experiment, there are over 1,000 financial and stock articles or reports of social
media distributed the HSI comments and prediction price in Hong Kong, over a period
of one year from January 2017 for data and text mining analysis. Besides, the amount
30
of collecting historical HSI data is around 248 days plus the updating data of 2018
index traded record from Yahoo! Finance, listed in the exchange from the year 2017.
As all the social media news are filtered by manually selected keywords sentences of
prediction Hang Seng Index (HSI), some unrelated information may also be retrieved.
The social media news report amounts of stock market news, but capturing the
important prediction is the direction way to make a data science. For example,
“Morgan Stanley issued a report update the basic situation of the major Asian stock
markets index, raising the target of the "basic situation" of the HSI to 29,000 points.”
The mainly information is to store “Morgan Stanley” – financial company; “HSI”
which means “Hang Seng Index”; “29,000 points” – prediction target point. There
are many Chinese stock proper noun keywords, like “The most cattle” commonly
known as “The maximum increasing stock price”. The idea is to determine the each
sentence of social media case to investigate the analysts’ prediction credibility and
measure level of HSI prediction performance.
31
Figure 6. Job for usage machine learning statistics in 2016.
To duel with different operations and analyze text and data, performed into easy-to-
interpret formats are to make use of a process and parse of textual and numerical data,
like data clearing, aggregation and feature selection.
The raw data of Hang Seng Index (HSI) often has to manipulate to be in the correct
format for the analysis. Some error with noisy data is going to be removed as the
suddenly typhoon signal eight or above or black rainstorm warning has been issued;
therefore, HKEX’s market system may be closed earlier depending on the period of
time (HKEX, 2017). Data cleaning is the process to duel with the data missing,
garbage values and NULLs to maintain the data consistent and data control.
32
Feature selection can involve removing irrelevant HSI content in a large content of
social media report. This approach can be saving the time of processing data and
making faster on analysis procedures.
Raw text in social media report from Chinese language will be translated into English
version to conduct a case of prediction and input to excel. All the case of raw text will
be converted into word vectors. There are three main processes to implement in
Python and nltk from associated library is for Porter stemming algorithm, which are:
Stop words are kind of two notion of stop word, which can be one stop words can be
something, like literally come across that word and leave all not applicable words. It
might use words that are typically used sarcastically as a stop word, as analysts do not
willing to continue attempting to analyze something when it may or may not be the
actual, such as opposite meaning; hence, that’s one notion but another notion of stop
words, these words that just pull out and ignored, which is not meaningful and useless
to understand the sentence meaning. The following is the stop word list, which will be
removed in all analyst’s prediction:
33
{'s', 'themselves', 'their', 'wouldn', 'should', 'couldn', 'theirs', 'being', 'when',
'where', 'his', 'why', 'didn', 'has', "haven't", 'this', 'those', 'd', 'o', 'shan', 'such', 'to',
'not', 'below', 'can', 'off', 'hers', 'was', 'my', 'here', 'do', 'there', 'won', "couldn't", 'at',
"should've", 'isn', 'mightn', 'having', 'into', 'himself', 'while', 'before', 'then', 'don',
'weren', 'if', 'doesn', 'through', 'who', 'ours', 'on', 'ma', 'or', 'further', 'been', 'with',
'after', 'only', 'doing', 'him', 'does', 'you', 'yours', 'herself', 't', 'just', "won't",
'between', 'down', 'once', "doesn't", 'them', 'we', 'during', 'is', 'from', 'itself',
'because', "mightn't", 'both', 'm', 'wasn', 'up', "she's", 'any', 'the', 'more', 'ain', 'she',
'yourselves', 'which', 'these', "it's", 'did', "hadn't", 'haven', 'had', 'again', 'it', 'mustn',
'our', 'were', 'as', "you're", 'll', "weren't", 've', 're', 'no', "shouldn't", 'needn', "aren't",
'too', "wasn't", 'be', 'myself', 'so', 'few', 'over', "you'd", 'have', 'out', 'how', 'will',
'her', 'all', 'ourselves', 'until', 'an', 'they', 'same', 'are', 'he', 'own', 'hadn', 'by', 'about',
'yourself', 'in', 'than', 'other', "that'll", 'y', "hasn't", 'above', 'shouldn', 'whom',
"isn't", "you'll", "didn't", 'a', 'your', 'of', "needn't", 'that', 'aren', "don't", "wouldn't",
'most', 'me', 'am', 'what', 'i', 'and', 'each', 'but', "shan't", 'for', 'very', "mustn't",
"you've", 'its', 'nor', 'under', 'now', 'hasn', 'against', 'some'}
All English stop words have already been predefined by nltk in python. Stop words is
useful for any sort of database, pulling out words and articles on stacks, it can save lot
of processing time.
34
Table 2. Part of Speech (POS) tagging list
POS tag list
CC coordinating conjunction PRP$ possessive pronoun
CD cardinal digit RB adverb
DT determiner RBR adverb, comparative
EX existential there (like: "there is" ... RBS adverb, superlative
think of it like "there exists")
FW foreign word RP particle
IN preposition/subordinating conjunction TO to
JJ adjective UH interjection
JJR adjective, comparative VB verb, base form
JJS adjective, superlative VBD verb, past tense
LS list marker VBG verb, gerund/present participle
MD modal VBN verb, past participle
NN noun, singular 'desk' VBP verb, sing. present, non-3d
NNS noun plural VBZ verb, 3rd person sing. present
NNP proper noun, singular WDT wh-determiner
NNPS proper noun, plural WP wh-pronoun
PDT predeterminer WP$ possessive wh-pronoun
POS possessive ending WRB wh-abverb
PRP personal pronoun
As a result, it takes and creates tuples of the word and part of speech instead of writing
them all out. This part of pre-processing is to confirm all the words correctly in all
part of speech and tagging, for the further analysis.
Chunking is able to do stuff, most people will chunk into noun phrases, which are the
groups of net and will be down, as well as a bunch of modifiers around noun. It would
be a kind of like a descriptive group of word in the sentence, which is called “shallow-
parsing”. This method only going to be able to use regular expressions and group
things together as a chunk that are chunking each other. The process can chunk
important words and then kind of break it out from whole sentence.
35
Modifiers
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line
This model involves extracting feature from root form, using mathematical structures
to reduce the feature dimensionality and remove the terms, but latent semantic
indexing can be ignored because of Bag-of-words model as TF-IDF vector has filled
the weight score. Latent Dirichlet Allocation (LDA) is adopted to represent a
combination of topics similarity.
# using LdaModel class, belongs to gensim’s ldamodel module, is used in LDA library.
# scikit-learn is the final implementation in LDA-based topic model in LDA library.
To apply suitable text mining algorithm, it is necessary to identify and understand the
previous work of content and prediction price from social media data collection, to
grasp the data utilization for data analysis. Analysis approaches in this section are
illustrated.
3.5.1.1 Dictionary
SMeDA for prediction price movement involve defining and recognizing “degree
level of personal emotion”. Dictionary is an essential index and crucial role in data
science to make a further development in using text mining to analyze, which
distinguish polarity and estimate the breadth of. Through establishment of stock
36
domain dictionary, it results a higher accuracy prediction of stock price movement
compared with daily life dictionary.
Delete Extract
Text Word disable HSI Emotion
NLP
Text Words
sentences words
Calculation
Count stock
Polarity & define positive
NLP by text up & down Text
word or negative Dictionary
The procedures of developing dictionary are divided into two phases. The first phase
consists in removing the extract sentences, which is not related to HSI topic. Then,
selecting and extracting emotion words for words for analysis. All the words make a
polarity and count the total number of stock up and down, and finally calculation the
score and identify and define the word expression.
𝐷𝑜𝑐(𝑗)𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑤𝑜𝑟𝑑(𝑖),
1( )
Word(i, j) = { 𝐷𝑜𝑐(𝑗) 𝑛𝑒𝑥𝑡 𝑑𝑎𝑦 𝑝𝑟𝑖𝑐𝑒 𝑢𝑝 } (3.1)
0 (𝑡ℎ𝑒 𝑟𝑒𝑠𝑡)
Word(i). 𝑁𝑢𝑚𝐷𝑜𝑐𝑠 = the total number of article include word count(𝑖) (3.2)
∑𝑛𝑖=1 𝑊𝑜𝑟𝑑(𝑖, 𝑗)
𝑊𝑜𝑟𝑑(𝑖). 𝑆𝑒𝑛𝑡𝑖𝑆𝑐𝑜𝑟𝑒 = (3.3)
𝑊𝑜𝑟𝑑(𝑖). 𝑁𝑢𝑚𝐷𝑜𝑐𝑠
In the above Eq.3.1 to 3.3, the result of score is between 0 and 1. The minimum of 0
is fully negative; while 1 is fully positive. The score is based on the average score for
all al words to calculate.
37
3.5.1.2 Polarity and Subjectivity
After extracted the score, the topics modeled by LDA have no standard guidance for
polarity, each topic of document word can filter the topic related to HIS and use Eq.3.1
to 3.3 result the polarity score.
3.5.1.4 Score
Given a set of social media data in the period of time series as the listed financial and
economic media and websites, apart from the predicting index price in the sentences,
the most useful thing to discover it is to figure out what kinds of are reflected in the
data – different level of metaphor. It is for this reason why SMeDA cannot be ignored
and its tasks is to guarantee some effective text mining algorithms to extract words or
phrases from media that are the most reflective of the authors.
SMeDA will be recorded and to analyze a sequence of words representable from social
media, as “Hang Seng Index is expected to rise 3764 points in the next six months”,
where “Hang”, “Seng”, “Index”, “is”, “expected”, “to”, “rise”, “3764”, “points”, “in”,
“the”, “next”, “six”, “months” refer to different words. To classify all the sentences of
expressed from the received media by stock commentators or analysts, SMeDA set up
a list of word that inserted and storage that are used to reflect a variety of different
degrees; therefore, the score can be analyzed to the degree levels.
38
3.5.2 Data mining
Yahoo! Finance and aastock contain the historical index prices of Hang Seng Index
(HSI) listed in the exchange from the year 2017. As the amount of data collection is
around 248 days plus the data of 2018 index traded record, including the value of
attribute “Open”, “High”, “Low”, “Close”, “Adj Close” and “Volume”. Those
attributes were considered to be the part of analysis attribute to represent. Those
historical HSI stock numeric value price for each attribute will be performed by
general data mining algorithm with the closing price value in the trading day, such as
regression linear, moving average etc.
In the other part, when the data was collected from social media stock analysts, all the
values of attributes selected and recorded were continuous numeric values and textual
values. Data transformation was applied by generalizing data to a higher-level concept,
so all the values became discrete. The criteria to transform the numeric values of each
attribute to discrete values depends on the previous day of adjusted closing price of
Hand Seng Index (HSI) price.
If the values of the attributes adjusted closing were greater than the value of attribute
open and min for the same trading day, the numeric values of the attributes were
replaced by the value “Positive (Up)” as blue signal. If the values of the attributes
mentioned above were less than the value of last day open, min and max, the numeric
values of the attributes were replaced by “Negative (Down) as red signal. If the values
of those attributes were equal to the value of the attribute previous, the values were
replaced by the value “Equal (-)” as grey signal (Table 3).
39
Table 3. Stock attribute description
Attribute Description Possible Value
Open Current day open price of the stock Positive, Negative, Equal
Min Current day minimum price of the stock Positive, Negative, Equal
Max Current day maximum price of the stock Positive, Negative, Equal
Close Current day close price of the stock Positive, Negative, Equal
Adj Close Current day adjustment price of the stock Positive, Negative, Equal
In the previous section, there are the basic selection involves two ways road according
to the analysis structure and design, which
(1) SMeDA: To deal with potential over-fitting problem, reducing the amount of text
in the article, only within description the main resource of HSI, is necessary. A
classification algorithm assists to over fit the training data and improve data
structure for data analysis accuracy, which the result also related to dimensionality
reduction. Hsu et al.’s work claimed that high-level features might give a more
promising result than features.
(2) HSI historical data analysis: using the past data of index price since 2017 to
make a general prediction movement by algorithms.
Feature is unique and meaningful. There are many feature-extraction techniques can
be applied on data to identify features in a dataset. Generally, dataset is the row and
various features are the columns. An initial set of measured text and numeric data and
building the derived values are the feature extraction, which is also related to
dimensionality reduction (Scikit-learn developers, 2017). Sklearn is the main function
that can extract features supported by machine learning algorithms from all stock
information, such as text and image, which convert it into numerical features. The
analysis will be loading data and text features from dictionary.
40
3.8 Summary
41
Chapter 4 Experimental Framework
In this chapter, according to the chapter 3 of design approach of selections and
methodologies, this section shows the workflow procedures of experiments as the
baseline experiment.
To process the result of data collection, including historial stock data and social media
data collection. Data science by using Python programming implement the data
analysis to achieve the dissertation aims. The whole procedures use python text file
(.py) to store python script with list of commands to figure out analysis result of
graphics or / and tables.
4.1.1 Features
BOW models: Collected social media of textual data are converted into vector
features. Outside the word of stop list and high frequencies token are remained in the
vector feature to make further algorithm.
42
4.1.2 Prediction Score and Movement
The prediction score from social media collection is related to stock movement
performance (SMP), to understand the word properties for identifying the future HSI
performance. SVM and MNB are highly required for measurement.
4.2 Summary
43
Chapter 5 Analysis Result and Evaluation
In figure 8, HSI basic form trend is a foundation graphic result to show the movement
from 2017 up to the current. This format is to show open, high, low and close of stock
price in every HSI day. Every HSI price bar value performed by comparing last day
closing price value, if the current closing price is greater than the previous closing
price, which attribute of bar demonstrate green color represents “positive”; while if
the current closing price is greater than the previous closing price, which attribute of
bar demonstrate red color represents “negative” in the chart. Finally, if the current
closing price is equal to the previous closing price, which attribute of bar demonstrate
grey color represents “equal”.
Figure 9. Hang Seng Index (HSI) basic trend from Sep 2017 to Feb 2018
At the same time, the graphic also shows 10-days SMA, 20-days SMA, 50-days
SMA, 100-days SMA and 150-days SMA, which show from the short term to long
terms of general prediction trend.
44
5.2 Data Analysis Result
Figure 10. Prediction future trend by regression linear until 8th February 2018
This part of aim is to predict HSI movement until 8th February 2018 by regression
linear in the historical HSI price data. The result demonstrates index movement
become positive effect, although HSI in the short term has dropped from the peak of
33,000 to 29,500. There will be a technique rebound to increase back 31,000 points
or even higher in the future. After appropriate adjustments, figure 9 shows that HSI
in the long term will continues to grow up to 34,000.
45
5.3 Social Media Analysis Result
The present word of stop words and stemming word result are important pre-
processing algorithms to make token uniform integrity on the data set. To perform a
high effectiveness of algorithms, no significant semantic context in a sentence must
be removed to speed up the task in text pre-processing. Therefore, the data set will be
decreased the data redundancy and result meaningful result in algorithms.
['Short-term', 'index', 'expected', 'test', 'resistance', '24000', 'points', '.', 'The', 'index',
'expected', 'greater', 'chance', 'downswing', ',', '24000', 'points', 'larger', 'resistance',
',', 'short-term', 'test', '23500', 'level', 'support', '.', 'He', 'pointed', 'together',
'analysts', 'starting', 're-raise', 'future', 'earnings', 'forecast', 'emerging', 'market',
'enterprises', ',', 'I', 'believe', 'cyclical', 'economic', 'recovery', 'begun', 'continue',
'next', 'two', 'years', '.', 'Therefore', ',', 'target', 'price', 'HSI', '2017', '2018', '28000',
'points', '32000', 'Point', ',', ',', 'forecast', 'earnings', '14', 'times', '15', 'times', '.',
.
46
5.3.2 Support Vector Machine (SVM)
In the linear SVM result to figure out the positive and negative token, the result of
positive and negative coefficient is corresponding to HSI. In SVM with feature
selection and filter record in ascending order selected top 10 of positive and negative
features.
The result is shown in figure 11, table 4 and table 5 for various training set result, the
fish score performed identically correlation coefficient. Taking an average of positive
token on first top feature of coefficient for those training token is 0.930668. In contrast,
taking an average of positive token on first top feature of coefficient for those training
token is -0.9254995. SVM feature selection of accuracy for Linear SVM is 88.89%.
Those positive and negative tokens are the signal for investors or analyst to understand
47
the HSI movement trend (Appendix III provided a list of all positive and negative
value of words).
48
5.3.3 Multinomial Naïve Bayesian (MNB)
The word occurrences are conditionally independent of other, which given the class
of the sentence. In Table 6, the impact of those tokens is reflected the HSI is down
forward, especially “today” and the social media named “mpfinance” and “mingpao”.
Those tokens bring out the negative signal for HSI movement (Appendix IV provided
a list of all MNB value of words).
49
Both result of SVM and MNB are to figure out a “feature” in a point of coefficient
value. Generally, SVM algorithm most of the time is for text classification and better
for analyzing full-length content; while MNB can be translated as a linear model,
which is good at analyzing snippets or short documents. It also stated that MNB is
better than SVM with training cases (Ng & Jordan, 2002).
Basically, both MNB and SVM are appropriate and strong baseline classification
algorithm for analyzing every text case to result the feature of positive and negative
value from analyst commentary hereby have a signal on HSI index stock price
movement.
Random forest (RF) is often used by prediction; however, the result has given the
feature importance that give a sense of word variables have the most effect in this
model. The density of predictive power in random forest are under 0.05 (Table 7). The
random forest is based on the data set of prediction article and content to forecast.
50
Figure 12. Feature Importance Random Forest
The result of RF feature selection used for splitting, which depends on impurity
reduction is biased in variable of categories. Word correlated features can be
referenced as the predictor if the article has many correlated features in the analysts’
article.
51
5.3.4 eXtreme Gradient Boosting (Xgboost)
Xgboost generally is to provide importance score that is useful and valuable in the
construction of the boosted decision tree within the model. The more an attribute is
used to make key decisions with decision trees, the higher its relative importance. The
result of importance value is estimated explicitly for each attribute in the dataset,
allowing attributes to be ranked and compared with each other.
52
Figure 13. Feature Importance XGBoost
In figure 13, the top 10 of features are automatically named according to their index
in the input array (X) from the text of "level" to "market". Manually mapping these
indices to names in the problem description, the plot shows the text of "level" has the
highest importance (0.057279237) and the text of "market" (0.035799522) has the
lowest importance in the top 10 feature.
53
5.4 Comparison on Prediction Movement Result
The following table 9 of accuracy result is predicted by four algorithm plus predicted
movement by analyst article based on the text and predicted price, which listed on 54
stock commentary records use the trained classifier on the test set. Data set be the
trained the classifier to predict the movement by analyst and algorithms compared
with the actual index price movement. Each sentence has been worked on stop words
and stemming word, as well as based on SVM, NB, RF and xgboost to forecast the
future movement in HSI performance. The actual and prediction movement has a
closer relationship, which is to certify the forecast result from stock commentary of
prediction. The result of “up” means market went upward and “down” is market went
downward.
54
The accuracy between prediction movement by analyst and actual HSI in these 54
records is 68.52%, which is more than half algorithm accuracy to predict right with
the actual index; accuracy between prediction SVM and prediction Xgboost are the
same of 64.81% accuracy.
55
remain above
the day low of
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
index is
expected to
stabilize at the up up up up up up
level of
points the
market outlook
effectively
further
challenges
resistance
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the upward trend
is expected to
up up up up up up
continue hong
kong stocks the
index will
further test the
level the bottom
support at
around
mingpao
mpfinance
hong kong
stocks are up up up up up down
expected to try
another
points
mingpao
mpfinance the
up up down up up up
hsi is expected
to fluctuate
56
between and
today
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
short term
consolidation is
up up up up up up
expected to
maintain the
pattern the
index continued
at to
points on the
level
sina pong po
lam paul
managing
director of
pegasus fund
managers
limited small
and medium
sized stocks will
benefit us stocks
continue to rise
but investors
will be worried up up up up up up
about the bubble
burst in the
united states or
next year will be
the transfer of
funds to hong
kong i believe
hong kong
stocks have the
opportunity to
rise above
points next year
57
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
hkej lam ka kei
hang seng index
is expected
within the short up up up up up up
term will still be
to from the
distance
mingpao
mpfinance the
hsi is expected down down down down down up
to go up or
down today at
aastock chik yiu
fai head of
research
department of
bright smart
securities
commodities
group limited
hong kong up up up up up down
stocks are
expected to
easily break
above today
but will have
some resistance
at the previous
high of
mingpao
mpfinance the
hang seng index
up up up up up down
is expected to
consolidate from
a high of to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd up up up up up up
the short term is
still expected to
continue the
pattern of
58
consolidation
the index
remained at
points to
points level
hkej sam chi
yung senior
strategist at
south china
financial
holdings ltd
market outlook
up up up up up up
is still powerful
and then break
up or higher
level is
absolutely
capable of
seeing
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
up up up up up up
the index is
expected to
continue to up
and down
between and
etnet chan fung
chu
today or is likely
up up up up up up
to challenge
upward
points mark
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
59
sina missing
it is expected
that the hsi will
continue to
consolidate at
up up up down up up
the level of
about on the
eve of the
christmas
holidays
mingpao
mpfinance the
hsi is expected up down down down down up
to hover around
to today
aastock kwok ka
yiu managing
director at china
goldjoy asset
up up up up up up
management ltd
the index is
expected to hold
at to
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
quamnet yu
kwan lung
independent
stock
commentator the
hang seng index up up up up up down
is forecast to
fluctuate
between and
in the coming
few sessions
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd up up up up up up
index or test
points support
point level to
become short
60
term rebound
resistance
mingpao
mpfinance
hong kong
stocks are
up up up up up down
expected to go
up or down
today at
points
mingpao
mpfinance the
hang seng index
is expected to
down down up up down down
rebound today
and is expected
to go up and
down at to
mingpao
mpfinance
hong kong
stocks are up up up up up up
expected to go
up or down
today at to
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
up up up up up down
expected to
further test the
level and the
underlying
support moves
up to
aastock kwok ka
yiu managing
director at china
goldjoy asset up up up up up up
management ltd
the index is still
expected to
61
consolidate in
the short term
maintaining the
to level
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
short term
external
situation is not
yet clear the up up up up up up
short term index
continues to be
in a
consolidation
pattern
maintained at
to points on
the level
mingpao
mpfinance the
hsi is expected
down down down down down down
to hit the daily
level of
today
mingpao
mpfinance the
hsi is expected up down down down down down
to go up and
down at to
etnet shek kang
chuen arthur
head of research down up up up up down
of hong kong
economic times
62
keep steadily
from to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
up up up up up up
expected to
consolidate
ahead of the
short term and
maintain its
level at to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
up up up down down up
expected to
maintain the
index at to
points on the
level
etnet tang sing
hing kenny
chairman of the
hong kong
institute of
financial
analysts and
up up up up up down
professional
commentators
limited expected
today th is
expected to
challenge
points
mingpao
mpfinance
hang seng index
up up up up up down
is expected to
test today or
on trial
63
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
expected to up up up up up up
maintain its
stable pattern on
the last trading
day of
continuing from
to points
mingpao
mpfinance the
hsi is expected
up up down up up up
to go up and
down at to
today
mingpao
mpfinance the
hsi is expected
down down down down down up
to test the
level today or on
the day
quamnet yu
kwan lung
independent
stock
commentator as
the investment
climate
downturn a
basket of hang
up up up up up down
seng index
constituents is
sold out and the
hang seng index
is forecast to
fluctuate
between and
in the next few
trading days
64
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
expect the index
to test a
psychological
barrier of up up up up up up
points after the
elimination of
the selling
pressure the
index is still
expected to test
points
repeatedly
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
quamnet yu
kwan lung
independent
stock
commentator the
hang seng index up up up up up up
is forecast to
fluctuate
between and
in the coming
few sessions
etnet chong chi
ho chief
financial capital
limited
hong kong up up up up up up
stocks are
expected to
further test the
level today
finet missing
missing the hang
seng index trend
up down up up up up
repeatedly at
the level of
contention
65
mingpao
mpfinance the
hsi is expected
down down down down down up
to hit the daily
level at
today
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
66
5.5 Goal-Based Trading Experiment
To prove the prediction trend of authenticity on HSI, one of the famous online-CFD-
trading platform is Plus500 (http:///www.plus500.com), which provide trading
instruments, including crypto, indices, forex, commodities etc., that is an international
trading system for investors.
For the actual trading experiment, investment Hang Seng Index Future (Hong Kong
50 HSI) is a direct experiment to testify the prediction result, which aims surely is to
gain profit value.
The rule of HSI future is a contract form to make a trading, which every contract has
an expiry date and must receive premium charge in each single trading date if investor
save the contract until the deadline date. Also, each point is HKD$50 in this case, the
operation function of “Buy” means rise and “Sold” means drop, so if purchase HSI
raise, then each point gains HKD$50; otherwise, it will lose HKD$50 for each point.
The trading experiments are divided into short term (one-day trading) and medium (at
least a month), the confidence level of this experiment is 85%.
67
According to the figure 10, the statistics
recorded the trading record of purchasing
Hong Kong 50 of HSI on “Buy” mode in
opening rate of HSI 29,315 on 9th
February 2018. The amount to purchase
raising option in HSI has 2,250 contracts,
which the expiry date is 23rd February
2018.
68
5.6 Issues Influencing HSI Prediction Result
Because of Hang Send Index including a number of blue chip and have a high
coefficient with global stock market, those news and information will affect the price
of Hang Seng Index (HSI).
1) Global Issue
The policy of international trade, such as trade tariff, will influence between
countries stock market atmosphere. In the case of favourable issue, the stock
price will inevitably rise internationally. Otherwise, it will case a high risk on
market.
According to the CNN politics reporting, USA of President Donald Trump hits
China with the tariffs, withdrawing nearly 60 billion U.S. dollars from China
(Diamond, 2018). This issue will be heightening concerns of global trade war
and bring negative factors with shocking the index price in global financial
market.
The other issue about debt issue in 2011, which faced on a high pressure on
world stock market. In Europe, the market is worried that it may want to
restructure debts, and the European debt problem may spread to larger
countries such as Italy, Spain and France. Both Italian and Spanish bond yields
have risen, and the market is concerned that France may lose its highest
sovereign debt rating. European countries have failed to reach a consensus on
the mitigation plan for the debt issue, leading to a drop in the stock market. As
of the end of 2011, the FTSE index, DAX index and CAC index fell by 5.6%,
14.7% and 17.0%, respectively, while the stock markets of Portugal, Alghero,
Italy, Xiji and Spain's five European countries (PIIGS) fell by 13.1% to 51.9%.
Unequal (but only if the Erlang stocks rose 0.6%) (Securities and Futures
Commission (SFC), 2012). Due to the debt issue, the Asian region’s economy
69
is dragged down by Europe’s sovereign debt problems, and regional stock
markets have generally fallen.
In Hong Kong, European sovereign debt problems, the US’s credit rating was
lowered, and the Mainland may introduce macro-austerization measures, all
damaging the market sentiment. As of the end of 2011, Hang Seng Index and
Hang Seng State-owned Enterprises Index fell by 20.0% and 21.7%
respectively from the end of 2010. (Securities and Futures Commission (SFC),
2012).
3) Transaction Volume
A high volume transition is one of rising force for Hang Seng Index (HSI).
Southbound inflows through stock connect in Hong Kong stock market,
around HK$339.9 billion in 2017 has been selected and collected (Securities
and Futures Commission (SFC), Research Paper No. 62: A Review of the
Global and Local Securities Markets in 2017, 2018).
70
4) Corporate Performance Report
Due to Hang Seng Index (HSI) included 50 different kind of constituent
corporate, the corporate season and overall performance report will be released
in every season. Those corporate will be influenced the HSI in the short-term
and long-term performance.
For instance, as the global market is optimistic about corporate profits in 2017,
coupled with the improvement of the basic economic factors, the global market
has made good progress. In particular, US stocks hit a record high, and the
pace of economic recovery began to accelerate. In addition to the United States,
major markets such as Germany, United Kingdom, South Korea and India all
recorded historical highs in 2017. The US dollar continued to weaken,
depreciating by 10% in 2017, which is another factor that promotes continued
capital inflows into Hong Kong and other emerging markets (Securities and
Futures Commission (SFC), Research Paper No. 62: A Review of the Global
and Local Securities Markets in 2017, 2018).
All in all, there are many internal and external issues and suddenly news will be
released and published, which must influence the prediction stock price movement
and the atmosphere of stock market. Hence, as an investor or analyst, it must keep in
update the issues and information and the prediction of stock movement trend. The
result of prediction movement in constant text and data analysis are as references
property under macro conditions.
71
Chapter 6 Conclusion
The current day of stock market is charactiszed by a strong strengthening of
influencing role. In this work, this paper has built up a corpus that can be used to
investigate the importance of text analysis and estimate and certify the prediction
movement. The text analysis algorithms showed some prediction article of words
from analyst in social media is indeed important, especially some encouraging and
short-term words, like “support” or “limited expected”, are resulted in positive
features.
From the results of 4 experiments on various data sets in text, four algorithms (SVM,
MNB, RF and xgboost) and prediction by analysts on average accuracy are around
60%, yet the most insight message is accuracy of HSI prediction from financial stock
analysts’ is over 68%.
Overall, the average prediction of algorithms in corresponding with the actual result
is more than 35%. There is no doubt that social media stock price movement
prediction has a strong reference value for the public.
72
Chapter 7 Further Research & Limitation
The further research and limitation are equal to advance and prolong the scope and
time in research of dissertation. It can be divided in to the section of research and
experiment. Those parts of innovation can be sublimated to be business field in
system software.
1) Advanced Research
To achieve a global text analysis in stock market, the major of index, such as
Dow Jones index, Deutsche Borse index and SSE in China, are the global
indexes that influence other global, regional and national indexes’ performance
(Galina, Iryna, & Sergii, 2015).
2) High-Tech Experiment
73
Selected Bibliography
[1] Alassiri, A., Mud, M., & Ghazali, R. (2014, February). Usage of Social
Networking Sites and Technological Impact on the Interaction Enabling
Features. International Journal of Humanities and Social Science, 4(4), 46-61.
[2] Batista, G., Wang, X., & Keogh, E. (2012). A Complexity-Invariant Distance
Measure for Time Series. University of California, 12.
[3] Blackrock Investment Institute. (Blackrock Investment Institute). WEEKLY
COMMENTARY • MARCH 12, 2018. 2018.
[4] Butler, M. (2009). An Artificial Intelligence Approach to Financial Forecasting
using Improved Data Representation, Multi-objective Optimization, and Text
Mining. Halifax, Nova Scotia: Dalhousie Universiy.
[5] Campbell, W., Dagli, C., & Weinstein, C. (2013). Social Network Analysis with
Content and Graphs. Lincoln Laboratory Journal, 20(1), 62-80.
[6] Cooper, J. (2012). Hit and Run Trading: The Short-Term Stock Traders' Bible.
Wiley Trading.
[7] Das, S., & Chen, M. (2007). Yahoo! for amazon: Sentiment extraction from small
talk on the web. Management Science, 53(9), 1375-1388.
[8] Dascalaki, E., Droustsa, K., Gaglia, A., Kontoyiannidis, S., & Balaras, C. (2010,
February 17). Data Collection and analysis of the building stock and its energy
performance - An example for Hellenic buildings. ELSEVIER, 1231-1237.
[9] Day, M.-Y. (2014). Data Mining - Classification and Prediction. Taiwan:
Department of Information Management, Tamkang University. Retrieved from
http://slideplayer.com/slide/7027794/
[10] Degutls, A., & Novickyte, L. (2014). The Efficient Market Hypothsis: A Critical
Review of Literature and Methodology. EKONOMIKA, 93(2), 8-23.
[11] Diamond, J. (2018, March 23). Trump hits China with tariffs, heightening
concerns of global trade war. Retrieved from CNN politics:
https://edition.cnn.com/2018/03/22/politics/donald-trump-china-tariffs-trade-
war/index.html
74
[12] Flugel, A. (2016, July). Burtch Works Flash Survey: SAS vs R vs Python! (B. W.
Recruiting, Ed.) Retrieved from
https://www.burtchworks.com/files/2016/07/SAS-vs-R-vs-Python-
2016_webinar-PDF-deck.pdf
[13] Galina, A., Iryna, S., & Sergii, K. (2015). Analysis of the Global Stock Market
Trends. Journal of Finance and Economics,, 4(3), pp. 67-71. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.880.8378&rep=rep1
&type=pdf
[14] Garefalakis, A., Dimitras, A., Koemtzopoulos, D., & Spinthiropoulos, K. (2011).
Determinant Factors of Hong Kong Stock Market. International Research
Journal of Finance and Economics(62), 50-60.
[15] Gera, M., & Goel, S. (2015, March). Data Mining - Techniques, Methods and
Algorithms: A Review on Tools and their Validity. International Journal of
Computer Applications, 113(18), 22-29.
[16] Hang Seng Indexes Company Limited. (2017). Hang Seng Indexes Report. Hong
Kong: Hang Seng Indexes. Retrieved from http://www.hsi.com.hk/HSI-
Net/static/revamp/contents/en/dl_centre/factsheets/FS_HSIe.pdf
[17] HKEX. (2017, April 20). Severe Weather Arrangements. Retrieved from
http://www.hkex.com.hk/Services/Trading-hours-and-Severe-Weather-
Arrangements/Severe-Weather-Arrangements/Clearing-and-
Settlement?sc_lang=en
[18] Ingason, A., Helgadóttir, S., Loftsson, H., & Rögnvaldsson, E. (2008). A Mixed
Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities
(HOLI). Springer-Verlag Berlin Heidelberg , 205-216.
[19] Investor Education Centre. (2014, December 2). edb.gov.hk. Retrieved from
Understanding the importance of Hang Seng Index:
http://www.edb.gov.hk/attachment/en/curriculum-development/kla/technology-
edu/resources/business-
edu/CDI020150246_Handout_2_20141202_Importance_of_HSI.pdf
75
[20] Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S., . . . Li, H. (2004).
A Survey on Text Mining in Social Networks. Cambridge University Press, 00:0,
1-24.
[21] Jiang, K., Ediger, D., Corley, C., Farber, R., & Reynolds, W. (2010). Massive
Social Network Analysis: Mining Twitter for Social Good. 39th International
Conference on Parallel Processing, (pp. 583-593). USA.
[22] Kannan, S., Sekar, S., Sathik, M., & Arumugam, P. (2010, March 17). Financial
Stock Market Forecast using Data Mining Techniques. Proceedings of the
International MultiConference of Engineers and Computer Scientists.
[23] Kirkos, E., & Manolopoulos, Y. (2004). Data Mining in Finance and
Accounting: A Review of Current Research Trends. Department of Accounting,
Technological Educational Institution of Thessaloniki, Greece.
[24] Kloptchenko, A., Back, B., Vanharanta, H., Eklund, T., Karlsson, J., & Visa, A.
(2002). Combining data and text mining techniques for analyzing financial
reports. Eighth Americas Conference on Information Systems, 20-28.
[25] Kovalerchuk, B. (2015). Data Mining For Financial Applications. USA: Central
Washington University.
[26] Li, B., Chan, K., Ou, C., & Reuifeng, S. (2017, February 2). Discovering public
sentiment in social media for predicting stock movement of publicly listed
companies. Elsevier, 81-92. Retrieved 9 7, 2017
[27] Li, Y., Li, X., & Wang, H. (2016, November 29). Based on Multiple Scales
Forecasting Stock Price with a Hybrid Forecasting System. American Journal of
Industrial and Business Management, 927-940.
[28] Mahajan, A., Dey, L., & Haque, S. (2008). Mining financial news for major
events and their impacts on the market. WI-IAT '08. IEEE/WIC/ACM
International Conference on, 1, 423-426.
[29] Mak, K., Ho, T., & Ting, S. (2011). A Financial Data Mining Model for
Extracting Customer Behavior. The Hong Kong Polytechnic University,
Department of Industrial and Systems Engineering. Hong Kong: Convoy
Financial Services Holdings Limited. Retrieved July 23, 2011
76
[30] Marwala, L. (2017). Forecasting the Stock Market Index Using Artificial
Intelligence Techniques. Johannesburg: Faculty of Engineering and the Built
Environment, University of the Witwatersrand.
[31] Moldovan, D. (2011). Business Intelligence: Data Mining in Finance. Babeş-
Bolyai University of Cluj-Napoca, Faculty of Economics and Business
Administration. Rome: Babeş-Bolyai University of Cluj-Napoca.
[32] Momani, G., & Alsharari, M. (2012). Impact of Economic Factors on the Stock
Prices at Amman Stock Market (1992-2010). International Journal of Economics
and Finance, 4(1), 151-159.
[33] Ng, A., & Jordan, M. (2002). On Discriminative vs. Generative classifiers: A
comparison of logistic regression and naive Bayes. Standfor University.
[34] Olowe, M., Gaber, M., & Stahl, F. (2013). A Survey of Data Mining Techniques
for Social Network Analysis. UK: School of Computing Science and Digital
Media, Robert Gordon.
[35] Ozer, P. (2008). Data Mining Algorithms for Classification. Netherlands:
Radboud University Nijmegen.
[36] Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment
Classification using Machine Learning. Association for Computational
Linguistics, 10, 79-86.
[37] Powers, D. (2007). Evaluation: From Precision, Recall and F-Factor to ROC,
Informedness, Markedness & Correlation. Australia: School of Informatics and
Engineering, Flinders University of South Australia.
[38] Puget, J.-F. (2016, December 23). KDnuggets. Retrieved from
http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-
data-science.html
[39] Radaideh, Q., Assaf, A., & Alnagi, E. (2013). Predicting Stock Prices Using Data
Mining Techniques. The International Arab Conference on Information
Technology. of Computer Information Systems, Faculty of Information
Technology and Computer Science, Yarmouk University, Irbid.
77
[40] Ravisankar, P., Ravi, V., Rao, G., & Bose, I. (2011). Detection of financial
statement fraud and feature selection using data mining techniques. Elsevier B.V.,
491-500.
[41] Sap, M., & Mohebi, E. (2017, December 4). Hybrid Self Organizing Map for
Overlapping Clusters. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 11-20.
[42] Sawant, A., & Chawan, P. (2013). Comparison of Data Mining Techniques used
for Financial Data Analysis. International Journal of Emerging Technology and
Advanced Engineering, 112-116.
[43] Sawant, A., & Chawan, P. (2013, May). Study of Data Mining Techniques used
for Financial Data Analysis. International Journal of Engineering Science and
Innovative Technology (IJESIT), II(3), 503-509.
[44] Scikit-learn developers. (2017). Feature extraction. Retrieved from scikit-learn:
http://scikit-learn.org/stable/modules/feature_extraction.html
[45] Securities and Futures Commission (SFC). (2012). Research Paper 50: Review
of Global and Hong Kong Securities Market in 2011. Hong Kong: Securities and
Futures Commission (SFC).
[46] Securities and Futures Commission (SFC). (2018). Research Paper No. 62: A
Review of the Global and Local Securities Markets in 2017. Hong Kong.
[47] Soumya, S., & Deepika, N. (2016, January). Data Mining With Predictive
Analytics for Financial Applications. International Journal of Scientific
Engineering and Applied Science (IJSEAS), II(1), 310-317.
[48] The ASPIRA Association. (2017). Short-Term and Long-Term Investments
Options. Investments: Resources for Reaching the American Dream.
[49] The ASPIRA Association. (2017). Short-Term andLong-Term Investments
Options. Washington: Investments: Resources for Reaching the American
Dream. Retrieved from
https://www.aspira.org/sites/default/files/Inv_Fac_M5_V2.pdf
[50] Tsui, D. (2014). Predicting Stock Price Movement Using Social Media Analysis.
California: Stanford University.
78
[51] Visa, A., Kloptchenko, A., Eklund, T., Karlsson, J., Back, B., & Vanharanta, H.
(2004, March). Combining data and text mining techniques for analysing
financial reports. 12, pp. 29-41.
[52] Von, V. (2014). The Value of Social Media for Predicting Stock Returns –
Preconditions, Instruments and Performance Analysis. Germany: Technische
Universitat Darmstadt.
[53] Yakushev, A., & Mityagin, S. (2014). Social networks mining for analysis and
modeling drugs usage. ELSEVIER, 29, 2462-2471.
[54] Yu, K., Ng, H., Wong, W., Chu, K., & Chan, K. (2010). An Empirical Study of
the Impact of Intellectual Capital Performance on Business Performance. The 7th
International Conference on Intellectual Capital, Knowledge Management &
Organisational Learning, The Hong Kong Polytechnic University (pp. 1-11).
Hong Kong: The University of Hong Kong, HKSAR.
[55] Zeng, D., Chen, H., Lusch, R., & Li, S. (2010, November). Social Media
Analytics and Intelligence. IEEE Computer Society, 13-16.
79
Appendix I – The relationship between SP500 and
HSI
Many investors and analysts use the relationship between SP500 and HSI to predict
the next day index price. According to the macroasix in 2018, it stated that existing
pair cross correlation between SP500 and HSI result 0.87 coefficient in the current
period of 30 trading days from January 17, 2018 to February 16, 2018.
Pair Volatility
Assuming 30 trading- days horizon, SP500 is expected to generate 0.95 times more
return on investment than Hang Seng. However, SP500 is 1.05 times less risky than
Hang Seng. It trades about -0.11 of its potential returns per unit of risk. Hang Seng is
currently generating about -0.13 per unit of risk. If you would invest 280,256 in SP500
on January 17, 2018 and sell it today you would lose (10,393) from holding SP500 or
give up 3.71% of portfolio value over 30 days.
80
Appendix II – Technical Indicators
A technical indicator is a series of data points that are derived by applying a formula
to the price data of a security. Price data includes any combination of the open, high,
low or close over a period of time. Some indicators may use only the closing prices,
while others incorporate volume and open interest into their formulas. A technical
indicator offers a different perspective from which to analyze the price action. Some,
such as moving averages, are derived from simple formulas and the mechanics are
relatively easy to understand.
A simple moving average (SMA) is an indicator that calculates the average price of a
security over a specified number of periods. In general, Hong Kong use 10-days SMA,
20-days SMA, 50-days SMA, 100-days SMA and 150-days SMA, which represent
short term and long-term stock price movement. For example, to calculate 10-days
SMA, assume HSI now have 12-days historical data and want to forecast the next day
of movement. It can average the recent 10-days and the result is to predict the next
day of forecast price value (𝑥̂). The following is a reference equation:
∑10−𝑑𝑎𝑦𝑠 𝑥𝑖
𝑥̂ = 𝑖=1
n
∑10−𝑑𝑎𝑦𝑠
𝑖=1 𝑥𝑖 , which the above example is to calculate 10-days SMA, and division
the total number of days.
81
Appendix III – Feature Importance SVM
rebound
15 0.754 117 arthur head 0.130 219 open higher -0.102
resistance
16 quamnet 0.720 118 chuen 0.130 220 director -0.115
kenny
17 term market 0.673 119 0.124 221 mpfinance -0.130
chairman
18 expect 0.636 120 kenny 0.124 222 etnet sam -0.136
19 hsi 0.619 121 professional 0.124 223 stocks -0.137
professional
20 missing hsi 0.617 122 0.124 224 consolidate -0.145
commentators
82
financial
21 chance 0.602 123 etnet tang 0.124 225 -0.148
holdings
expected institute
22 0.582 124 0.124 226 holdings -0.148
remain financial
analysts index
23 outlook 0.577 125 0.124 227 -0.157
professional continue
market
24 0.577 126 sing 0.124 228 rebound -0.157
outlook
83
43 management 0.369 145 lung 0.100 247 mark -0.240
lung stocks
44 ho 0.362 146 0.100 248 -0.251
independent expected
45 high year 0.341 147 capital 0.099 249 missing -0.253
46 kwong 0.335 148 yu 0.097 250 stock -0.261
expected research
47 institute 0.334 149 0.096 251 -0.274
fluctuate hang
48 challenge 0.331 150 ka 0.093 252 hit -0.283
mpfinance economic
49 0.328 151 0.074 253 wai -0.293
hang times
50 chan 0.324 152 times 0.072 254 etnet kwong -0.314
outlook commentator
51 0.320 153 0.071 255 rise -0.315
expected hang
expected management
52 points level 0.314 154 0.070 256 -0.320
challenge expected
expected limited
53 0.313 155 0.069 257 end -0.327
continue believe
54 week 0.308 156 kwok 0.064 258 test level -0.337
index managing
57 0.278 159 chong 0.033 261 -0.353
expected director
financial
58 chairman 0.274 160 0.033 262 near -0.357
capital
capital
59 high points 0.269 161 0.033 263 repeated -0.363
limited
60 test points 0.264 162 ho chief 0.033 264 index -0.364
61 month 0.263 163 chi ho 0.033 265 aastock chik -0.382
expected
62 0.263 164 chong chi 0.033 266 index opened -0.404
open
mpfinance chief
63 0.252 165 0.033 267 opened -0.404
hong financial
64 level points 0.249 166 etnet chong 0.033 268 performance -0.407
84
kong expected
66 0.231 168 daily 0.032 270 -0.432
economic hover
fluctuate
67 hong 0.230 169 expect index 0.023 271 -0.436
coming
68 hong kong 0.230 170 limited hsi 0.023 272 point -0.441
management
69 kong 0.230 171 0.018 273 today points -0.463
market
70 higher 0.228 172 kwok ka 0.011 274 expected hit -0.470
resistance
71 economic 0.226 173 goldjoy 0.011 275 -0.486
points
management
72 open 0.221 174 ka yiu 0.011 276 -0.488
short
73 hkej 0.215 175 china goldjoy 0.011 277 test today -0.488
expected
74 asset 0.215 176 director china 0.009 278 -0.504
maintain
asset
75 0.215 177 yiu managing 0.009 279 trend -0.505
management
76 short term 0.210 178 goldjoy asset 0.009 280 good -0.509
stock
84 0.187 186 seng 0.002 288 coming -0.672
commentator
strategist
85 0.186 187 seng index 0.002 289 break -0.684
south
86 yung 0.186 188 financial -0.004 290 aastock -0.714
85
continue
87 sam 0.186 189 department -0.005 291 -0.733
level
research
88 south china 0.186 190 -0.005 292 points points -0.747
department
senior
89 0.186 191 hit daily -0.030 293 level today -0.750
strategist
challenge
90 senior 0.186 192 -0.035 294 missing hong -0.772
resistance
fluctuate
91 sam chi 0.186 193 -0.038 295 resistance -0.783
points
market
92 south 0.186 194 yiu -0.040 296 -0.872
expected
expected
93 yung senior 0.186 195 -0.041 297 fall -0.896
opportunity
china
94 0.186 196 support near -0.044 298 expected -0.909
financial
95 chi yung 0.186 197 maintain -0.047 299 challenging -0.970
96 bright 0.184 198 maintained -0.055 300 contention -0.972
maintained freeman
97 bright smart 0.184 199 -0.055 301 -1.021
points securities
98 smart 0.184 200 francis -0.059 302 freeman -1.021
smart
99 0.184 201 kwok sze -0.059 303 morning -1.046
securities
100 strategist 0.182 202 sze chi -0.059 304 highs -1.084
101 term index 0.178 203 chi francis -0.059 305 consolidation -1.126
102 hold 0.171 204 sze -0.059 306 hsi expected -1.258
86
Appendix IV - Feature Importance Naïve
Bayes
Rank feat coeff Rank feat coeff Rank feat coeff
1 today -4.104 103 stock -6.149 205 rise -6.651
management
2 mpfinance -4.110 104 -6.168 206 sina -6.652
expected
expected
3 mingpao -4.111 105 level points -6.179 207 -6.654
consolidate
mingpao
4 -4.111 106 points points -6.189 208 etnet kwong -6.656
mpfinance
5 expected -4.131 107 range -6.201 209 high year -6.672
6 hsi -4.136 108 high points -6.219 210 yung senior -6.675
mpfinance expected senior
7 -4.163 109 -6.226 211 -6.675
hsi remain strategist
8 hsi expected -4.175 110 financial -6.229 212 south china -6.675
expected
9 -4.209 111 expect -6.236 213 senior -6.675
fluctuate
market
10 fluctuate -4.210 112 -6.252 214 sam chi -6.675
expected
fluctuate expected
11 -4.215 113 -6.260 215 south -6.675
today hover
12 points -4.527 114 outlook -6.262 216 chi yung -6.675
market
13 index -4.717 115 -6.262 217 sam -6.675
outlook
management strategist
14 yiu -4.885 116 -6.262 218 -6.675
short south
china
15 aastock -4.900 117 believe -6.262 219 -6.675
financial
expected
16 director -4.924 118 -6.271 220 yung -6.675
open
stock
17 china -4.937 119 -6.278 221 chief -6.675
commentator
87
asset management financial
21 -4.972 123 -6.281 225 -6.687
management market holdings
rebound
22 asset -4.972 124 yu -6.291 226 -6.692
resistance
china challenge
23 -4.983 125 remain level -6.295 227 -6.694
goldjoy resistance
points
24 goldjoy -4.983 126 -6.295 228 chan -6.700
support
25 ka yiu -4.983 127 support near -6.335 229 range points -6.702
26 kwok ka -4.983 128 higher -6.352 230 term index -6.707
aastock
27 -4.990 129 hkej -6.353 231 days -6.717
kwok
director
28 -4.990 130 today points -6.378 232 breakthrough -6.717
china
goldjoy
29 -4.990 131 kwong -6.379 233 capital -6.724
asset
yiu expected commentator
30 -4.990 132 -6.380 234 -6.733
managing challenge coming
managing resistance
31 -4.993 133 -6.390 235 wai -6.733
director support
support continue
32 managing -4.993 134 -6.393 236 -6.734
points points
33 level -5.029 135 target -6.399 237 temporarily -6.735
limited
34 hang -5.124 136 -6.405 238 today fall -6.735
expected
35 seng -5.124 137 month -6.409 239 good -6.736
36 seng index -5.124 138 hover points -6.410 240 wong -6.747
management
37 hang seng -5.124 139 -6.417 241 open today -6.750
expect
index commentator
38 -5.158 140 quamnet yu -6.422 242 -6.757
expected hang
39 test -5.208 141 lung -6.422 243 hit -6.760
40 missing -5.248 142 kwan lung -6.422 244 daily -6.762
41 hong kong -5.268 143 kwan -6.422 245 open higher -6.777
lung economic
42 hong -5.268 144 -6.422 246 -6.778
independent times
expected
43 kong -5.268 145 yu kwan -6.422 247 -6.782
opportunity
44 market -5.303 146 singtao -6.427 248 highs -6.783
45 support -5.309 147 forecast -6.432 249 sze chi -6.784
88
46 stocks -5.367 148 coming -6.434 250 sze -6.784
outlook
47 term -5.392 149 -6.436 251 kwok sze -6.784
expected
expected expected
48 -5.420 150 -6.448 252 francis -6.784
today maintain
49 kong stocks -5.422 151 economic -6.456 253 chi francis -6.784
50 short -5.424 152 kwong man -6.472 254 new high -6.825
51 short term -5.442 153 man bun -6.472 255 new -6.825
financial
52 continue -5.499 154 man -6.472 256 -6.833
capital
chief
53 research -5.541 155 bun -6.472 257 -6.833
financial
expected
54 -5.558 156 appledaily -6.473 258 chi ho -6.833
test
55 limited -5.561 157 director head -6.485 259 chong -6.833
management
56 -5.571 158 end -6.485 260 etnet chong -6.833
index
test capital
57 points today -5.572 159 -6.490 261 -6.833
resistance limited
head opportunity
58 -5.573 160 -6.492 262 ho chief -6.833
research test
59 head -5.573 161 term market -6.493 263 chong chi -6.833
60 securities -5.608 162 limited hsi -6.505 264 narrow -6.843
mpfinance
61 -5.690 163 fall -6.513 265 repeated -6.845
hang
62 etnet -5.702 164 expect index -6.515 266 etnet sam -6.846
maintain
63 year -5.713 165 break -6.515 267 -6.863
pattern
64 bright smart -5.714 166 oncc -6.517 268 research hang -6.867
limited
65 smart -5.714 167 chance -6.522 269 -6.886
believe
smart kong
66 -5.714 168 -6.526 270 consolidation -6.888
securities economic
67 bright -5.714 169 rebound -6.543 271 missing hong -6.891
stocks analysts
69 -5.751 171 hold -6.549 273 -6.897
expected professional
89
kenny
70 challenge -5.771 172 consolidate -6.557 274 -6.897
chairman
commentators
71 group -5.806 173 investment -6.566 275 -6.897
limited
research
72 -5.833 174 point -6.569 276 hing kenny -6.897
department
kgi
73 department -5.833 175 -6.572 277 hing -6.897
executive
chairman
74 chik -5.834 176 kgi -6.572 278 -6.897
hong
executive professional
75 fai -5.834 177 -6.572 279 -6.897
director commentators
group institute
76 -5.836 178 executive -6.572 280 -6.897
limited financial
expected
77 -5.838 179 bun kgi -6.572 281 kenny -6.897
continue
missing
78 -5.850 180 cnfol -6.575 282 professional -6.897
missing
fluctuate
79 yiu fai -5.863 181 -6.575 283 etnet tang -6.897
points
department
80 -5.863 182 times -6.577 284 kong institute -6.897
bright
resistance
81 fai head -5.863 183 -6.580 285 tang sing -6.897
points
securities
82 -5.863 184 mark -6.582 286 tang -6.897
commodities
commodities
83 -5.863 185 ying -6.594 287 sing -6.897
group
index
84 chik yiu -5.863 186 -6.595 288 sing hing -6.897
continue
financial
85 commodities -5.863 187 sessions -6.605 289 -6.897
analysts
coming fluctuate
86 hover -5.866 188 -6.605 290 -6.913
sessions coming
87 high -5.882 189 chairman -6.610 291 performance -6.925
88 test points -5.915 190 institute -6.616 292 maintained -6.949
maintained
89 aastock chik -5.943 191 shek -6.618 293 -6.949
points
90 test level -5.954 192 shek kang -6.618 294 continue level -6.958
90
mpfinance
91 -5.966 193 etnet shek -6.618 295 hit daily -6.959
hong
92 opportunity -6.026 194 chuen -6.618 296 expected hit -6.973
93 near -6.050 195 chuen arthur -6.618 297 test today -6.984
94 maintain -6.058 196 kang -6.618 298 level today -6.996
95 open -6.065 197 kang chuen -6.618 299 trend -7.015
research
96 chi -6.072 198 -6.618 300 challenging -7.023
hong
97 points level -6.099 199 arthur head -6.618 301 index opened -7.077
98 week -6.122 200 arthur -6.618 302 opened -7.077
99 remain -6.128 201 missing hsi -6.644 303 contention -7.096
100 pattern -6.128 202 analysts -6.644 304 morning -7.124
level freeman
101 -6.129 203 limited hong -6.645 305 -7.321
support securities
102 quamnet -6.148 204 hovering -6.648 306 freeman -7.321
91
Appendix V - Algorithm using Python
Programming Coding
import pandas as pd
import numpy as np
from datetime import timedelta
from sklearn.model_selection import KFold, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import re
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
def cleanData(text):
text = text.lower()
text = re.sub(r"\'s", " ", text)
text = re.sub(r"[^A-Za-z]", " ", text)
return text
92
# Read excel file having stock commentart
df = pd.read_excel("data/Stock Commentary and Prediction.xlsx")
# Previous day settlement price shifted by one is that day's closing price
stock['Closing_Price'] = stock['Prev. Day Settlement Price'].shift(-1)
stock = stock[['Date', 'Open', 'Closing_Price']]
print(stock.head(5))
93
# Convert date to datetime & merge with stock to get opening value at stock
commentary date
stock['Date'] = pd.to_datetime(stock['Date'])
stock = stock.rename(columns = {'Date':'DateOfComentary'})
# Convert date to datetime & merge with stock to get closing value at stock
prediction date
stock = stock.rename(columns = {'DateOfComentary':'DateOfPrediction'})
df = pd.merge(df, stock, on='DateOfPrediction', how='left')
print(df.head(5))
df.dropna(axis=0, inplace=True)
df = df.reset_index(drop=True)
df.columns.values[5] = 'Open'
df.columns.values[6] = 'Closing_Price'
print(df.head(5))
df = df.ix[df['HSIPrediction']!='missing', :].reset_index(drop=True)
94
# target variable is actual movement in market is upward or downward
df['movement'] = (df['Closing_Price'] > df['Open']).astype(int)
# target column
target = ['movement']
columns = ['text', 'pred_movement']
y = df[target]
X = df[columns]
95
tfidf.fit(X_train['text'].values) # Fit TF-IDF on training set
train_vect = tfidf.transform(X_train['text'].values) # Transform training set
algo = []
acc = []
f1= []
algo.append('SVC')
acc.append(accuracy_score(y_test, pred)* 100)
96
f1.append(f1_score(y_test, pred))
model_NB = MultinomialNB()
model_NB.fit(train_vect, y_train)
algo.append('NB')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))
model_RF = RandomForestClassifier()
model_RF.fit(train_vect, y_train)
algo.append('RF')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))
97
print("Accuracy for Random Forest: ", accuracy_score(y_test, pred)*100)
print("F1 score for Random Forest: ", f1_score(y_test, pred))
model_XGB = xgb.XGBClassifier()
model_XGB.fit(train_vect, y_train)
algo.append('XGBoost')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))
X_test.to_csv("prediction_NB_SVM_XGB_RF.csv", index=False)
var_imp_RF_XGB = pd.DataFrame()
var_imp_RF_XGB['feat'] = tfidf.get_feature_names()
var_imp_RF_XGB['importance_RF'] = model_RF.feature_importances_[0:-1]
var_imp_RF_XGB['importance_XGBM'] = model_XGB.feature_importances_[0:-
1]
var_imp_RF_XGB.to_csv('feature_importance_RF_XGB.csv', index=False)
# Storing feature importance for Linear SVC (for text based features only)
var_imp_SVC = pd.DataFrame()
98
var_imp_SVC['feat'] = tfidf.get_feature_names()
var_imp_SVC['coeff'] = model_SVC.coef_[0,:-1] # Coefficients indicate
importance of variable for predictions
var_imp_SVC = var_imp_SVC.sort_values('coeff', ascending=False)
# Plotting
plt.bar(algo, acc)
plt.title("Accuracy for different algorithms")
plt.ylabel('% Accuracy')
plt.show()
plt.bar(algo, f1)
plt.title("f1 score for different algorithms")
99
plt.ylabel('F1 score')
plt.show()
var_imp_RF_XGB = var_imp_RF_XGB.sort_values(by='importance_RF',
ascending=False)
plt.bar(var_imp_RF_XGB['feat'].head(10),
var_imp_RF_XGB['importance_RF'].head(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance Random Forest')
plt.ylabel('importance')
plt.show()
var_imp_RF_XGB = var_imp_RF_XGB.sort_values(by='importance_XGBM',
ascending=False)
plt.bar(var_imp_RF_XGB['feat'].head(10),
var_imp_RF_XGB['importance_XGBM'].head(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance XGBoost')
plt.ylabel('importance')
plt.show()
plt.bar(var_imp_SVC['feat'].head(10), var_imp_SVC['coeff'].head(10))
plt.bar(var_imp_SVC['feat'].tail(10), var_imp_SVC['coeff'].tail(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance SVC')
plt.ylabel('importance')
plt.show()
plt.bar(var_imp_NB['feat'].head(20), var_imp_NB['coeff'].head(20))
plt.xticks(rotation=45, size=6)
100
plt.title('Feature Importance Naive Bayes')
plt.ylabel('importance')
plt.show()
101
Appendix VI - Feature of raw data for
algorithm
102