0% found this document useful (0 votes)
71 views111 pages

Forecasting Hong Kong Hang Seng Index Stock Price Movement Using Social Media Data Analysis

Author: WAI TAK LAU Intelligence and high-tech algorithms are the evolution of predicting stock price movement to inherit the historical researches - Social Media Data Analysis (SMeDA). It is perfect to collect Hong Kong social media data in the public text. Hang Seng Index (HSI) of stock price prediction movement based on the stock analysts expressed on financial and economic social media website has been an intriguing field of research ... ...

Uploaded by

W.T LAU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views111 pages

Forecasting Hong Kong Hang Seng Index Stock Price Movement Using Social Media Data Analysis

Author: WAI TAK LAU Intelligence and high-tech algorithms are the evolution of predicting stock price movement to inherit the historical researches - Social Media Data Analysis (SMeDA). It is perfect to collect Hong Kong social media data in the public text. Hang Seng Index (HSI) of stock price prediction movement based on the stock analysts expressed on financial and economic social media website has been an intriguing field of research ... ...

Uploaded by

W.T LAU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Forecasting Hong Kong Hang Seng Index Stock

Price Movement Using Social Media Data Analysis

LAU, Wai Tak

Master of Science in Information System

The Hong Kong Polytechnic University

September 2017
Statement of Authorship
Except where reference is made in the text of this dissertation, this dissertation
contains no material published elsewhere or extracted in whole or in part from a
dissertation presented by me for another degree or diploma.

No other person’s work has been used without due acknowledgement in the main text
of the dissertation.

This dissertation has not been submitted for the award of any other degree or diploma
in any other tertiary institution.

___________________________

Name: LAU, Wai Tak

Dated: 20/04/2018

I
Acknowledgements
After the period of eight months on intensive research and development, this note of
thanks is to represent my finishing master programme on my dissertation. Learn a lot,
implement a lot, earn a lot during an intensive learning and writing, not only in the
mathematics and science areas, but also in my personal attitudes, training and
knowledge.

I would like to express my sincere gratitude for the assistance of Prof. CHAN Chun
Chung Keith, whose work demonstrated and suggested to me that related to Hang Seng
Index in comparative social media data analysis, and should the new trend data science
for prediction and provide a quest for our times. In addition, I thank the Hong Kong
Polytechnic University for permission to allow copyrighted documentation as my
dissertation, which was originally published in the public.

II
Abstract
Intelligence and high-tech algorithms are the evolution of predicting stock price
movement to inherit the historical researches – Social Media Data Analysis
(SMeDA). It is perfectly to collect Hong Kong social media data in the public text.
Hang Seng Index (HSI) of stock price prediction movement based on the stock
analysts expressed on financial and economic social media website has been an
intriguing field of research. Their emotional text expressed their perspective and
point of view in the future of stock price movement. The thesis of this dissertation is
to examine and study creditability and referencing of stock analysts’ prediction
movement.

In this paper, applied text and data analysis, including supervised machine learning
algorithms extracted from social media articles, as well as analyze the correlation
features between stock price prediction and stock commentary, which compared with
historical stock data prediction. In an elaborate way, the level of creditability from
stock analyst’s text can be roughly assort the stock market development trend in HSI,
it can definitely encourage the public for references (Appendix V and Appendix VI
provide the programming coding of all algorithms and feature of raw data for
algorithm ) .

*Keywords: Social Media Data Analysis, Hang Seng Index, stock analysts’
prediction movement.

III
Table of Contents
Statement of Authorship ........................................................................................... I
Acknowledgements ................................................................................................... II
Abstract.................................................................................................................... III
List of Figures......................................................................................................... VII
List of Tables ........................................................................................................ VIII
Chapter 1 Introduction ........................................................................................... 1
1.1 Aims and Objectives ................................................................................ 4
1.2 Research Process ...................................................................................... 4
1.3 Contribution and Originality .................................................................... 4
1.4 Dissertation Overview.............................................................................. 5
Chapter 2 Background ............................................................................................ 6
2.1 General Terms and Concepts ................................................................... 6
2.1.1 Period of Investment .................................................................... 6
2.1.2 Hang Seng Index (HSI) ................................................................ 7
2.1.3 Social Media Mining Analysis ................................................... 10
2.1.4 Data Mining ............................................................................... 11
2.1.5 Text Mining – Text Analysis ..................................................... 12
2.1.6 Summary .................................................................................... 13
2.2 Technical Background & Description .................................................... 13
2.2.1 Time Series Similarity Analysis ................................................ 14
2.2.2 Learning Algorithms .................................................................. 16
2.2.3 Text Processing and Understanding ........................................... 17
2.2.4 Classification algorithms............................................................ 19
2.2.5 Text Summarization ................................................................... 23
2.2.6 Text Analysis ............................................................................. 24
2.2.7 Performance Evaluation ............................................................. 25

IV
2.3 Movement Performance Index (MPI) Research .................................... 26
2.4 Summary ................................................................................................ 26
Chapter 3 Design Approach.................................................................................. 28
3.1 Analysis Structure and Design ................................................................. 29
3.2 Data Acquisition ...................................................................................... 30
3.3 Data Preparation ...................................................................................... 31
3.4 Data Pre-Processing & Munging ............................................................. 32
3.4.1 Bag-of-words model................................................................... 33
3.4.2 Topic Modelling ......................................................................... 36
3.5 Social Media Analysis Methodologies ...................................................... 36
3.5.1 Analysis SMeDA ....................................................................... 36
3.5.2 Data mining ................................................................................ 39
3.6 Feature Selection ........................................................................................ 40
3.7 Feature Extraction ...................................................................................... 40
3.8 Summary .................................................................................................... 41
Chapter 4 Experimental Framework ................................................................... 42
4.1 Basic Experiment ................................................................................... 42
4.1.1 Features ...................................................................................... 42
4.1.2 Prediction Score and Movement ................................................ 43
4.2 Summary ................................................................................................ 43
Chapter 5 Analysis Result and Evaluation .......................................................... 44
5.1 Basic Form ............................................................................................. 44
5.2 Data Analysis Result .............................................................................. 45
5.2.1 Regression Linear....................................................................... 45
5.3 Social Media Analysis Result ................................................................ 46
5.3.1 Stop words and stemming word ................................................. 46
5.3.2 Support Vector Machine (SVM) ................................................ 47
5.3.3 Multinomial Naïve Bayesian (MNB) ......................................... 49

V
5.3.3 Random Forest (RF) ................................................................... 50
5.3.4 eXtreme Gradient Boosting (Xgboost) ...................................... 52
5.4 Comparison on Prediction Movement Result ........................................ 54
5.5 Goal-Based Trading Experiment............................................................ 67
5.6 Issues Influencing HSI Prediction Result .............................................. 69
Chapter 6 Conclusion ............................................................................................ 72
Chapter 7 Further Research & Limitation ...................................................... 73
Selected Bibliography .............................................................................................. 74
Appendix I – The relationship between SP500 and HSI ...................................... 80
Appendix II – Technical Indicators ....................................................................... 81
Appendix III – Feature Importance SVM ............................................................. 82
Appendix IV - Feature Importance Naïve Bayes .................................................. 87
Appendix V - Algorithm using Python Programming Coding ............................ 92
Appendix VI - Feature of raw data for algorithm .............................................. 102

VI
List of Figures
Figure 1. Venn diagram describing the intersection of disciplines ............................ 12

Figure 2. Model -linear regression............................................................................. 13

Figure 3. Alignment of values in euclidean distance.................................................. 14

Figure 4. LDA plate notation .................................................................................... 22

Figure 5. Analysis structure and design ..................................................................... 28

Figure 6. Job for usage machine learning statistics in 2016 ..................................... 31

Figure 7. Development of text dictionary.................................................................. 36

Figure 8. Structure of F1 score ................................................................................. 42

Figure 9. Hang Seng Index (HSI) basic trend from Sep 2017 to Feb 2018............... 43

Figure 10. Prediction future trend by regression linear until 8th February 2018…... 44

Figure 11. Feature Importance SVM ........................................................................ 46

Figure 12. Feature Importance Random Forest......................................................... 50

Figure 13. Feature Importance XGBoost.................................................................. 52

Figure 14. F1 score for different algorithms ............................................................. 53

Figure 15. Plus500 Trading platform statistics.......................................................... 66

Figure 16. Buy Hong Kong 50 statistics record on 16th February 2018................... 67

VII
List of Tables
Table 1. 2017 Hang Seng Index Constituents.............................................................. 7
Table 2. Part of Speech (POS) tagging list................................................................. 34
Table 3. Stock attribute description............................................................................ 39
Table 4. Top 10 positive features for linear SVM...................................................... 47
Table 5. Top 10 negative features for linear SVM..................................................... 47
Table 6. Top 20 negative features for linear Multinomial Naïve Bayesian ............... 48
Table 7. Features importance Random Forest............................................................ 49
Table 8. Features importance eXtreme Gradient Boosting........................................ 51
Table 9. Accuracy result in analysts’ prediction and algorithm prediction................ 53
Table 10. Prediction movement result with actual price.............................................53

VIII
Chapter 1 Introduction
Analysis on the current stock market in society will be discussed through multi-
channels application, such as magazines, television, and media applications and so on.
A number of people have starting to browse social media sites to be containing
repositories for answering to all kinds of opinion poll questions, replaced by the
traditional method of standing in the bank or HKEX to receive information for
transaction. Ever since financial social media sites have an alternative communication
channel generally, analyze and predict on a certain type of stock and index trend
movement according to the basic background of financial stock and technique
indicators, such Simple Moving Average (SMA), Relative Strength Index (RSI) and
Price-to-Earnings Ratio (P/E ratio), are the reference techniques and skills to generally
forecast the future trend movement. It has become popular phenomenon for investors
to perform their opinions (Alassiri, Mud, & Ghazali, 2014). However, some most
likely follow the market atmosphere to invest, while some experienced investors and
analysts close to technical trend of stock and index, to predict next year stock turnover
from annual report or even using data mining algorithm to forecast the trend.

To increase degree of accuracy and efficiency on investment, it is necessary and high


value to collect and analyze the text from communication message in social media to
make social media data analysis, to support the current prediction methods and
techniques. Millions of dynamic messages update on stock and index will be uploaded
in sites. As a result, some researchers have started to analyze and extract the massive
amount of social media content instead of collecting the history of stock or index price,
for predicting the future trend.

Even since Visa et al.’s work in 1999 and back et al.’s work reporting on a “the
performance on data and text mining” performed to have identified - “Text bears more
diverse information than dry numbers do”, researchers have carried out text
investigation in integration with the study of companies’ financial ratios (Visa et. al,

1
2004). Text mining techniques are the advanced effectively to deal with enormous
quantitative and qualitative data have been a new trend. Text in social media has
provided an importance and valuable text mining asset to understand, identify and
explore nuggets of knowledge and algorithm result.

This research continues and builds on the work with evolution of financial text mining,
which is social media data analysis (SMeDA) in text analysis, and data mining
methods to use analysts’ prediction report in social media to forecast the movement,
including showing the prediction graphics and statistics.

Data science in data mining presents an excellent opportunity to investigate the


prediction of future performance in stock and index, especially the representative of
Asia – Hang Seng Index (HSI).

However, reports of financial text analysis in Hong Kong, Asia, to forecast indications
of Hang Seng Index (HSI), on future performance are few and the number of cases
studied have been very limited. Social media text analysis by data and text on the other
hand, has been largely neglected.

In the present study, we examined a reliability and availability large text data and
historical stock price data regarding to forecast Hang Seng Index (HSI) in this paper.
It has been widely accepted by official social media report that experienced financial
market analysts and investors regularly express the view of the market outlook,
including trend, volatility range, resistance position etc. The aim of this dissertation is,
given that all the expression or opinion release on social media sites can be collected
relatively easily, to figure out the result in the form of textual part on expression value
(positive or negative) of report and show the prototype-matching approach in
prediction of HSI performance.

2
1.1 Motivation

Everyone invest money to the market, aim must be gaining the profit. Some most
likely follow the market atmosphere to invest, while some experienced investors and
analysts close to technical trend or indicator of index, like relative strength index (RSI),
simple moving average (SMA) or other technical indicator by data mining method.
Moreover, some of them may see the Hang Seng evening futures, index American
Depositary Receipt (ADR) or Dow Jones index to predict the next day of HSI goes up
or down. On the other hand, in some way stock analysts will regularly post their point
of view and prediction trend of HSI in social media, like aastock, etnet and so on, for
a reference to the public. In the overlap of these three factors, its result a research idea
“Social Media Data Analysis” (SMeDA).

Figure 1. Concept for being a topic

Although social media analysis has becoming popular in research topic, but not much
work has been done and never been any easier to process in study financial aspect of
patterns and prediction from analysts’ content (Li, Chan, Ou, & Reuifeng, 2017)

3
1.2 Aims and Objectives

The aim of this dissertation is to determine and critically examine the creditability of
financial stock analysts’ in the content and prediction of Hang Seng Index (HSI), to
forecast the future movement and trend in stock market by data mining and text mining
analysis in order to notify and alert the prediction movement and avoid the financial
crisis.

To achieve it, Stock analysts’ in the content of social media (stock market prices,
volume, the content of analysts’ comments from social media, such as news, blogs or
video channels etc.) and HSI prediction and historical HSI data will be calculated by
data mining and text mining for index movement forecasting. Two-stage architecture
on using the features from text analysis and intergradation of numeric and textual are
to conduct the experiments’ result – data science and graphical analysis report. Data
and text mining algorithms evaluate the performance of the experiments.

1.3 Research Process

In this dissertation, data mining and text mining learning algorithms and techniques to
utilize price factors can use predicting index movement performance. The mining
algorithms refers to the process of estimation to discover the prediction movement and
hidden patterns and behaviors, such as text analysis, timer series analysis, moving
average etc., are used to forecast movement performance index (MPI).

1.3 Contribution and Originality

This dissertation, although some people have been number of interdisciplinary studies
on data mining and text mining, it will be original in-depth studies on text mining in
text analysis from social media in relation to investors’ investment desire and
objectively to evaluate the analyst’s perspectives performance, so as to forecast the
stock price movement accuracy. The features from Hong Kong social media are found

4
to give the highest prediction accuracy, which may be linked with the fact on the
current market. Social media from analysts give the comments on market outlook
atmosphere and future perspectives. Their perspectives to predict the stock price needs
to take into account with the content analysis for evaluating their degree of credibility.
This estimation methodology will be the first dissertation to figure out the stock price
movement in Hong Kong as well as the use of analysts and investors as references.

1.4 Dissertation Overview

This dissertation is an effort to serve as a financial social media text analysis on text
mining and data mining algorithms, as well as to advance the current state of the
finance with respect to collection and analysis of index number and text comments by
financial analysts and experienced investors.

The general organization of this dissertation is as follows. Chapter 2 provides relevant


importance term and key concept of SMeDA and data analysis and focus on both
topics of financial index movements as well as the machine learning, data mining and
text mining techniques that are being used in the financial community. Chapter 3
surveys the current research and common themes of the archiving methods utilized in
design approach on forecasting stock price movement as of this writing. I will then
discuss in detail the work that I have done in this area, detailing the social media data
collection methodologies, including data mining, text mining and text analysis.
In collection (Chapter 4), this will necessitate a discussion of experimental framework.
In Chapters 5 through analysis result and evaluation, I present four contributions to
EDM analysis of score matrices and the influence of topic information on that analysis.
Finally, the general conclusion has to show the level of credibility on the stock price
movement from stock analysts, which this dissertation also have some potential
avenues for future work can be discussed.

5
Chapter 2 Background

To catch up and discuss the current state of social media mining approximately social
media data analysis (SMeDA) easily, it is useful to first look at the information about
social media research in Hong Kong as well as machine learning and mining
algorithms, to understand the current situation. As these two fields are immense and
such an overview could easily grow into a work of hundreds of pages, this chapter will
focus only on those theories and algorithms that are actively cited by recent social
media data analysis (SMeDA) and data mining publications.

This chapter is organized as follows, to have a clear understanding approximately the


general term and concept; it defines all key terms and concepts in chapter 2.1.
Technical background in text mining, text analysis and data mining etc. are discussed
in chapter 2.2; research background behind the stock market movement is given in
chapter 2.3 and summary all background information in chapter 2.4.

2.1 General Terms and Concepts

2.1.1 Period of Investment

All investment period is a strategy trading option and financial investment behaviour
to understand the stock market volume and investors’ psychologies and behaviours.
Moreover, there is no double even short term to long term trading that aim definitely
is to make for profits. There is different definition to describe the term of trading for
the time period.

a) Short term
The period of short-term investment generally is a few days to a month. The
potential net profit and investment risk can be highly fluctuation as same as

6
roller coaster, so short-term investment can make money faster, but also can
be loss money quickly (The ASPIRA Association, 2017).

b) Medium to long term


Medium to long-term investment has a great degree of stability, which
investors aim to gain regular interest or waiting for the potential increase to
gain profit over a long period. At the same time, this is also to give investors
for making savings and fund grow (The ASPIRA Association, 2017); therefore,
the time period of medium to long-term investment generally is over the years.

Less control is also the advantage of long-term investment. Even if the price
value goes down, but it will be climb up again, which presents a business cycle,
so a high growth profit is not consideration for medium to long-term
investment.

2.1.2 Hang Seng Index (HSI)

One of the earlier stock indexes in Hong Kong is Hang Seng Index (HSI), which has
become the most widely quoted indicator of the performance of the Hong Kong stock
market since 24 November 1969. Investopedia stated that the index is a statistical
measurement to reflect the economic change or a security of market, which represents
the stock volume trading (buying and selling), in a previous of time. HSI did not have
basic analysis, including organization profile or industry relevance, as HSI represents
the Hong Kong economic index and reference indicator instead of an actual stock
company. Therefore, all economic, livelihood, organizations information and news
releases will have some impacts about stability on the Hang Seng Index, as there are
a fixed 50 numbers of Hang Seng index constituents as blue chips (Table 1).

Those blue chips influence Hang Seng Index (HSI) of index value by weighting score.
Hang Seng bank will review quarterly the performance of Hang Seng Index
constituent stocks, making a decision to exclude risk stock or include potential stock

7
(Hang Seng Indexes Company Limited, 2017). For example, according to Hang Seng
Index’s report on October, 2017, Hang Seng Index Co., Ltd. released the index review
results. Cathay Pacific (0293) and Kunlun Energy (0135) were excluded from the
blue-chip list and were included in the HSI constituents by Sunny Optical (2382) and
Country Garden (2007). That decision is based on the stock of future development to
implement, which will influence by weighting Hang Seng Index (HIS).

The importance of Hang Seng Index (HSI) to be the research is that Hong Kong in
Asia to provide and achieve comprehensive, credible and impartial financial
reputation, which represent a high positive in a financial and economy market
(Investor Education Centre, 2014). In a microcosm economy point of view, Hang Seng
Index (HSI) is an economic norm to estimate and reflect overall health of a sector and
performance in Hong Kong and a basic for investment return from the stock market
(Investor Education Centre, 2014).

Table 1. 2017 Hang Seng Index Constituents.


2017 Hang Seng Index Constituents
Stock Industry Weighting
Code ISIN CODE Company Name Classification Share Type (%)
Information Other HK-listed
700 KYG875721634 Tencent Technology Mainland Co. 10.75

5 GB0005405286 HSBC Holdings Financials HK Ordinary 10.01


939 CNE1000002H1 CCB Financials H Share 8.44
1299 HK0000069689 AIA Financials HK Ordinary 7.94

Telecommunica
941 HK0941009539 China Mobile tions Red Chip 5.39
1398 CNE1000003G1 ICBC Financials H Share 5.12
2318 CNE1000003X6 Ping An Financials H Share 4.00
3988 CNE1000001Z5 Bank of China Financials H Share 3.46
1 KYG217651051 CKH Holdings Conglomerates HK Ordinary 3.00
388 HK0388045442 HKEx Financials HK Ordinary 2.85
2628 CNE1000002L3 China Life Financials H Share 2.15
883 HK0883013259 CNOOC Energy Red Chip 2.13
Properties &
1113 KYG2177B1014 CK Asset Construction HK Ordinary 1.87
Properties &
16 HK0016000132 SHK Ppt Construction HK Ordinary 1.86
2 HK0002007356 CLP Holdings Utilities HK Ordinary 1.68
386 CNE1000002Q2 Sinopec Corp Energy H Share 1.64
Properties &
823 HK0823032773 Link REIT Construction HK Ordinary 1.63
11 HK0011000095 Hang Seng Bank Financials HK Ordinary 1.58
2388 HK2388011192 BOC Hong Kong Financials HK Ordinary 1.54
Consumer Other HK-listed
175 KYG3777B1032 Geely Auto Goods Mainland Co. 1.45
Consumer
27 HK0027032686 Galaxy Ent Services HK Ordinary 1.40
3 HK0003000038 HK & China Gas Utilities HK Ordinary 1.39

8
857 CNE1000003W8 PetroChina Energy H Share 1.20
Information Other HK-listed
2018 KYG2953R1149 AAC Tech Technology Mainland Co. 1.18
Properties &
688 HK0688002218 China Overseas Construction Red Chip 1.09
6 HK0006000050 Power Assets Utilities HK Ordinary 1.05
Consumer
1928 KYG7800X1079 Sands China Ltd Services HK Ordinary 1.00
Properties &
4 HK0004000045 Wharf Holdings Construction HK Ordinary 0.96
Consumer
66 HK0066009694 MTR Corporation Services HK Ordinary 0.90
Consumer
288 KYG960071028 WH Group Goods HK Ordinary 0.84
Properties &
17 HK0017000149 New World Dev Construction HK Ordinary 0.77
267 HK0267001375 CITIC Conglomerates Red Chip 0.74
Telecommunica
762 HK0000049939 China Unicom tions Red Chip 0.74
Properties &
1109 KYG2108Y1052 China Res Land Construction Red Chip 0.72
1088 CNE1000002R0 China Shenhua Energy H Share 0.71
Properties &
12 HK0012000102 Henderson Land Construction HK Ordinary 0.68
Consumer
2319 KYG210961051 Mengniu Dairy Goods Red Chip 0.67
Consumer Other HK-listed
1044 KYG4402L1510 Hengan Int'l Goods Mainland Co. 0.62
3328 CNE100000205 Bankcomm Financials H Share 0.58
23 HK0023000190 Bank of E Asia Financials HK Ordinary 0.57
1038 BMG2178K1009 CKI Holdings Utilities HK Ordinary 0.50
Properties &
83 HK0083000502 Sino Land Construction HK Ordinary 0.48
Consumer Other HK-listed
151 KYG9431R1039 Want Want China Goods Mainland Co. 0.45
19 HK0019000162 Swire Pacific A Conglomerates HK Ordinary 0.43
Properties &
101 HK0101000591 Hang Lung Ppt Construction HK Ordinary 0.41
Information
992 HK0992009065 Lenovo Group Technology Red Chip 0.37
144 HK0144000764 China Mer Port Industrials Red Chip 0.34
836 HK0836012952 China Res Power Utilities Red Chip 0.32
135 BMG5320C1082 Kunlun Energy Energy Red Chip 0.26
Consumer
293 HK0293001514 Cathay Pac Air Services HK Ordinary 0.12
100

Determinant factors of HSI performance


Apart from the above release information affecting index performance, SP500 is also
the major relevant stock markets that positively affect Hang Seng Index (HSI)
performance and the coefficient correlation between them are shown on (Appendix I),
because the stock market is the integration of financial markets (Garefalakis, Dimitras,
Koemtzopoulos, & Spinthiropoulos, 2011). Therefore, to what extent, the
international news report, like the case of UK Brexit in 2016 or U.S. President Elected
in 2016, will also affect the global stock price trends.

9
The other reasons influence the stock market performance, which the indicator of
index, such as Relative Strength Index (RSI), Moving Average Convergence /
Divergence (MACD), Exponential Moving Average (EMA), as well as daily index
price, daily investment volume, international currency exchange rate, HKEX policy
and strategy or even the current-year of hot spot, which is Northern China investor
volume are the factors that must influence the performance of HSI (Yu, Ng, Wong,
Chu, & Chan, 2010). Hence, a complexity of stock market has different of
circumstances impact the prediction of stock price movement.

Based on the above factors, it often creates a gap distance, which is called “gap theory”
in stock market. This theory is to describe the next day of stock or index price, which
is suddenly increase or decrease sharply on a chart and have a big distance price
compared with the last day, create, interpret and exploit the gap. This phenomenon
can be created by many factors, such as the global news issue, a cycle of financial
pressure, including irregular RSI, or stock market arrangement etc., both like the
above determinant factors to influence HSI.

2.1.3 Social Media Mining Analysis

Social Media shatter between the traditional media and virtual world, which notice
how to interact and communicate from to the public. The uniqueness of data collection
for novel data mining techniques from social media that export well-organized content
of user-generated with rich social relations. Social media mining techniques
implement the process of representing, analyzing and extracting hidden patterns from
the message data. Social media mining is theories and methodologies from several of
disciplines, including data science, computer science, data mining, machine learning,
text mining etc. It multiples different tools to apply, represent and measure social
media big data to result the meaningful patterns and assist to bridge the gap from those
theories models and measurement tools.

10
Social media enables us to be connected and interact with each other anywhere and
anytime, like a blog, video, slideshow, podcast as a one-to-many communication
channel – allowing us to observe human behavior in an unprecedented scale with a
new lens. This social media lens provides us with golden opportunities to understand
individuals at scale and to mine human behavioral patterns.

In the traditional data mining, historical data will be used to measure the result by any
algorithms. However, mining user-generated content is also a representative key that
cannot be ignored to analyze.

2.1.4 Data Mining

Mining information from data is an idea of mining knowledge with a big data, to
provide a data analysis to extract the knowledge patterns. To advance the simple
statistics analysis, data mining is essential mining concepts for understanding the data
of knowledge discovery in databases (KDD). The result of patterns and behaviors are
used to describe concepts, like stock price movement behaviors and interesting
patterns. The methodologies to analyze the data used by association rules, which is to
support within the data, like classification, clustering, time series analysis etc. In
general, mining concepts benefits on sales and marketing to understand customer
behavior, but in financial industry, it is a refereeing standard for programmers or
analysts used for predicting the future of stock movement. Various algorithms and
techniques are to prove the data creditability and reliability to get accurate prediction
of data in order to provide a reference relevant to the public and stock analysts to
understand the current market status.

Supervised learning and unsupervised learning are the basic categories in data mining
algorithm, which supervised is to predict the class attribute value and unsupervised is
to no class attribute to figure out similar instances in the dataset, but it can find and
identify the significant patterns result. Supervised learning can be included
classification and regression, such as Multinomial Naïve Bayes and Linear Regression.

11
Unsupervised learning is mainly on clustering algorithm, which require knowing a
distance measure, like Euclidean distance.

2.1.5 Text Mining – Text Analysis

Stock price analysis was already using in the normal form to predict the further
performance, as an investor, always heard about the stock analysts to explain the
development trend and outlook of the market. Stock micro blogs, news, official stock
webs, such as aastock and yahoo finance, provide a simple way to allow data from
social media and the stock market. It is also easily to collect in a significant proportion
stock market news and advertisements, which may be covered data collection for text
analysis (Degutls & Novickyte, 2014). Certainly, the data should be filtered and
captured the most important original sequence of segment sentence by topics to
analyze. In fact, those opinions and segments are good examples for referencing in the
prediction of stock movement. Text analysis is the other formula rule of machine
learning to identify, extract and character the text content of sentence text unit,
sometimes referred to opinion mining. However, there has been a big confusion
among researchers and people about the difference between text and opinion.
According to Merriam-Webster’s Collegiate Dictionary, Delivery the message about
attitude, thought by feeling from the case are defined as “text”. In the opposite side,
opinion is based on a case or matter to judge, take a point of view, discuss, evaluate
and give a statement. All the text analysis is generally called the technique of text
mining. There has been many successful in the areas of cases in analyzing the text,
such as political science, policy making, psychology and sociology.

During text analysis, an analysis model, such as Twitter Investor Text (TIS), can base
on all words in the sentence to compute and count an opinion weighted scoring,
reflected the statement of credibility and representativeness.

In this research, it based on the social media to start up text analysis, which is called
“Social Media Data Analysis” (SMeDA). It is an effectiveness and comprehensive

12
measurement and should collect enormous data in using data mining approximately
professional individuals and groups’ entities so as to measure their interactions,
discover patterns and understand human behavior. The methodology of formulating
and estimating and theory of text will be discussed on the later section.

2.1.6 Summary

This dissertation certainly to achieve the aim to forecast HSI future performance by
mining algorithms, but it also reflects the psychological, social cognitive and emotion
of human behavior of investors’ interest and the future financial market prediction
(TMFP).

Figure 1. Venn diagram describing the intersection of disciplines

This is also given the reason why researchers must use advanced algorithm to predict
stock predict, evaluate the risk of financial crisis and ensure the health of financial
market. Both data mining and text mining respectively able to predict the future
movement price in the content of prediction text status and prediction price by
financial stock analysts’.

2.2 Technical Background & Description

In this subsection, using algorithms in data mining and text mining to predict the index
trend based on a combination of sequential chart patterns and behaviors.

13
2.2.1 Time Series Similarity Analysis

A number of index price and text scores can extract from textual data can be
represented in the form of time series. In this subsection, several similarity
measurements are represented.

Linear Regression
The basic concept of linear regression is to calculate functional and statistical relation
between two variables, which independent variable mainly is a predictor and
explanatory. Surely, scatter diagram is to present the result between variable Y and
variable X. To make a prediction, variable Y is a tendency of response to vary with X
variable (predictor). (Figure 2) The linear regression line is one of the prediction
measurement to forecast the value, which is to calculate the slopes and intercept to
result the prediction.
𝑌̂𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 (2.1)

In the above Eq.2.1, 𝑌̂ is prediction value for observation 𝑖 , 𝑏0 is a regression


intercept; 𝑏1 is the regression slop and 𝑋 variable is for observation 𝑖.

Figure 2. Model -linear regression

14
Euclidean distance
This measurement is named after Euclid famous mathematician, which is popularly
referred to as the father geometry and it is an unsupervised learning algorithm to
demonstrate the distance between two points connected by line, given two-time series
of equal length. The formula is illustrated in Eq.2.2 where the equation is the sum of
𝑛 and 𝑛 represents the number of dimensions in data, 𝑖 starts as one goes up to 𝑛 and
is dimensions; whereas 𝑆 and 𝑞 are two points by line in n-dimension (Batista, Wang,
& Keogh, 2012).

𝑑(𝑠, 𝑞) = √∑(𝑆𝑡 − 𝑞𝑡 )2
(2.2)
𝑖=0

Figure 3. alignment of values in euclidean distance

Pearson’s correlation coefficient


Pearson’s correlation r coefficient (PCC) is used in modern software packages,
available for data demonstration and curve fitting, which is called “Pearson Product-
Moment Correlation”. It is used the covariance of two vectors numerical data by
standard deviation production to represent the graphic, where X and Y are two
vectors in Eq.2.3.
∑𝑛𝑖=1(𝑋𝑖 − 𝑋̂) (𝑌𝑖 − 𝑌̂)
𝑟=
(2.3)
√∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2 √∑𝑛𝑖=1(𝑌𝑖 − 𝑌̅)2

15
Short Time Series (STS) distance
Depending on each time series of linear function, Möller-Levet et al., to compare total
value of the squared of the slopes in two-time series. In mathematically, the similarity
of STS between two timer series as xa and yb is defined as

𝑃
yb(k−1)− ybk xa (k−1)− xak
𝑑𝑆𝑇𝑆 = √∑( − )2 (2.4)
𝑡(𝑘+1) − 𝑡𝑘 𝑡(𝑘+1) − 𝑡𝑘
𝑘=1

In the above formula in Eq.2.4, 𝑡𝑘 is the center point for data point xak and ybk . Z
standardization of the series is to delete the effect of scale.

2.2.2 Learning Algorithms

The majority of machine learning algorithm uses supervised and unsupervised


machine learning in data mining and text mining, where iteratively make predictions
on the training and received data to achieve an acceptable level of performance, as
well as to model the underlying structure or distribution in the data. For example,
unsupervised machine learning is K-means, GHSOM and LDA, while supervised
machine learning is SVM, SOFNN are introduced.

K-means
A partitioned and basic clustering approach and techniques is used by K-means
algorithm, which each cluster is associated with a center point and each point is
assigned to the cluster with the closest centroid. This equation of K-means can be
defined as Euclidean distance or other similarity algorithms given at subsection 2.2.1.

K-means is a basic and partitioned clustering analysis technique to calculate the


minimum each cluster point associated with centroid. Each point of distance is
assigned to the cluster with the closet centroid.

16
Hybrid Kohonen Self-Organizing Map
It is a type of Self-Organizing Map (SOM) in unsupervised and non-parametric neural
network learning to feed into a hidden layer pattern to forecast the HSI performance.
Each neuron, which is the basic unit, is assigned the related weighting. The output
from this method of the closer neuron is nearby the hidden layer as the winning neuron,
therefore, it can identify the pattern space with clusters of neurons in the layer to
objectively to predict the stock price (Sap & Mohebi, 2017).

2.2.3 Text Processing and Understanding

There are many feature extraction text and data mining techniques can be applied on
those collected data. Features are unique in machine learning terminology, which is
usually absolute numeric values or categorical features in nature. It can be encoded as
binary features for each category in the list using a process, which is called one-hot
encoding. Feature selection and extraction are fed into machine learning techniques to
find learning patterns, which is applied on future new data points for gaining insights.
The form of each algorithm is numeric vectors and usually is a mathematical operation
of optimization and minimizing loss and error. This is the concept and technique of
feature extraction on how to transform textual data and extract numeric features from
it. To calculate degree level of word text performance, there are many form of
algorithms to estimate and process the data to find out the feature result.

Sentence Tokenization
Sentence tokenization is the process of splitting a text corpus into that act as the first
level of tokens which is to reduce different word forms into tokens. This theory and
method can be divided into Porter Stemming Algorithm, Part of Speech (POS)
tagging, Chunking, and Chinking.

Porter Stemming Algorithm


In most of the time, there are many tenses and derived vocabularies in the article to
describe the current situation. For instance, “increase” has double forms, which are

17
noun and verb form; an adjective form, which is “increasing” and adverb form, which
is “increasingly”. Those vocabularies represent the same meaning. To reduce the
feature dimension and unified display outcome, such vocabularies should be filtered
out to same root of “increase”, but behind the actual vocabularies should be keep it to
guarantee the data integrity.

Using Porter stemming algorithm is a rule-based stemmer method to reduce words


forms to proper class, while lemmatization aims to remove inflectional endings word
only and to return the base or dictionary form of vocabulary in lemma. For example,
word of ‘decrease’ can be recognized as ‘drop’ or ‘flow’ by lemmatizing algorithms
but not by stemming algorithm (Ingason, Helgadóttir, Loftsson, & Rögnvaldsson,
2008).

Textual Formulation and Notification


To highlight and reflect the importance of a vocabulary in every sentence / article
words, Term Frequency-Inverse Document Frequency (TF-IDF) is a feature
weighting approach to point out the important words in the document or information
retrieval – text, which it is also a Natural Language Processing (NLP) technique. The
information drill into different specific groups of document to result the tf-idf scores.
The higher frequency of text words, the higher information it is, as well as represents
a strong relation with the document or text data. This process is present in Eq. 2.5.
𝑁
𝑠𝑓(𝑡, 𝑑) × log (𝑛 + 0.01)
𝑡
𝑊(𝑡, 𝑑) = (2.5)
𝑁
√∑𝑡∈𝑑 𝑠𝑓(𝑡, 𝑑) × log ( + 0.01)
𝑛𝑡

In the above formula is given in Eq.2.5, where 𝑊(𝑡, 𝑑) means the weighted score of
a term 𝑡 in a document. 𝑠𝑓(𝑡, 𝑑) means the frequency of a term 𝑡 in storing a social
media data document 𝑑. 𝑁 means the total number of sampling to be analysis. This
model and calculation performance is used the cross-validation. Using scikit-learn to
build in is a general form and standard to implement.

18
A term frequencies as a number of word play an important role for weighting, but
Term Presence (TP), represented as a Boolean value in a vector, the result will be
assigned as True (vector score or value will be 1), yields a good performance than
term frequency in text analysis (Pang, Lee, & Vaithyanathan, 2002).

Part-Of-Speech (POS) Tagging


It is a word-category disambiguation, to mark up a word in a text – “lemmatization”
in order to reduce ambiguity. The process of classifying and labeling POS tags for
words are used to annotate words and depict their POS, which is helpful when
programmer use the same annotated text later in NLP-based applications as they can
filter by particular parts of speech and utilize that information to make an analysis.

Negation
It is a consideration negation during processing a sentence. Generally, a word “not” is
a simple and inverse meaning of example. The researchers are appending the term –
“NOT” to “no” or “do not” to solve the negation problem. For instance, “I do not
expect HSI to be raised.” the term extracted will be “expect_NOT” instead of “expect”
(Das & Chen, 2007).

2.2.4 Classification algorithms

Bag-of-Words Model
It is a classic model, used in NLP, to represent a term frequency vector in a document,
although syntax and the ordering of term are missing, the major message is included
in the term frequency vector.

Multinomial Naïve Bayesian


It is an explicitly model, which based on the version of Naïve Bayesian, to make a
calculation of word counts and make adjustment for underlying calculations. It
estimates the conditional probability of a word given a class as the relative frequency
of term t in documents belonging to class. The variation considers the number of

19
occurrences of term t in training documents from class (c), including multiple
occurrences. Naïve Bayesian is a popular and supervised learning algorithm for
prediction and classification tasks to be either positive or negative text value based on
the algorithm of TF-IDF. Naïve is to assume that every feature is independent to each
other and the formulation to display in the form of a feature vector is {𝑥1 , 𝑥2 , … , 𝑥𝑛 }
and 𝑦 is a class variable. This theorem is to notice the probability of the occurrence
of 𝑦, which present in Eq. 2.6.
𝑃(𝑦) × 𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝑦)
𝑃(𝑦|𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = (2.6)
𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 )

Under the assumption that 𝑃(𝑥𝑖 |𝑥1 , 𝑥2 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑛 ) = 𝑃(𝑥𝑖 |𝑦) and the
variable of 𝑖 represents as the range from 1 to 𝑛. Simple term can be presented as
𝑝𝑟𝑖𝑜𝑟 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
since 𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) is constant.

In the equation of model, each feature is independent of each other conditionally over
the class variable, which is to be predicted. The following equation 𝑦 and 𝑍 = 𝑝(𝑥) is
a constant issue dependent, which can be represented as:
𝑛
1
𝑃(𝑦|𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = 𝑃(𝑦) × ∏ 𝑃(𝑥𝑖 |𝑦) (2.7)
𝑍
𝑖=1

MAP decision rule is the naïve Bayes classifier by combining Eq.2.7, as maximum
posteriori. Mathematical function as the classifier can allocate a class lab prediction
𝑛
𝑎𝑟𝑔𝑚𝑎𝑥
𝑦̂ = 𝑘∈{1,2,…,𝐾}𝑃(𝐶𝑘
) × ∏ 𝑃(𝑥𝑖 |𝐶𝑘 ) (2.8)
𝑖=1

This algorithm in Eq.2.8 is related to classification, like spam filtering, multi-class


document classification and can be applied in many cases, which can compare with
other classifier quickly even if there are no any sufficient training data, it also works
well. To look at this problem, Naïve Bayes is the first-class way to decoupling the

20
class variable-related conditional feature distributions and it also link to single
dimension distribution.

Support Vector Machines (SVMs)


It is a “off-the-shelf” supervised learning algorithm to associate the data to output a
hyperplane for classification and regression all training pattern recognition with
leaving the maximum margin from both classes, which to find natural clustering of
data groups and mapping new data group. The equation of hyperplane for
classification is listed below:

𝑔(→)=→ 𝑇 → +𝜔0 (2.9)


𝑥 𝜔 𝑥

In the equation, if the result of 𝑔(→) is more or equal than 1, it will classify to class
𝑥

1; while 𝑔(→) is less or equal than −1, class 2 will be classified.


𝑥

The total margin is computed by Eq.2.10 minimizing the term to maximize the
separability.
1
=→ 𝑇 → +𝑤0 (2.10)
|| → || 𝑤 𝑥
𝜔

In the above formula in Eq.2.10, → is a minimizing-value to calculate the


𝜔
classification score, which is solved by the Karush-Kuhn-Tucker (KKT) conditions,
using Language multipliers. The following formula in Eq.2.11 is to link up Eq.2.10 to
calculate the mini zing value:
𝑁

→ = ∑ 𝜆𝑖 𝑦𝑖 → (2.11)
𝜔 𝑥𝑖
𝑖=0

21
𝑁

∑ 𝜆𝑖 𝑦𝑖 = 0 (2.12)
𝑖=0

The SVM for regression as the above equation to find out Support Vector Regression.

Random Forest Classification


Random forest algorithm for text data analysis is a popular classification methodology
to categorize text documents of content for resulting a set of classification tree, as this
algorithm is simple and result a prominent classification performance for high
dimensional data. Text data has number of terms and features during forest procedures.
The result of prediction is equal to the confidence level to measure how much
predictions coming from different trees. To make a further analysis, it can find out the
highest variability of prediction from observation for understanding the text of
behaviours.

eXtreme Gradient Boosting Mining (XGBM)


The name of XGBoot is eXtreme Gradient Boosting. It is a supervised learning
algorithm, which also mean using the mathematical structure to find the prediction on
text. The major function on XGBoot is that the implementation boosted a decision
trees with high execution speed and high model performance result. To achieve an
efficiency and accuracy, XGBoost is implemented for kaggle competitions, which can
be run on a cluster and parallel computation and execute on a high ranking result for
the data sets in github.

22
2.2.5 Text Summarization

Most of non-likeable reading people will become short attention spans when reading
a long article and a large document, it also leads to get bored and miss the important
messages. Therefore, text summarization is extremely important concepts for readers
to extract the key theme of information.

Latent Dirichlet Allocation (LDA)


In topic modeling, Latent Dirichlet Allocation (LDA) is a generative probabilistic
model to combine the similarity of topics to probabilistic latent semantic indexing
model; however, the latent topics contain a Dirichelet prior over them.

𝛽 𝜑
K

𝛼 𝜃 𝑍 W
N
M
Figure 4. LDA plate notation

In the above in Figure 4, it shows the plate notation for LDA model, which K is the
number of topics; N is the number of words in the topic; M is the number of topic; 𝛼
is Dirichlet-prior concentration parameter of the each topic distribution; 𝛽 is the
same parameter of the each topic word distribution; 𝜑 (k) is the word distribution for
topic k; 𝜃(i) is the topic distribution for document i; 𝑍 (i,j) is the topic assignment for
w(i,j); W(i,j) is the j-th word in the i-th document; 𝜑 and 𝜃 are Dirichlet
distribution, 𝑍 and W are multinomial.

23
2.2.6 Text Analysis

Dictionary
Storing and generating all the word list for text analysis is a simple technique by
dictionary-based method. A new synonyms and antonyms are added into the word
list, which a word has a related degree of credibility level. However, some mood
words within specific domains are difficult to find, which a major weakness is in
dictionary-based method to estimate the accuracy.

Profile of Mood States (POMS)


It is a psychological rating scale to estimate the mood states quickly due to the
simplicity of the test. Bollen et. al stated that using public moods to predict stock price,
which gain the accuracy of 86.7%. Surely, this text analysis estimation can be used in
the form of online or written. The following process as below:

1) Score each case of investors’ comment using the function of POMS is given in eq.
2.9. Each case 𝑐 is denoted in the term set of 𝑤. The POMS emotion adjectives are
displayed as 𝑝𝑖 for mood dimension 𝑖.

P(c) → m ∈ 𝑅 6 = [||𝑤 ∩ 𝑝1 ||, 𝑤 ∩ 𝑝2 ||, … … , ||𝑤 ∩ 𝑝6 ||] (2.13)

2) Normalize emotion vector


𝑚
𝑚
̂= (2.14)
||𝑚||

3) Integrate emotion vectors for dates and denote as 𝑚𝑑 and 𝜃𝑚𝑑 [𝑖, 𝑘] represents a
period of k-day mood.
∑ ∀𝑐∈𝑇 𝑚̂
𝑑
𝑚𝑑 = (2.15)
||𝑇𝑑 ||

24
θ𝑚𝑑 [𝑖, 𝑘] = [𝑚𝑖 , 𝑚𝑖+1 , … , 𝑚𝑖+𝑘 ] (2.16)

4) Normalize mood vector with z-scores.


𝑚
̂ 𝑖 − 𝑥(𝜃[𝑖, ±𝑘])
𝑚
̃𝑖 = (2.17)
𝜎(𝜃[[𝑖, ±𝑘])

θ̃𝑚𝑑 [𝑖, 𝑘] = [𝑚
̃ 𝑖, 𝑚 ̃ 𝑖+𝑘 ]
̃ 𝑖+1 , … , 𝑚 (2.18)

Lydia
It is created of tie series for counting the text emotion value, including positive and
negative words displaying in related to parties.
𝑝−𝑛 𝑝+𝑛
𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦 = (2.19) Subjectivity = (2.20)
𝑝+𝑛 𝑁

In the above in Eq.2.19 and 2.20 are the equation to calculate the text emotion 𝑝 and
𝑛 are represented as the positive and negative value. 𝑁 is the total number of emotion
value.

2.2.7 Performance Evaluation

Cross-validation (statistics)
It is a statistic evaluation of model validation technique to estimate how the result of
learning algorithms analysis is generalized, which is also called rotation estimation.
This method is a setting the goal of prediction and estimate how accurately in a
prediction of model that performance is, which generally use K subsamples and K
iterations to represent the data set of training and testing data. The result of
measurement reported by K-fold cross-validation is the average of values computed
in the loop.

25
2.3 Movement Performance Index (MPI) Research

A stock price must have technical indicators to analyze stock price movements.
Combining of soft computing technology with technical analysis in stock analysis has
done several researches work as well as gain a high rate of predication result. There
are many parameters of indicators to shows the strength index of price movement
performance, which is called “Relative Strength Index (RSI). This is an important
index indicator to understand the resistance and upside potential index. RSI technical
indicators calculations are shown on (Appendix II), which are divided into 10 days, 20
days, 50 days and others length of trend time span. RSI is also a minor technical
indicator for people to understand short-term and long-term goal of HSI movement.

According to the philosophy of Evolutionary Algorithms (EA) to achieve ideal


parameters of technical indicators. Moving Average Convergence and Divergence
(MACD) indicator display the relationship between moving average prices, as well as
related to Relative Strength Index (RSI) to make a buy and sell signals for investors’
references (Li, Li, & Wang, 2016). This method aim is to maximize the yields and to
minimize the transaction cost and trend risk.

2.4 Summary

Although there are many analysts believe that RSI is a good reference to predict stock
movement and share with the public, it still takes time for investors to understand,
analyze and digest the information meaning. In this dissertation, every social media
report from analyst’s prediction comment is the main source to make social media
analysis, which is also included data analysis and text analysis (text analysis).

Bag-of-words model, multinomial Naïve Bayes and support and Support Vector
Machines (SVM) are used in social media classification tasks. The targets of the
occasion in forecasting stock price models are allocated by future stock price
movement.

26
To discover the similarity of text and behavior patterns, Latent Dirichlet Allocation
(LDA) is an estimation algorithm based on word frequency for discovering probability
distributions over latent topic, and it can be used for topic modelling. Mahajan et al.
(2008) stated that LDA is to identify the characteristics and understanding the impact
of financial news events, which located the average direction of accuracy was 60%.

Learning algorithms, data mining and text mining techniques and evaluation
estimations are the research methodologies in this chapter to achieve the potential
experimental result in the chapter 4. This dissertation has been considered several of
financial aspects to use specific equation to complete. In the next chapter, the principal
and framework of design approach in the system are going to present.

27
Chapter 3 Design Approach
User generated content, consumer-generated media, researcher professional opinion-
content from stock or economic media weblogs, online broadcast, stream or official
web communities etc. are the planform to receive the part of information about Hang
Seng Index (HSI) (Zeng, Chen, Lusch, & Li, 2010).

This chapter presents the design principle and approach on implementation and
estimation of each module of HSI financial forecasting system. (1) The description of
numerical and textural data used in analyzing research, (2) preprocessing and munging
for the features on data weighted estimation and other calculations on index price and
indicators from social media to forecast. (3) Other sub-parts are given the details on
methodologies in content analysis and data and text extraction.

28
3.1 Methodology

Social Media
Collection
HSI Historical
(financial sites, news Data
reports, radio etc.)

NLP Text Pre-processing

Data Pre-
processing

Text Text Dictionary


Dictionary

Text Text Analysis Data Analysis Mining


Score Rule

Index
Movement
Prediction

Result the relationship between


algorithms and HSI price

Figure 5. Analysis structure and design

In the Figure 5 shows, the overall analysis structure procedures design. This
structure can logically be seen as three main processes for achieving the prediction
movement. The two main steps of text and numerical HSI data can be processing at
the same moment and will make pre-processing to filter meaningful of analyzing
data. At the end of both textural and numerical data processes, selected algorithm to
estimate from the processed data is the most essential part to evaluate the
experimental results.

29
3.2 Data Acquisition

This dissertation is to analyze the data from social media. The data collection is
necessary from diverse announced social media, which refers to interact with other
people by announcing or posting information to the public and give an instrument of
communication via radio, news report, and social networking sites etc.

There are two essential types of data willing to collect, one is the historical data of
HSI, and other is the comments of analysts’ prediction from social media. All the data
will be stored in text files with .csv format.

The data collection is obtained from Hong Kong financial and stock social media sites.
The social media used in this dissertation is published by:
(1) Financial and stock: Aastock, Yahoo! Finance, Quamnet, Etnet etc.
(2) E-news: Appledaily, Singtao, Mingpao, Oncc, Takungpo etc.
(3) Radio from site: RTHK, Metroradio

All received information about Hang Seng Index (HSI) prediction stored in specific
general collection data (analyst or company name, position (if necessary), release
information date, content of prediction, prediction date, index prediction and reference
link) by establishing a common database with uniform information and structure to
store (Dascalaki, Droustsa, Gaglia, Kontoyiannidis, & Balaras, 2010). Along the same
lines, the public, who is responsible for organizing, monitoring and analyzing HSI
index price and the statement of analysts and investors’ opinions, could use this
analysis and prediction for reference to understand the prediction accuracy and
credibility.

For the experiment, there are over 1,000 financial and stock articles or reports of social
media distributed the HSI comments and prediction price in Hong Kong, over a period
of one year from January 2017 for data and text mining analysis. Besides, the amount

30
of collecting historical HSI data is around 248 days plus the updating data of 2018
index traded record from Yahoo! Finance, listed in the exchange from the year 2017.

3.3 Data Preparation

As all the social media news are filtered by manually selected keywords sentences of
prediction Hang Seng Index (HSI), some unrelated information may also be retrieved.
The social media news report amounts of stock market news, but capturing the
important prediction is the direction way to make a data science. For example,
“Morgan Stanley issued a report update the basic situation of the major Asian stock
markets index, raising the target of the "basic situation" of the HSI to 29,000 points.”
The mainly information is to store “Morgan Stanley” – financial company; “HSI”
which means “Hang Seng Index”; “29,000 points” – prediction target point. There
are many Chinese stock proper noun keywords, like “The most cattle” commonly
known as “The maximum increasing stock price”. The idea is to determine the each
sentence of social media case to investigate the analysts’ prediction credibility and
measure level of HSI prediction performance.

Understanding the methodology of collection data, the data preparation is also


included each social media source. The implementation of forecasting analysis mainly
used by Python programming language to perform. The numeric and text data stored
in excel file with structured and well-defined attributes and processed by Python to
make further development and analysis on with selected methodology of algorithms.
(Figure 6) Python is a vibrant community, growing and evolving set of libraries,
including data management, analytical processing, visualization, and applicable to
each step in the data science process. There are over 52% of data scientists using
Python to make a data science as it is an open source software and a new trend software
since 2016 (Flugel, 2016). In addition, a machine learning language in the current is
Python, which still have sharply increasing rate on matching job in IT industry,
followed by Java, R and C++. Therefore, Python is the first programming language to
use in this dissertation (Puget, 2016).

31
Figure 6. Job for usage machine learning statistics in 2016.

3.4 Data Pre-Processing & Munging

To duel with different operations and analyze text and data, performed into easy-to-
interpret formats are to make use of a process and parse of textual and numerical data,
like data clearing, aggregation and feature selection.

The raw data of Hang Seng Index (HSI) often has to manipulate to be in the correct
format for the analysis. Some error with noisy data is going to be removed as the
suddenly typhoon signal eight or above or black rainstorm warning has been issued;
therefore, HKEX’s market system may be closed earlier depending on the period of
time (HKEX, 2017). Data cleaning is the process to duel with the data missing,
garbage values and NULLs to maintain the data consistent and data control.

One such transformation to process data is called aggregation, which is generally


results in data with less variability and may help with the analysis in the long term.
For instance, daily index price figures may have many spurious changes. Aggregating
values to weekly or monthly index price figures will result in smoother data
representation.

32
Feature selection can involve removing irrelevant HSI content in a large content of
social media report. This approach can be saving the time of processing data and
making faster on analysis procedures.

3.4.1 Bag-of-words model

Raw text in social media report from Chinese language will be translated into English
version to conduct a case of prediction and input to excel. All the case of raw text will
be converted into word vectors. There are three main processes to implement in
Python and nltk from associated library is for Porter stemming algorithm, which are:

1) Use Porter stemming algorithm to convert words to root forms.


2) Delete the same root words emerging in the stop-lists.
3) Remove the comment of words with a frequency less than 30 times from every
case of analyst’s prediction.

However, before working the above analysis, it is necessary to do the pre-processing


of removing “Stop Words” with big data analysis. Pre-processing and stuff of NLTK
the module is not going to generate any insights inside the data but is going to able
analyzing pull apart text tag things either Part of Speech (POS) named as ND.

Stop words are kind of two notion of stop word, which can be one stop words can be
something, like literally come across that word and leave all not applicable words. It
might use words that are typically used sarcastically as a stop word, as analysts do not
willing to continue attempting to analyze something when it may or may not be the
actual, such as opposite meaning; hence, that’s one notion but another notion of stop
words, these words that just pull out and ignored, which is not meaningful and useless
to understand the sentence meaning. The following is the stop word list, which will be
removed in all analyst’s prediction:

33
{'s', 'themselves', 'their', 'wouldn', 'should', 'couldn', 'theirs', 'being', 'when',
'where', 'his', 'why', 'didn', 'has', "haven't", 'this', 'those', 'd', 'o', 'shan', 'such', 'to',
'not', 'below', 'can', 'off', 'hers', 'was', 'my', 'here', 'do', 'there', 'won', "couldn't", 'at',
"should've", 'isn', 'mightn', 'having', 'into', 'himself', 'while', 'before', 'then', 'don',
'weren', 'if', 'doesn', 'through', 'who', 'ours', 'on', 'ma', 'or', 'further', 'been', 'with',
'after', 'only', 'doing', 'him', 'does', 'you', 'yours', 'herself', 't', 'just', "won't",
'between', 'down', 'once', "doesn't", 'them', 'we', 'during', 'is', 'from', 'itself',
'because', "mightn't", 'both', 'm', 'wasn', 'up', "she's", 'any', 'the', 'more', 'ain', 'she',
'yourselves', 'which', 'these', "it's", 'did', "hadn't", 'haven', 'had', 'again', 'it', 'mustn',
'our', 'were', 'as', "you're", 'll', "weren't", 've', 're', 'no', "shouldn't", 'needn', "aren't",
'too', "wasn't", 'be', 'myself', 'so', 'few', 'over', "you'd", 'have', 'out', 'how', 'will',
'her', 'all', 'ourselves', 'until', 'an', 'they', 'same', 'are', 'he', 'own', 'hadn', 'by', 'about',
'yourself', 'in', 'than', 'other', "that'll", 'y', "hasn't", 'above', 'shouldn', 'whom',
"isn't", "you'll", "didn't", 'a', 'your', 'of', "needn't", 'that', 'aren', "don't", "wouldn't",
'most', 'me', 'am', 'what', 'i', 'and', 'each', 'but', "shan't", 'for', 'very', "mustn't",
"you've", 'its', 'nor', 'under', 'now', 'hasn', 'against', 'some'}

All English stop words have already been predefined by nltk in python. Stop words is
useful for any sort of database, pulling out words and articles on stacks, it can save lot
of processing time.

Part of Speech (POS) Tagging


POS tagging sounds to make labeling the part of speech to every single word in it,
which is to assure the word property before establishing and estimating the score.
There are 35 categories (Table 2) of part of speech, which these packages have been
stored in programming system.

34
Table 2. Part of Speech (POS) tagging list
POS tag list
CC coordinating conjunction PRP$ possessive pronoun
CD cardinal digit RB adverb
DT determiner RBR adverb, comparative
EX existential there (like: "there is" ... RBS adverb, superlative
think of it like "there exists")
FW foreign word RP particle
IN preposition/subordinating conjunction TO to
JJ adjective UH interjection
JJR adjective, comparative VB verb, base form
JJS adjective, superlative VBD verb, past tense
LS list marker VBG verb, gerund/present participle
MD modal VBN verb, past participle
NN noun, singular 'desk' VBP verb, sing. present, non-3d
NNS noun plural VBZ verb, 3rd person sing. present
NNP proper noun, singular WDT wh-determiner
NNPS proper noun, plural WP wh-pronoun
PDT predeterminer WP$ possessive wh-pronoun
POS possessive ending WRB wh-abverb
PRP personal pronoun

As a result, it takes and creates tuples of the word and part of speech instead of writing
them all out. This part of pre-processing is to confirm all the words correctly in all
part of speech and tagging, for the further analysis.

Chunking is able to do stuff, most people will chunk into noun phrases, which are the
groups of net and will be down, as well as a bunch of modifiers around noun. It would
be a kind of like a descriptive group of word in the sentence, which is called “shallow-
parsing”. This method only going to be able to use regular expressions and group
things together as a chunk that are chunking each other. The process can chunk
important words and then kind of break it out from whole sentence.

35
Modifiers
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line

3.4.2 Topic Modelling

This model involves extracting feature from root form, using mathematical structures
to reduce the feature dimensionality and remove the terms, but latent semantic
indexing can be ignored because of Bag-of-words model as TF-IDF vector has filled
the weight score. Latent Dirichlet Allocation (LDA) is adopted to represent a
combination of topics similarity.

# using LdaModel class, belongs to gensim’s ldamodel module, is used in LDA library.
# scikit-learn is the final implementation in LDA-based topic model in LDA library.

3.5 Social Media Analysis Methodologies

3.5.1 Analysis SMeDA

To apply suitable text mining algorithm, it is necessary to identify and understand the
previous work of content and prediction price from social media data collection, to
grasp the data utilization for data analysis. Analysis approaches in this section are
illustrated.

3.5.1.1 Dictionary

SMeDA for prediction price movement involve defining and recognizing “degree
level of personal emotion”. Dictionary is an essential index and crucial role in data
science to make a further development in using text mining to analyze, which
distinguish polarity and estimate the breadth of. Through establishment of stock

36
domain dictionary, it results a higher accuracy prediction of stock price movement
compared with daily life dictionary.

Delete Extract
Text Word disable HSI Emotion
NLP
Text Words
sentences words

Calculation
Count stock
Polarity & define positive
NLP by text up & down Text
word or negative Dictionary

Figure 7. Development of text dictionary

The procedures of developing dictionary are divided into two phases. The first phase
consists in removing the extract sentences, which is not related to HSI topic. Then,
selecting and extracting emotion words for words for analysis. All the words make a
polarity and count the total number of stock up and down, and finally calculation the
score and identify and define the word expression.

𝐷𝑜𝑐(𝑗)𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑤𝑜𝑟𝑑(𝑖),
1( )
Word(i, j) = { 𝐷𝑜𝑐(𝑗) 𝑛𝑒𝑥𝑡 𝑑𝑎𝑦 𝑝𝑟𝑖𝑐𝑒 𝑢𝑝 } (3.1)
0 (𝑡ℎ𝑒 𝑟𝑒𝑠𝑡)

Word(i). 𝑁𝑢𝑚𝐷𝑜𝑐𝑠 = the total number of article include word count(𝑖) (3.2)

∑𝑛𝑖=1 𝑊𝑜𝑟𝑑(𝑖, 𝑗)
𝑊𝑜𝑟𝑑(𝑖). 𝑆𝑒𝑛𝑡𝑖𝑆𝑐𝑜𝑟𝑒 = (3.3)
𝑊𝑜𝑟𝑑(𝑖). 𝑁𝑢𝑚𝐷𝑜𝑐𝑠

In the above Eq.3.1 to 3.3, the result of score is between 0 and 1. The minimum of 0
is fully negative; while 1 is fully positive. The score is based on the average score for
all al words to calculate.

37
3.5.1.2 Polarity and Subjectivity

Polarity is to express the emotion either positive, negative or neutral; Subjectivity is


to explain article for analyzing context. The equation of polarity and subjectivity in
the Lydia style listed in Eq.2.19 and Eq.2.20. With the graphic interface of dictionary,
it can refer to subsection on 3.5.1.1, which the text words generate the score and the
category either “positive” or “negative”.

After extracted the score, the topics modeled by LDA have no standard guidance for
polarity, each topic of document word can filter the topic related to HIS and use Eq.3.1
to 3.3 result the polarity score.

3.5.1.4 Score

Given a set of social media data in the period of time series as the listed financial and
economic media and websites, apart from the predicting index price in the sentences,
the most useful thing to discover it is to figure out what kinds of are reflected in the
data – different level of metaphor. It is for this reason why SMeDA cannot be ignored
and its tasks is to guarantee some effective text mining algorithms to extract words or
phrases from media that are the most reflective of the authors.

SMeDA will be recorded and to analyze a sequence of words representable from social
media, as “Hang Seng Index is expected to rise 3764 points in the next six months”,
where “Hang”, “Seng”, “Index”, “is”, “expected”, “to”, “rise”, “3764”, “points”, “in”,
“the”, “next”, “six”, “months” refer to different words. To classify all the sentences of
expressed from the received media by stock commentators or analysts, SMeDA set up
a list of word that inserted and storage that are used to reflect a variety of different
degrees; therefore, the score can be analyzed to the degree levels.

38
3.5.2 Data mining

Yahoo! Finance and aastock contain the historical index prices of Hang Seng Index
(HSI) listed in the exchange from the year 2017. As the amount of data collection is
around 248 days plus the data of 2018 index traded record, including the value of
attribute “Open”, “High”, “Low”, “Close”, “Adj Close” and “Volume”. Those
attributes were considered to be the part of analysis attribute to represent. Those
historical HSI stock numeric value price for each attribute will be performed by
general data mining algorithm with the closing price value in the trading day, such as
regression linear, moving average etc.

In the other part, when the data was collected from social media stock analysts, all the
values of attributes selected and recorded were continuous numeric values and textual
values. Data transformation was applied by generalizing data to a higher-level concept,
so all the values became discrete. The criteria to transform the numeric values of each
attribute to discrete values depends on the previous day of adjusted closing price of
Hand Seng Index (HSI) price.

If the values of the attributes adjusted closing were greater than the value of attribute
open and min for the same trading day, the numeric values of the attributes were
replaced by the value “Positive (Up)” as blue signal. If the values of the attributes
mentioned above were less than the value of last day open, min and max, the numeric
values of the attributes were replaced by “Negative (Down) as red signal. If the values
of those attributes were equal to the value of the attribute previous, the values were
replaced by the value “Equal (-)” as grey signal (Table 3).

39
Table 3. Stock attribute description
Attribute Description Possible Value
Open Current day open price of the stock Positive, Negative, Equal
Min Current day minimum price of the stock Positive, Negative, Equal
Max Current day maximum price of the stock Positive, Negative, Equal
Close Current day close price of the stock Positive, Negative, Equal
Adj Close Current day adjustment price of the stock Positive, Negative, Equal

3.6 Feature Selection

In the previous section, there are the basic selection involves two ways road according
to the analysis structure and design, which
(1) SMeDA: To deal with potential over-fitting problem, reducing the amount of text
in the article, only within description the main resource of HSI, is necessary. A
classification algorithm assists to over fit the training data and improve data
structure for data analysis accuracy, which the result also related to dimensionality
reduction. Hsu et al.’s work claimed that high-level features might give a more
promising result than features.
(2) HSI historical data analysis: using the past data of index price since 2017 to
make a general prediction movement by algorithms.

3.7 Feature Extraction

Feature is unique and meaningful. There are many feature-extraction techniques can
be applied on data to identify features in a dataset. Generally, dataset is the row and
various features are the columns. An initial set of measured text and numeric data and
building the derived values are the feature extraction, which is also related to
dimensionality reduction (Scikit-learn developers, 2017). Sklearn is the main function
that can extract features supported by machine learning algorithms from all stock
information, such as text and image, which convert it into numerical features. The
analysis will be loading data and text features from dictionary.

40
3.8 Summary

The principle, implementation and methodologies on design approach of each module


has discussed and suggested in this chapter. The most essential is dictionary, which is
the main process to apply text-mining algorithms, so that it can make a combination
with historical data and current text articles by analyst to make a further data analysis
to predict the stock price movement. In the next chapter, the process of setting
experience will be given.

41
Chapter 4 Experimental Framework
In this chapter, according to the chapter 3 of design approach of selections and
methodologies, this section shows the workflow procedures of experiments as the
baseline experiment.

4.1 Basic Experiment

To process the result of data collection, including historial stock data and social media
data collection. Data science by using Python programming implement the data
analysis to achieve the dissertation aims. The whole procedures use python text file
(.py) to store python script with list of commands to figure out analysis result of
graphics or / and tables.

Exactly to forecast few-days stock market prediction is impossible to work on (Cooper,


2012). The most effective measurement is to generally predict to direction patterns
based on those data and text mining data. The experiment of confidence level is 85%.

4.1.1 Features

Technical Indicators: A basic demonstration to perform HSI illustrated in the next


chapter and their transforms are adopted as features.

Machine learning algorithm: Linear regression is one of the famous prediction


algorithm, which base on the history data to make future movement.

BOW models: Collected social media of textual data are converted into vector
features. Outside the word of stop list and high frequencies token are remained in the
vector feature to make further algorithm.

42
4.1.2 Prediction Score and Movement

The prediction score from social media collection is related to stock movement
performance (SMP), to understand the word properties for identifying the future HSI
performance. SVM and MNB are highly required for measurement.

The above measurement algorithms result F1


score to measure the prediction or test accuracy.
F1 score is both precision (p) and the recall (r)
of the test set to calculate the score.
𝑝∙𝑟
𝐹𝛽 = (1 + 𝛽) ∙ (4.1)
(𝛽 2 ∙ 𝑝) + 𝑟

F1 score is harmonic average score between


precision and recall, which in case (Figure 8) F1
score is more than 0, it represents to true positive. Figure 8. Structure of F1 score
(Day, 2014)
Otherwise, less than 0 represents to false negative.
The score of text analysis by SVM listed ranking of negative and positive token
(Powers, 2007). Besides, MNB, RF and Xgboost can also calculate the token of value.
Apart from prediction score on text, the prediction movement of those algorithms will
be compared with the actual HSI result to understand the accuracy of prediction.

4.2 Summary

The comprehensive workflow of experiment aims to have a clear guideline to work


on the programming experiment. This chapter is to follow the previous chapter 2 & 3
in order to forecast the HSI stock price movement.

43
Chapter 5 Analysis Result and Evaluation

5.1 Basic Form

In figure 8, HSI basic form trend is a foundation graphic result to show the movement
from 2017 up to the current. This format is to show open, high, low and close of stock
price in every HSI day. Every HSI price bar value performed by comparing last day
closing price value, if the current closing price is greater than the previous closing
price, which attribute of bar demonstrate green color represents “positive”; while if
the current closing price is greater than the previous closing price, which attribute of
bar demonstrate red color represents “negative” in the chart. Finally, if the current
closing price is equal to the previous closing price, which attribute of bar demonstrate
grey color represents “equal”.

Figure 9. Hang Seng Index (HSI) basic trend from Sep 2017 to Feb 2018

At the same time, the graphic also shows 10-days SMA, 20-days SMA, 50-days
SMA, 100-days SMA and 150-days SMA, which show from the short term to long
terms of general prediction trend.

44
5.2 Data Analysis Result

5.2.1 Regression Linear

Figure 10. Prediction future trend by regression linear until 8th February 2018

This part of aim is to predict HSI movement until 8th February 2018 by regression
linear in the historical HSI price data. The result demonstrates index movement
become positive effect, although HSI in the short term has dropped from the peak of
33,000 to 29,500. There will be a technique rebound to increase back 31,000 points
or even higher in the future. After appropriate adjustments, figure 9 shows that HSI
in the long term will continues to grow up to 34,000.

45
5.3 Social Media Analysis Result

5.3.1 Stop words and stemming word

The present word of stop words and stemming word result are important pre-
processing algorithms to make token uniform integrity on the data set. To perform a
high effectiveness of algorithms, no significant semantic context in a sentence must
be removed to speed up the task in text pre-processing. Therefore, the data set will be
decreased the data redundancy and result meaningful result in algorithms.

['Short-term', 'index', 'expected', 'test', 'resistance', '24000', 'points', '.', 'The', 'index',
'expected', 'greater', 'chance', 'downswing', ',', '24000', 'points', 'larger', 'resistance',
',', 'short-term', 'test', '23500', 'level', 'support', '.', 'He', 'pointed', 'together',
'analysts', 'starting', 're-raise', 'future', 'earnings', 'forecast', 'emerging', 'market',
'enterprises', ',', 'I', 'believe', 'cyclical', 'economic', 'recovery', 'begun', 'continue',
'next', 'two', 'years', '.', 'Therefore', ',', 'target', 'price', 'HSI', '2017', '2018', '28000',
'points', '32000', 'Point', ',', ',', 'forecast', 'earnings', '14', 'times', '15', 'times', '.',
.

'He', 'pointed', 'together', 'analysts', 'starting', 're-raise', 'future', 'earnings', 'forecast',


'emerging', 'market', 'enterprises', ',', 'I', 'believe', 'cyclical', 'economic', 'recovery',
'begun', 'continue', 'next', 'two', 'years', '.', 'Therefore', ',', 'target', 'price', 'HSI',
'2017', '2018', '28000', 'points', '32000', 'Point', ',', ',', 'forecast', 'earnings', '14',
'times', '15', 'times', '.', 'The', 'basic', 'target', 'Hang', 'Seng', 'Index', 'next', 'year',
'23800', ',', '6.5', '%', 'current', 'level', '.', 'Overall', ',', 'Hong', 'Kong', 'stocks',
'expected', 'fluctuate', 'significantly', 'next', 'year', ',', 'HSI', 'ranging', '18300',
'25800', '.', 'Morgan', 'Stanley', 'expects', 'Hang', 'Seng', 'Index', 'rose', '30500',
'points', '.'] ….

46
5.3.2 Support Vector Machine (SVM)

In the linear SVM result to figure out the positive and negative token, the result of
positive and negative coefficient is corresponding to HSI. In SVM with feature
selection and filter record in ascending order selected top 10 of positive and negative
features.

Figure 11. Feature Importance SVM

The result is shown in figure 11, table 4 and table 5 for various training set result, the
fish score performed identically correlation coefficient. Taking an average of positive
token on first top feature of coefficient for those training token is 0.930668. In contrast,
taking an average of positive token on first top feature of coefficient for those training
token is -0.9254995. SVM feature selection of accuracy for Linear SVM is 88.89%.
Those positive and negative tokens are the signal for investors or analyst to understand

47
the HSI movement trend (Appendix III provided a list of all positive and negative
value of words).

Table 4. Top 10 positive features for linear SVM


Rank feat coeff Rank feat coeff
1 singtao 1.069987 6 believe 0.914917
2 range 1.005172 7 year 0.866025
3 support 0.998440 8 opportunity 0.852649
test
4 limited 0.968164 9 continue 0.852507
expected points
5 temporarily 0.930996 10 term 0.847823

Table 5. Top 10 negative features for linear SVM


Rank feat coeff Rank feat coeff
1 hsi expected -1.258017 6 freeman -1.020983
2 consolidation -1.126413 7 contention -0.972060
3 highs -1.083767 8 challenging -0.969969
4 morning -1.045890 9 expected -0.908875
5 freeman -1.020983 10 fall -0.896254
securities

48
5.3.3 Multinomial Naïve Bayesian (MNB)

In those calculation likelihoods to be count of token, the following result aim is to


understand the token of features and can have a briefly prediction in the future. SVM
feature selection of accuracy for Linear MNB is 88.89%.

The word occurrences are conditionally independent of other, which given the class
of the sentence. In Table 6, the impact of those tokens is reflected the HSI is down
forward, especially “today” and the social media named “mpfinance” and “mingpao”.
Those tokens bring out the negative signal for HSI movement (Appendix IV provided
a list of all MNB value of words).

Table 6. Top 20 negative features for linear Multinomial Naïve Bayesian


Rank feat coeff Rank feat coeff
1 today -4.104449 11 fluctuate -4.215065
today
2 mpfinance -4.110087 12 points -4.527014
3 mingpao -4.111158 13 index -4.716769
4 mingpao -4.111158 14 yiu -4.884947
mpfinance
5 expected -4.130641 15 aastock -4.899981
6 hsi -4.135670 16 director -4.924233
7 mpfinance hsi -4.162952 17 china -4.937234
8 hsi expected -4.174566 18 kwok -4.949888
9 expected -4.208969 19 management -4.950483
fluctuate
10 fluctuate -4.210240 20 ka -4.969320

49
Both result of SVM and MNB are to figure out a “feature” in a point of coefficient
value. Generally, SVM algorithm most of the time is for text classification and better
for analyzing full-length content; while MNB can be translated as a linear model,
which is good at analyzing snippets or short documents. It also stated that MNB is
better than SVM with training cases (Ng & Jordan, 2002).

Basically, both MNB and SVM are appropriate and strong baseline classification
algorithm for analyzing every text case to result the feature of positive and negative
value from analyst commentary hereby have a signal on HSI index stock price
movement.

5.3.3 Random Forest (RF)

Table 7. Features importance Random Forest


Feat Importance_RF Feat Importance_RF
mpfinance 0.044293104 expected 0.016031654
mingpao 0.032746819 expected fluctuate 0.014106095
points 0.029200343 yiu 0.013567787
hsi expected 0.02764265 research 0.012273888
index 0.02355802 hang 0.012259709
aastock 0.020926599 limited hsi 0.011614551
today 0.020675591 goldjoy 0.011568717
level 0.019877035 market expected 0.011093856
hsi 0.016439794 freeman securities 0.010886402

Random forest (RF) is often used by prediction; however, the result has given the
feature importance that give a sense of word variables have the most effect in this
model. The density of predictive power in random forest are under 0.05 (Table 7). The
random forest is based on the data set of prediction article and content to forecast.

50
Figure 12. Feature Importance Random Forest

The result of RF feature selection used for splitting, which depends on impurity
reduction is biased in variable of categories. Word correlated features can be
referenced as the predictor if the article has many correlated features in the analysts’
article.

51
5.3.4 eXtreme Gradient Boosting (Xgboost)

Xgboost generally is to provide importance score that is useful and valuable in the
construction of the boosted decision tree within the model. The more an attribute is
used to make key decisions with decision trees, the higher its relative importance. The
result of importance value is estimated explicitly for each attribute in the dataset,
allowing attributes to be ranked and compared with each other.

Table 8. Features importance eXtreme Gradient Boosting


Feat Importance_XGBM Feat Importance_XGBM
level 0.057279237 hsi expected 0.035799522
points 0.050119333 resistance 0.035799522
support 0.045346063 market 0.035799522
index 0.040572792 aastock 0.03341289
today 0.040572792 consolidation 0.028639618
chi 0.040572792 hang 0.026252983
expected 0.03818616 year 0.026252983

52
Figure 13. Feature Importance XGBoost

In figure 13, the top 10 of features are automatically named according to their index
in the input array (X) from the text of "level" to "market". Manually mapping these
indices to names in the problem description, the plot shows the text of "level" has the
highest importance (0.057279237) and the text of "market" (0.035799522) has the
lowest importance in the top 10 feature.

53
5.4 Comparison on Prediction Movement Result

Figure 14. F1 score for different algorithms

The following table 9 of accuracy result is predicted by four algorithm plus predicted
movement by analyst article based on the text and predicted price, which listed on 54
stock commentary records use the trained classifier on the test set. Data set be the
trained the classifier to predict the movement by analyst and algorithms compared
with the actual index price movement. Each sentence has been worked on stop words
and stemming word, as well as based on SVM, NB, RF and xgboost to forecast the
future movement in HSI performance. The actual and prediction movement has a
closer relationship, which is to certify the forecast result from stock commentary of
prediction. The result of “up” means market went upward and “down” is market went
downward.

54
The accuracy between prediction movement by analyst and actual HSI in these 54
records is 68.52%, which is more than half algorithm accuracy to predict right with
the actual index; accuracy between prediction SVM and prediction Xgboost are the
same of 64.81% accuracy.

Table 9. Accuracy result in analysts’ prediction and algorithm prediction


Algorithm Prediction Accuracy Accuracy %
Prediction movement by analyst 37 / 54 68.52%
Support Vector Machine (SVM) 35 / 54 64.81%
Multinomial Naïve Bayesian (MNB) 31 / 54 57.40%
Random Forest (RF) 32 / 54 59.26%
eXtreme Gradient Boosting (Xgboost) 35 / 54 64.81%
Integrated all result in corresponding to Actual Trend 20 / 54 37.04%

Table 10. Prediction movement result with actual price


Prediction_
Prediction Prediction Prediction Prediction
Text movement Actual
_SVM _MNB _RF _XGB
by analyst
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the short term up up up up up up
resistance is
expected to be
around and
below the
support at
aastock chik yiu
fai head of
research
department of
bright smart
up up up down up up
securities
commodities
group limited
the short term is
still expected to

55
remain above
the day low of

mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
index is
expected to
stabilize at the up up up up up up
level of
points the
market outlook
effectively
further
challenges
resistance
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the upward trend
is expected to
up up up up up up
continue hong
kong stocks the
index will
further test the
level the bottom
support at
around
mingpao
mpfinance
hong kong
stocks are up up up up up down
expected to try
another
points
mingpao
mpfinance the
up up down up up up
hsi is expected
to fluctuate

56
between and
today

mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
short term
consolidation is
up up up up up up
expected to
maintain the
pattern the
index continued
at to
points on the
level
sina pong po
lam paul
managing
director of
pegasus fund
managers
limited small
and medium
sized stocks will
benefit us stocks
continue to rise
but investors
will be worried up up up up up up
about the bubble
burst in the
united states or
next year will be
the transfer of
funds to hong
kong i believe
hong kong
stocks have the
opportunity to
rise above
points next year

57
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
hkej lam ka kei
hang seng index
is expected
within the short up up up up up up
term will still be
to from the
distance
mingpao
mpfinance the
hsi is expected down down down down down up
to go up or
down today at
aastock chik yiu
fai head of
research
department of
bright smart
securities
commodities
group limited
hong kong up up up up up down
stocks are
expected to
easily break
above today
but will have
some resistance
at the previous
high of
mingpao
mpfinance the
hang seng index
up up up up up down
is expected to
consolidate from
a high of to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd up up up up up up
the short term is
still expected to
continue the
pattern of

58
consolidation
the index
remained at
points to
points level
hkej sam chi
yung senior
strategist at
south china
financial
holdings ltd
market outlook
up up up up up up
is still powerful
and then break
up or higher
level is
absolutely
capable of
seeing
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
up up up up up up
the index is
expected to
continue to up
and down
between and
etnet chan fung
chu
today or is likely
up up up up up up
to challenge
upward
points mark
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today

59
sina missing
it is expected
that the hsi will
continue to
consolidate at
up up up down up up
the level of
about on the
eve of the
christmas
holidays
mingpao
mpfinance the
hsi is expected up down down down down up
to hover around
to today
aastock kwok ka
yiu managing
director at china
goldjoy asset
up up up up up up
management ltd
the index is
expected to hold
at to
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
quamnet yu
kwan lung
independent
stock
commentator the
hang seng index up up up up up down
is forecast to
fluctuate
between and
in the coming
few sessions
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd up up up up up up
index or test
points support
point level to
become short

60
term rebound
resistance
mingpao
mpfinance
hong kong
stocks are
up up up up up down
expected to go
up or down
today at
points
mingpao
mpfinance the
hang seng index
is expected to
down down up up down down
rebound today
and is expected
to go up and
down at to
mingpao
mpfinance
hong kong
stocks are up up up up up up
expected to go
up or down
today at to
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
up up up up up down
expected to
further test the
level and the
underlying
support moves
up to
aastock kwok ka
yiu managing
director at china
goldjoy asset up up up up up up
management ltd
the index is still
expected to

61
consolidate in
the short term
maintaining the
to level
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
short term
external
situation is not
yet clear the up up up up up up
short term index
continues to be
in a
consolidation
pattern
maintained at
to points on
the level
mingpao
mpfinance the
hsi is expected
down down down down down down
to hit the daily
level of
today
mingpao
mpfinance the
hsi is expected up down down down down down
to go up and
down at to
etnet shek kang
chuen arthur
head of research down up up up up down
of hong kong
economic times

62
keep steadily
from to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
up up up up up up
expected to
consolidate
ahead of the
short term and
maintain its
level at to
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
up up up down down up
expected to
maintain the
index at to
points on the
level
etnet tang sing
hing kenny
chairman of the
hong kong
institute of
financial
analysts and
up up up up up down
professional
commentators
limited expected
today th is
expected to
challenge
points
mingpao
mpfinance
hang seng index
up up up up up down
is expected to
test today or
on trial

63
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
the index is
expected to up up up up up up
maintain its
stable pattern on
the last trading
day of
continuing from
to points
mingpao
mpfinance the
hsi is expected
up up down up up up
to go up and
down at to
today
mingpao
mpfinance the
hsi is expected
down down down down down up
to test the
level today or on
the day
quamnet yu
kwan lung
independent
stock
commentator as
the investment
climate
downturn a
basket of hang
up up up up up down
seng index
constituents is
sold out and the
hang seng index
is forecast to
fluctuate
between and
in the next few
trading days

64
aastock kwok ka
yiu managing
director at china
goldjoy asset
management ltd
expect the index
to test a
psychological
barrier of up up up up up up
points after the
elimination of
the selling
pressure the
index is still
expected to test
points
repeatedly
mingpao
mpfinance the
hsi is expected
up up down up up down
to fluctuate
between and
today
quamnet yu
kwan lung
independent
stock
commentator the
hang seng index up up up up up up
is forecast to
fluctuate
between and
in the coming
few sessions
etnet chong chi
ho chief
financial capital
limited
hong kong up up up up up up
stocks are
expected to
further test the
level today
finet missing
missing the hang
seng index trend
up down up up up up
repeatedly at
the level of
contention

65
mingpao
mpfinance the
hsi is expected
down down down down down up
to hit the daily
level at
today
mingpao
mpfinance the
hsi is expected
up up down up up up
to fluctuate
between and
today

66
5.5 Goal-Based Trading Experiment

To prove the prediction trend of authenticity on HSI, one of the famous online-CFD-
trading platform is Plus500 (http:///www.plus500.com), which provide trading
instruments, including crypto, indices, forex, commodities etc., that is an international
trading system for investors.

For the actual trading experiment, investment Hang Seng Index Future (Hong Kong
50 HSI) is a direct experiment to testify the prediction result, which aims surely is to
gain profit value.

The rule of HSI future is a contract form to make a trading, which every contract has
an expiry date and must receive premium charge in each single trading date if investor
save the contract until the deadline date. Also, each point is HKD$50 in this case, the
operation function of “Buy” means rise and “Sold” means drop, so if purchase HSI
raise, then each point gains HKD$50; otherwise, it will lose HKD$50 for each point.

The trading experiments are divided into short term (one-day trading) and medium (at
least a month), the confidence level of this experiment is 85%.

Figure 15. – Plus500 Trading platform statistics

67
According to the figure 10, the statistics
recorded the trading record of purchasing
Hong Kong 50 of HSI on “Buy” mode in
opening rate of HSI 29,315 on 9th
February 2018. The amount to purchase
raising option in HSI has 2,250 contracts,
which the expiry date is 23rd February
2018.

The limit as setting HSI is 34,000 points


Figure 16. – Buy Hong Kong 50
and will be automatically sell. As of 21
statistics records on 16th February 2018
February 2018, the current rate of HSI has
raised to 31,365 points, increased by 6.99 percent. The result gains the net profit on
HKD$453,899.00. The percentage of net profit is 90.78%.

This kind of goal-based trading experiment is according to the prediction algorithm of


regression linear and stock analysts’ prediction on February, which has a reasonable
and expected result.

68
5.6 Issues Influencing HSI Prediction Result

Because of Hang Send Index including a number of blue chip and have a high
coefficient with global stock market, those news and information will affect the price
of Hang Seng Index (HSI).

1) Global Issue
The policy of international trade, such as trade tariff, will influence between
countries stock market atmosphere. In the case of favourable issue, the stock
price will inevitably rise internationally. Otherwise, it will case a high risk on
market.

According to the CNN politics reporting, USA of President Donald Trump hits
China with the tariffs, withdrawing nearly 60 billion U.S. dollars from China
(Diamond, 2018). This issue will be heightening concerns of global trade war
and bring negative factors with shocking the index price in global financial
market.

The other issue about debt issue in 2011, which faced on a high pressure on
world stock market. In Europe, the market is worried that it may want to
restructure debts, and the European debt problem may spread to larger
countries such as Italy, Spain and France. Both Italian and Spanish bond yields
have risen, and the market is concerned that France may lose its highest
sovereign debt rating. European countries have failed to reach a consensus on
the mitigation plan for the debt issue, leading to a drop in the stock market. As
of the end of 2011, the FTSE index, DAX index and CAC index fell by 5.6%,
14.7% and 17.0%, respectively, while the stock markets of Portugal, Alghero,
Italy, Xiji and Spain's five European countries (PIIGS) fell by 13.1% to 51.9%.
Unequal (but only if the Erlang stocks rose 0.6%) (Securities and Futures
Commission (SFC), 2012). Due to the debt issue, the Asian region’s economy

69
is dragged down by Europe’s sovereign debt problems, and regional stock
markets have generally fallen.

In Hong Kong, European sovereign debt problems, the US’s credit rating was
lowered, and the Mainland may introduce macro-austerization measures, all
damaging the market sentiment. As of the end of 2011, Hang Seng Index and
Hang Seng State-owned Enterprises Index fell by 20.0% and 21.7%
respectively from the end of 2010. (Securities and Futures Commission (SFC),
2012).

2) Economic Indicators and Releases


The index release is one of the factor to influence the current or the next day
of index price. There are many economic index, such as CPI, PMI, and GDP
etc. According to the Momani & Alsharari in 2012, the most three indicators
that influence stock market price are the gross domestic product, or GDP; the
unemployment rate; and the rate of inflation. Indicators if becoming positive
or over expected value percentage for the economy, it is likely to make a
growth in the stock market. Although it cannot guarantee positive indicators
result in positive stock price performance, historical precedents exits. As all
the indicators has become positive growth which means many corporate in
local business a strongly increase and presage a bull market (Momani &
Alsharari, 2012).

3) Transaction Volume
A high volume transition is one of rising force for Hang Seng Index (HSI).
Southbound inflows through stock connect in Hong Kong stock market,
around HK$339.9 billion in 2017 has been selected and collected (Securities
and Futures Commission (SFC), Research Paper No. 62: A Review of the
Global and Local Securities Markets in 2017, 2018).

70
4) Corporate Performance Report
Due to Hang Seng Index (HSI) included 50 different kind of constituent
corporate, the corporate season and overall performance report will be released
in every season. Those corporate will be influenced the HSI in the short-term
and long-term performance.

For instance, as the global market is optimistic about corporate profits in 2017,
coupled with the improvement of the basic economic factors, the global market
has made good progress. In particular, US stocks hit a record high, and the
pace of economic recovery began to accelerate. In addition to the United States,
major markets such as Germany, United Kingdom, South Korea and India all
recorded historical highs in 2017. The US dollar continued to weaken,
depreciating by 10% in 2017, which is another factor that promotes continued
capital inflows into Hong Kong and other emerging markets (Securities and
Futures Commission (SFC), Research Paper No. 62: A Review of the Global
and Local Securities Markets in 2017, 2018).

All in all, there are many internal and external issues and suddenly news will be
released and published, which must influence the prediction stock price movement
and the atmosphere of stock market. Hence, as an investor or analyst, it must keep in
update the issues and information and the prediction of stock movement trend. The
result of prediction movement in constant text and data analysis are as references
property under macro conditions.

71
Chapter 6 Conclusion
The current day of stock market is charactiszed by a strong strengthening of
influencing role. In this work, this paper has built up a corpus that can be used to
investigate the importance of text analysis and estimate and certify the prediction
movement. The text analysis algorithms showed some prediction article of words
from analyst in social media is indeed important, especially some encouraging and
short-term words, like “support” or “limited expected”, are resulted in positive
features.

From the results of 4 experiments on various data sets in text, four algorithms (SVM,
MNB, RF and xgboost) and prediction by analysts on average accuracy are around
60%, yet the most insight message is accuracy of HSI prediction from financial stock
analysts’ is over 68%.

Overall, the average prediction of algorithms in corresponding with the actual result
is more than 35%. There is no doubt that social media stock price movement
prediction has a strong reference value for the public.

72
Chapter 7 Further Research & Limitation
The further research and limitation are equal to advance and prolong the scope and
time in research of dissertation. It can be divided in to the section of research and
experiment. Those parts of innovation can be sublimated to be business field in
system software.

1) Advanced Research

To achieve a global text analysis in stock market, the major of index, such as
Dow Jones index, Deutsche Borse index and SSE in China, are the global
indexes that influence other global, regional and national indexes’ performance
(Galina, Iryna, & Sergii, 2015).

It is a unique and comprehensive analysis if it can analyse all part of major


indexes, so that it may be find out some secret correlations instead of only
receiving and analysing primary data source and technical indicators.

2) High-Tech Experiment

At the current experiment, it is a text analysis from analysts in social media,


which is to forecast a prediction trend or movement. To achieve nearly business
system, it is necessary to make a web-based with AI analysis system, which apart
from the basic indicators, mining algorithms can also added into the system.
Hence, it will be integrated the traditional concept and theories and the
innovation of AI and mining techniques in today’s century.

73
Selected Bibliography
[1] Alassiri, A., Mud, M., & Ghazali, R. (2014, February). Usage of Social
Networking Sites and Technological Impact on the Interaction Enabling
Features. International Journal of Humanities and Social Science, 4(4), 46-61.
[2] Batista, G., Wang, X., & Keogh, E. (2012). A Complexity-Invariant Distance
Measure for Time Series. University of California, 12.
[3] Blackrock Investment Institute. (Blackrock Investment Institute). WEEKLY
COMMENTARY • MARCH 12, 2018. 2018.
[4] Butler, M. (2009). An Artificial Intelligence Approach to Financial Forecasting
using Improved Data Representation, Multi-objective Optimization, and Text
Mining. Halifax, Nova Scotia: Dalhousie Universiy.
[5] Campbell, W., Dagli, C., & Weinstein, C. (2013). Social Network Analysis with
Content and Graphs. Lincoln Laboratory Journal, 20(1), 62-80.
[6] Cooper, J. (2012). Hit and Run Trading: The Short-Term Stock Traders' Bible.
Wiley Trading.
[7] Das, S., & Chen, M. (2007). Yahoo! for amazon: Sentiment extraction from small
talk on the web. Management Science, 53(9), 1375-1388.
[8] Dascalaki, E., Droustsa, K., Gaglia, A., Kontoyiannidis, S., & Balaras, C. (2010,
February 17). Data Collection and analysis of the building stock and its energy
performance - An example for Hellenic buildings. ELSEVIER, 1231-1237.
[9] Day, M.-Y. (2014). Data Mining - Classification and Prediction. Taiwan:
Department of Information Management, Tamkang University. Retrieved from
http://slideplayer.com/slide/7027794/
[10] Degutls, A., & Novickyte, L. (2014). The Efficient Market Hypothsis: A Critical
Review of Literature and Methodology. EKONOMIKA, 93(2), 8-23.
[11] Diamond, J. (2018, March 23). Trump hits China with tariffs, heightening
concerns of global trade war. Retrieved from CNN politics:
https://edition.cnn.com/2018/03/22/politics/donald-trump-china-tariffs-trade-
war/index.html

74
[12] Flugel, A. (2016, July). Burtch Works Flash Survey: SAS vs R vs Python! (B. W.
Recruiting, Ed.) Retrieved from
https://www.burtchworks.com/files/2016/07/SAS-vs-R-vs-Python-
2016_webinar-PDF-deck.pdf
[13] Galina, A., Iryna, S., & Sergii, K. (2015). Analysis of the Global Stock Market
Trends. Journal of Finance and Economics,, 4(3), pp. 67-71. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.880.8378&rep=rep1
&type=pdf
[14] Garefalakis, A., Dimitras, A., Koemtzopoulos, D., & Spinthiropoulos, K. (2011).
Determinant Factors of Hong Kong Stock Market. International Research
Journal of Finance and Economics(62), 50-60.
[15] Gera, M., & Goel, S. (2015, March). Data Mining - Techniques, Methods and
Algorithms: A Review on Tools and their Validity. International Journal of
Computer Applications, 113(18), 22-29.
[16] Hang Seng Indexes Company Limited. (2017). Hang Seng Indexes Report. Hong
Kong: Hang Seng Indexes. Retrieved from http://www.hsi.com.hk/HSI-
Net/static/revamp/contents/en/dl_centre/factsheets/FS_HSIe.pdf
[17] HKEX. (2017, April 20). Severe Weather Arrangements. Retrieved from
http://www.hkex.com.hk/Services/Trading-hours-and-Severe-Weather-
Arrangements/Severe-Weather-Arrangements/Clearing-and-
Settlement?sc_lang=en
[18] Ingason, A., Helgadóttir, S., Loftsson, H., & Rögnvaldsson, E. (2008). A Mixed
Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities
(HOLI). Springer-Verlag Berlin Heidelberg , 205-216.
[19] Investor Education Centre. (2014, December 2). edb.gov.hk. Retrieved from
Understanding the importance of Hang Seng Index:
http://www.edb.gov.hk/attachment/en/curriculum-development/kla/technology-
edu/resources/business-
edu/CDI020150246_Handout_2_20141202_Importance_of_HSI.pdf

75
[20] Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S., . . . Li, H. (2004).
A Survey on Text Mining in Social Networks. Cambridge University Press, 00:0,
1-24.
[21] Jiang, K., Ediger, D., Corley, C., Farber, R., & Reynolds, W. (2010). Massive
Social Network Analysis: Mining Twitter for Social Good. 39th International
Conference on Parallel Processing, (pp. 583-593). USA.
[22] Kannan, S., Sekar, S., Sathik, M., & Arumugam, P. (2010, March 17). Financial
Stock Market Forecast using Data Mining Techniques. Proceedings of the
International MultiConference of Engineers and Computer Scientists.
[23] Kirkos, E., & Manolopoulos, Y. (2004). Data Mining in Finance and
Accounting: A Review of Current Research Trends. Department of Accounting,
Technological Educational Institution of Thessaloniki, Greece.
[24] Kloptchenko, A., Back, B., Vanharanta, H., Eklund, T., Karlsson, J., & Visa, A.
(2002). Combining data and text mining techniques for analyzing financial
reports. Eighth Americas Conference on Information Systems, 20-28.
[25] Kovalerchuk, B. (2015). Data Mining For Financial Applications. USA: Central
Washington University.
[26] Li, B., Chan, K., Ou, C., & Reuifeng, S. (2017, February 2). Discovering public
sentiment in social media for predicting stock movement of publicly listed
companies. Elsevier, 81-92. Retrieved 9 7, 2017
[27] Li, Y., Li, X., & Wang, H. (2016, November 29). Based on Multiple Scales
Forecasting Stock Price with a Hybrid Forecasting System. American Journal of
Industrial and Business Management, 927-940.
[28] Mahajan, A., Dey, L., & Haque, S. (2008). Mining financial news for major
events and their impacts on the market. WI-IAT '08. IEEE/WIC/ACM
International Conference on, 1, 423-426.
[29] Mak, K., Ho, T., & Ting, S. (2011). A Financial Data Mining Model for
Extracting Customer Behavior. The Hong Kong Polytechnic University,
Department of Industrial and Systems Engineering. Hong Kong: Convoy
Financial Services Holdings Limited. Retrieved July 23, 2011

76
[30] Marwala, L. (2017). Forecasting the Stock Market Index Using Artificial
Intelligence Techniques. Johannesburg: Faculty of Engineering and the Built
Environment, University of the Witwatersrand.
[31] Moldovan, D. (2011). Business Intelligence: Data Mining in Finance. Babeş-
Bolyai University of Cluj-Napoca, Faculty of Economics and Business
Administration. Rome: Babeş-Bolyai University of Cluj-Napoca.
[32] Momani, G., & Alsharari, M. (2012). Impact of Economic Factors on the Stock
Prices at Amman Stock Market (1992-2010). International Journal of Economics
and Finance, 4(1), 151-159.
[33] Ng, A., & Jordan, M. (2002). On Discriminative vs. Generative classifiers: A
comparison of logistic regression and naive Bayes. Standfor University.
[34] Olowe, M., Gaber, M., & Stahl, F. (2013). A Survey of Data Mining Techniques
for Social Network Analysis. UK: School of Computing Science and Digital
Media, Robert Gordon.
[35] Ozer, P. (2008). Data Mining Algorithms for Classification. Netherlands:
Radboud University Nijmegen.
[36] Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment
Classification using Machine Learning. Association for Computational
Linguistics, 10, 79-86.
[37] Powers, D. (2007). Evaluation: From Precision, Recall and F-Factor to ROC,
Informedness, Markedness & Correlation. Australia: School of Informatics and
Engineering, Flinders University of South Australia.
[38] Puget, J.-F. (2016, December 23). KDnuggets. Retrieved from
http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-
data-science.html
[39] Radaideh, Q., Assaf, A., & Alnagi, E. (2013). Predicting Stock Prices Using Data
Mining Techniques. The International Arab Conference on Information
Technology. of Computer Information Systems, Faculty of Information
Technology and Computer Science, Yarmouk University, Irbid.

77
[40] Ravisankar, P., Ravi, V., Rao, G., & Bose, I. (2011). Detection of financial
statement fraud and feature selection using data mining techniques. Elsevier B.V.,
491-500.
[41] Sap, M., & Mohebi, E. (2017, December 4). Hybrid Self Organizing Map for
Overlapping Clusters. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 11-20.
[42] Sawant, A., & Chawan, P. (2013). Comparison of Data Mining Techniques used
for Financial Data Analysis. International Journal of Emerging Technology and
Advanced Engineering, 112-116.
[43] Sawant, A., & Chawan, P. (2013, May). Study of Data Mining Techniques used
for Financial Data Analysis. International Journal of Engineering Science and
Innovative Technology (IJESIT), II(3), 503-509.
[44] Scikit-learn developers. (2017). Feature extraction. Retrieved from scikit-learn:
http://scikit-learn.org/stable/modules/feature_extraction.html
[45] Securities and Futures Commission (SFC). (2012). Research Paper 50: Review
of Global and Hong Kong Securities Market in 2011. Hong Kong: Securities and
Futures Commission (SFC).
[46] Securities and Futures Commission (SFC). (2018). Research Paper No. 62: A
Review of the Global and Local Securities Markets in 2017. Hong Kong.
[47] Soumya, S., & Deepika, N. (2016, January). Data Mining With Predictive
Analytics for Financial Applications. International Journal of Scientific
Engineering and Applied Science (IJSEAS), II(1), 310-317.
[48] The ASPIRA Association. (2017). Short-Term and Long-Term Investments
Options. Investments: Resources for Reaching the American Dream.
[49] The ASPIRA Association. (2017). Short-Term andLong-Term Investments
Options. Washington: Investments: Resources for Reaching the American
Dream. Retrieved from
https://www.aspira.org/sites/default/files/Inv_Fac_M5_V2.pdf
[50] Tsui, D. (2014). Predicting Stock Price Movement Using Social Media Analysis.
California: Stanford University.

78
[51] Visa, A., Kloptchenko, A., Eklund, T., Karlsson, J., Back, B., & Vanharanta, H.
(2004, March). Combining data and text mining techniques for analysing
financial reports. 12, pp. 29-41.
[52] Von, V. (2014). The Value of Social Media for Predicting Stock Returns –
Preconditions, Instruments and Performance Analysis. Germany: Technische
Universitat Darmstadt.
[53] Yakushev, A., & Mityagin, S. (2014). Social networks mining for analysis and
modeling drugs usage. ELSEVIER, 29, 2462-2471.
[54] Yu, K., Ng, H., Wong, W., Chu, K., & Chan, K. (2010). An Empirical Study of
the Impact of Intellectual Capital Performance on Business Performance. The 7th
International Conference on Intellectual Capital, Knowledge Management &
Organisational Learning, The Hong Kong Polytechnic University (pp. 1-11).
Hong Kong: The University of Hong Kong, HKSAR.
[55] Zeng, D., Chen, H., Lusch, R., & Li, S. (2010, November). Social Media
Analytics and Intelligence. IEEE Computer Society, 13-16.

79
Appendix I – The relationship between SP500 and
HSI

Many investors and analysts use the relationship between SP500 and HSI to predict
the next day index price. According to the macroasix in 2018, it stated that existing
pair cross correlation between SP500 and HSI result 0.87 coefficient in the current
period of 30 trading days from January 17, 2018 to February 16, 2018.

Pair Volatility
Assuming 30 trading- days horizon, SP500 is expected to generate 0.95 times more
return on investment than Hang Seng. However, SP500 is 1.05 times less risky than
Hang Seng. It trades about -0.11 of its potential returns per unit of risk. Hang Seng is
currently generating about -0.13 per unit of risk. If you would invest 280,256 in SP500
on January 17, 2018 and sell it today you would lose (10,393) from holding SP500 or
give up 3.71% of portfolio value over 30 days.

80
Appendix II – Technical Indicators

A technical indicator is a series of data points that are derived by applying a formula
to the price data of a security. Price data includes any combination of the open, high,
low or close over a period of time. Some indicators may use only the closing prices,
while others incorporate volume and open interest into their formulas. A technical
indicator offers a different perspective from which to analyze the price action. Some,
such as moving averages, are derived from simple formulas and the mechanics are
relatively easy to understand.

A simple moving average (SMA) is an indicator that calculates the average price of a
security over a specified number of periods. In general, Hong Kong use 10-days SMA,
20-days SMA, 50-days SMA, 100-days SMA and 150-days SMA, which represent
short term and long-term stock price movement. For example, to calculate 10-days
SMA, assume HSI now have 12-days historical data and want to forecast the next day
of movement. It can average the recent 10-days and the result is to predict the next
day of forecast price value (𝑥̂). The following is a reference equation:

∑10−𝑑𝑎𝑦𝑠 𝑥𝑖
𝑥̂ = 𝑖=1
n

To calculate forecast value, summation the currently history of stock price

∑10−𝑑𝑎𝑦𝑠
𝑖=1 𝑥𝑖 , which the above example is to calculate 10-days SMA, and division
the total number of days.

81
Appendix III – Feature Importance SVM

Rank feat coeff Rank feat coeff Rank feat coeff


1 singtao 1.070 103 forecast 0.168 205 group -0.075
missing group
2 range 1.005 104 0.162 206 -0.076
missing limited
resistance
3 support 0.998 105 0.156 207 sina -0.077
support
limited
4 0.968 106 hover 0.154 208 chik -0.079
expected
management
5 temporarily 0.931 107 0.136 209 fai -0.079
expect
securities
6 believe 0.915 108 aastock kwok 0.133 210 -0.080
commodities
department
7 year 0.866 109 kang 0.130 211 -0.080
bright
opportunity
8 0.853 110 shek kang 0.130 212 yiu fai -0.080
test
continue
9 0.853 111 chuen arthur 0.130 213 chik yiu -0.080
points
10 term 0.848 112 etnet shek 0.130 214 fai head -0.080
11 hover points 0.800 113 arthur 0.130 215 commodities -0.080
fluctuate commodities
12 0.780 114 kang chuen 0.130 216 -0.080
today group
13 points 0.761 115 shek 0.130 217 narrow -0.082

14 oncc 0.757 116 research hong 0.130 218 opportunity -0.095

rebound
15 0.754 117 arthur head 0.130 219 open higher -0.102
resistance
16 quamnet 0.720 118 chuen 0.130 220 director -0.115
kenny
17 term market 0.673 119 0.124 221 mpfinance -0.130
chairman
18 expect 0.636 120 kenny 0.124 222 etnet sam -0.136
19 hsi 0.619 121 professional 0.124 223 stocks -0.137

professional
20 missing hsi 0.617 122 0.124 224 consolidate -0.145
commentators

82
financial
21 chance 0.602 123 etnet tang 0.124 225 -0.148
holdings
expected institute
22 0.582 124 0.124 226 holdings -0.148
remain financial
analysts index
23 outlook 0.577 125 0.124 227 -0.157
professional continue
market
24 0.577 126 sing 0.124 228 rebound -0.157
outlook

25 remain 0.560 127 kong institute 0.124 229 fluctuate -0.158

26 kong stocks 0.542 128 tang 0.124 230 high -0.163


test
27 continue 0.529 129 tang sing 0.124 231 -0.165
resistance
points
28 0.528 130 hing kenny 0.124 232 new -0.172
support
commentator
29 0.525 131 hing 0.124 233 new high -0.172
coming
management
30 appledaily 0.521 132 commentators 0.124 234 -0.173
index
chairman
31 days 0.510 133 0.124 235 sessions -0.175
hong
expected commentators coming
32 0.510 134 0.124 236 -0.175
today limited sessions
support
33 remain level 0.496 135 sing hing 0.124 237 -0.181
points
financial expected
34 pattern 0.491 136 0.124 238 -0.186
analysts consolidate

35 level support 0.472 137 bun 0.124 239 limited -0.187

36 today 0.449 138 man 0.124 240 trading -0.188


mingpao
37 today fall 0.445 139 kwong man 0.124 241 -0.196
mpfinance
38 analysts 0.439 140 man bun 0.124 242 mingpao -0.196
maintain
39 0.427 141 yu kwan 0.100 243 ying -0.209
pattern
40 etnet 0.423 142 kwan 0.100 244 level -0.220
41 limited hong 0.378 143 quamnet yu 0.100 245 points today -0.230
42 cnfol 0.375 144 kwan lung 0.100 246 target -0.230

83
43 management 0.369 145 lung 0.100 247 mark -0.240
lung stocks
44 ho 0.362 146 0.100 248 -0.251
independent expected
45 high year 0.341 147 capital 0.099 249 missing -0.253
46 kwong 0.335 148 yu 0.097 250 stock -0.261
expected research
47 institute 0.334 149 0.096 251 -0.274
fluctuate hang
48 challenge 0.331 150 ka 0.093 252 hit -0.283
mpfinance economic
49 0.328 151 0.074 253 wai -0.293
hang times
50 chan 0.324 152 times 0.072 254 etnet kwong -0.314
outlook commentator
51 0.320 153 0.071 255 rise -0.315
expected hang
expected management
52 points level 0.314 154 0.070 256 -0.320
challenge expected
expected limited
53 0.313 155 0.069 257 end -0.327
continue believe
54 week 0.308 156 kwok 0.064 258 test level -0.337

55 expected test 0.302 157 market 0.056 259 chi -0.351

56 director head 0.285 158 research 0.045 260 managing -0.353

index managing
57 0.278 159 chong 0.033 261 -0.353
expected director
financial
58 chairman 0.274 160 0.033 262 near -0.357
capital
capital
59 high points 0.269 161 0.033 263 repeated -0.363
limited
60 test points 0.264 162 ho chief 0.033 264 index -0.364
61 month 0.263 163 chi ho 0.033 265 aastock chik -0.382
expected
62 0.263 164 chong chi 0.033 266 index opened -0.404
open
mpfinance chief
63 0.252 165 0.033 267 opened -0.404
hong financial
64 level points 0.249 166 etnet chong 0.033 268 performance -0.407

65 hovering 0.244 167 chief 0.032 269 breakthrough -0.418

84
kong expected
66 0.231 168 daily 0.032 270 -0.432
economic hover
fluctuate
67 hong 0.230 169 expect index 0.023 271 -0.436
coming
68 hong kong 0.230 170 limited hsi 0.023 272 point -0.441
management
69 kong 0.230 171 0.018 273 today points -0.463
market
70 higher 0.228 172 kwok ka 0.011 274 expected hit -0.470
resistance
71 economic 0.226 173 goldjoy 0.011 275 -0.486
points
management
72 open 0.221 174 ka yiu 0.011 276 -0.488
short

73 hkej 0.215 175 china goldjoy 0.011 277 test today -0.488

expected
74 asset 0.215 176 director china 0.009 278 -0.504
maintain
asset
75 0.215 177 yiu managing 0.009 279 trend -0.505
management

76 short term 0.210 178 goldjoy asset 0.009 280 good -0.509

77 short 0.209 179 kgi executive 0.008 281 test -0.512

78 china 0.200 180 bun kgi 0.008 282 wong -0.524


executive
79 head 0.196 181 0.008 283 investment -0.583
director
head
80 0.196 182 kgi 0.008 284 open today -0.614
research
81 independent 0.187 183 executive 0.008 285 range points -0.621
independent
82 0.187 184 hang seng 0.002 286 securities -0.629
stock
mpfinance
83 commentator 0.187 185 hang 0.002 287 -0.630
hsi

stock
84 0.187 186 seng 0.002 288 coming -0.672
commentator

strategist
85 0.186 187 seng index 0.002 289 break -0.684
south
86 yung 0.186 188 financial -0.004 290 aastock -0.714

85
continue
87 sam 0.186 189 department -0.005 291 -0.733
level
research
88 south china 0.186 190 -0.005 292 points points -0.747
department
senior
89 0.186 191 hit daily -0.030 293 level today -0.750
strategist
challenge
90 senior 0.186 192 -0.035 294 missing hong -0.772
resistance
fluctuate
91 sam chi 0.186 193 -0.038 295 resistance -0.783
points
market
92 south 0.186 194 yiu -0.040 296 -0.872
expected
expected
93 yung senior 0.186 195 -0.041 297 fall -0.896
opportunity
china
94 0.186 196 support near -0.044 298 expected -0.909
financial
95 chi yung 0.186 197 maintain -0.047 299 challenging -0.970
96 bright 0.184 198 maintained -0.055 300 contention -0.972
maintained freeman
97 bright smart 0.184 199 -0.055 301 -1.021
points securities
98 smart 0.184 200 francis -0.059 302 freeman -1.021
smart
99 0.184 201 kwok sze -0.059 303 morning -1.046
securities
100 strategist 0.182 202 sze chi -0.059 304 highs -1.084

101 term index 0.178 203 chi francis -0.059 305 consolidation -1.126

102 hold 0.171 204 sze -0.059 306 hsi expected -1.258

86
Appendix IV - Feature Importance Naïve
Bayes
Rank feat coeff Rank feat coeff Rank feat coeff
1 today -4.104 103 stock -6.149 205 rise -6.651
management
2 mpfinance -4.110 104 -6.168 206 sina -6.652
expected
expected
3 mingpao -4.111 105 level points -6.179 207 -6.654
consolidate
mingpao
4 -4.111 106 points points -6.189 208 etnet kwong -6.656
mpfinance
5 expected -4.131 107 range -6.201 209 high year -6.672
6 hsi -4.136 108 high points -6.219 210 yung senior -6.675
mpfinance expected senior
7 -4.163 109 -6.226 211 -6.675
hsi remain strategist
8 hsi expected -4.175 110 financial -6.229 212 south china -6.675
expected
9 -4.209 111 expect -6.236 213 senior -6.675
fluctuate
market
10 fluctuate -4.210 112 -6.252 214 sam chi -6.675
expected
fluctuate expected
11 -4.215 113 -6.260 215 south -6.675
today hover
12 points -4.527 114 outlook -6.262 216 chi yung -6.675
market
13 index -4.717 115 -6.262 217 sam -6.675
outlook
management strategist
14 yiu -4.885 116 -6.262 218 -6.675
short south
china
15 aastock -4.900 117 believe -6.262 219 -6.675
financial
expected
16 director -4.924 118 -6.271 220 yung -6.675
open
stock
17 china -4.937 119 -6.278 221 chief -6.675
commentator

18 kwok -4.950 120 commentator -6.278 222 trading -6.676

19 management -4.950 121 independent -6.278 223 ho -6.684


independent
20 ka -4.969 122 -6.278 224 holdings -6.687
stock

87
asset management financial
21 -4.972 123 -6.281 225 -6.687
management market holdings
rebound
22 asset -4.972 124 yu -6.291 226 -6.692
resistance
china challenge
23 -4.983 125 remain level -6.295 227 -6.694
goldjoy resistance
points
24 goldjoy -4.983 126 -6.295 228 chan -6.700
support
25 ka yiu -4.983 127 support near -6.335 229 range points -6.702
26 kwok ka -4.983 128 higher -6.352 230 term index -6.707
aastock
27 -4.990 129 hkej -6.353 231 days -6.717
kwok
director
28 -4.990 130 today points -6.378 232 breakthrough -6.717
china
goldjoy
29 -4.990 131 kwong -6.379 233 capital -6.724
asset
yiu expected commentator
30 -4.990 132 -6.380 234 -6.733
managing challenge coming
managing resistance
31 -4.993 133 -6.390 235 wai -6.733
director support
support continue
32 managing -4.993 134 -6.393 236 -6.734
points points
33 level -5.029 135 target -6.399 237 temporarily -6.735
limited
34 hang -5.124 136 -6.405 238 today fall -6.735
expected
35 seng -5.124 137 month -6.409 239 good -6.736
36 seng index -5.124 138 hover points -6.410 240 wong -6.747
management
37 hang seng -5.124 139 -6.417 241 open today -6.750
expect
index commentator
38 -5.158 140 quamnet yu -6.422 242 -6.757
expected hang
39 test -5.208 141 lung -6.422 243 hit -6.760
40 missing -5.248 142 kwan lung -6.422 244 daily -6.762
41 hong kong -5.268 143 kwan -6.422 245 open higher -6.777

lung economic
42 hong -5.268 144 -6.422 246 -6.778
independent times

expected
43 kong -5.268 145 yu kwan -6.422 247 -6.782
opportunity
44 market -5.303 146 singtao -6.427 248 highs -6.783
45 support -5.309 147 forecast -6.432 249 sze chi -6.784

88
46 stocks -5.367 148 coming -6.434 250 sze -6.784
outlook
47 term -5.392 149 -6.436 251 kwok sze -6.784
expected
expected expected
48 -5.420 150 -6.448 252 francis -6.784
today maintain
49 kong stocks -5.422 151 economic -6.456 253 chi francis -6.784

50 short -5.424 152 kwong man -6.472 254 new high -6.825
51 short term -5.442 153 man bun -6.472 255 new -6.825
financial
52 continue -5.499 154 man -6.472 256 -6.833
capital
chief
53 research -5.541 155 bun -6.472 257 -6.833
financial
expected
54 -5.558 156 appledaily -6.473 258 chi ho -6.833
test
55 limited -5.561 157 director head -6.485 259 chong -6.833
management
56 -5.571 158 end -6.485 260 etnet chong -6.833
index
test capital
57 points today -5.572 159 -6.490 261 -6.833
resistance limited
head opportunity
58 -5.573 160 -6.492 262 ho chief -6.833
research test
59 head -5.573 161 term market -6.493 263 chong chi -6.833
60 securities -5.608 162 limited hsi -6.505 264 narrow -6.843
mpfinance
61 -5.690 163 fall -6.513 265 repeated -6.845
hang
62 etnet -5.702 164 expect index -6.515 266 etnet sam -6.846
maintain
63 year -5.713 165 break -6.515 267 -6.863
pattern
64 bright smart -5.714 166 oncc -6.517 268 research hang -6.867
limited
65 smart -5.714 167 chance -6.522 269 -6.886
believe
smart kong
66 -5.714 168 -6.526 270 consolidation -6.888
securities economic
67 bright -5.714 169 rebound -6.543 271 missing hong -6.891

68 resistance -5.735 170 strategist -6.545 272 commentators -6.897

stocks analysts
69 -5.751 171 hold -6.549 273 -6.897
expected professional

89
kenny
70 challenge -5.771 172 consolidate -6.557 274 -6.897
chairman
commentators
71 group -5.806 173 investment -6.566 275 -6.897
limited
research
72 -5.833 174 point -6.569 276 hing kenny -6.897
department
kgi
73 department -5.833 175 -6.572 277 hing -6.897
executive
chairman
74 chik -5.834 176 kgi -6.572 278 -6.897
hong

executive professional
75 fai -5.834 177 -6.572 279 -6.897
director commentators

group institute
76 -5.836 178 executive -6.572 280 -6.897
limited financial
expected
77 -5.838 179 bun kgi -6.572 281 kenny -6.897
continue
missing
78 -5.850 180 cnfol -6.575 282 professional -6.897
missing
fluctuate
79 yiu fai -5.863 181 -6.575 283 etnet tang -6.897
points
department
80 -5.863 182 times -6.577 284 kong institute -6.897
bright
resistance
81 fai head -5.863 183 -6.580 285 tang sing -6.897
points
securities
82 -5.863 184 mark -6.582 286 tang -6.897
commodities
commodities
83 -5.863 185 ying -6.594 287 sing -6.897
group
index
84 chik yiu -5.863 186 -6.595 288 sing hing -6.897
continue
financial
85 commodities -5.863 187 sessions -6.605 289 -6.897
analysts
coming fluctuate
86 hover -5.866 188 -6.605 290 -6.913
sessions coming
87 high -5.882 189 chairman -6.610 291 performance -6.925
88 test points -5.915 190 institute -6.616 292 maintained -6.949
maintained
89 aastock chik -5.943 191 shek -6.618 293 -6.949
points
90 test level -5.954 192 shek kang -6.618 294 continue level -6.958

90
mpfinance
91 -5.966 193 etnet shek -6.618 295 hit daily -6.959
hong
92 opportunity -6.026 194 chuen -6.618 296 expected hit -6.973

93 near -6.050 195 chuen arthur -6.618 297 test today -6.984
94 maintain -6.058 196 kang -6.618 298 level today -6.996
95 open -6.065 197 kang chuen -6.618 299 trend -7.015
research
96 chi -6.072 198 -6.618 300 challenging -7.023
hong
97 points level -6.099 199 arthur head -6.618 301 index opened -7.077
98 week -6.122 200 arthur -6.618 302 opened -7.077
99 remain -6.128 201 missing hsi -6.644 303 contention -7.096
100 pattern -6.128 202 analysts -6.644 304 morning -7.124
level freeman
101 -6.129 203 limited hong -6.645 305 -7.321
support securities
102 quamnet -6.148 204 hovering -6.648 306 freeman -7.321

91
Appendix V - Algorithm using Python
Programming Coding
import pandas as pd
import numpy as np
from datetime import timedelta
from sklearn.model_selection import KFold, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import re
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# Function to hande missing values


def handle_missing(df):
df[['DateOfPrediction']].dropna(axis=0, inplace=True) # If date of prediction is
missing drop row
df.fillna('missing', inplace=True) # Put missing as value in cells with no data
return df

def cleanData(text):
text = text.lower()
text = re.sub(r"\'s", " ", text)
text = re.sub(r"[^A-Za-z]", " ", text)
return text

92
# Read excel file having stock commentart
df = pd.read_excel("data/Stock Commentary and Prediction.xlsx")

# remove unnamed columns and website link column


df.drop(['Unnamed: 8', 'Unnamed: 9', 'WebsiteLink'], axis=1, inplace=True)

# handle missing values


df = handle_missing(df)

# Check whether datetime values in DateofPrediction are in same frmat


df['date_len'] = df['DateOfPrediction'].apply(lambda x: len(str(x)))

# drop values with no dats mentioned


df = df.ix[df['date_len']>9, :]

# convert date of prediction & commentary into datetime columns


df['DateOfPrediction'] = pd.to_datetime(df['DateOfPrediction'])
df['DateOfComentary'] = pd.to_datetime(df['DateOfComentary'])

# import historical stock prices


stock = pd.read_csv("data/stock_price.csv")
stock = stock.sort_values(by='Date', ascending=True)

# Previous day settlement price shifted by one is that day's closing price
stock['Closing_Price'] = stock['Prev. Day Settlement Price'].shift(-1)
stock = stock[['Date', 'Open', 'Closing_Price']]
print(stock.head(5))

93
# Convert date to datetime & merge with stock to get opening value at stock
commentary date
stock['Date'] = pd.to_datetime(stock['Date'])
stock = stock.rename(columns = {'Date':'DateOfComentary'})

df = pd.merge(df, stock, on='DateOfComentary', how='left')

# Convert date to datetime & merge with stock to get closing value at stock
prediction date
stock = stock.rename(columns = {'DateOfComentary':'DateOfPrediction'})
df = pd.merge(df, stock, on='DateOfPrediction', how='left')

print(df.head(5))

df.dropna(axis=0, inplace=True)
df = df.reset_index(drop=True)

df['days_gap'] = (df['DateOfPrediction'] - df['DateOfComentary']).dt.days


print(df.ix[df['days_gap']>90, :].shape)
df.drop(['Closing_Price_x', 'Open_y', 'date_len', 'DateOfPrediction',
'DateOfComentary'], axis=1, inplace=True)

# changed row no. 9 is excel wrt dateofcommentary

df.columns.values[5] = 'Open'
df.columns.values[6] = 'Closing_Price'

print(df.head(5))

df = df.ix[df['HSIPrediction']!='missing', :].reset_index(drop=True)

94
# target variable is actual movement in market is upward or downward
df['movement'] = (df['Closing_Price'] > df['Open']).astype(int)

# predicted movement by analyst


df['pred_movement'] = (df['HSIPrediction'] > df['Open']).astype(int)

# columns to use for prediction


df['text'] = df['SocialMedia'].astype(str) + " " + df['NameOfStockAnalyztOrSocial
Media'].astype(str) + " " + df['Position_(If_necessary)'].astype(str) + " " +
df['CommentsOfHSI(ENG)'].astype(str)
df['text'] = df['text'].apply(cleanData)

# target column
target = ['movement']
columns = ['text', 'pred_movement']

y = df[target]
X = df[columns]

tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=5, stop_words='english') #TF


IDF - convert text into vectors

# Split dataset into train, test - stratify the split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y,
random_state=23, shuffle=True)

95
tfidf.fit(X_train['text'].values) # Fit TF-IDF on training set
train_vect = tfidf.transform(X_train['text'].values) # Transform training set

# Combined pred movement column with vectorized features obtained using TF


IDF on text
train_vect = np.column_stack((train_vect.todense(), X_train['pred_movement']))

# Fit LinearSVC model on training set


model_SVC = LinearSVC()
model_SVC.fit(train_vect, y_train)

# Preparing test set for predictions

test_vect = tfidf.transform(X_test['text'].values) # Use TF-IDF fitted on training


set
test_vect = np.column_stack((test_vect.todense(), X_test['pred_movement'])) #
Combine TF IDF output and pred movement

# Predict using LinearSVC fitted on train set


pred = model_SVC.predict(test_vect)
X_test['prediction_SVC'] = pred

algo = []
acc = []
f1= []

print("Accuracy for Linear SVC: ", accuracy_score(y_test, pred)* 100)


print("F1 score for Linear SVC: ", f1_score(y_test, pred))

algo.append('SVC')
acc.append(accuracy_score(y_test, pred)* 100)

96
f1.append(f1_score(y_test, pred))

model_NB = MultinomialNB()
model_NB.fit(train_vect, y_train)

# Predict using LinearSVC fitted on train set


pred = model_NB.predict(test_vect)
X_test['prediction_NB'] = pred
X_test['actual'] = y_test

print("Accuracy for Multinomial Naive Bayes: ", accuracy_score(y_test,


pred)*100)
print("F1 score for Multinomial Naive Bayes: ", f1_score(y_test, pred))

algo.append('NB')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))

model_RF = RandomForestClassifier()
model_RF.fit(train_vect, y_train)

# Predict using LinearSVC fitted on train set


pred = model_RF.predict(test_vect)
X_test['prediction_RF'] = pred
X_test['actual'] = y_test

algo.append('RF')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))

97
print("Accuracy for Random Forest: ", accuracy_score(y_test, pred)*100)
print("F1 score for Random Forest: ", f1_score(y_test, pred))

model_XGB = xgb.XGBClassifier()
model_XGB.fit(train_vect, y_train)

# Predict using LinearSVC fitted on train set


pred = model_XGB.predict(test_vect)
X_test['prediction_XGB'] = pred
X_test['actual'] = y_test

print("Accuracy for Xgboost: ", accuracy_score(y_test, pred)*100)


print("F1 score for Xgboost: ", f1_score(y_test, pred))

algo.append('XGBoost')
acc.append(accuracy_score(y_test, pred)* 100)
f1.append(f1_score(y_test, pred))

X_test.to_csv("prediction_NB_SVM_XGB_RF.csv", index=False)

var_imp_RF_XGB = pd.DataFrame()
var_imp_RF_XGB['feat'] = tfidf.get_feature_names()
var_imp_RF_XGB['importance_RF'] = model_RF.feature_importances_[0:-1]
var_imp_RF_XGB['importance_XGBM'] = model_XGB.feature_importances_[0:-
1]
var_imp_RF_XGB.to_csv('feature_importance_RF_XGB.csv', index=False)

# Storing feature importance for Linear SVC (for text based features only)
var_imp_SVC = pd.DataFrame()

98
var_imp_SVC['feat'] = tfidf.get_feature_names()
var_imp_SVC['coeff'] = model_SVC.coef_[0,:-1] # Coefficients indicate
importance of variable for predictions
var_imp_SVC = var_imp_SVC.sort_values('coeff', ascending=False)

print(" Top 10 positive featuers for linear SVM")


print(var_imp_SVC.head(10))

print(" Top 10 negative featuers for linear SVM")


print(var_imp_SVC.tail(10).sort_values(by='coeff', ascending=True))

var_imp_SVC.to_csv("feature_importance_SVC_" + ".csv", index=False)

# Storing feature importance for Multinomial NB


var_imp_NB = pd.DataFrame()
var_imp_NB['feat'] = tfidf.get_feature_names()
var_imp_NB['coeff'] = model_NB.coef_[0,:-1] # Coefficients indicate importance
of variable for predictions
var_imp_NB = var_imp_NB.sort_values('coeff', ascending=False)
print("Top 20 features for Multinomial NB")
print(var_imp_NB.head(20))
var_imp_NB.to_csv("feature_importance_NB_" + ".csv", index=False)

# Plotting
plt.bar(algo, acc)
plt.title("Accuracy for different algorithms")
plt.ylabel('% Accuracy')
plt.show()

plt.bar(algo, f1)
plt.title("f1 score for different algorithms")

99
plt.ylabel('F1 score')
plt.show()

var_imp_RF_XGB = var_imp_RF_XGB.sort_values(by='importance_RF',
ascending=False)
plt.bar(var_imp_RF_XGB['feat'].head(10),
var_imp_RF_XGB['importance_RF'].head(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance Random Forest')
plt.ylabel('importance')
plt.show()

var_imp_RF_XGB = var_imp_RF_XGB.sort_values(by='importance_XGBM',
ascending=False)
plt.bar(var_imp_RF_XGB['feat'].head(10),
var_imp_RF_XGB['importance_XGBM'].head(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance XGBoost')
plt.ylabel('importance')
plt.show()

plt.bar(var_imp_SVC['feat'].head(10), var_imp_SVC['coeff'].head(10))
plt.bar(var_imp_SVC['feat'].tail(10), var_imp_SVC['coeff'].tail(10))
plt.xticks(rotation=45, size=6)
plt.title('Feature Importance SVC')
plt.ylabel('importance')
plt.show()

plt.bar(var_imp_NB['feat'].head(20), var_imp_NB['coeff'].head(20))
plt.xticks(rotation=45, size=6)

100
plt.title('Feature Importance Naive Bayes')
plt.ylabel('importance')
plt.show()

101
Appendix VI - Feature of raw data for
algorithm

102

You might also like