Equity Research Report-Driven Investment Strategy
Equity Research Report-Driven Investment Strategy
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT This research examines and proposes an investment strategy by combining the natural
language processing on the equity research reports published in the Korean financial market and machine
learning algorithms for binary classification. At first, we deduce the part-of-speech from the report using the
KoNLPy and Mecab. Then, we define 33 features as the input variables and perform the binary classification
on the price direction of the stocks recommended in the report using various machine learning algorithms.
Note that we investigate the model performance in detail by dividing the entire period into three sub-periods,
including pre-COVID-19 for the sideways market, COVID-19 for the crashing market, and post-COVID-19
for the extreme bullish market. We confirm that the random forest is the best classifier for all periods, so
we utilize its results on positively predicted stocks in the test set as the investment universe for the monthly
re-balancing and buy-and-hold investment. The proposed strategy shows a significantly higher return on
investment than benchmarks during the pre-COVID-19 and COVID-19 periods, whereas the comparable
return during the post-COVID-19.
INDEX TERMS Finance, Natural language processing, stock markets, Equity research reports, Binary
classification, Investment strategy
VOLUME X, XXXX 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
41 various deep learning algorithms such as Artificial Neural 96 recommends one stock at a time, providing information on
42 Network [22]–[24] and Recurrent Neural Network [25]–[27]. 97 the target price, current status and direction on the underlying
43 Depending on how the features of the text are extracted, it 98 company’s business, and degree of recommendation on the
44 may not reflect the movement of stock price well. Hence, 99 stock. Note that the number of reports varies in different
45 the extraction of features representing the characteristics of 100 months as summarized in Table 1.
46 natural language is an important topic in NLP. The methods 101 A total of 1,118 stocks were recommended within 34,780
47 of the feature engineering incorporate linguistic feature [28], 102 research reports where the reports are concentrated on a lim-
48 [29], keyword extraction [30], data reduction with generative 103 ited number of stocks as presented in Figure 1. Specifically,
49 probabilistic model [31], and word embedding with n-grams 104 the top 400 stocks account for 90% of all reports. One reason
50 [28], [32], TF-IDF [33], ensemble model [34] and deep 105 for such concentration could be the investor’s preference on
51 learning [35]. 106 stocks with high market capitalization or thematic investing,
52 Furthermore, there have been many studies regarding the 107 which can promote readability and click rates on reports in
53 derivation of investment strategies using NLP. Several stud- 108 favor of analysts. Due to less diversity among the reports,
54 ies have analyzed market sentiment information using a 109 many investors doubt the report’s utility in predicting the
55 machine-learning algorithm to construct a portfolio. Such 110 future returns on the underlying stock. However, it is also
56 studies either design a neural network with an ensemble 111 true that most of the reports are carefully written through
57 of evolving clustering and LSTM [36], or propose a new 112 sufficient research, consisting of the analyst’s sentimental
58 follow-the-loser portfolio strategy from the post of stock 113 but reasonable opinions on the company. In this context, we
59 micro-blogs using semi-supervised learning method [37], or 114 assume that the valuable reports with rich information are
60 establish a trading strategy from new sentiment data using 115 written based on clear facts, whose composition and even
61 learning-to-rank algorithms [38]. Also, recently, a portfolio 116 sentence structure could differ from the unhelpful report with
62 investment strategy that considers shareholders’ confidence 117 limited value. At first, we define that a report is valuable if it
63 index by combining the existing random forest and senti- 118 successfully recommends a stock with a positive return in
64 mental analysis [39] and an investment strategy that encodes 119 the near future. A recommended stock from an unhelpful
65 external information from financial news using reinforcement 120 report yields a negative return in the near future. Then, we
66 learning have been proposed [40]. 121 conduct a binary classification based on NLP and various
67 However, there have been limited efforts on establishing 122 machine learning algorithms to distinguish the composition
68 an investment strategy based on the NLP of equity research 123 of valuable reports.
69 reports published in the Korean financial market. Therefore,
70 in this paper, we focus on analyzing the report through NLP 124 B. FEATURE ENGINEERING WITH NLP
71 and investigate if the induced information can be utilized 125 We utilize the NLP to define the features for binary classifi-
72 for investment strategy. At first, the NLP element is derived 126 cation. At first, the contents of each report written in Korean
73 by quantifying the structure of the report in the form of 127 are divided into morpheme units through NLP. English has its
74 part-of-speech (POS). Then, using NLP elements as input 128 meaning decomposed based on spacing, but the Korean can
75 features, a binary classification model that predicts whether 129 be divided into morphemes containing two or more meanings
76 the stocks recommended from the report produce the positive 130 without spacing. To analyze the Korean language, we employ
77 or negative return is constructed. The model with the best 131 the KoNLPy [41], a Python package for NLP of the Korean
78 classification performance is selected for the experiment by 132 language, and Mecab [42], methods of tagging POS that tags
79 applying several machine learning algorithms. Finally, we 133 each morpheme with 43 detailed POS. However, 43 POS di-
80 propose an investment strategy to buy stocks predicted to 134 vides the sentence in too much detail and has many features,
81 yield a positive return in future returns through the suggested 135 which can cause overfitting. Therefore, in this study, the top
82 classification algorithm. To show the superiority of the pro- 136 10 most used NLP elements are integrated and selected as
83 posed investment strategy, we compare its investment returns 137 summarized in Table 2.
84 with the strategy of investing all the stocks recommended by 138 Based on the ten selected POS, we utilize each POS
85 the report and the market index as benchmarks. Besides, to 139 frequency as a feature for the binary classification. Then, we
86 investigate whether the proposed investment strategy shows 140 create eight additional features that can represent the char-
87 consistent performance in various market conditions, differ- 141 acteristic of equity research report: Number of a morpheme
88 ent periods’ investment return is analyzed separately. 142 (subwords), Average number of morpheme per sentence
143 (mean_subwords_per_sentence), Standard deviation of mor-
89 II. FRAMEWORK OF INVESTMENT STRATEGY 144 pheme per sentence(std_subwords_per_sentence), Number
90 A. EQUITY RESEARCH REPORT 145 of sentences ending with da (da), Number of sentences (sen-
91 This study utilizes 34,780 equity research reports on stocks 146 tence), Number of paragraphs (paragraph), Number of pages
92 traded in Korean financial markets published from 2019-01- 147 (page), Number of pages with words (page_with_word).
93 01 to 2020-06-12. Note that the securities firm analysts are 148 Note that we use optical character recognition (OCR) to
94 in charge of writing the reports and provide them in Portable 149 count the number of pages with words since there exist pages
95 Document Format (PDF) on the firm’s website. Each report 150 with only tables or pictures. In Korean, a perfect sentence
2 VOLUME X, XXXX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
Months Jan-2019 Feb-2019 Mar-2019 Apr-2019 May-2019 Jun-2019 Jul-2019 Aug-2019 Sep-2019
Number of reports 600 470 350 607 685 303 602 556 280
Months Oct-2019 Nov-2019 Dec-2019 Jan-2020 Feb-2020 Mar-2020 Apr-2020 May-2020 Jun-2020
Number of reports 644 702 179 468 492 364 628 593 173
151 ends with da; otherwise, a sentence has omitted elements. 168 tio (determiner/subwords), Ending ratio (ending/subwords),
152 Note that a sentence that does not end with da only conveys 169 Number ratio (number/subwords), English ratio (en-
153 some financial terms providing limited implication to the 170 glish/subwords), Others ratio (others/subwords), Ratio of da
154 investment. Therefore, we assume that the da can be a feature 171 in sentences (da/sentence), Changes in number of morpheme
155 representing an equity research report’s characteristic. In 172 (std_subwords_per_sentence/mean_subwords_per_sentence),
156 Figure 2a, the distributions of the ten selected POS and 173 Morpheme per page (subwords/page), Morpheme per sen-
157 additional features are investigated, showing that all features 174 tence (subwords/sentence), Ratio of pages with words
158 are skewed. Therefore, we apply log-transformation to all 175 (page_with_word/page). Note that we also apply the log-
159 variables as illustrated in Figure 2b. In this context, we 176 transformation to the ratios. Finally, we apply the Min-max
160 successfully obtain variables whose distributions are close to 177 scaling to total 33 features to finalize the data pre-processing
161 the normal distribution used in machine learning for binary 178 for the binary classification.
162 classification.
163 In addition, we include 15 different ratios based on the 179 C. BINARY CLASSIFICATION & INVESTMENT
164 selected POS and additional features as follows: Noun ra- 180 STRATEGY
165 tio (noun/subwords), Adjective ratio (adjective/subwords), 181 We propose a binary classification based on the pre-processed
166 Verb ratio (verb/subwords), Adverb ratio (adverb/subwords), 182 NLP-driven features, which predicts whether or not the stock
167 Postposition ratio (postposition/subwords), Determiner ra- 183 suggested in the equity report will show a positive or negative
VOLUME X, XXXX 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
184 return in the future. Specifically, we utilize five well-known 212 where pk refers to the percentage of the data points belonging
185 models. At first, we employ the k-Nearest Neighbors (k-NN) 213 to the category k. It is trained to increase the homogeneity of
186 classifier. The k-NN algorithm hinges on the assumption that 214 each area and reduce the impurity or uncertainty as much as
187 similar data points will be located at close distance [43]. 215 possible, which is called information gain.
188 Therefore, it calculates the distance between the test data and 216 Fourthly, we utilize the random forest. Since the decision
189 the input, which can be obtained as follows: 217 tree has a limitation of overfitting, we employ an ensemble
v
u n 218 model that generates multiple decision trees and votes on
uX 219 each tree’s classification results. It can be obtained through
190 d(p, q) = t (qi − pi )2 (1) 220 bagging that makes a decision tree with data sampled with
i=1
191 221 replacement from the entire training data [46].
192 where p and q refer to the data points that have coordinates of 222 Lastly, we utilize gradient boosting, an ensemble model
193 (p1 , p2 , ..., pn ) and (q1 , q2 , ..., qn ) in n dimensions, respec- 223 that produces a robust classifier by combining weak classi-
194 tively. 224 fiers, typically decision trees [47]. It uses gradient descent
195 Secondly, we utilize the logistic regression using the sig- 225 to differentiate the loss function as a parameter to obtain the
196 moid function as follows [44]: 226 slope and calibrates the parameter so that the loss decreases.
227 The loss function and the negative gradient are expressed as
1 X
197 cost(W ) = c(H(x), y) (2) 228 follows.
m
( 1
− log(H(x)), :y=1 229 L(yi , f (xi )) =(yi − f (xi ))2 (6)
198 c(H(x), y) = (3) 2
− log(1 − H(x)), : y = 0 ∂[ 1 (yi − f (xi ))2 ]
∂L(yi , f (xi ))
1 230 = 2 = f (xi ) − yi (7)
199 H(x) = (4) 231 ∂f (xi ) ∂f (xi )
1+e −(W x+b)
200
232 where L refers to the loss function.
201 where H(x), W and b correspond to the sigmoid function, 233 For the experiment, we divide the data into the train(70%)
202 weight, and bias, respectively. As a result of approaches 1 234 and test(30%) sets. Note that we ensure the partitioned data
203 or 0, the value of the cost function decreases or increases, 235 can carry the equivalent distributional characteristics of the
204 respectively. 236 number of equity research reports per month as well as
205 Thirdly, we utilize the decision tree, which analyzes and 237 the number of those per stock. Although many prediction
206 represents patterns between data as a combination of possible 238 problems in financial time-series use the in-sample and out-
207 rules and is built top-down from the root node [45]. To build 239 of-sample on time, our model can utilize random sampling
208 a decision tree, we use the entropy for an area to which m 240 since its explanatory variables are not dependent on time.
209 data points belong can be calculated as follows: 241 For 50 different random seeds, we compare the classification
m
X 242 performances of five models for different times after the re-
210 Entropy = − pk log2 (pk ) (5) 243 port’s release. Based on the model with the best performance,
211 k=1 244 we simulate the backtesting with monthly re-balancing and
4 VOLUME X, XXXX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
245 simple buy-and-hold for different investment horizon invest- 300 indicates relatively lower false positives than false negatives.
246 ment strategies using the positively predicted stocks in the 301 In this context, the stocks predicted to be positive are likely to
247 test set. Then, we compare the investment performance with 302 be in the actual positive direction, although the model cannot
248 other benchmarks. A step-by-step scenario of the proposed 303 accurately detect all stocks with positive direction. Therefore,
249 investment strategy is illustrated in Figure 3 304 we can imply that an investment strategy based on the stocks
305 predicted to be positive returns can produce a high profit.
250 III. EMPIRICAL RESULTS AND DISCUSSIONS 306 We further investigate how random forest classification
251 A. BINARY CLASSIFICATION PERFORMANCE 307 varies in different market conditions as summarized in Table
252 As previously stated, we utilize five machine learning models 308 6. For the reports published during the sideways period, the
253 to predict the price direction of the stock recommended 309 accuracy increases as the investment period increases, but the
254 from the equity research reports whose NLP elements are 310 AUC remains around 0.5. Hence, the corresponding invest-
255 considered as the features. Table 3 summarizes the hyper- 311 ment strategy is expected to produce a little advantage over
256 parameters for each model. We compare the binary classi- 312 investing in all reports. During the collapsing period, both
257 fication performances of each model for a different time in 313 AUC and F1-score increase as the prediction time increases.
258 the future in terms of prediction accuracy and area under the 314 Hence, the corresponding investment strategy is expected to
259 receiver operating characteristic curve (AUC). Note that this 315 produce a high profit over investment in all reports. During
260 research’s main objective is to examine if the equity research 316 the soaring period, we observe a high accuracy and F1-score
261 report’s NLP elements can be used to construct an investment 317 but relatively low AUC values. The high accuracy and F1-
262 strategy. Therefore, based on two simple measures, we select 318 score are realized due to the biased target variable on the
263 a model with the highest classification performance, analyze 319 positive direction during the recovery from the COVID-19
264 the classification results in detail using the precision, recall, 320 pandemic shock. Therefore, the corresponding investment
265 and F1-score, and utilize it to establish an investment strat- 321 strategy is expected to show no significantly different profit
266 egy. 322 compared to the investment in all reports.
267 The models predict the direction of stock at 30, 60, 90, 120,
268 150, and 180 trading days after the report’s release. We will 323 B. FEATURE IMPORTANCE
269 call this as prediction time. We consider the equity research 324 Prior to utilizing the binary classification into the investment
270 reports published from 2019-01-01 to 2020-06-12, and the 325 strategy, we investigate the feature importance based on
271 Korean financial market has experienced the sideways period 326 random forest results. The average importance of each NLP
272 with low volatility (2019-01-01 - 2020-01-20), collapsing 327 element in the random forest is summarized in Figure 4. For
273 period due to the outbreak of COVID-19 (2020-01-21 2020- 328 the total period in Figure 4a, the most significant feature is the
274 03-29), and soaring period with the extreme bullish market 329 English ratio. Note that the low feature importance indicates
275 (2020-03-30 2020-06-30). Specifically, we divide the pe- 330 no significant influence on predicting the direction of the
276 riods based on the highest and lowest points of KOSPI200, 331 stock price. Specifically, based on the median of the English
277 the representative financial market index of Korea, within the 332 ratio, the average investment return for 180 prediction time
278 entire period. In this regard, the classification performance 333 for all reports with an English ratio lower than the median
279 can be evaluated for different market conditions. 334 is -2.3%, whereas that for all reports with an English ratio
280 At first, the average classification performances of each 335 higher than the median is 7.8%, which yields the difference
281 model for 50 different random seeds are summarized in 336 of 10.1% of the return. It implies that a relatively high
282 Table 4. According to the results, the accuracy and AUC tend 337 English ratio report can be expected to show a positive
283 to increase as the prediction time increases for all models. 338 expected return compared to a report that does not have one.
284 It implies that a higher return can be expected when an in- 339 Likewise, the noun ratio, the second most crucial variable,
285 vestment strategy is established based on the long investment 340 shows a 7.2% difference in investment return based on the
286 horizon’s prediction results. Finally, we choose the random 341 median. In this context, we discover the NLP-elements that
287 forest as the primary classification model since it shows the 342 positively affect the investment return, which are English
288 highest accuracy and AUC for all prediction times. 343 ratio, subwords per page, page word, and subwords, among
289 Detailed classification performance of the random forest is 344 the top 15 features showing high importance. Otherwise,
290 summarized in Table 5. Comparing to the accuracy and AUC, 345 for most NLP-elements, the lower the value, the higher the
291 the F1-score is low and invariant for different prediction 346 investment return. Interestingly, most of the ratios of NLP-
292 times, which reduces the utility of the prediction model. 347 elements show high feature importance than selected POS
293 Specifically, the low F1-score is caused by the relatively low 348 and additional features in Figure 3.
294 recall. Note that the precision shares the same pattern as the 349 Furthermore, we examine the feature importance of NLP
295 accuracy and AUC. However, such a result does not affect the 350 elements for different market conditions in Figures 4b,4c and
296 random forest’s utility since the proposed investment strategy 351 4d. Analogous to the total period, the ratios of NLP-elements
297 only utilizes the positively predicted stocks, whose return in 352 show high feature importance in all periods. Therefore, we
298 the future is expected to be positive. A classification model 353 can conclude that the ratios of NLP-elements play a more
299 with high precision but low recall in a binary classification 354 important role than basic NLP elements regardless of market
VOLUME X, XXXX 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
FIGURE 3: Proposed investment framework using NLP-driven features from the equity research report
TABLE 4: Average classification performances of each machine learning algorithms for different prediction times
TABLE 5: Average precision and recall of random forest for different prediction times
6 VOLUME X, XXXX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
TABLE 6: Average classification performance of random forest for different market periods
(a) Total period (20190101 20200630) (b) Sideways period (20190101 20200120)
(c) Collapsing period (20200121 20200329) (d) Soaring period (20200330 20200630)
FIGURE 4: Feature importance for different market periods
VOLUME X, XXXX 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
355 conditions except for the determiner ratio and ending ratio. 409 based on NLP-elements of the equity research report. To the
356 Also, subwords per page and number ratio show high feature 410 best of our knowledge, this is the first attempt to utilize the
357 importance regardless of market conditions. Note that the 411 NLP-elements of the equity research report in Korea to estab-
358 higher the subwords per page, the higher the investment 412 lish investment strategies. Therefore, this research’s novelty
359 return, while the lower the number ratio, the higher the 413 lies in providing the possible integration of NLP-elements of
360 investment return. 414 the equity research report in stock investment. Through the
415 experiments, the random forest shows the best classification
361 C. INVESTMENT PERFORMANCE 416 performance whose AUC of the random forest during the
362 Finally, we perform the backtesting of two investment strate- 417 sideways period and the collapsing period is higher than
363 gies based on the positively predicted stocks in the test set 418 0.5. Therefore, we select the random forest as the binary
364 from the 50 random seeds as the investment universe. The 419 classification algorithm. Then, we perform the backtesting
365 first strategy is the monthly-rebalancing. At first, we take 420 based on classification results for monthly re-balancing and
366 a long position on the positively predicted stocks on the 421 buy-and-hold for different investment horizons. As a result,
367 test set with equal weight. Then, after a month, we sell all 422 we confirm that the proposed investment strategy generates
368 the stocks purchased and repeat the process of taking long 423 higher returns than the benchmark during the sideways period
369 position. Figure 5 shows the monthly average cumulative 424 and collapsing period. In an extreme bull market, selecting
370 rate of return. The proposed strategy is a blue line, and 425 stocks with high expected return does not make much of
371 the monthly cumulative returns of KOSPI200 and all stocks 426 a difference since any stock an investor chooses will yield
372 recommended from the equity research reports in the test set 427 a high return. However, an investment strategy that helps
373 are provided as benchmarks with sky blue and gray lines, 428 select stocks with a high return in the future during sideways
374 respectively. Note that the vertical lines indicate the three 429 or bearish markets has a significant implication in real-
375 standard deviations of cumulative returns for each month. 430 world investment practice. Therefore, for further research,
376 The result shows that the proposed strategy outperforms the 431 we plan to utilize various portfolio theories in constructing
377 returns of other benchmarks. Besides, the strategy of buying 432 efficient investment strategies rather than simple buy-and-
378 all the stocks recommended by the report slightly exceeds the 433 hold by using the positively predicted stocks from the binary
379 KOSPI index, which ensures some degree of the reliability of 434 classification.
380 the equity research report on recommending stocks.
381 In order to compensate for the limitation of the cumulative REFERENCES
382 return, the average return on investment in different market [1] P. C. Tetlock, M. Saar-Tsechansky, and S. Macskassy, “More than words:
383 conditions based on a buy-and-hold strategy is summarized Quantifying language to measure firms’ fundamentals,” The Journal of
Finance, vol. 63, no. 3, pp. 1437–1467, 2008.
384 in Table 7 for different investment horizons from 30 days to [2] R. P. Schumaker and H. Chen, “A discrete stock price prediction engine
385 180 days. The proposed investment strategy yields signifi- based on financial news,” Computer, vol. 43, no. 1, pp. 51–56, 2010.
386 cantly higher returns for the total period than the benchmarks [3] R. P. Schumaker, “Analyzing parts of speech and their impact on stock
price,” Communications of the IIMA, vol. 10, no. 3, p. 1, 2010.
387 invested in all stocks recommended by the report for all [4] F. Z. Xing, E. Cambria, and R. E. Welsch, “Natural language based
388 investment horizons. Also, the difference in returns between financial forecasting: a survey,” Artificial Intelligence Review, vol. 50,
389 the two investment strategies increases as the investment no. 1, pp. 49–73, 2018.
[5] J. Si, A. Mukherjee, B. Liu, Q. Li, H. Li, and X. Deng, “Exploiting topic
390 period increases. During the sideways period, the proposed based twitter sentiment for stock prediction,” in Proceedings of the 51st
391 investment strategy shows slightly better returns than the Annual Meeting of the Association for Computational Linguistics (Volume
392 benchmark. However, the equity research report published 2: Short Papers), pp. 24–29, 2013.
[6] J. Smailović, M. Grčar, N. Lavrač, and M. Žnidaršič, “Stream-based
393 in the sideways period includes a collapsing period on the active learning for sentiment analysis in the financial domain,” Information
394 long-term investment horizon. Despite the sharp decline in sciences, vol. 285, pp. 181–203, 2014.
395 the market, the proposed strategy does not record negative [7] G. Ranco, D. Aleksovski, G. Caldarelli, M. Grčar, and I. Mozetič, “The
effects of twitter sentiment on stock price returns,” PloS one, vol. 10, no. 9,
396 returns except for the investment horizon of 120 trading days, p. e0138441, 2015.
397 which is very encouraging. During the collapsing period, it [8] A. Papana, C. Kyrtsou, D. Kugiumtzis, and C. Diks, “Detecting causality
398 yields significantly higher returns than the benchmark for in non-stationary time series using partial symbolic transfer entropy:
evidence in financial data,” Computational economics, vol. 47, no. 3,
399 all investment horizons. In particular, since the long-term pp. 341–365, 2016.
400 investment horizon includes a soaring period, the proposed [9] A. Tafti, R. Zotti, and W. Jank, “Real-time diffusion of information on
401 investment strategy can be considered to possess an ability twitter and the financial markets,” PloS one, vol. 11, no. 8, p. e0159226,
2016.
402 to detect stocks whose prices will rise rapidly during the [10] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning
403 recovery of a financial market after the market crash. Finally, for stock prediction using numerical and textual information,” in 2016
404 during the soaring period, the presented model shows a IEEE/ACIS 15th International Conference on Computer and Information
Science (ICIS), pp. 1–6, IEEE, 2016.
405 similar investment return as the benchmark. [11] S. R. Das and M. Y. Chen, “Yahoo! for amazon: Sentiment extraction from
small talk on the web,” Management science, vol. 53, no. 9, pp. 1375–
406 IV. CONCLUSIONS 1388, 2007.
[12] T. H. Nguyen and K. Shirai, “Topic modeling based sentiment analysis
407 Throughout this research, we explore the possibility of devel- on social media for stock market prediction,” in Proceedings of the 53rd
408 oping an investment framework using a binary classification Annual Meeting of the Association for Computational Linguistics and
8 VOLUME X, XXXX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
TABLE 7: Investment returns of predicted & all stocks from the equity research reports for different investment horizons
the 7th International Joint Conference on Natural Language Processing similarity,” Expert Systems with Applications, vol. 118, pp. 411–424,
(Volume 1: Long Papers), pp. 1354–1364, 2015. 2019.
[13] W. Antweiler and M. Z. Frank, “Is all that talk just noise? the information [21] B. Weng, M. A. Ahmed, and F. M. Megahed, “Stock market one-day ahead
content of internet stock message boards,” The Journal of finance, vol. 59, movement prediction using disparate data sources,” Expert Systems with
no. 3, pp. 1259–1294, 2004. Applications, vol. 79, pp. 153–163, 2017.
[14] F. Li, “The information content of forward-looking statements in cor- [22] S. K. Khatri and A. Srivastava, “Using sentimental analysis in prediction
porate filings—a naïve bayesian machine learning approach,” Journal of of stock market investment,” in 2016 5th International Conference on
Accounting Research, vol. 48, no. 5, pp. 1049–1102, 2010. Reliability, Infocom Technologies and Optimization (Trends and Future
[15] N. Jegadeesh and D. Wu, “Word power: A new approach for content Directions)(ICRITO), pp. 566–569, IEEE, 2016.
analysis,” Journal of financial economics, vol. 110, no. 3, pp. 712–729, [23] X. Zhang, S. Qu, J. Huang, B. Fang, and P. Yu, “Stock market predic-
2013. tion via multi-source multiple instance learning,” IEEE Access, vol. 6,
[16] A. H. Huang, A. Y. Zang, and R. Zheng, “Evidence on the information pp. 50720–50728, 2018.
content of text in analyst reports,” The Accounting Review, vol. 89, no. 6, [24] M. Shastri, S. Roy, and M. Mittal, “Stock price prediction using artificial
pp. 2151–2180, 2014. neural model: an application of big data,” EAI Endorsed Transactions on
[17] X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, “News impact on stock Scalable Information Systems, vol. 6, no. 20, 2019.
price return via sentiment analysis,” Knowledge-Based Systems, vol. 69, [25] J. Li, H. Bu, and J. Wu, “Sentiment-aware stock market prediction: A deep
pp. 14–23, 2014. learning method,” in 2017 international conference on service systems and
[18] F. Xu and V. Keelj, “Collective sentiment mining of microblogs in 24- service management, pp. 1–6, IEEE, 2017.
hour stock price movement prediction,” in 2014 IEEE 16th Conference on [26] M. Kraus and S. Feuerriegel, “Decision support from financial disclosures
Business Informatics, vol. 2, pp. 60–67, IEEE, 2014. with deep neural networks and transfer learning,” Decision Support Sys-
[19] Y. Xie and H. Jiang, “Stock market forecasting based on text min- tems, vol. 104, pp. 38–48, 2017.
ing technology: A support vector machine method,” arXiv preprint [27] M.-Y. Chen, C.-H. Liao, and R.-P. Hsieh, “Modeling public mood and
arXiv:1909.12789, 2019. emotion: Stock market trend prediction with anticipatory computing ap-
[20] W. Long, L. Song, and Y. Tian, “A new graphic kernel method of stock proach,” Computers in Human Behavior, vol. 101, pp. 402–408, 2019.
price trend prediction based on financial news semantic and structural [28] A. Onan, “An ensemble scheme based on language function analysis and
VOLUME X, XXXX 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3067691, IEEE Access
P. Cho et al.: Equity research report-driven investment strategy in Korea using binary classification on stock price direction
10 VOLUME X, XXXX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/