Comparing Clustering Algorithms Using Financial Time-Series Data
Comparing Clustering Algorithms Using Financial Time-Series Data
Volume: 3, Issue: 2
Page: 146-166
2019 International Journal of Science and Business
About Author
1. Introduction
Time-series data is a kind of data that was collected from the data points. It is a continuous
sequence of time such as daily stock data, daily currency exchanged rate, daily temperature
data or cancer growth rate etc. Current time-series data plays an important in research of
various disciplines, such as bioinformatics, robotics, medicine, chemistry, gesture recognition,
speech recognition, tracking, finance, biometrics, astronomy, manufacturing, etc. Data mining
of time-series data is an interesting topic. To analyze and drilling the time-series data of the
relationships, models or insights, which are hidden, useful information based on the
principles of mathematics, statistics, database, recognition and learning of the machine
(Machine Learning such as association rule, classification, prediction, clustering, anomaly
detection, and visualization). This paper raises the clustering analysis to experiment by
comparing 3 scenarios of clustering algorithm with various time-series data (crypto-
currency, exchange rate currency, the Shanghai Stock Exchange and the Stock Exchange of
Thailand 50) using the principle of distance measurement which is Dynamic Time Warping
techniques (DTW), however there are still problems caused by DTW techniques calculation
that is very dynamic, so it would take time to calculate and could not speed up easily. After
finished data clustering, clustering evaluation is used to deciding clustering algorithm
scenarios quality and indicated the dataset that is fit and suitable to which clustering
algorithm scenario.
2. Related work
Recently, the big data, machine learning and AI topic are interested in many industries. This
increasing amount of digital data consequently effects to the role of data analysis as well. In
which reflected in the continuously rising amount to researches related to the clustering
algorithm in part years, (Liao T. W. (2005), Rokach, L. (2009), Ling H E et al. (2007), Nayak J,
Naik B and Behera H S. (2015), Sasirekha, Sasirekha, K. and Baby, P. (2013) and Rui, X., and D.
Wunsch. (2005)). Especially for time-series clustering which is reviewed in-depth on and
processing detail of the theory by Aghabozorgi, S. et al. (2015). All process of time-series
clustering which are distance measurement, time-series prototype, a clustering algorithm,
and cluster evaluation. For distance measurement in time-series clustering, Berndt DJ and
Clifford J proposed dynamic time warping (DTW) to find patterns in time-series data which
time-series clustering by approximate prototypes are propose by Ville Hautamäki et al (2008)
to decide cluster presenter. The other important point of this research is cluster validity
indices (CVIs) that will be explained by Arbelaitz, O. et al. (2013). And in order to complete
this research R software was applied using by Sardá-Espinosa (2018)’s manual. Due to much
interesting in time-series clustering, therefore there are studies on this topic;Tsay, R. S.
(2010) did cluster analysis with time-series data of American unemployment rate
Niennattrakul, V. , & Ratanamahatana, C. A. . (2006) applied time series clustering to compare
the efficiency of multimedia data using representation method and multimedia data using
traditional processing, so this research has demonstrated the profit of time-series
representation method. Saikhamwong N and Rimcharoen S. (2002)did cluster analysis
applied with stock data.
3. Time-series clustering
Time-series clustering consists of several parts, including distance measurement, time-series
prototype, clustering algorithm, and clustering evaluation. Table 1 shows the overall of each
step of time-series clustering. This section briefly describes basic time-series clustering,
which used in this work. This paper, the time-series data has 2 types, equal length and non-
147 International Journal of Science and Business Published By
Email: [email protected] Website: ijsab.com
Volume: 3, Issue: 2, Year: 2019 Page: 146-166 IJSB-International
equal length. Each type, we will compare the clustering algorithm in 3 scenarios. The
cryptocurrency data and the Shanghai Stock Exchange (SSE 50) are non-equal lengths, so the
algorithm which clustering time-series is hierarchical clustering, partitional clustering with k-
medoid and partitional clustering with k-shape. In another data type which is exchange rate
currency and the stock exchange of Thailand (SET 50) that use hierarchical clustering,
partitional clustering with k-medoid and partitional clustering with TADPole. To compare
these scenarios of cluster algorithm whether they are suitable for this dataset or not. We use
clustering evaluation approaches such as Silhouette index, COP index, DB index, DB* index
and CH index. The flow chart of this research framework shown in Figure 1.
Fuzzy partitions
- Soft Rand Index.
- Soft Adjusted Rand Index
- Soft Variation of
Information
- Soft Normalized
Mutual Information
Figure 2. Comparing Dynamic Time Warping distance measurement and Euclidean distance
measurement
The features of the time-series data that is different from generic data, causes many
problems, which are very high dimensional indices (difficulty to identify the locations of data
in high dimension) and other problems that cause by Dynamic Time Warping techniques that
is dynamic leads to long–time of calculation and could not speed up.
Shape-based distance (SBD)
The Shape-Based Distance (SBD) was taken a component of K-Shape clustering algorithm. It
depends on the Cross-Correlation with Coefficient Normalization ( ) sequence between
two series and is sensitive in scale, so normalized data is the suggestions; the distance
formula is defined as:
Table 4.
Silhouette index
Silhouette index is the internal validation of consistency within cluster data. Some algorithm
needs to know k-value before, therefore silhouette is used to measure k-value which control
the amount of clusters in a dataset and relationship between objects in the dataset. The
silhouette coefficient value should always be the maximum number after measurement.
Definition: The dataset , of n objects, suppose is segregated to k clusters, . For
each object , we compute as the average distance of and other objects in the
cluster where belongs. Likewise, is the minimum average distance from to call
clusters to where does not belongs. Formally, suppose ; then
and
COP index
It was first presented to adopted in the partnership of a cluster hierarchy post-processing
algorithm, Traditional cluster validity indices could also adopt this. It is a ratio-type index,
where the coherence is indicated by the distance from the position in a cluster to its centroid.
The partition establish based on the furthest border width is defined as:
where
4. Experiment result
4.1 Dataset
This paper has 4 dataset experiments, which are cryptocurrency dataset, SSE 50 dataset,
exchange rate currency dataset, and SET 50 dataset. Dataset structure detail shows as Table
5.
Table 5. Dataset structure detail
Dataset Length No. No. Data Points
Currency / Stock
1,866
Cryptocurrency, Zone A 154 234,949
(Unequal length)
1,096
Cryptocurrency, Zone B 297 130,663
(Unequal length)
522
Cryptocurrency, Zone C 1,192 265,547
(Unequal length)
2,435
SSE50 43 99,353
(Unequal length)
92
Exchange rate currency 146 13,432
(Equal length)
244
SET50 50 12,200
(Equal length)
Cryptocurrency dataset
Cryptocurrency dataset is all historical closing price of all cryptocurrencies from the Kaggle
website (www.kaggle.com/jessevent/all-crypto-currencies/home). All datasets have 631,159
observations, the used variables are currency, data, closing price and period of dataset
between 28, April 2013 to 21, May 2018. But each cryptocurrency has a different length,
therefore we split cryptocurrency to 3 zones; zone A has 154 cryptocurrencies that have
1,866 lengths of 234,949 data points, zone B has 297 cryptocurrencies that have 1,096
lengths of 130,663 data points, and zone C has1,192 cryptocurrencies that have 522 lengths
of 631,159 data points. Time-series dataset of each zone shows as Figure 3.
resulted in for 2 clusters as best, while the partitional with k-shape is 4 clusters. The series
and centroid result of the hierarchical scenario, the partitional with the k-medoid scenario,
and the partitional with k-shape scenario are shown as Figure 10, Figure 11, and Figure 12
respectively.
Figure 7. Cryptocurrency, zone A of Silhouette index of each scenario algorithm and each k
clusters
Figure 8. Cryptocurrency, zone B of Silhouette index of each scenario algorithm and each k
clusters
Figure 9. Cryptocurrency, zone C of Silhouette index of each scenario algorithm and each k
clusters
Figure 10. The series (left) and centroid (right) result of the hierarchical scenario of
cryptocurrency time-series, zone A
Figure 11. The series (left) and centroid (right) result of partitional with the k-medoid
scenario of cryptocurrency time-series, zone A
Figure 12. The series (left) and centroid (right) result of partition with the k-shape scenario of
cryptocurrency time-series, zone A
In zone B, the k-value of hierarchical and partitional with k-shape scenario resulted in 2
clusters as best, while the partitional with k-medoid is 4 clusters. The series and centroid
result of the hierarchical scenario, the partitional with the k-medoid scenario, and the
partitional with k-shape scenario are shown as Figure 13, Figure 14, and Figure 15
respectively.
Figure 13. The series (left) and centroid (right) result of the hierarchical scenario of
cryptocurrency time-series, zone B
Figure 14. The series (left) and centroid (right) result of partitional with the k-medoid
scenario of cryptocurrency time-series, zone B
Figure 15. The series (left) and centroid (right) result of partitional with the k-shape scenario
of cryptocurrency time-series, zone B.
In zone C, The k-value of hierarchical, partitional with k-shape and partitional with k-medoid
scenario resulted in 2 clusters as best. The series and centroid result of the hierarchical
scenario, the partitional with the k-medoid scenario and the partitional with k-shape scenario
are shown as Figure 16, Figure 17, and Figure 18 respectively.
Figure 16. The series (left) and centroid (right) result of the hierarchical scenario of
cryptocurrency time-series, zone C
Figure 17. The series (left) and centroid (right) result of partitional with the k-medoid
scenario of cryptocurrency time-series, zone C
Figure 18. the series (left) and centroid (right) result of partitional with the k-shape scenario
of cryptocurrency time-series, zone C
partitional with the k-medoid scenario, and the partitional with k-shape scenario are shown
as Figure 20, Figure 21, and Figure 22 respectively.
Figure 19. The Shanghai Stock Exchange 50 Index (SSE 50) of Silhouette index of each
scenario algorithm and each k clusters
Figure 20. The series (left) and centroid (right) result of hierarchical scenario of SSE 50
Figure 21. The series (left) and centroid (right) result of partitional with k-medoid scenario of
SSE 50
Figure 22. The series (left) and centroid (right) result of partition with the k-shape scenario of
SSE 50
Figure 23. The exchange rate currency of Silhouette index of each scenario algorithm and
each k clusters
Figure 24. The series (left) and centroid (right) result of hierarchical scenario of the exchange
rate currency
Figure 25. The series (left) and centroid (right) result of partitional with the k-medoid
scenario of the exchange rate currency
Figure 26. The series (left) and centroid (right) result of partitional with TADPole scenario of
the exchange rate currency
Figure 27. SET 50 of Silhouette index of each scenario algorithm and each k clusters
Figure 28. The series (left) and centroid (right) result of the hierarchical scenario of SET 50
Figure 29. The series (left) and centroid (right) result of partitional with the k-medoid
scenario of SET 50
Figure 30. The series (left) and centroid (right) result of partitional with TADPole scenario of
SET 50
Table 8 8, 9 and 10 report the clustering evaluation value that represent in the same direction,
the hierarchical scenario is the most effective for the unequal dataset; cryptocurrency and
SSE 50 time-series. For exchange rate currency and SET 50 which are the equal length time-
series, their evaluation values are shown as
Table 11 and
Table 12 respectively. The most effective algorithm with exchange rate currency and SET 50
is partitional with TADPole and partitional with k-medoid respectively.
5. Conclusion
According to Table 13, which we have demonstrated the clustering time-series of some kind
of financial time-series. We use 4 time-series datasets, which can split into 2 data type (equal
and unequal length). This experiment, comparing time-series clustering using 3 scenarios of
cluster algorithm for each time-series data set and evaluating clustering algorithm using 5
indices to identify the validity of each clustering algorithm. From research result, the
hierarchical algorithm is the most efficient algorithm for unequal length of cryptocurrency
series and SSE 50. In another hand, the partitional algorithm is the most efficient for an equal
length of exchange rate currency and SET 50.
Exploring more on the source of the result, some element of the unequal length data might
contribute to the result such as the fact that original data is very large in both the data period
and type of currency dimensions. when we execute and cluster data, it is hard to do it in equal
length manner, where SSE50, includes 50 stock indices and data period is 10 years while
cryptocurrency, includes 643 currencies in 5 years. Therefore, the hierarchical algorithm is
suitable with the larger and high dynamic variant datasets. On another hand, SET50 includes
50 stock indices with fixed data only period of 1 year while exchange rate currency in the
study includes 92 currencies around the world within 3-month period. With these low
amount and much less variant dataset, a partitional clustering algorithm is more suitable. We
hope our research would be good motivation for researchers to study further on time-series
clustering and its application in much wider area including biometric clustering, exchange
rate currency and stock clustering to manage the trading portfolio.
Acknowledge
Firstly, I would first like to thank my thesis advisor Professor Li Jin Yu of the School of
Mathematics at China University of Mining and Technology. The door to Prof. Li office was
always open whenever I ran into trouble or had a question about my research. His opened mind
allows me to put my curiosity at its best and push this workpiece beyond its stand by steering me
on in the right direction and provide constructive comment to it. I would also like to
acknowledge Mr. Pakorn Leewasuthorn, my forever friend and Ms. Zhu Lin my senior apprentice
sister at faculty educational companion as the second reader of this thesis. I am gratefully
indebted to his/her for very valuable comments on this thesis. I would also like to acknowledge
Jiangsu Province and China University of Mining and Technology who supported scholarship and
this valuable opportunity to study in China. I would also like to acknowledge my senior
apprentice Zhu Lin, Zhao Xing Wei, Xu Yuan Yuan, Chinese classmates, Wang Pei, Dong Min Jie,
Yang Jia Hui, Cheng Li, Aommy, Jo Jo, Maylo, all my Chinese and International friends that gave
me long last friendship , and emotional support, take care and made my life in China
meaningful. Finally, I must express my very profound gratitude to my parents and older
brothers, my grandma, and my aunt for providing me with unfailing support, give me freedom
and continuous encouragement throughout my years of study and through the processes of
researching and writing this thesis. This accomplishment would not have been possible without
them. Thank you.
Reference
Ville Hautamäki, Pekka Nykänen, & Pasi Fränti. (2008). Time-series Clustering by
Approximate Prototypes. International Conference on Pattern Recognition. IEEE.
Niennattrakul, V. , & Ratanamahatana, C. A. . (2006). Clustering Multimedia Data Using Time
Series. International Conference on Hybrid Information Technology. IEEE.
Gullo, F. , Ponti, G. , Tagarelli, A. , Tradigo, G. , & Veltri, P. . (2012). A time series approach for
clustering mass spectrometry data. Journal of Computational Science, 3(5), 344-355.
Izakian H, Pedrycz W, Jamal I. (2015). Fuzzy clustering of time series data using dynamic time
warping distance. Engineering Applications of Artificial Intelligence, 39, 235-244.
Liao, T. W. . (2005). Clustering of time series data—a survey. Pattern Recognition, 38(11),
1857-1874.
PA Wang, W. , & Zhang, Y. . (2007). On fuzzy cluster validity indices. Fuzzy Sets and Systems,
158(19), 2095-2117.
Berndt DJ, Clifford J. (1994). Using dynamic time warping to find patterns in time series. In
KKD workshop, 10(1), 359-370.
Rokach, Lior, and Oded Maimon.(2005). "Clustering methods." Data mining and knowledge
discovery handbook. Springer US, 321-352.
Begum, N. , Ulanova, L. , Wang, J. , & Keogh, A. E. . (2015). Accelerating dynamic time warping
clustering with a novel admissible pruning strategy.
Ratanamahatana, C., Keogh, E., Bagnall, A. J., & Lonardi, S. (2005). A Novel Bit Level Time
Series Representation with Implication of Similarity Search and Clustering. Pacific-asia
Conference on Advances in Knowledge Discovery & Data Mining.
HHan, J. (2005). Data Mining: Concepts and Techniques.
Paparrizos, J. , & Gravano, L. . (2016). K-shape: efficient and accurate clustering of time series.
ACM SIGMOD Record, 45(1), 69-76.
Aghabozorgi, S. , Shirkhorshidi, A. S. , & Wah, T. Y. . (2015). Time-series clustering - A decade
review. Elsevier Science Ltd.
Tak Chung Fu - F.L. Chung - Robert Wing Pong Luk - Vincent T. Y. NgVincent. (2001). Flexible
time series pattern matching based on perceptually important points. JT Conference on
Artificial Intelligence Workshop, 1-7
Arbelaitz, O. , Gurrutxaga, I. , Muguerza, J. , Jesús M. Pérez, & Iñigo Perona. (2013). An
extensive comparative study of cluster validity indices. pattern recognit. Pattern
Recognition, 46(1), 243-256.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.
Sardá-Espinosa. (2018). Comparing Time-Series Clustering Algorithms in R using the
dtwclust Package - Retrieved from https://cran.r-
project.org/web/packages/dtwclust/vignettes/dtwclust.pdf
Tsay, R. S. . (2010). Multivariate Time Series Analysis and Its Applications. Analysis of
Financial Time Series, Second Edition. John Wiley & Sons, Inc.
Rokach, L. . (2009). A survey of clustering algorithms. Data Mining & Knowledge Discovery
Handbook, 16(3), 269-298.
Ling H E , Ling-Da W U , Yi-Chao C . (2007). Survey of Clustering Algorithms in Data Mining.
Application Research of Computers, 24(1), 10-13.
Nayak J, Naik B, Behera H S. (2015). Fuzzy C-Means (FCM) Clustering Algorithm: A Decade
Review from 2000 to 2014. Computational Intelligence in Data Mining, 2(1).
Sasirekha, K. , & Baby, P. . (2013). Agglomerative hierarchical clustering. Electronic Design,
17-17.
Rui, X., & D. Wunsch. (2005). Survey of clustering algorithms. IEEE Transactions on Neural
Networks, 16(3), 645-678.
Saikhamwong N - Rimcharoen S.(2002).K-Mean Clustering of the Stock Exchange of Thailand
50 (SET50) for Portfolio Diversification. 11th International Conference on e-Business
(iNCEB 2013).
Published by