Majumdar 和 Laha - 2020 - Clustering and classification of time series using
Majumdar 和 Laha - 2020 - Clustering and classification of time series using
PII: S0957-4174(20)30676-X
DOI: https://doi.org/10.1016/j.eswa.2020.113868
Reference: ESWA 113868
Please cite this article as: S. Majumdar and A.K. Laha, Clustering and classification of time series
using topological data analysis with applications to finance. Expert Systems With Applications
(2020), doi: https://doi.org/10.1016/j.eswa.2020.113868.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
of
Clustering and classification of time series using
topological data analysis with applications to
pro
finance
Sourav Majumdar
[email protected]
re-
Arnab Kumar Laha∗
[email protected]
sis (TDA) such as persistent homology and time delay embedding for analyzing
time-series data. We present a new clustering method SOM-TDA and a new
classification method RF-TDA based on TDA. Using SOM-TDA we examine
the topological similarities and dissimilarities of some well-known time-series
models used in finance. We also use the RF-TDA to examine if the topological
features can be used to distinguish between time series models using simu-
lated data. The performance of RF-TDA on the classification task is compared
Jou
1
Journal Pre-proof
1 Introduction
of
1 In this article, we study the problem of financial time series classification. Given
2 a collection of time series and labels such that each series has a unique label,
3 we are interested in learning from the training data and use it to predict the
4 label of a new unlabelled time series. We utilise the method of topological
data analysis(TDA) to perform time series classification. This is based on the
pro
5
14
15
16
re-
complexes from the point cloud at various resolutions. One then computes Betti
numbers for each of these complexes. A k-dimensional Betti number measures
a k-dimensional hole and we notice the change of the Betti numbers across the
sequence to identify which topological features persist. Topological features
17 are robust, in that small perturbations to data only cause small changes in the
18 output from the method. We describe the method more formally and in further
lP
19 detail in the next section 2.1 below.
20 In this article, we analyse univariate time series, which are one dimensional
21 and to compute meaningful topological features we require a higher dimensional
22 point cloud. We use the method of time delay embedding to construct a higher
23 dimensional point cloud from a univariate time series. We then extract topo-
24 logical summaries of this point cloud and perform classification using them.
rna
25 Since the Takens theorem (see Theorem 2.1 below) connects the time delay
26 reconstruction to the attractor space of the underlying dynamical system, our
27 technique works with these attractors to perform classification (see section(2.2)
28 for more details).
29 Time delay embedding methods have been applied to time series analysis
30 for a long time. de Silva et al., 2012 first considered analysing the delay map
31 using persistent homology. Pereira and de Mello, 2015 used persistent homol-
ogy to perform time series clustering by computing summary statistics from
Jou
32
33 the persistence diagrams such as mean of birth and death time, etc. Perea and
34 Harer, 2015 proves convergence theorem for quantifying periodicity in time se-
35 ries of trignometric polynomials by using persistent homology. Truong, 2017
36 studies a high frequency financial time series and using TDA they perform a
37 time delay embedding and reduce it to 3 dimensions using principal compo-
38 nent analysis(PCA) and compute persistent homology. Then they aim to show
39 by computing various measures that it is topologically different from quantum
40 noise. Umeda, 2017 proposes a TDA based method for classifying time series,
41 using a different approach than being presented in this paper. They compute
42 various summary statistics from the persistence diagram of the delay embedding
43 and then use convolutional neural network for classification. Gidea and Katz,
2
Journal Pre-proof
44 2018 analyse crashes in financial time series by computing landscapes and com-
45 paring their L2 norms. Gidea, Goldsmith, et al., 2018 additionally use these L2
of
46 distance between persistence landscapes to perform k-means clustering. Goel
47 et al., 2020 apply a clustering scheme based on norms of TDA landscapes to
48 perform portfolio selection. Kim et al., 2018 prove that after applying PCA
49 to a point cloud the resulting point cloud’s bottleneck distance between the
pro
50 persistence diagram of the original and PCA point cloud is 0. They consider a
51 similar algorithm for featurization of time series but ours use different methods
52 for parameter selection and focus on application to time series classification and
53 clustering. Some other non financial applications of TDA to time series data
54 include, Berwald and Gidea, 2014; Lum et al., 2013; Perea, Deckard, et al.,
55 2015.
56 In this article, we present two new methods, SOM-TDA and RF-TDA, for
57 clustering time series and the other for classifying time series respectively. We
58
59
60
61
re-
examine both these techniques using extensive simulated data and derive useful
insights. Further we illustrate the proposed classification technique using real
life financial time series data of different stocks. We have taken the labels
of the time series to be the sectors in which they belong, as we anticipate
62 that the price movements of a stock may depend on the sector to which it
63 belongs. Infact in this work, we show stock price movements can be used for
lP
64 predicting the sectors to which a stock belongs. We believe this could lead
65 to greater theoretical understanding on stock price formation and the role of
66 sector induced topology in it.
67 As mentioned above we also show the efficacy of RF-TDA on simulated data
68 arising from various stochastic processes. We aim to understand topological
69 similarity and difference in different state space models like AR, MA, ARMA,
rna
3
Journal Pre-proof
89 ods. In subsection 3.4 we examine the performance of RF-TDA and the other
90 methods in the context of multi-class classification using the simulated data. In
of
91 section 4 we consider NSE stock price data and demonstrate the performance of
92 RF-TDA and the other methods in time series classification by performing two
93 experiments on separate datasets. We report the results of experiment 1 and
94 2 in subsection 4.1 and 4.2 respectively. In subsection 4.3 we analyse the per-
pro
95 formance of the methods in experiment 1 and 2. In subsection 4.4 we perform
96 multi-class classification using the methods considered here on all the classes
97 of both experiments. In section 5 we conclude the article and discuss further
98 possible research directions.
99 2 Background
100
101
102
2.1 re-
Topological data analysis
Topological data analysis is based on the theory of persistent homology, see
(Edelsbrunner, Letscher, et al., 2002; Zomorodian and Carlsson, 2005). This is
103 a multi- resolution version of the simplicial homology theory, see Chapter 1 and
104 2 of Munkres, 2018 for a reference on simplicial homology. For a much detailed
exposition of persistent homology refer to Edelsbrunner and Harer, 2010.
lP
105
111 points.
117
118 Consider a point cloud X. The first step in TDA is to construct a simplicial
119 complex over a point cloud. There are several systematic ways to do this. One
120 of them being the Rips complex. To construct a Rips complex L (X), one
121 creates a neighborhood of > 0 around each point in the point cloud. Then
122 if neighborhoods of two points intersect, we create an edge between those two
123 points. We observe that for → ∞, all points would have an edge between each
124 other and for → 0 the complex would have no edges. A filtration F of complex
125 K is a collection of nested subsimplices such that φ = K0 ⊂ · · · ⊂ Kn = K. One
126 creates a filtration F of complexes, which is done by varying from suitably
127 small to large values. L0 (X) would be a subcomplex of L00 (X) for 0 < 00 .
4
Journal Pre-proof
of
130 be 0 over non p-dimensional simplices. cp (K) is an additive free abelian group
131 defined to be the collection of p-dimensional chains. Then there is a boundary
132 operator ∂p : Cp (K) → Cp−1 (K) which is a homomorphism (Munkres, 2018,
133 pp. 28).
pro
134 Definition 2.2. The kernel of ∂p : Cp (K) → Cp−1 (K) is called the group of
135 p-cycles that is denoted Zp (K). The image of ∂p+1 : Cp+1 (K) → Cp (K) is
136 called the group of p-boundaries and is denoted by Bp (K). The pth homology
137 group of K is defined by,
139
140
141
142
re-
number (Munkres, 2018, pp. 30). One then computes the k-dimensional
Betti numbers over each of these complexes, which represents the number
of k-dimensional holes in the complex. 0-dimensional Betti numbers denote
the number of connected components, 1-dimensional Betti numbers denote the
143 number of loops in the complex and 2-dimensional Betti numbers count voids in
144 the complex, higher dimensional Betti numbers are harder to interpret visually.
lP
145 The filtration F induces a homomorphsim over the k-dimensional homol-
146 ogy groups across various complexes. This allows us to measure when a certain
147 class is born in this filtration and when it disappears. These times are denoted
148 as birth time and death time respectively. Birth time and death time is the
149 index of the filtration when a certain feature appeared and disappeared respec-
tively. While computing persistent homology we are often more interested in
rna
150
151 how long does a certain topological feature persist in the filtration than in the
152 precise Betti numbers. This is because longer persistence(difference of death
153 time and birth time) gives an indication of robustness that the feature under
154 consideration may be actually present in the topological space from which the
155 data is obtained. The final output from the TDA procedure is a multiset for
156 each dimension consisting of birth and death times. This is called a persistence
157 diagram. This is plotted by drawing birth times on x-axis and death times on
Jou
158 y-axis, since death time of a class succeeds birth time, points further away from
159 the y = x line denote relatively robust features. Persistence diagrams form a
160 metric space under the Wasserstein metric (see Mileyko et al., 2011 for further
161 discussion on theoretical aspects of persistence diagrams).
162 TDA as a method is suitable to be performed over noisy data as it is known
163 that the persistence diagram is stable under perturbation (Cohen-Steiner et
164 al., 2007). For small errors, the change in the distance between persistence
165 diagrams is less than the perturbation.
166 Persistence diagram is not a suitable summary to perform statistical anal-
167 ysis with because of its multiset structure. It is also known that persistence
168 diagrams do not have a unique Fréchet mean (Mileyko et al., 2011). An alter-
169 nate statistical summary is the persistence landscape (Bubenik, 2015). In this
5
Journal Pre-proof
170 article, we consider the mean persistence landscape. For each birth-death pair
171 (b, d) in a persistence diagram Dgm we associate a functional,
of
(
t − b t ∈ [b, b+d
2 ]
β(b,d) (t) =
d − t t ∈ [ b+d
2 , d]
172 The persistence landscape is defined as λk (t) which is the kth largest value
pro
173 among all β(b,d) (t). In this article we set k = 1. Persistence landscape forms
174 a Banach space (Bubenik, 2015). They have unique mean. One can compute
175 Lp norms, which is not possible to do with persistence diagrams. We show a
176 sample output from TDA in Figure 1.
177 The features from TDA can be interpreted to give an idea about the pos-
178 sible shape of the underlying data. The longer length between birth and death
179 provides a strong evidence for the presence of the particular topological fea-
180
181
182
183
re-
ture. Consider the example illustrated in Figure 1. The betti number of S 3 are
β0 = 1, β1 = 0, β2 = 0, β3 = 1. In the example the data has been sampled from
S 3 and TDA has been applied to it. We see in the figure that after features of
dimension 1 and 2 have died off, features of the true dimension 3 appear. These
184 birth and death times of various features in TDA provide us an understanding
185 of the topology of the data.
lP
186 2.2 Time delay embedding
187 TDA is useful when the input data is a point cloud in more than one di-
188 mension. Since we consider univariate time series data we use a method by
189 which one may embed the time series in a higher dimensional space. For
rna
190 a given series X(t), t = 1, · · · , n, one can construct a delay map Y (t) =
191 [X(t), X(t − τ ), X(t − 2τ ), · · · , X(t − dτ )] , where τ is the lag parameter and
192 d is the dimension parameter. Takens (Takens, 1981) shows that such a recon-
193 struction for appropriate values of τ and d is topologically equivalent to the
194 attractor space of the dynamical system generating the series.
195 We state the theorem in the form stated in Perea, 2019,
196 Theorem 2.1 (Takens embedding theorem, 1981). Let M be a smooth, compact
Jou
197 manifold. Let τ > 0 be a real number and let d ≥ 2 dim(M ) be an integer.
198 Let Φ be an observation function on the dynamical system. Then for generic
199 Φ ∈ C 2 (R × M, M ) and F ∈ C 2 (M, R) and ϕp (t), we have that the delay map,
200 ϕ : p → (ϕp (0), ϕp (τ ), · · · , ϕp (dτ )) is an embedding. (Generic here refers to
201 ϕ, F being open and dense in the C 1 topology.)
6
Journal Pre-proof
208 article, we use this insight and proceed to use the Takens embedding theorem
209 to reconstruct the time series for further work. A similar view is taken in the
of
210 works of Goel et al., 2020; Kim et al., 2018; Umeda, 2017.
211 Time delay embedding requires the parameters d and τ . To the best of
212 our knowledge, there is no known optimal way to estimate d and τ , though
213 there are several heuristics through which one may do this. For the time lag, τ ,
pro
214 most rules are based on analysing the autocorrelation function or auto mutual
215 information function. For example in one approach the lag is chosen as the
216 value when ACF is less than 1/e. There are several methods to estimate the
217 dimension d such as singular value decomposition (Broomhead and King, 1986),
218 method of false neighborhood (Kennel et al., 1992),etc. One can always choose
219 a very large dimension but it may not be efficient to do so. In this paper we
220 estimate minimum dimension by using Cao’s algorithm (Cao, 1997).
221 We show in Figure 2 a 3-dimensional embedding of a stock.
222
223
2.3 Stochastic processes
re-
In the next section we test our proposed method against simulated data. We
224 describe the stochastic processes used for simulation briefly below for easy ref-
225 erence.
lP
226 1. Autoregressive process: The autoregressive process model AR(p) is a
227 weakly stationary stochastic process given by,
p
X
Xt = c + ai Xt−i + t (2)
i=1
rna
7
Journal Pre-proof
of
241 process, where p is the order of GARCH terms and q is the order of ARCH
242 terms,
q
X p
X
σt2 = α0 + αi 2t−i + 2
βi σt−i (6)
i=1 i=1
pro
243 Here α1 , · · · , αp , β1 , · · · , βq are model parameters (Shumway and Stoffer,
244 2017, pp. 253).
245 4. Auto regressive fractionally integrated moving average process model (ARFIMA)
246 (Granger and Joyeux, 1980):
Let d be the difference operator such that dXt = Xt − Xt−1 . So,
∞
X ∞
m i X m!
247
(1 − d)m =
re- (−1)i
i=0
i
d = (−1)i
(m − i)!i!
i=0
di
257
8
Journal Pre-proof
267 Let T1 , T2 be two time series of length n, m respectively. Let D be the point-
268 wise distance matrix between T1 , T2 . A path in the matrix D is the collection
of
269 of pairs {(i, j) : i ∈ T1 , j ∈ T2 } subject to certain warping conditions. See
270 Bagnall et al., 2017 for details. The objective is to find the length of shortest
271 path beginning at (1, 1) and terminating at (n, m). This length is the DTW
272 distance between two time series.
pro
273 2.6 Discrete wavelet transform
274 Another method we consider for comparison purpose is the discrete wavelet
275 transform, see Batal and Hauskrecht, 2009 . Wavelet transforms bypass the
276 periodicity requirements of Fourier transform.
We use the Haar wavelets and apply Haar decomposition on the time series.
We use the Haar coefficients obtained as features for classsification. Let x be
re-
a time series of length n. If n 6= 2l , l ∈ N, then we pad zero’s to the end of
the time series such that its length is a power of two. The Haar coefficients are
calculated as (Batal and Hauskrecht, 2009),
1
dl,i = √ (sl−1,2i − sl−1,2i+1 ) (9)
2
lP
1
sl,i = √ (sl−1,2i + sl−1,2i+1 ) (10)
2
277 where l = 1, · · · , log2 n are called the levels of the time series, i = 1, · · · , 2nl
278 and s0,i = xi . d1 , · · · , dl are called the level coefficients and slog2 n,0 is called
279 the scaling coefficient. We create vector of all level coefficients and the scaling
rna
9
Journal Pre-proof
of
298 Let X be a collection of k time series. Thus, X = {xi : i ∈ {1, . . . , k}} where
299 each xi is a time series of length ni . The time series clustering problem is to
300 divide k time series, into s, s < k groups based on some notion of homogeneity of
301 the time series in each group. In this article, we present a method for clustering
time series based on its topological similarity.
pro
302
303 Now suppose each time series xi is associated with a label yi and let Y be
304 the collection of all these labels. Assume that the number of distinct labels
305 in Y = {yj : j ∈ {1, . . . , k}} be `. The time series classification problem
306 is to predict the label of a time series whose label is unknown based on the
307 information derived from a training set of labelled time series i.e. given the set
308 of pairs P = {(xi , yi ) : xi ∈ X, yi ∈ Y }.
309 Let w ∈ N and we assume that each time series is of length which is an
310
311
312
313
re-
integral multiple of w. We break a given series into several sub-series of equal
length w. This is helpful because TDA is computationally expensive. Another
advantage of dividing the series into several sub-series is to potentially allow
for different time-delay embeddings which may happen for example, if a change
314 point is present. A single time-delay embedding for a long time series implies
315 the belief that the entire data was generated by a single dynamical system,
lP
316 which may not always be the case (Ang and Timmermann, 2012). We give all
317 the sub-series of a given series the same label as was given to the parent series.
318 We do a time delay embedding for each of these sub-series following the
319 process discussed in section 2.2. Since, the resulting embedding may have
320 a very high dimension we reduce the computational expense by performing
321 PCA. We then apply TDA on this embedding and compute its one dimensional
rna
329
330 the overall accuracy. This process is repeated 10 times by keeping each group
331 once as the test group. Then the dimension of the PCA and window size is
332 chosen to be that value (D∗ , W ∗ ) for which the overall accuracy is maximised.
333 The persistence landscapes of dimension D∗ and subseries of size W ∗ are then
334 used for clustering and classification of the time series. We use the R package
335 tda (Fasy et al., 2014) for all TDA related computations. This is shown as
336 Algorithm 1.
337 We generate 30 time series each of length 1200 from each of the 14 processes
338 given in Table 1. As can be seen from the Table the parameters of the 30 time
339 series of each kind are generated using parameter values that are randomly
340 drawn from the specified distribution. We split the data into training and test
10
Journal Pre-proof
of
343 ulated series by the method described above. To select the value of window
344 size and PCA dimension we perform a 10-fold cross validation. We observe in
345 Table 2 that the cross validation accuracy is maximised for window size 100
346 and PCA dimension 2.
pro
Algorithm 1 Generating persistence landscape features of time series
• Input: The collection of time series X.
• Parameters: Window size(W ∗ ) and PCA dimension(D∗ ).
• Output: Persistence landscapes of time series
1: Divide each time series, x ∈ X, into it’s subseries of length W ∗ obtained from
cross-validation.
2: For each subseries generate the time delay embedding.
re-
3: Reduce the time delay embedding of each subseries to D ∗ dimensions by PCA.
4: Generate Rips filtration over these reduced embeddings and compute persistence
diagrams for each filtration.
5: Compute the persistence landscape from these diagrams corresponding to each
subseries.
lP
347 3.1 Clustering time series
348 For clustering, we use the generated persistence landscapes and apply SOM on
349 them. We use 3 × 3 grid of hexagonal topology. Each landscape is mapped to
rna
350 a node. We then perform hierarchical clustering of the nodes using Manhattan
351 metric for 4 centers to obtain times series clusters. We refer to this method as
352 SOM-TDA(Algorithm 2).
Algorithm 2 SOM-TDA
• Input: Persistence landscape features for each subseries from Algorithm 1.
• Parameters: Grid dimension(m × n) and topology.
Jou
11
Journal Pre-proof
358 also notice that processes with one-order of differencing are in cluster 1 and 3
359 whereas processes with 2 order of differencing are in cluster 4, which possibly
of
360 implies the effect of differencing on topology.
361 Clustering time series models is useful for model selection. A search on a set
362 of candidate models which may have unknown parameters that are estimated
363 from the data is first conducted. An information criterion such as AIC is then
pro
364 used to choose the ”best model” among the candidate models. This exercise
365 can be computationally expensive if the number of candidate models are large.
366 However, using SOM-TDA clusters obtained based on topological similarity we
367 can narrow down our search space to representative members of the clusters.
368 Now since the clustering has been done over members of several different model
369 classes, this may lead to creation of better ensembles.
371
372
373
re-
We use the persistence landscapes as features for time series classification. We
use the random forest algorithm for classification. We will henceforth refer to
this algorithm as RF-TDA(Algorithm 3) in the article. Further we also examine
374 the performance of the proposed method, RF-TDA, using extensive simulation.
We examine the discriminatory power of the TDA algorithm by considering sev-
lP
Algorithm 3 RF-TDA
• Input: The training data is the Persistence landscape features for each sub-
series from Algorithm 1 and the labels of their parent series Y. The test data is
the persistence landscape features of the subseries.
• Parameters: Number of trees and variables tried at each split for Random
rna
forest.
• Output: Prediction of label on test data.
1: Train a Random forest on the training data.
2: Predict using the trained random forest model on the test data.
375
376 eral cases. The results on the training data are shown in Table 4. For instance
since AR(1) can be written as a MA(∞) process so the better discrimination
Jou
377
378 of AR(1) and MA(1) process by the RF-TDA algorithm suggests a possibil-
379 ity of a topologically different attractor space. We also note that AR(1) and
380 MA(1) appeared in different clusters in section 3.1, they have also been discrim-
381 inated well. Note that for pairs like (AR(2),MA(2)), (ARMA(1,2),ARMA(2,1))
382 and (ARCH, GARCH) the classification accuracy although being decent is not
383 that high. We also saw in section 3.1 that they were members of the same
384 clusters which meant that they were topologically similar and hence probably
385 could not be discriminated well. We also note the high accuracy in classify-
386 ing ARFIMA(1,2,1) from ARFIMA(2,1,2), which could be due to the different
387 differencing order which may induce different topology as was also observed in
388 SOM-TDA results in the previous section.
12
Journal Pre-proof
389 The results on test data is shown in Table 5. There does not appear to be
390 any over-fitting worthy of concern since the performance on the test data is
of
391 similar to that with the training data.
pro
393 To examine the effectiveness of RF-TDA, we compare RF-TDA against three
394 other methods that are based on methods used for time series classification in
395 the literature namely, Discrete Wavelet Transform(DWT), Maximal Overlap
396 Discrete Wavelet Transform(MODWT) and Dynamic Time Warping(DTW)
397 described in the previous section. We use DWT features in random forest al-
398 gorithm for classification, and henceforth refer to this as RF-DWT. We also
399 use MODWT features alongwith random forest, and call this as RF-MODWT.
400 DTW being a similarity measure, we use it in conjunction with k-Nearest Neigh-
401
402
403
404
re-
bor method for classification. We call this algorithm knn-DTW. We also apply
a similarity measure based on Euclidean distances between time series, and
use it along with k-Nearest Neighbor method. We refer to this method as
knn-Euclidean.
405 To give an indication of computational expense involved in each of the meth-
406 ods we report the time to generate the features. In a dataset with 48 time series
lP
407 of length 1200 each. All the computations were performed on a Ubuntu 18 Ma-
408 chine with 32 GB RAM and 6 cores using the R language. A windowing size of
409 120 and PCA dimension 5 takes TDA to complete generating features in 31.6
410 minutes. Features from DWT takes 12 seconds and MODWT takes 32 seconds.
411 DTW features take 1 minute to be generated. TDA is computationally more
412 expensive than the other methods considered here.
rna
13
Journal Pre-proof
432 We report the results in Table 6. We note that the ensemble is the best
433 performing method amongst all considered but RF-MODWT.
of
We also compare the median overall scores of the methods considered using
hypothesis testing. We apply a one-tailed two-sample Monte-Carlo permutation
test (see Dwass, 1957) with,
pro
H0 : { Median of alternate method’s overall accuracy ≥ Median of
RF-TDA’s overall accuracy}
and,
434
435
436
re-
We report the p-values for both the hypothesis in Table 14 and we observe that
both the hypothesis have very high p-values indicating that the hypothesis can
not be rejected.
443
444 two classes which are assigned the highest and the next to the highest probabil-
445 ities by the random forest algorithm. We consider the prediction to be accurate
446 if one of the predicted class is also the real class of the data. We report the
447 performance of our method in Table 16.We see that RF-TDA has a median
448 accuracy of 38%, RF-DWT and RF-MODWT have a median accuracy of 61%
449 on the test set. It must be noted that this dataset has 13 classes, so a random
450 classifier with a prediction set of two will predict with only an accuracy of
Jou
451 15.4%.
14
Journal Pre-proof
of
461
462 NSE publishes sectoral indices weighted on the free-float market capitalisation.
463 For details on their index methodology refer (Methodology Document of NIFTY
464 Sectoral Index Series n.d.). We obtain data from across sectors for 6 largest,
465 by index proportion, constituents of NSE sectoral indices of banks, pharmaceu-
pro
466 ticals, information technology and public banks. These are among the largest
467 sectors by market capitalisation and are actively traded in. The chosen stocks
468 along with their NSE symbols are given below. We denote in parentheses their
469 NSE symbols.
470 1. Banks-HDFC (HDFCBANK), Axis (AXISBANK), Kotak Mahindra (KO-
471 TAKBANK), ICICI (ICICIBANK), Indus Ind (INDUSINDBK), Federal
472 (FEDERALBNK)
473
474
475
re-
2. Pharmaceuticals-Sun (SUNPHARMA), Cipla (CIPLA), Divi (DIVISLAB),
Dr Reddy (DRREDDY), Lupin (LUPIN), Biocon (BIOCON)
3. IT- TCS (TCS), Tech Mahindra (TECHM), HCL (HCLTECH), Infosys
476 (INFY), WIPRO (WIPRO), Hexaware (HEXAWARE)
477 4. Public Banks-State Bank of India (SBIN), Bank of Baroda (BANKBAR-
lP
478 ODA), Punjab National Bank (PNB), Canara Bank (CANBK), Bank of
479 India (BANKINDIA), Union Bank (UNIONBANK)
480 As mentioned earlier the label of these time series is the sector to which they
481 belong. The training period data was chosen to be from 1 January 2014 to 31
482 December 2018 while the test period data was chosen to be from 1 January
2019 to 1 November 2019. We work with the log return series of stocks. No
rna
483
484 discernible pattern can be seen by us in the series from their plots.
485 We then apply RF-TDA and generate the persistence landscapes. The PCA
486 dimension was found after 10-fold cross validation to be 8 and window size 100.
487 The results are shown in Table 7. From the persistence landscapes plots we
488 also note the difference in the persistence landscapes for each window of the
489 time series. The sector-wise difference in the persistence landscapes, such as
490 Public Banks have more dispersed peaks compared to Pharma (see Figures 3
Jou
15
Journal Pre-proof
503 besides Federal bank and WIPRO where they are fitted with ARFIMA model.
504 We use R library forecast (Hyndman et al., 2020) auto.arima function to fit
of
505 ARIMA models on stock price data. We fit GARCH and ARFIMA models on
506 the dataset by selecting the model with the lowest AIC. We use the R library
507 tseries (Trapletti and Hornik, 2019) to fit GARCH models and the library
508 arfima (Veenstra, 2012) to fit ARFIMA models. For GARCH we vary model
pro
509 order within [0, 4] and fit models, we select the model fitted with least AIC.
510 For ARFIMA, we vary the AR, MA, difference parameter within [0, 4] and fit
511 models, we select the model fitted with least AIC. It is seen that AIC values of
512 GARCH models are generally lower than that of ARIMA or ARFIMA models.
513 We then perform classification on this dataset using RF-TDA, RF-DWT,
514 RF-MODWT, knn-DTW and knn-Euclidean. We report the results for RF-
515 TDA for the first set in Tables 10 and 11. We note the very high overall
516 accuracy on the training dataset. We see accuracy upwards of 90% for all con-
517
518
519
520
re-
sidered cases. We see that the cases involving the Pharma sector an overall
accuracy upwards of 97% is observed, which points to presence of some topo-
logical feature(s) present in the pharma time series which is(are) distinct from
the time series of other sectors facilitating the classification. There is also no
521 indication of overfitting since we observe high accuracy on the test dataset as
522 well. Barring the case of Pharma-Public banks we observe that RF-TDA clas-
lP
523 sifies well with high accuracy. We note an accuracy rate of more than 80% in
524 some cases.
525 We report the RF-DWT results for the first set in Tables 10 and 11 and the
526 knn-DTW and knn-Euclidean results in Table 11. RF-TDA outperforms the
527 other methods considered in terms of overall accuracy and classwise accuracy.
528 RF-DWT performs decently on the training set, but on none of the cases does
rna
529 it cross an overall accuracy of 90%. On the test data we observe that although
530 RF-DWT performs decently it is outperformed by RF-TDA in four out of the
531 six cases. RF-DWT and RF-MODWT perform similarly. We note that RF-
532 TDA has a higher median accuracy than RF-DWT and RF-MODWT. Methods
533 based on similarity measures i.e. knn-DTW and knn-Euclidean perform poorly
534 with knn-DTW performing better than knn-Euclidean. We note that knn-DTW
535 is outperformed on all comparisons by RF-TDA.
Jou
16
Journal Pre-proof
of
547 (WHIRLPOOL)
548 3. Realty-DLF (DLF), Godrej Properties (GODREJPROP), Indiabulls Real
549 Estate (IBREALEST), Oberoi Realty (OBEROIRLTY), Phoenix Mills
550 (PHOENIXLTD), Prestige group (PRESTIGE)
pro
551 4. FMCG-Britannia (BRITANNIA), Dabur (DABUR), Godrej consumer prod-
552 ucts (GODREJCP), Hindustan Unilever (HINDUNILVR), ITC (ITC),
553 Nestle India (NESTLEIND)
554 5. Oil and Gas-Bharat Petroleum (BPCL), Gas Authority of India (GAIL),
555 Indian Oil (IOC), Oil and Natural Gas Corporation (ONGC), Petronet
556 (Petronet), Reliance Industries (RELIANCE)
557
558
559
560
re-
6. Media- Inox (INOXLEISUR), Network 18 (NETWORK18), PVR (PVR),
Sun TV (SUNTV), TV 18 (TV18BRDCST), Zee entertainment (ZEEL)
For the second dataset, the PCA dimension after 10-fold cross validation was
found to be 5 and window size 120.
561 We report the time series models fitted to this dataset in Table 9.We find
562 that all time-series closely fit GARCH among the models considered in section
lP
563 3. We observe the results are consistent with the first experiment in Section 4.1
564 We apply the methods considered to the second dataset. This is a bigger
565 dataset in terms of pairwise comparisons. We report the results in Table 12
566 and 13. Here we note that RF-DWT and RF-MODWT have higher median
567 accuracy than RF-TDA on the training set, but we note that RF-TDA’s per-
568 formance is affected due to a few poor pairwise comparisons such as many of
rna
569 the comparisons involving Automobiles. The high accuracy of RF-DWT and
570 RF-MODWT may also be an indication of overfitting, since we see in Table
571 13 that they have lower overall median accuracy than RF-TDA on test data.
572 RF-TDA performs excellently on many of the pairwise comparisons with sev-
573 eral 100% accuracy results. RF-TDA has the highest overall median accuracy
574 among all the methods. RF-TDA outperforms knn-DTW and knn-Euclidean in
575 most of the pairwise comparisons.
Jou
17
Journal Pre-proof
or more accuracy 10 times. We also compare the median overall scores of the
methods considered using hypothesis testing. We apply a one-tailed two-sample
of
Monte-Carlo permutation test with,
pro
and,
577 We report the p-values for both the hypotheses in Table 15 . We observe that
both H0 and H0∗ is rejected at 5% level of significance for comparisons with
578
579
580
581
re-
the methods RF-DWT, knn-DTW and knn-Euclidean indicating that the per-
formance of RF-TDW on these two real-life tests is significantly better than
these three methods. For comparison of the performance of the RF-TDW and
582 RF-MODWT methods, we observe that both the p-values for testing H0 and
583 H0∗ are greater than 0.05. Thus we cannot reject either of these hypotheses.
lP
584 4.4 Multi-class Classification
585 We perform a multi-class classification using all the classes of the data in Ex-
586 periment 1. As discussed in Section 3.4, we provide a prediction set where we
587 predict two classes with the highest probabilities in the random forest. We re-
port the performance of our method in Table 17.We observe that RF-TDA has
rna
588
589 a nearly 100% accuracy on the training set. We see that RF-TDA has a median
590 accuracy of 75% whereas RF-DWT and RF-MODWT have a median accuracy
591 of 67% on the test set. It must be noted that this dataset has 4 classes, so a
592 random classifier with a prediction set of two will predict with only an accuracy
593 of 50%.
594 Next, we now perform a multi-class classification using all the classes of
595 the data in Experiment 2. We report the performance of our method in Table
Jou
596 18. We see that RF-TDA has a median accuracy of 58.5%, while the same for
597 RF-DWT is 67% and that for RF-MODWT is 58.5% on the test set. It must be
598 noted that this dataset has six classes, so a random classifier with a prediction
599 set of two will predict with only an accuracy of 33.32%.
600 5 Conclusion
601 In this paper, two TDA based methods for clustering and classification is pre-
602 sented. In section 3.1, we apply SOM-TDA to cluster time series models by
603 their topological similarity. It’s potential applications in model selection and
604 ensemble formation is also discussed. In section 3.2 we report the performance
18
Journal Pre-proof
605 of RF-TDA for classification on the simulated dataset. We also observe that
606 the clustering output from section 3.1 allows us to explain the performance of
of
607 RF-TDA. In section 3.3, we find that the performance of RF-MODWT and
608 the ensemble of the four methods are better than the other methods on the
609 simulated data. However, in the two experiments with real data reported in
610 Section 4 it is seen that the performance of RF-TDA is superior to the other
pro
611 methods considered in this paper. This indicates that real-life data possibly
612 has features that are not completely captured in the time-series models from
613 which the simulated data is generated in Section 3. We conjecture that the
614 RF-TDA method proposed in this paper is able to capture these features bet-
615 ter leading to better performance in these experiments. In this context it may
616 be noted that the RF-MODWT is also a competitive method that may used
617 in conjunction with RF-TDA or an ensemble using these two methods may be
618 considered. Through this work we provide evidence that stock price movements
619
620
621
622
re-
are sector dependent, which may be useful for further research in finance.
In a future work we intend to explore the theoretical aspects of this method.
We intend to study and characterize the stochastic processes & dynamical sys-
tems for which a classifier based on topological features works best. Another
623 future project is to also explore the application of SOM-TDA to model selection
624 and compare its performance with other model selection frameworks. We also
lP
625 anticipate further applications of RF-TDA to financial time series classification
626 in other settings. A question that could be explored here is for which suitable
627 labels like sectors, can a financial time series be characterized by its topological
628 features? A causal explanation for the presence or lack of such features remains
629 to be provided, and needs to be explored.
rna
630 Acknowledgements
631 The authors thank the editor and the anonymous reviewers for their helpful
632 comments on an earlier version of the paper which has led to improvement in
633 the paper.
Jou
634 References
635 Ang, A., & Timmermann, A. (2012). Regime changes and financial mar-
636 kets. Annu. Rev. Financ. Econ., 4 (1), 313–337.
637 Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The
638 great time series classification bake off: A review and experimen-
639 tal evaluation of recent algorithmic advances. Data Mining and
640 Knowledge Discovery, 31 (3), 606–660.
641 Batal, I., & Hauskrecht, M. (2009). A supervised time series feature extrac-
642 tion technique using dct and dwt, In 2009 international conference
643 on machine learning and applications. IEEE.
19
Journal Pre-proof
644 Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find
645 patterns in time series., In Kdd workshop. Seattle, WA.
of
646 Berwald, J., & Gidea, M. (2014). Critical transitions in a model of a ge-
647 netic regulatory system. Mathematical Biosciences & Engineering,
648 11 (4), 723–740.
649 Breiman, L. (2001). Random forests. Machine learning, 45 (1), 5–32.
pro
650 Broomhead, D. S., & King, G. P. (1986). Extracting qualitative dynamics
651 from experimental data. Physica D: Nonlinear Phenomena, 20 (2-
652 3), 217–236.
653 Bubenik, P. (2015). Statistical topological data analysis using persistence
654 landscapes. The Journal of Machine Learning Research, 16 (1), 77–
655 102.
656 Cao, L. (1997). Practical method for determining the minimum embedding
657
658
659
660
re-
dimension of a scalar time series. Physica D: Nonlinear Phenom-
ena, 110 (1-2), 43–50.
Cohen-Steiner, D., Edelsbrunner, H., & Harer, J. (2007). Stability of
persistence diagrams. Discrete & Computational Geometry, 37 (1),
661 103–120.
662 de Silva, V., Skraba, P., & Vejdemo-Johansson, M. (2012). Topological
lP
663 analysis of recurrent systems, In Workshop on algebraic topology
664 and machine learning, nips.
665 Dwass, M. (1957). Modified randomization tests for nonparametric hy-
666 potheses. The Annals of Mathematical Statistics, 181–187.
667 Edelsbrunner, H., & Harer, J. (2010). Computational topology: An intro-
rna
20
Journal Pre-proof
of
688 ries analysis, 1 (1), 15–29.
689 Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay,
690 L., O’Hara-Wild, M., Petropoulos, F., Razbash, S., Wang, E., &
691 Yasmeen, F. (2020). forecast: Forecasting functions for time series
pro
692 and linear models [R package version 8.11]. R package version 8.11.
693 http://pkg.robjhyndman.com/forecast
694 Kennel, M. B., Brown, R., & Abarbanel, H. D. (1992). Determining em-
695 bedding dimension for phase-space reconstruction using a geomet-
696 rical construction. Physical review A, 45 (6), 3403.
697 Kim, K., Kim, J., & Rinaldo, A. (2018). Time series featurization via
698 topological data analysis: An application to cryptocurrency trend
699
700
701
702
re-
forecasting. arXiv preprint arXiv:1812.02987.
Kohonen, T. (1982). Self-organized formation of topologically correct fea-
ture maps. Biological cybernetics, 43 (1), 59–69.
Lum, P. Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson,
703 M., Alagappan, M., Carlsson, J., & Carlsson, G. (2013). Extracting
704 insights from the shape of complex data using topology. Scientific
lP
705 reports, 3, 1236.
706 Maharaj, E. A., & Alonso, A. M. (2007). Discrimination of locally station-
707 ary time series using wavelets. Computational Statistics & Data
708 Analysis, 52 (2), 879–895.
709 Methodology document of nifty sectoral index series. (n.d.). Retrieved March
rna
21
Journal Pre-proof
727 Pereira, C. M., & de Mello, R. F. (2015). Persistent homology for time se-
728 ries and spatial data clustering. Expert Systems with Applications,
of
729 42 (15-16), 6026–6038.
730 Shumway, R. H., & Stoffer, D. S. (2017). Time series analysis and its
731 applications: With r examples. Springer.
732 Takens, F. (1981). Detecting strange attractors in turbulence, In Dynam-
pro
733 ical systems and turbulence, warwick 1980. Springer.
734 Trapletti, A., & Hornik, K. (2019). Tseries: Time series analysis and com-
735 putational finance [R package version 0.10-47.]. R package version
736 0.10-47. https://CRAN.R-project.org/package=tseries
737 Truong, P. (2017). An exploration of topological properties of high-frequency
738 one-dimensional financial time series data using tda (Doctoral dis-
739 sertation). KTH Royal Institute of Technology.
740
741
742
743
re-
Umeda, Y. (2017). Time series classification via topological data analysis.
Information and Media Technologies, 12, 228–239.
Veenstra, J. Q. (2012). Persistence and anti-persistence: Theory and soft-
ware (Doctoral dissertation). Western University.
744 Yin, H. (2008). The self-organizing maps: Background, theories, exten-
745 sions and applications, In Computational intelligence: A compendium.
lP
746 Springer.
747 Zhao, X., Barber, S., Taylor, C. C., & Milan, Z. (2018). Classification
748 tree methods for panel data using wavelet-transformed time series.
749 Computational Statistics & Data Analysis, 127, 204–216.
750 Zomorodian, A., & Carlsson, G. (2005). Computing persistent homology.
rna
22
Journal Pre-proof
Process Coefficients
AR(1) a1 ∼ Uniform(0.5, 0.99)
of
AR(2) a1 , a2 ∼ Uniform(0.1, 0.4)
MA(1) b1 ∼ Uniform(0.5, 0.99)
MA(2) b1 ∼ Uniform(0.5, 0.99)
ARMA(2,1) (a1 , a2 , b1 ) ∼ Uniform(0.2, 0.4)
pro
ARMA(1,2) (a1 , b1 , b2 ) ∼ Uniform(0.25, 0.45)
ARCH(1) (α0 , α1 ) ∼ Uniform(0.0001, 0.9999)
GARCH(1,1) (α0 ) ∼ Uniform(0.0001, 0.9999), (α1 , β1 ∼ Uniform(0.0001, 0.4999))
GARCH(2,1) (α0 ) ∼ Uniform(0.0001, 0.9999), (α1 , β1 , β2 ∼ Uniform(0.0001, 0.3332))
ARIMA(0,2,1) (b1 ) ∼ Uniform(−1, 1)
ARIMA(1,2,1) (a1 , b1 ) ∼ Uniform(−1, 1)
ARFIMA(1,2,1) (a1 , b1 ) ∼ Uniform(−1, 1), h ∈ Uniform(0, 1)
ARFIMA(2,1,2) re-
(a1 , a2 , b1 , b2 ) ∼ Uniform(−1, 1), h ∈ Uniform(0, 1)
Process Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Total Cluster
Centers (1.5,0.87) (2.5,0.87) (3.5,0.87) (1,1.73) (2,1.73) (3,1.73) (1.5,2.59) (2.5,2.59) (3.5,2.59)
AR(1) 38 31 13 57 31 10 41 39 4 264 1
Jou
AR(2) 10 41 55 14 31 66 4 11 32 264 2
MA(1) 8 44 55 18 40 69 7 9 14 264 2
MA(2) 19 30 61 8 37 58 13 9 29 264 2
ARCH(1) 19 25 20 17 28 47 14 41 53 264 3
GARCH(1,1) 8 16 39 14 20 53 5 22 87 264 3
GARCH(2,1) 10 19 24 11 19 57 12 34 78 264 3
ARMA(2,1) 39 49 40 29 30 16 30 28 3 264 1
ARMA(1,2) 38 43 40 29 34 17 27 33 3 264 1
ARIMA(1,1,1) 31 28 25 19 30 21 16 29 65 264 3
ARIMA(0,2,1) 1 1 11 3 9 39 2 15 183 264 4
ARFIMA(1,2,1) 1 1 13 1 7 39 2 10 190 264 4
ARFIMA(2,1,2) 31 42 25 25 40 23 25 45 8 264 1
Node wise total 253 370 421 245 356 515 198 325 749
23
Journal Pre-proof
Class I Class II RF- RF- RF- RF- RF- RF- RF- RF- RF-
TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- accu- accu- accu- accu- I II all
racy racy racy racy racy racy accu- accu- accu-
pro
racy racy racy
AR (1) MA (1) 0.8 0.76 0.78 0.77 0.83 0.8 0.74 0.86 0.8
AR (2) MA (2) 0.56 0.52 0.54 0.89 0.9 0.89 0.88 0.85 0.87
ARMA ARMA 0.48 0.49 0.48 0.08 0.05 0.06 0.04 0.02 0.03
(1,2) (2,1)
ARCH GARCH 0.65 0.63 0.64 0.57 0.76 0.66 0.55 0.77 0.66
(1) (1,1)
GARCH
(1,1)
ARIMA
(0,2,1)
GARCH
(2,1)
ARIMA
(1,2,1)
0.59
0.61
re-
0.53
0.39
0.56
0.5
0.75
0.66
0.48
0.63
0.62
0.64
0.71
0.98
0.45
0.99
0.58
0.99
ARFIMA ARFIMA 0.89 0.93 0.91 0.98 0.99 0.98 0.99 0.98 0.98
(1,2,1) (2,1,2)
lP
Table 4: RF-TDA, RF-DWT and RF-MODWT performance on training set of sim-
ulated data. Each class contains 264 observations
rna
Jou
24
Journal Pre-proof
Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- accu- accu- accu- accu- I II all
racy racy racy racy racy racy accu- accu- accu-
pro
racy racy racy
AR MA 0.85 0.82 0.83 0.86 0.83 0.85 0.79 0.85 0.82
(1) (1)
AR MA 0.39 0.58 0.48 0.86 0.98 0.92 0.88 0.95 0.92
(2) (2)
ARMA ARMA 0.53 0.56 0.55 0.61 0.76 0.68 0.73 0.71 0.72
(1,2) (2,1)
ARCH
(1)
GAR-
CH
(1,1)
GAR- GAR-
0.67
0.38
re-
0.55
0.44
0.61
0.41
0.79
0.2
0.59
0.41
0.69
0.3
0.74
0.58
0.67
0.42
0.7
0.5
CH CH
(1,1) (2,1)
lP
ARIMA ARIMA 0.59 0.2 0.39 0.58 0.27 0.42 0.92 0.86 0.89
(0,2,1) (1,2,1)
ARF- ARF- 0.71 0.97 0.84 0.76 1 0.88 0.79 1 0.89
IMA IMA
(1,2,1) (2,1,2)
rna
25
Journal Pre-proof
of
3 61.41% 60.75% 66.21% 65.42%
4 63.24% 61.59% 65.65% 67.36%
5 60.73% 63.15% 64.43% 66.39%
6 59.33% 66.01% 74.73% 67.64%
pro
7 57.77% 65.44% 88.39% 67.08%
8 58.74% 64.85% 95.59% 64.72%
Table 7: 10-fold Cross Validation results on the NSE stock price data used in Exper-
iment 1 for the period of 1 January 2014 to 31 December 2018
Canara Bank Public Banks (1,1,2,-0.99) 5268.41 (0,1,0) 8752.17 (1,3) 2758.66
Punjab National Bank Public Banks (2,2,2,-0.73) 3398.56 (1,1,0) 6883.13 (1,3) 2290.5
Bank of India Public Banks (4,2,2,-0.64) 3909.56 (0,1,0) 7397.03 (4,1) 2331.42
Indian Bank Public Banks (3,3,2,-0.02) 4618.83 (2,1,2) 8103.77 (1,4) 2696.21
Table 8: Time series models fit on the NSE stock price data used in Experiment 1
for the period of 1 January 2014 to 31 December 2018
26
Journal Pre-proof
Stock Sector ARFIMA model AIC ARIMA model AIC GARCH model AIC
BAJAJ-AUTO Auto (1,1,1,-0.95) 22714 (0,1,0) 12410.94 (1,4) 8927.17
EICHERMOT Auto (1,2,3,-0.99) 27606.58 (0,1,1) 18204.16 (4,3) 14716.89
of
HEROMOTOCO Auto (3,2,3,-0.44) 23091.63 (0,1,1) 12840.04 (1,4) 9352.59
M&M Auto (3,1,3,-0.13) 19410.15 (0,1,1) 9352.61 (1,4) 5863.12
TATAMOTORS Auto (1,1,1,-0.68) 18175 (1,1,1) 8840.97 (1,4) 5359.01
MARUTI Auto (1,2,3,-0.68) 24307.91 (0,1,1) 14320.07 (4,3) 10833.42
HAVELLS Consumer Durables (4,1,4,-0.31) 17911.01 (2,1,2) 8560.73 (4,3) 5077.23
pro
VOLTAS Consumer Durables (2,1,4,-0.99) 17806.48 (0,1,1) 8706.69 (4,3) 5222.42
RAJESHEXPO Consumer Durables (2,1,2,-0.91) 18296.67 (0,1,1) 9224.74 (4,4) 5730.23
WHIRLPOOL Consumer Durables (2,1,2,-0.1) 19958.3 (2,1,2) 10651.71 (4,3) 7168.25
TITAN Consumer Durables (3,1,3,-0.99) 18513.24 (0,1,1) 9302.34 (2,4) 5815.73
BATAINDIA Consumer Durables (3,1,3,-0.94) 19206.23 (0,1,0) 9560.04 (1,4) 6076.32
PETRONET Oil and Gas (4,1,2,-0.93) 15649.09 (0,1,2) 6228.78 (4,3) 2733.8
RELIANCE Oil and Gas (2,1,2,-0.21) 19202.54 (2,1,2) 9365.36 (3,4) 5882.34
ONGC Oil and Gas (3,1,1,-0.86) 16280.02 (0,1,3) 6723.19 (1,4) 3235.42
IOC Oil and Gas (4,1,3,-0.99) 15310.17 (0,1,0) 5997.3 (4,3) 2507.68
BPCL Oil and Gas (3,1,4,-0.12) 17613.31 (0,1,1) 8301.91 (2,4) 4812.58
GAIL
DLF
OBEROIRLTY
IBREALEST
PHOENIXLTD
Oil and Gas
Realty
Realty
Realty
Realty
re-(1,1,0,-0.95)
(4,1,3,-0.58)
(3,1,4,-0.48)
(2,1,2,-0.07)
(3,1,2,-0.46)
15361.62
15877.77
17613.01
14669.01
18156.09
(1,1,2)
(0,1,0)
(0,1,0)
(0,1,0)
(0,1,0)
5701.66
7386.11
8773.98
7063.83
8984.45
(1,4)
(1,4)
(1,4)
(4,4)
(1,4)
2216
3903.39
5269.87
3580.64
5487.92
GODREJPROP Realty (2,1,3,-0.99) 18025.98 (3,1,2) 9184.49 (4,3) 5674.59
PRESTIGE Realty (4,1,4,-0.81) 16702.88 (1,1,2) 7951.93 (1,4) 4469.41
SUNTV Media (0,1,0,-0.99) 18785.19 (0,1,0) 10086.94 (3,4) 6604.86
lP
ZEEL Media (3,1,2,-0.9) 18310.66 (1,1,1) 8454.39 (1,4) 4963.98
INOXLEISUR Media (4,1,2,-0.99) 16590.94 (0,1,0) 7500.1 (1,4) 4014.61
TV18BRDCST Media (3,1,3,-0.16) 12401.78 (2,1,2) 3688.04 (1,4) 205.56
PVR Media (3,1,2,-0.82) 20315.5 (3,1,3) 11014.02 (1,4) 7532.46
NETWORK18 Media (4,0,4,-0.29) 12905.83 (0,1,0) 4406.5 (1,4) 920.75
GODREJCP FMCG (3,1,2,-0.99) 18614.79 (0,1,0) 9214.85 (1,4) 5719.67
HINDUNILVR FMCG (2,1,3,-0.07) 20312.31 (0,1,1) 10042.02 (1,4) 6556.47
rna
Table 9: Time series models fit on the NSE stock price data used in Experiment 2
for the period of 1 January 2014 to 31 December 2018
Jou
27
Journal Pre-proof
Class I Class II RF- RF- RF- RF- RF- RF- RF- RF- RF-
TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class over-
accu- accu- Ac- accu- accu- Ac- I II all
racy racy cu- racy racy cu- accu- accu- accu-
pro
racy racy racy racy racy
Banks Public 0.89 0.97 0.93 0.75 0.86 0.81 0.71 0.75 0.73
Banks
IT Public 0.97 0.93 0.92 0.83 0.94 0.89 0.68 0.75 0.72
Banks
Pharma Public 0.96 0.99 0.97 0.85 0.9 0.88 0.72 0.88 0.8
Banks
Banks
Pharma
Banks
Median
Pharma
IT
IT
1
0.96
0.97
0.965
re-
0.96
0.99
0.94
0.965
0.98
0.97
0.96
0.965
0.74
0.71
0.74
0.745
0.76
0.67
0.74
0.81
0.75
0.69
0.74
0.78
0.72
0.85
0.83
0.72
0.6
0.94
0.89
0.815
0.66
0.9
0.86
0.765
accu-
racy
lP
Table 10: The performance of RF-TDA, RF-DWT and RF-MODWT on training set
of NSE stock price data used in Experiment 1 for the period of 1 January 2014 to 31
December 2018. The number of observations in each class are 72
rna
Jou
28
Journal Pre-proof
Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- Ac- accu- accu- Ac- I Ac- II all
racy racy cu- racy racy cu- cu- Ac- Ac-
pro
racy racy racy cu- cu-
racy racy
Banks Public 0.62 0.62 0.62 0.83 0.5 0.67 0.33 0.75 0.54
Banks
IT Public 0.92 0.62 0.77 0.83 0.5 0.67 0.25 0.91 0.58
Banks
Pharma Public 0.33 0.75 0.54 0.67 0.83 0.75 0.91 0.5 0.7
Banks
Banks Pharma
Pharma IT
Banks IT
0.83
1
0.92
re-
0.83
0.75
0.69
0.83
0.88
0.81
0.42
0.5
0.33
0.67
0.58
0.75
0.54
0.54
0.54
0.5
0.75
0.83
0.5
0.58
0.25
0.5
0.66
0.54
Median 0.875 0.72 0.79 0.58 0.625 0.60 0.625 0.54 0.56
Accu-
lP
racy
Table 11: The performance of RF-TDA, RF-DWT and RF-MODWT on the test set
of NSE stock price data of the Experiment 1 for the period of 1 January 2019 to 1
November 2019. The number of observations in each class are 12.
rna
Jou
29
Journal Pre-proof
Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I Ac- II all I Ac- II all Class Class Over-
cu- Ac- Ac- cu- Ac- Ac- I Ac- II all
racy cu- cu- racy cu- cu- cu- Ac- Ac-
pro
racy racy racy racy racy cuarcy cu-
racy
Auto Con 0.6 0.55 0.58 0.88 0.9 0.89 0.7 0.78 0.74
Dur
Auto Oil 0.5 0.62 0.56 0.88 0.88 0.88 0.87 0.87 0.87
and
Gas
Auto
Auto
Auto
Con
Realty
Media
FMCG
Oil
0.87
0.67
0.45
0.25
re-
0.77
0.65
0.57
0.68
0.82
0.66
0.51
0.47
0.8
0.75
0.75
0.73
0.88
0.73
0.65
0.55
0.84
0.74
0.7
0.64
0.73
0.75
0.73
0.88
0.87
0.73
0.6
0.7
0.8
0.74
0.67
0.79
Dur and
Gas
lP
Con Realty 0.77 0.67 0.72 0.67 0.7 0.68 0.7 0.7 0.7
Dur
Con Media 0.6 0.57 0.58 0.8 0.67 0.73 0.85 0.47 0.66
Dur
Con FMCG 0.62 0.7 0.66 0.77 0.67 0.72 0.7 0.72 0.71
rna
Dur
Oil Realty 0.9 0.77 0.83 0.62 0.78 0.7 0.63 0.83 0.73
and
Gas
Oil Media 0.62 0.8 0.71 0.65 0.75 0.7 0.67 0.73 0.7
and
Gas
Jou
Oil FMCG 0.6 0.72 0.66 0.68 0.7 0.69 0.68 0.7 0.69
and
Gas
Realty Media 0.7 0.47 0.58 0.47 0.75 0.61 0.48 0.82 0.65
Realty FMCG 0.85 0.8 0.83 0.65 0.78 0.72 0.67 0.77 0.72
Media FMCG 0.8 0.73 0.77 0.7 0.5 0.6 0.72 0.45 0.58
Median 0.62 0.68 0.66 0.73 0.73 0.7 0.7 0.73 0.71
Accu-
racy
Table 12: The performance of RF-TDA, RF-DWT and RF-MODWT on the training
set of NSE stock price data of Experiment 2 for the period of 1 January 2019 to 1
November 2019. The number of observations
30 in each class are 12.
Journal Pre-proof
Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-
of
Class Class Over- Class Class Over- DWT DWT DWT
I Ac- II all I Ac- II all Class Class Over-
cu- Ac- Ac- cu- Ac- Ac- I Ac- II all
racy cu- cu- racy cu- cu- cu- Ac- Ac-
pro
racy racy racy racy racy cuarcy cu-
racy
Auto Con 0.17 0 0.08 0.83 0.83 0.83 0.67 0.5 0.58
Dur
Auto Oil 0.33 0.83 0.58 0.83 0.67 0.75 0.83 0.83 0.83
and
Gas
Auto
Auto
Auto
Con
Realty
Media
FMCG
Oil
0.5
0.17
1
0
1
1
0.83
0.67
re-0.75
0.58
0.92
0.33
0.83
0.67
0.83
0.83
0.67
0.83
0.5
1
0.75
0.75
0.67
0.92
0.83
0.83
0.67
1
0.67
0.67
0.5
0.67
0.75
0.75
0.58
0.83
Dur and
Gas
lP
Con Realty 1 0.83 0.92 0.17 1 0.58 1 0.5 0.75
Dur
Con Media 1 1 1 0.17 0.83 0.5 1 0.5 0.75
Dur
Con FMCG 0.17 1 0.58 0.67 0.67 0.67 0.33 0.83 0.58
rna
Dur
Oil Realty 0.83 0.67 0.75 0.67 1 0.83 0.5 0.83 0.67
and
Gas
Oil Media 1 0.5 0.75 0.83 0.5 0.67 0.5 0.83 0.67
and
Gas
Jou
Oil FMCG 0.67 0.83 0.75 0.83 0.67 0.75 0.67 0.67 0.67
and
Gas
Realty Media 0.17 0.5 0.33 0.67 0.67 0.67 0.33 1 0.67
Realty FMCG 1 1 1 0.67 0.5 0.58 0.67 0.5 0.58
Media FMCG 1 1 1 0.83 0.33 0.58 0.67 0.33 0.5
Median 0.67 0.83 0.75 0.83 0.67 0.67 0.67 0.67 0.67
Accu-
racy
Table 13: The performance of RF-TDA, RF-DWT and RF-MODWT on the test
set of NSE stock price data of Experiment 2 for the period of 1 January 2019 to 1
November 2019. The number of observations
31 in each class are 12.
Journal Pre-proof
Method H0 H0∗
RF-DWT 0.627 0.625
of
knn-DTW 0.509 0.554
knn-Euclidean 0.568 0.463
RF-MODWT 0.839 0.842
pro
Table 14: p-value for test of hypotheses H0 and H0∗ about performance of RF-TDA
using results on simulated data
Method H0 H0∗
RF-DWT 0.0364 0.0969
knn-DTW 0.0019 0.0479
knn-Euclidean
RF-MODWT
0.0316
0.175 re-
0.0492
0.1239
Table 15: p-value for tests of hypotheses H0 and H0∗ about performance of RF-TDA
using results on real data in Experiments 1 and 2 taken together
lP
Class RF- RF- RF- RF- RF- RF-
TDA TDA MODWT MODWT DWT DWT
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
curacy curacy curacy curacy curacy curacy
rna
32
Journal Pre-proof
of
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
curacy curacy curacy curacy curacy curacy
Banks 0.94 0.5 0.76 0.67 0.76 0.67
pro
Pharma 0.99 0.75 0.76 0.67 0.76 0.67
IT 1 1 0.85 0.75 0.85 0.75
Public Banks 0.97 0.75 0.96 0.5 0.96 0.5
Class RF-
TDA
RF-
TDA
re- RF-
MODWT
RF-
MODWT
RF-
DWT
RF-
DWT
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
lP
curacy curacy curacy curacy curacy curacy
Auto 0.67 0.17 0.67 0.67 0.68 0.83
Con Dur 0.6 0.17 0.6 0.33 0.65 0.83
FMCG 0.73 0.67 0.73 0.67 0.73 0.67
Media 0.77 0.5 0.77 0.5 0.75 0.17
Oil and Gas 0.38 0.83 0.38 0.33 0.38 0.5
rna
33
Journal Pre-proof
of
pro
re-
Figure 1: a) is the projection of points sampled from S 3 , b) is the persistence diagram
lP
of points sampled from S 3 , c) is the persistence landscape of the points sampled from
S3
rna
Jou
Figure 2: a) The time series of Indian Bank from 1 January 2014 to 31 December
2018 b) time delay reconstruction of the time series
34
Journal Pre-proof
of
pro
re-
lP
Figure 3: Persistent Landscapes of Pharma constituents-Cipla, Divi, Sun and Dr.
Reddy. These have been generated for the period of 1 January 2014 to 31 December
2018. The line types denotes the landscapes for the subseries generated by window
size as 100
rna
Jou
35
Journal Pre-proof
of
pro
re-
lP
Figure 4: Persistent Landscapes of Public Banks constituents- Bank of Baroda, Ca-
nara Bank, Punjab National Bank and State Bank of India. These have been gener-
ated for the period of 1 January 2014 to 31 December 2018. The line types denotes
the landscapes for the subseries generated by window size as 100
rna
Jou
36
*Highlights (for review)
Journal Pre-proof
New methods for time series clustering (SOM-TDA) and classification (RF-TDA).
RF-TDA outperforms other methods on the classification task.
Dependence of stock price movements on sectors in NSE is revealed using RF-TDA.
of
pro
re-
lP
rna
Jou
*Credit Author Statement
Journal Pre-proof
Arnab Kumar Laha: Conceptualization, Methodology, Formal Analysis, Writing - Review and Editing,
Supervision
of
pro
re-
lP
rna
Jou
Journal Pre-proof
*Conflict of Interest Statement
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.
of
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
pro
re-
lP
rna
Jou