0% found this document useful (0 votes)
158 views40 pages

Majumdar 和 Laha - 2020 - Clustering and classification of time series using

Uploaded by

saucierm45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views40 pages

Majumdar 和 Laha - 2020 - Clustering and classification of time series using

Uploaded by

saucierm45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Journal Pre-proof

Clustering and classification of time series using topological data analysis


with applications to finance

Sourav Majumdar, Arnab Kumar Laha

PII: S0957-4174(20)30676-X
DOI: https://doi.org/10.1016/j.eswa.2020.113868
Reference: ESWA 113868

To appear in: Expert Systems With Applications

Received date : 20 March 2020


Revised date : 12 July 2020
Accepted date : 7 August 2020

Please cite this article as: S. Majumdar and A.K. Laha, Clustering and classification of time series
using topological data analysis with applications to finance. Expert Systems With Applications
(2020), doi: https://doi.org/10.1016/j.eswa.2020.113868.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier Ltd.


Journal Pre-proof
*Manuscript
Click here to download Manuscript: Expert_Systems__TDA___Update_.pdf Click here to view linked Referenc

of
Clustering and classification of time series using
topological data analysis with applications to

pro
finance
Sourav Majumdar
[email protected]

re-
Arnab Kumar Laha∗
[email protected]

Indian Institute of Management Ahmedabad, India


lP
Abstract
In this paper, we propose new methods for time series classification and
clustering. These methods are based on techniques of Topological Data Analy-
rna

sis (TDA) such as persistent homology and time delay embedding for analyzing
time-series data. We present a new clustering method SOM-TDA and a new
classification method RF-TDA based on TDA. Using SOM-TDA we examine
the topological similarities and dissimilarities of some well-known time-series
models used in finance. We also use the RF-TDA to examine if the topological
features can be used to distinguish between time series models using simu-
lated data. The performance of RF-TDA on the classification task is compared
Jou

against three other classification methods. We also consider an application of


RF-TDA to financial time series classification using real-life price data of stocks
belonging to different sectors. RF-TDA is seen to perform quite well in the two
experiments based on real-life stock-price data. This implies that the topolog-
ical features of the time series of stock prices in the different sectors are not
identical and have distinctive features that can be discerned through the use of
TDA. We also briefly consider multi-class classification using RF-TDA.

Keywords— Persistent homology; Time delay embedding; Takens theorem;


Random Forest; Self Organizing Maps

Corresponding author

1
Journal Pre-proof

1 Introduction

of
1 In this article, we study the problem of financial time series classification. Given
2 a collection of time series and labels such that each series has a unique label,
3 we are interested in learning from the training data and use it to predict the
4 label of a new unlabelled time series. We utilise the method of topological
data analysis(TDA) to perform time series classification. This is based on the

pro
5

6 recently developed framework of persistent homology. The objective of TDA


7 is to get information from a point cloud about its topological features such
8 as connectedness, loops, voids and other higher dimensional analogues. This
9 information is summarised by a persistence diagram or other transforms of it.
10 The only requirement in TDA is the existence of pairwise distance in the point
11 cloud, which in the present case is the time delay embedding of time series (see
12 section (2.2) for details). This is done by constructing a sequence of simplicial
13

14

15

16
re-
complexes from the point cloud at various resolutions. One then computes Betti
numbers for each of these complexes. A k-dimensional Betti number measures
a k-dimensional hole and we notice the change of the Betti numbers across the
sequence to identify which topological features persist. Topological features
17 are robust, in that small perturbations to data only cause small changes in the
18 output from the method. We describe the method more formally and in further
lP
19 detail in the next section 2.1 below.
20 In this article, we analyse univariate time series, which are one dimensional
21 and to compute meaningful topological features we require a higher dimensional
22 point cloud. We use the method of time delay embedding to construct a higher
23 dimensional point cloud from a univariate time series. We then extract topo-
24 logical summaries of this point cloud and perform classification using them.
rna

25 Since the Takens theorem (see Theorem 2.1 below) connects the time delay
26 reconstruction to the attractor space of the underlying dynamical system, our
27 technique works with these attractors to perform classification (see section(2.2)
28 for more details).
29 Time delay embedding methods have been applied to time series analysis
30 for a long time. de Silva et al., 2012 first considered analysing the delay map
31 using persistent homology. Pereira and de Mello, 2015 used persistent homol-
ogy to perform time series clustering by computing summary statistics from
Jou

32

33 the persistence diagrams such as mean of birth and death time, etc. Perea and
34 Harer, 2015 proves convergence theorem for quantifying periodicity in time se-
35 ries of trignometric polynomials by using persistent homology. Truong, 2017
36 studies a high frequency financial time series and using TDA they perform a
37 time delay embedding and reduce it to 3 dimensions using principal compo-
38 nent analysis(PCA) and compute persistent homology. Then they aim to show
39 by computing various measures that it is topologically different from quantum
40 noise. Umeda, 2017 proposes a TDA based method for classifying time series,
41 using a different approach than being presented in this paper. They compute
42 various summary statistics from the persistence diagram of the delay embedding
43 and then use convolutional neural network for classification. Gidea and Katz,

2
Journal Pre-proof

44 2018 analyse crashes in financial time series by computing landscapes and com-
45 paring their L2 norms. Gidea, Goldsmith, et al., 2018 additionally use these L2

of
46 distance between persistence landscapes to perform k-means clustering. Goel
47 et al., 2020 apply a clustering scheme based on norms of TDA landscapes to
48 perform portfolio selection. Kim et al., 2018 prove that after applying PCA
49 to a point cloud the resulting point cloud’s bottleneck distance between the

pro
50 persistence diagram of the original and PCA point cloud is 0. They consider a
51 similar algorithm for featurization of time series but ours use different methods
52 for parameter selection and focus on application to time series classification and
53 clustering. Some other non financial applications of TDA to time series data
54 include, Berwald and Gidea, 2014; Lum et al., 2013; Perea, Deckard, et al.,
55 2015.
56 In this article, we present two new methods, SOM-TDA and RF-TDA, for
57 clustering time series and the other for classifying time series respectively. We
58

59

60

61
re-
examine both these techniques using extensive simulated data and derive useful
insights. Further we illustrate the proposed classification technique using real
life financial time series data of different stocks. We have taken the labels
of the time series to be the sectors in which they belong, as we anticipate
62 that the price movements of a stock may depend on the sector to which it
63 belongs. Infact in this work, we show stock price movements can be used for
lP
64 predicting the sectors to which a stock belongs. We believe this could lead
65 to greater theoretical understanding on stock price formation and the role of
66 sector induced topology in it.
67 As mentioned above we also show the efficacy of RF-TDA on simulated data
68 arising from various stochastic processes. We aim to understand topological
69 similarity and difference in different state space models like AR, MA, ARMA,
rna

70 ARIMA, ARCH, GARCH and ARFIMA. Obtaining such an understanding


71 would have an application in reducing the search space for model choice. The
72 models are clustered based on its topology and a suitable ensemble model could
73 be constructed from these clusters.
74 For simulated data, we find that an ensemble of the methods considered
75 has the best performance in time series classification by overall accuracy. We
76 cluster simulated data using the clustering scheme SOM-TDA. We observe that
Jou

77 the clustering output can be used to explain the classification performance of


78 RF-TDA. In the NSE stock price data, we find that RF-TDA has the highest
79 accuracy among all methods considered.
80 The remainder of the article is structured as follows. In section 2 we review
81 the theoretical concepts and algorithms on which the methods considered are
82 based. We also describe the time series models that we use for simulation. In
83 section 3 we describe in detail the methods we propose. In the subsection 3.1 we
84 demonstrate the performance of SOM-TDA on the simulated data in clustering.
85 Here we also discuss potential applications of this method to model selection. In
86 subsection 3.2 we consider the classification problem on the simulated dataset,
87 where we predict the time series model from the data. In subsection 3.3 we
88 compare the classification performance of RF-TDA with other considered meth-

3
Journal Pre-proof

89 ods. In subsection 3.4 we examine the performance of RF-TDA and the other
90 methods in the context of multi-class classification using the simulated data. In

of
91 section 4 we consider NSE stock price data and demonstrate the performance of
92 RF-TDA and the other methods in time series classification by performing two
93 experiments on separate datasets. We report the results of experiment 1 and
94 2 in subsection 4.1 and 4.2 respectively. In subsection 4.3 we analyse the per-

pro
95 formance of the methods in experiment 1 and 2. In subsection 4.4 we perform
96 multi-class classification using the methods considered here on all the classes
97 of both experiments. In section 5 we conclude the article and discuss further
98 possible research directions.

99 2 Background
100

101

102
2.1 re-
Topological data analysis
Topological data analysis is based on the theory of persistent homology, see
(Edelsbrunner, Letscher, et al., 2002; Zomorodian and Carlsson, 2005). This is
103 a multi- resolution version of the simplicial homology theory, see Chapter 1 and
104 2 of Munkres, 2018 for a reference on simplicial homology. For a much detailed
exposition of persistent homology refer to Edelsbrunner and Harer, 2010.
lP
105

106 The set of vectors


Pn {x0 , · · · , xP
n } in a vector space V are said to be affinely
107 independent if i=0 ci = 0 and ni=0 ci xi = 0 implies that c0 = · · · = cn = 0.
108 Let {y0 , · · · , yn } ⊆ VP, then S is said to be a convex set inPV if it is of the
109 form S = {y : y = ni=0 ai yi , ai ≥ 0 ∀ i ∈ {0, · · · , n} and n
i=0 ai = 1}. A
110 n−simplex is the smallest convex set that contains n + 1 affinely independent
rna

111 points.

112 Definition 2.1. A simplicial complex K is a collection of simplices in Rn such


113 that,
114 1. For a simplex S ∈ K, a simplex S 0 spanned by a subset of the affinely
115 independent points spanning S, also lies in K.
116 2. The intersection of any two simplexes of K is spanned by a subset of the
affinely independent points of each simplex
Jou

117

118 Consider a point cloud X. The first step in TDA is to construct a simplicial
119 complex over a point cloud. There are several systematic ways to do this. One
120 of them being the Rips complex. To construct a Rips complex L (X), one
121 creates a neighborhood of  > 0 around each point in the point cloud. Then
122 if neighborhoods of two points intersect, we create an edge between those two
123 points. We observe that for  → ∞, all points would have an edge between each
124 other and for  → 0 the complex would have no edges. A filtration F of complex
125 K is a collection of nested subsimplices such that φ = K0 ⊂ · · · ⊂ Kn = K. One
126 creates a filtration F of complexes, which is done by varying  from suitably
127 small to large values. L0 (X) would be a subcomplex of L00 (X) for 0 < 00 .

4
Journal Pre-proof

128 A p-dimensional chain cp maps a p-dimensional simplex in K to a field F,


129 such that sum of chains over opposite permutation is 0 and cp is defined to

of
130 be 0 over non p-dimensional simplices. cp (K) is an additive free abelian group
131 defined to be the collection of p-dimensional chains. Then there is a boundary
132 operator ∂p : Cp (K) → Cp−1 (K) which is a homomorphism (Munkres, 2018,
133 pp. 28).

pro
134 Definition 2.2. The kernel of ∂p : Cp (K) → Cp−1 (K) is called the group of
135 p-cycles that is denoted Zp (K). The image of ∂p+1 : Cp+1 (K) → Cp (K) is
136 called the group of p-boundaries and is denoted by Bp (K). The pth homology
137 group of K is defined by,

Hp (K) = Zp (K)/Bp (K) (1)

The rank of a k-dimensional homology group Hk is the k-dimensional Betti


138

139

140

141

142
re-
number (Munkres, 2018, pp. 30). One then computes the k-dimensional
Betti numbers over each of these complexes, which represents the number
of k-dimensional holes in the complex. 0-dimensional Betti numbers denote
the number of connected components, 1-dimensional Betti numbers denote the
143 number of loops in the complex and 2-dimensional Betti numbers count voids in
144 the complex, higher dimensional Betti numbers are harder to interpret visually.
lP
145 The filtration F induces a homomorphsim over the k-dimensional homol-
146 ogy groups across various complexes. This allows us to measure when a certain
147 class is born in this filtration and when it disappears. These times are denoted
148 as birth time and death time respectively. Birth time and death time is the
149 index of the filtration when a certain feature appeared and disappeared respec-
tively. While computing persistent homology we are often more interested in
rna

150

151 how long does a certain topological feature persist in the filtration than in the
152 precise Betti numbers. This is because longer persistence(difference of death
153 time and birth time) gives an indication of robustness that the feature under
154 consideration may be actually present in the topological space from which the
155 data is obtained. The final output from the TDA procedure is a multiset for
156 each dimension consisting of birth and death times. This is called a persistence
157 diagram. This is plotted by drawing birth times on x-axis and death times on
Jou

158 y-axis, since death time of a class succeeds birth time, points further away from
159 the y = x line denote relatively robust features. Persistence diagrams form a
160 metric space under the Wasserstein metric (see Mileyko et al., 2011 for further
161 discussion on theoretical aspects of persistence diagrams).
162 TDA as a method is suitable to be performed over noisy data as it is known
163 that the persistence diagram is stable under perturbation (Cohen-Steiner et
164 al., 2007). For small errors, the change in the distance between persistence
165 diagrams is less than the perturbation.
166 Persistence diagram is not a suitable summary to perform statistical anal-
167 ysis with because of its multiset structure. It is also known that persistence
168 diagrams do not have a unique Fréchet mean (Mileyko et al., 2011). An alter-
169 nate statistical summary is the persistence landscape (Bubenik, 2015). In this

5
Journal Pre-proof

170 article, we consider the mean persistence landscape. For each birth-death pair
171 (b, d) in a persistence diagram Dgm we associate a functional,

of
(
t − b t ∈ [b, b+d
2 ]
β(b,d) (t) =
d − t t ∈ [ b+d
2 , d]

172 The persistence landscape is defined as λk (t) which is the kth largest value

pro
173 among all β(b,d) (t). In this article we set k = 1. Persistence landscape forms
174 a Banach space (Bubenik, 2015). They have unique mean. One can compute
175 Lp norms, which is not possible to do with persistence diagrams. We show a
176 sample output from TDA in Figure 1.
177 The features from TDA can be interpreted to give an idea about the pos-
178 sible shape of the underlying data. The longer length between birth and death
179 provides a strong evidence for the presence of the particular topological fea-
180

181

182

183
re-
ture. Consider the example illustrated in Figure 1. The betti number of S 3 are
β0 = 1, β1 = 0, β2 = 0, β3 = 1. In the example the data has been sampled from
S 3 and TDA has been applied to it. We see in the figure that after features of
dimension 1 and 2 have died off, features of the true dimension 3 appear. These
184 birth and death times of various features in TDA provide us an understanding
185 of the topology of the data.
lP
186 2.2 Time delay embedding
187 TDA is useful when the input data is a point cloud in more than one di-
188 mension. Since we consider univariate time series data we use a method by
189 which one may embed the time series in a higher dimensional space. For
rna

190 a given series X(t), t = 1, · · · , n, one can construct a delay map Y (t) =
191 [X(t), X(t − τ ), X(t − 2τ ), · · · , X(t − dτ )] , where τ is the lag parameter and
192 d is the dimension parameter. Takens (Takens, 1981) shows that such a recon-
193 struction for appropriate values of τ and d is topologically equivalent to the
194 attractor space of the dynamical system generating the series.
195 We state the theorem in the form stated in Perea, 2019,

196 Theorem 2.1 (Takens embedding theorem, 1981). Let M be a smooth, compact
Jou

197 manifold. Let τ > 0 be a real number and let d ≥ 2 dim(M ) be an integer.
198 Let Φ be an observation function on the dynamical system. Then for generic
199 Φ ∈ C 2 (R × M, M ) and F ∈ C 2 (M, R) and ϕp (t), we have that the delay map,
200 ϕ : p → (ϕp (0), ϕp (τ ), · · · , ϕp (dτ )) is an embedding. (Generic here refers to
201 ϕ, F being open and dense in the C 1 topology.)

202 This result is applicable to deterministic dynamical systems. However this


203 may not be the case for dynamical systems with stochasticity. In fact, the
204 notion of attractors is quite different in a random dynamical system. While
205 the financial stock price data is generally modelled as a stochastic process, it is
206 known that certain deterministic dynamical systems infact have characteristics
207 similar to a stochastic system(see logistic map in Goodson, 2016). In this

6
Journal Pre-proof

208 article, we use this insight and proceed to use the Takens embedding theorem
209 to reconstruct the time series for further work. A similar view is taken in the

of
210 works of Goel et al., 2020; Kim et al., 2018; Umeda, 2017.
211 Time delay embedding requires the parameters d and τ . To the best of
212 our knowledge, there is no known optimal way to estimate d and τ , though
213 there are several heuristics through which one may do this. For the time lag, τ ,

pro
214 most rules are based on analysing the autocorrelation function or auto mutual
215 information function. For example in one approach the lag is chosen as the
216 value when ACF is less than 1/e. There are several methods to estimate the
217 dimension d such as singular value decomposition (Broomhead and King, 1986),
218 method of false neighborhood (Kennel et al., 1992),etc. One can always choose
219 a very large dimension but it may not be efficient to do so. In this paper we
220 estimate minimum dimension by using Cao’s algorithm (Cao, 1997).
221 We show in Figure 2 a 3-dimensional embedding of a stock.

222

223
2.3 Stochastic processes
re-
In the next section we test our proposed method against simulated data. We
224 describe the stochastic processes used for simulation briefly below for easy ref-
225 erence.
lP
226 1. Autoregressive process: The autoregressive process model AR(p) is a
227 weakly stationary stochastic process given by,
p
X
Xt = c + ai Xt−i + t (2)
i=1
rna

228 where a1 , · · · , ap are parameters of the model, c is some constant, t ∼


229 White Noise (Shumway and Stoffer, 2017, pp. 77).
230 2. Moving average process: The moving average process model MA(q) is a
231 weakly stationary stochastic process given by,
q
X
Xt = c∗ + bi t−i (3)
i=1
Jou

232 where b1 , · · · , bq are parameters of the model. c∗ is some constant, t , · · · , t−q ∼


233 White Noise (Shumway and Stoffer, 2017, pp .77).
234 3. (Generalized) Autoregressive conditional heteroskedasticity process: Let
235 there be a time series following an AR(p) process whose errors can be
236 represented as,
t = σt at (4)
237 where at ∼ White noise and σt is the time dependent standard deviation.
238 In the ARCH(p) process {σt } is modelled as,
p
X
σt2 = α0 + αi 2t−i (5)
i=1

7
Journal Pre-proof

239 Here α0 , α1 , · · · , αp are model parameters. If the time series follows an


240 ARMA(p,q) process, then the error is instead modelled by a GARCH(p,q)

of
241 process, where p is the order of GARCH terms and q is the order of ARCH
242 terms,
q
X p
X
σt2 = α0 + αi 2t−i + 2
βi σt−i (6)
i=1 i=1

pro
243 Here α1 , · · · , αp , β1 , · · · , βq are model parameters (Shumway and Stoffer,
244 2017, pp. 253).
245 4. Auto regressive fractionally integrated moving average process model (ARFIMA)
246 (Granger and Joyeux, 1980):
Let d be the difference operator such that dXt = Xt − Xt−1 . So,

X   ∞
m i X m!

247
(1 − d)m =
re- (−1)i
i=0
i
d = (−1)i
(m − i)!i!
i=0
di

In the ARFIMA case, m = m0 + h, where m0 is a non-negative integer and


(7)

248 h ∈ (−1, 1). The ARFIMA(p,m,q) model is specified as,


p
X q
X
lP
(1 − ai di )(1 − d)m Xt = (1 + bj dj )t (8)
i=1 j=1

249 For h = 0, ARFIMA reduces to the auto regressive integrated moving


250 average model (ARIMA).
rna

251 2.4 Self-Organizing Maps


252 Self-organizing map(SOM) is a neural network model proposed by Kohonen
253 (Kohonen, 1982). In this model a higher dimensional vector is mapped onto
254 a two dimensional lattice, where each unit is called a neuron. One of the
255 advantages of using Self-Organizing maps is that it preserves topology if there
256 are sufficiently large number of nodes (Kohonen, 1982). Self-organizing maps
have been used in both supervised and unsupervised setting.
Jou

257

258 In this article we use Self-organizing maps to perform clustering of times


259 series. Each vector in the dataset is mapped to a node and these nodes are
260 then used as cluster identifiers. The reader may refer to Yin, 2008 for a review
261 of SOM and further details.

262 2.5 Dynamic time warping


263 Dynamic time warping (Berndt and Clifford, 1994) is a time series similarity
264 measure. It is considered as a benchmark method in the time series classification
265 literature, see Bagnall et al., 2017. DTW allows us to measure distance between
266 two time series even when they are of unequal length.

8
Journal Pre-proof

267 Let T1 , T2 be two time series of length n, m respectively. Let D be the point-
268 wise distance matrix between T1 , T2 . A path in the matrix D is the collection

of
269 of pairs {(i, j) : i ∈ T1 , j ∈ T2 } subject to certain warping conditions. See
270 Bagnall et al., 2017 for details. The objective is to find the length of shortest
271 path beginning at (1, 1) and terminating at (n, m). This length is the DTW
272 distance between two time series.

pro
273 2.6 Discrete wavelet transform
274 Another method we consider for comparison purpose is the discrete wavelet
275 transform, see Batal and Hauskrecht, 2009 . Wavelet transforms bypass the
276 periodicity requirements of Fourier transform.
We use the Haar wavelets and apply Haar decomposition on the time series.
We use the Haar coefficients obtained as features for classsification. Let x be
re-
a time series of length n. If n 6= 2l , l ∈ N, then we pad zero’s to the end of
the time series such that its length is a power of two. The Haar coefficients are
calculated as (Batal and Hauskrecht, 2009),
1
dl,i = √ (sl−1,2i − sl−1,2i+1 ) (9)
2
lP
1
sl,i = √ (sl−1,2i + sl−1,2i+1 ) (10)
2
277 where l = 1, · · · , log2 n are called the levels of the time series, i = 1, · · · , 2nl
278 and s0,i = xi . d1 , · · · , dl are called the level coefficients and slog2 n,0 is called
279 the scaling coefficient. We create vector of all level coefficients and the scaling
rna

280 coefficient to use as time series features for our problem.


281 We also use an undecimated discrete wavelet transform called the Maximal
282 Overlap Discrete Wavelet Transform (MODWT). MODWT aims to bypass the
283 sensitivity issues pertaining to DWT such as the dependence on the starting
284 point of the time series. It also overcomes the length requirement of DWT. For
285 further details see Maharaj and Alonso, 2007; Percival and Walden, 2000; Zhao
286 et al., 2018
Jou

287 2.7 Random Forest


288 Random Forest (Breiman, 2001) is an improvement on decision tree learning.
289 Random Forests have become popular as it overcomes an important problem
290 of decision trees which is to overfit training data. Random Forests differ on
291 two major aspects to the decision tree. First, bootstrap aggregation is used
292 to generate a simple random sample (with replacement) of n trees and for
293 classification, a majority vote is taken to obtain the final output. Another
294 aspect is that for each tree a random subset of the features are used. This
295 helps to avoid correlation with other tree samples, leading to results that are
296 more generalizable.

9
Journal Pre-proof

297 3 Method & Experiment

of
298 Let X be a collection of k time series. Thus, X = {xi : i ∈ {1, . . . , k}} where
299 each xi is a time series of length ni . The time series clustering problem is to
300 divide k time series, into s, s < k groups based on some notion of homogeneity of
301 the time series in each group. In this article, we present a method for clustering
time series based on its topological similarity.

pro
302

303 Now suppose each time series xi is associated with a label yi and let Y be
304 the collection of all these labels. Assume that the number of distinct labels
305 in Y = {yj : j ∈ {1, . . . , k}} be `. The time series classification problem
306 is to predict the label of a time series whose label is unknown based on the
307 information derived from a training set of labelled time series i.e. given the set
308 of pairs P = {(xi , yi ) : xi ∈ X, yi ∈ Y }.
309 Let w ∈ N and we assume that each time series is of length which is an
310

311

312

313
re-
integral multiple of w. We break a given series into several sub-series of equal
length w. This is helpful because TDA is computationally expensive. Another
advantage of dividing the series into several sub-series is to potentially allow
for different time-delay embeddings which may happen for example, if a change
314 point is present. A single time-delay embedding for a long time series implies
315 the belief that the entire data was generated by a single dynamical system,
lP
316 which may not always be the case (Ang and Timmermann, 2012). We give all
317 the sub-series of a given series the same label as was given to the parent series.
318 We do a time delay embedding for each of these sub-series following the
319 process discussed in section 2.2. Since, the resulting embedding may have
320 a very high dimension we reduce the computational expense by performing
321 PCA. We then apply TDA on this embedding and compute its one dimensional
rna

322 persistence landscape. We choose the appropriate number of PCA dimensions


323 and window size, w, using 10-fold cross validation and varying D over a range
324 of values [d1 , d2 ] and w over {w1 , · · · , wn }. Towards this, we first generate
325 persistence landscapes for each dimension D ∈ [d1 , d2 ] for every sub-series of
326 length w ∈ {w1 , · · · , wn }. Then for each dimension and window size, we divide
327 the obtained landscapes randomly into 10 groups. The model is trained on
328 the landscapes in 9 groups and the landscapes in the remaining group is used
for evaluating the performance metric which for this problem is chosen to be
Jou

329

330 the overall accuracy. This process is repeated 10 times by keeping each group
331 once as the test group. Then the dimension of the PCA and window size is
332 chosen to be that value (D∗ , W ∗ ) for which the overall accuracy is maximised.
333 The persistence landscapes of dimension D∗ and subseries of size W ∗ are then
334 used for clustering and classification of the time series. We use the R package
335 tda (Fasy et al., 2014) for all TDA related computations. This is shown as
336 Algorithm 1.
337 We generate 30 time series each of length 1200 from each of the 14 processes
338 given in Table 1. As can be seen from the Table the parameters of the 30 time
339 series of each kind are generated using parameter values that are randomly
340 drawn from the specified distribution. We split the data into training and test

10
Journal Pre-proof

341 set in the proportion of 80% and 20% respectively.


342 We generate one dimensional persistence landscapes for each of these sim-

of
343 ulated series by the method described above. To select the value of window
344 size and PCA dimension we perform a 10-fold cross validation. We observe in
345 Table 2 that the cross validation accuracy is maximised for window size 100
346 and PCA dimension 2.

pro
Algorithm 1 Generating persistence landscape features of time series
• Input: The collection of time series X.
• Parameters: Window size(W ∗ ) and PCA dimension(D∗ ).
• Output: Persistence landscapes of time series
1: Divide each time series, x ∈ X, into it’s subseries of length W ∗ obtained from
cross-validation.
2: For each subseries generate the time delay embedding.
re-
3: Reduce the time delay embedding of each subseries to D ∗ dimensions by PCA.
4: Generate Rips filtration over these reduced embeddings and compute persistence
diagrams for each filtration.
5: Compute the persistence landscape from these diagrams corresponding to each
subseries.
lP
347 3.1 Clustering time series
348 For clustering, we use the generated persistence landscapes and apply SOM on
349 them. We use 3 × 3 grid of hexagonal topology. Each landscape is mapped to
rna

350 a node. We then perform hierarchical clustering of the nodes using Manhattan
351 metric for 4 centers to obtain times series clusters. We refer to this method as
352 SOM-TDA(Algorithm 2).

Algorithm 2 SOM-TDA
• Input: Persistence landscape features for each subseries from Algorithm 1.
• Parameters: Grid dimension(m × n) and topology.
Jou

• Output: Clusters of each subseries.


1: Train SOM on a grid dimension and topology with the persistence landscapes as
input.
2: The trained SOM maps each subseries to a node on the m × n grid.
3: Perform hierarchical clustering on the nodes to obtain clusters.

353 We compare simulated processes and obtain clusters using SOM-TDA. We


354 perform a clustering on the training dataset using the SOM-TDA method de-
355 scribed in Algorithm 2. We report the node wise distribution of each process in
356 Table 3. We notice that ARCH and GARCH processes land up in one cluster,
357 namely cluster 3. AR(2), MA(1) and MA(2) processes form one cluster. We

11
Journal Pre-proof

358 also notice that processes with one-order of differencing are in cluster 1 and 3
359 whereas processes with 2 order of differencing are in cluster 4, which possibly

of
360 implies the effect of differencing on topology.
361 Clustering time series models is useful for model selection. A search on a set
362 of candidate models which may have unknown parameters that are estimated
363 from the data is first conducted. An information criterion such as AIC is then

pro
364 used to choose the ”best model” among the candidate models. This exercise
365 can be computationally expensive if the number of candidate models are large.
366 However, using SOM-TDA clusters obtained based on topological similarity we
367 can narrow down our search space to representative members of the clusters.
368 Now since the clustering has been done over members of several different model
369 classes, this may lead to creation of better ensembles.

3.2 Classifying time series


370

371

372

373
re-
We use the persistence landscapes as features for time series classification. We
use the random forest algorithm for classification. We will henceforth refer to
this algorithm as RF-TDA(Algorithm 3) in the article. Further we also examine
374 the performance of the proposed method, RF-TDA, using extensive simulation.
We examine the discriminatory power of the TDA algorithm by considering sev-
lP
Algorithm 3 RF-TDA
• Input: The training data is the Persistence landscape features for each sub-
series from Algorithm 1 and the labels of their parent series Y. The test data is
the persistence landscape features of the subseries.
• Parameters: Number of trees and variables tried at each split for Random
rna

forest.
• Output: Prediction of label on test data.
1: Train a Random forest on the training data.
2: Predict using the trained random forest model on the test data.

375
376 eral cases. The results on the training data are shown in Table 4. For instance
since AR(1) can be written as a MA(∞) process so the better discrimination
Jou

377

378 of AR(1) and MA(1) process by the RF-TDA algorithm suggests a possibil-
379 ity of a topologically different attractor space. We also note that AR(1) and
380 MA(1) appeared in different clusters in section 3.1, they have also been discrim-
381 inated well. Note that for pairs like (AR(2),MA(2)), (ARMA(1,2),ARMA(2,1))
382 and (ARCH, GARCH) the classification accuracy although being decent is not
383 that high. We also saw in section 3.1 that they were members of the same
384 clusters which meant that they were topologically similar and hence probably
385 could not be discriminated well. We also note the high accuracy in classify-
386 ing ARFIMA(1,2,1) from ARFIMA(2,1,2), which could be due to the different
387 differencing order which may induce different topology as was also observed in
388 SOM-TDA results in the previous section.

12
Journal Pre-proof

389 The results on test data is shown in Table 5. There does not appear to be
390 any over-fitting worthy of concern since the performance on the test data is

of
391 similar to that with the training data.

392 3.3 Comparison with other methods

pro
393 To examine the effectiveness of RF-TDA, we compare RF-TDA against three
394 other methods that are based on methods used for time series classification in
395 the literature namely, Discrete Wavelet Transform(DWT), Maximal Overlap
396 Discrete Wavelet Transform(MODWT) and Dynamic Time Warping(DTW)
397 described in the previous section. We use DWT features in random forest al-
398 gorithm for classification, and henceforth refer to this as RF-DWT. We also
399 use MODWT features alongwith random forest, and call this as RF-MODWT.
400 DTW being a similarity measure, we use it in conjunction with k-Nearest Neigh-
401

402

403

404
re-
bor method for classification. We call this algorithm knn-DTW. We also apply
a similarity measure based on Euclidean distances between time series, and
use it along with k-Nearest Neighbor method. We refer to this method as
knn-Euclidean.
405 To give an indication of computational expense involved in each of the meth-
406 ods we report the time to generate the features. In a dataset with 48 time series
lP
407 of length 1200 each. All the computations were performed on a Ubuntu 18 Ma-
408 chine with 32 GB RAM and 6 cores using the R language. A windowing size of
409 120 and PCA dimension 5 takes TDA to complete generating features in 31.6
410 minutes. Features from DWT takes 12 seconds and MODWT takes 32 seconds.
411 DTW features take 1 minute to be generated. TDA is computationally more
412 expensive than the other methods considered here.
rna

413 Performance of RF-DWT is shown in Tables 4 and 5. RF-DWT performs as


414 good or better than RF-TDA in most cases for this simulated dataset. We see
415 from Table 5 that besides GARCH processes comparison it is as good or better
416 than RF-TDA. We see that RF-MODWT has the best performance amongst
417 all methods considered here.
418 We note that RF-TDA outperforms knn-DTW and knn-Euclidean on all
419 considered cases. We also note the poor performance of knn-DTW and knn-
Jou

420 Euclidean on comparison of GARCH processes. We report the performance of


421 knn-DTW and knn-Euclidean in Table 5.
422 Next, we create an ensemble of RF-TDA, knn-DTW, RF-DWT and RF-
423 MODWT based on proportional voting and check its performance. We per-
424 form 10-fold cross validation for each method on the training data as described
425 earlier. We take the mean accuracy from the 10-fold cross validation as the
426 weight assigned to the method. The ensemble prediction for each unit is the
427 class with the highest sum of weights of the methods that predict the class.
428 For example say, RF-TDA, RF-DWT and RF-MODWT with weight w1 , w2 ,w3
429 predicts class c1 for a time series and knn-DTW with weight w4 predicts class
430 c2 . Then c1 is chosen as the ensemble prediction if w1 + w2 + w3 > w4 , else c2
431 is chosen.

13
Journal Pre-proof

432 We report the results in Table 6. We note that the ensemble is the best
433 performing method amongst all considered but RF-MODWT.

of
We also compare the median overall scores of the methods considered using
hypothesis testing. We apply a one-tailed two-sample Monte-Carlo permutation
test (see Dwass, 1957) with,

pro
H0 : { Median of alternate method’s overall accuracy ≥ Median of
RF-TDA’s overall accuracy}

and,

H0∗ : { Median of RF-TDA’s mis-classification error ≥ Median of


alternate method’s mis-classification error}

434

435

436
re-
We report the p-values for both the hypothesis in Table 14 and we observe that
both the hypothesis have very high p-values indicating that the hypothesis can
not be rejected.

437 3.4 Multi-class Classification


lP
438 In this subsection we report the performance of the RF-TDA when it is applied
439 to the task of multi-class classification and compare its performance with that
440 of RF-DWT and RF-MODWT.
441 As is generally seen with multi-class classification problems, the accuracy of
442 the prediction is not good by any of these three methods. Hence we decided to
use a prediction set approach for this problem. The prediction set consists of
rna

443

444 two classes which are assigned the highest and the next to the highest probabil-
445 ities by the random forest algorithm. We consider the prediction to be accurate
446 if one of the predicted class is also the real class of the data. We report the
447 performance of our method in Table 16.We see that RF-TDA has a median
448 accuracy of 38%, RF-DWT and RF-MODWT have a median accuracy of 61%
449 on the test set. It must be noted that this dataset has 13 classes, so a random
450 classifier with a prediction set of two will predict with only an accuracy of
Jou

451 15.4%.

452 4 Financial time series classification


453 In finance, Fama-French 3 factor model (Fama and French, 1993) is a very fa-
454 mous model characterizing the factors that describe stock returns. However, to
455 the best of our knowledge, there is no model that proposes a dependence of the
456 stock returns on the sectors. We propose to demonstrate here the prediction
457 power of RF-TDA which may suggest presence of sector specific TDA features
458 which can possibly be utilized in explaining the differences in the behaviour of
459 stock returns across sectors.

14
Journal Pre-proof

460 4.1 Experiment One


National stock exchange(NSE) is one of the largest stock exchanges in India.

of
461

462 NSE publishes sectoral indices weighted on the free-float market capitalisation.
463 For details on their index methodology refer (Methodology Document of NIFTY
464 Sectoral Index Series n.d.). We obtain data from across sectors for 6 largest,
465 by index proportion, constituents of NSE sectoral indices of banks, pharmaceu-

pro
466 ticals, information technology and public banks. These are among the largest
467 sectors by market capitalisation and are actively traded in. The chosen stocks
468 along with their NSE symbols are given below. We denote in parentheses their
469 NSE symbols.
470 1. Banks-HDFC (HDFCBANK), Axis (AXISBANK), Kotak Mahindra (KO-
471 TAKBANK), ICICI (ICICIBANK), Indus Ind (INDUSINDBK), Federal
472 (FEDERALBNK)
473

474

475
re-
2. Pharmaceuticals-Sun (SUNPHARMA), Cipla (CIPLA), Divi (DIVISLAB),
Dr Reddy (DRREDDY), Lupin (LUPIN), Biocon (BIOCON)
3. IT- TCS (TCS), Tech Mahindra (TECHM), HCL (HCLTECH), Infosys
476 (INFY), WIPRO (WIPRO), Hexaware (HEXAWARE)
477 4. Public Banks-State Bank of India (SBIN), Bank of Baroda (BANKBAR-
lP
478 ODA), Punjab National Bank (PNB), Canara Bank (CANBK), Bank of
479 India (BANKINDIA), Union Bank (UNIONBANK)
480 As mentioned earlier the label of these time series is the sector to which they
481 belong. The training period data was chosen to be from 1 January 2014 to 31
482 December 2018 while the test period data was chosen to be from 1 January
2019 to 1 November 2019. We work with the log return series of stocks. No
rna

483

484 discernible pattern can be seen by us in the series from their plots.
485 We then apply RF-TDA and generate the persistence landscapes. The PCA
486 dimension was found after 10-fold cross validation to be 8 and window size 100.
487 The results are shown in Table 7. From the persistence landscapes plots we
488 also note the difference in the persistence landscapes for each window of the
489 time series. The sector-wise difference in the persistence landscapes, such as
490 Public Banks have more dispersed peaks compared to Pharma (see Figures 3
Jou

491 and 4) is also observed.


492 For the Random forest algorithm for classification, the number of trees
493 considered is 500 and the number of variables tried at each split is 31. It should
494 be noted that variables here are the index of the filtration, which are realized
495 as points on the discretization of the persistence landscapes. The persistence
496 landscape here was discretized into 1000 points
497 From the models considered in section 3, we want to examine if any of those
498 fits the dataset under consideration. We also want to observe if there is any
499 sector-wise similarity in the models fitted to the dataset under consideration.
500 For this purpose we compute the AIC values of the models fitted to find the
501 best model. We report the results in Table 8. From the Table we observe that
502 all stocks have been fitted with GARCH models with their respective orders

15
Journal Pre-proof

503 besides Federal bank and WIPRO where they are fitted with ARFIMA model.
504 We use R library forecast (Hyndman et al., 2020) auto.arima function to fit

of
505 ARIMA models on stock price data. We fit GARCH and ARFIMA models on
506 the dataset by selecting the model with the lowest AIC. We use the R library
507 tseries (Trapletti and Hornik, 2019) to fit GARCH models and the library
508 arfima (Veenstra, 2012) to fit ARFIMA models. For GARCH we vary model

pro
509 order within [0, 4] and fit models, we select the model fitted with least AIC.
510 For ARFIMA, we vary the AR, MA, difference parameter within [0, 4] and fit
511 models, we select the model fitted with least AIC. It is seen that AIC values of
512 GARCH models are generally lower than that of ARIMA or ARFIMA models.
513 We then perform classification on this dataset using RF-TDA, RF-DWT,
514 RF-MODWT, knn-DTW and knn-Euclidean. We report the results for RF-
515 TDA for the first set in Tables 10 and 11. We note the very high overall
516 accuracy on the training dataset. We see accuracy upwards of 90% for all con-
517

518

519

520
re-
sidered cases. We see that the cases involving the Pharma sector an overall
accuracy upwards of 97% is observed, which points to presence of some topo-
logical feature(s) present in the pharma time series which is(are) distinct from
the time series of other sectors facilitating the classification. There is also no
521 indication of overfitting since we observe high accuracy on the test dataset as
522 well. Barring the case of Pharma-Public banks we observe that RF-TDA clas-
lP
523 sifies well with high accuracy. We note an accuracy rate of more than 80% in
524 some cases.
525 We report the RF-DWT results for the first set in Tables 10 and 11 and the
526 knn-DTW and knn-Euclidean results in Table 11. RF-TDA outperforms the
527 other methods considered in terms of overall accuracy and classwise accuracy.
528 RF-DWT performs decently on the training set, but on none of the cases does
rna

529 it cross an overall accuracy of 90%. On the test data we observe that although
530 RF-DWT performs decently it is outperformed by RF-TDA in four out of the
531 six cases. RF-DWT and RF-MODWT perform similarly. We note that RF-
532 TDA has a higher median accuracy than RF-DWT and RF-MODWT. Methods
533 based on similarity measures i.e. knn-DTW and knn-Euclidean perform poorly
534 with knn-DTW performing better than knn-Euclidean. We note that knn-DTW
535 is outperformed on all comparisons by RF-TDA.
Jou

536 4.2 Experiment two


537 A reviewer of the earlier draft of this paper suggested carrying out a larger
538 study. For this we conduct a second experiment with 6 additional sectors namely
539 -automobiles, consumer durables, realty, FMCG, oil & gas and media. The
540 stocks in each of these sectors are chosen following the same process described
541 in subsection 4.1. The details are given below,
542 1. Automobiles-Bajaj Auto (BAJAJ-AUTO), Eicher Motors(EICHERMOT),
543 Hero Motor (HEROMOTOCO), Mahindra & Mahindra (M&M), Maruti
544 (MARUTI), TataMotors (TATAMOTORS)

16
Journal Pre-proof

545 2. Consumer Durables- Bata India(BATAINDIA), Havells (HAVELLS), Ra-


546 jesh Exports (RAJESHEXPO), Titat (TITAN), Voltas (VOLTAS), Whirlpool

of
547 (WHIRLPOOL)
548 3. Realty-DLF (DLF), Godrej Properties (GODREJPROP), Indiabulls Real
549 Estate (IBREALEST), Oberoi Realty (OBEROIRLTY), Phoenix Mills
550 (PHOENIXLTD), Prestige group (PRESTIGE)

pro
551 4. FMCG-Britannia (BRITANNIA), Dabur (DABUR), Godrej consumer prod-
552 ucts (GODREJCP), Hindustan Unilever (HINDUNILVR), ITC (ITC),
553 Nestle India (NESTLEIND)
554 5. Oil and Gas-Bharat Petroleum (BPCL), Gas Authority of India (GAIL),
555 Indian Oil (IOC), Oil and Natural Gas Corporation (ONGC), Petronet
556 (Petronet), Reliance Industries (RELIANCE)
557

558

559

560
re-
6. Media- Inox (INOXLEISUR), Network 18 (NETWORK18), PVR (PVR),
Sun TV (SUNTV), TV 18 (TV18BRDCST), Zee entertainment (ZEEL)
For the second dataset, the PCA dimension after 10-fold cross validation was
found to be 5 and window size 120.
561 We report the time series models fitted to this dataset in Table 9.We find
562 that all time-series closely fit GARCH among the models considered in section
lP
563 3. We observe the results are consistent with the first experiment in Section 4.1
564 We apply the methods considered to the second dataset. This is a bigger
565 dataset in terms of pairwise comparisons. We report the results in Table 12
566 and 13. Here we note that RF-DWT and RF-MODWT have higher median
567 accuracy than RF-TDA on the training set, but we note that RF-TDA’s per-
568 formance is affected due to a few poor pairwise comparisons such as many of
rna

569 the comparisons involving Automobiles. The high accuracy of RF-DWT and
570 RF-MODWT may also be an indication of overfitting, since we see in Table
571 13 that they have lower overall median accuracy than RF-TDA on test data.
572 RF-TDA performs excellently on many of the pairwise comparisons with sev-
573 eral 100% accuracy results. RF-TDA has the highest overall median accuracy
574 among all the methods. RF-TDA outperforms knn-DTW and knn-Euclidean in
575 most of the pairwise comparisons.
Jou

576 4.3 Analysis of performance


An examination of the performance of all the methods on the two datasets,
reveal that on the test set out of 21 comparisons RF-TDA had 75% or more
accuracy 14 times, RF-DWT 7 times, RF-MODWT 6 times, knn-DTW 5 times
and knn-Euclidean 4 times. Combining performance across both datasets we
see that on the test set RF-TDA has a median accuracy of 75%, higher than
all the methods considered. We note that RF-TDA performed similarly well
on both these datasets. It had the highest median accuracy across methods for
both. On the first dataset out of the 6 comparisons it had 75% or more accu-
racy 4 times and on the second dataset out of the 15 comparions it had 75%

17
Journal Pre-proof

or more accuracy 10 times. We also compare the median overall scores of the
methods considered using hypothesis testing. We apply a one-tailed two-sample

of
Monte-Carlo permutation test with,

H0 : { Median of alternate method’s overall accuracy ≥ Median of


RF-TDA’s overall accuracy}

pro
and,

H0∗ : { Median of RF-TDA’s mis-classification error ≥ Median of


alternate method’s mis-classification error}

577 We report the p-values for both the hypotheses in Table 15 . We observe that
both H0 and H0∗ is rejected at 5% level of significance for comparisons with
578

579

580

581
re-
the methods RF-DWT, knn-DTW and knn-Euclidean indicating that the per-
formance of RF-TDW on these two real-life tests is significantly better than
these three methods. For comparison of the performance of the RF-TDW and
582 RF-MODWT methods, we observe that both the p-values for testing H0 and
583 H0∗ are greater than 0.05. Thus we cannot reject either of these hypotheses.
lP
584 4.4 Multi-class Classification
585 We perform a multi-class classification using all the classes of the data in Ex-
586 periment 1. As discussed in Section 3.4, we provide a prediction set where we
587 predict two classes with the highest probabilities in the random forest. We re-
port the performance of our method in Table 17.We observe that RF-TDA has
rna

588

589 a nearly 100% accuracy on the training set. We see that RF-TDA has a median
590 accuracy of 75% whereas RF-DWT and RF-MODWT have a median accuracy
591 of 67% on the test set. It must be noted that this dataset has 4 classes, so a
592 random classifier with a prediction set of two will predict with only an accuracy
593 of 50%.
594 Next, we now perform a multi-class classification using all the classes of
595 the data in Experiment 2. We report the performance of our method in Table
Jou

596 18. We see that RF-TDA has a median accuracy of 58.5%, while the same for
597 RF-DWT is 67% and that for RF-MODWT is 58.5% on the test set. It must be
598 noted that this dataset has six classes, so a random classifier with a prediction
599 set of two will predict with only an accuracy of 33.32%.

600 5 Conclusion
601 In this paper, two TDA based methods for clustering and classification is pre-
602 sented. In section 3.1, we apply SOM-TDA to cluster time series models by
603 their topological similarity. It’s potential applications in model selection and
604 ensemble formation is also discussed. In section 3.2 we report the performance

18
Journal Pre-proof

605 of RF-TDA for classification on the simulated dataset. We also observe that
606 the clustering output from section 3.1 allows us to explain the performance of

of
607 RF-TDA. In section 3.3, we find that the performance of RF-MODWT and
608 the ensemble of the four methods are better than the other methods on the
609 simulated data. However, in the two experiments with real data reported in
610 Section 4 it is seen that the performance of RF-TDA is superior to the other

pro
611 methods considered in this paper. This indicates that real-life data possibly
612 has features that are not completely captured in the time-series models from
613 which the simulated data is generated in Section 3. We conjecture that the
614 RF-TDA method proposed in this paper is able to capture these features bet-
615 ter leading to better performance in these experiments. In this context it may
616 be noted that the RF-MODWT is also a competitive method that may used
617 in conjunction with RF-TDA or an ensemble using these two methods may be
618 considered. Through this work we provide evidence that stock price movements
619

620

621

622
re-
are sector dependent, which may be useful for further research in finance.
In a future work we intend to explore the theoretical aspects of this method.
We intend to study and characterize the stochastic processes & dynamical sys-
tems for which a classifier based on topological features works best. Another
623 future project is to also explore the application of SOM-TDA to model selection
624 and compare its performance with other model selection frameworks. We also
lP
625 anticipate further applications of RF-TDA to financial time series classification
626 in other settings. A question that could be explored here is for which suitable
627 labels like sectors, can a financial time series be characterized by its topological
628 features? A causal explanation for the presence or lack of such features remains
629 to be provided, and needs to be explored.
rna

630 Acknowledgements
631 The authors thank the editor and the anonymous reviewers for their helpful
632 comments on an earlier version of the paper which has led to improvement in
633 the paper.
Jou

634 References
635 Ang, A., & Timmermann, A. (2012). Regime changes and financial mar-
636 kets. Annu. Rev. Financ. Econ., 4 (1), 313–337.
637 Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The
638 great time series classification bake off: A review and experimen-
639 tal evaluation of recent algorithmic advances. Data Mining and
640 Knowledge Discovery, 31 (3), 606–660.
641 Batal, I., & Hauskrecht, M. (2009). A supervised time series feature extrac-
642 tion technique using dct and dwt, In 2009 international conference
643 on machine learning and applications. IEEE.

19
Journal Pre-proof

644 Berndt, D. J., & Clifford, J. (1994). Using dynamic time warping to find
645 patterns in time series., In Kdd workshop. Seattle, WA.

of
646 Berwald, J., & Gidea, M. (2014). Critical transitions in a model of a ge-
647 netic regulatory system. Mathematical Biosciences & Engineering,
648 11 (4), 723–740.
649 Breiman, L. (2001). Random forests. Machine learning, 45 (1), 5–32.

pro
650 Broomhead, D. S., & King, G. P. (1986). Extracting qualitative dynamics
651 from experimental data. Physica D: Nonlinear Phenomena, 20 (2-
652 3), 217–236.
653 Bubenik, P. (2015). Statistical topological data analysis using persistence
654 landscapes. The Journal of Machine Learning Research, 16 (1), 77–
655 102.
656 Cao, L. (1997). Practical method for determining the minimum embedding
657

658

659

660
re-
dimension of a scalar time series. Physica D: Nonlinear Phenom-
ena, 110 (1-2), 43–50.
Cohen-Steiner, D., Edelsbrunner, H., & Harer, J. (2007). Stability of
persistence diagrams. Discrete & Computational Geometry, 37 (1),
661 103–120.
662 de Silva, V., Skraba, P., & Vejdemo-Johansson, M. (2012). Topological
lP
663 analysis of recurrent systems, In Workshop on algebraic topology
664 and machine learning, nips.
665 Dwass, M. (1957). Modified randomization tests for nonparametric hy-
666 potheses. The Annals of Mathematical Statistics, 181–187.
667 Edelsbrunner, H., & Harer, J. (2010). Computational topology: An intro-
rna

668 duction. American Mathematical Soc.


669 Edelsbrunner, H., Letscher, D., & Zomorodian, A. (2002). Topological
670 persistence and simplification. Discrete Comput Geom, 28, 511–
671 533.
672 Fama, E. F., & French, K. R. (1993). Common risk factors in the returns
673 on stocks and bonds. Journal of Financial Economics, 33, 3–56.
674 Fasy, B. T., Kim, J., Lecci, F., & Maria, C. (2014). Introduction to the r
Jou

675 package tda. arXiv preprint arXiv:1411.1830.


676 Gidea, M., Goldsmith, D., Katz, Y. A., Roldan, P., & Shmalo, Y. (2018).
677 Topological recognition of critical transitions in time series of cryp-
678 tocurrencies. Available at SSRN 3202721.
679 Gidea, M., & Katz, Y. (2018). Topological data analysis of financial time
680 series: Landscapes of crashes. Physica A: Statistical Mechanics and
681 its Applications, 491, 820–834.
682 Goel, A., Pasricha, P., & Mehra, A. (2020). Topological data analysis in
683 investment decisions. Expert Systems with Applications, 113222.
684 Goodson, G. R. (2016). Chaotic dynamics: Fractals, tilings, and substitu-
685 tions. Cambridge University Press.

20
Journal Pre-proof

686 Granger, C. W., & Joyeux, R. (1980). An introduction to long-memory


687 time series models and fractional differencing. Journal of time se-

of
688 ries analysis, 1 (1), 15–29.
689 Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay,
690 L., O’Hara-Wild, M., Petropoulos, F., Razbash, S., Wang, E., &
691 Yasmeen, F. (2020). forecast: Forecasting functions for time series

pro
692 and linear models [R package version 8.11]. R package version 8.11.
693 http://pkg.robjhyndman.com/forecast
694 Kennel, M. B., Brown, R., & Abarbanel, H. D. (1992). Determining em-
695 bedding dimension for phase-space reconstruction using a geomet-
696 rical construction. Physical review A, 45 (6), 3403.
697 Kim, K., Kim, J., & Rinaldo, A. (2018). Time series featurization via
698 topological data analysis: An application to cryptocurrency trend
699

700

701

702
re-
forecasting. arXiv preprint arXiv:1812.02987.
Kohonen, T. (1982). Self-organized formation of topologically correct fea-
ture maps. Biological cybernetics, 43 (1), 59–69.
Lum, P. Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson,
703 M., Alagappan, M., Carlsson, J., & Carlsson, G. (2013). Extracting
704 insights from the shape of complex data using topology. Scientific
lP
705 reports, 3, 1236.
706 Maharaj, E. A., & Alonso, A. M. (2007). Discrimination of locally station-
707 ary time series using wavelets. Computational Statistics & Data
708 Analysis, 52 (2), 879–895.
709 Methodology document of nifty sectoral index series. (n.d.). Retrieved March
rna

710 5, 2020, from https : / / archives .nseindia .com / content /


711 indices/Method Nifty Sectoral.pdf
712 Mileyko, Y., Mukherjee, S., & Harer, J. (2011). Probability measures
713 on the space of persistence diagrams. Inverse Problems, 27 (12),
714 124007.
715 Munkres, J. R. (2018). Elements of algebraic topology. CRC Press.
716 Percival, D. B., & Walden, A. T. (2000). Wavelet methods for time series
Jou

717 analysis (Vol. 4). Cambridge university press.


718 Perea, J. A. (2019). Topological time series analysis. Notices of the Amer-
719 ican Mathematical Society, 66 (5).
720 Perea, J. A., Deckard, A., Haase, S. B., & Harer, J. (2015). Sw1pers:
721 Sliding windows and 1-persistence scoring; discovering periodicity
722 in gene expression time series data. BMC bioinformatics, 16 (1),
723 257.
724 Perea, J. A., & Harer, J. (2015). Sliding windows and persistence: An
725 application of topological methods to signal analysis. Foundations
726 of Computational Mathematics, 15 (3), 799–838.

21
Journal Pre-proof

727 Pereira, C. M., & de Mello, R. F. (2015). Persistent homology for time se-
728 ries and spatial data clustering. Expert Systems with Applications,

of
729 42 (15-16), 6026–6038.
730 Shumway, R. H., & Stoffer, D. S. (2017). Time series analysis and its
731 applications: With r examples. Springer.
732 Takens, F. (1981). Detecting strange attractors in turbulence, In Dynam-

pro
733 ical systems and turbulence, warwick 1980. Springer.
734 Trapletti, A., & Hornik, K. (2019). Tseries: Time series analysis and com-
735 putational finance [R package version 0.10-47.]. R package version
736 0.10-47. https://CRAN.R-project.org/package=tseries
737 Truong, P. (2017). An exploration of topological properties of high-frequency
738 one-dimensional financial time series data using tda (Doctoral dis-
739 sertation). KTH Royal Institute of Technology.
740

741

742

743
re-
Umeda, Y. (2017). Time series classification via topological data analysis.
Information and Media Technologies, 12, 228–239.
Veenstra, J. Q. (2012). Persistence and anti-persistence: Theory and soft-
ware (Doctoral dissertation). Western University.
744 Yin, H. (2008). The self-organizing maps: Background, theories, exten-
745 sions and applications, In Computational intelligence: A compendium.
lP
746 Springer.
747 Zhao, X., Barber, S., Taylor, C. C., & Milan, Z. (2018). Classification
748 tree methods for panel data using wavelet-transformed time series.
749 Computational Statistics & Data Analysis, 127, 204–216.
750 Zomorodian, A., & Carlsson, G. (2005). Computing persistent homology.
rna

751 Discrete & Computational Geometry, 33 (2), 249–274.


Jou

22
Journal Pre-proof

Process Coefficients
AR(1) a1 ∼ Uniform(0.5, 0.99)

of
AR(2) a1 , a2 ∼ Uniform(0.1, 0.4)
MA(1) b1 ∼ Uniform(0.5, 0.99)
MA(2) b1 ∼ Uniform(0.5, 0.99)
ARMA(2,1) (a1 , a2 , b1 ) ∼ Uniform(0.2, 0.4)

pro
ARMA(1,2) (a1 , b1 , b2 ) ∼ Uniform(0.25, 0.45)
ARCH(1) (α0 , α1 ) ∼ Uniform(0.0001, 0.9999)
GARCH(1,1) (α0 ) ∼ Uniform(0.0001, 0.9999), (α1 , β1 ∼ Uniform(0.0001, 0.4999))
GARCH(2,1) (α0 ) ∼ Uniform(0.0001, 0.9999), (α1 , β1 , β2 ∼ Uniform(0.0001, 0.3332))
ARIMA(0,2,1) (b1 ) ∼ Uniform(−1, 1)
ARIMA(1,2,1) (a1 , b1 ) ∼ Uniform(−1, 1)
ARFIMA(1,2,1) (a1 , b1 ) ∼ Uniform(−1, 1), h ∈ Uniform(0, 1)
ARFIMA(2,1,2) re-
(a1 , a2 , b1 , b2 ) ∼ Uniform(−1, 1), h ∈ Uniform(0, 1)

Table 1: Processes and parameters used for simulation study

PCA dim\ Window size 50 100 120


lP
2 60% 63.12% 61.78%
3 56.31% 62.34% 61.99%
4 53.05% 58.62% 59.3%
5 52.03% 52.32% 53.68%
6 51.84% 52.35% 52.05%
rna

7 51.74% 52% 51.91%


8 51.71% 52.14% 51.89%

Table 2: 10-fold Crossvalidation results on the simulated dataset

Process Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Total Cluster
Centers (1.5,0.87) (2.5,0.87) (3.5,0.87) (1,1.73) (2,1.73) (3,1.73) (1.5,2.59) (2.5,2.59) (3.5,2.59)
AR(1) 38 31 13 57 31 10 41 39 4 264 1
Jou

AR(2) 10 41 55 14 31 66 4 11 32 264 2
MA(1) 8 44 55 18 40 69 7 9 14 264 2
MA(2) 19 30 61 8 37 58 13 9 29 264 2
ARCH(1) 19 25 20 17 28 47 14 41 53 264 3
GARCH(1,1) 8 16 39 14 20 53 5 22 87 264 3
GARCH(2,1) 10 19 24 11 19 57 12 34 78 264 3
ARMA(2,1) 39 49 40 29 30 16 30 28 3 264 1
ARMA(1,2) 38 43 40 29 34 17 27 33 3 264 1
ARIMA(1,1,1) 31 28 25 19 30 21 16 29 65 264 3
ARIMA(0,2,1) 1 1 11 3 9 39 2 15 183 264 4
ARFIMA(1,2,1) 1 1 13 1 7 39 2 10 190 264 4
ARFIMA(2,1,2) 31 42 25 25 40 23 25 45 8 264 1
Node wise total 253 370 421 245 356 515 198 325 749

Table 3: Distribution of each process on the different SOM Nodes

23
Journal Pre-proof

Class I Class II RF- RF- RF- RF- RF- RF- RF- RF- RF-
TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- accu- accu- accu- accu- I II all
racy racy racy racy racy racy accu- accu- accu-

pro
racy racy racy
AR (1) MA (1) 0.8 0.76 0.78 0.77 0.83 0.8 0.74 0.86 0.8
AR (2) MA (2) 0.56 0.52 0.54 0.89 0.9 0.89 0.88 0.85 0.87
ARMA ARMA 0.48 0.49 0.48 0.08 0.05 0.06 0.04 0.02 0.03
(1,2) (2,1)
ARCH GARCH 0.65 0.63 0.64 0.57 0.76 0.66 0.55 0.77 0.66
(1) (1,1)
GARCH
(1,1)
ARIMA
(0,2,1)
GARCH
(2,1)
ARIMA
(1,2,1)
0.59

0.61
re-
0.53

0.39
0.56

0.5
0.75

0.66
0.48

0.63
0.62

0.64
0.71

0.98
0.45

0.99
0.58

0.99

ARFIMA ARFIMA 0.89 0.93 0.91 0.98 0.99 0.98 0.99 0.98 0.98
(1,2,1) (2,1,2)
lP
Table 4: RF-TDA, RF-DWT and RF-MODWT performance on training set of sim-
ulated data. Each class contains 264 observations
rna
Jou

24
Journal Pre-proof

Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- accu- accu- accu- accu- I II all
racy racy racy racy racy racy accu- accu- accu-

pro
racy racy racy
AR MA 0.85 0.82 0.83 0.86 0.83 0.85 0.79 0.85 0.82
(1) (1)
AR MA 0.39 0.58 0.48 0.86 0.98 0.92 0.88 0.95 0.92
(2) (2)
ARMA ARMA 0.53 0.56 0.55 0.61 0.76 0.68 0.73 0.71 0.72
(1,2) (2,1)
ARCH
(1)
GAR-
CH
(1,1)
GAR- GAR-
0.67

0.38
re-
0.55

0.44
0.61

0.41
0.79

0.2
0.59

0.41
0.69

0.3
0.74

0.58
0.67

0.42
0.7

0.5
CH CH
(1,1) (2,1)
lP
ARIMA ARIMA 0.59 0.2 0.39 0.58 0.27 0.42 0.92 0.86 0.89
(0,2,1) (1,2,1)
ARF- ARF- 0.71 0.97 0.84 0.76 1 0.88 0.79 1 0.89
IMA IMA
(1,2,1) (2,1,2)
rna

Table 5: RF-TDA, RF-DWT, knn-DTW, knn-Euclidean and RF-MODWT perfor-


mance on the test set of simulated data. Each class contains 66 observations.

Class I Class IIClass I ac- Class II ac- Overall ac-


curacy curacy curacy
AR (1) MA (1) 0.92 0.80 0.86
Jou

AR (2) MA (2) 0.91 0.91 0.91


ARMA (1,2) ARMA (2,1) 0.68 0.72 0.70
ARCH (1) GARCH (1,1) 0.82 0.53 0.67
GARCH (1,1) GARCH (2,1) 0.24 0.47 0.36
ARIMA ARIMA 0.61 0.35 0.48
(0,2,1) (1,2,1)
ARFIMA ARFIMA 0.76 1 0.88
(1,2,1) (2,1,2)

Table 6: Performance of the ensemble of RF-TDA, RF-DWT, RF-MODWT and


knn-DTW on test set of simulated data. Each class contains 66 observations.

25
Journal Pre-proof

PCA dim \ Window size 50 75 100 120


2 56.07% 60.09% 61.29% 61.67%

of
3 61.41% 60.75% 66.21% 65.42%
4 63.24% 61.59% 65.65% 67.36%
5 60.73% 63.15% 64.43% 66.39%
6 59.33% 66.01% 74.73% 67.64%

pro
7 57.77% 65.44% 88.39% 67.08%
8 58.74% 64.85% 95.59% 64.72%

Table 7: 10-fold Cross Validation results on the NSE stock price data used in Exper-
iment 1 for the period of 1 January 2014 to 31 December 2018

Stock Label ARFIMA AIC ARIMA AIC GARCH AIC


HDFC
Axis
ICICI
Kotak Mahindra
Bank
Bank
Bank
Bank
re- (4,4,2,-0.11) 4780.15
(1,2,2,-0.99) 5574.24
(3,1,2,-0.99) 4047.73
(3,3,2,-0.99) 6156.69
(1,1,2)
(0,1,0)
(0,1,0)
(0,1,1)
8275.99
9062.64
7534.15
9643.55
(1,4)
(1,3)
(1,4)
(1,4)
3360.46
3192.57
2957.1
3445.97
Indus Ind Bank (4,3,2,-0.89) 7081.06 (0,1,0) 10572.25 (1,3) 3480.72
Federal Bank (4,2,2,-0.7) 1257.78 (0,1,0) 4745.73 (1,4) 2363.86
lP
Sun Pharma (3,1,2,-0.99) 6539.21 (0,1,0) 10027.2 (1,4) 2976.24
Dr. Reddy Pharma (4,3,2,-0.95) 9776.09 (1,1,0) 13261.92 (1,4) 3696.9
Cipla Pharma (1,1,2,-0.99) 5563.79 (0,1,0) 9048.82 (1,4) 3045.33
Divi Pharma (4,4,3,-0.43) 7431.53 (0,1,1) 10920.77 (1,4) 3504.99
Piramal Pharma (4,3,2,-0.86) 8919.96 (2,1,4) 12404.49 (1,4) 3586.86
Biocon Pharma (4,4,2,-0.11) 3374.54 (0,1,0) 6874.34 (1,3) 2800.36
rna

TCS IT (2,2,2,-0.99) 7400.14 (0,1,0) 10885.85 (1,4) 3604.51


Infosys IT (2,2,2,-0.99) 5215.55 (2,1,2) 8703.56 (1,4) 3192.58
HCL IT (2,1,2,-0.99) 6642.47 (2,1,1) 10130.47 (1,4) 3330.06
Tech Mahindra IT (4,4,2,-0.81) 5642.45 (0,1,0) 9132.44 (1,4) 3186.56
WIPRO IT (1,1,2,-0.99) 2636 (0,1,0) 6123.03 (1,4) 2785.64
Hexaware IT (3,4,2,0.03) 4779.9 (0,1,2) 8267.98 (1,4) 2898.68
SBI Public Banks (1,2,2,-0.14) 4142.37 (0,1,0) 7628.05 (1,4) 2844.47
Bank of Baroda Public Banks (3,1,2,-0.99) 3541.77 (0,1,0) 7027.55 (1,3) 2448.66
Jou

Canara Bank Public Banks (1,1,2,-0.99) 5268.41 (0,1,0) 8752.17 (1,3) 2758.66
Punjab National Bank Public Banks (2,2,2,-0.73) 3398.56 (1,1,0) 6883.13 (1,3) 2290.5
Bank of India Public Banks (4,2,2,-0.64) 3909.56 (0,1,0) 7397.03 (4,1) 2331.42
Indian Bank Public Banks (3,3,2,-0.02) 4618.83 (2,1,2) 8103.77 (1,4) 2696.21

Table 8: Time series models fit on the NSE stock price data used in Experiment 1
for the period of 1 January 2014 to 31 December 2018

26
Journal Pre-proof

Stock Sector ARFIMA model AIC ARIMA model AIC GARCH model AIC
BAJAJ-AUTO Auto (1,1,1,-0.95) 22714 (0,1,0) 12410.94 (1,4) 8927.17
EICHERMOT Auto (1,2,3,-0.99) 27606.58 (0,1,1) 18204.16 (4,3) 14716.89

of
HEROMOTOCO Auto (3,2,3,-0.44) 23091.63 (0,1,1) 12840.04 (1,4) 9352.59
M&M Auto (3,1,3,-0.13) 19410.15 (0,1,1) 9352.61 (1,4) 5863.12
TATAMOTORS Auto (1,1,1,-0.68) 18175 (1,1,1) 8840.97 (1,4) 5359.01
MARUTI Auto (1,2,3,-0.68) 24307.91 (0,1,1) 14320.07 (4,3) 10833.42
HAVELLS Consumer Durables (4,1,4,-0.31) 17911.01 (2,1,2) 8560.73 (4,3) 5077.23

pro
VOLTAS Consumer Durables (2,1,4,-0.99) 17806.48 (0,1,1) 8706.69 (4,3) 5222.42
RAJESHEXPO Consumer Durables (2,1,2,-0.91) 18296.67 (0,1,1) 9224.74 (4,4) 5730.23
WHIRLPOOL Consumer Durables (2,1,2,-0.1) 19958.3 (2,1,2) 10651.71 (4,3) 7168.25
TITAN Consumer Durables (3,1,3,-0.99) 18513.24 (0,1,1) 9302.34 (2,4) 5815.73
BATAINDIA Consumer Durables (3,1,3,-0.94) 19206.23 (0,1,0) 9560.04 (1,4) 6076.32
PETRONET Oil and Gas (4,1,2,-0.93) 15649.09 (0,1,2) 6228.78 (4,3) 2733.8
RELIANCE Oil and Gas (2,1,2,-0.21) 19202.54 (2,1,2) 9365.36 (3,4) 5882.34
ONGC Oil and Gas (3,1,1,-0.86) 16280.02 (0,1,3) 6723.19 (1,4) 3235.42
IOC Oil and Gas (4,1,3,-0.99) 15310.17 (0,1,0) 5997.3 (4,3) 2507.68
BPCL Oil and Gas (3,1,4,-0.12) 17613.31 (0,1,1) 8301.91 (2,4) 4812.58
GAIL
DLF
OBEROIRLTY
IBREALEST
PHOENIXLTD
Oil and Gas
Realty
Realty
Realty
Realty
re-(1,1,0,-0.95)
(4,1,3,-0.58)
(3,1,4,-0.48)
(2,1,2,-0.07)
(3,1,2,-0.46)
15361.62
15877.77
17613.01
14669.01
18156.09
(1,1,2)
(0,1,0)
(0,1,0)
(0,1,0)
(0,1,0)
5701.66
7386.11
8773.98
7063.83
8984.45
(1,4)
(1,4)
(1,4)
(4,4)
(1,4)
2216
3903.39
5269.87
3580.64
5487.92
GODREJPROP Realty (2,1,3,-0.99) 18025.98 (3,1,2) 9184.49 (4,3) 5674.59
PRESTIGE Realty (4,1,4,-0.81) 16702.88 (1,1,2) 7951.93 (1,4) 4469.41
SUNTV Media (0,1,0,-0.99) 18785.19 (0,1,0) 10086.94 (3,4) 6604.86
lP
ZEEL Media (3,1,2,-0.9) 18310.66 (1,1,1) 8454.39 (1,4) 4963.98
INOXLEISUR Media (4,1,2,-0.99) 16590.94 (0,1,0) 7500.1 (1,4) 4014.61
TV18BRDCST Media (3,1,3,-0.16) 12401.78 (2,1,2) 3688.04 (1,4) 205.56
PVR Media (3,1,2,-0.82) 20315.5 (3,1,3) 11014.02 (1,4) 7532.46
NETWORK18 Media (4,0,4,-0.29) 12905.83 (0,1,0) 4406.5 (1,4) 920.75
GODREJCP FMCG (3,1,2,-0.99) 18614.79 (0,1,0) 9214.85 (1,4) 5719.67
HINDUNILVR FMCG (2,1,3,-0.07) 20312.31 (0,1,1) 10042.02 (1,4) 6556.47
rna

NESTLEIND FMCG (4,1,4,-0.09) 25069.25 (0,1,0) 14963.62 (1,4) 11473.02


DABUR FMCG (1,1,1,-0.99) 17296.17 (1,1,0) 7282.27 (1,4) 3790.46
BRITANNIA FMCG (4,1,4,-0.99) 21341.27 (0,1,0) 11542.71 (4,4) 8049.34
ITC FMCG (0,1,1,-0.99) 16984.19 (0,1,0) 6846.98 (1,4) 3358.54

Table 9: Time series models fit on the NSE stock price data used in Experiment 2
for the period of 1 January 2014 to 31 December 2018
Jou

27
Journal Pre-proof

Class I Class II RF- RF- RF- RF- RF- RF- RF- RF- RF-
TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class over-
accu- accu- Ac- accu- accu- Ac- I II all
racy racy cu- racy racy cu- accu- accu- accu-

pro
racy racy racy racy racy
Banks Public 0.89 0.97 0.93 0.75 0.86 0.81 0.71 0.75 0.73
Banks
IT Public 0.97 0.93 0.92 0.83 0.94 0.89 0.68 0.75 0.72
Banks
Pharma Public 0.96 0.99 0.97 0.85 0.9 0.88 0.72 0.88 0.8
Banks
Banks
Pharma
Banks
Median
Pharma
IT
IT
1
0.96
0.97
0.965
re-
0.96
0.99
0.94
0.965
0.98
0.97
0.96
0.965
0.74
0.71
0.74
0.745
0.76
0.67
0.74
0.81
0.75
0.69
0.74
0.78
0.72
0.85
0.83
0.72
0.6
0.94
0.89
0.815
0.66
0.9
0.86
0.765
accu-
racy
lP
Table 10: The performance of RF-TDA, RF-DWT and RF-MODWT on training set
of NSE stock price data used in Experiment 1 for the period of 1 January 2014 to 31
December 2018. The number of observations in each class are 72
rna
Jou

28
Journal Pre-proof

Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I II all I II all Class Class Over-
accu- accu- Ac- accu- accu- Ac- I Ac- II all
racy racy cu- racy racy cu- cu- Ac- Ac-

pro
racy racy racy cu- cu-
racy racy
Banks Public 0.62 0.62 0.62 0.83 0.5 0.67 0.33 0.75 0.54
Banks
IT Public 0.92 0.62 0.77 0.83 0.5 0.67 0.25 0.91 0.58
Banks
Pharma Public 0.33 0.75 0.54 0.67 0.83 0.75 0.91 0.5 0.7
Banks
Banks Pharma
Pharma IT
Banks IT
0.83
1
0.92
re-
0.83
0.75
0.69
0.83
0.88
0.81
0.42
0.5
0.33
0.67
0.58
0.75
0.54
0.54
0.54
0.5
0.75
0.83
0.5
0.58
0.25
0.5
0.66
0.54
Median 0.875 0.72 0.79 0.58 0.625 0.60 0.625 0.54 0.56
Accu-
lP
racy

Table 11: The performance of RF-TDA, RF-DWT and RF-MODWT on the test set
of NSE stock price data of the Experiment 1 for the period of 1 January 2019 to 1
November 2019. The number of observations in each class are 12.
rna
Jou

29
Journal Pre-proof

Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I Ac- II all I Ac- II all Class Class Over-
cu- Ac- Ac- cu- Ac- Ac- I Ac- II all
racy cu- cu- racy cu- cu- cu- Ac- Ac-

pro
racy racy racy racy racy cuarcy cu-
racy
Auto Con 0.6 0.55 0.58 0.88 0.9 0.89 0.7 0.78 0.74
Dur
Auto Oil 0.5 0.62 0.56 0.88 0.88 0.88 0.87 0.87 0.87
and
Gas
Auto
Auto
Auto
Con
Realty
Media
FMCG
Oil
0.87
0.67
0.45
0.25
re-
0.77
0.65
0.57
0.68
0.82
0.66
0.51
0.47
0.8
0.75
0.75
0.73
0.88
0.73
0.65
0.55
0.84
0.74
0.7
0.64
0.73
0.75
0.73
0.88
0.87
0.73
0.6
0.7
0.8
0.74
0.67
0.79
Dur and
Gas
lP
Con Realty 0.77 0.67 0.72 0.67 0.7 0.68 0.7 0.7 0.7
Dur
Con Media 0.6 0.57 0.58 0.8 0.67 0.73 0.85 0.47 0.66
Dur
Con FMCG 0.62 0.7 0.66 0.77 0.67 0.72 0.7 0.72 0.71
rna

Dur
Oil Realty 0.9 0.77 0.83 0.62 0.78 0.7 0.63 0.83 0.73
and
Gas
Oil Media 0.62 0.8 0.71 0.65 0.75 0.7 0.67 0.73 0.7
and
Gas
Jou

Oil FMCG 0.6 0.72 0.66 0.68 0.7 0.69 0.68 0.7 0.69
and
Gas
Realty Media 0.7 0.47 0.58 0.47 0.75 0.61 0.48 0.82 0.65
Realty FMCG 0.85 0.8 0.83 0.65 0.78 0.72 0.67 0.77 0.72
Media FMCG 0.8 0.73 0.77 0.7 0.5 0.6 0.72 0.45 0.58
Median 0.62 0.68 0.66 0.73 0.73 0.7 0.7 0.73 0.71
Accu-
racy

Table 12: The performance of RF-TDA, RF-DWT and RF-MODWT on the training
set of NSE stock price data of Experiment 2 for the period of 1 January 2019 to 1
November 2019. The number of observations
30 in each class are 12.
Journal Pre-proof

Class I Class RF- RF- RF- RF- RF- RF- RF- RF- RF-
II TDA TDA TDA DWT DWT DWT MO- MO- MO-

of
Class Class Over- Class Class Over- DWT DWT DWT
I Ac- II all I Ac- II all Class Class Over-
cu- Ac- Ac- cu- Ac- Ac- I Ac- II all
racy cu- cu- racy cu- cu- cu- Ac- Ac-

pro
racy racy racy racy racy cuarcy cu-
racy
Auto Con 0.17 0 0.08 0.83 0.83 0.83 0.67 0.5 0.58
Dur
Auto Oil 0.33 0.83 0.58 0.83 0.67 0.75 0.83 0.83 0.83
and
Gas
Auto
Auto
Auto
Con
Realty
Media
FMCG
Oil
0.5
0.17
1
0
1
1
0.83
0.67
re-0.75
0.58
0.92
0.33
0.83
0.67
0.83
0.83
0.67
0.83
0.5
1
0.75
0.75
0.67
0.92
0.83
0.83
0.67
1
0.67
0.67
0.5
0.67
0.75
0.75
0.58
0.83
Dur and
Gas
lP
Con Realty 1 0.83 0.92 0.17 1 0.58 1 0.5 0.75
Dur
Con Media 1 1 1 0.17 0.83 0.5 1 0.5 0.75
Dur
Con FMCG 0.17 1 0.58 0.67 0.67 0.67 0.33 0.83 0.58
rna

Dur
Oil Realty 0.83 0.67 0.75 0.67 1 0.83 0.5 0.83 0.67
and
Gas
Oil Media 1 0.5 0.75 0.83 0.5 0.67 0.5 0.83 0.67
and
Gas
Jou

Oil FMCG 0.67 0.83 0.75 0.83 0.67 0.75 0.67 0.67 0.67
and
Gas
Realty Media 0.17 0.5 0.33 0.67 0.67 0.67 0.33 1 0.67
Realty FMCG 1 1 1 0.67 0.5 0.58 0.67 0.5 0.58
Media FMCG 1 1 1 0.83 0.33 0.58 0.67 0.33 0.5
Median 0.67 0.83 0.75 0.83 0.67 0.67 0.67 0.67 0.67
Accu-
racy

Table 13: The performance of RF-TDA, RF-DWT and RF-MODWT on the test
set of NSE stock price data of Experiment 2 for the period of 1 January 2019 to 1
November 2019. The number of observations
31 in each class are 12.
Journal Pre-proof

Method H0 H0∗
RF-DWT 0.627 0.625

of
knn-DTW 0.509 0.554
knn-Euclidean 0.568 0.463
RF-MODWT 0.839 0.842

pro
Table 14: p-value for test of hypotheses H0 and H0∗ about performance of RF-TDA
using results on simulated data

Method H0 H0∗
RF-DWT 0.0364 0.0969
knn-DTW 0.0019 0.0479
knn-Euclidean
RF-MODWT
0.0316
0.175 re-
0.0492
0.1239

Table 15: p-value for tests of hypotheses H0 and H0∗ about performance of RF-TDA
using results on real data in Experiments 1 and 2 taken together
lP
Class RF- RF- RF- RF- RF- RF-
TDA TDA MODWT MODWT DWT DWT
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
curacy curacy curacy curacy curacy curacy
rna

GARCH (2,1) 0.3 0.23 0.7 0.52 0.65 0.61


ARIMA (1,1,1) 0.2 0.08 0.65 0.79 0.44 0.67
ARMA (1,2) 0.34 0.23 0.06 0.298 0.07 0.17
MA (2) 0.28 0.47 0.58 0.71 0.52 0.61
ARMA (2,1) 0.32 0.38 0.07 0.32 0.06 0.29
GARCH (1,1) 0.21 0.18 0.48 0.38 0.58 0.52
AR (1) 0.48 0.44 0.26 0.32 0.29 0.52
Jou

MA (1) 0.34 0.29 0.17 0.58 0.24 0.56


ARCH (1) 0.22 0.39 0.63 0.86 0.65 0.85
ARIMA (0,2,1) 0.56 0.59 0.99 0.92 0.98 0.91
ARFIMA (1,2,1) 0.68 0.64 0.99 0.82 0.99 0.8
ARFIMA (2,1,2) 0.29 0.38 0.82 0.94 0.78 0.95
AR (2) 0.27 0.26 0.35 0.61 0.37 0.48

Table 16: Performance of RF-TDA, RF-DWT and RF-MODWT on a multi-class


classification using top-2 prediction sets on the simulated data described in Section 3

32
Journal Pre-proof

Class RF- RF- RF- RF- RF- RF-


TDA TDA MODWT MODWT DWT DWT

of
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
curacy curacy curacy curacy curacy curacy
Banks 0.94 0.5 0.76 0.67 0.76 0.67

pro
Pharma 0.99 0.75 0.76 0.67 0.76 0.67
IT 1 1 0.85 0.75 0.85 0.75
Public Banks 0.97 0.75 0.96 0.5 0.96 0.5

Table 17: Performance of RF-TDA, RF-DWT and RF-MODWT on a multi-class


classification using top-2 prediction sets on sector-wise NSE Stock price data used in
Experiment 2 in Section 4.1

Class RF-
TDA
RF-
TDA
re- RF-
MODWT
RF-
MODWT
RF-
DWT
RF-
DWT
training test training test training test
class ac- class ac- class ac- class ac- class ac- class ac-
lP
curacy curacy curacy curacy curacy curacy
Auto 0.67 0.17 0.67 0.67 0.68 0.83
Con Dur 0.6 0.17 0.6 0.33 0.65 0.83
FMCG 0.73 0.67 0.73 0.67 0.73 0.67
Media 0.77 0.5 0.77 0.5 0.75 0.17
Oil and Gas 0.38 0.83 0.38 0.33 0.38 0.5
rna

Realty 0.38 1 0.38 0.67 0.48 0.67

Table 18: Performance of RF-TDA, RF-DWT and RF-MODWT on a multi-class


classification using top-2 prediction sets on sector-wise NSE Stock price data used in
Experiment 2 in Section 4.2
Jou

33
Journal Pre-proof

of
pro
re-
Figure 1: a) is the projection of points sampled from S 3 , b) is the persistence diagram
lP
of points sampled from S 3 , c) is the persistence landscape of the points sampled from
S3
rna
Jou

Figure 2: a) The time series of Indian Bank from 1 January 2014 to 31 December
2018 b) time delay reconstruction of the time series

34
Journal Pre-proof

of
pro
re-
lP
Figure 3: Persistent Landscapes of Pharma constituents-Cipla, Divi, Sun and Dr.
Reddy. These have been generated for the period of 1 January 2014 to 31 December
2018. The line types denotes the landscapes for the subseries generated by window
size as 100
rna
Jou

35
Journal Pre-proof

of
pro
re-
lP
Figure 4: Persistent Landscapes of Public Banks constituents- Bank of Baroda, Ca-
nara Bank, Punjab National Bank and State Bank of India. These have been gener-
ated for the period of 1 January 2014 to 31 December 2018. The line types denotes
the landscapes for the subseries generated by window size as 100
rna
Jou

36
*Highlights (for review)

Journal Pre-proof

 New methods for time series clustering (SOM-TDA) and classification (RF-TDA).
 RF-TDA outperforms other methods on the classification task.
 Dependence of stock price movements on sectors in NSE is revealed using RF-TDA.

of
pro
re-
lP
rna
Jou
*Credit Author Statement

Journal Pre-proof

Sourav Majumdar: Conceptualization, Methodology, Software, Formal Analysis, Investigation, Data


Curation, Writing - Original Draft

Arnab Kumar Laha: Conceptualization, Methodology, Formal Analysis, Writing - Review and Editing,
Supervision

of
pro
re-
lP
rna
Jou
Journal Pre-proof
*Conflict of Interest Statement

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

of
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

pro
re-
lP
rna
Jou

You might also like