Petitjean2011 PR
Petitjean2011 PR
net/publication/220601732
CITATIONS READS
709 6,147
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by François Petitjean on 05 February 2018.
Abstract
Mining sequential data is an old topic that has been revived in the last decade, due to the increasing availability of sequential
datasets. Most works in this field are centred on the definition and use of a distance (or, at least, a similarity measure) between
sequences of elements. A measure called Dynamic Time Warping (DTW) seems to be currently the most relevant for a large panel
of applications. This article is about the use of DTW in data mining algorithms, and focuses on the computation of an average of a
set of sequences. Averaging is an essential tool for the analysis of data. For example, the K-means clustering algorithm repeatedly
computes such an average, and needs to provide a description of the clusters it forms. Averaging is here a crucial step, which must
be sound in order to make algorithms work accurately. When dealing with sequences, especially when sequences are compared
with DTW, averaging is not a trivial task.
Starting with existing techniques developed around DTW, the article suggests an analysis framework to classify averaging
techniques. It then proceeds to study the two major questions lifted by the framework. First, we develop a global technique for
averaging a set of sequences. This technique is original in that it avoids using iterative pairwise averaging. It is thus insensitive to
ordering effects. Second, we describe a new strategy to reduce the length of the resulting average sequence. This has a favourable
impact on performance, but also on the relevance of the results. Both aspects are evaluated on standard datasets, and the evaluation
shows that they compare favourably with existing methods. The article ends by describing the use of averaging in clustering.
The last section also introduces a new application domain, namely the analysis of satellite image time series, where data mining
techniques provide an original approach.
Key words: sequence analysis, time series clustering, dynamic time warping, distance-based clustering, time series averaging,
DTW Barycenter Averaging, global averaging, satellite image time series
method to represent in one object, informations from a set of Since no information on the length of the average sequence
objects. Algorithms like K-medoids are using the medoid of a is available, the search cannot be limited to sequences of a given
set of objects. Some others like K-means need to average a set length, so all possible values for T have to be considered. Note
of objects, by finding a mean of the set. If these “consensus” that sequences of S have a fixed length T . C has hence to fulfill:
representations are easily definable for objects in the Euclidean
space, this is much more difficult for sequences under DTW.
N N
Finding a consensus representation of a set of sequences is X X
∀t ∈ [1, +∞[, ∀X ∈ E ,
t 2 2
even described by [20], chapter 14, as the Holy Grail. In this DT W (C, Sn ) ≤ DT W (X, Sn )
section, we first introduce the problem of finding a consensus n=1 n=1
(3)
sequence from a set of sequences, inspired by theories devel-
This definition relies in fact on the Steiner trees theory; C
oped in computational biology. Then, we make the link be-
is called Steiner sequence [20]. Note that the sums in Equa-
tween these theories on strings and methods currently used to
tions (2) and (3), is often called Within Group Sum of Squares
average sequences of numbers under Dynamic Time Warping.
(WGSS), or discrepancy distance in [21]. We will also use the
To simplify explanations, we use the term coordinate to
simple term inertia, used in most works on clustering.
designate an element, or point, or component of a sequence.
Without loss of generality, we consider that sequences con-
3.2. Exact solutions to the Steiner sequence problem
tain T coordinates that are one-dimensional data. We note A(β)
a sequence of length β. In the following, we consider a set As shown in [22], when considered objects are simple points
S = {S1 , · · · , SN } of N sequences from which we want to com- in an α-dimensional space, the minimization problem corre-
pute a consensus sequence C. sponding to Equation (2) can be solved by using the property
of the arithmetic mean. As the notion of arithmetic mean is not
easily extendable to semi-pseudometric spaces (i.e., spaces in-
duced by semi-pseudometrics like DTW), we need to detail this
Steiner sequence problem, i.e., the problem to find an average not provide any exact scalable algorithm, neither for the mul-
sequence. To solve this problem, there are two close families of tiple alignment problem, nor for the consensus sequence prob-
methods. lem. Given a sequence standing for the solution, we even can-
The first one consists in computing the global multiple align- not check if the potential solution is optimal because we rely on
ment [23] of the N sequences of S. This multiple alignment is a subset of the search space. A number of heuristics have been
computable by extending DTW for aligning N sequences. For developed to solve this problem (see [25–29] for examples). We
instance, instead of computing DTW by comparing three val- present in this subsection the most common family of methods
ues in a square, one have to compare seven values in a cube used to approximate the average sequence: iterative pairwise
for three sequences. This idea can be generalized by computing averaging. We will also link this family to existing averaging
DTW in a N-dimensional hypercube. Given this global align- method for DTW.
ment, C can be found by averaging column by column the mul- Iterative pairwise averaging consists in iteratively merging
tiple alignment. However this method presents two major dif- two average sequences of two subsets of sequences into a single
ficulties, thatprevent
its use. First, the multiple alignment pro- average sequence of the union of those subsets. The simplest
cess takes Θ T N operations [24], and is not tractable for more strategy computes an average of two sequences and iteratively
than a few sequences. Second, the global length of the multiple incorporates one sequence to the average sequence. Differences
alignment can be on the order of T N , and requires unrealistic between existing methods are the order in which the merges are
amounts of memory. done, and the way they compute an average of two sequences.
The second family of methods consists in searching through
the solution space, keeping those minimizing the sum (Equation 3.3.1. Ordering schemes
2). In the discrete case (i.e., sequences of characters or of sym- Tournament scheme. The simplest and most obvious averag-
bolical values), as the alphabet is generally finite, this scan is ing ordering consists in pairwise averaging sequences follow-
easy. However, as the length of the global alignment is poten- ing a tournament scheme. That way, N/2 average sequences
tially T N − 2T , scanning the space takes Θ(T N ) operations. In are created at first step. Then those N/2 sequences, in turn,
the continuous case (i.e., sequences of numbers or of number are pairwise averaged into N/4 sequences, and so on, until one
vectors), scanning all solutions is impossible. Nevertheless, we sequence is obtained. In this approach, the averaging method
will explain in the next paragraph how this search can be guided (between two sequences) is applied N times.
towards potential solutions, even though this strategy also ex-
hibits exponential complexity. Ascendant hierarchical scheme. A second approach consists
In fact, we need a method to generate all potential solu- in averaging at first the two sequences whose distance (DTW)
tions. Each solution corresponds to a coupling between C and is minimal over all pairs of sequences. This works like As-
each sequence of S. As all coordinates of each sequence of S cendant Hierarchical Clustering, computing a distance matrix
must be associated to at least one coordinate of C, we have to before each average computation. In that way, the averaging
generate all possible groupings between coordinates of each se- method is also called N − 1 times. In addition, one has to take
quence of S and coordinates of C. Each coordinate of C must into account the required time to compute N times the distance
be associated to a non-empty non-overlapping set of coordi- matrix.
nates of sequences of S. To generate all possible groupings,
we can consider that each sequence is split in as many subse- 3.3.2. Computing the average sequence
quences as there are coordinates in C. Thus, the first coordinate Regarding the way to compute an average from two se-
of C will be associated to the coordinates appearing in the first quences under DTW, most methods are using associations (cou-
subsequences of sequences of S, the second coordinate to coor- pling) computed with DTW.
dinates appearing in the second subsequences of sequences of
S, and so on. Then, for C(p) (C of length p), the N sequences One coordinate by association. Starting from a coupling be-
of S have to be cut into p parts. There are Tp ways to cut a se- tween two sequences, the average sequence is built using the
quence of length T into p subsequences. Thus, for N sequences center of each association. Each coordinate of the average se-
N quence will thus be the center of each association created by
there are Tp possibilities. The time complexity of this scan is
N DTW. The main problem of this technique is that the resulting
therefore in Θ( Tp ). mean can grow substantially in length, because up to |A|+|B|−2
One can note in Equation (3) that p > T does not make associations can be created between two sequences A and B.
sense. It would mean that sequences have to be split in more
subsequences than they contain coordinates. Thus, we can limit One coordinate by connected component. Considering that the
the search conditions of C to p ∈ [1, T ]. But even with this coupling (created by DTW) between two sequences forms a
tighter bound, the overall computation remains intractable. graph, the idea is to associate each connected component of this
graph to a coordinate of the resulting mean, usually taken as the
3.3. Approximating the exact solution barycenter of this component. Contrary to previous methods,
As we explained in the previous subsection, finding the av- the length of resulting mean can decrease. The resulting length
erage sequence is deeply linked to the multiple alignment prob- will be between 2 and min(|A|, |B|).
lem. Unfortunately, 30 years of well-motivated research did
3.4. Existing algorithms would lead to the same result. Local averaging strategies like
The different ordering schemes and average computations PSA or NLAAF may let an initial approximation error propa-
just described are combined in the DTW literature to make up gate throughout the averaging process. If the averaging process
algorithms. The two main averaging methods for DTW are pre- has to be repeated (e.g., during K-means iterations), the effects
sented below. may dramatically alter the quality of the result. This is why a
global approach is desirable, where sequences would be aver-
NonLinear Alignment and Averaging Filters. NLAAF was in- aged all together, with no sensitivity to their order of consider-
troduced in [30] and rediscovered in [15]. This method uses ation. The obvious analogy to a global method is the compu-
the tournament scheme and the one coordinate by association tation of the barycenter of a set of points in a Euclidean space.
averaging method. Its main drawback lies in the growth of its Section 4 follows this line of reasoning in order to introduce a
resulting mean. As stated earlier, each use of the averaging global averaging strategy suitable for DTW, and provides em-
method can almost double the length of the average sequence. pirical evidence of its superiority over existing techniques.
The entire NLAAF process could produce, over all sequences, The second dimension along which averaging techniques
an average sequence of length N × T . As classical datasets can be classified is the way they select the elements of the mean.
comprise thousands of sequences made up on the order of hun- We have seen that a naive use of the DTW-computed associa-
dred coordinates, simply storing the resulting mean could be tions may lead to some sort of “overfit”, with an average cover-
impossible. This length problem is moreover worsened by the ing almost every detail of every sequence, whereas simpler and
complexity of DTW, that grows bi-linearly with lengths of se- smoother averages could well provide a better description of the
quences. That is why NLAAF is generally used in conjunction set of sequences. Moreover, long and detailed averages have a
with a process reducing the length of the mean, leading to a loss strong impact on further processing. Here again, iterative algo-
of information and thus to an unsatisfactory approximation. rithms like K-means are especially at risk: every iteration may
lead to a longer average, and because the complexity of DTW is
Prioritized Shape Averaging. PSA was introduced in [21] to directly related to the length of the sequences involved, later it-
resolve shortcomings of NLAAF. This method uses the Ascen- erations will take longer than earlier ones. In such cases, uncon-
dant hierarchical scheme and the one by connected compo- strained averaging will not only lead to an inadequate descrip-
nent averaging method. Although this hierarchical averaging tion of clusters, it will also cause a severe performance degrada-
method aims at preventing the error to propagate too much, the tion. This negative effect of sequence averaging is well-known,
length of average sequences remains a problem. If one align- and corrective actions have been proposed. Section 5 builds on
ment (with DTW) between two sequences leads to two con- our averaging strategy to suggest new ways of shortening the
nected components (i.e., associations are forming two hand- average.
held fans), the overall resulting mean will be composed of only
two coordinates. Obviously, such a sequence cannot represent
a full set of potentially long sequences. This is why authors 4. A new averaging method for DTW
proposed to replicate each coordinate of the average sequence
To solve the problems of existing pairwise averaging meth-
as many times as there were associations in the corresponding
ods, we introduce a global averaging strategy called Dtw Barycen-
connected component. However, this repetition of coordinates
ter Averaging (DBA). This section first defines the new averag-
causes the problem already observed with NLAAF, by poten-
ing method and details its complexity. Then DBA is compared
tially doubling the number of coordinates of each intermediate
to NLAAF and PSA on standard datasets [19]. Finally, the ro-
average sequence. To alleviate this problem, the authors sug-
bustness and the convergence of DBA are studied.
gest using a process in order to reduce the length of the resulting
mean.
4.1. Definition of DBA
3.5. Motivation DBA stands for Dtw Barycenter Averaging. It consists in a
heuristic strategy, designed as a global averaging method. DBA
We have seen that most of the works on averaging sets of
is an averaging method which consists in iteratively refining
sequences can be analyzed along two dimensions: first, the
an initially (potentially arbitrary) average sequence, in order to
way they consider the individual sequences when averaging,
minimize its squared distance (DTW) to averaged sequences.
and second, the way they compute the elements of the result-
Let us provide an intuition on the mechanism of DBA. The
ing sequences. These two characteristics have proved useful to
aim is to minimize the sum of squared DTW distances from the
classify the existing averaging techniques. They are also useful
average sequence to the set of sequences. This sum is formed
angles under which new solutions can be elaborated.
by single distances between each coordinate of the average se-
Regarding the averaging of individual sequences, the main
quence and coordinates of sequences associated to it. Thus,
shortcoming of all existing methods is their use of pairwise av-
the contribution of one coordinate of the average sequence to
eraging. When computing the mean of N sequences by pairwise
the total sum of squared distance is actually a sum of euclidean
averaging, the order in which sequences are taken influences the
distances between this coordinate and coordinates of sequences
quality of the result, because neither NLAAF nor PSA are asso-
ciative functions. Pairwise averaging strategies are intrinsically
sensitive to the order, with no guarantee that a different order
associated to it during the computation of DTW. Note that a co-
ordinate of one of the sequences may contribute to the new po-
sition of several coordinates of the average. Conversely, any co-
ordinate of the average is updated with contributions from one
or more coordinates of each sequence. In addition, minimizing
this partial sum for each coordinate of the average sequence is 1
achieved by taking the barycenter of this set of coordinates. The
principle of DBA is to compute each coordinate of the average
sequence as the barycenter of its associated coordinates of the
set of sequences. Thus, each coordinate will minimize its part
of the total WGSS in order to minimize the total WGSS. The
updated average sequence is defined once all barycenters are
computed.
Technically, for each refinement i.e., for each iteration, DBA
works in two steps:
1. Computing DTW between each individual sequence and
the temporary average sequence to be refined, in order
to find associations between coordinates of the average
sequence and coordinates of the set of sequences. 2
2. Updating each coordinate of the average sequence as the
barycenter of coordinates associated to it during the first
step.
Let S = {S1 , · · · , SN } be the set of sequences to be averaged,
let C = hC1 , . . . , CT i be the average sequence at iteration i and
0 0 0
let C = hC1 , . . . , CT i be the update of C at iteration i + 1, of
which we want to find coordinates. In addition, each coordinate
of the average sequence is defined in an arbitrary vector space E
(e.g., usually a Euclidean space):
∀t ∈ [1, T ] , Ct ∈ E (4)
We consider the function assoc, that links each coordinate
3
of the average sequence to one or more coordinates of the se-
quences of S. This function is computed during DTW compu-
tation between C and each sequence of S. The tth coordinate of
0
the average sequence Ct is then defined as:
0
Ct = barycenter (assoc (Ct )) (5)
where
X1 + . . . + Xα
barycenter {X1 , . . . , Xα } = (6)
α
(the addition of Xi is the vector addition). Algorithm 5 details
the complete DBA computation.
Then, by computing again DTW between the average se-
quence and all sequences of S, the associations created by DTW
may change. As it is impossible to anticipate how these associ- 4
ations will change, we propose to make C iteratively converge.
Figure 2 shows four iterations (i.e., four updates) of DBA on an
example with two sequences.
As a summary, the proposed averaging method for Dynamic
Time Warping is a global approach that can average a set of se-
quences all together. The update of the average sequence be-
tween two iterations is independent of the order with which the
individual sequences are used to compute their contribution to
the update in question. Figure 3 shows an example of an av- Figure 2: DBA iteratively adjusting the average of two sequences.
erage sequence computed with DBA, on one dataset from [19].
This figure shows that DBA preserves the ability of DTW, iden-
tifying time shifts.
Algorithm 5 DBA data lets one except that a much shorter sequence can be ade-
Require: C = hC1 , . . . , CT 0 i the initial average sequence quate. We will observe in Section 4.5 that a length of around T
Require: S1 = hs11 , . . . , s1T i the 1 st sequence to average (the length of the sequences to average) performs well.
.. Regarding the values of the initial coordinates, it is theoreti-
. cally impossible to determine the optimal values, otherwise the
Require: Sn = hsn1 , . . . , snT i the nth sequence to average whole averaging process would become useless. In methods
Let T be the length of sequences that require an initialisation, e.g., K-means clustering, a large
Let assocT ab be a table of size T 0 containing in each cell a set number of heuristics have been developed. We will describe
of coordinates associated to each coordinate of C in Section 4.5 experiments with the most frequent techniques:
Let m[T, T ] be a temporary DTW (cost,path) matrix first, a randomized choice, and second, using an element of the
assocT ab ← [∅, . . . , ∅] set of sequences to average. We will show empirically that the
for seq in S do latter gives an efficient initialisation.
m ← DT W( C , seq )
i ← T0 Convergence. As explained previously, DBA is an iterative pro-
j←T cess. It is necessary, once the average is computed, to update it
while i ≥ 1 and j ≥ 1 do several times. This has the property of letting DTW refine its
assocT ab[i] ← assocT ab[i] ∪ seq j associations. It is important to note that at each iteration, iner-
(i, j) ← second(m[i, j]) tia can only decrease, since the new average sequence is closer
end while (under DTW) to the elements it averages. If the update does not
end for modify the alignment of the sequences, so the Huygens’ theo-
for i = 1 to T do rem applies; barycenters composing the average sequence will
0
Ci = barycenter(assocT ab[i]) {see Equation 6} get closer to coordinates of S. In the other case, if the alignment
end for is modified, it means that DTW calculates a better alignment
0
return C with a smaller inertia (which decreases in that case also). We
thus have a guarantee of convergence. Section 4.6 details some
experiments in order to quantify this convergence.
4.2. Initialization and Convergence
The DBA algorithm starts with an initial averaging and re- 4.3. Complexity study
fines it by minimizing its WGSS with respect to the sequences This section details the time complexity of DBA. Each iter-
it averages. This section examines the effect of the initialisation ation of the iterative process is divided into two parts:
and the rate of convergence. 1. Computing DTW between each individual sequence and
the temporary (i.e., current) average sequence, to find as-
Initialization. There are two major factors to consider when sociations between its coordinates and coordinates of the
priming the iterative refinement: sequences.
• first the length of the starting average sequence, 2. Updating the mean according to the associations just com-
puted.
• and second the values of its coordinates.
Finding associations. The aim of Step 1 is to determine the set
Regarding the length of the initial average, we have seen in Sec- of associations between each coordinate of C and coordinates of
tion 3.2 that its upper bound is T N , but that such a length cannot sequences of S. Therefore we have to compute DTW once per
reasonably be used. However, the inherent redundancy of the sequence
to average, that is N times. The complexity
of DTW
is Θ T 2 . The complexity of Step 1 is therefore Θ N · T 2 .
Trace Trace
4,25
4,00
4,00 Updating the mean. After Step 1, each coordinate Ct of the
average sequence has a set of coordinates {p1 , . . . , pαt } associ-
3,75
3,75
3,50
3,50
3,25
3,25
3,00
3,00
2,75
2,50
2,75
2,50
2,25
ated to it. The process of updating C consists in updating each
2,25
2,00
1,75
2,00
1,75
1,50
coordinate of the average sequence as the barycenter this set
1,50
1,25
1,00
0,75
1,25
1,00
0,75
of coordinates. Since the average sequence is associated to N
0,50
0,25
0,00
0,50
0,25
0,00
sequences, its T coordinates are, overall, associated to N · T
-0,25
-0,50
-0,75
-0,25
-0,50 coordinates, i.e., all coordinates of sequences of S. The update
step will thus have a time complexity of Θ (N · T ).
-0,75
-1,00 -1,00
-1,25
-1,25
-1,50
-1,50
-1,75
-1,75
-2,00
-2,00
-2,25
0 25 50 75 100 125 150 175 200 225 250 275 0 25 50 75 100 125 150 175 200 225 250 275 Overall complexity. Because DBA is an iterative process, let
(a) A cluster of the Trace dataset. (b) The average sequence of the
us note I the number of iterations. The time complexity of the
cluster.
averaging process of N sequences, each one containing T coor-
Figure 3: An example of the DBA averaging method on one cluster from the dinates, is thus:
“Trace” dataset from [19].
Θ (DBA) = Θ I N · T 2 + N · T = Θ I · N · T 2 (7)
5,0
Comparison with PSA and NLAAF. To compute an average se- 50 words 2,5
0,0
DTW between these two sequences, which has a time com- Adiac 1
0
-1
-2
Beef 0,2
0,1
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475
ever, after having computed this average sequence, it has to CBF 0,0
-2,5
be shorten to the length of averaged sequences. The classi- 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130
Coffee 30
20
Θ(T +2T +3T +· · ·+T 2 ) = Θ(T 3 ). The computation of the aver- ECG200
2,5
0,0
quires: FaceFour
2,5
0,0
-2,5
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475
quences, it has at least to compute a dissimilarity matrix, which 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155
Lighting2
5,0
0,0
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 625 650
0 25 50 75 100 125 150 175 200 225 250 275 300 325
OliveOil 1,0
0,5
0,0
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575
2,5
-2,5
SwedishLeaf 0,0
-2,5
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130
2
Evaluating an average sequence is not a trivial task. No 0,0 2,5 5,0 7,5 10,0 12,5 15,0 17,5 20,0 22,5 25,0 27,5 30,0 32,5 35,0 37,5 40,0 42,5 45,0 47,5 50,0 52,5 55,0 57,5 60,0
Trace 2,5
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280
2
in Section 3 that many meanings are covered by the “average” Two patterns 1
0
-1
2
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425
Experimental settings. To make these experiments reproducible, • we compare DBA to NLAAF on all datasets from [19]
we provide here the details about our experimental settings: and to PSA using results given by [21];
• all programs are implemented in Java and run on a Core • as we want to test the capacity of DBA to minimize the
2 Duo processor running at 2.4 GHz with 3 GB of RAM; WGSS, and because we do not focus on supervised meth-
ods, we put all sequences from both train and test dataset
• the distance used between two coordinates of sequences together;
is the squared Euclidean distance. As the square function
is a strictly increasing function on positive numbers, and • for each set of sequences under consideration, we report
because we only use comparisons between distances, it the inertia under DBA and other averaging techniques.
is unnecessary to compute square roots. The same opti- To provide quantitative evaluation, we indicate the ratio
mization has been used in [31], and seems rather com- between the inertia with respect to the average computed
mon; by DBA and those computed by NLAAF and/or PSA
(see the tables below). To provide an overall evaluation,
Intraclass inertia Global dataset inertia
DBA DBA
Dataset NLAAF DBA NLAAF
Dataset NLAAF DBA PS A
50words 11.98 6.21 52 % 50words 51 642 26 688 52 %
Adiac 0.21 0.17 81 % Adiac 647 470 73 %
Beef 29.90 9.50 32 % Beef 3 154 979 31 %
CBF 14.25 13.34 94 % CBF 21 306 18 698 88 %
Coffee 0.72 0.55 76 % Coffee 61.60 39.25 64 %
ECG200 11.34 6.95 61 % ECG200 2 190 1 466 67 %
FaceAll 17.77 14.73 83 % FaceAll 72 356 63 800 88 %
FaceFour 34.46 24.87 72 % FaceFour 6 569 3 838 58 %
Fish 1.35 1.02 75 % Fish 658 468 71 %
GunPoint 7.24 2.46 34 % GunPoint 1 525 600 39 %
Lighting2 194.07 77.57 40 % Lighting2 25 708 9 673 38 %
Lighting7 48.25 28.77 60 % Lighting7 14 388 7 379 51 %
OliveOil 0.018261 0.018259 100 % OliveOil 2.24 1.83 82 %
OSULeaf 53.03 22.69 43 % OSULeaf 30 293 12 936 43 %
SwedishLeaf 2.50 2.21 88 % SwedishLeaf 5 590 4 571 82 %
Synthetic control 9.71 9.28 96 % Synthetic control 17 939 13 613 76 %
Trace 1.65 0.92 56 % Trace 22 613 4 521 20 %
Two patterns 9.19 8.66 94 % Two patterns 122 471 100 084 82 %
Wafer 54.66 30.40 56 % Wafer 416 376 258 020 62 %
Yoga 40.07 37.27 93 % Yoga 136 547 39 712 29 %
Table 1: Comparison of intraclass inertia under DTW between NLAAF and Table 3: Comparison of global dataset inertia under DTW between NLAAF
DBA. and DBA.
Intraclass inertia
Dataset PSA DBA DBA This means that only one average sequence is computed for a
PS A
Beef 25.65 9.50 37 % whole dataset. That way, we compute the global dataset inertia
Coffee 0.72 0.55 76 % under DTW with NLAAF and DBA, to compare their capacity
ECG200 9.16 6.95 76 % to summarize mixed data.
FaceFour 33.68 24.87 74 % As can be seen in Table 3, DBA systematically reduces/improves
the global dataset inertia with a geometric average ratio of 56 %.
Synthetic control 10.97 9.28 85 %
This means that DBA not only performs better than NLAAF
Trace 1.66 0.92 56 %
(Table 1), but is also more robust to diversity.
Table 2: Comparison of intraclass inertia under DTW between PSA and DBA.
4.5. Impact of initialisation
DBA is deterministic once the initial average sequence C is
the text also sometimes mentions geometric averages of chosen. It is thus important to study the impact of the choice
these ratios. of the initial mean on the results of DBA. When used with
K-means, this choice must be done at each iteration of the al-
Intraclass inertia comparison. In this first set of experiments, gorithm, for example by taking as the initialisation the average
we compute an average for each class in each dataset. Table 1 sequence C obtained at the previous iteration. However, DBA
shows the global WGSS obtained for each dataset. We notice is not limited to this context.
that, for all datasets, DBA reduces/improves the intraclass in- We have seen in Section 4.2 that two aspects of initialisa-
ertia. The geometric average of the ratios shown in Table 1 is tion have to be evaluated empirically: first, the length of the ini-
65 %. tial average sequence, and second the values of its coordinates.
Table 2 shows a comparison between results of DBA and We have designed three experiments on some of the datasets
PSA. Here again, for all results published in [21], DBA outper- from [19]. Because these experiments require heavy computa-
forms PSA, with a geometric average of inertia ratios of 65 %. tions, we have not repeated the computation on all data sets.
Actually, such a decreases of inertia show that old averaging 1. The first experiment starts with randomly generated se-
methods could not be seriously considered for machine learning quences, of lengths varying from 1 to 2T , where T is the
use. length of the sequences in the data set. Once the length
is chosen, the coordinates are generated randomly with a
Global dataset inertia. In the previous paragraph, we com-
normal distribution of mean zero and variance one. The
puted an average for each class in each dataset. In this para-
curves on Figure 5 show the variation of inertia with the
graph, the goal is to test robustness of DBA with more data
length.
variety. Therefore, we average all sequences of each dataset.
50
T T
NLAAF
Random sequences Random sequences DBA
Random sequence of length T (100 runs) Random sequence of length T (100 runs)
Sequence from the dataset (100 runs) Sequence from the dataset (100 runs)
10
40
Dataset
NLAAF DBA
Best score Worst score
Inertia
Inertia
30
10 1
Beef 26.7 13.2
Coffee 0.69 0.66
20 ECG200 9.98 8.9
Gun Point 7.2 3.1
100 200 300 400 500
0,1
100 200 300 Lighting7 45 32
Length of the mean Length of the mean 10
Inertia
700
10 0,9
0,8
600
0,7
Normalized inertia
0,6
500
Inertia
0,5
10
200 400 600 800 50 100 150 200 250 0,4
400
Length of the mean Length of the mean
0,3
0,2
0 200
2 4 6 8 10 2 4 6 8 10
100 Iteration Iteration
T T
Random sequences
Random sequence of length T (100 runs)
Sequence from the dataset (100 runs)
Random sequences
Random sequence of length T (100 runs)
Sequence from the dataset (100 runs)
(a) Average convergence of the (b) Example of an uneven con-
10
iterative process over 50 clus- vergence (cluster 4).
ters.
Inertia
Inertia
10
Figure 7: Convergence of the iterative process for the 50words dataset.
1
phenomenon. For instance, a music can be sampled with different bit rates.
50
30
20
Figure 9: Average sequence is drawn at the bottom and one sequence of the set
is drawn at the top.
10