0% found this document useful (0 votes)
19 views16 pages

Petitjean2011 PR

A paper about statistics, finance and the state of the art
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views16 pages

Petitjean2011 PR

A paper about statistics, finance and the state of the art
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220601732

A global averaging method for dynamic time warping, with applications to


clustering

Article in Pattern Recognition · March 2011


DOI: 10.1016/j.patcog.2010.09.013 · Source: DBLP

CITATIONS READS

709 6,147

3 authors, including:

François Petitjean Pierre Gancarski


Australian Taxation Office University of Strasbourg
105 PUBLICATIONS 3,915 CITATIONS 160 PUBLICATIONS 2,649 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

COCLICO (ANR Project) View project

Scalable Time Series Classification View project

All content following this page was uploaded by François Petitjean on 05 February 2018.

The user has requested enhancement of the downloaded file.


A global averaging method for dynamic time warping, with applications to clustering

François Petitjeana,b,c,∗, Alain Ketterlina , Pierre Gançarskia,b


a University
of Strasbourg, 7 rue René Descartes, 67084 Strasbourg Cedex, France
b LSIIT– UMR 7005, Pôle API, Bd Sébastien Brant, BP 10413, 67412 Illkirch Cedex, France
c Centre National d’Études Spatiales, 18 avenue Edouard Belin, 31401 Toulouse Cedex 9, France

Abstract
Mining sequential data is an old topic that has been revived in the last decade, due to the increasing availability of sequential
datasets. Most works in this field are centred on the definition and use of a distance (or, at least, a similarity measure) between
sequences of elements. A measure called Dynamic Time Warping (DTW) seems to be currently the most relevant for a large panel
of applications. This article is about the use of DTW in data mining algorithms, and focuses on the computation of an average of a
set of sequences. Averaging is an essential tool for the analysis of data. For example, the K-means clustering algorithm repeatedly
computes such an average, and needs to provide a description of the clusters it forms. Averaging is here a crucial step, which must
be sound in order to make algorithms work accurately. When dealing with sequences, especially when sequences are compared
with DTW, averaging is not a trivial task.
Starting with existing techniques developed around DTW, the article suggests an analysis framework to classify averaging
techniques. It then proceeds to study the two major questions lifted by the framework. First, we develop a global technique for
averaging a set of sequences. This technique is original in that it avoids using iterative pairwise averaging. It is thus insensitive to
ordering effects. Second, we describe a new strategy to reduce the length of the resulting average sequence. This has a favourable
impact on performance, but also on the relevance of the results. Both aspects are evaluated on standard datasets, and the evaluation
shows that they compare favourably with existing methods. The article ends by describing the use of averaging in clustering.
The last section also introduces a new application domain, namely the analysis of satellite image time series, where data mining
techniques provide an original approach.
Key words: sequence analysis, time series clustering, dynamic time warping, distance-based clustering, time series averaging,
DTW Barycenter Averaging, global averaging, satellite image time series

1. Introduction considered. Accordingly, experiments have shown that the tra-


ditional Euclidean distance metric is not an accurate similarity
Time series data have started to appear in several applica- measure for time series. A similarity measure called Dynamic
tion domains, like biology, finance, multimedia, image anal- Time Warping (DTW) has been proposed [6, 7]. Its relevance
ysis, etc. Data mining researchers and practitioners are thus was demonstrated in various applications [8–13].
adapting their techniques and algorithms to this kind of data. Given this similarity measure, many distance-based learn-
In exploratory data analysis, a common way to deal with such ing algorithms can be used (e.g., hierarchical or centroid-based
data consists in applying clustering algorithms. Clustering, i.e., ones). However, many of them, like the well-known K-means
the unsupervised classification of objects into groups, is often algorithm, or even Ascendant Hierarchical Clustering, also re-
an important first step in a data mining process. Several exten- quire an averaging method, and highly depend on the quality
sive reviews of clustering techniques in general have been pub- of this averaging. Time series averaging is not a trivial task,
lished [1–4] as well as a survey on time series clustering [5]. mostly because it has to be consistent with the ability of DTW
Given a suitable similarity measure between sequential data, to realign sequences over time. Several attempts at defining
most classical learning algorithms are readily applicable. an averaging method for DTW have been made, but they pro-
A similarity measure on time series data (also referred to as vide an inaccurate notion of average [14], and perturb the con-
sequences hereafter) is more difficult to define than on classical vergence of such clustering algorithms [15]. That is mostly
data, because the order of elements in the sequences has to be why several time series clustering attemps prefer to use the
K-medoids algorithm instead (see [16–18] for examples com-
∗ Corresponding author – LSIIT, Pôle API, Bd Sébastien Brant, BP 10413, bining DTW and the K-medoids algorithm). Throughout this
67412 Illkirch Cedex, France – Tel.: +33 3 68 85 45 78 – Fax.: +33 3 68 85 44 article, and without loss of generality, we use some times the
55 example of the K-means algorithm, because of its intensive use
Email addresses: [email protected] (François Petitjean), of the averaging operation, and because of its applicability to
[email protected] (Alain Ketterlin), [email protected] (Pierre
large datasets.
Gançarski)

Article accepted for publication in Pattern Recognition accessible at doi:10.1016/j.patcog.2010.09.013


In this article, we propose a novel method for averaging Fortunately, the fact that the overall problem exhibits overlap-
a set of sequences under DTW. The proposed method avoids ping subproblems allows for the memoization of partial results
the drawbacks of other techniques, and is designed to be used, in a matrix, which makes the minimal-weight coupling compu-
among others, in similarity-based methods (e.g., K-means) to tation a process that costs |A|·|B| basic operations. This measure
mine sequential datasets. Section 2 introduces the DTW sim- has thus a time and a space complexity of O(|A| · |B|).
ilarity measure on sequences. Then Section 3 considers the DTW is able to find optimal global alignment between se-
problem of finding a consensus sequence from a set of sequences, quences and is probably the most commonly used measure to
providing theoretical background and reviewing existing meth- quantify the dissimilarity between sequences [9–13]. It also
ods. Section 4 introduces the proposed averaging method, called provides an overall real number that quantifies similarity. An
Dtw Barycenter Averaging (DBA). It also describes experi- example DTW-alignment of two sequences can be found in Fig-
ments on standard datasets from the UCR time series classifica- ure 1: it shows the alignment of points taken from two sinu-
tion and clustering archive [19] in order to compare our method soids, one being slightly shifted in time. The numerical result
to existing averaging methods. Then, Section 5 looks deeper computed by DTW is the sum of the heights1 of the associa-
into the sufficient number of points to accurately represent of tions. Alignments at both extremities on Figure 1 show that
a set of sequences. Section 6 describes experiments conducted DTW is able to correctly re-align one sequence with the other,
to demonstrate the applicability of DBA to clustering, by de- a process which, in this case, highlights similarities that Eu-
tailing experiments carried out with the K-means algorithm on clidean distance is unable to capture. Algorithm 1 details the
standard datasets as well as on an application domain, namely computation.
satellite image time series. Finally, Section 7 concludes the ar-
ticle and presents some further work. Algorithm 1 DTW
Require: A = ha1 , . . . , aS i
2. Dynamic Time Warping (DTW) Require: B = hb1 , . . . , bT i
Let δ be a distance between coordinates of sequences
In this section, we recall the definition of the euclidean dis- Let m[S , T ] be the matrix of couples (cost, path)
tance and of the DTW similarity measure. Throughout this m[1, 1] ← ( δ(a1 , b1 ) , (0, 0) )
section, let A = ha1 , . . . , aT i and B = hb1 , . . . , bT i be two se- for i ← 2 to S do
quences, and let δ be a distance between elements (or coordi- m[i, 1] ← ( m[i − 1, 1, 1] + δ(ai , b1 ) , (i − 1, 1) )
nates) of sequences. end for
for j ← 2 to T do
Euclidean distance. This distance is commonly accepted as the m[1, j] ← ( m[1, j − 1, 1] + δ(a1 , b j ) , (1, j − 1) )
simplest distance between sequences. The distance between A end for
and B is defined by: for i ← 2 to S do
p for j ← 2 to T do
D(A, B) = δ(a1 , b1 )2 + · · · + δ(aT , bT )2 minimum ← minVal( m[i − 1, j] , m[i, j − 1] , m[i −
1, j − 1] )
Unfortunately, this distance does not correspond to the com- m[i, j] ← ( f irst(minimum) +
mon understanding of what a sequence really is, and cannot δ(ai , b j ), second(minimum))
capture flexible similarities. For example, X = ha, b, a, ai and end for
Y = ha, a, b, ai are different according to this distance even end for
though they represent similar trajectories. return m[S , T ]
Dynamic Time Warping. DTW is based on the Levenshtein dis-
tance (also called edit distance) and was introduced in [6, 7],
with applications in speech recognition. It finds the optimal Algorithm 2 first
alignment (or coupling) between two sequences of numerical Require: p = (a, b) : couple
values, and captures flexible similarities by aligning the coordi- return a
nates inside both sequences. The cost of the optimal alignment
can be recursively computed by:
Algorithm 3 second
D(Ai−1 , B j−1 ) 
 

  Require: p = (a, b) : couple
D(Ai , B j ) = δ(ai , b j ) + min  D(Ai , B j−1 ) 
 
(1)
 
 D(A , B ) 

 
 return b
i−1 j

where Ai is the subsequence ha1 , . . . , ai i. The overall similarity


is given by D(A|A| , B|B| ) = D(AT , BT ). 1 In fact, the distance δ(a , b ) computed in Equation 1 is the distance be-
i j
Unfortunately, a direct implementation of this recursive def- tween two coordinates without considering the time distance between them.
inition leads to an algorithm that has exponential cost in time.
Algorithm 4 minVal 3.1. The consensus sequence problem
Require: v1 , v2 , v3 : couple As we focus on DTW, we will only detail the consensus
if f irst(v1 ) ≤ min( f irst(v2 ) , f irst(v3 ) ) then sequence problem from the edit distance side. (The problem
return v1 for DTW is almost the same, and we will detail the differences
else if f irst(v2 ) ≤ f irst(v3 ) then in next subsection.) The term consensus is subjective, and de-
return v2 pends on the needs. In the context of sequences, this term is
else used with three meanings: the longest common subsequence of
return v3 a set, the medoid sequence of the set, or the average sequence
end if of the set.
The longest common subsequence generally permits to vi-
sualize a summary of a set of sequences. It is however generally
not used in algorithms because the resulting common subse-
quence does not cover the whole data.
The two other concepts refer to a more formal definition,
corresponding to the sequence in the center of the set of se-
quences. We have to know what the center notion means. The
commonly accepted definition is the object minimizing the sum
of squared distances to objects of the set. When the center must
be found in the dataset, the center is called “medoid sequence”.
Otherwise, when the search space of the center is not restricted,
the most widely used term is “average sequence”.
As our purpose is the definition of a center minimizing the
sum of squared distances to sequences of a set, we focus on
the definition of an average sequence when the corresponding
distance is DTW.
Figure 1: Two 1D sequences aligned with Dynamic Time Warping. Coordinates
of the top and bottom sequences have been respectively computed by cos(t) Definition. Let E be the space of the coordinates of sequences.
and cos(t + α). For visualization purpose, the top sequence is drawn vertically By a minor abuse of notation, we use E T to designate the space
shifted.
of all sequences of length T . Given a set of sequences S =
{S1 , · · · , SN }, the average sequence C(T ) must fulfill:
N N
3. Related work X X
∀X ∈ E T , DT W 2 (C(T ) , Sn ) ≤ DT W 2 (X, Sn ) (2)
In the context of classification, many algorithms require a n=1 n=1

method to represent in one object, informations from a set of Since no information on the length of the average sequence
objects. Algorithms like K-medoids are using the medoid of a is available, the search cannot be limited to sequences of a given
set of objects. Some others like K-means need to average a set length, so all possible values for T have to be considered. Note
of objects, by finding a mean of the set. If these “consensus” that sequences of S have a fixed length T . C has hence to fulfill:
representations are easily definable for objects in the Euclidean
space, this is much more difficult for sequences under DTW.
N N
 
Finding a consensus representation of a set of sequences is X X
∀t ∈ [1, +∞[, ∀X ∈ E ,
 t 2 2

even described by [20], chapter 14, as the Holy Grail. In this  DT W (C, Sn ) ≤ DT W (X, Sn )
section, we first introduce the problem of finding a consensus n=1 n=1
(3)
sequence from a set of sequences, inspired by theories devel-
This definition relies in fact on the Steiner trees theory; C
oped in computational biology. Then, we make the link be-
is called Steiner sequence [20]. Note that the sums in Equa-
tween these theories on strings and methods currently used to
tions (2) and (3), is often called Within Group Sum of Squares
average sequences of numbers under Dynamic Time Warping.
(WGSS), or discrepancy distance in [21]. We will also use the
To simplify explanations, we use the term coordinate to
simple term inertia, used in most works on clustering.
designate an element, or point, or component of a sequence.
Without loss of generality, we consider that sequences con-
3.2. Exact solutions to the Steiner sequence problem
tain T coordinates that are one-dimensional data. We note A(β)
a sequence of length β. In the following, we consider a set As shown in [22], when considered objects are simple points
S = {S1 , · · · , SN } of N sequences from which we want to com- in an α-dimensional space, the minimization problem corre-
pute a consensus sequence C. sponding to Equation (2) can be solved by using the property
of the arithmetic mean. As the notion of arithmetic mean is not
easily extendable to semi-pseudometric spaces (i.e., spaces in-
duced by semi-pseudometrics like DTW), we need to detail this
Steiner sequence problem, i.e., the problem to find an average not provide any exact scalable algorithm, neither for the mul-
sequence. To solve this problem, there are two close families of tiple alignment problem, nor for the consensus sequence prob-
methods. lem. Given a sequence standing for the solution, we even can-
The first one consists in computing the global multiple align- not check if the potential solution is optimal because we rely on
ment [23] of the N sequences of S. This multiple alignment is a subset of the search space. A number of heuristics have been
computable by extending DTW for aligning N sequences. For developed to solve this problem (see [25–29] for examples). We
instance, instead of computing DTW by comparing three val- present in this subsection the most common family of methods
ues in a square, one have to compare seven values in a cube used to approximate the average sequence: iterative pairwise
for three sequences. This idea can be generalized by computing averaging. We will also link this family to existing averaging
DTW in a N-dimensional hypercube. Given this global align- method for DTW.
ment, C can be found by averaging column by column the mul- Iterative pairwise averaging consists in iteratively merging
tiple alignment. However this method presents two major dif- two average sequences of two subsets of sequences into a single
ficulties, thatprevent
 its use. First, the multiple alignment pro- average sequence of the union of those subsets. The simplest
cess takes Θ T N operations [24], and is not tractable for more strategy computes an average of two sequences and iteratively
than a few sequences. Second, the global length of the multiple incorporates one sequence to the average sequence. Differences
alignment can be on the order of T N , and requires unrealistic between existing methods are the order in which the merges are
amounts of memory. done, and the way they compute an average of two sequences.
The second family of methods consists in searching through
the solution space, keeping those minimizing the sum (Equation 3.3.1. Ordering schemes
2). In the discrete case (i.e., sequences of characters or of sym- Tournament scheme. The simplest and most obvious averag-
bolical values), as the alphabet is generally finite, this scan is ing ordering consists in pairwise averaging sequences follow-
easy. However, as the length of the global alignment is poten- ing a tournament scheme. That way, N/2 average sequences
tially T N − 2T , scanning the space takes Θ(T N ) operations. In are created at first step. Then those N/2 sequences, in turn,
the continuous case (i.e., sequences of numbers or of number are pairwise averaged into N/4 sequences, and so on, until one
vectors), scanning all solutions is impossible. Nevertheless, we sequence is obtained. In this approach, the averaging method
will explain in the next paragraph how this search can be guided (between two sequences) is applied N times.
towards potential solutions, even though this strategy also ex-
hibits exponential complexity. Ascendant hierarchical scheme. A second approach consists
In fact, we need a method to generate all potential solu- in averaging at first the two sequences whose distance (DTW)
tions. Each solution corresponds to a coupling between C and is minimal over all pairs of sequences. This works like As-
each sequence of S. As all coordinates of each sequence of S cendant Hierarchical Clustering, computing a distance matrix
must be associated to at least one coordinate of C, we have to before each average computation. In that way, the averaging
generate all possible groupings between coordinates of each se- method is also called N − 1 times. In addition, one has to take
quence of S and coordinates of C. Each coordinate of C must into account the required time to compute N times the distance
be associated to a non-empty non-overlapping set of coordi- matrix.
nates of sequences of S. To generate all possible groupings,
we can consider that each sequence is split in as many subse- 3.3.2. Computing the average sequence
quences as there are coordinates in C. Thus, the first coordinate Regarding the way to compute an average from two se-
of C will be associated to the coordinates appearing in the first quences under DTW, most methods are using associations (cou-
subsequences of sequences of S, the second coordinate to coor- pling) computed with DTW.
dinates appearing in the second subsequences of sequences of
S, and so on. Then, for C(p) (C of length p), the N sequences One coordinate by association. Starting from a coupling be-
of S have to be cut into p parts. There are Tp ways to cut a se- tween two sequences, the average sequence is built using the
quence of length T into p subsequences. Thus, for N sequences center of each association. Each coordinate of the average se-
 N quence will thus be the center of each association created by
there are Tp possibilities. The time complexity of this scan is
 N DTW. The main problem of this technique is that the resulting
therefore in Θ( Tp ). mean can grow substantially in length, because up to |A|+|B|−2
One can note in Equation (3) that p > T does not make associations can be created between two sequences A and B.
sense. It would mean that sequences have to be split in more
subsequences than they contain coordinates. Thus, we can limit One coordinate by connected component. Considering that the
the search conditions of C to p ∈ [1, T ]. But even with this coupling (created by DTW) between two sequences forms a
tighter bound, the overall computation remains intractable. graph, the idea is to associate each connected component of this
graph to a coordinate of the resulting mean, usually taken as the
3.3. Approximating the exact solution barycenter of this component. Contrary to previous methods,
As we explained in the previous subsection, finding the av- the length of resulting mean can decrease. The resulting length
erage sequence is deeply linked to the multiple alignment prob- will be between 2 and min(|A|, |B|).
lem. Unfortunately, 30 years of well-motivated research did
3.4. Existing algorithms would lead to the same result. Local averaging strategies like
The different ordering schemes and average computations PSA or NLAAF may let an initial approximation error propa-
just described are combined in the DTW literature to make up gate throughout the averaging process. If the averaging process
algorithms. The two main averaging methods for DTW are pre- has to be repeated (e.g., during K-means iterations), the effects
sented below. may dramatically alter the quality of the result. This is why a
global approach is desirable, where sequences would be aver-
NonLinear Alignment and Averaging Filters. NLAAF was in- aged all together, with no sensitivity to their order of consider-
troduced in [30] and rediscovered in [15]. This method uses ation. The obvious analogy to a global method is the compu-
the tournament scheme and the one coordinate by association tation of the barycenter of a set of points in a Euclidean space.
averaging method. Its main drawback lies in the growth of its Section 4 follows this line of reasoning in order to introduce a
resulting mean. As stated earlier, each use of the averaging global averaging strategy suitable for DTW, and provides em-
method can almost double the length of the average sequence. pirical evidence of its superiority over existing techniques.
The entire NLAAF process could produce, over all sequences, The second dimension along which averaging techniques
an average sequence of length N × T . As classical datasets can be classified is the way they select the elements of the mean.
comprise thousands of sequences made up on the order of hun- We have seen that a naive use of the DTW-computed associa-
dred coordinates, simply storing the resulting mean could be tions may lead to some sort of “overfit”, with an average cover-
impossible. This length problem is moreover worsened by the ing almost every detail of every sequence, whereas simpler and
complexity of DTW, that grows bi-linearly with lengths of se- smoother averages could well provide a better description of the
quences. That is why NLAAF is generally used in conjunction set of sequences. Moreover, long and detailed averages have a
with a process reducing the length of the mean, leading to a loss strong impact on further processing. Here again, iterative algo-
of information and thus to an unsatisfactory approximation. rithms like K-means are especially at risk: every iteration may
lead to a longer average, and because the complexity of DTW is
Prioritized Shape Averaging. PSA was introduced in [21] to directly related to the length of the sequences involved, later it-
resolve shortcomings of NLAAF. This method uses the Ascen- erations will take longer than earlier ones. In such cases, uncon-
dant hierarchical scheme and the one by connected compo- strained averaging will not only lead to an inadequate descrip-
nent averaging method. Although this hierarchical averaging tion of clusters, it will also cause a severe performance degrada-
method aims at preventing the error to propagate too much, the tion. This negative effect of sequence averaging is well-known,
length of average sequences remains a problem. If one align- and corrective actions have been proposed. Section 5 builds on
ment (with DTW) between two sequences leads to two con- our averaging strategy to suggest new ways of shortening the
nected components (i.e., associations are forming two hand- average.
held fans), the overall resulting mean will be composed of only
two coordinates. Obviously, such a sequence cannot represent
a full set of potentially long sequences. This is why authors 4. A new averaging method for DTW
proposed to replicate each coordinate of the average sequence
To solve the problems of existing pairwise averaging meth-
as many times as there were associations in the corresponding
ods, we introduce a global averaging strategy called Dtw Barycen-
connected component. However, this repetition of coordinates
ter Averaging (DBA). This section first defines the new averag-
causes the problem already observed with NLAAF, by poten-
ing method and details its complexity. Then DBA is compared
tially doubling the number of coordinates of each intermediate
to NLAAF and PSA on standard datasets [19]. Finally, the ro-
average sequence. To alleviate this problem, the authors sug-
bustness and the convergence of DBA are studied.
gest using a process in order to reduce the length of the resulting
mean.
4.1. Definition of DBA
3.5. Motivation DBA stands for Dtw Barycenter Averaging. It consists in a
heuristic strategy, designed as a global averaging method. DBA
We have seen that most of the works on averaging sets of
is an averaging method which consists in iteratively refining
sequences can be analyzed along two dimensions: first, the
an initially (potentially arbitrary) average sequence, in order to
way they consider the individual sequences when averaging,
minimize its squared distance (DTW) to averaged sequences.
and second, the way they compute the elements of the result-
Let us provide an intuition on the mechanism of DBA. The
ing sequences. These two characteristics have proved useful to
aim is to minimize the sum of squared DTW distances from the
classify the existing averaging techniques. They are also useful
average sequence to the set of sequences. This sum is formed
angles under which new solutions can be elaborated.
by single distances between each coordinate of the average se-
Regarding the averaging of individual sequences, the main
quence and coordinates of sequences associated to it. Thus,
shortcoming of all existing methods is their use of pairwise av-
the contribution of one coordinate of the average sequence to
eraging. When computing the mean of N sequences by pairwise
the total sum of squared distance is actually a sum of euclidean
averaging, the order in which sequences are taken influences the
distances between this coordinate and coordinates of sequences
quality of the result, because neither NLAAF nor PSA are asso-
ciative functions. Pairwise averaging strategies are intrinsically
sensitive to the order, with no guarantee that a different order
associated to it during the computation of DTW. Note that a co-
ordinate of one of the sequences may contribute to the new po-
sition of several coordinates of the average. Conversely, any co-
ordinate of the average is updated with contributions from one
or more coordinates of each sequence. In addition, minimizing
this partial sum for each coordinate of the average sequence is 1
achieved by taking the barycenter of this set of coordinates. The
principle of DBA is to compute each coordinate of the average
sequence as the barycenter of its associated coordinates of the
set of sequences. Thus, each coordinate will minimize its part
of the total WGSS in order to minimize the total WGSS. The
updated average sequence is defined once all barycenters are
computed.
Technically, for each refinement i.e., for each iteration, DBA
works in two steps:
1. Computing DTW between each individual sequence and
the temporary average sequence to be refined, in order
to find associations between coordinates of the average
sequence and coordinates of the set of sequences. 2
2. Updating each coordinate of the average sequence as the
barycenter of coordinates associated to it during the first
step.
Let S = {S1 , · · · , SN } be the set of sequences to be averaged,
let C = hC1 , . . . , CT i be the average sequence at iteration i and
0 0 0
let C = hC1 , . . . , CT i be the update of C at iteration i + 1, of
which we want to find coordinates. In addition, each coordinate
of the average sequence is defined in an arbitrary vector space E
(e.g., usually a Euclidean space):
∀t ∈ [1, T ] , Ct ∈ E (4)
We consider the function assoc, that links each coordinate
3
of the average sequence to one or more coordinates of the se-
quences of S. This function is computed during DTW compu-
tation between C and each sequence of S. The tth coordinate of
0
the average sequence Ct is then defined as:
0
Ct = barycenter (assoc (Ct )) (5)
where
X1 + . . . + Xα
barycenter {X1 , . . . , Xα } = (6)
α
(the addition of Xi is the vector addition). Algorithm 5 details
the complete DBA computation.
Then, by computing again DTW between the average se-
quence and all sequences of S, the associations created by DTW
may change. As it is impossible to anticipate how these associ- 4
ations will change, we propose to make C iteratively converge.
Figure 2 shows four iterations (i.e., four updates) of DBA on an
example with two sequences.
As a summary, the proposed averaging method for Dynamic
Time Warping is a global approach that can average a set of se-
quences all together. The update of the average sequence be-
tween two iterations is independent of the order with which the
individual sequences are used to compute their contribution to
the update in question. Figure 3 shows an example of an av- Figure 2: DBA iteratively adjusting the average of two sequences.
erage sequence computed with DBA, on one dataset from [19].
This figure shows that DBA preserves the ability of DTW, iden-
tifying time shifts.
Algorithm 5 DBA data lets one except that a much shorter sequence can be ade-
Require: C = hC1 , . . . , CT 0 i the initial average sequence quate. We will observe in Section 4.5 that a length of around T
Require: S1 = hs11 , . . . , s1T i the 1 st sequence to average (the length of the sequences to average) performs well.
.. Regarding the values of the initial coordinates, it is theoreti-
. cally impossible to determine the optimal values, otherwise the
Require: Sn = hsn1 , . . . , snT i the nth sequence to average whole averaging process would become useless. In methods
Let T be the length of sequences that require an initialisation, e.g., K-means clustering, a large
Let assocT ab be a table of size T 0 containing in each cell a set number of heuristics have been developed. We will describe
of coordinates associated to each coordinate of C in Section 4.5 experiments with the most frequent techniques:
Let m[T, T ] be a temporary DTW (cost,path) matrix first, a randomized choice, and second, using an element of the
assocT ab ← [∅, . . . , ∅] set of sequences to average. We will show empirically that the
for seq in S do latter gives an efficient initialisation.
m ← DT W( C , seq )
i ← T0 Convergence. As explained previously, DBA is an iterative pro-
j←T cess. It is necessary, once the average is computed, to update it
while i ≥ 1 and j ≥ 1 do several times. This has the property of letting DTW refine its
assocT ab[i] ← assocT ab[i] ∪ seq j associations. It is important to note that at each iteration, iner-
(i, j) ← second(m[i, j]) tia can only decrease, since the new average sequence is closer
end while (under DTW) to the elements it averages. If the update does not
end for modify the alignment of the sequences, so the Huygens’ theo-
for i = 1 to T do rem applies; barycenters composing the average sequence will
0
Ci = barycenter(assocT ab[i]) {see Equation 6} get closer to coordinates of S. In the other case, if the alignment
end for is modified, it means that DTW calculates a better alignment
0
return C with a smaller inertia (which decreases in that case also). We
thus have a guarantee of convergence. Section 4.6 details some
experiments in order to quantify this convergence.
4.2. Initialization and Convergence
The DBA algorithm starts with an initial averaging and re- 4.3. Complexity study
fines it by minimizing its WGSS with respect to the sequences This section details the time complexity of DBA. Each iter-
it averages. This section examines the effect of the initialisation ation of the iterative process is divided into two parts:
and the rate of convergence. 1. Computing DTW between each individual sequence and
the temporary (i.e., current) average sequence, to find as-
Initialization. There are two major factors to consider when sociations between its coordinates and coordinates of the
priming the iterative refinement: sequences.
• first the length of the starting average sequence, 2. Updating the mean according to the associations just com-
puted.
• and second the values of its coordinates.
Finding associations. The aim of Step 1 is to determine the set
Regarding the length of the initial average, we have seen in Sec- of associations between each coordinate of C and coordinates of
tion 3.2 that its upper bound is T N , but that such a length cannot sequences of S. Therefore we have to compute DTW once per
reasonably be used. However, the inherent redundancy of the sequence
  to average, that is N times. The complexity
 of DTW
is Θ T 2 . The complexity of Step 1 is therefore Θ N · T 2 .
Trace Trace
4,25
4,00
4,00 Updating the mean. After Step 1, each coordinate Ct of the
average sequence has a set of coordinates {p1 , . . . , pαt } associ-
3,75
3,75
3,50
3,50
3,25
3,25
3,00
3,00
2,75
2,50
2,75
2,50
2,25
ated to it. The process of updating C consists in updating each
2,25
2,00
1,75
2,00
1,75
1,50
coordinate of the average sequence as the barycenter this set
1,50
1,25
1,00
0,75
1,25
1,00
0,75
of coordinates. Since the average sequence is associated to N
0,50
0,25
0,00
0,50
0,25
0,00
sequences, its T coordinates are, overall, associated to N · T
-0,25
-0,50
-0,75
-0,25
-0,50 coordinates, i.e., all coordinates of sequences of S. The update
step will thus have a time complexity of Θ (N · T ).
-0,75
-1,00 -1,00
-1,25
-1,25
-1,50
-1,50
-1,75
-1,75
-2,00
-2,00
-2,25
0 25 50 75 100 125 150 175 200 225 250 275 0 25 50 75 100 125 150 175 200 225 250 275 Overall complexity. Because DBA is an iterative process, let
(a) A cluster of the Trace dataset. (b) The average sequence of the
us note I the number of iterations. The time complexity of the
cluster.
averaging process of N sequences, each one containing T coor-
Figure 3: An example of the DBA averaging method on one cluster from the dinates, is thus:
“Trace” dataset from [19].     
Θ (DBA) = Θ I N · T 2 + N · T = Θ I · N · T 2 (7)
5,0

Comparison with PSA and NLAAF. To compute an average se- 50 words 2,5

0,0

quence from two sequences, PSA and NLAAF need to compute 2


0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280

DTW between these two sequences, which has a time com- Adiac 1
0
-1
-2

plexity of Θ(T 2 ). Then, to compute the temporary average


0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
0,3

Beef 0,2
0,1

sequence, PSA and NLAAF require Θ(T ) operations. How-


0,0

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475

ever, after having computed this average sequence, it has to CBF 0,0

-2,5

be shorten to the length of averaged sequences. The classi- 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130

Coffee 30
20

cal averaging process used is Uniform Scaling which requires 10


0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290

Θ(T +2T +3T +· · ·+T 2 ) = Θ(T 3 ). The computation of the aver- ECG200
2,5

0,0

age sequence of two sequences requires Θ(T 3 +T 2 +T ) = Θ(T 3 ). 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

The overall NLAAF averaging of a set of N sequences then re- FaceAll 5

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135

quires: FaceFour
2,5

0,0

-2,5

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350

Θ((N − 1) · (T 3 + T 2 + T )) = Θ(N · T 3 ) (8) Fish


2
1
0
-1

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475

Moreover, as PSA is using a hierarchical strategy to order se- GunPoint 1

quences, it has at least to compute a dissimilarity matrix, which 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155

Lighting2
5,0

requires Θ(N 2 · T 2 ) operations. The overall PSA averaging of a 2,5

0,0

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 625 650

set of N sequences then requires: Lighting7 10

0 25 50 75 100 125 150 175 200 225 250 275 300 325

Θ((N − 1) · (T + T + T ) + N · T ) = Θ(N · T + N · T ) (9)


3 2 2 2 3 2 2 1,5

OliveOil 1,0

0,5

0,0
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575

2,5

As I  T , the time complexity of DBA is thus smaller than OSULeaf 0,0

-2,5

PSA and NLAAF ones. 2,5


0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425

SwedishLeaf 0,0

-2,5
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130
2

4.4. Experiments on standard datasets Synthetic control 1


0
-1
-2

Evaluating an average sequence is not a trivial task. No 0,0 2,5 5,0 7,5 10,0 12,5 15,0 17,5 20,0 22,5 25,0 27,5 30,0 32,5 35,0 37,5 40,0 42,5 45,0 47,5 50,0 52,5 55,0 57,5 60,0

Trace 2,5

ground truth of the expected sequence is available and we saw 0,0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280
2

in Section 3 that many meanings are covered by the “average” Two patterns 1
0
-1

(or consensus) notion. Most experimental and theoretical works -2

2
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130

use the WGSS to quantify the relative quality of an averaging Wafer 1


0
-1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155

technique. Thus, to assess the performance of DBA by com-


Yoga 1
0

parison with existing averaging methods, we compare DBA to -1

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425

NLAAF and PSA in terms of WGSS over datasets from the


Figure 4: Sample instances from the test datasets. One time series from each
UCR classification/clustering archive [19] (see Figure 4). class is displayed for each dataset.
Let us briefly remind what NLAAF and PSA are. NLAAF
works by placing each coordinate of the average sequence of
two sequences, as the center of each association created by • sequences have been normalized with Z-score: for each
DTW. PSA associates each connected component of the graph sequence, the mean x̄ and standard deviation σ of the
(formed by the coupling between two sequences) to a coor- coordinate values are computed, and each coordinate yi
dinate of the average sequence. Moreover, to average N se- is replaced by:
quences, it uses a hierarchical method to average at first closest yi − x̄
y0i = (10)
sequences. σ

Experimental settings. To make these experiments reproducible, • we compare DBA to NLAAF on all datasets from [19]
we provide here the details about our experimental settings: and to PSA using results given by [21];

• all programs are implemented in Java and run on a Core • as we want to test the capacity of DBA to minimize the
2 Duo processor running at 2.4 GHz with 3 GB of RAM; WGSS, and because we do not focus on supervised meth-
ods, we put all sequences from both train and test dataset
• the distance used between two coordinates of sequences together;
is the squared Euclidean distance. As the square function
is a strictly increasing function on positive numbers, and • for each set of sequences under consideration, we report
because we only use comparisons between distances, it the inertia under DBA and other averaging techniques.
is unnecessary to compute square roots. The same opti- To provide quantitative evaluation, we indicate the ratio
mization has been used in [31], and seems rather com- between the inertia with respect to the average computed
mon; by DBA and those computed by NLAAF and/or PSA
(see the tables below). To provide an overall evaluation,
Intraclass inertia Global dataset inertia
DBA DBA
Dataset NLAAF DBA NLAAF
Dataset NLAAF DBA PS A
50words 11.98 6.21 52 % 50words 51 642 26 688 52 %
Adiac 0.21 0.17 81 % Adiac 647 470 73 %
Beef 29.90 9.50 32 % Beef 3 154 979 31 %
CBF 14.25 13.34 94 % CBF 21 306 18 698 88 %
Coffee 0.72 0.55 76 % Coffee 61.60 39.25 64 %
ECG200 11.34 6.95 61 % ECG200 2 190 1 466 67 %
FaceAll 17.77 14.73 83 % FaceAll 72 356 63 800 88 %
FaceFour 34.46 24.87 72 % FaceFour 6 569 3 838 58 %
Fish 1.35 1.02 75 % Fish 658 468 71 %
GunPoint 7.24 2.46 34 % GunPoint 1 525 600 39 %
Lighting2 194.07 77.57 40 % Lighting2 25 708 9 673 38 %
Lighting7 48.25 28.77 60 % Lighting7 14 388 7 379 51 %
OliveOil 0.018261 0.018259 100 % OliveOil 2.24 1.83 82 %
OSULeaf 53.03 22.69 43 % OSULeaf 30 293 12 936 43 %
SwedishLeaf 2.50 2.21 88 % SwedishLeaf 5 590 4 571 82 %
Synthetic control 9.71 9.28 96 % Synthetic control 17 939 13 613 76 %
Trace 1.65 0.92 56 % Trace 22 613 4 521 20 %
Two patterns 9.19 8.66 94 % Two patterns 122 471 100 084 82 %
Wafer 54.66 30.40 56 % Wafer 416 376 258 020 62 %
Yoga 40.07 37.27 93 % Yoga 136 547 39 712 29 %

Table 1: Comparison of intraclass inertia under DTW between NLAAF and Table 3: Comparison of global dataset inertia under DTW between NLAAF
DBA. and DBA.

Intraclass inertia
Dataset PSA DBA DBA This means that only one average sequence is computed for a
PS A
Beef 25.65 9.50 37 % whole dataset. That way, we compute the global dataset inertia
Coffee 0.72 0.55 76 % under DTW with NLAAF and DBA, to compare their capacity
ECG200 9.16 6.95 76 % to summarize mixed data.
FaceFour 33.68 24.87 74 % As can be seen in Table 3, DBA systematically reduces/improves
the global dataset inertia with a geometric average ratio of 56 %.
Synthetic control 10.97 9.28 85 %
This means that DBA not only performs better than NLAAF
Trace 1.66 0.92 56 %
(Table 1), but is also more robust to diversity.
Table 2: Comparison of intraclass inertia under DTW between PSA and DBA.
4.5. Impact of initialisation
DBA is deterministic once the initial average sequence C is
the text also sometimes mentions geometric averages of chosen. It is thus important to study the impact of the choice
these ratios. of the initial mean on the results of DBA. When used with
K-means, this choice must be done at each iteration of the al-
Intraclass inertia comparison. In this first set of experiments, gorithm, for example by taking as the initialisation the average
we compute an average for each class in each dataset. Table 1 sequence C obtained at the previous iteration. However, DBA
shows the global WGSS obtained for each dataset. We notice is not limited to this context.
that, for all datasets, DBA reduces/improves the intraclass in- We have seen in Section 4.2 that two aspects of initialisa-
ertia. The geometric average of the ratios shown in Table 1 is tion have to be evaluated empirically: first, the length of the ini-
65 %. tial average sequence, and second the values of its coordinates.
Table 2 shows a comparison between results of DBA and We have designed three experiments on some of the datasets
PSA. Here again, for all results published in [21], DBA outper- from [19]. Because these experiments require heavy computa-
forms PSA, with a geometric average of inertia ratios of 65 %. tions, we have not repeated the computation on all data sets.
Actually, such a decreases of inertia show that old averaging 1. The first experiment starts with randomly generated se-
methods could not be seriously considered for machine learning quences, of lengths varying from 1 to 2T , where T is the
use. length of the sequences in the data set. Once the length
is chosen, the coordinates are generated randomly with a
Global dataset inertia. In the previous paragraph, we com-
normal distribution of mean zero and variance one. The
puted an average for each class in each dataset. In this para-
curves on Figure 5 show the variation of inertia with the
graph, the goal is to test robustness of DBA with more data
length.
variety. Therefore, we average all sequences of each dataset.
50
T T
NLAAF
Random sequences Random sequences DBA
Random sequence of length T (100 runs) Random sequence of length T (100 runs)
Sequence from the dataset (100 runs) Sequence from the dataset (100 runs)

10
40

Dataset
NLAAF DBA
Best score Worst score
Inertia

Inertia
30
10 1
Beef 26.7 13.2
Coffee 0.69 0.66
20 ECG200 9.98 8.9
Gun Point 7.2 3.1
100 200 300 400 500
0,1
100 200 300 Lighting7 45 32
Length of the mean Length of the mean 10

(a) 50words (b) Adiac


0
Beef Coffee ECG200 Gun_Point Lighting7
T T

Random sequences Random sequences


Random sequence of length T (100 runs) Random sequence of length T (100 runs)
Sequence from the dataset (100 runs) Sequence from the dataset (100 runs)
Figure 6: Effect of initialisation on NLAAF and DBA.
Inertia

Inertia

700
10 0,9

0,8

600
0,7

Normalized inertia
0,6
500

Inertia
0,5

10
200 400 600 800 50 100 150 200 250 0,4
400
Length of the mean Length of the mean
0,3

0,2

(c) Beef (d) CBF 0,1


300

0 200
2 4 6 8 10 2 4 6 8 10
100 Iteration Iteration
T T
Random sequences
Random sequence of length T (100 runs)
Sequence from the dataset (100 runs)
Random sequences
Random sequence of length T (100 runs)
Sequence from the dataset (100 runs)
(a) Average convergence of the (b) Example of an uneven con-
10
iterative process over 50 clus- vergence (cluster 4).
ters.
Inertia

Inertia

10
Figure 7: Convergence of the iterative process for the 50words dataset.
1

100 200 300 400 500 50 100 150


Length of the mean Length of the mean

tion on DBA and on NLAAF, we perform 100 computations of


(e) Coffee (f) ECG200 the average on few datasets (because of computing times), by
choosing a random sequence from the dataset as an initialisa-
Figure 5: Impact of different initialisation strategies on DBA. Note that the
Inertia is displayed with logarithmic scale.
tion of DBA. As NLAAF is sensitive to the initialisation too, the
same method is followed, in order to compare results. Figure
6 presents the mean and standard deviation of the final inertia.
2. Because the previous experiment shows that the optimal The results of DBA are not only better than NLAAF (shown
inertia is attained with an initial sequence of length in the on the left part of Figure 6), but the best inertia obtained by
order of T , we have repeated the computation 100 times NLAAF is even worse as the worst inertia obtained by DBA
with different, randomly generated initial sequences of (see the table on the right of Figure 6).
length T : the goal of this experiment is to measure how
stable this heuristic is. Green triangles on Figure 5 show 4.6. Convergence of the iterative process
the inertias with the different random initialisations of As explained previously, DBA is an iterative process. It is
length T . necessary, once the average is computed, to update it several
3. Because priming DBA with a sequence of length T seems times. This has the property of letting DTW refine its asso-
to be an adequate choice, we have tried to replace the ran- ciations. Figure 7(a) presents the average convergence of the
domly generated sequence with one drawn (randomly) iterative process on the 50words dataset. This dataset is diverse
from the dataset. We have repeated this experiment 100 enough (it contains 50 different classes) to test the robustness
times. Red triangles on Figure 5 show the inertias with of the convergence of DBA.
the different initial sequences from the dataset. Besides the overall shape of the convergence curve in Fig-
Our experiments on the choice of the initial average se- ure 7(a), it is important to note that in some cases, the conver-
quence lead to two main conclusions. First, an initial average gence can be uneven (see Figure 7(b) for an example). Even
of length T (the length of the data sequences) is the most appro- if this case is somewhat unusual, one has to keep in mind that
priate. It almost always leads to the minimal inertia. Second, DTW makes nonlinear distortions, which cannot be predicted.
randomly choosing an element of the dataset leads to the least Consequently, the convergence of DBA, based on alignments,
inertia on almost all cases. Using some data to prime an it- cannot be always smooth.
erative algorithm is part of the folklore. DBA is another case
where it performs well. We have used this strategy in all our
experiments with DBA.
Moreover, in order to compare the impact of the initialisa-
5. Optimization of the mean by length shortening

We mentioned in Section 3.4 that algorithms such as NLAAF


need to reduce the length of a sequence. Actually, this problem
is more global and concerns the scaling problem of a sequence
under time warping. Many applications working with subse-
quences or even with different resolutions2 require a method to
(a) Alignment of two sequences: sequence below is composed of
uniformly scale a sequence to a fixed length. This kind of meth-
two same coordinates
ods is generally called Uniform Scaling; further details about its
inner working can be found in [31].
Unfortunately, the use of Uniform Scaling is not always co-
herent in the context of DTW, which computes non-uniform
distortions. To avoid the use of Uniform Scaling with DTW, as
done in [15, 21, 31], we propose here a new approach specifi-
cally designed for DTW. It is called Adaptive Scaling, and aims
at reducing the length of a sequence with respect to one or more (b) Alignment of two sequences: sequence below is composed of
other sequences. In this section, we first recall the definition of only one coordinate
Uniform Scaling, then we detail the proposed approach and fi-
nally its complexity is studied and discussed. Figure 8: Illustration of the length reduction process

5.1. Uniform Scaling


Uniform Scaling is a process that reduces the length of a context, the question is to know if this average sequence can
sequence with regard to another sequence. Let A and B be two be shortened, without making a less representative mean (i.e.,
sequences. Uniform Scaling finds the prefix A sub of A such that, without increasing inertia). We show in the first example, that
scaled up to B, DT W (A sub , B) is minimal. The subsequence the constraint on inertia is respected. Even if Uniform Scal-
A sub is defined by: ing could be used to reduce the length of the mean, an adaptive
scaling would give better results, because DTW is able to make
A sub = argmin{DT W A1,i , B }

(11) deformations on the time axis. Adaptive Scaling is described in
i∈[1,T ]
Algorithm 6.
Uniform Scaling has two main drawbacks: one is directly linked
to the method itself, and one is linked to its use with DTW. First, Algorithm 6 Adaptive Scaling
while Uniform Scaling considers a prefix of the sequence (i.e., a Require: A = hA1 , . . . , AT i
subsequence), the representativeness of the resulting mean us- while Need to reduce the length of A do
ing such a reduction process can be discussed. Second, Uni- (i, j) ← successive coordinates with minimal distance
form Scaling is a uniform reduction process, whereas DTW merge Ai with A j
makes non-uniform alignments. end while
return A
5.2. Adaptive Scaling
We propose to make the scaling adaptive. The idea of the Let us now explain how the inertia can also decrease by
proposed Adaptive Scaling process is to answer to the following using Adaptive Scaling. Figure 9 illustrates the example used
question: “How can one remove a point from a sequence, such below. Imagine now that the next to last coordinate CT −1 of
as the distance to another sequence does not increase much?” the average sequence is perfectly aligned with the last α coor-
To answer this question, Adaptive Scaling works by merging dinates of the N sequences of S. In this case, the last coordi-
the two closest successive coordinates. nate CT of the average sequence must still be, at least, linked to
To explain how Adaptive Scaling works, let us start with all last coordinates of the N sequences of S. Therefore, as the
a simple example. If two consecutive coordinates are identical, next to last coordinate was (in this example) perfectly aligned,
they can be merged. DTW is able to stretch the resulting coordi- aligning the last coordinate will increase the total inertia. This
nate and so recover the original sequence. This fusion process is why Adaptive Scaling is not only able to shorten the average
is illustrated in Figure 8. Note that in this example, DTW gives sequence, but also to reduce the inertia. Moreover, by checking
the same score in Figure 8(a) as in Figure 8(b). the evolution of inertia after each merging, we can control this
This article focuses on finding an average sequence consis- length reduction process, and so guarantee the improvement of
tent with DTW. Performances of DBA have been demonstrated inertia. Thus, given the resulting mean of DBA, coordinates
on an average sequence of length arbitrarily fixed to T . In this of the average sequence can be successively merged as long as
inertia decreases.
2 The resolution in this case is the number of samples used to describe a

phenomenon. For instance, a music can be sampled with different bit rates.
50

NLAAF with Uniform Scaling


NLAAF with Adaptive Scaling
PSA
40

30

20
Figure 9: Average sequence is drawn at the bottom and one sequence of the set
is drawn at the top.
10

Intraclass inertia Length of the mean


Dataset DBA DBA+AS DBA DBA+AS 0
Beef Coffee ECG200 FaceFour OliveOil Synthetic control Trace

50words 6.21 6.09 270 151


Adiac 0.17 0.16 176 162 Figure 10: Adaptive Scaling: benefits of the reduction process for NLAAF.
CBF 13.34 12.11 128 57
FaceAll 14.73 14.04 131 95
Fish 1.02 0.98 463 365
GunPoint 2.46 2.0 150 48
ample using AS for NLAAF to be scalable, it requires a time
Lighting2 77.57 72.45 637 188 complexity of Θ(K · T ), and is thus tractable.
Lighting7 28.77 26.97 319 137 One could however need to guarantee that Adaptive Scaling
OliveOil 0.018259 0.01818 570 534 does not merge too many coordinates. That is why we suggest
OSULeaf 22.69 21.96 427 210 to control the dataset inertia, by computing it and by stopping
SwedishLeaf 2.21 2.07 128 95 the Adaptive Scaling process if inertia increases. Unfortunately,
 
Two patterns 8.66 6.99 128 59 computing the dataset inertia under DTW takes Θ N · T 2 . Its
Wafer 30.40 17.56 152 24 complexity may prevent its use in some cases.
Yoga 37.27 11.57 426 195 Nevertheless, we give here some interesting cases, where
Beef 9.50 9.05 470 242 the use of Adaptive Scaling could be beneficial, because hav-
Coffee 0.55 0.525 286 201 ing shorter sequences means spending less time in computing
ECG200 6.95 6.45 96 48 DTW. In databases, the construction of indexes is an active re-
FaceFour 24.87 21.38 350 201 search domain. The aim of indexes is to represent data in a
Synthetic control 9.28 8.15 60 33 better way while being fast queryable. Using Adaptive Scal-
Trace 0.92 0.66 275 108
ing could be used here, because it correctly represents data and
Table 4: Inertia comparison of intraclass inertias and lengths of resulting means reduces the DTW complexity for further queries. The construc-
with or without using the Adaptive Scaling (AS) process. tion time of the index is here negligible compared to the mil-
lions of potential queries. Another example is (supervised or
unsupervised) learning, where the learning time is often neg-
ligible compared to the time spent in classification. Adaptive
5.3. Experiments
Scaling could be very useful in such contexts, and can be seen
Table 4 gives scores obtained by using Adaptive Scaling af- in this case as an optimization process, more than an alternative
ter the DBA process on various datasets. It shows that Adaptive to Uniform Scaling.
Scaling alone always reduces the intraclass inertia, with a ge-
ometric average of 84 %. Furthermore, the resulting average
sequences are much shorter, by almost two thirds. This is an 6. Application to clustering
interesting idea of the minimum necessary length for respre- Even if DBA can be used in association to DTW out of the
senting a tim behaviour. context of K-means (and more generally out of the context of
In order to demonstrate that Adaptive Scaling is not only clustering), it is interesting to test the behaviour of DBA with
designed for DBA, Figure 10 shows its performances as a re- K-means because this algorithm makes a heavy use of averag-
ducing process in NLAAF. Adaptive Scaling is here used to re- ing. Most clustering techniques with DTW use the K-medoids
duce the length of a temporary pairwise average sequence (see algorithm, which does not require any computation of an aver-
Section 3.4). Figure 10 shows that Adaptive Scaling used in age [15–18]. However K-medoids has some problems related to
NLAAF leads to scores similar to the ones achieved by PSA. its use of the notion of median: K-medoids is not idempotent,
which means that its results can oscillate. Whereas DTW is one
5.4. Complexity of the most used similarity on time series, it was not possible to
Adaptive Scaling (AS) consists in merging the two closest use it reliably with well-known clustering algorithms.
successive coordinates in the sequence. If we know ahead of To estimate the capacity of DBA to summarize clusters, we
time the number K of coordinates that must be merged, for ex- have tested its use with the K-means algorithm. We present
here different tests on standard datasets and on a satellite image
Intracluster inertia
DBA
Dataset NLAAF DBA NLAAF
50words 5 920 3 503 59 % ···
Adiac 86 84 98 %
Beef 393 274 70 %
CBF 12 450 11 178 90 % 1 2 ··· n−1 n
Coffee 39.7 31.5 79 %
ECG200 1 429 950 66 % Figure 11: Extract of the Satellite Image Time Serie of Kalideos used. c CNES
2009 – Distribution Spot Image
FaceAll 34 780 29 148 84 %
FaceFour 3 155 2 822 89 %
Fish 221 324 147 %
GunPoint 408 180 44 % 6.2. On satellite image time series
Lighting2 16 333 6 519 40 %
We have applied the K-means algorithm with DTW and DBA
Lighting7 6 530 3 679 56 %
OliveOil 0.55 0.80 146 %
in the domain of satellite image time series analysis. In this
OSULeaf 13 591 7 213 53 % domain, each dataset (i.e., sequence of images) provides thou-
SwedishLeaf 2 300 1 996 87 % sands of relatively short sequences. This kind of data is the op-
Synthetic control 5 993 5 686 95 % posite of sequences commonly used to validate time sequences
Trace 387 203 52 % analysis. Thus, in addition to evaluate our approach on small
Two patterns 45 557 40 588 89 % datasets of long sequences, we test our method on large datasets
Wafer 157 507 108 336 69 % of short sequences.
Yoga 73 944 24 670 33 % Our data are sequences of numerical values, corresponding
to radiometric values of pixels from a Satellite Image Time Se-
Table 5: Comparison of intracluster inertia under DTW between NLAAF and ries (SITS). For every pixel, identified by its coordinates (x, y),
DBA
and for a sequence of images hI1 , . . . , In i, we define a sequence
as hI1 (x, y), . . . , In (x, y)i. That means that a sequence is identi-
fied by coordinates x and y of a pixel (not used in measuring
time series. Here again, result of DBA are compared to those similarity), and that the values of its coordinates are the vectors
obtained with NLAAF. of radiometric values of this pixel in each image. Each dataset
contains as many sequences as there are pixels in one image.
6.1. On UCR datasets We have tested DBA on one SITS of size 450×450 pixels,
Table 5 shows, for each dataset, the global WGSS resulting and of length 35 (corresponding to images sensed between 1990
from a K-means clustering. Since K-means requires initial cen- and 2006). The whole experiment thus deals with 202, 500
ters, we place randomly as many centers as there are classes in sequences of length 35 each, and each coordinate is made of
each dataset. As shown in the table, DBA outperforms NLAAF three radiometric values. This SITS is provided by the Kalideos
in all cases except for Fish and OliveOil datasets. Including database [32] (see Figure 11 for an extract).
these exceptions, the inertia is reduced with a geometric aver- We have applied the K-means algorithm on this dataset, with
age of 72 %. a number of clusters of ten or twenty, chosen arbitrarily. Then
Let us try to explain the seemingly negative results that ap- we computed the sum of intraclass inertia after five or ten iter-
pear in Table 5. First, on OliveOil, the inertia over the whole ations.
dataset is very low (i.e., all sequences are almost identical; see Table 6 summarizes results obtained with different parame-
Figure 4), which makes it difficult to obtain meaningful results. ters.
The other particular dataset is Fish. We have seen, in Section We can note that scores (to be minimized) are always or-
4.4, that DBA outperforms NLAAF provided it has meaningful dered as NLAAF > DBA > DBA+AS, which tends to confirm
clusters to start with. However, here, the K-means algorithm the behaviour of DBA and Adaptive Scaling. As we could ex-
tries to minimize this inertia in grouping elements in “centroid pect, Adaptive Scaling permits to significantly reduce the score
form”. Thus, if clusters to identify are not organized around of DBA. We can see that even if the improvement seems less
“centroids”, this algorithm may converge to any local minima. satisfactory, than those obtained on synthetic data, it remains
In this case, we explain this exceptional behaviour on Fish as however better than NLAAF.
due to non-centroid clusters. We have shown in Section 4.4 Let us try to explain why the results of DBA are close to
that, if sequences are averaged per class, DBA outperforms all those of NLAAF. One can consider that when clusters are close
results, even those of these two datasets. This means that these to each other, then the improvement is reduced. The most likely
two seemingly negative results are linked to the K-means al- explanation is that, by using so short sequences, DTW has not
gorithm itself, that converges to a less optimal local minimum much material to work on, and that the alignment it finds early
even though the averaging method is better. have little chance to change over the successive iterations. In
fact, the shorter the sequences are, the closer DTW is from the
Euclidean distance. Moreover, NLAAF makes less errors when
there is no re-alignment between sequences. Thus, when se-
Nb Nb Inertia ble objects. In such cases, less precise averaging tends to blur
seeds iterations NLAAF DBA DBA and AS cluster boundaries.
10 5 2.82 × 107 2.73 × 107 2.59 × 107
20 5 2.58 × 107 2.52 × 107 2.38 × 107
10 10 2.79 × 107 2.72 × 107 2.58 × 107 7. Conclusion
20 10 2.57 × 107 2.51 × 107 2.37 × 107
The DTW similarity measure is probably the most used
Table 6: Comparison of intracluster inertia of K-means with different parame- and useful tool to analyse sets of sequences. Unfortunately,
terizations. Distance used is DTW while averaging methods are NLAAF, DBA its applicability to data analysis was reduced because it had
and DBA followed by Adaptive Scaling (AS).
no suitable associated averaging technique. Several attempts
have been made to fill the gap. This article proposes a way
to classify these averaging methods. This “interpretive lens”
permits to understand where existing techniques could be im-
proved. In light of this contextualization, this article defines
a global averaging method, called Dtw Barycenter Averaging
(DBA). We have shown that DBA achieves better results on all
tested datasets, and that its behaviour is robust.
The length of the average sequence is not trivial to choose.
It has to be as short as possible, but also sufficiently long to
represent the data it covers. This article also introduces a short-
ening technique of the length of a sequence called Adaptive
Scaling. This process is shown to shorten the average sequence
in adequacy to DTW and to the data, but also to improve its
Figure 12: The average of one of the clusters produced by K-means on the satel- representativity.
lite image time series. This sequence corresponds to the thematical behaviour
Having a sound averaging algorithm lets us apply clustering
of the urban growth (construction of buildings, roads, etc.). The three rectan-
gles corresponds to three phases: vegetation or bare soil, followed by new roofs techniques to time series data. Our results show again a signif-
and roads, followed by damaged and dusty roofs and roads. icant improvement in cluster inertia compared to other tech-
niques, which certainly increases the usefulness of clustering
techniques.
Many application domains now provide time-based data and
quences are small, NLAAF makes less errors. For that reason,
need data mining techniques to handle large volumes of such
even if DBA is in this case again better than NLAAF, the im-
data. DTW provides a good similarity measure for time series,
provement is smaller.
and DBA complements it with an averaging method. Taken to-
Let us now explain why Adaptive Scaling is so useful here.
gether, they constitute a useful foundation to develop new data
In SITS, there are several evolutions which can be considered
mining systems for time series. For instance, satellite imagery
as random perturbations. Thus, the mean may not need to rep-
has started to produce satellite image time series, containing
resent these perturbations, and we think that shorter means are
millions of sequences of multi-dimensional radiometric data.
sometimes better, because they can represent a perturbed con-
We have briefly described preliminary experiments in this do-
stant subsequence by a single coordinate. Actually, this is often
main.
the case in SITS. As an example, a river can stay almost the
We believe this work opens up a number of research direc-
same over a SITS and one or two coordinates can be sufficient
tions. First, because it is, as far as we know, the first global ap-
to represent the evolution of such an area.
proach to the problem of averaging a set of sequences, it raises
From a thematic point of view, having an average for each
interesting questions on the topology of the space of sequences,
cluster of radiometric evolution sequences highlights and de-
and on how the mean relates to the individual sequences.
scribes typical ground evolutions. For instance, the experiment
Regarding Dtw Barycenter Averaging proper, the are still a
just described provides a cluster representing a typical urban
few aspects to be studied. One aspect could be the choice of the
growth behaviour (appearance of new buildings and urban den-
initial sequence where sequences to be averaged do not have
sification). This is illustrated on Figure 12. Combining DTW
the same length. Also we have provided an empirical analy-
with DBA has led to the extraction of urbanizing zones, but
sis of the rate of convergence of the averaging process. More
has also provided a symbolic description of this particular be-
theoretical or empirical work is needed to derive a more robust
haviour. Using euclidean distance instead of DTW, or any other
strategy, able to adjust the number of iterations to perform. An
averaging technique instead of DBA, has led to inferior results.
orientation of this work could be the study of the distribution
Euclidean distance was expected to fail somehow, because ur-
of coordinates contributing to a coordinate of the average se-
banization has gone faster in some zones than in others, and
quence. Eventually, averaging very small datasets with DBA
because the data sampling is non uniform. The other averaging
could be a limitation that should be studied.
methods have also failed to produce meaningful results, prob-
Adaptive Scaling has important implications on performance
ably because of the intrinsic difficulty of the data (various sen-
and relevance. Because of its adaptive nature, it ensures that
sors, irregular sampling, etc.), which leads to difficultly separa-
average sequences have “the right level of detail” on appropri- [18] V. Hautamaki, P. Nykanen, P. Franti, Time-series clustering by approxi-
ate sequence segments. It currently considers only the coor- mate prototypes, in: 19th International Conference on Pattern Recogni-
tion, 2008, pp. 1–4.
dinates of the average sequence. Incorporating averaged se- [19] E. Keogh, X. Xi, L. Wei, C. A. Ratanamahatana, The UCR Time Se-
quences may lead to a more precise scaling, but would require ries Classification / Clustering Homepage, http://www.cs.ucr.edu/
more computation time. Finding the right balance between cost ~eamonn/time_series_data/ (2006).
and precision requires further investigation. [20] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Sci-
ence and Computational Biology, Cambridge University Press, 1997.
When combining DBA with Adaptive Scaling, e.g., when [21] V. Niennattrakul, C. A. Ratanamahatana, Shape Averaging under
building reduced average sequences, we have often noticed that Time Warping, in: International Conference on Electrical Engineer-
they provide short summaries, that are at the same time easy to ing/Electronics, Computer, Telecommunications, and Information Tech-
visualize and truly representative of the underlying phenomenon. nology, 2009.
[22] E. Dimitriadou, A. Weingessel, K. Hornik, A combination scheme for
For instance, the process of length reduction builds an average fuzzy clustering, International Journal of Pattern Recognition and Artifi-
sequence around the major states of the data. It thus provides a cial Intelligence 16 (7) (2002) 901–912.
sampling of the dataset built from the data themselves. Exploit- [23] S. B. Needleman, C. D. Wunsch, A general method applicable to the
ing and extending this property is a promising research direc- search for similarities in the amino acid sequence of two proteins, Journal
of Molecular Biology 48 (3) (1970) 443–453.
tion. [24] L. Wang, T. Jiang, On the complexity of multiple sequence alignment,
Journal of Computational Biology 1 (4) (1994) 337–348.
[25] R. C. Edgar, MUSCLE: a multiple sequence alignment method with re-
References duced time and space complexity, BMC Bioinformatics 5 (1) (2004)
1792–1797.
[1] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM [26] J. Pei, R. Sadreyev, N. V. Grishin, PCMA: fast and accurate multiple
Computing Surveys 31 (3) (1999) 264–323. sequence alignment based on profile consistency, Bioinformatics 19 (3)
[2] A. Rauber, E. Pampalk, J. Paralic, Empirical evaluation of clustering algo- (2003) 427–428.
rithms, Journal of Information and Organizational Sciences (JIOS) 24 (2) [27] T. Lassmann, E. L. L. Sonnhammer, Kalign - an accurate and fast multiple
(2000) 195–209. sequence alignment algorithm, BMC Bioinformatics 6 (1) (2005) 298–
[3] P. Berkhin, Survey of clustering data mining techniques, Tech. rep., Ac- 306.
crue Software, San Jose, CA (2002). [28] C. Notredame, D. G. Higgins, J. Heringa, T-coffee: a novel method for
[4] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions fast and accurate multiple sequence alignment, Journal of Molecular Bi-
on Neural Networks 16 (3) (2005) 645–678. ology 302 (1) (2000) 205–217.
[5] T. W. Liao, Clustering of time series data – a survey, Pattern Recognition [29] J. Pei, N. V. Grishin, PROMALS: towards accurate multiple sequence
38 (11) (2005) 1857 – 1874. alignments of distantly related proteins, Bioinformatics 23 (7) (2007)
[6] H. Sakoe, S. Chiba, A dynamic programming approach to continuous 802–808.
speech recognition, in: Proceedings of the Seventh International Congress [30] L. Gupta, D. Molfese, R. Tammana, P. Simos, Nonlinear alignment and
on Acoustics, Vol. 3, 1971, pp. 65–69. averaging for estimating the evoked potential, IEEE Transactions on
[7] H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for Biomedical Engineering 43 (4) (1996) 348–356.
spoken word recognition, IEEE Transactions on Acoustics, Speech and [31] A. W.-C. Fu, E. J. Keogh, L. Y. H. Lau, C. A. Ratanamahatana, R. C.-W.
Signal Processing 26 (1) (1978) 43–49. Wong, Scaling and time warping in time series querying., VLDB Journal
[8] A. P. Shanker, A. Rajagopalan, Off-line signature verification using DTW, 17 (4) (2008) 899–921.
Pattern Recognition Letters 28 (12) (2007) 1407 – 1414. [32] CNES, Kalidéos, Distribution Spot Image, http://kalideos.cnes.fr
[9] D. Sankoff, J. Kruskal, Time Warps, String Edits and Macromolecules: (2009).
The Theory and Practice of Sequence Comparison, Addison Wesley Pub-
lishing Company, 1983, Ch. The symmetric time-warping problem: from
continuous to discrete, pp. 125–161.
[10] J. Aach, G. M. Church, Aligning gene expression time series with time
warping algorithms, Bioinformatics 17 (6) (2001) 495–508.
[11] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. S. Jaakkola, I. Simon, A new
approach to analyzing gene expression time series data, in: RECOMB:
Proceedings of the sixth annual international conference on Computa-
tional Biology, ACM, New York, NY, USA, 2002, pp. 39–48.
[12] D. M. Gavrila, L. S. Davis, Towards 3-D model-based tracking and recog-
nition of human movement: a multi-view approach, in: IEEE Interna-
tional Workshop on Automatic Face- and Gesture-Recognition., 1995, pp.
272–277.
[13] T. Rath, R. Manmatha, Word image matching using dynamic time warp-
ing, in: IEEE Conference on Computer Vision and Pattern Recognition,
Vol. 2, 2003, pp. 521–527.
[14] V. Niennattrakul, C. A. Ratanamahatana, Inaccuracies of shape averaging
method using dynamic time warping for time series data, in: S. Berlin
(Ed.), Computational Science – ICCS, Vol. 4487 of LNCS, 2007.
[15] V. Niennattrakul, C. A. Ratanamahatana, On Clustering Multimedia Time
Series Data Using K-Means and Dynamic Time Warping, in: Interna-
tional Conference on Multimedia and Ubiquitous Engineering, 2007, pp.
733–738.
[16] T. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R. Kaste, J. O’May,
Understanding and projecting the battle state, in: 23rd Army Science
Conference, 2002.
[17] T. W. Liao, C.-F. Ting, P.-C. Chang, An adaptive genetic clustering
method for exploratory mining of feature vector and time series data, In-
ternational Journal of Production Research 44 (2006) 2731–2748.

View publication stats

You might also like