Toward Accurate Dynamic Time Warping in Linear Tim
Toward Accurate Dynamic Time Warping in Linear Tim
net/publication/220571483
CITATIONS READS
1,189 2,083
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Stan Salvador on 29 July 2014.
Time Series Y
|Y|
identical, but one is shifted slightly along the time axis, then
Euclidean distance may consider them to be very different from
each other. Dynamic time warping (DTW) was introduced [11] to
overcome this limitation and give intuitive distance measurements
between time series by ignoring both global and local shifts in the
time dimension.
Problem Formulation. The dynamic time warping problem is
Time
j
stated as follows: Given two time series X, and Y, of lengths |X|
and |Y|,
X = x1 , x 2 ,K , xi ,K, x X
Y = y1 , y 2 ,K , y j ,K, y Y
construct a warp path W 1
where K is the length of the warp path and the kth element of the
warp path is
Time
wk = (i, j ) Figure 2. A cost matrix with the minimum-distance warp path
traced through it.
where i is an index from time series X, and j is an index from time
series Y. The warp path must start at the beginning of each time The cost matrix and warp path in Figure 2 are for the same two
series at w1 = (1, 1) and finish at the end of both time series at wK time series shown in Figure 1. The warp path is W = {(1,1), (2,1),
= (|X|, |Y|). This ensures that every index of both time series is (3,1), (4,2), (5,3), (6,4), (7,5), (8,6), (9,7), (9,8), (9,9), (9,10),
used in the warp path. There is also a constraint on the warp path (10,11), (10,12), (11,13), (12,14), (13,15), (14,15), (15,15),
that forces i and j to be monotonically increasing in the warp path, (16,16)}. If the warp path passes through a cell D(i, j) in the cost
which is why the lines representing the warp path in Figure 1 do matrix, it means that the ith point in time series X is warped to the
not overlap. Every index of each time series must be used. Stated jth point in time series Y. Notice that where there are vertical
more formally: sections of the warp path, a single point in time series X is warped
to multiple points in time series Y, and the opposite is also true
w k = (i, j ), w k +1 = (i ′, j ′) i ≤ i ′ ≤ i + 1, j ≤ j ′ ≤ j + 1 where the warp path is a horizontal line. Since a single point may
map to multiple points in the other time series, the time series do
The optimal warp path is the warp path is the minimum-distance not need to be of equal length. If X and Y were identical time
warp path, where the distance of a warp path W is series, the warp path through the matrix would be a straight
∑ Dist (w
k =K diagonal line.
Dist (W ) = ki , wkj ) To find the minimum-distance warp path, every cell of the cost
k =1
matrix must be filled. The rationale behind using a dynamic
Dist(W) is the distance (typically Euclidean distance) of warp path programming approach to this problem is that since the value at
W, and Dist(wki, wkj) is the distance between the two data point D(i, j) is the minimum warp distance of two time series of lengths
i and j, if the minimum warp distances are already known for all
slightly smaller portions of that time series that are a single data 2.2 Speeding up Dynamic Time Warping
point away from lengths i and j, then the value at D(i, j) is the The quadratic time and space complexity of DTW creates the need
minimum distance of all possible warp paths for time series that for methods to speed up dynamic time warping. The methods
are one data point smaller than i and j, plus the distance between used make DTW faster fall into three categories:
the two points xi and yj. Since the warp past must either be
incremented by one or stay the same along the i and j axes, the 1) Constraints – Limit the number of cells that are
distances of the optimal warp paths one data point smaller than evaluated in the cost matrix.
lengths i and j are contained in the matrix at D(i-1, j), D(i, j-1),
2) Data Abstraction – Perform DTW on a reduced
and D(i-1, j-1). So the value of a cell in the cost matrix is:
representation of the data.
D (i, j ) = Dist (i , j ) + min[ D (i − 1, j ), D (i, j − 1), 3) Indexing – Use lower bounding functions to reduce the
D (i − 1, j − 1)] number of times DTW must be run during time series
classification or clustering.
The warp path to D(i, j) must pass through one of those three grid
cells, and since the minimum possible warp path distance is Constraints are widely used to speed up DTW. Two of the most
already known for them, all that is needed is to simply add the commonly used constraints are the Sakoe-Chuba Band [13] and
distance of the current two points to the smallest one. Since this the Itakura Parallelogram [4], which are shown in Figure 4.
equation determines the value of a cell in the cost matrix by using
the values in other cells, the order that they are evaluated in is
very important. The cost matrix is filled one column at a time
from the bottom up, from left to right as depicted in Figure 3.
j j j
i i i
Figure 3. The order that the cost matrix is filled. Figure 4. Two constraints: Sakoe-Chuba Band (left) and an
Itakura Parallelogram (right), both have a width of 5.
After the entire matrix is filled, a warp path must be found from
D(1, 1) to D(|X|, |Y|). The warp path is actually calculated in The shaded areas in Figure 4 are the cells of the cost matrix that
reverse order starting at D(|X|, |Y|). A greedy search is performed are filled in by the DTW algorithm for each constraint. The width
that evaluates cells to the left, down, and diagonally to the of each shaded area, or window, is specified by a parameter.
bottom-left. Whichever of these three adjacent cells has the When constraints are used, the DTW algorithm finds the optimal
smallest value is added to the beginning of the warp path found so warp path through the constraint window. However, the globally
far, and the search continues from that cell. The search stops optimal warp path will not be found if it is not entirely inside the
when D(1, 1) is reached. window. Using constraints speeds up DTW by a constant factor,
but the DTW algorithm is still O(N2) if the size of the input
Complexity of DTW. Time and Space complexity of the DTW is window is a function of the length of the input time series.
easy to determine. Each cell in the |X| by |Y| cost matrix is filled Constraints work well in domains where the optimal warp path is
exactly once, and each cell is filled in constant time. This yields expected to be close to a linear warp and passes through the cost
both a time and space complexity of |X| by |Y|, which is O(N2) if matrix diagonally in a relatively straight line. Constraints work
N=|X|=|Y|. The quadratic space complexity is particularly poorly if time series are of events that start and stop at radically
prohibitive because memory requirements are in the terabyte different times because the warp path can stray very far from a
range for time series containing only 177,000 measurements. A linear warp and nearly the entire cost matrix must be evaluated to
linear space-complexity implementation of the DTW algorithm is find the optimal warp path.
possible by only keeping the current and previous columns in
memory as the cost matrix is filled from left to right (see Figure Data abstraction speeds up the DTW algorithm by running DTW
3). By only retaining two columns at any one time, the optimal on a reduced representation of the data [2][9]. The left side of
warp distance between the two time series can be determined. Figure 5 shows a full-resolution cost matrix for which a
However it is not possible to reconstruct the warp path between minimum-distance warp path must be found. Rather than running
these two time series because the information required to calculate the DTW algorithm on the full resolution (1/1) cost matrix, the
the warp path is thrown away with the discarded columns. This is time series are reduced in size to make the number of cells in the
not a problem if only the distance between two time series is cost matrix more manageable. A warp path is found for the
required, but applications that find corresponding regions between lower-resolution time series and is mapped back to the full
time series [14] or merge time series together [1][3] require the resolution cost matrix.
warp path to be found.
guess for a higher resolution’s minimum-distance warp
1/1 1/5 1/1 path.
3) Refinement – Refine the warp path projected from a
lower resolution through local adjustments of the warp
path.
Coarsening reduces the size (or resolution) of a time series by
averaging adjacent pairs of points. The resulting time series is a
factor of two smaller than the original time series. Coarsening is
Figure 5. Speeding up DTW by data abstraction. run several times to produce many different resolutions of the
The result is that DTW is sped up by a large constant factor, but time series. Projection takes a warp path calculated at a lower
the algorithm still runs in O(N2) time and space. Obviously, the resolution and determines what cells in the next higher resolution
warp distance that is calculated between the two time series time series the warp path passes through. Since the resolution is
becomes increasingly inaccurate as the level of abstraction increasing by a factor of two, a single point in the low-resolution
increases. Projecting the lower resolution warp path to the full warp path will map to at least four points at the higher resolution
resolution usually creates a warp path that is far from optimal (possibly >4 if |X|≠|Y|). This projected path is then used as a
because even IF the optimal warp path actually passes through the heuristic during solution refinement to find a warp path at the
low-resolution cell, projecting the warp path to the higher higher resolution. Refinement finds the optimal warp path in the
resolution ignores local variations in the warp path that can be neighborhood of the projected path, where the size of the
very significant. neighborhood is controlled by the radius parameter.
Indexing uses lower-bounding functions to prune out the number Standard dynamic time warping (DTW) is an O(N2) algorithm
of times DTW needs to be run for certain tasks such as clustering because every cell in the cost matrix must be filled to ensure an
a set of time series or finding the time series that is most similar to optimal answer is found, and the size of the matrix grows
a given time series [6][10]. Indexing significantly speeds up quadratically with the size of the time series. In the multilevel
many DTW applications by reducing the number of times DTW is approach, the cost matrix is only filled in the neighborhood of the
run, but does not speed up the actual DTW algorithm. path projected from the previous resolution. Since the length of
the warp path grows linearly with the size of the input time series,
Our FastDTW algorithm uses ideas from both the constraints and the multilevel approach is an O(N) algorithm.
data abstraction categories. Using a combination of both
overcomes many limitations of using either method individually, The FastDTW algorithm first uses coarsening to create all of the
and yields an algorithm that is O(N) in both time and space. resolutions that will be evaluated. Figure 6 shows four
resolutions that are created when running the FastDTW algorithm
on the time series that were previously used in Figures 1 and 2.
3. APPROACH The standard DTW algorithm is run to find the optimal warp path
The multilevel approach that FastDTW uses is inspired by the for the lowest resolution time series. This lowest resolution warp
multilevel approach used for graph bisection [5]. Graph bisection path is shown in the left of Figure 6. After the warp path is found
is the task of splitting a graph into roughly equal portions, such for the lowest resolution, it is projected to the next higher
that the sum of the edges that would be broken is as small as resolution. In Figure 6, the projection of the warp path from a
possible. Efficient and accurate algorithms exist for small graphs, resolution of 1/8 is shown as the heavily shaded cells at 1/4
but for large graphs, the solutions found are typically far from resolution.
optimal. A multilevel approach can be used to find the optimal
solution for a small graph, and then repeatedly expand the graph 1/8 1/4 1/2 1/1
and “fix” the pre-existing solution for the slightly larger problem.
A multilevel approach works well if a large problem is difficult to
solve all at once, but partial solutions can effectively be refined at
different levels of resolution. The dynamic time warping problem
can also be solved with a multilevel approach. Our FastDTW
algorithm uses the multilevel approach and is able to find an
accurate warp path in linear time and space. Figure 6. The four different resolutions evaluated during a
complete run of the FastDTW algorithm.
3.1 FastDTW Algorithm To refine the projected path, a constrained DTW algorithm is run
The FastDTW algorithm uses a multilevel approach with three with the very specific constraint that only cells in the projected
key operations: warp path are evaluated. This will find the optimal warp path
through the area of the warp path that was projected from the
1) Coarsening – Shrink a time series into a smaller time lower resolution. However, the entire optimal warp path may not
series that represents the same curve as accurately as be contained within projected path. To increase the chances of
possible with fewer data points. finding the optimal solution, there is a radius parameter that
controls the additional number of cells on each side of the
2) Projection – Find a minimum-distance warp path at a
projected path that will also be evaluated when refining the warp
lower resolution, and use that warp path as an initial
path. In Figure 6, the radius parameter is set to 1. The cells
included during warp path refinement due to the radius are lightly
shaded. Once the warp path is refined at the 1/4 resolution, that Function FastDTW()
warp path is projected to the 1/2 resolution, expanded by a radius
Input: X – a TimeSeries of length |X|
of 1, and refined again. Finally, the warp path is projected to the Y – a TimeSeries of length |Y|
full resolution (1/1) matrix in Figure 6. The projection is radius – distance to search outside of the projected
expanded by the radius and refined one last time. This refined warp path from the previous resolution
warp path is the output of the algorithm. when refining the warp path
Output: 1) A min. distance warp path between X and Y
Notice that the warp path found by the FastDTW algorithm in 2) The warped path distance between X and Y
Figure 6 is the optimal warp path that was found by the standard
DTW in Figure 2. However, FastDTW only evaluated the shaded 1| // The min size of the coarsest resolution.
cells, while DTW evaluates all of the cells in the cost matrix. 2| Integer minTSsize = radius+2
FastDTW evaluated 4+16+44+100=164 cells at all resolutions, 3|
while DTW evaluates all 235 (162) cells. This increase in 4| IF (|X|≤minTSsize OR |Y|≤minTSsize)
5| {
efficiency is not very significant for his small problem, especially
6| // Base Case: for a very small time series run
considering the overhead of creating all four resolutions. 7| // the full DTW algorithm.
However, the number of cells that FastDTW evaluates scales 8| RETURN DTW(X, Y)
linearly with the length of the time series, while DTW always 9| }
evaluates N2 cells (if both time series are of length N). FastDTW 10| ELSE
scales linearly because the width of the path through the matrix 11| {
that is being evaluated is constant at all resolutions. 12| // Recursive Case: Project the warp path from
13| // a coarser resolution onto the current
The example in Figure 6 finds the optimal warp path, but the 14| // current resolution. Run DTW only along
FastDTW algorithm is not guaranteed to always find a warp path 15| // the projected path (and also ‘radius’ cells
that is optimal. However, the path found is usually very close to 16| // from the projected path).
optimal. The larger the value of the radius parameter, the more 17| TimeSeries shrunkX = X.reduceByHalf()
accurate the warp path will be. If the radius parameter is set to be 18| TimeSeries shrunkY = Y.reduceByHalf()
19|
as large as one of the input time series, then FastDTW generalizes 20| WarpPath lowResPath =
to the DTW algorithm (optimal but O(N2)). The accuracy of 21| FastDTW(shrunkX,shrunkY, radius)
FastDTW using different settings for the radius parameter will be 22|
demonstrated in Section 4. 23| SearchWindow window =
24| ExpandedResWindow(lowResPath, X, Y,
The pseudocode for the FastDTW algorithm is shown Figure 7. 25| radius)
The input to the algorithm is two time series, and the radius 26|
parameter. The output of FastDTW is a warp path and the 27| RETURN DTW(X, Y, window)
distance between the two time series along that warp path. Line 2 28| }
determines the minimum length of a time series at the lowest
Figure 7. The FastDTW algorithm.
resolution. This size is dependent on the radius parameter and
determines the smallest possible resolution size for which The execution of the FastDTW algorithm repeatedly runs lines
decreasing the resolution further would be pointless because full 17-18 in recursive calls to lower resolutions are made by line 21.
dynamic time warping would need to be calculated at more than This creates multiple resolutions until the base case is reached
one resolution. (line 8). The base case is executed only a single time, and
afterwards lines 23-27 are executed for each recursive call (or
FastDTW has a straightforward recursive implementation. The resolution) on the stack.
base case is when one of the input time series has a length less
than minTSsize. For the base case, the algorithm simply returns Next, we will provide a theoretical analysis of FastDTW based on
the result of the standard DTW algorithm. The recursive case has time and space complexity.
three main steps. First, two new lower-resolution time series are Time Complexity of FastDTW. To simplify the calculations we
created that have half as many points as the input time series will assume that the two full-resolution time series X and Y are
(coarsening). This is performed by lines 17-18 in Figure 7. both of length N. All analysis will be performed on worst-case
Next, a low resolution path is found for the coarsened time series behavior.
(lines 20-21) and projected to a higher resolution (lines 23-25).
This projected path is also expanded by radius cells to create a The number of cells in the cost matrix that are filled by FastDTW
search window that will be passed to a constrained version of the in a single resolution is equal the number of cells in the projected
DTW algorithm that only evaluates the cells in the search window warp path and any other cells within radius (denoted as r in the
(line 27). The constrained DTW algorithm refines the warp path rest of this analysis to save space) cells away from the projected
that was projected form the lower resolution. The result of this path. The worst case, a straight diagonal projected warp path is
refinement is then returned. depicted in Figure 8.
Time to create all resolutions = 4N [7]
1/2 1/1 The time complexity needed to trace the warp path back through a
matrix is measured by the length of the warp path. A resolution
containing N points has a length of 2N in the worst case (N is the
best case for a diagonal line). Multiplying Equation 4 by 2N
gives the worst-case length of all warp paths added together from
every resolution:
res
N N N N The space complexity of the cost matrix is the maximum size cost
= N , , 2 , 3 , 4 ,L
2 res =0
[2] matrix that is created for the full resolution matrix. The number
2 2 2 2 of cells in the matrix is Equation 1
Therefore, the number of cells evaluated at all resolutions is Space of cost matrix = N (4r + 3) [11]
(combine Equations 1 and 2)
∑2 L [3]
∞ The space complexity of storing the warp path is equal to the
N N N
res
( 4 r + 3) = N ( 4r + 3) + ( 4 r + 3) + 2 ( 4 r + 3) + longest warp path that can exist at full resolution. If the warp path
res = 0 2 2 traces the perimeter of the cost matrix, then the length of that path
will be
The series in Equation 3 is very similar to the series
∑2 L= 2
∞
1 1 1 1 1 Space complexity of storing the warp path = 2N [12]
res
= 1+ + + + + [4]
res = 0 2 2 2 23 2 4 And adding Equations 10, 11, and 12 gives the total worst-cast
space complexity of
Multiplying Equation 4 by Equation 1 yields
FastDTW space complexity = N ( 4r + 7) [13]
N
2
N
N ( 4r + 3) + ( 4r + 3) + 2 ( 4r + 3) +
2
L = 2 N ( 4 r + 3) [5]
which is also O(N) if r (radius) is a small (<N) constant value.
Since the sequence in Equation 5 is identical to the sequence in
Equation 3, the number of cells evaluated at all resolutions is 4. EMPIRICAL EVALUATION
The goal of this evaluation is demonstrate the efficiency and
Total number of cells filled = 2 N (4r + 3) [6] accuracy of the FastDTW algorithm on a wide range of time series
data sets. To ensure reproducibility, all datasets and algorithms
In addition to the number of cells calculated there is also time used in this evaluation can be found online at “http://cs.fit.edu/
complexity for creating the coarser resolutions and determining ~pkc/FastDTW/”. This evaluation will first demonstrate the
the warp path by tracing through the matrix. accuracy of the FastDTW algorithm and will then empirically
The time complexity needed to create the resolutions is verify its linear time complexity.
proportional to the number of points in all of the resolutions,
which is the series in Equation 2. The solution of Equation 2 is 4.1 Accuracy of FastDTW
obtained by multiplying Equation 4 by N, which yields 2N. Since 4.1.1 Procedures and Criteria
multiple resolutions of both time series must be created, 2N is The accuracy of an approximate DTW algorithm can be measured
multiplied by two to get the final time complexity. by determining how much the approximate warp path distance
differs from the optimal warp path distance. The error of an
approximate DTW algorithm, such as our FastDTW algorithm, is characteristics of the time series. Dynamic time warping is most
calculated by the following equation: frequently used to compare the similarity between time series, so
it is likely that the majority of time series that are compared are
Error of a warp path = approxDist − optimalDist × 100 [14] similar and from the same domain. However, very dissimilar time
optimalDist series are also evaluated to ensure that the approximate FastDTW
algorithm works well when warping two time series that do not
If the DTW algorithm finds a warp path with a distance equal to share common features. The accuracy of each DTW algorithm is
the optimal warp path distance, then there is zero error. The measured on three groups of data:
optimal warp path distance can be found by running the standard
DTW algorithm. The error of a warp path will always be ≥0% 1) Random – 990 time warps between 45 time series from
(because optimalDist is never larger than approxDist) and can different domains (eeg, random walk, earthquake, speech,
exceed 100% if the distance of the approximate warp path is more tide, etc.). The average length is 1128 points.
than double the optimal distance. 2) Trace - 10,900 time warps between 200 time series data
The FastDTW algorithm is evaluated against two other existing sets. The Gun domain contains 4 classes that simulate
approximate DTW algorithms: Sakoe-Chuba bands and data instrumentation failure in a nuclear power plant. All time
abstraction. Sakoe-Chuba bands (see left side of Figure 4) series have a length of 275 points.
constrain the DTW algorithm to only evaluate a specified radius 3) Gun – 10,900 time warps between 200 time series data
away from a linear warp within the cost matrix. Itakura sets. The Gun domain contains 2 classes, with 100 time
Parallelograms (see right side of Figure 4) are not evaluated series of a gun being drawn from a holster and 100 time
because, for a given radius, a band will always find a warp path series of a gun being pointed. All time series have a
equal to or better than that of the parallelogram. This is because length of 151 points.
the parallelogram constraint is a subset of the band constraint.
The data abstraction DTW algorithm used in this evaluation first All data sets used in this evaluation were obtained from the UCR
samples the data, and then runs the standard DTW algorithm to Time Series Data Mining Archive and are publicly available [7].
find a warp path on the sampled data. This warp path is then Each algorithm and group of data is also run multiple times with
projected to the full resolution as previously shown in Figure 5. the following settings for the radius parameter: 0, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 20, and 30. For a given algorithm, group of data, and
The radius parameter performs a similar function for all three radius, the average error of all possible warp paths between time
algorithms. It expands the region of the cost matrix searched from series in the group are recorded.
an initial “guess”. For bands, the initial guess is a linear warp.
For data abstraction, it is the projected warp path from the 4.1.2 Results and Analysis
sampled data, and for FastDTW it is the projected warp path from The FastDTW algorithm is very accurate for all three groups of
the previous resolution. Each algorithm will be run with multiple data that it was tested on. FastDTW has an error of only 19.2% to
radius parameters on a wide range of data sets. 0.0%, depending on the value of the radius parameter. For all
algorithms, the error decreases as the radius parameter increases.
All three algorithms (FastDTW, bands, and data abstraction) are
However, FastDTW converges to 0% error much faster than the
only being evaluated based on accuracy in this section. However,
other two algorithms. A summary of the results for several radius
care has been taken to ensure that the time each algorithm requires
settings is contained in Table 1.
to execute is similar for the same radius. The data abstraction
algorithm is made O(N) by sampling the data down to N points Table 1. Average error of three the algorithms at selected
before performing quadratic time warping (O( N 2) = O(N)). All radius values (errors of the 3 groups of data are averaged).
three algorithms evaluate roughly the same number of cells in the radius
cost matrix for any particular radius. FastDTW has some 0 1 10 20 30
overhead for evaluating previous resolutions, and data abstraction FastDTW 19.2% 8.6% 1.5% 0.8% 0.6%
has overhead for running standard DTW on the sampled time Abstraction 983.3% 547.9% 6.5% 2.8% 1.8%
series. However, all three algorithms are linear with respect to the Band 2749.2% 2385.7% 794.1% 136.8% 9.3%
length of the input time series, and the number of cells evaluated
for a given radius does not differ by more than a power of two of Table 1 shows the average error for all three algorithms over all
for any pair of algorithms. three test cases, when run with the radius set to 0, 1, 10, 20, and
30. FastDTW has a small amount of error for all radius settings,
The time series data sets used to evaluate the accuracy of the and begins to approach 0% error when radius is set at or above
FastDTW algorithm include very similar data sets that are from 10. Data abstraction is inaccurate for small radius values, but
the same domain, and dissimilar data sets that are from different begins to be reasonably accurate when run with larger radius
domains. Both types of data are used to show that FastDTW settings. The band algorithm is very inaccurate for all radius
works well on a wide range of data, regardless of the similarity or settings except for 30.
Accuracy of FastDTW, Bands, and Data Abstraction
100%
FastDTW-Random FastDTW-Trace FastDTW - Gun
90% Abstraction-Random Abstraction-Trace Abstraction-Gun
Band-Random Band-Trace Band-Gun
80%
70%
60%
Error
50%
40%
30%
20%
10%
0%
0 5 10 radius 15 20 25 30
10%
FastDTW-Random FastDTW-Trace FastDTW - Gun
9% Abstraction-Random Abstraction-Trace Abstraction-Gun
Band-Random Band-Trace Band-Gun
8%
7%
6%
Error
5%
4%
3%
2%
1%
0%
0 5 10 radius 15 20 25 30
Figure 9. Accuracy of FastDTW compared to Bands and Data Abstraction. The top figure’s y-axis is 0%-100% and the bottom
figure’s y-axis is 0%-10%.
combination between the three algorithms and the three groups of
Data abstraction is inaccurate (500-1000% error) for small radius
data sets. The FastDTW algorithm curves are solid lines, data
settings because it blindly projects the warp path from a sampled
abstraction curves are dotted lines, and band curves are dashed
time series onto a full resolution cost matrix. This projection may
lines. The three groups of data can be identified by the shape of
be “in the neighborhood” of a near-optimal warp path, but it fails
the markers on the curves. Round markers are used on curves
to take into consideration any local variation in the warp path that
using Random data, triangle markers are for the Trace data, and
is obscured by sampling. Local variations in the warp path can
square markers are for the Gun data.
have a huge impact on the accuracy of a warp path. Increasing the
radius setting (which is not part of the original data abstraction
The three solid lines at the bottom of Figure 9 are the error curves
algorithm, it is introduced in this paper), can make it rather
for FastDTW on all three groups of data. The error is small for all
accurate because this begins to adjust the warp path to cover local
three lines, meaning that the accuracy FastDTW is not effected
variations. However, the accuracy is still worse than FastDTW
very much by the characteristics or similarity of the input time
for a given radius because FastDTW projects the “neighborhood”
series. FastDTW is significantly more accurate than the other two
of the near-optimal warp path from the previous resolution in
methods when the radius parameter is set to small values. When
several small steps rather than a single large step.
the radius parameter is larger, the abstraction method begins to
approach the accuracy of FastDTW. However, FastDTW was
Bands can only have good results if a near-optimal warp path is
always at least 2-3 times more accurate than abstraction in our
entirely contained within radius cells from a linear warp. When
experiments.
bands are used with a radius of 0, and the two time series are of
equal length, it generalizes to Euclidean distance...which is a
The three dotted Abstraction lines all have large errors for small
notoriously inaccurate similarity measure for time series [15]. A
radius values, but converge to less than 5% error on all data sets
slight misalignment between the two time series being warped can
as the radius is increased to 30. This is due to the previously
cause a very large amount of error in the warp path.
stated problem of the projected warp path being close to a near-
optimal solution, but not taking local variations of the warp path
The accuracy of each algorithm on the different groups of data is
into account. Abstraction does perform reasonably well if the
displayed in Figure 9.
radius is increased to at least 10. The ability of data abstraction to
locally refine its projected path within the neighborhood of radius
In Figure 9, the x-axis is the radius parameter used, and the y-axis cells is not a part of the original algorithm, and is introduced in
is the error of the tested algorithm. Each of the 9 lines is a
this paper. The run-time of the original data abstraction algorithm 0 and 100 respectively) when the time series have lengths of
is the same as our improved implementation when using a radius 150,000 points. A sample of the results of the FastDTW
of 0, which has a very large average error of 983.3% over the algorithm can be seen in Table 2.
three groups of data used in this evaluation.
Table 2. Execution time (in seconds) of DTW and FastDTW on
time series of four different lengths.
The three dashed Band lines all have errors greater than 100% (as
high as 7225%) for small radius values, and converge very slowly Length of Time Series
to 0% error as the radius increases. Band performs best on the 100 1,000 10,000 100,000
random data because if two time series have almost nothing in DTW 0.02 0.92 57.45 7969.59
common, an arbitrary warp path probably has a warp path FastDTW
0.01 0.02 0.38 67.94
distance that is not significantly much different from the (radius=0)
minimum-distance or maximum-distance warp paths. The other FastDTW
0.02 0.06 8.42 207.19
two groups of data are data sets in a similar domain, which means (radius=100)
that the optimal warp distance can be very small. Due to the way
In Table 2, FastDTW and DTW have similar execution times for
that error is calculated in Equation 14, if the optimal warp
the 100 point time series. For the larger 10,000 and 100,000
distance is very small, then the potential error can be very large
point time series FastDTW runs much more quickly than DTW.
because the optimal warp distance is the denominator of a
But execution time for the 1,000 point time series is both faster
fraction. The Band approach on the Trace data group has
and slower than DTW, depending on the radius parameter. The
extremely poor accuracy because the time series contain events
exact length at which FastDTW runs quicker than DTW depends
that are shifted in time, and bands only work well if a near-
on the radius parameter. Figure 10 shows the critical region
optimal warp path exists that is close to a linear warp. The Gun
where one algorithm is faster than the other depending on the
data group also does not work very well with the Band algorithm,
radius parameter.
which is surprising since the time series seem to be reasonably in
phase with each other (near a linear warp). Execution Time of FastDTW on Small Time Series
0.9
DTW
4.2 Efficiency 0.8 FastDTW (radius=100)
4.2.1 Procedures and Criteria 0.7
FastDTW (radius=20)
FastDTW (radius=0)
The efficiency of the FastDTW algorithm will be measured in
0.6
Time (seconds)
seconds, with respect to the length of the input time series, and
compared to the standard DTW algorithm. The FastDTW 0.5
algorithm will be run with the radius parameter set to: 0, 20, and 0.4
100 over a range of varying-length time series. The data sets used
are synthetic data sets of a single period of a sine wave with 0.3
Gaussian noise inserted. Only the lengths of the time series are 0.2
significant because the shape of the time series has little
0.1
significance on the run-time of either algorithm. The lengths of
the time series evaluated vary from 10 to 150,000. 0.0
0 100 200 300 400 500 600 700 800 900 1000
The standard DTW algorithm used in this evaluation is the linear- Length of Time Series
space implementation that only retains the last two columns of the
Figure 10. The efficiency of FastDTW and DTW on small time
cost matrix. If the standard DTW implementation is used, the test
series.
machine runs out of memory when the length of the time series
exceeds ~3,000. The FastDTW algorithm is implemented as The FastDTW algorithm, with a radius of 100, takes longer to run
described in this paper except that the cost matrix is filled using than DTW until the size of the time series exceeds approximately
secondary storage if the lengths of the time series grow so large 900 points. However, with a radius of 0 or 20, the DTW
that the number of cells in the search window is larger than can fit algorithm is never faster than the FastDTW algorithm for small
into main memory. Both algorithms are implemented in Java, and time series, and once the length of the time series exceed 200-300
the runtime is measured using the system clock on a machine with points, FastDTW becomes the more efficient algorithm. For small
minimal background processes running. time series it makes more sense to use the DTW algorithm rather
than FastDTW. The FastDTW algorithm is not significantly faster
4.2.2 Results and Analysis (and possibly a little slower) than DTW for small time series, and
The FastDTW algorithm was significantly faster than the standard DTW is guaranteed to always find the optimal warp path.
DTW algorithm for all but the smallest time series. FastDTW is However, for large time series, the quadratic time complexity
50 to 150 times faster than standard DTW (using radius values of becomes prohibitive.
Execution Time of FastDTW on Large Time Series
5000
DTW
4500
FastDTW (radius=100)
4000 FastDTW (radius=20)
Time (seconds)
100000
DTW
FastDTW (radius=100)
FastDTW (radius=20)
1000
Time (seconds)
FastDTW (radius=0)
10
0
1,000 10,000 100,000 1,000,000
Length of Time Series
Figure 11. The eficiency of FastDTW and DTW on large time series. The top figure is scaled normally, and the bottom figure has
log-log scaling.
The full results, using radius values of 0, 20, and 100 on time algorithm occurs when the number of cells being filled in the
series ranging in length from 10 to 200,000 are shown in Figure search window will not fit into main memory, and must be saved
11. In Figure 11, the y-axis is the execution time and the x-axis is to the disk. Writing the cells to the disk can be performed in
the length of the time series. The two graphs in Figure 11 are two linear time. However, when reading the cells from the random-
views of the same data. The top graph is scaled normally, and the access file to construct a warp path, reading individual non-
bottom graph has log-log scaling. Looking at the top graph, it is sequential cells from the disk cannot be performed in linear time.
immediately obvious that the time complexity of DTW is much Larger time series create larger swap files, which require the disk
greater than that of FastDTW. DTW has an exponential curve, head to move further to perform each random-access read
while all three FastDTW curves are approximately straight lines. operation. In other words, the number of cells in the cost matrix
In the log-scaled graph at the bottom of Figure 11, the three that must be filled/read is linear with respect to the length of the
curves of FastDTW can be viewed more easily. The radius time series. So the algorithm is O(N), but the implementation is
parameter increases the execution of FastDTW by a constant not quite O(N) for large time series when the entire search
factor, which is why the three FastDTW lines seem to be window will not fit into main memory.
converging on the log-scaled graph as length of the time series
increases. The constant factor difference between them gets less
significant as the length of the time series increases.
5. CONCLUDING REMARKS
In this paper we introduced the FastDTW algorithm, a linear and
In Section 3.1, we proved theoretically that the FastDTW accurate approximation of dynamic time warping (DTW).
algorithm was O(N). Using the empirical data in Figure 11, the FastDTW uses a multilevel approach that recursively projects a
equation of the FastDTW curve with a radius of 100 is warp path from a coarser resolution to the current resolution and
refines it. While the quadratic time and space complexity of DTW
y = 0.00000001x 2 + 0.001x − 0.7337 has limited its use to only the smallest time series data sets,
FastDTW can be run on much larger data sets. FastDTW is an
This coefficient of the squared term is very small, and it seems order of magnitude faster than DTW, and it also compliments
like the linear term is the most significant term in the existing indexing methods that speed up time series similarity
equation...which would empirically prove that the FastDTW search and classification.
algorithm is O(N). However, since the values for x are so large,
the squared term actually dominates the equation when Our theoretical and empirical analysis showed that FastDTW has
x>100,000. The reason for this slight sub-linearity in the a linear time and space complexity. Expirical results have also
shown that FastDTW is accurate when warping both similar and [5] Karypis, G., R. Aggarwal, V. Kumar & S. Shekhar.
dissimilar time series. With a radius of only 1, FastDTW had an Multilevel Hypergraph Partitioning: Application in VLSI
average error of 8.6%, and increasing the radius to 20 lowers the Domain. In Design Automation Conl, pp. 526-530.
error to under 1%. FastDTW’s accuracy was compared to two Anaheim, California, 1997.
existing methods, Data Abstraction and Sakoe-Chiba Bands, and
was found to be far more accurate than either approach when
[6] Keogh, E. Exact Indexing of Dynamic Time Warping. In
VLDB, pp. 406-417. Hong Kong, China, 2002.
using small radius values. FastDTW’s solutions also always
approached zero error (optimal warp path) with smaller radius [7] Keogh, E. and T. Folias. The UCR Time Series Data Mining
values than the other two methods. An additional contribution of Archive [http://www.cs.ucr.edu/~eamonn/TSDMA/
this paper is demonstrating how to apply the refinement portion of index.html], Riverside, CA, University of California –
the FastDTW algorithm to the Data Abstraction approximate Computer Science and Engineering Department, 2002
DTW algorithm. Doing so increased the accuracy of Data
[8] Keogh, E. & M. Pazzani. Derivative Dynamic Time
Abstraction by more than 100-fold in our evaluation with a radius
Warping. In Proc. of the First Intl. SIAM Intl. Conf. on Data
of only 10.
Mining, Chicago, Illinois, 2001.
The main limitation of the FastDTW algorithm is that it is an [9] Keogh, E. & M. Pazzani. Scaling up Dynamic Time
approximate algorithm and is not guaranteed to find the optimal Warping for Datamining Applications. In Proc. of the Sixth
solution (although it very often does). If for some reason a ACM SIGKDD Intl. Conf. on Knowledge Discovery and
problem requires optimal warp paths to be found. Future work Data Mining, pp.285-289. Boston, Massachuseetts, 2000.
will look into increasing the accuracy of FastDTW. Possibilities
to increase the accuracy of FastDTW include changing the step [10] Kim, S., S. Park & W. Chu. An Index-based Approach for
size (magnitute of the resolution change) between resolutions and Similarity Search Supporting Time Warping in Large
evaluating search algorithms to guide search during the Sequence Databases. In Proc. 17th Intl. Conf. on Data
refinement step rather than simple expanding the search window Engineering, pp. 607-614. Heidelberg, Germany, 2001.
in both directions. [11] Kruskall, J. & M. Liberman. The Symmetric Time Warping
Problem: From Continuous to Discrete. In Time Warps,
6. REFERENCES. String Edits and Macromolecules: The Theory and Practice
[1] Abdulla, W., D. Chow, and G. Sin, Cross-words reference of Sequence Comparison, pp. 125-161, Addison-Wesley
template for DTW-based speech recognition systems, in Publishing Co., Reading, Massachusetts, 1983
Proc. IEEE TENCON, Bangalore, India, 2003. [12] Ratanamahatana, C. & E. Keogh. Making Time-series
[2] Chu, S., E. Keogh, D. Hart & Michael Pazzani. Iterative Classification More Accurate Using Learned Constraints. In
Deepening Dynamic Time Warping for Time Series. In Proc of SIAM Intl. Conf. on Data Mining, pp. 11-22. Lake
Proc. of the Second SIAM Intl. Conf. on Data Mining. Buena Vista, Florida, 2004.
Arlington, Virginia, 2002. [13] Sakoe, H. & S. Chiba. (1978) Dynamic programming
[3] Gupta, L., D. Molfese, R. Tammana & P. Simos. Nonlinear algorithm optimization for spoken word recognition. IEEE
Alignment and Averaging for Estimating the Evoked Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-26.
Potential. In IEEE Transactions on Biomedical [14] Salvador, Stan. Learning States for Detecting Anomalies in
Engineering, vol. 43, no. 4, pp. 346-356, 1996. Time Series. Master’s Thesis, CS-2004-05, Dept. of
[4] Itakura, F. Minimum Prediction Residual Principle Applied Computer Sciences, Florida Institute of Technology, 2004.
to Speech Recognition. In IEEE Trans. Acoustics, Speech, [15] Vlachos, M., G. Kollios, & D. Gunopulos. Discovering
and Signal Proc. vol. ASSP-23, pp 52-72, 1975. Similar Multidimensional Trajectories. In Proc. 18th Intl.
Conf. on Data Engineering, pp. 673-684, San Jose,
California, 2002.