A Practical Time-Series Tutorial With MATLAB
A Practical Time-Series Tutorial With MATLAB
A practical Time-Series
Tutorial with MATLAB
Michalis Vlachos
IBM T.J.
T.J. Watson Research Center
Hawthorne, NY, 10532
I am definately
smarter than her,
but I am not a timetimeseries person, perper-se.
I wonder what I gain
from this tutorial
tutorial
Disclaimer
I am not affiliated with Mathworks in any way
but I do like using Matlab a lot
since it makes my life easier
Timeline of tutorial
Matlab introduction
I will try to convince you that Matlab is cool
Brief introduction to its many features
TimeTime-series with Matlab
Introduction
TimeTime-Series Representations
Distance Measures
Lower Bounding
Clustering/Classification/Visualization
Applications
4
Letsnot
notescape
escapeinto
into
Lets
mathematics.
Lets
mathematics. Lets
Oh...buy
Oh...buy
mybooks
books
my
too!
too!
stickwith
withreality.
reality.
stick
Seasonality
..or about complex mathematical formulas
Michael Crichton
5
John Tukey
7
Matlab
Interpreted Language
Easy code maintenance (code is very compact)
Very fast array/vector manipulation
Support for OOP
Easy plotting and visualization
Easy Integration with other Languages/OSs
Interact with C/C++, COM Objects, DLLs
Build in Java support (and compiler)
Ability to make executable files
Multi-Platform Support (Windows, Mac, Linux)
Extensive number of Toolboxes
Image, Statistics, Bioinformatics, etc
8
Cleve Moler
10
11
Students
Batch processing of files
No more incomprehensible perl code!
Easy visualization
Its cheap! (for students at least)
12
Starting up Matlab
13
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
14
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
15
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
17
Populating arrays
Plot sinusoid function
a = [0:0.3:2*pi] % generate values from 0 to 2pi (with step of 0.3)
b = cos(a)
cos(a) % access cos at positions contained in array [a]
plot(a,b)
plot(a,b) % plot a (x(x-axis) against b (y(y-axis)
Related:
linspace(-100,100,15); % generate 15 values between -100 and 100
18
Array Access
Access array elements
>> a(1)
>> a(1:3)
ans =
0
ans =
0.3000
0.6000
19
2D Arrays
Can access whole columns or rows
>> a(1,:)
Row-wise access
ans =
1
4
2
5
3
6
>> a(2,2)
>> a(:,1)
ans =
ans =
3
Column-wise access
1
4
A good listener is not only popular everywhere, but after a while he gets to know something. Wilson Mizner 20
10
Column-wise computation
For arrays greater than 1D, all computations happen
column-by-column
>> a = [1 2 3; 3 2 1]
a =
>> max(a)
max(a)
ans =
1
3
2
2
3
1
>> mean(a)
mean(a)
>> sort(a)
ans =
ans =
2.0000
2.0000
2.0000
1
3
2
2
1
3
21
Concatenating arrays
Column-wise or row-wise
>> a = [1 2 3];
>> b = [4 5 6];
>> c = [a b]
>> a = [1;2];
>> b = [3;4];
>> c = [a b]
c =
c =
1
>>
>>
>>
a
b
c
=
=
=
[1 2 3];
[4 5 6];
[a; b]
c =
1
2
>>
>>
>>
a
b
c
=
=
=
3
4
[1;2];
[3;4];
[a; b]
c =
1
4
2
5
3
6
1
2
3
4
22
11
Initializing arrays
Create array of ones [ones]
>> a = ones(1,3)
a =
1
>> a = ones(2,2)*5;
a =
1
5
5
5
5
>> a = ones(1,3)*inf
a =
Inf Inf Inf
>> a = zeros(3,1) + [1 2 3]
3]
a =
1
2
3
23
3
4
reshape(X,[M,N]):
[M,N] matrix of
columnwise version
of X
5
6
repmat(X,[M,N]):
make [M,N] tiles of X
>> repmat(a,2,1)
repmat(a,2,1)
ans =
1
2
1
2
3
3
24
12
>> a = [1 3 2 5];
>> a(enda(end-1)
ans =
ans =
2
Length = 4
>> length(a)
a=
ans =
columns = 4
rows = 1
25
a=
s=
i=
s =
1
i =
1
26
13
4 dimensions, 3 species
Petal length & width, sepal length & width
Iris:
virginica/versicolor/setosa
meas (150x4 array):
Holds 4D measurements
...
'versicolor'
'versicolor'
'versicolor'
'versicolor'
'versicolor'
'virginica'
'virginica'
'virginica'
'virginica
...
27
idx_setosa
...
1
1
1
0
0
0
...
The world is governed more by appearances rather than realities --Daniel Webster
28
14
scatter3
>>
>>
>>
>>
>>
>>
>>
setosa = meas(idx_setosa,[1:3]);
virgin = meas(idx_virginica,[1:3]);
versi = meas(idx_versicolor,[1:3]);
scatter3(setosa(:,1), setosa(:,2),setosa(:,3)); % plot in blue circles by default
hold on;
scatter3(virgin(:,1), virgin(:,2),virgin(:,3), rs
rs); % red[r
red[r]] squares[s
squares[s]] for these
scatter3(versi(:,1), virgin(:,2),versi(:,3), gx
gx); % green xs
7
6
5
4
3
2
1
4.5
4
3.5
3
2.5
2
4.5
5.5
6.5
7.5
29
Zoom in
Create line
Create Arrow
Select Object
Add text
Computers are useless. They can only give you answers. Pablo Picasso
30
15
31
A
Right click
C
32
16
Other Styles:
3
2
1
0
-1
-2
-3
0
3
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
100
2
1
0
-1
-2
-3
0
33
100
34
17
The result
35
The result
36
18
>> title(
title(My measurements (\
(\epsilon/\
epsilon/\pi)
pi))
>> ylabel(
ylabel(Imaginary Quantity
Quantity)
>> xlabel(
xlabel(Month of 2005
2005)
37
Saving Figures
Matlab allows to save the figures (.fig) for later
processing
You can always put-off for tomorrow, what you can do today. -Anonymous
38
19
Exporting Figures
Export to:
emf, eps, jpg, etc
39
Matlab code:
% extract to color eps
print -depsc myImage.eps;
myImage.eps; % from commandcommand-line
print(gcf,
print(gcf,-depsc
depsc,myImage
myImage) % using variable as name
40
20
colormap
bars
time = [100 120 80 70]; % our data
h = bar(time);
bar(time); % get handle
cmap = [1 0 0; 0 1 0; 0 0 1; .5 0 1]; % colors
colormap(cmap);
colormap(cmap); % create colormap
cdata = [1 2 3 4]; % assign colors
set(h,'CDataMapping','direct','CData',cdata);
set(h,'CDataMapping','direct','CData',cdata);
41
10
9
8
6
6
3
8
6
4
2
0
8
6
6
5
3
2
colormap
7
5
4
4
2
1
64
1
2
3
5
6
7
0
0.0198
0.0397
0.0595
0.0794
0.0992
1.0000
1.0000
1.0000
1.0000
0
0.0124
0.0248
0.0372
0.0496
0.0620
...
0.7440
0.7564
0.7688
0.7812
0
0.0079
0.0158
0.0237
0.0316
0.0395
0.4738
0.4817
0.4896
0.4975
3
data = [ 10 8 7; 9 6 5; 8 6 4; 6 5 4; 6 3 2; 3 2 1];
bar3([1 2 3 5 6 7], data);
c = colormap(gray);
colormap(gray); % get colors of colormap
c = c(20:55,:); % get some colors
colormap(c);
colormap(c); % new colormap
42
21
10
7
6
9 10
5
4
10
3
2
1
10
8
10
6
8
6
2
0
data = [1:10];
data = repmat(data,10,1);
repmat(data,10,1); % create data
surface(data,'FaceColor',[1 1 1], 'Edgecolor
', [0 0 1]); % plot data
'Edgecolor',
view(3);
view(3); grid on; % change viewpoint and put axis lines
43
Creating .m files
Standard text files
Script: A series of Matlab commands (no input/output arguments)
Functions: Programs that accept input and return output
Right click
44
22
Creating .m files
M editor
Double click
45
Creating .m files
The following script will create:
An array with 10 random walk vectors
Will save them under text files: 1.dat, , 10.dat
Sample Script
myScript.m
cumsum(A)
10
15
10
-5
10
20
30
40
50
60
70
80
90 100
46
23
Functions in .m scripts
Help Text
(help function_name)
if (nargin
<1), % check parameters
(nargin<1),
error(
error(Not enough arguments
arguments);
end
data = data mean(data);
mean(data); % subtract mean
data = data/std(data
); % divide by std
data/std(data);
dataN = data;
Function Body
47
Cell Arrays
Cells that hold other Matlab arrays
Lets read the files of a directory
>> f = dir(
dir(*.dat
*.dat) % read file contents
f =
15x1 struct array with fields:
name
date
bytes
isdir
for i=1:length(f),
a{i} = load(f(i).name);
N = length(a{i});
plot3([1:N], a{i}(:,1), a{i}(:,2), ...
r-, Linewidth
Linewidth, 1.5);
grid on;
pause;
600
500
cla;
400
end
Struct Array
1
name
date
bytes
isdir
a
).n
f(1
me
2
3
4
5
300
200
100
0
1000
1500
500
1000
48
500
0
24
Reading/Writing Files
Load/Save are faster than C style I/O operations
But fscanf, fprintf can be useful for file formatting
or reading non-Matlab files
fid = fopen('fischer.txt', 'wt');
for i=1:length(species),
fprintf(fid, '%6.4f %6.4f %6.4f %6.4f %s\
%s\n', meas(i,:), species{i});
end
fclose(fid);
Output file:
0.1
1.1052
0.2
1.2214
0.3
1.3499
0.4
0.4
1.4918
0.5
1.6487
1.6487
0.6
1.8221
0.7
2.0138
49
Flow Control/Loops
if (else/elseif) , switch
Check logical conditions
while
Execute statements infinite number of times
for
Execute statements a fixed number of times
break, continue
return
Return execution to the invoking function
Life is pleasant. Death is peaceful. Its the transition thats troublesome. Isaac Asimov
50
25
For-Loop or vectorization?
clear all;
tic;
for i=1:50000
a(i)
a(i) = sin(i);
sin(i);
end
toc
clear all;
a = zeros(1,50000);
zeros(1,50000);
tic;
for i=1:50000
a(i)
a(i) = sin(i);
sin(i);
end
toc
clear all;
tic;
i = [1:50000];
a = sin(i);
sin(i);
toc;
toc;
elapsed_time =
5.0070
elapsed_time =
0.1400
elapsed_time =
0.0200
Matlab Profiler
Find which portions of code take up
most of the execution time
Identify bottlenecks
Vectorize offending code
52
26
Hints &Tips
There is always an easier (and faster) way
Typically there is a specialized function for what you want to
achieve
53
Debugging
Beware of bugs in the above code; I have only proved it correct, not tried it
-- R. Knuth
54
27
Debugging
Full control over variables and execution path
F10: step, F11: step in (visit functions, as well)
A
F10
C
55
56
28
-1
-2
0
0
0
-1
-2
-3
-3
0
-1
-2
50
50
50
-3
-1
1
0
100
100
-1
100
0
-1
57
58
29
59
60
30
Matlab Toolboxes
You can buy many specialized toolboxes from Mathworks
Image Processing, Statistics, Bio-Informatics, etc
Wavelets
http://www.math.rutgers.edu/~ojanen/wavekit/
Speech Processing
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
Bayesian Networks
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
61
Ivehad
hadaawonderful
wonderful
Ive
evening.
Butthis
this
evening. But
wasnt
it
wasnt it
Google groups
comp.soft-sys.matlab
You can find *anything* here
Someone else had the same
problem before you!
62
31
63
What is a time-series
Definition:
Definition:AAsequence
sequenceof
ofmeasurements
measurementsover
overtime
time
ECG
Medicine
Stock Market
Meteorology
Geology
Astronomy
Chemistry
Biometrics
Robotics
64.0
62.8
62.0
66.0
62.0
32.0
86.4
...
21.6
45.2
43.2
53.0
43.2
42.8
43.2
36.4
16.9
10.0
Sunspot
Earthquake
time
64
32
Color Histogram
600
Cluster 2
400
200
0
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
400
200
800
600
400
200
0
Time-Series
65
Applications (Shapes)
Recognize type of leaf based on its shape
Ulmus carpinifolia
Acer platanoides
Salix fragilis
Tilia
Quercus robur
33
MOCAPdata
data
MOCAP
my
precious
my precious
67
Applications (Video)
Video-tracking / Surveillance
Visual tracking of body features (2D time-series)
Sign Language recognition (3D time-series)
Video Tracking of body feature
over time (Athens1, Athens2)
68
34
Easy visualization
Many built-in functions
Specialized Toolboxes
69
Becoming
Becoming sufficiently
sufficiently
familiar
familiar with
with something
something
is
a
substitute
is a substitute for
for
understanding
understanding it.
it.
70
35
71
Linear Scan:
Objective: Compare the query with
all sequences in DB and return
the k most similar sequences to
the query.
Database
Databasewith
withtime-series:
time-series:
Medical
sequences
Medical sequences
Images,
Images,etc
etc
Sequence
Length:100-1000pts
Sequence Length:100-1000pts
DB
DBSize:
Size:11TByte
TByte
D = 10.2
D = 11.8
D = 17
D = 22
72
36
?
?
73
Hierarchical Clustering
Very generic & powerful tool
Provides visual data grouping
Pairwise
distances
D1,1
D2,1
DM,N
Z = linkage(D);
H = dendrogram(Z);
74
37
Partitional Clustering
Faster than hierarchical clustering
Typically provides suboptimal solutions (local minima)
Not good performance for high dimensions
K-Means Algorithm:
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0.3
0.4
See: kmeans
0.5
0.6
0.7
0.8
0.9
75
K-Means Demo
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.5
0.5
1.5
76
38
Original
sequences
Compressed
sequences
Clustering
space
0.4
0.2
-0.2
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
77
Classification
Typically classification can be made easier if we have clustered the objects
Class A
0.4
0.2
-0.2
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
Class B
78
39
Elfs
Hobbits
10
9
Hair Length
8
7
6
5
4
3
2
1
1
9 10
Height
79
Example
What do we need?
1. Define Similarity
2. Search fast
Dimensionality Reduction
(compress data)
80
40
All
All models
models are
are wrong,
wrong,
but
some
are
useful
but some are useful
81
Notion of Similarity I
Solution to any time-series problem, boils down to a proper
definition of *similarity*
82
41
Notion of Similarity II
Similarity depends on the features we consider
(i.e. how we will describe or compress the sequences)
83
Non-Metric
Euclidean Distance
Time Warping
Correlation
LCSS
Properties
Positivity:
Positivity:d(x,y)
d(x,y)0
0and
andd(x,y)=0,
d(x,y)=0,ififx=y
x=y
Symmetry:
Symmetry:d(x,y)
d(x,y)==d(y,x)
d(y,x)
IfIfany
anyof
ofthese
theseis
isnot
not
obeyed
then
the
obeyed then thedistance
distance
is
isaanon-metric
non-metric
Triangle
TriangleInequality:
Inequality:d(x,z)
d(x,z)d(x,y)
d(x,y)++d(y,z)
d(y,z)
84
42
Triangle Inequality
Triangle
TriangleInequality:
Inequality:d(x,z)
d(x,z)d(x,y)
d(x,y)++d(y,z)
d(y,z)
z
Metric distance
functions can exploit
the triangle inequality
to speed-up search
Intuitively, if:
- x is similar to y and,
- y is similar to z, then,
- x is similar to z too.
85
Assume:
d(Q,bestMatch) = 20
and
d(Q,B) =150
20
110
20
90
110
90
0
86
43
Man
similar to
bat??
Matching
Matchingflexibility
flexibility
Robustness
Robustnessto
tooutliers
outliers
Bat
similar to
batman
Stretching
Stretchingin
intime/space
time/space
Support
Supportfor
fordifferent
differentsizes/lengths
sizes/lengths
Batman
similar
to man
Speeding-up
Speeding-upsearch
searchcan
canbe
be
difficult
difficult
87
Euclidean Distance
Most widely used distance measure
n
Definition: L2 =
(a[i] b[i])
i =1
20
40
60
80
100
88
44
B: DxN matrix
Result is MxN matrix
A=
Of length D
A: DxM matrix
result
D1,1
D2,1
DM,N
aa=
.*b); ab=a'*b;
aa=sum(a.*a);
sum(a.*a); bb=sum(b
bb=sum(b.*b);
ab=a'*b;
d = sqrt(repmat(aa',[1 size(bb,2)])
);
size(bb,2)]) + repmat(bb,[size(aa,2)
repmat(bb,[size(aa,2) 1]) - 2*ab
2*ab);
89
average value of A
average value of B
a = a mean(a);
mean(a);
90
45
a = a ./ std(a);
std(a);
91
C
Solution: Allow for compression & decompression in time
92
46
Dynamic Time-Warping
First used in speech recognition
for recognizing words spoken at
different speeds
---Maat--llaabb-------------------
----Mat-lab--------------------------
93
Euclidean
Euclideandistance
distance
T1
=
[1,
1,
T1 = [1, 1,2,2,2]
2]
dd==11
T2
T2==[1,
[1,2,2,2,2,2]
2]
One-to-one linear alignment
Warping
Warpingdistance
distance
T1
=
[1,
1,
T1 = [1, 1,2,2,2]
2]
dd==00
T2
T2==[1,
[1,2,2,2,2,2]
2]
One-to-many non-linear alignment
94
47
c(i,j)
)+
c(i,j) ==D(A
D(Ai,B
i,Bj j) +
min{
c(i-1,j-1)
min{ c(i-1,j-1), ,c(i-1,j
c(i-1,j)), ,c(i,j-1)
c(i,j-1)}}
Recursive equation
95
Euclidean Distance
18
16
7
13
14
3
9
6
2
15
11
19
10
20
17
5
12
8
4
1
48
A
We now only fill only a small
portion of the array
Minimum
Bounding
Envelope
(MBE)
97
Warping Length
98
49
match
match
99
Disadvantages of DTW:
A. All points are matched
B. Outliers can distort distance
ignore majority
of noise
C. One-to-many mapping
Advantages of LCSS:
A. Outlying values not matched
B. Distance/Similarity distorted less
C. Constraints in time & space
match
match
100
50
101
ASL
ASL+noise
Method
Time (sec)
Accuracy
Euclidean
34
20%
DTW
237
80%
LCSS
210
100%
Euclidean
2.2
33%
DTW
9.1
44%
LCSS
8.2
46%
Euclidean
2.1
11%
DTW
9.3
15%
LCSS
8.3
31%
102
51
Complexity
Elastic Matching
One-to-one Matching
Noise
Robustness
O(n)
DTW
O(n*)
LCSS
O(n*)
Euclidean
103
104
52
query
This
ThisDB
DBcan
cantypically
typically
fit
fitin
inmemory
memory
105
Feature 1
A
B
C
Feature 2
query
106
53
DDLB (a,b)
<= D (A,B)
LB (a,b) <= Dtrue
true(A,B)
5
4
0
C
3
EF
4
0
0
D
2
B C
EF
original
DB
Answer
Superset
Final
Answer
set
Verify
against
original
DB
simplified
query
query
108
54
query
109
query
110
55
Lower Bounds
4.6399
37.9032
19.5174
72.1846
67.1436
78.0920
70.9273
63.7253
1.4121
111
Lower Bounds
True Distance
4.6399
46.7790
37.9032
108.8856
19.5174
113.5873
72.1846
104.5062
67.1436
119.4087
78.0920
120.0066
70.9273
111.6011
63.7253
119.0635
1.4121
17.2540
BestSoFar
112
56
20 40 60 80 100 120
DFT
20 40 60 80 100 120
DWT
20 40 60 80 100 120
SVD
20 40 60 80 100 120
APCA
20 40 60 80 100 120
PAA
20 40 60 80 100 120
PLA
113
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT:
IDFT:
Everysignal
signalcan
can
Every
be
represented
as
be represented as
a
superposition
of
a superposition of
sinesand
andcosines
cosines
sines
(alas
nobody
(alas nobody
believesme)
me)
believes
114
57
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT:
IDFT:
fa = fft(a);
fft(a); % Fourier decomposition
fa(5:end) = 0; % keep first 5 coefficients (low frequencies)
reconstr = real(ifft(fa));
real(ifft(fa)); % reconstruct signal
X(f)
x(n)
-0.3633
-0.4446
-0.6280 + 0.2709i
-0.9864
-0.4929 + 0.0399i
-0.3254
-1.0143 + 0.9520i
-0.6938
0.7200 - 1.0571i
-0.1086
-0.0411 + 0.1674i
-0.3470
-0.5120 - 0.3572i
0.5849
0.9860 + 0.8043i
1.5927
-0.3680 - 0.1296i
-0.9430
-0.0517 - 0.0830i
-0.3037
-0.9158 + 0.4481i
-0.7805
1.1212 - 0.6795i
-0.1953
0.2667 + 0.1100i
-0.3037
0.2667 - 0.1100i
0.2381
1.1212 + 0.6795i
2.8389
-0.9158 - 0.4481i
-0.7046
-0.0517 + 0.0830i
-0.5529
-0.3680 + 0.1296i
-0.6721
0.9860 - 0.8043i
0.1189
-0.5120 + 0.3572i
0.2706
-0.0411 - 0.1674i
-0.0003
0.7200 + 1.0571i
1.3976
-1.0143 - 0.9520i
-0.4987
-0.4929 - 0.0399i
-0.2387
-0.6280 - 0.2709i
-0.7588
115
Fourier Decomposition
How much space we gain by compressing random walk data?
-5
50
100
150
200
250
116
58
Fourier Decomposition
How much space we gain by compressing random walk data?
-5
50
100
150
200
250
117
Fourier Decomposition
How much space we gain by compressing random walk data?
-5
50
100
150
200
250
118
59
Fourier Decomposition
How much space we gain by compressing random walk data?
-5
50
100
150
200
250
119
Fourier Decomposition
How much space we gain by compressing random walk data?
Energy Percentage
Error
1
1500
0.95
0.9
1000
0.85
0.8
0.75
500
0.7
0.65
0
20
40
60
80
Coefficients
100
120
20
40
60
80
Coefficients
100
120
120
60
Fourier Decomposition
Which coefficients are important?
We can measure the energy of each coefficient
Energy = Real(X(fk))2 + Imag(X(fk))2
Most of data-mining research
uses first k coefficients:
Easy to index
fa = fft(a);
fft(a); % Fourier decomposition
N = length(a);
length(a); % how many?
fa = fa(1:ceil(N/2));
fa(1:ceil(N/2)); % keep first half only
mag = 2*abs(fa).^2
; % calculate energy
2*abs(fa).^2;
121
Fourier Decomposition
Which coefficients are important?
We can measure the energy of each coefficient
Energy = Real(X(fk))2 + Imag(X(fk))2
Usage of the coefficients with
highest energy:
Believed to be difficult to
index
122
61
X(f)
0
-0.6280 + 0.2709i
keep
% zz-normalization
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
fa = fft(a);
-0.0411 + 0.1674i
maxInd = ceil(length(a)/2);
N = length(a);
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
% energy of a
-0.9158 + 0.4481i
1.1212 - 0.6795i
for ind=2:maxInd,
fa_N = fa;
fa_N(ind+1:Nfa_N(ind+1:N-ind+1) = 0;
r = real(ifft(fa_N));
end
Ignore
% copy fourier
% zero out unused
% reconstruction
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
123
% zz-normalization
This is the same
% until the middle
% energy of a
% copy fourier
% zero out unused
% reconstruction
62
Euclidean distance
or, that
125
x
y
120.9051
fx = fft(x)/sqrt(length(x));
fft(x)/sqrt(length(x));
fy = fft(y)/sqrt(length(x));
fft(y)/sqrt(length(x));
euclid_Freq = sqrt(sum(abs(fx - fy).^2));
fy).^2));
Keeping 10 coefficients
the distance is:
115.5556 < 120.9051
120.9051
126
63
Fourier Decomposition
O(nlogn)
O(nlogn)complexity
complexity
Tried
Triedand
andtested
tested
Hardware
Hardwareimplementations
implementations
Many
Manyapplications:
applications:
compression
compression
Not
Notgood
goodapproximation
approximationfor
for
bursty
signals
bursty signals
Not
Notgood
goodapproximation
approximationfor
for
signals
with
signals withflat
flatand
andbusy
busy
sections
sections
(requires
(requiresmany
manycoefficients)
coefficients)
smoothing
smoothing
periodicity
periodicitydetection
detection
127
128
64
c-d00
X = [9,7,3,5]
c+d00
D
etc
Haar = [6,2,1,-1]
c = 6 = (9+7+3+5)/4
c + d00 = 6+2 = 8 = (9+7)/2
c - d00 = 6-2 = 4 = (3+5)/2
etc
See also:wavemenu
129
Wavelets in Matlab
Specialized Matlab interface
for wavelets
See also:wavemenu
130
65
100
150
200
250
131
100
150
200
250
132
66
100
150
200
250
133
100
150
200
250
134
67
100
150
200
250
135
100
150
200
250
136
68
% length of sequence
% assume it's integer
%
%
%
%
break in segments
average segments
expand segments
make column
numCoeff
137
% length of sequence
% assume it's integer
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make column
numCoeff
138
69
% length of sequence
% assume it's integer
2
s
sN
N=8
segLen = 2
%
%
%
%
break in segments
average segments
expand segments
make column
numCoeff
139
% length of sequence
% assume it's integer
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make column
numCoeff
140
70
% length of sequence
% assume it's integer
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make row
numCoeff
data
1.5
3.5
5.5
7.5
1.5
3.5
5.5
7.5
141
% length of sequence
% assume it's integer
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
break in segments
average segments
expand segments
make row
numCoeff
data
data
N=8
segLen = 2
1.5
3.5
5.5
7.5
1.5
3.5
5.5
7.5
1.5
1.5
3.5
3.5
5.5
5.5
7.5
7.5
142
71
Segments of
equal size
APCA
Segments of
variable size
Wavelet Decomposition
O(n)
O(n)complexity
complexity
Hierarchical
Hierarchicalstructure
structure
Progressive
Progressivetransmission
transmission
Most
Mostdata-mining
data-miningresearch
research
still
utilizes
still utilizesHaar
Haarwavelets
wavelets
because
becauseof
oftheir
theirsimplicity.
simplicity.
Better
Betterlocalization
localization
Good
Goodfor
forbursty
burstysignals
signals
Many
Manyapplications:
applications:
compression
compression
periodicity
periodicitydetection
detection
144
72
146
73
147
148
74
149
150
75
O(nlogn)
O(nlogn)complexity
complexityfor
for
bottom
up
bottom upalgorithm
algorithm
Visually
Visuallynot
notvery
verysmooth
smoothor
or
pleasing.
pleasing.
Incremental
Incrementalcomputation
computation
possible
possible
Provable
Provableerror
errorbounds
bounds
Applications
Applicationsfor:
for:
Image
Image//signal
signal
simplification
simplification
Trend
Trenddetection
detection
151
y
We need 2 numbers (x,y)
for every point
y
Now we can describe each
point with 1 number, their
projection on the line
152
76
eigenwave 0
each of length n
eigenwave 4
M sequences
eigenwave 3
[U,S,V]
U,S,V] = svd(A)
svd(A)
153
Optimal
Optimaldimensionality
dimensionality
reduction
reductionin
inEuclidean
Euclidean
distance
sense
distance sense
Cannot
Cannotbe
beapplied
appliedfor
forjust
just
one
sequence.
A
set
one sequence. A setof
of
sequences
sequencesis
isrequired.
required.
SVD
SVDis
isaavery
verypowerful
powerfultool
tool
in
inmany
manydomains:
domains:
Websearch
Websearch(PageRank)
(PageRank)
Addition
Additionof
ofaasequence
sequencein
in
database
databaserequires
requires
recomputation
recomputation
Very
Verycostly
costlyto
tocompute.
compute.
2
2
Time:
min{
O(M
Time: min{ O(M2n),
n),O(Mn
O(Mn2)})}
Space:
Space:O(Mn)
O(Mn)
MMsequences
sequencesof
oflength
lengthnn
154
77
Symbolic Approximation
Assign a different symbol based on range of values
Find ranges either from data histogram or uniformly
c
c
c
b
b
a
20
a
40
60
80
100
120
baabccbc
You can find an implementation here:
http://www.ise.gmu.edu/~jessica/sax.htm
155
Symbolic Approximations
Linear
Linearcomplexity
complexity
After
Aftersymbolization
symbolizationmany
many
tools
toolsfrom
frombioinformatics
bioinformatics
can
canbe
beused
used
Markov
Markovmodels
models
Number
Numberof
ofregions
regions
(alphabet
(alphabetlength)
length)can
canaffect
affect
the
quality
of
result
the quality of result
Suffix-Trees,
Suffix-Trees,etc
etc
156
78
Multidimensional Time-Series
Ari,are
areyou
yousure
surethe
the
Ari,
worldisisnot
not1D?
1D?
world
Aristotle
157
Multidimensional MBRs
Find Bounding rectangles that completely contain a trajectory
given some optimization criteria (eg minimize volume)
79
159
Fourierisis
Fourier
good
good
1993
PAA!
PAA!
thanPAA!
PAA!
than
2000
2001
Chebyshev
Chebyshev
better
isisbetter
thanAPCA!
APCA!
than
2004
The
The
future
future isis
symbolic!
symbolic!
2005
80
Comparisons
Lets see how tight the lower bounds are for a variety on 65 datasets
Average Lower Bound
A. No approach
is better on all
datasets
B. Best coeff.
techniques
can offer
tighter
bounds
C. Choice of
compression
depends on
application
161
162
81
MBE(Q)
A
164
82
Time Comparisons
We will use DTW (and the corresponding LBs) for recognition of hand-written
digits/shapes.
83
LCSS(MBE
LCSS(MBEQQ,A)
,A)>=
>=LCSS(Q,A)
LCSS(Q,A)
Query
Indexed Sequence
Sim.=50/77
= 0.64
44 points
6 points
167
Word annotation:
1.1.Extract
Extractwords
wordsfrom
fromdocument
document
2.2.Extract
image
features
Extract image features
3.3.Annotate
Annotateaasubset
subsetof
ofwords
words
4.4.Classify
Classifyremaining
remainingwords
words
Feature Value
0.8
0.6
0.4
0.2
0
50
100
150
200
Column
250
300
350
400
Features:
84
169
170
85
PKDD 2005
Porto
Weblog of user
requests over
time
Priceline
171
Jan
Feb Mar
Jul
Privacy preserving
Nov Dec
Google Zeitgeist
172
86
Requests
Query: xbox
Query: ps2
Jan
Feb Mar
Apr May
Jun
Jul
Nov Dec
Query Bach
1 year span
87
175
Matching results I
Query = Lance Armstrong
2000
2001
2002
LeTour
0
2000
2001
2002
Tour De France
0
2000
2001
2002
176
88
Matching results II
Query = Christmas
2000
2001
2002
177
178
89
Periodic Matching
Frequency
F ( x), F ( y )
Ignore Phase/
Keep important
components
Calculate
Distance
cinema
D1 =|| F ( x + ) F ( y + ) ||
Periodogram
D2 =|| F ( x + ) F ( y + ) ||
stock
easter
0
10
15
20
25
30
35
40
45
50
10
15
20
25
30
35
40
45
50
christmas
179
180
90
181
Harry Potter 1
(Movie)
50
100
Harry Potter 1
(DVD)
150
200
250
300
350
182
91
Burst Detection
Burst detection is similar to anomaly detection.
Create distribution of values (eg gaussian model)
Any value that deviates from the observed distribution (eg more than 3
std) can be considered as burst.
Valentines
Day
Mothers
Day
183
Query-by-burst
To perform query-by-burst we can perform the following steps:
1.
2.
3.
184
92
Query-by-burst Results
Queries
www.nhc.noaa.gov
Pentagon attack
Cheap gifts
Matches
Nostradamus prediction
Tropical Storm
Scarfs
185
Periodic Measure
Incorrect
Grouping
36
35
33
28
27
26
32
34
30
31
29
25
18
23
20
19
17
24
22
16
14
15
21
13
12
8
2
7
11
5
9
3
10
6
4
1
186
93
187
PRICELINE:
Stock value dropped
NICE SYSTEMS:
Stock value increased
(provider of air traffic
control systems)
188
94
Conclusion
The traditional shape matching measures cannot address all timeseries matching problems and applications.
Structural distance measures can provide more flexibility.
There are many other exciting time-series problems that havent been
covered in this tutorial:
Anomaly Detection
dontwant
wantto
to
IIdont
achieve
immortality
achieve immortality
throughmy
myworkI
workI
through
want
to
achieve
want to achieve itit
throughnot
notdying.
dying.
through
Rule Discovery
etc
189
95