Time Series Analysis With Matlab Tutorials
Time Series Analysis With Matlab Tutorials
Disclaimer
Feel free to use any of the following slides for educational purposes, however kindly acknowledge the source. We would also like to know how you have used these slides, so please send us emails with comments or suggestions.
Disclaimer
The goal of this tutorial is to show you that time-series research (or research in general) can be made fun, when it involves visualizing ideas, that can be achieved with concise programming. Matlab enables us to do that.
Will I be able to use this MATLAB right away after the tutorial? I am definitely smarter than her, but I am not a timetimeseries person, perper-se. I wonder what I gain from this tutorial tutorial
We are not affiliated with Mathworks in any way but we do like using Matlab a lot since it makes our lives easier Errors and bugs are most likely contained in this tutorial. We might be responsible for some of them.
Overview
PART A The Matlab programming environment PART B Basic mathematics Introduction / geometric intuition Coordinates and transforms Quantized representations Non-Euclidean distances PART C Similarity Search and Applications Introduction Representations Distance Measures Lower Bounding Clustering/Classification/Visualization Applications
The greatest value of a picture is that is forces us to notice what we never expected to see -- John Tukey
John Tukey
Matlab
Interpreted Language Easy code maintenance (code is very compact) Very fast array/vector manipulation Support for OOP Easy plotting and visualization Easy Integration with other Languages/OSs Interact with C/C++, COM Objects, DLLs Build in Java support (and compiler) Ability to make executable files Multi-Platform Support (Windows, Mac, Linux) Extensive number of Toolboxes Image, Statistics, Bioinformatics, etc
Cleve Moler
Mathworks still is privately owned Used in >3,500 Universities, with >500,000 users worldwide 2005 Revenue: >350 M. 2005 Employees: 1,400+ Pricing:
starts from 1900$ (Commercial use), ~100$ (Student Edition)
Moneyis isbetter betterthan than Money poverty,if ifonly onlyfor for poverty, financialreasons reasons financial
Matlab 7.3
R2006b, Released on Sept 1 2006
Distributed computing Better support for large files New optimization Toolbox Matlab builder for Java
create Java classes from Matlab
Tutorial | Time-Series with Matlab Personally I'm always ready to learn, although I do not always like being taught. Sir Winston Churchill
Starting up Matlab
Commands like: cd pwd mkdir
Matlab Environment
For navigation it is easier to just copy/paste the path from explorer E.g.: cd c:\documents\
Workspace: Loaded Variables/Types/Size
Matlab Environment
Matlab Environment
Populating arrays
Plot sinusoid function
a = [0:0.3:2*pi] % generate values from 0 to 2pi (with step of 0.3) b = cos(a) cos(a) % access cos at positions contained in array [a] plot(a,b) plot(a,b) % plot a (x(x-axis) against b (y(y-axis)
Array Access
Access array elements
>> a(1) ans = 0 >> a(1:3) ans = 0 0.3000 0.6000
2D Arrays
Can access whole columns or rows
A good listener is not only popular everywhere, but after a while he gets to know something. Wilson Mizner
Column-wise computation
For arrays greater than 1D, all computations happen column-by-column
>> a = [1 2 3; 3 2 1] a = 1 3 >> mean(a) mean(a) ans = 2.0000 2.0000 2.0000 2 2 3 1 >> max(a) max(a) ans = 3 >> sort(a) ans = 1 3 2 2 1 3 2 3
Concatenating arrays
Column-wise or row-wise
Initializing arrays
Create array of ones [ones]
>> a = ones(1,3) a = 1 1 1 >> a = ones(2,2)*5; a = 5 5 5 5
3 3
Length = 4 2 5
rows = 1
columns = 4
4 dimensions, 3 species Petal length & width, sepal length & width Iris:
virginica/versicolor/setosa
meas (150x4 array): Holds 4D measurements
... 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'virginica' 'virginica' 'virginica' 'virginica
idx_setosa
... 1 1 1 0 0 0
... species (150x1 cell array): Holds name of species for the specific measurement
An array of zeros and ones indicating the positions where the keyword setosa was found
... The world is governed more by appearances rather than realities --Daniel Webster
scatter3
Zoom in
Create line
Computersare are Computers useless.They Theycan can useless. onlygive giveyou you only answers answers
Create Arrow
7 6 5 4 3 2 1 4.5 4 3.5 3 2.5 2 4 4.5 5 5.5 6 6.5 7 7.5 8
>> grid on; % show grid on axis >> rotate3D on; % rotate with mouse
Select Object
Add text
A Right click C
Other Styles:
3 2 1 0 -1 -2 -3 3 2 1 0 -1 -2 -3 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
If this represents a years worth of measurements of an imaginary quantity, we will change: x-axis annotation to months Axis labels Put title in the figure Include some greek letters in the title just for fun
The result
The result
Saving Figures
Matlab allows to save the figures (.fig) for later processing
.fig can be later opened through Matlab >> xlabel( xlabel(Month of 2005 2005)
You can always put-off for tomorrow, what you can do today. -Anonymous
Exporting Figures
Matlab code:
% extract to color eps print -depsc myImage.eps; myImage.eps; % from commandcommand-line print(gcf, print(gcf,-depsc depsc,myImage myImage) % using variable as name
colormap
7 5 4 4 2 1 0 0.0198 0.0397 0.0595 0.0794 0.0992 1.0000 1.0000 1.0000 1.0000 0 0.0124 0.0248 0.0372 0.0496 0.0620 ... 0.7440 0.7564 0.7688 0.7812
3
colormap
10 9 8 6 6 3
3 5 6 3
8 6 6 5 3 2
64
bars
time = [100 120 80 70]; % our data h = bar(time); bar(time); % get handle cmap = [1 0 0; 0 1 0; 0 0 1; .5 0 1]; % colors colormap(cmap); colormap(cmap); % create colormap cdata = [1 2 3 4]; % assign colors set(h,'CDataMapping','direct','CData',cdata); set(h,'CDataMapping','direct','CData',cdata);
data = [ 10 8 7; 9 6 5; 8 6 4; 6 5 4; 6 3 2; 3 2 1]; bar3([1 2 3 5 6 7], data); c = colormap(gray); colormap(gray); % get colors of colormap c = c(20:55,:); % get some colors colormap(c); colormap(c); % new colormap
Creating .m files
Standard text files
10
1 1
Script: A series of Matlab commands (no input/output arguments) Functions: Programs that accept input and return output
9 10 1 10
The value at position x-y of the array indicates the height of the surface
Right click
data = [1:10]; data = repmat(data,10,1); % create data surface(data,'FaceColor',[1 1 1], 'Edgecolor ', [0 0 1]); % plot data 'Edgecolor', view(3); grid on; % change viewpoint and put axis lines
Creating .m files
Creating .m files
The following script will create: An array with 10 random walk vectors
M editor
Double click
Sample Script
A 1 2 3 4 5
cumsum(A) 1 3 6
10 15
a = cumsum(randn(100,10)); % 10 random walk data of length 100 for i=1:size(a,2), % number of columns data = a(:,i) ; fname = [num2str(i) .dat .dat]; % a string is a vector of characters! save(fname, data data,-ASCII ASCII); % save each column in a text file end Write this in the M editor
-5
10
20
30
40
50
60
70
80
90 100
Functions in .m scripts
When we need to: Organize our code Frequently change parameters in our scripts
keyword output argument function name input argument
Cell Arrays
Cells that hold other Matlab arrays Lets read the files of a directory
>> f = dir( dir(*.dat *.dat) % read file contents f = 15x1 struct array with fields: name date bytes isdir for i=1:length(f), a{i} = load(f(i).name); N = length(a{i}); plot3([1:N], a{i}(:,1), a{i}(:,2), ... r-, Linewidth Linewidth, 1.5); grid on; pause; 600 500 cla; 400 end
300 200 100 0 1000 1500
Struct Array 1 2 3 4 5
function dataN = zNorm(data) % ZNORM zNormalization of vector % subtract mean and divide by std if (nargin <1), % check parameters (nargin<1), error( error(Not enough arguments arguments); end data = data mean(data); mean(data); % subtract mean data = data/std(data ); % divide by std data/std(data); dataN = data;
f(1
).n
e am
Function Body
500 500
1000
Reading/Writing Files
Load/Save are faster than C style I/O operations But fscanf, fprintf can be useful for file formatting or reading non-Matlab files
fid = fopen('fischer.txt', 'wt'); for i=1:length(species), fprintf(fid, '%6.4f %6.4f %6.4f %6.4f %s\ %s\n', meas(i,:), species{i}); end fclose(fid);
Flow Control/Loops
if (else/elseif) , switch
Check logical conditions
while
Execute statements infinite number of times
for
Execute statements a fixed number of times
Output file:
Life is pleasant. Death is peaceful. Its the transition thats troublesome. Isaac Asimov
For-Loop or vectorization?
clear all; tic; for i=1:50000 a(i) a(i) = sin(i); sin(i); end toc elapsed_time = 5.0070
Pre-allocate arrays that store output results No need for Matlab to resize everytime Functions are faster than scripts Compiled into pseudocode Load/Save faster than Matlab I/O functions After v. 6.5 of Matlab there is for-loop vectorization (interpreter)
Matlab Profiler
Find which portions of code take up most of the execution time Identify bottlenecks Vectorize offending code
clear all; a = zeros(1,50000); tic; for i=1:50000 a(i) a(i) = sin(i); sin(i); end toc
elapsed_time = 0.1400
elapsed_time = 0.0200
Hints &Tips
There is always an easier (and faster) way
Typically there is a specialized function for what you want to achieve
Debugging
Beware of bugs in the above code; I have only proved it correct, not tried it -- R. Knuth
Not as frequently required as in C/C++ Set breakpoints, step, step in, check variables values
Set breakpoints
Debugging
Eitherthis thisman manis is Either deador ormy mywatch watch dead hasstopped. stopped. has
Full control over variables and execution path F10: step, F11: step in (visit functions, as well)
A
F10 C
3 2 1 0 -1 -2 -3 0
50
-1
azimuth = [50:100 99:99:-1:50]; % azimuth range of values for k = 1:length(azimuth), plot3(1:length(a), a(:,1), a(:,2), 'r', 'Linewidth',2); grid on; view(azimuth(k),30); % change new M(k) M(k) = getframe; getframe; % save the frame end movie(M,20); % play movie 20 times See also:movie2avi
10
Matlab Toolboxes
You can buy many specialized toolboxes from Mathworks Image Processing, Statistics, Bio-Informatics, etc There are many equivalent free toolboxes too: SVM toolbox
http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/
Ivehad hada awonderful wonderful Ive evening.But Butthis this evening. wasntit it wasnt
http://www.mathworks.com/company/events/archived_webinars.html?fp
Wavelets
http://www.math.rutgers.edu/~ojanen/wavekit/
Google groups comp.soft-sys.matlab You can find *anything* here Someone else had the same problem before you!
Speech Processing
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
Bayesian Networks
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
Overview of Part B
1. 2. Introduction and geometric intuition Coordinates and transforms Eightpercent percentof of Eight successis isshowing showing success up. up. 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Non-Euclidean distances
11
What is a time-series
Definition: Definition:A Asequence sequenceof ofmeasurements measurementsover overtime time
Medicine Stock Market Meteorology Geology Astronomy Chemistry Biometrics Robotics
64.0 62.8 62.0 66.0 62.0 32.0 86.4 ... 21.6 45.2 43.2 53.0 43.2 42.8 43.2 36.4 16.9 10.0 ECG
Applications
Images
Image
Shapes
Motion capture
Acer platanoides
200
50
100
150
200
250
Earthquake
Time-Series
Salix fragilis time
more to come
Time Series
value
Time Series
value
x5 x2 x6 x3 x1 x4 time
3
9 8 6 4 1
x = (3, 8, 4, 1, 9, 6)
time
Mean
Definition:
Variance
Definition:
From now on, we will generally assume zero mean mean normalization:
From now on, we will generally assume unit variance variance normalization:
12
mean
Variance = Length
Variance of zero-mean series:
||x |
x1
Correlation = Angle
Correlation of normalized series:
residual
Cosine law:
slope
2.5 2 1.5 1
2.5
= -0.23
2 1.5 1
= 0.99
So that:
x
CAD
BEF
0.5 0
0.5 0
x.y
FRF
FRF
13
Ergodicity
Example
Assume I eat chicken at the same restaurant every day and Question: How often is the food good?
Answer one:
Answer two:
x.y y
Ergodicity
Example
Stationarity
Example
Ergodicity is a common and fundamental assumption, but sometimes can be wrong: Total number of murders this year is 5% of the population If I live 100 years, then I will commit about 5 murders, and if I live 60 years, I will commit about 3 murders non-ergodic! Such ergodicity assumptions on population ensembles is commonly called racism.
Autocorrelation
Definition:
Time-domain coordinates
6 4 2 1.5
3.5
Is well-defined if and only if the series is (weakly) stationary Depends only on lag , not time t
-0.5
-0.5 -2
+ 1.5
+ -2
+ 2
+ 3.5
14
Time-domain coordinates
6 4 2 1.5 1
Orthonormal basis
Set of N vectors, { e1, e2, , eN }
Normal: ||ei|| = 1, for all 1 i N Orthogonal: eiej = 0, for i j
3.5
-0.5 -2
x1 -0.5
e1
42 + x
e2
x3 + 1.5
e3
x4 + -2
e4
+ x 25
e5
66 + x
e6
x7 + 3.5
e7
18 + x
e8
Orthonormal basis
Note that the coefficients xi w.r.t. the basis { e1, , eN } are the corresponding similarities of x to each basis vector/series:
Orthonormal bases
The time-domain basis is a trivial tautology:
Each coefficient is simply the value at one time instant
+ e1
+ e2
x2
Basic concepts:
Series / vector
3.5
-0.5 -2
5.6
+ -2.2
+ 2.8
- 4.9
+ -3
+ 0.05
15
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Non-Euclidean distances
Frequency
.
One cycle every 20 time units (period)
period = 8 period 20? Why is the
= 0
Its not 8, because its similarity (projection) to a period-8 series (of the same length) is zero.
.
= 0
period = 10 period = 40
.
= 0 Why is the cycle 20? Its not 40, because its similarity (projection) to a period-40 series (of the same length) is zero.
Why is the cycle 20? Its not 10, because its similarity (projection) to a period-10 series (of the same length) is zero.
and so on
16
Frequency
Frequency
To find the period, we compared the time series with sinusoids of many different periods Therefore, a good description (or basis) would consist of all these sinusoids This is precisely the idea behind the discrete Fourier transform
The coefficients capture the similarity (in terms of amplitude and phase) of the series with sinusoids of different periods
Technical details:
We have to ensure we get an orthonormal basis Real form: sines and cosines at N/2 different frequencies Complex form: exponentials at N different frequencies
Fourier transform
Real form
Fourier transform
where The pair of bases at frequency fk are are the amplitude and phase, respectively. plus the zero-frequency (mean) component
Fourier transform
Fourier transform
Complex form
The equations become easier to handle if we allow the series and the Fourier coefficients Xk to take complex values:
1 0.5 0 -0.5 -1 0 10 20 30 40 50 60 70 80
Matlab note: fft omits the scaling factor and is not unitaryhowever, ifft includes an scaling factor, so always ifft(fft(x)) == x.
17
Fourier transform
Example
2 GBP 1 0 -1 2 GBP 1 0 -1 2 GBP 1 0 -1
3 frequencies
2 GBP 1 0 -1
5 frequencies
2 GBP 1 0 -1
10 frequencies
2 GBP 1 0 -1
20 frequencies
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
e.g., .
period = 20
0 0 etc
Quantized representations
period = 10
Non-Euclidean distances
No single cycle, because the series isnt exactly similar with any series of the same length.
18
What if we examined, e.g., eight values at a time? Can only compare with periods up to eight.
Results may be different for each group (window)
Wavelets
Intuition
Main idea
Use small windows for small periods
Remove high-frequency component, then
Repeat recursively
Technical details
Need to ensure we get an orthonormal basis
Wavelets
Intuition
Wavelets
Scale (frequency)
Time
Time
Time
Scale (frequency)
Frequency
Frequency
Frequency
Time
Fourier, DCT,
STFT
Wavelets
19
Wavelet transform
Pyramid algorithm
Wavelet transform
Pyramid algorithm
High pass
Wavelet transform
Pyramid algorithm
Wavelet transform
Pyramid algorithm
High pass
w1
x w0
High pass
w2
High pass Low pass
Low pass
v1
w3 v3
Low pass
v2
Wavelet transforms
General form
Wavelet transforms
Other filters examples
Haar (Daubechies-1)
Daubechies-2
Daubechies-3
20
Wavelets
Example Wavelet coefficients (GBP, Haar)
2 1 0 -1 500 1 W1 0 -1 1 W2 0 -1 2 W3 0 -2 2 W4 0 -2 5 W5 0 -5 10 W6 0 -10 20 V6 0 -20 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160 50 100 150 200 250 300 100 200 300 400 500 600 200 400 600 800 1000 1200 1000 1500 2000 2500 1 D1 0 -1 1 0 -1 1 0 -1 2 0 -2 5 0 -5 5 0 -5 20 0 -20 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 A6 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160 50 100 150 200 250 300 D4 100 200 300 400 500 600 200 400 600 800 1000 1200 2 1 0 -1 500 1000 1500 2000 2500 GBP
Wavelets
Example Wavelet coefficients (GBP, Daubechies-3)
2 1 0 -1 500 0.1 0 -0.1 -0.2 -0.3 500 0.2 0 -0.2 500 0.4 0.2 0 -0.2 -0.4 0.4 0.2 0 -0.2 -0.4 0.5 0 -0.5 500 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 D6 1000 1500 2000 2500 500 1000 1500 2000 2500 0.2 0 -0.2 -0.4 0.5 0 -0.5 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 500 1000 1500 2000 2500 D3 1000 1500 2000 2500 0.2 0 -0.2 -0.4 500 1000 1500 2000 2500 D2 1000 1500 2000 2500 1000 1500 2000 2500 0 -0.2 -0.4 -0.6 0.2 0 -0.2 -0.4 -0.6 500 1000 1500 2000 2500 GBP
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
D5
500
1000
1500
2000
2500
500
1000
1500
2000
2500
Wavelets
Example Multi-resolution analysis (GBP, Haar)
2 1 0 -1 500 0.1 0 -0.1 -0.2 -0.3 500 0.2 0 -0.2 500 0.4 0.2 0 -0.2 -0.4 0.4 0.2 0 -0.2 -0.4 0.5 0 -0.5 500 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 D6 500 D3 D2 2 1 0 -1 500 1500 1000 2000 1000 1500 2000 2500 0.2 0 -0.2 -0.4 1500 500 0.2 1000 1500 1000 1500 2 1 0 -1 GBP
Wavelets
Matlab Multi-resolution analysis (GBP, Daubechies-3)
DiDj = 0, for i j
2000 2500 500 0 -0.2 -0.4 -0.6 0.2 0 -0.2 -0.4 -0.6 2000 2500 500
1000
1500
2000
D1
1000
1500
2000
2500
1000
1500
2000
2500
1000
2500
2000 1000
2500 1500
2000
2500
D4
D5
500
1000
1500
2000
2500
A6
500
1000
1500
2000
2500
Other wavelets
Only scratching the surface Wavelet packets
All possible tilings (binary) Best-basis transform
More on wavelets
Signal representation and compressibility
100
100
90
90
80
80
Quality (% energy)
60
Quality (% energy)
Time FFT Haar DB3
0 2 4 6 8 10
Overcomplete wavelet transform (ODWT), aka. maximum-overlap wavelets (MODWT), aka. shiftinvariant wavelets
70
70
60
50
50
40
40
30
30
20
20
Further reading: 1. Donald B. Percival, Andrew T. Walden, Wavelet Methods for Time Series Analysis, Cambridge Univ. Press, 2006. 2. Gilbert Strang, Truong Nguyen, Wavelets and Filter Banks, Wellesley College, 1996. 3. Tao Li, Qi Li, Shenghuo Zhu, Mitsunori Ogihara, A Survey of Wavelet Applications in Data Mining, SIGKDD Explorations, 4(2), 2002.
10
10
Compression (% coefficients)
Compression (% coefficients)
21
More wavelets
Keeping the highest coefficients minimizes total error (L2-distance) Other coefficient selection/thresholding schemes for different error metrics (e.g., maximum per-instant error, or L1-dist.)
Typically use Haar bases
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Further reading: 1. Minos Garofalakis, Amit Kumar, Wavelet Synopses for General Error Metrics, ACM TODS, 30(4), 2005. 2.Panagiotis Karras, Nikos Mamoulis, One-pass Wavelet Synopses for Maximum-Error Metrics, VLDB 2005.
Non-Euclidean distances
Wavelets
Incremental estimation
Wavelets
Incremental estimation
Wavelets
Incremental estimation
Wavelets
Incremental estimation
22
Wavelets
Incremental estimation
Wavelets
Incremental estimation
post-order traversal
Wavelets
Incremental estimation
Overview
:
1. 2. Introduction and geometric intuition Coordinates and transforms
constant factor: filter length
Forward transform
O(1) time (amortized)
Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
3. 4.
Inverse transform:
Same complexity
Quantized representations
Non-Euclidean distances
Fourier and wavelets are the most prevalent and successful descriptions of time series. Next, we will consider collections of M time series, each of length N.
What is the series that is most similar to all series in the collection? What is the second most similar, and so on
23
( 0)
CAD
AUD
BEF
u2
U2
0 -0.05 0.05 0 -0.05 0.05 0 -0.05 500 1000 Time 1500 2000 2500
50
SEK
40
GBP 2 0 -2
FRF
u4
U4
u3
CAD
U3
30
AUD
2 0 -2 FRF
DEM
i,2
JPY
20
ESP
0.05
2 0 -2
U1
NZL
NLG
NZL
0
2 0 -2 NLG
CHF
2 0 -2
ESP
-10
SEK
-20
JPY
2 0 -2
CHF
GBP
-30
-20
-10
10
i,1
20
30
40
50
60
X = UVT X U VT
x(1) x(2) x(M)
X = UVT X U VT
v1 M x(1) x(2) x(M)
u1 u2
uk
1 2 3
u1 u2
uk
v2 1 2 3 vk
DEM
BEF
10
2 0 -2
time series
time series
X = UVT X U
1
x(1) x(2) x(M)
VT
v1
u1 u2
uk
.
k
v2
vk basis for measurements (rows) Further reading: 1. Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002. 2. Gilbert Strang, Linear Algebra and Its Applications (4th ed), Brooks Cole, 2005.
24
the fundamental properties are possible Why kernels? We no longer have explicit coordinates
Objects do not even need to be numeric
For arbitrary similarities, we can still find the eigendecomposition of the similarity matrix
Multidimensional scaling (MDS) Maps arbitrary metric data into a low-dimensional space
CAD AUD
JPY
But we can still talk about distances and angles Many algorithms rely just on these two concepts
Exchange rates
SEK ESP GBP
NZL
Further reading: 1. Bernhard Schlkopf, Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, MIT Press, 2001.
JPY
Principal components
Matlab
Further reading: 1. M. Ghil, et al., Advanced Spectral Methods for Climatic Time Series, Rev. Geophys., 40(1), 2002.
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Principal components
Incremental estimation
NM
kk
recap:
(diagonal)
Nk
Quantized representations
Mk
Non-Euclidean distances
25
Principal components
Incremental estimation
Principal components
Incremental estimation Example
NM
vk 20oC
Mk
Series x(1)
Nk 2
v1 v2
Principal components
Incremental estimation Example
Principal components
Incremental estimation Example
First series
30oC
Correlations:
30oC
Second series
Series x(2) Series x(2)
20oC
20oC
30oC
Series x(1)
Principal components
Incremental estimation Example
Principal components
Incremental estimation Example
30oC
Series x(2)
20oC
e ffs
pr
pa ci in
O(M) numbers for the slope, and One number for each measurementpair (offset on line = PC)
First three values Other values
Series x(2)
m co
ne po
30oC
Other pairs also follow the same pattern: they lie (approximately) on this line
20oC
20oC
30oC
20oC
30oC
Series x(1)
Series x(1)
26
Principal components
Incremental estimation Example
Principal components
error
error
Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
Series x(2)
20oC
20oC
O(M) time
20oC
30oC
Series x(1)
New value
20oC
30oC
Series x(1)
New value
Principal components
Principal components
Incremental estimation Example
The line is the first principal component (PC) direction This line is optimal: it minimizes the sum of squared projection errors
Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
Series x(2)
20oC
20oC
30oC
Series x(1)
Principal components
Principal components
For each new point xt and for j = 1, , k : yj := vjTxt j2 j + yj2 ej := x yjwj vj vj + (1/
2 j )
O(Mk) space (total) and time (per tuple), i.e., Independent of # points Linear w.r.t. # streams (M) Linear w.r.t. # principal components (k)
(proj. onto vj) (energy j-th eigenval.) (error) yjej (update estimate) (repeat with remainder)
xt xt yjvj
xt e1
v1 updated v1 y1
27
Principal components
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Further reading: 1. Sudipto Guha, Dimitrios Gunopulos, Nick Koudas, Correlating synchronous and asynchronous data streams, KDD 2003. 2. Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos, Streaming Pattern Discovery in Multiple Time-Series, VLDB 2005. 3. Matthew Brand, Fast Online SVD Revisions for Lightweight Recommender Systems, SDM 2003.
4.
Non-Euclidean distances
Piecewise constant
Example APCA (k=10)
2 1 0 -1
Within each window we sought fairly complex patterns (sinusoids, wavelets, etc.) Next, we will allow any window size, but constrain the pattern within each window to the simplest possible (mean)
500
1000
1500
2000
2500
APCA (k=21)
2 1 0 -1
500
1000
1500
2000
2500
APCA (k=41)
2 1 0 -1
500
1000
1500
2000
2500
Piecewise constant
Example APCA (k=10)
2 1 0
-1
500
1000
1500
2000
2500
500
1000
1500
2000
500
1000
1500
2000
2500
Further reading: 1. Kaushik Chakrabarti, Eamonn Keogh, Sharad Mehrotra, Michael Pazzani, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, TODS, 27(2), 2002.
500
1000
1500
2000
2500
28
Piecewise constant
Example APCA (k=10)
2 1 0 -1
Piecewise constant
Example APCA (k=10)
2 1 0 -1
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
k/h-segmentation
Again, divide the series into k segments (variable length) For each segment choose one of h quantization levels to represent all points
Now, mj can take only h k possible values
Quantization of values Segmentation of time based on these quantization levels More in next part
APCA = k/k-segmentation (h = k)
Further reading: 1. Aristides Gionis, Heikki Mannila, Finding Recurrent Sources in Sequences, Recomb 2003.
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Non-Euclidean distances
29
K-means
K-means
Partitions the time series x(1), , x(M) into groups, Ij, for 1 j k .
m2
All time series in the j-th group, 1 j k, are represented by their centroid, mj . Objective is to choose mj so as to minimize the overall squared distortion,
m1
1-D on values + contiguity requirement: APCA
K-means
K-means
K-means
1. Start with arbitrary cluster assignment. 2. Compute centroids. 3. Re-assign to clusters based on new centroids. 4. Repeat from (2), until no improvement.
K-means
Example
Exchange rates
50
PCs
0.05 0 -0.05 0.05 0 -0.05
40
CAD
30
AUD
k=2
20
i,2
1 0 -1
10
DEM NZL
0
2 1 0 -1
2 1 0 -1 -10 2 1 0 -1 2 0 -2 2 0 -2 50 60
k=4
-20
JPY
-30
-20
-10
10
i,1
20
30
40
30
Further reading: 1. Hongyuan Zha, Xiaofeng He, Chris H.Q. Ding, Ming Gu, Horst D. Simon, Spectral Relaxation for K-means Clustering, NIPS 2001.
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Non-Euclidean distances
Euclidean path:
i = j always Ignores off-diagonal cells
31
Dynamic time-warping
Fast estimation
(i, j)
stretch x / shrink y
x[1:i]
Further reading: 1. Eamonn J. Keogh, Exact Indexing of Dynamic Time Warping, VLDB 2002. 2. Stan Salvador, Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space, TDM 2004. 3. Yasushi Sakurai, Masatoshi Yoshikawa, Christos Faloutsos, FTW: Fast Similarity Under the Time Warping Distance, PODS 2005.
Non-Euclidean metrics
Create lower-bounding distance on coarser granularity, either at
Single scale Multiple scales
More in part 3
y[1:j]
x[1:i]
Timeline of part C
Introduction TimeTime-Series Representations Distance Measures Lower Bounding Clustering/Classification/Visualization Applications
32
Applications (Shapes)
Recognize type of leaf based on its shape
Ulmus carpinifolia
Acer platanoides
Salix fragilis
Tilia
Quercus robur
Color Histogram
600 400 200 0 400 50 100 150 200 250
Cluster 2
200
50
100
150
200
250
Time-Series Special thanks to A. Ratanamahatana & E. Keogh for the leaf video.
Applications (Video)
Video-tracking / Surveillance Visual tracking of body features (2D time-series) Sign Language recognition (3D time-series)
Video Tracking of body feature over time (Athens1, Athens2)
Becoming Becoming sufficiently sufficiently familiar familiar with with something something is is a a substitute substitute for for understanding understanding it. it.
33
Linear Scan: Objective: Compare the query with all sequences in DB and return the k most similar sequences to the query. Database Databasewith withtime-series: time-series: Medical Medicalsequences sequences Images, Images,etc etc Sequence SequenceLength:100-1000pts Length:100-1000pts DB Size: 1 TByte DB Size: 1 TByte
D = 10.2
D = 11.8
D = 17
D = 22
Hierarchical Clustering
Very generic & powerful tool Provides visual data grouping
Pairwise distances
D1,1 D2,1
Partitional Clustering
Faster than hierarchical clustering Typically provides suboptimal solutions (local minima) Not good performance for high dimensions
K-Means Demo
1.4 1.2 1
K-Means Algorithm: 1. Initialize k clusters (k specified by user) randomly. 2. Repeat until convergence 1. Assign each object to the nearest cluster center. 2. Re-estimate cluster centers.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-0.5
0.5
1.5
See: kmeans
34
Classification
Typically classification can be made easier if we have clustered the objects Class A
0.4
0.2
Original sequences
Compressed sequences
0.4
Clustering space
-0.2
0.2
Project query in the new space and find its closest cluster
-0.4
-0.6
Class B
-0.2
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
Example
Elfs
Hobbits
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10
Height
Hair Length
What do we need?
1. Define Similarity 2. Search fast Dimensionality Reduction (compress data)
Notion of Similarity I
Solution to any time-series problem, boils down to a proper definition of *similarity*
All All models models are are wrong, wrong, but but some some are are useful useful
35
Notion of Similarity II
Similarity depends on the features we consider (i.e. how we will describe or compress the sequences)
Triangle Inequality
Triangle TriangleInequality: Inequality:d(x,z) d(x,z) d(x,y) d(x,y)+ +d(y,z) d(y,z)
z x y
Metric distance functions can exploit the triangle inequality to speed-up search
Q A B C
and
Then, since d(A,B)=20 d(Q,A) d(Q,B) d(B,A) d(Q,A) 150 20 = 130 So we dont have to retrieve A from disk
A A B C 0 20 110
B 20 0 90
C 110 90 0
Euclidean Distance
Most widely used distance measure
Definition: L2 =
(a[i] b[i])
i =1
20
40
60
80
100
36
||A-B|| = sqrt (
A: DxM matrix
||A||2
||B||2
- 2*A.B ) result
D1,1 D2,1 DM,N
M sequences Of length D
A=
aa= .*b); ab=a'*b; aa=sum(a.*a); sum(a.*a); bb=sum(b bb=sum(b.*b); ab=a'*b; d = sqrt(repmat(aa',[1 size(bb,2)]) + repmat(bb,[size(aa,2) 1]) - 2*ab ); 2*ab);
a = a mean(a); mean(a);
a = a ./ std(a); std(a);
Dynamic Time-Warping
First used in speech recognition for recognizing words spoken at different speeds ---Maat--llaabb------------------Same idea can work equally well for generic time-series data
Euclidean Euclideandistance distance T1 T1= =[1, [1,1, 1,2, 2,2] 2] d d= =1 1 T2 T2= =[1, [1,2, 2,2, 2,2] 2]
One-to-one linear alignment
----Mat-lab--------------------------
Warping Warpingdistance distance T1 T1= =[1, [1,1, 1,2, 2,2] 2] d d= =0 0 T2 T2= =[1, [1,2, 2,2, 2,2] 2]
One-to-many non-linear alignment
37
Euclidean Distance
18 16 7 13 14 3 9 6 2 15 11 19 10 20 17 5 12 8 4 1
c(i,j) (( A Bj))+ i, c(i,j) = =D D A i ,B j + min{ i-1, j-1) i-1, jj)), ,c( i, min{c( c( i-1, j-1), ,c( c( i-1, c( ij ,-1) j-1)}}
Recursive equation
18 20 17 13 16 14 12 19 15 11 3 9 8 7 5 6 2 10 4 1
The restriction of the warping path helps: A. Speed-up execution B. Avoid extreme (degenerate) matchings C. Improve clustering/classification accuracy Classification Accuracy
Camera Mouse
A
We now only fill only a small portion of the array
Warping Length
match match
A. Outlying values not matched B. Distance/Similarity distorted less C. Constraints in time & space
38
Method
Euclidean DTW LCSS
Time (sec)
34 237 210 2.2 9.1 8.2 2.1 9.3 8.3
Accuracy
20% 80%
100%
33% 44%
ASL
46%
11% 15%
31%
Feature 1
Objective: Instead of comparing the query to the original sequences (Linear Scan/LS) , lets compare the query to simplified versions of the DB timeseries.
A B C
One can also organize the low-dimensional points into a hierarchical index structure. In this tutorial we will not go over indexing techniques.
Question: When searching the original space it is guaranteed that we will find the best match. Does this hold (or under which circumstances) in the new compressed space?
39
EF
4 5
D F E
1 2 3 4 5 0 1
simplified query
EF
4 5
0 0
Lower Bounds
4.6399 37.9032 19.5174 72.1846 67.1436 78.0920 70.9273 63.7253 1.4121
Lower Bounds
4.6399 37.9032 19.5174 72.1846 67.1436 78.0920 70.9273 63.7253 1.4121
True Distance
46.7790 108.8856 113.5873 104.5062 119.4087 120.0066 111.6011 119.0635 17.2540 BestSoFar
40
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT: IDFT:
Everysignal signalcan can Every berepresented representedas as be asuperposition superpositionof of a sinesand andcosines cosines sines (alasnobody nobody (alas believesme) me) believes
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
DFT
DWT
SVD
APCA
PAA
PLA
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT: IDFT:
x(n)
-0.4446 -0.9864 -0.3254 -0.6938 -0.1086 -0.3470 0.5849 1.5927 -0.9430 -0.3037 -0.7805 -0.1953 -0.3037 0.2381 2.8389 -0.7046 -0.5529 -0.6721 0.1189 0.2706 -0.0003 1.3976 -0.4987 -0.2387 -0.7588
Fourier Decomposition
How much space we gain by compressing random walk data?
fa = fft(a); fft(a); % Fourier decomposition fa(5:end) = 0; % keep first 5 coefficients (low frequencies) reconstr = real(ifft(fa)); real(ifft(fa)); % reconstruct signal Life is complex, it has both real and imaginary parts.
Fourier Decomposition
How much space we gain by compressing random walk data?
Fourier Decomposition
How much space we gain by compressing random walk data?
41
Fourier Decomposition
How much space we gain by compressing random walk data?
Fourier Decomposition
How much space we gain by compressing random walk data?
Error 1500 1 0.95 Energy Percentage
Fourier Decomposition
Which coefficients are important? We can measure the energy of each coefficient Energy = Real(X(fk))2 + Imag(X(fk))2
Most of data-mining research uses first k coefficients: Good for random walk signals (eg stock market) Easy to index Not good for general signals
Fourier Decomposition
Which coefficients are important? We can measure the energy of each coefficient Energy = Real(X(fk))2 + Imag(X(fk))2
Usage of the coefficients with highest energy: Good for all types of signals Believed to be difficult to index CAN be indexed using metric trees
fa = fft(a); fft(a); % Fourier decomposition N = length(a); length(a); % how many? fa = fa(1:ceil(N/2)); % keep first half only mag = 2*abs(fa).^2; % calculate energy
keep
-0.4929 + 0.0399i -1.0143 + 0.9520i 0.7200 - 1.0571i -0.0411 + 0.1674i -0.5120 - 0.3572i 0.9860 + 0.8043i -0.3680 - 0.1296i -0.0517 - 0.0830i
% energy of a
% energy of a
Ignore
0.2667 + 0.1100i 0.2667 - 0.1100i 1.1212 + 0.6795i -0.9158 - 0.4481i -0.0517 + 0.0830i -0.3680 + 0.1296i 0.9860 - 0.8043i -0.5120 + 0.3572i -0.0411 - 0.1674i 0.7200 + 1.0571i -1.0143 - 0.9520i -0.4929 - 0.0399i -0.6280 - 0.2709i
energy(indenergy(ind-1) = sum(r.^2); % energy of reconstruction error(inderror(ind-1) = sum(abs(rsum(abs(r-a).^2); % error end E = ones(maxIndones(maxInd-1, 1)*E; error = E - energy; ratio = energy ./ E; subplot(1,2,1); % left plot plot([1:maxIndplot([1:maxInd-1], error, 'r', 'LineWidth',1.5); subplot(1,2,2); % right plot plot([1:maxIndplot([1:maxInd-1], ratio, 'b', 'LineWidth',1.5);
end
plot(r, 'r','LineWidth',2); hold on; plot(a,'k'); title(['Reconstruction using ' num2str(indnum2str(ind-1) 'coefficients']); set(gca,'plotboxaspectratio', [3 1 1]); axis tight pause; % wait for key cla; % clear axis keep
42
If we just keep some of the coefficients, their sum of squares always underestimates (ie lower bounds) the Euclidean distance:
120.9051
120.9051
Fourier Decomposition
O(nlogn) O(nlogn)complexity complexity Tried Triedand andtested tested Hardware Hardwareimplementations implementations Many Manyapplications: applications: compression compression smoothing smoothing periodicity periodicitydetection detection
Not Notgood goodapproximation approximationfor for bursty burstysignals signals Not Notgood goodapproximation approximationfor for signals signalswith withflat flatand andbusy busy sections sections (requires (requiresmany manycoefficients) coefficients) Fourier is good for smooth, random walk data, but not for bursty data or flat data
Wavelets in Matlab
Specialized Matlab interface for wavelets
etc
See also:wavemenu
See also:wavemenu
43
44
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
numCoeff
N=8 segLen = 2
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
% % % %
numCoeff
s sN
numCoeff
1 2
3 4
5 6
7 8
N=8 segLen = 2
% length of sequence % assume it's integer % % % % break in segments average segments expand segments make row
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
sN = reshape(s, segLen, numCoeff); avg = mean(sN); 2 data = repmat(avg, segLen, 1); data = data(:) data(:);
s sN avg
numCoeff
s sN avg
numCoeff
1.5 3.5 3.5 5.5 5.5 7.5 7.5
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
data
1.5
45
Not all haar/PAA coefficients are equally important Intuition: Keep ones with the highest energy Segments of variable length APCA is good for bursty signals PAA requires 1 number per segment, APCA requires 2: [value, length]
E.g. 10 bits for a sequence of 1024 points
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:) data(:);
APCA
s sN avg
numCoeff
1.5 3.5 3.5 1.5 5.5 5.5 3.5 7.5 7.5 3.5
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
data data
1.5 1.5
5.5
5.5
7.5
7.5
Wavelet Decomposition
O(n) O(n)complexity complexity Hierarchical Hierarchicalstructure structure Progressive Progressivetransmission transmission Better Betterlocalization localization Good Goodfor forbursty burstysignals signals
Most Mostdata-mining data-miningresearch research still stillutilizes utilizesHaar Haarwavelets wavelets because of their simplicity. because of their simplicity.
46
O(nlogn) O(nlogn)complexity complexityfor for bottom bottomup upalgorithm algorithm Incremental Incrementalcomputation computation possible possible Provable Provableerror errorbounds bounds Applications Applicationsfor: for: Image Image//signal signal simplification simplification Trend Trenddetection detection
eigenwave 1
each of length n
eigenwave 3
eigenwave 4
y Now we can describe each point with 1 number, their projection on the line
M sequences
A linear combination of the eigenwaves can produce any sequence in the database
47
Optimal Optimaldimensionality dimensionality reduction reductionin inEuclidean Euclidean distance distancesense sense SVD SVDis isa avery verypowerful powerfultool tool in inmany manydomains: domains: Websearch Websearch(PageRank) (PageRank)
Cannot Cannotbe beapplied appliedfor forjust just one onesequence. sequence.A Aset setof of sequences sequencesis isrequired. required. Addition Additionof ofa asequence sequencein in database databaserequires requires recomputation recomputation Very Verycostly costlyto tocompute. compute. 2n), O(Mn22 Time: n), O(Mn )} )} Time:min{ min{O(M O(M2 Space: Space:O(Mn) O(Mn)
M Msequences sequencesof oflength lengthn n
Symbolic Approximation
Assign a different symbol based on range of values Find ranges either from data histogram or uniformly
Symbolic Approximations
c c b
0
c b
Linear Linearcomplexity complexity After Aftersymbolization symbolizationmany many tools toolsfrom frombioinformatics bioinformatics can canbe beused used Markov Markovmodels models Suffix-Trees, Suffix-Trees,etc etc Number Numberof ofregions regions (alphabet (alphabetlength) length)can canaffect affect the quality of the quality ofresult result
b a
20
a
40 60 80 100 120
baabccbc
You can find an implementation here: http://www.ise.gmu.edu/~jessica/sax.htm
Multidimensional Time-Series
Catching momentum lately Applications for mobile trajectories, sensor networks, epidemiology, etc
Ari,are areyou yousure surethe the Ari, world is not 1D? world is not 1D?
Multidimensional MBRs
Find Bounding rectangles that completely contain a trajectory given some optimization criteria (eg minimize volume)
Aristotle
On my income tax 1040 it says "Check this box if you are blind." I wanted to put a check mark about three inches away. - Tom Lehrer
48
1993
2000
2001
2004
2005
Comparisons
Lets see how tight the lower bounds are for a variety on 65 datasets
Average Lower Bound
A. No approach is better on all datasets B. Best coeff. techniques can offer tighter bounds C. Choice of compression depends on application
Note: similar results also reported by Keogh in SIGKDD02
PART II: Time Series Matching Lower Bounding the DTW and LCSS
MBE(Q)
LB = sqrt(sum([[A > U].* [A-U]; [A < L].* [L-A]].^2)); One Matlab command! U LB by Zhu and Shasha approximate MBE and sequence using PAA However, this representation is uncompressed. Both MBE and the DB sequence can be compressed using any of the previously mentioned techniques. LB = 25.41
Q A
49
Time Comparisons
We will use DTW (and the corresponding LBs) for recognition of hand-written digits/shapes.
Lower Bounding approaches for DTW, will typically yield at least an order of magnitude speed improvement compared to the nave approach. Lets compare the 3 LB approaches:
Accuracy: Using DTW we can achieve recognition above 90%. Running Time: runTime LB_Warp < runTime LB_Zhu < runTime LB-Keogh Pruning Power: For some queries LB_Warp can examine up to 65 time fewer sequences
Word annotation:
1. 1.Extract Extractwords wordsfrom fromdocument document 2. 2.Extract Extractimage imagefeatures features 3. 3.Annotate Annotateaasubset subsetof ofwords words 4. 4.Classify Classifyremaining remainingwords words
1 0.8
Sim.=50/77 = 0.64
Feature Value
50
100
150
200 Column
250
300
350
400
Features:
44 points
6 points
PART II: Time Series Analysis Test Case and Structural Similarity Measures
50
Porto
Jan
Feb Mar
Jul
Nov Dec
Google Zeitgeist
Priceline
Requests
Query: ps2
Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec
Requests
Query: xbox
Query: elvis
Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec
The data is smooth and highly periodic, so we can use Fourier decomposition. Instead of using the first Fourier coefficients we can use the best ones instead. Lets see how the approximation will look:
Using the best coefficients, provides a very high quality approximation of the original time-series
51
Matching results I
Query = Lance Armstrong
Matching results II
Query = Christmas
2000
2001
2002
2000
2001
2002
LeTour
0 2000 2001 2002
Knn4: Christmas coloring books Knn8: Christmas baking Knn12: Christmas clipart
Periodic Matching
Frequency Ignore Phase/ Keep important components Calculate Distance
F ( x), F ( y )
cinema
Periodogram
D1 =|| F ( x + ) F ( y + ) || D2 =|| F ( x + ) F ( y + ) ||
stock
easter
Query Elvis. Burst in demand on 16th August. Death anniversary of Elvis Presley
10
15
20
25
30
35
40
45
50
christmas
10
15
20
25
30
35
40
45
50
52
Burst Detection
Burst detection is similar to anomaly detection. Create distribution of values (eg gaussian model) Any value that deviates from the observed distribution (eg more than 3 std) can be considered as burst.
50
100
150
200
250
300
350
Query-by-burst
To perform query-by-burst we can perform the following steps: 1. 2. 3. Find burst regions in given query Represent query bursts as time segments Find which sequences in DB have overlapping burst regions.
Query-by-burst Results
Queries
Pentagon attack
www.nhc.noaa.gov
Cheap gifts
Matches
Nostradamus prediction
Tropical Storm
Scarfs
Periodic Measure
Incorrect Grouping
53
Conclusion
The traditional shape matching measures cannot address all timeseries matching problems and applications. Structural distance measures can provide more flexibility. There are many other exciting time-series problems that havent been covered in this tutorial: Anomaly Detection
Idont dontwant wantto to I achieveimmortality immortality achieve throughmy myworkI workI through wantto toachieve achieveit it want through not dying. through not dying.
NICE SYSTEMS: Stock value increased (provider of air traffic control systems)
54