Tutorial | Time-Series with Matlab
Disclaimer
Feel free to use any of the following slides for educational purposes, however kindly acknowledge the source. We would also like to know how you have used these slides, so please send us emails with comments or suggestions.
HandsHands-On TimeTime-Series Analysis with Matlab
Michalis Vlachos and Spiros Papadimitriou
IBM T.J. Watson Research Center
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
About this tutorial
Disclaimer
The goal of this tutorial is to show you that time-series research (or research in general) can be made fun, when it involves visualizing ideas, that can be achieved with concise programming. Matlab enables us to do that.
Will I be able to use this MATLAB right away after the tutorial? I am definitely smarter than her, but I am not a timetimeseries person, perper-se. I wonder what I gain from this tutorial tutorial
We are not affiliated with Mathworks in any way but we do like using Matlab a lot since it makes our lives easier Errors and bugs are most likely contained in this tutorial. We might be responsible for some of them.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
What this tutorial is NOT about
Moving averages Autoregressive models Forecasting/Prediction Stationarity Seasonality
Overview
PART A The Matlab programming environment PART B Basic mathematics Introduction / geometric intuition Coordinates and transforms Quantized representations Non-Euclidean distances PART C Similarity Search and Applications Introduction Representations Distance Measures Lower Bounding Clustering/Classification/Visualization Applications
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Why does anyone need Matlab?
Matlab enables the efficient Exploratory Data Analysis (EDA)
Science progresses through observation -- Isaac Newton
Isaac Newton
PART A: Matlab Introduction
The greatest value of a picture is that is forces us to notice what we never expected to see -- John Tukey
John Tukey
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matlab
Interpreted Language Easy code maintenance (code is very compact) Very fast array/vector manipulation Support for OOP Easy plotting and visualization Easy Integration with other Languages/OSs Interact with C/C++, COM Objects, DLLs Build in Java support (and compiler) Ability to make executable files Multi-Platform Support (Windows, Mac, Linux) Extensive number of Toolboxes Image, Statistics, Bioinformatics, etc
History of Matlab (MATrix LABoratory)
The most important thing in the programming language is the name. I have recently invented a very good name and now I am looking for a suitable language. -- R. Knuth
Programmed by Cleve Moler as an interface for EISPACK & LINPACK
1957: Moler goes to Caltech. Studies numerical Analysis 1961: Goes to Stanford. Works with G. Forsythe on Laplacian eigenvalues. 1977: First edition of Matlab; 2000 lines of Fortran 80 functions (now more than 8000 functions) 1979: Met with Jack Little in Stanford. Started working on porting it to C 1984: Mathworks is founded
Video:[Link]
Cleve Moler
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Current State of Matlab/Mathworks
Matlab, Simulink, Stateflow Matlab version 7.3, R2006b Used in variety of industries
Aerospace, defense, computers, communication, biotech
Mathworks still is privately owned Used in >3,500 Universities, with >500,000 users worldwide 2005 Revenue: >350 M. 2005 Employees: 1,400+ Pricing:
starts from 1900$ (Commercial use), ~100$ (Student Edition)
Moneyis isbetter betterthan than Money poverty,if ifonly onlyfor for poverty, financialreasons reasons financial
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matlab 7.3
R2006b, Released on Sept 1 2006
Distributed computing Better support for large files New optimization Toolbox Matlab builder for Java
create Java classes from Matlab
Who needs Matlab?
R&D companies for easy application deployment Professors Lab assignments
Matlab allows focus on algorithms not on language features
Students Batch processing of files
No more incomprehensible perl code!
Great environment for testing ideas
Demos, Webinars in Flash format
([Link] html)
Quick coding of ideas, then porting to C/Java etc
Easy visualization Its cheap! (for students at least)
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab Personally I'm always ready to learn, although I do not always like being taught. Sir Winston Churchill
Starting up Matlab
Commands like: cd pwd mkdir
Matlab Environment
Dos/Unix like directory navigation
Command Window: - type commands - load scripts
For navigation it is easier to just copy/paste the path from explorer E.g.: cd c:\documents\
Workspace: Loaded Variables/Types/Size
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matlab Environment
Matlab Environment
Command Window: - type commands - load scripts
Command Window: - type commands - load scripts
Workspace: Loaded Variables/Types/Size
Workspace: Loaded Variables/Types/Size
Help contains a comprehensive introduction to all functions
Excellent demos and tutorial of the various features and toolboxes
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Starting with Matlab
Everything is arrays Manipulation of arrays is faster than regular manipulation with for-loops
a = [1 2 3 4 5 6 7 9 10] % define an array
Populating arrays
Plot sinusoid function
a = [0:0.3:2*pi] % generate values from 0 to 2pi (with step of 0.3) b = cos(a) cos(a) % access cos at positions contained in array [a] plot(a,b) plot(a,b) % plot a (x(x-axis) against b (y(y-axis)
Related: linspace(-100,100,15); % generate 15 values between -100 and 100
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Array Access
Access array elements
>> a(1) ans = 0 >> a(1:3) ans = 0 0.3000 0.6000
2D Arrays
Can access whole columns or rows
Lets define a 2D array
>> a = [1 2 3; 4 5 6] a = 1 4 >> a(2,2) ans = 5 2 5 3 6 >> a(1,:) ans = 1 >> a(:,1) ans = 1 4 2 3 Column-wise access Row-wise access
Set array elements
>> a(1) = 100 >> a(1:3) = [100 100 100] 100]
A good listener is not only popular everywhere, but after a while he gets to know something. Wilson Mizner
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Column-wise computation
For arrays greater than 1D, all computations happen column-by-column
>> a = [1 2 3; 3 2 1] a = 1 3 >> mean(a) mean(a) ans = 2.0000 2.0000 2.0000 2 2 3 1 >> max(a) max(a) ans = 3 >> sort(a) ans = 1 3 2 2 1 3 2 3
Concatenating arrays
Column-wise or row-wise
>> a = [1 2 3]; >> b = [4 5 6]; >> c = [a b] c = 1 2 3
Row next to row
>> a = [1;2]; >> b = [3;4]; >> c = [a b] c = 1 2 3 4
Column next to column
>> a = [1 2 3]; >> b = [4 5 6]; >> c = [a; b] c = 1 4 2 5 3 6
Row below row
>> a = [1;2]; >> b = [3;4]; >> c = [a; b] c = 1 2 3 4
Column below column
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Initializing arrays
Create array of ones [ones]
>> a = ones(1,3) a = 1 1 1 >> a = ones(2,2)*5; a = 5 5 5 5
Reshaping and Replicating Arrays
Changing the array shape [reshape] (eg, for easier column-wise computation)
>> a = [1 2 3 4 5 6] 6]; % make it into a column >> reshape(a,2,3) ans = 1 2 3 4 5 6 reshape(X,[M,N]): [M,N] matrix of columnwise version of X
>> a = ones(1,3)*inf a = Inf Inf Inf
Create array of zeroes [zeros] Good for initializing arrays
>> a = zeros(1,4) a = 0 0 0 0 >> a = zeros(3,1) + [1 2 3] 3] a = 1 2 3
Replicating an array [repmat]
>> a = [1 2 3]; >> repmat(a,1,2) ans = 1 2 3 1 2 3 repmat(X,[M,N]): make [M,N] tiles of X
>> repmat(a,2,1) repmat(a,2,1) ans = 1 2 1 2
3 3
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Useful Array functions
Last element of array [end]
>> a = [1 3 2 5]; >> a(end) ans = 5 >> a = [1 3 2 5]; >> a(enda(end-1) ans = 2
Useful Array functions
Find a specific element [find] **
>> a = [1 3 2 5 10 5 2 3]; >> b = find(a==2) b = 3 7
Length of array [length]
>> length(a) ans = 4 a= 1 3
Length = 4 2 5
Sorting [sort] ***
>> a = [1 3 2 5]; >> [s,i]=sort(a) s = 1 2 3 5 s= i = 1 3 2 4 i= 1 3 2 4 1 2 3 5 Indicates the index where the element came from a= 1 3 2 5
>> [rows, columns] = size(a) rows = 1 columns = 4
rows = 1
Dimensions of array [size]
columns = 4
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
strcmp, scatter, hold on
Visualizing Data and Exporting Figures
Use Fishers Iris dataset
>> load fisheriris
Visualizing Data (2D)
>> >> >> >> >> >> >> >> idx_setosa = strcmp(species, setosa setosa); % rows of setosa data idx_virginica = strcmp(species, virginica virginica); % rows of virginica setosa = meas(idx_setosa,[1:2]); virgin = meas(idx_virginica,[1:2]); scatter(setosa(:,1), setosa(:,2)); % plot in blue circles by default hold on; scatter(virgin(:,1), virgin(:,2), rs ] squares[s ] for these rs); % red[r red[r] squares[s]
4 dimensions, 3 species Petal length & width, sepal length & width Iris:
virginica/versicolor/setosa
meas (150x4 array): Holds 4D measurements
... 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'virginica' 'virginica' 'virginica' 'virginica
idx_setosa
... 1 1 1 0 0 0
... species (150x1 cell array): Holds name of species for the specific measurement
An array of zeros and ones indicating the positions where the keyword setosa was found
... The world is governed more by appearances rather than realities --Daniel Webster
Tutorial | Time-Series with Matlab
scatter3
Tutorial | Time-Series with Matlab
Visualizing Data (3D)
>> idx_setosa = strcmp(species, setosa setosa); % rows of setosa data >> idx_virginica = strcmp(species, virginica virginica); % rows of virginica >> idx_versicolor = strcmp(species, versicolor versicolor); % rows of versicolor >> >> >> >> >> >> >> setosa = meas(idx_setosa,[1:3]); virgin = meas(idx_virginica,[1:3]); versi = meas(idx_versicolor,[1:3]); scatter3(setosa(:,1), setosa(:,2),setosa(:,3)); % plot in blue circles by default hold on; scatter3(virgin(:,1), virgin(:,2),virgin(:,3), rs ] squares[s ] for these rs); % red[r red[r] squares[s] scatter3(versi(:,1), virgin(:,2),versi(:,3), gx gx); % green xs
Changing Plots Visually
Zoom out
Zoom in
Create line
Computersare are Computers [Link] Theycan can useless. onlygive giveyou you only answers answers
Create Arrow
7 6 5 4 3 2 1 4.5 4 3.5 3 2.5 2 4 4.5 5 5.5 6 6.5 7 7.5 8
>> grid on; % show grid on axis >> rotate3D on; % rotate with mouse
Select Object
Add text
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Changing Plots Visually
Add titles Add labels on axis Change tick labels Add grids to axis Change color of line Change thickness/ Linestyle etc
Changing Plots Visually (Example)
Change color and width of a line
A Right click C
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Changing Plots Visually (Example)
The result
Changing Figure Properties with Code
GUIs are easy, but sooner or later we realize that coding is faster
>> a = cumsum(randn(365,1)); % random walk of 365 values
Other Styles:
3 2 1 0 -1 -2 -3 3 2 1 0 -1 -2 -3 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
If this represents a years worth of measurements of an imaginary quantity, we will change: x-axis annotation to months Axis labels Put title in the figure Include some greek letters in the title just for fun
Real men do it command-line --Anonymous
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
Axis annotation to months
>> >> >> axis tight; % irrelevant but useful... xx = [Link]; set(gca, xtick xtick,xx)
Changing Figure Properties with Code
Axis annotation to months
>> set(gca, set(gca,xticklabel xticklabel,[ ,[Jan Jan; ... Feb Feb;Mar Mar])
The result
The result
Real men do it command-line --Anonymous
Real men do it command-line --Anonymous
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
Axis labels and title
>> title( title(My measurements (\ (\epsilon/\ epsilon/\pi) pi)) Other latex examples: \alpha, \beta, e^{-\alpha} etc
Saving Figures
Matlab allows to save the figures (.fig) for later processing
>> ylabel( ylabel(Imaginary Quantity Quantity)
.fig can be later opened through Matlab >> xlabel( xlabel(Month of 2005 2005)
Real men do it command-line --Anonymous
You can always put-off for tomorrow, what you can do today. -Anonymous
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Exporting Figures
Exporting figures (code)
You can also achieve the same result with Matlab code
Export to: emf, eps, jpg, etc
Matlab code:
% extract to color eps print -depsc [Link]; [Link]; % from commandcommand-line print(gcf, print(gcf,-depsc depsc,myImage myImage) % using variable as name
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Visualizing Data - 2D Bars
1 2 3 4
Visualizing Data - 3D Bars
data
10 8 6 4 2 0 1 2
colormap
7 5 4 4 2 1 0 0.0198 0.0397 0.0595 0.0794 0.0992 1.0000 1.0000 1.0000 1.0000 0 0.0124 0.0248 0.0372 0.0496 0.0620 ... 0.7440 0.7564 0.7688 0.7812
3
colormap
10 9 8 6 6 3
3 5 6 3
8 6 6 5 3 2
64
0 0.0079 0.0158 0.0237 0.0316 0.0395 0.4738 0.4817 0.4896 0.4975
bars
time = [100 120 80 70]; % our data h = bar(time); bar(time); % get handle cmap = [1 0 0; 0 1 0; 0 0 1; .5 0 1]; % colors colormap(cmap); colormap(cmap); % create colormap cdata = [1 2 3 4]; % assign colors set(h,'CDataMapping','direct','CData',cdata); set(h,'CDataMapping','direct','CData',cdata);
data = [ 10 8 7; 9 6 5; 8 6 4; 6 5 4; 6 3 2; 3 2 1]; bar3([1 2 3 5 6 7], data); c = colormap(gray); colormap(gray); % get colors of colormap c = c(20:55,:); % get some colors colormap(c); colormap(c); % new colormap
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Visualizing Data - Surfaces
data
10 9 8 7 6 5 4 3 2 1 10 8 6 4 2 0 0 2 6 4 8 10
Creating .m files
Standard text files
10
1 1
Script: A series of Matlab commands (no input/output arguments) Functions: Programs that accept input and return output
9 10 1 10
The value at position x-y of the array indicates the height of the surface
Right click
data = [1:10]; data = repmat(data,10,1); % create data surface(data,'FaceColor',[1 1 1], 'Edgecolor ', [0 0 1]); % plot data 'Edgecolor', view(3); grid on; % change viewpoint and put axis lines
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
cumsum, num2str, save
Creating .m files
Creating .m files
The following script will create: An array with 10 random walk vectors
M editor
Will save them under text files: [Link], , [Link]
myScript.m
Double click
Sample Script
A 1 2 3 4 5
cumsum(A) 1 3 6
10 15
a = cumsum(randn(100,10)); % 10 random walk data of length 100 for i=1:size(a,2), % number of columns data = a(:,i) ; fname = [num2str(i) .dat .dat]; % a string is a vector of characters! save(fname, data data,-ASCII ASCII); % save each column in a text file end Write this in the M editor
A random walk time-series
10
-5
10
20
30
40
50
60
70
80
90 100
and execute by typing the name on the Matlab command line
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Functions in .m scripts
When we need to: Organize our code Frequently change parameters in our scripts
keyword output argument function name input argument
Cell Arrays
Cells that hold other Matlab arrays Lets read the files of a directory
>> f = dir( dir(*.dat *.dat) % read file contents f = 15x1 struct array with fields: name date bytes isdir for i=1:length(f), a{i} = load(f(i).name); N = length(a{i}); plot3([1:N], a{i}(:,1), a{i}(:,2), ... r-, Linewidth Linewidth, 1.5); grid on; pause; 600 500 cla; 400 end
300 200 100 0 1000 1500
Struct Array 1 2 3 4 5
function dataN = zNorm(data) % ZNORM zNormalization of vector % subtract mean and divide by std if (nargin <1), % check parameters (nargin<1), error( error(Not enough arguments arguments); end data = data mean(data); mean(data); % subtract mean data = data/std(data ); % divide by std data/std(data); dataN = data;
Help Text (help function_name)
name date bytes isdir
f(1
).n
e am
Function Body
function [a,b] = myFunc(data, x, y) % pass & return more arguments
See also:varargin, varargout
500 500
1000
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Reading/Writing Files
Load/Save are faster than C style I/O operations But fscanf, fprintf can be useful for file formatting or reading non-Matlab files
fid = fopen('[Link]', 'wt'); for i=1:length(species), fprintf(fid, '%6.4f %6.4f %6.4f %6.4f %s\ %s\n', meas(i,:), species{i}); end fclose(fid);
Flow Control/Loops
if (else/elseif) , switch
Check logical conditions
while
Execute statements infinite number of times
for
Execute statements a fixed number of times
Output file:
Elements are accessed column-wise (again)
x = 0:.1:1; y = [x; exp(x)]; fid = fopen('[Link]','w'); fprintf(fid,'%6.2f %12.8f\ %12.8f\n',y); fclose(fid);
0 1 0.1 1.1052 0.2 1.2214 0.3 1.3499 0.4 0.4 1.4918 0.5 1.6487 1.6487 0.6 1.8221 0.7 2.0138
break, continue return
Return execution to the invoking function
Life is pleasant. Death is peaceful. Its the transition thats troublesome. Isaac Asimov
Tutorial | Time-Series with Matlab
tic, toc, clear all
Tutorial | Time-Series with Matlab
For-Loop or vectorization?
clear all; tic; for i=1:50000 a(i) a(i) = sin(i); sin(i); end toc elapsed_time = 5.0070
Pre-allocate arrays that store output results No need for Matlab to resize everytime Functions are faster than scripts Compiled into pseudocode Load/Save faster than Matlab I/O functions After v. 6.5 of Matlab there is for-loop vectorization (interpreter)
Matlab Profiler
Find which portions of code take up most of the execution time Identify bottlenecks Vectorize offending code
clear all; a = zeros(1,50000); tic; for i=1:50000 a(i) a(i) = sin(i); sin(i); end toc
elapsed_time = 0.1400
clear all; tic; i = [1:50000]; a = sin(i); sin(i); toc; toc;
elapsed_time = 0.0200
Vectorizations help, but not so obvious how to achieve many times
Time not importantonly life important. The Fifth Element
Time not importantonly life important. The Fifth Element
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Hints &Tips
There is always an easier (and faster) way
Typically there is a specialized function for what you want to achieve
Debugging
Beware of bugs in the above code; I have only proved it correct, not tried it -- R. Knuth
Not as frequently required as in C/C++ Set breakpoints, step, step in, check variables values
Set breakpoints
Learn vectorization techniques, by peaking at the actual Matlab files:
edit [fname], eg edit mean edit princomp
Matlab Help contains many vectorization examples
Tutorial | Time-Series with Matlab
Debugging
Eitherthis thisman manis is Either deador ormy mywatch watch dead hasstopped. stopped. has
Tutorial | Time-Series with Matlab
Advanced Features 3D modeling/Volume Rendering
Very easy volume manipulation and rendering
Full control over variables and execution path F10: step, F11: step in (visit functions, as well)
A
F10 C
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Advanced Features Making Animations (Example)
Create animation by changing the camera viewpoint
3 2 1 3 2 1 0 -1 -2 -3 -1 0 1 2 3 4 100 100 1 2 3 4 50 0 -1 -2 0 -3 0
Advanced Features GUIs
Built-in Development Environment Buttons, figures, Menus, sliders, etc
3 2 1 0 -1 -2 -3 0
50
Several Examples in Help Directory listing
50 100 -1 0 1 2 3 4
Address book reader GUI with multiple axis
-1
azimuth = [50:100 [Link]-1:50]; % azimuth range of values for k = 1:length(azimuth), plot3(1:length(a), a(:,1), a(:,2), 'r', 'Linewidth',2); grid on; view(azimuth(k),30); % change new M(k) M(k) = getframe; getframe; % save the frame end movie(M,20); % play movie 20 times See also:movie2avi
10
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Advanced Features Using Java
Matlab is shipped with Java Virtual Machine (JVM) Access Java API (eg I/O or networking) Import Java classes and construct objects Pass data between Java objects and Matlab variables
Advanced Features Using Java (Example)
Stock Quote Query
Connect to Yahoo server [Link] objectId=4069&objectType=file
disp('Contacting YAHOO server using ...'); disp(['url = [Link](' [Link](' urlString ')']); end; url = [Link](urlString); [Link](urlString); try stream = openStream(url); openStream(url); ireader = [Link](stream); [Link](stream); breader = [Link](ireader); [Link](ireader); connect_query_data= connect_query_data= 1; %connect made; catch connect_query_data= connect_query_data= -1; %could not connect case; disp(['URL: disp(['URL: ' urlString]); urlString]); error(['Could not connect to server. It may be unavailable. Try again later.']); stockdata={}; stockdata={}; return; end
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matlab Toolboxes
You can buy many specialized toolboxes from Mathworks Image Processing, Statistics, Bio-Informatics, etc There are many equivalent free toolboxes too: SVM toolbox
[Link]
In case I get stuck
help [command] (on the command line) eg. help fft Menu: help -> matlab help Excellent introduction on various topics Matlab webinars
Ivehad hada awonderful wonderful Ive [Link] Butthis this evening. wasntit it wasnt
[Link]
Wavelets
[Link]
Google groups [Link] You can find *anything* here Someone else had the same problem before you!
Speech Processing
[Link]
Bayesian Networks
[Link]
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Overview of Part B
1. 2. Introduction and geometric intuition Coordinates and transforms Eightpercent percentof of Eight successis isshowing showing success up. up. 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
PART B: Mathematical notions
Quantized representations
Non-Euclidean distances
11
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
What is a time-series
Definition: Definition:A Asequence sequenceof ofmeasurements measurementsover overtime time
Medicine Stock Market Meteorology Geology Astronomy Chemistry Biometrics Robotics
64.0 62.8 62.0 66.0 62.0 32.0 86.4 ... 21.6 45.2 43.2 53.0 43.2 42.8 43.2 36.4 16.9 10.0 ECG
Applications
Images
Image
Shapes
Motion capture
Sunspot Color Histogram
600 400 200 0 400 50 100 150 200 250
Acer platanoides
200
50
100
150
200
250
800 600 400 200 0 50 100 150 200 250
Earthquake
Time-Series
Salix fragilis time
more to come
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Time Series
value
Time Series
value
x5 x2 x6 x3 x1 x4 time
3
9 8 6 4 1
x = (3, 8, 4, 1, 9, 6)
time
Sequence of numeric values Finite: N-dimensional vectors/points Infinite: Infinite-dimensional vectors
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Mean
Definition:
Variance
Definition:
From now on, we will generally assume zero mean mean normalization:
or, if zero mean, then
From now on, we will generally assume unit variance variance normalization:
12
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Mean and variance
Why and when to normalize
Intuitively, the notion of shape is generally independent of
Average level (mean) Magnitude (variance)
variance
mean
Unless otherwise specified, we normalize to zero mean and unit variance
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Variance = Length
Variance of zero-mean series:
Covariance and correlation
Definition
Length of N-dimensional vector (L2-norm):
or, if zero mean and unit variance, then So that:
x2
||x |
x1
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Correlation and similarity
How strong is the linear relationship between xt and yt ? For normalized series,
Correlation = Angle
Correlation of normalized series:
residual
Cosine law:
slope
2.5 2 1.5 1
2.5
= -0.23
2 1.5 1
= 0.99
So that:
x
CAD
BEF
0.5 0
0.5 0
-0.5 -1 -1.5 -2 -2.5 -2 -1 0 1 2
-0.5 -1 -1.5 -2 -2.5 -2 -1 0 1 2
x.y
FRF
FRF
13
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Correlation and distance
For normalized series,
Ergodicity
Example
Assume I eat chicken at the same restaurant every day and Question: How often is the food good?
Answer one:
i.e., correlation and squared Euclidean distance are linearly related.
x
Answer two:
Answers are equal ergodic
If the chicken is usually good, then my guests today can safely order other things.
||x || -y
x.y y
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Ergodicity
Example
Stationarity
Example
Ergodicity is a common and fundamental assumption, but sometimes can be wrong: Total number of murders this year is 5% of the population If I live 100 years, then I will commit about 5 murders, and if I live 60 years, I will commit about 3 murders non-ergodic! Such ergodicity assumptions on population ensembles is commonly called racism.
Is the chicken quality consistent?
Last week: Two weeks ago: Last month: Last year:
Answers are equal stationary
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Autocorrelation
Definition:
Time-domain coordinates
6 4 2 1.5
3.5
Is well-defined if and only if the series is (weakly) stationary Depends only on lag , not time t
-0.5
-0.5 -2
+ 1.5
+ -2
+ 2
+ 3.5
14
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Time-domain coordinates
6 4 2 1.5 1
Orthonormal basis
Set of N vectors, { e1, e2, , eN }
Normal: ||ei|| = 1, for all 1 i N Orthogonal: eiej = 0, for i j
3.5
-0.5 -2
Describe a Cartesian coordinate system
Preserve length (aka. Parseval theorem) Preserve angles (inner-product, correlations)
x1 -0.5
e1
42 + x
e2
x3 + 1.5
e3
x4 + -2
e4
+ x 25
e5
66 + x
e6
x7 + 3.5
e7
18 + x
e8
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Orthonormal basis
Note that the coefficients xi w.r.t. the basis { e1, , eN } are the corresponding similarities of x to each basis vector/series:
Orthonormal bases
The time-domain basis is a trivial tautology:
Each coefficient is simply the value at one time instant
What other bases may be of interest? Coefficients may correspond to:
6 4 1.5 2 1 -0.5 -2 3.5
Frequency (Fourier) Time/scale (wavelets)
=
-0.5
+ e1
+ e2
Features extracted from series collection (PCA)
x2
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency domain coordinates
Preview
6 4 2 1.5 1
Time series geometry
Summary
Basic concepts:
Series / vector
3.5
Mean: average level Variance: magnitude/length
-0.5 -2
Correlation: similarity, distance, angle Basis: Cartesian coordinate system
5.6
+ -2.2
+ 2.8
- 4.9
+ -3
+ 0.05
15
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Time series geometry
Preview Applications
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
The quest for the right basis Compression / pattern extraction
Filtering Similarity / distance Indexing Clustering Forecasting Periodicity estimation Correlation
Quantized representations
Non-Euclidean distances
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency
Frequency and time
.
One cycle every 20 time units (period)
period = 8 period 20? Why is the
= 0
Its not 8, because its similarity (projection) to a period-8 series (of the same length) is zero.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency and time
Frequency and time
.
= 0
period = 10 period = 40
.
= 0 Why is the cycle 20? Its not 40, because its similarity (projection) to a period-40 series (of the same length) is zero.
Why is the cycle 20? Its not 10, because its similarity (projection) to a period-10 series (of the same length) is zero.
and so on
16
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency
Fourier transform - Intuition
Frequency
Fourier transform - Intuition
To find the period, we compared the time series with sinusoids of many different periods Therefore, a good description (or basis) would consist of all these sinusoids This is precisely the idea behind the discrete Fourier transform
The coefficients capture the similarity (in terms of amplitude and phase) of the series with sinusoids of different periods
Technical details:
We have to ensure we get an orthonormal basis Real form: sines and cosines at N/2 different frequencies Complex form: exponentials at N different frequencies
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier transform
Real form
Fourier transform
Real form Amplitude and phase
For odd-length series,
Observe that, for any fk, we can write
where The pair of bases at frequency fk are are the amplitude and phase, respectively. plus the zero-frequency (mean) component
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier transform
Real form Amplitude and phase
Fourier transform
Complex form
It is often easier to think in terms of amplitude rk and phase k e.g.,
The equations become easier to handle if we allow the series and the Fourier coefficients Xk to take complex values:
1 0.5 0 -0.5 -1 0 10 20 30 40 50 60 70 80
Matlab note: fft omits the scaling factor and is not unitaryhowever, ifft includes an scaling factor, so always ifft(fft(x)) == x.
17
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier transform
Example
2 GBP 1 0 -1 2 GBP 1 0 -1 2 GBP 1 0 -1
Other frequency-based transforms
Discrete Cosine Transform (DCT)
1 frequency 2 frequencies
Matlab: dct / idct
Modified Discrete Cosine Transform (MDCT)
3 frequencies
2 GBP 1 0 -1
5 frequencies
2 GBP 1 0 -1
10 frequencies
2 GBP 1 0 -1
20 frequencies
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Frequency and time
e.g., .
period = 20
0 0 etc
Quantized representations
period = 10
What is the cycle now?
Non-Euclidean distances
No single cycle, because the series isnt exactly similar with any series of the same length.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency and time
Fourier is successful for summarization of series with a few, stable periodic components However, content is smeared across frequencies when there are
Frequency shifts or jumps, e.g.,
Frequency and time
If there are discontinuities in time/frequency or frequency shifts, then we should seek an alternate description or basis Main idea: Localize bases in time
Short-time Fourier transform (STFT) Discrete wavelet transform (DWT)
Discontinuities (jumps) in time, e.g.,
18
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency and time
Intuition
Frequency and time
Intuition
What if we examined, e.g., eight values at a time?
What if we examined, e.g., eight values at a time? Can only compare with periods up to eight.
Results may be different for each group (window)
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Frequency and time
Intuition
Wavelets
Intuition
Main idea
Use small windows for small periods
Remove high-frequency component, then
Use larger windows for larger periods
Twice as large
Can adapt to localized phenomena Fixed window: short-window Fourier (STFT)
How to choose window size?
Repeat recursively
Technical details
Need to ensure we get an orthonormal basis
Variable windows: wavelets
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Intuition
Wavelets
Intuition Tiling time and frequency
Scale (frequency)
Time
Time
Time
Scale (frequency)
Frequency
Frequency
Frequency
Time
Fourier, DCT,
STFT
Wavelets
19
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelet transform
Pyramid algorithm
Wavelet transform
Pyramid algorithm
High pass
High pass Low pass Low pass
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelet transform
Pyramid algorithm
Wavelet transform
Pyramid algorithm
High pass
w1
x w0
High pass
w2
High pass Low pass
High pass Low pass
Low pass
v1
w3 v3
Low pass
v2
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelet transforms
General form
Wavelet transforms
Other filters examples
Haar (Daubechies-1)
A high-pass / low-pass filter pair
Example: pairwise difference / average (Haar) In general: Quadrature Mirror Filter (QMF) pair
Orthogonal spans, which cover the entire space
Better frequency isolation Worse time localization
Daubechies-2
Additional requirements to ensure orthonormality of overall transform
Daubechies-3
Use to recursively analyze into top / bottom half of frequency band
Daubechies-4
Wavelet filter, or Mother filter (high-pass)
Scaling filter, or Father filter (low-pass)
20
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Example Wavelet coefficients (GBP, Haar)
2 1 0 -1 500 1 W1 0 -1 1 W2 0 -1 2 W3 0 -2 2 W4 0 -2 5 W5 0 -5 10 W6 0 -10 20 V6 0 -20 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160 50 100 150 200 250 300 100 200 300 400 500 600 200 400 600 800 1000 1200 1000 1500 2000 2500 1 D1 0 -1 1 0 -1 1 0 -1 2 0 -2 5 0 -5 5 0 -5 20 0 -20 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 A6 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160 50 100 150 200 250 300 D4 100 200 300 400 500 600 200 400 600 800 1000 1200 2 1 0 -1 500 1000 1500 2000 2500 GBP
Wavelets
Example Wavelet coefficients (GBP, Daubechies-3)
2 1 0 -1 500 0.1 0 -0.1 -0.2 -0.3 500 0.2 0 -0.2 500 0.4 0.2 0 -0.2 -0.4 0.4 0.2 0 -0.2 -0.4 0.5 0 -0.5 500 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 D6 1000 1500 2000 2500 500 1000 1500 2000 2500 0.2 0 -0.2 -0.4 0.5 0 -0.5 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 500 1000 1500 2000 2500 D3 1000 1500 2000 2500 0.2 0 -0.2 -0.4 500 1000 1500 2000 2500 D2 1000 1500 2000 2500 1000 1500 2000 2500 0 -0.2 -0.4 -0.6 0.2 0 -0.2 -0.4 -0.6 500 1000 1500 2000 2500 GBP
Multi-resolution analysis (GBP, Haar)
2 1 0 -1
Multi-resolution analysis (GBP, Daubechies-3)
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
D5
500
1000
1500
2000
2500
500
1000
1500
2000
2500
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Example Multi-resolution analysis (GBP, Haar)
2 1 0 -1 500 0.1 0 -0.1 -0.2 -0.3 500 0.2 0 -0.2 500 0.4 0.2 0 -0.2 -0.4 0.4 0.2 0 -0.2 -0.4 0.5 0 -0.5 500 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 D6 500 D3 D2 2 1 0 -1 500 1500 1000 2000 1000 1500 2000 2500 0.2 0 -0.2 -0.4 1500 500 0.2 1000 1500 1000 1500 2 1 0 -1 GBP
Wavelets
Matlab Multi-resolution analysis (GBP, Daubechies-3)
Wavelet GUI: wavemenu
2500
Analysis levels are orthogonal,
DiDj = 0, for i j
2000 2500 500 0 -0.2 -0.4 -0.6 0.2 0 -0.2 -0.4 -0.6 2000 2500 500
1000
1500
2000
Single level: dwt / idwt Multiple level: wavedec / waverec
wmaxlev
D1
Haar analysis: simple, piecewise constant
500
1000
1500
2000
2500
1000
1500
2000
2500
1000
2500
2000 1000
2500 1500
2000
2500
Wavelet bases: wavefun
D4
2 500 1 1000 0 -1 1000 1500 1500
Daubechies-3 analysis: 0 less artifacting
2000 2500 -0.2 -0.4 0.5 0 -0.5 500 2000 2500 1000 0.5 0 -0.5 2 1 0 -1 500 1000 1500 2000 2500 1500 500 1000 2000 1500 2500 2000 2500 500 1000 1500 2000 2500
D5
500
1000
1500
2000
2500
A6
500
1000
1500
2000
2500
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Other wavelets
Only scratching the surface Wavelet packets
All possible tilings (binary) Best-basis transform
More on wavelets
Signal representation and compressibility
100
Partial energy (GBP)
100
Partial energy (Light)
90
90
80
80
Quality (% energy)
60
Quality (% energy)
Time FFT Haar DB3
0 2 4 6 8 10
Overcomplete wavelet transform (ODWT), aka. maximum-overlap wavelets (MODWT), aka. shiftinvariant wavelets
70
70
60
50
50
40
40
30
30
20
20
Further reading: 1. Donald B. Percival, Andrew T. Walden, Wavelet Methods for Time Series Analysis, Cambridge Univ. Press, 2006. 2. Gilbert Strang, Truong Nguyen, Wavelets and Filter Banks, Wellesley College, 1996. 3. Tao Li, Qi Li, Shenghuo Zhu, Mitsunori Ogihara, A Survey of Wavelet Applications in Data Mining, SIGKDD Explorations, 4(2), 2002.
10
10
Time FFT Haar DB3
0 5 10 15
Compression (% coefficients)
Compression (% coefficients)
21
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
More wavelets
Keeping the highest coefficients minimizes total error (L2-distance) Other coefficient selection/thresholding schemes for different error metrics (e.g., maximum per-instant error, or L1-dist.)
Typically use Haar bases
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Quantized representations
Further reading: 1. Minos Garofalakis, Amit Kumar, Wavelet Synopses for General Error Metrics, ACM TODS, 30(4), 2005. [Link] Karras, Nikos Mamoulis, One-pass Wavelet Synopses for Maximum-Error Metrics, VLDB 2005.
Non-Euclidean distances
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Incremental estimation
Wavelets
Incremental estimation
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Incremental estimation
Wavelets
Incremental estimation
22
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Incremental estimation
Wavelets
Incremental estimation
post-order traversal
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets
Incremental estimation
Overview
:
1. 2. Introduction and geometric intuition Coordinates and transforms
constant factor: filter length
Forward transform
O(1) time (amortized)
Post-order traversal of wavelet coefficient tree O(logN) buffer space (total)
Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
3. 4.
Inverse transform:
Same complexity
Pre-order traversal of wavelet coefficient tree
Quantized representations
Non-Euclidean distances
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Time series collections
Overview
Time series collections
Some notation:
Fourier and wavelets are the most prevalent and successful descriptions of time series. Next, we will consider collections of M time series, each of length N.
What is the series that is most similar to all series in the collection? What is the second most similar, and so on
values at time t, xt i-th series, x(i)
23
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal Component Analysis
Example
Exchange rates (vs. USD)
2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 500 1000 Time 1500 2000 2500
Principal component analysis
u1
0.05 0 -0.05
Principal components 1-4
( 0)
CAD
AUD
= 48% + 33% = 81% + 11% = 92% + 4% = 96%
BEF
u2
U2
0 -0.05 0.05 0 -0.05 0.05 0 -0.05 500 1000 Time 1500 2000 2500
50
SEK
40
GBP 2 0 -2
FRF
u4
U4
u3
CAD
U3
30
AUD
2 0 -2 FRF
DEM
x(2) = 49.1u1 + 8.1u2 + 7.8u3 + 3.6u4 + 1
i,2
JPY
Best basis : { u1, u2, u3, u4 }
20
ESP
0.05
2 0 -2
U1
First two principal components
2 0 -2
NZL
Coefficients of each time series w.r.t. basis { u1, u2, u3, u4 } :
NLG
NZL
0
2 0 -2 NLG
CHF
2 0 -2
ESP
-10
SEK
-20
JPY
2 0 -2
CHF
GBP
-30
-20
-10
10
i,1
20
30
40
50
60
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal Component Analysis
Matrix notation Singular Value Decomposition (SVD)
Principal Component Analysis
Matrix notation Singular Value Decomposition (SVD)
X = UVT X U VT
x(1) x(2) x(M)
X = UVT X U VT
v1 M x(1) x(2) x(M)
u1 u2
uk
1 2 3
u1 u2
uk
v2 1 2 3 vk
DEM
BEF
10
2 0 -2
time series
basis for time series
coefficients w.r.t. basis in U (columns)
time series
basis for time series
basis for measurements (rows) coefficients w.r.t. basis in U (columns)
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal Component Analysis
Matrix notation Singular Value Decomposition (SVD)
Principal component analysis
Properties Singular Value Decomposition (SVD)
X = UVT X U
1
x(1) x(2) x(M)
V are the eigenvectors of the covariance matrix XTX, since
VT
v1
u1 u2
uk
.
k
v2
U are the eigenvectors of the Gram (inner-product) matrix XXT, since
vk basis for measurements (rows) Further reading: 1. Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002. 2. Gilbert Strang, Linear Algebra and Its Applications (4th ed), Brooks Cole, 2005.
scaling factors time series basis for time series
24
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Kernels and KPCA
What are kernels? Usual definition of inner product w.r.t. vector coordinates is xy = i xiyi However, other definitions that preserve
NZL CAD AUD
Multidimensional scaling (MDS)
Exchange rates
SEK ESP GBP
Kernels are still Euclidean in some sense
We still have a Hilbert (inner-product) space, even though it may not be the space of the original data
FRF BEF DEM NLG CHF
the fundamental properties are possible Why kernels? We no longer have explicit coordinates
Objects do not even need to be numeric
For arbitrary similarities, we can still find the eigendecomposition of the similarity matrix
Multidimensional scaling (MDS) Maps arbitrary metric data into a low-dimensional space
CAD AUD
JPY
But we can still talk about distances and angles Many algorithms rely just on these two concepts
Exchange rates
SEK ESP GBP
NZL
FRF BEF DEM NLG CHF
Further reading: 1. Bernhard Schlkopf, Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, MIT Press, 2001.
JPY
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Matlab
PCA on sliding windows
Empirical orthogonal functions (EOF), aka. Singular Spectrum Analysis (SSA) If the series is stationary, then it can be shown that
The eigenvectors of its autocovariance matrix are the Fourier bases The principal components are the Fourier coefficients
pcacov princomp [U, S, V] = svd(X) [U, S, V] = svds(X, k)
Further reading: 1. M. Ghil, et al., Advanced Spectral Methods for Climatic Time Series, Rev. Geophys., 40(1), 2002.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Principal components
Incremental estimation
PCA via SVD on X 2
Singular values 2
NM
kk
recap:
(diagonal)
Nk
Energy / reconstruction accuracy
Left singular vectors U 2
Basis for time series Eigenvectors of Gram matrix XXT
Quantized representations
Right singular vectors V 2
Mk
Basis for measurements space Eigenvectors of covariance matrix XTX
Non-Euclidean distances
25
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation
Principal components
Incremental estimation Example
PCA via SVD on X 2
NM
recap: First series
VT
30oC
kk (diagonal) Singular X values 2 U
Energy / reconstruction accuracy
Left singular vectors U 2 (1 (2 (M
x )x
)
vk 20oC
Right singular vectors V 2
Mk
Basis for measurements space Eigenvectors of covariance matrix XTX
Series x(1)
uk . = u1 u2 Basis for time series Eigenvectors of Gram matrix XXT x
)
Nk 2
v1 v2
First three values Other values
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Example
Principal components
Incremental estimation Example
First series
30oC
Correlations:
30oC
Second series
Series x(2) Series x(2)
Lets take a closer look at the first three measurementpairs
20oC
20oC
First three values Other values
20oC
30oC
Series x(1)
First three values Other values
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Example
Principal components
Incremental estimation Example
30oC
Series x(2)
20oC
e ffs
pr
pa ci in
O(M) numbers for the slope, and One number for each measurementpair (offset on line = PC)
First three values Other values
Series x(2)
m co
ne po
First three lie (almost) on a line in the space of nt measurement-pairs
30oC
Other pairs also follow the same pattern: they lie (approximately) on this line
20oC
20oC
30oC
20oC
30oC
Series x(1)
Series x(1)
First three values Other values
26
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Example
Principal components
Incremental estimation Example (update)
For each new point
30oC
For each new point
30oC
error
Project onto current line Estimate error
Series x(2)
error
Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
Series x(2)
20oC
20oC
O(M) time
20oC
30oC
Series x(1)
New value
20oC
30oC
Series x(1)
New value
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Example (update)
Principal components
Incremental estimation Example
For each new point
30oC
The line is the first principal component (PC) direction This line is optimal: it minimizes the sum of squared projection errors
Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
Series x(2)
20oC
20oC
30oC
Series x(1)
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Update equations
Principal components
Incremental estimation Complexity
For each new point xt and for j = 1, , k : yj := vjTxt j2 j + yj2 ej := x yjwj vj vj + (1/
2 j )
O(Mk) space (total) and time (per tuple), i.e., Independent of # points Linear w.r.t. # streams (M) Linear w.r.t. # principal components (k)
(proj. onto vj) (energy j-th eigenval.) (error) yjej (update estimate) (repeat with remainder)
xt xt yjvj
xt e1
v1 updated v1 y1
27
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Principal components
Incremental estimation Applications
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Incremental PCs (measurement space)
Incremental tracking of correlations Forecasting / imputation Change detection
Quantized representations
Further reading: 1. Sudipto Guha, Dimitrios Gunopulos, Nick Koudas, Correlating synchronous and asynchronous data streams, KDD 2003. 2. Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos, Streaming Pattern Discovery in Multiple Time-Series, VLDB 2005. 3. Matthew Brand, Fast Online SVD Revisions for Lightweight Recommender Systems, SDM 2003.
4.
Non-Euclidean distances
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise constant (APCA)
So far our windows were pre-determined
DFT: Entire series STFT: Single, fixed window DWT: Geometric progression of windows
Piecewise constant
Example APCA (k=10)
2 1 0 -1
Within each window we sought fairly complex patterns (sinusoids, wavelets, etc.) Next, we will allow any window size, but constrain the pattern within each window to the simplest possible (mean)
500
1000
1500
2000
2500
APCA (k=21)
2 1 0 -1
500
1000
1500
2000
2500
APCA (k=41)
2 1 0 -1
500
1000
1500
2000
2500
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise constant (APCA)
Divide series into k segments with endpoints
Piecewise constant
Example APCA (k=10)
2 1 0
Constant length: PAA Variable length: APCA
-1
Represent all points within one average mj, 1 j k, thus minimizing
2 1 0 -1
Single-level Haar smooths, 2 , for all 1 j k if tj+1-tj = segment with their
2
500
1000
1500
2000
2500
APCA (k=21) / Haar (level 7, 21 coeffs)
1 0 -1
500
1000
1500
2000
500
1000
1500
2000
2500
APCA (k=41) / Haar (level 6, 41 coeffs)
2 1 0 -1
Further reading: 1. Kaushik Chakrabarti, Eamonn Keogh, Sharad Mehrotra, Michael Pazzani, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, TODS, 27(2), 2002.
500
1000
1500
2000
2500
28
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise constant
Example APCA (k=10)
2 1 0 -1
Piecewise constant
Example APCA (k=10)
2 1 0 -1
500
1000
1500
2000
2500
500
1000
1500
2000
2500
APCA (k=21) / Haar (level 7, 21 coeffs)
2 1 0 -1 2 1 0 -1
APCA (k=21) / Haar (level 7, 21 coeffs)
500
1000
1500
2000
2500
500
1000
1500
2000
2500
APCA / Haar (top-21 out of 7 levels)
2 1 0 -1 500 1000 1500 2000 2500
2 1 0 -1
APCA (k=15) / Daubechies-3 (level 7, 15 coeffs)
500
1000
1500
2000
2500
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
k/h-segmentation
Again, divide the series into k segments (variable length) For each segment choose one of h quantization levels to represent all points
Now, mj can take only h k possible values
Symbolic aggregate approximation (SAX)
Quantization of values Segmentation of time based on these quantization levels More in next part
APCA = k/k-segmentation (h = k)
Further reading: 1. Aristides Gionis, Heikki Mannila, Finding Recurrent Sources in Sequences, Recomb 2003.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
K-means / Vector quantization (VQ)
APCA considers one time series and
Groups time instants Approximates them via their (scalar) mean
Vector Quantization / K-means applies to a collection of M time series (of length N)
Groups time series Approximates them via their (vector) mean
Quantized representations
Non-Euclidean distances
29
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
K-means
K-means
Partitions the time series x(1), , x(M) into groups, Ij, for 1 j k .
m2
All time series in the j-th group, 1 j k, are represented by their centroid, mj . Objective is to choose mj so as to minimize the overall squared distortion,
m1
1-D on values + contiguity requirement: APCA
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
K-means
K-means
m2 Objective implies that, given Ij, for 1 j k,
i.e., mj is the vector mean of all points in cluster j. m1
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
K-means
1. Start with arbitrary cluster assignment. 2. Compute centroids. 3. Re-assign to clusters based on new centroids. 4. Repeat from (2), until no improvement.
K-means
Example
Exchange rates
50
ESP SEK GBP
PCs
0.05 0 -0.05 0.05 0 -0.05
40
CAD
30
AUD
k=2
20
i,2
1 0 -1
Finds local optimum of D. Matlab: [idx, M] = kmeans(X, k)
10
DEM NZL
0
FRF BEF NLG CHF
2 1 0 -1
2 1 0 -1 -10 2 1 0 -1 2 0 -2 2 0 -2 50 60
k=4
-20
JPY
-30
-20
-10
10
i,1
20
30
40
30
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
K-means in other coordinates
An orthonormal transform (e.g., DFT, DWT, PCA) preserves distances. K-means can be applied in any of these coordinate systems. Can transform data to speed up distance computations (if N large)
K-means and PCA
Further reading: 1. Hongyuan Zha, Xiaofeng He, Chris H.Q. Ding, Ming Gu, Horst D. Simon, Spectral Relaxation for K-means Clustering, NIPS 2001.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Overview
1. 2. Introduction and geometric intuition Coordinates and transforms 3. 4. Fourier transform (DFT) Wavelet transform (DWT) Incremental DWT Principal components (PCA) Incremental PCA Piecewise quantized / symbolic Vector quantization (VQ) / K-means Dynamic time warping (DTW)
Dynamic time warping (DTW)
So far we have been discussing shapes via various kinds of features or descriptions (bases) However, the series were always fixed Dynamic time warping:
Allows local deformations (stretch/shrink) Can thus also handle series of different lengths
Quantized representations
Non-Euclidean distances
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic time warping (DTW)
Euclidean (L2) distance is or, recursively, Dynamic time warping distance is
Dynamic time warping (DTW)
Each cell c = (i,j) is a pair of indices whose corresponding values will be compared, (xi yj)2, and included in the sum for the distance
y[1:j]
Euclidean path:
i = j always Ignores off-diagonal cells
where x1:i is the subsequence (x1, , xi)
shrink x / stretch y stretch x / shrink y
x[1:i]
31
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic time warping (DTW)
DTW allows any path Examine all paths:
shrink x / stretch y
(i-1, j) y[1:j]
Dynamic time-warping
Fast estimation
Standard dynamic programming: O(N2) Envelope-based technique
Introduced by [Keogh 2000 & 2002] Multi-scale, wavelet-like technique and formalism by [Salvador et al. 2004] and, independently, by [Sakurai et al. 2005]
(i, j)
stretch x / shrink y
(i-1, j-1) (i, j-1)
x[1:i]
Standard dynamic programming to fill in tabletop-right cell contains final result
Further reading: 1. Eamonn J. Keogh, Exact Indexing of Dynamic Time Warping, VLDB 2002. 2. Stan Salvador, Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space, TDM 2004. 3. Yasushi Sakurai, Masatoshi Yoshikawa, Christos Faloutsos, FTW: Fast Similarity Under the Time Warping Distance, PODS 2005.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic time warping
Fast estimation Summary
Non-Euclidean metrics
Create lower-bounding distance on coarser granularity, either at
Single scale Multiple scales
More in part 3
y[1:j]
Use to prune search space
x[1:i]
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Timeline of part C
Introduction TimeTime-Series Representations Distance Measures Lower Bounding Clustering/Classification/Visualization Applications
PART C: Similarity Search and Applications
32
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Applications (Image Matching)
Cluster 1
Many types of data can be converted to time-series
Image
Applications (Shapes)
Recognize type of leaf based on its shape
Ulmus carpinifolia
Acer platanoides
Salix fragilis
Tilia
Quercus robur
Color Histogram
600 400 200 0 400 50 100 150 200 250
Convert perimeter into a sequence of values
Cluster 2
200
50
100
150
200
250
800 600 400 200 0 50 100 150 200 250
Time-Series Special thanks to A. Ratanamahatana & E. Keogh for the leaf video.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Applications (Motion Capture)
Motion-Capture (MOCAP) Data (Movies, Games) Track position of several joints over time 3*17 joints = 51 parameters per frame
Applications (Video)
Video-tracking / Surveillance Visual tracking of body features (2D time-series) Sign Language recognition (3D time-series)
Video Tracking of body feature over time (Athens1, Athens2)
MOCAPdata data MOCAP myprecious precious my
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Time-Series and Matlab
Time-series can be represented as vectors or arrays Fast vector manipulation
Most linear operations (eg euclidean distance, correlation) can be trivially vectorized
Easy visualization Many built-in functions Specialized Toolboxes
Becoming Becoming sufficiently sufficiently familiar familiar with with something something is is a a substitute substitute for for understanding understanding it. it.
PART II: Time Series Matching Introduction
33
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Basic Data-Mining problem
Todays databases are becoming too large. Search is difficult. How can we overcome this obstacle? Basic structure of data-mining solution: Represent data in a new format Search few data in the new representation Examine even fewer original data Provide guarantees about the search results Provide some type of data/result visualization
Basic Time-Series Matching Problem
Distance
query D = 7.3
Linear Scan: Objective: Compare the query with all sequences in DB and return the k most similar sequences to the query. Database Databasewith withtime-series: time-series: Medical Medicalsequences sequences Images, Images,etc etc Sequence SequenceLength:100-1000pts Length:100-1000pts DB Size: 1 TByte DB Size: 1 TByte
D = 10.2
D = 11.8
D = 17
D = 22
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
What other problems can we solve?
Clustering: Place time-series into similar groups
Hierarchical Clustering
Very generic & powerful tool Provides visual data grouping
Pairwise distances
D1,1 D2,1
Classification: To which group is a time-series most similar to?
query ? ? ? DM,N
1. Merge objects with smallest distance 2. Reevaluate distances 3. Repeat process
Z = linkage(D); H = dendrogram(Z);
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Partitional Clustering
Faster than hierarchical clustering Typically provides suboptimal solutions (local minima) Not good performance for high dimensions
K-Means Demo
1.4 1.2 1
K-Means Algorithm: 1. Initialize k clusters (k specified by user) randomly. 2. Repeat until convergence 1. Assign each object to the nearest cluster center. 2. Re-estimate cluster centers.
0.9
0.8 0.6 0.4 0.2 0 -0.2 -0.4
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-0.5
0.5
1.5
See: kmeans
34
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
K-Means Clustering for Time-Series
So how is kMeans applied for Time-Series that are high-dimensional? Perform kMeans on a compressed dimensionality
Classification
Typically classification can be made easier if we have clustered the objects Class A
0.4
0.2
Original sequences
Compressed sequences
0.4
Clustering space
-0.2
0.2
Project query in the new space and find its closest cluster
-0.4
-0.6
So, query Q is more similar to class B
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
Class B
-0.2
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Nearest Neighbor Classification
We need not perform clustering before classification. We can classify an object based on the class majority of its nearest neighbors/matches.
Example
Elfs
Hobbits
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10
Height
Hair Length
What do we need?
1. Define Similarity 2. Search fast Dimensionality Reduction (compress data)
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Notion of Similarity I
Solution to any time-series problem, boils down to a proper definition of *similarity*
All All models models are are wrong, wrong, but but some some are are useful useful
PART II: Time Series Matching Similarity/Distance functions
Similarity is always subjective. (i.e. it depends on the application)
35
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Notion of Similarity II
Similarity depends on the features we consider (i.e. how we will describe or compress the sequences)
Metric and Non-metric Distance Functions
Distance functions Metric Euclidean Distance Correlation Properties
Positivity: 0 Positivity:d(x,y) d(x,y) 0and andd(x,y)=0, d(x,y)=0,if ifx=y x=y Symmetry: d(x,y)= =d(y,x) d(y,x) Symmetry:d(x,y) Triangle d(x,z) d(x,y) d(x,y)+ +d(y,z) d(y,z) TriangleInequality: Inequality:d(x,z) If Ifany anyof ofthese theseis isnot not obeyed obeyedthen thenthe thedistance distance is a non-metric is a non-metric
Non-Metric Time Warping LCSS
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Triangle Inequality
Triangle TriangleInequality: Inequality:d(x,z) d(x,z) d(x,y) d(x,y)+ +d(y,z) d(y,z)
Triangle Inequality (Importance)
Triangle TriangleInequality: Inequality:d(x,z) d(x,z) d(x,y) d(x,y)+ +d(y,z) d(y,z)
Assume: d(Q,bestMatch) = 20 d(Q,B) =150
z x y
Metric distance functions can exploit the triangle inequality to speed-up search
Q A B C
and
Then, since d(A,B)=20 d(Q,A) d(Q,B) d(B,A) d(Q,A) 150 20 = 130 So we dont have to retrieve A from disk
Intuitively, if: - x is similar to y and, - y is similar to z, then, - x is similar to z too.
A A B C 0 20 110
B 20 0 90
C 110 90 0
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Non-Metric Distance Functions
Euclidean Distance
Most widely used distance measure
Man similar to bat?? Bat similar to batman Batman similar to man
Matching Matchingflexibility flexibility
Robustness Robustnessto tooutliers outliers Stretching Stretchingin intime/space time/space Support Supportfor fordifferent differentsizes/lengths sizes/lengths
Definition: L2 =
(a[i] b[i])
i =1
20
40
60
80
100
Speeding-up Speeding-upsearch searchcan canbe be
difficult difficult
L2 = sqrt(sum((asqrt(sum((a-b).^2)); % return Euclidean distance
36
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Euclidean Distance (Vectorization)
Question: If I want to compare many sequences to each other do I have to use a for-loop? Answer: No, one can use the following equation to perform matrix computations only
Data Preprocessing (Baseline Removal)
average value of A average value of B
||A-B|| = sqrt (
A: DxM matrix
||A||2
||B||2
- 2*A.B ) result
D1,1 D2,1 DM,N
M sequences Of length D
B: DxN matrix Result is MxN matrix
A=
aa= .*b); ab=a'*b; aa=sum(a.*a); sum(a.*a); bb=sum(b bb=sum(b.*b); ab=a'*b; d = sqrt(repmat(aa',[1 size(bb,2)]) + repmat(bb,[size(aa,2) 1]) - 2*ab ); 2*ab);
a = a mean(a); mean(a);
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Data Preprocessing (Rescaling)
Dynamic Time-Warping (Motivation)
Euclidean distance or warping cannot compensate for small distortions in time axis. A B C Solution: Allow for compression & decompression in time According to Euclidean distance B is more similar to A than to C
a = a ./ std(a); std(a);
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic Time-Warping
First used in speech recognition for recognizing words spoken at different speeds ---Maat--llaabb------------------Same idea can work equally well for generic time-series data
Dynamic Time-Warping (how does it work?)
The intuition is that we copy an element multiple times so as to achieve a better matching
Euclidean Euclideandistance distance T1 T1= =[1, [1,1, 1,2, 2,2] 2] d d= =1 1 T2 T2= =[1, [1,2, 2,2, 2,2] 2]
One-to-one linear alignment
----Mat-lab--------------------------
Warping Warpingdistance distance T1 T1= =[1, [1,1, 1,2, 2,2] 2] d d= =0 0 T2 T2= =[1, [1,2, 2,2, 2,2] 2]
One-to-many non-linear alignment
37
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (implementation)
It is implemented using dynamic programming. Create an array that stores all solutions for all possible subsequences.
Dynamic Time-Warping (Examples)
So does it work better than Euclidean? Well yes! After all it is more costly..
Dynamic Time Warping
Euclidean Distance
18 16 7 13 14 3 9 6 2 15 11 19 10 20 17 5 12 8 4 1
c(i,j) (( A Bj))+ i, c(i,j) = =D D A i ,B j + min{ i-1, j-1) i-1, jj)), ,c( i, min{c( c( i-1, j-1), ,c( c( i-1, c( ij ,-1) j-1)}}
Recursive equation
18 20 17 13 16 14 12 19 15 11 3 9 8 7 5 6 2 10 4 1
MIT arrhythmia database
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (Can we speed it up?)
Complexity is O(n2). We can reduce it to O(n) simply by restricting the warping path.
Dynamic Time-Warping (restricted warping)
Camera-Mouse dataset
The restriction of the warping path helps: A. Speed-up execution B. Avoid extreme (degenerate) matchings C. Improve clustering/classification accuracy Classification Accuracy
Camera Mouse
A
We now only fill only a small portion of the array
Australian Sign Language Minimum Bounding Envelope (MBE)
10% warping is adequate
Warping Length
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Longest Common Subsequence (LCSS)
With Time Warping extreme values (outliers) can destroy the distance estimates. The LCSS model can offer more resilience to noise and impose spatial constraints too.
ignore majority of noise
Longest Common Subsequence (LCSS)
LCSS is more resilient to noise than DTW.
Disadvantages of DTW: A. All points are matched
ignore majority of noise
B. Outliers can distort distance C. One-to-many mapping
match match
Advantages of LCSS: Matching within time and in space
Everything that is outside the bounding envelope can never be matched match match
A. Outlying values not matched B. Distance/Similarity distorted less C. Constraints in time & space
38
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Longest Common Subsequence (Implementation)
Similar dynamic programming solution as DTW, but now we measure similarity not distance.
Distance Measure Comparison
Dataset
Camera-Mouse
Method
Euclidean DTW LCSS
Time (sec)
34 237 210 2.2 9.1 8.2 2.1 9.3 8.3
Accuracy
20% 80%
100%
33% 44%
ASL
Euclidean DTW LCSS
46%
11% 15%
ASL+noise Can also be expressed as distance
Euclidean DTW LCSS
31%
LCSS offers enhanced robustness under noisy conditions
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Distance Measure Comparison (Overview)
Method Euclidean DTW LCSS Complexity O(n) O(n*) O(n*) Elastic Matching 2 3 3 One-to-one Matching 3 2 3 Noise Robustness 2 2 3
PART II: Time Series Matching Lower Bounding
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Basic Time-Series Problem Revisited
Compression Dimensionality Reduction
Project all sequences into a new space, and search this space instead (eg project timeseries from 100-D space to 2-D space)
Feature 1
Objective: Instead of comparing the query to the original sequences (Linear Scan/LS) , lets compare the query to simplified versions of the DB timeseries.
A B C
Feature 2 query query
One can also organize the low-dimensional points into a hierarchical index structure. In this tutorial we will not go over indexing techniques.
This ThisDB DBcan cantypically typically fit fitin inmemory memory
Question: When searching the original space it is guaranteed that we will find the best match. Does this hold (or under which circumstances) in the new compressed space?
39
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Concept of Lower Bounding
You can guarantee similar results to Linear Scan in the original dimensionality, as long as you provide a Lower Bounding (LB) function (in low dim) to the original distance (high dim.) GEMINI, GEneric Multimedia INdexIng
So, for projection from high dim. (N) to low dim. (n): Aa, Bb etc
Generic Search using Lower Bounding
simplified DB Answer Superset original DB Final Answer set
D (a,b) <= Dtrue(A,B) D LB LB (a,b) <= D true(A,B)
5
Projection onto X-axis
C B C
0 1 2 3
Verify against original DB
EF
4 5
D F E
1 2 3 4 5 0 1
False alarm (not a problem)
Projection on some other axis
B C
2 3
simplified query
EF
4 5
0 0
False dismissal (bad!)
query
Find everything within range of 1 from A
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences query
Lower Bounding Example
sequences query
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences
Lower Bounding Example
sequences
Lower Bounds
4.6399 37.9032 19.5174 72.1846 67.1436 78.0920 70.9273 63.7253 1.4121
Lower Bounds
4.6399 37.9032 19.5174 72.1846 67.1436 78.0920 70.9273 63.7253 1.4121
True Distance
46.7790 108.8856 113.5873 104.5062 119.4087 120.0066 111.6011 119.0635 17.2540 BestSoFar
40
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding the Euclidean distance
There are many dimensionality reduction (compression ) techniques for time-series data. The following ones can be used to lower bound the Euclidean distance.
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT: IDFT:
Everysignal signalcan can Every berepresented representedas as be asuperposition superpositionof of a sinesand andcosines cosines sines (alasnobody nobody (alas believesme) me) believes
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
20 40 60 80 100 120
DFT
DWT
SVD
APCA
PAA
PLA
Figure by Eamonn Keogh, Time-Series Tutorial
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab X(f)
-0.3633 -0.6280 + 0.2709i -0.4929 + 0.0399i -1.0143 + 0.9520i 0.7200 - 1.0571i -0.0411 + 0.1674i -0.5120 - 0.3572i 0.9860 + 0.8043i -0.3680 - 0.1296i -0.0517 - 0.0830i -0.9158 + 0.4481i 1.1212 - 0.6795i 0.2667 + 0.1100i 0.2667 - 0.1100i 1.1212 + 0.6795i -0.9158 - 0.4481i -0.0517 + 0.0830i -0.3680 + 0.1296i 0.9860 - 0.8043i -0.5120 + 0.3572i -0.0411 - 0.1674i 0.7200 + 1.0571i -1.0143 - 0.9520i -0.4929 - 0.0399i -0.6280 - 0.2709i
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT: IDFT:
x(n)
-0.4446 -0.9864 -0.3254 -0.6938 -0.1086 -0.3470 0.5849 1.5927 -0.9430 -0.3037 -0.7805 -0.1953 -0.3037 0.2381 2.8389 -0.7046 -0.5529 -0.6721 0.1189 0.2706 -0.0003 1.3976 -0.4987 -0.2387 -0.7588
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 1coefficients
-5 50 100 150 200 250
1 coeff > 60% of energy 10 coeff > 90% of energy
fa = fft(a); fft(a); % Fourier decomposition fa(5:end) = 0; % keep first 5 coefficients (low frequencies) reconstr = real(ifft(fa)); real(ifft(fa)); % reconstruct signal Life is complex, it has both real and imaginary parts.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 2coefficients
Reconstruction using 7coefficients
-5 50 100 150 200 250
-5 50 100 150 200 250
1 coeff > 60% of energy 10 coeff > 90% of energy
1 coeff > 60% of energy 10 coeff > 90% of energy
41
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Fourier Decomposition
How much space we gain by compressing random walk data?
Error 1500 1 0.95 Energy Percentage
Reconstruction using 20coefficients 1000 5 500
0.9 0.85 0.8 0.75 0.7
-5 50 100 150 200 250 0 20 40 60 80 Coefficients 100 120
0.65 20 40 60 80 Coefficients 100 120
1 coeff > 60% of energy 10 coeff > 90% of energy
1 coeff > 60% of energy 10 coeff > 90% of energy
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier Decomposition
Which coefficients are important? We can measure the energy of each coefficient Energy = Real(X(fk))2 + Imag(X(fk))2
Most of data-mining research uses first k coefficients: Good for random walk signals (eg stock market) Easy to index Not good for general signals
Fourier Decomposition
Which coefficients are important? We can measure the energy of each coefficient Energy = Real(X(fk))2 + Imag(X(fk))2
Usage of the coefficients with highest energy: Good for all types of signals Believed to be difficult to index CAN be indexed using metric trees
fa = fft(a); fft(a); % Fourier decomposition N = length(a); length(a); % how many? fa = fa(1:ceil(N/2)); % keep first half only mag = 2*abs(fa).^2; % calculate energy
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab X(f)
0 -0.6280 + 0.2709i
Code for Reconstructed Sequence
a = load('[Link]'); a = (a(a-mean(a))/std(a); fa = fft(a); maxInd = ceil(length(a)/2); N = length(a); energy = zeros(maxIndzeros(maxInd-1, 1); E = sum(a.^2); for ind=2:maxInd, fa_N = fa; fa_N(ind+1:Nfa_N(ind+1:N-ind+1) = 0; r = real(ifft(fa_N)); % copy fourier % zero out unused % reconstruction % until the middle % zz-normalization
Code for Plotting the Error
a = load('[Link]'); a = (a(a-mean(a))/std(a); fa = fft(a); maxInd = ceil(length(a)/2); N = length(a); energy = zeros(maxIndzeros(maxInd-1, 1); E = sum(a.^2); for ind=2:maxInd, fa_N = fa; fa_N(ind+1:Nfa_N(ind+1:N-ind+1) = 0; r = real(ifft(fa_N)); % zz-normalization This is the same % until the middle
keep
-0.4929 + 0.0399i -1.0143 + 0.9520i 0.7200 - 1.0571i -0.0411 + 0.1674i -0.5120 - 0.3572i 0.9860 + 0.8043i -0.3680 - 0.1296i -0.0517 - 0.0830i
% energy of a
% energy of a
-0.9158 + 0.4481i 1.1212 - 0.6795i
Ignore
0.2667 + 0.1100i 0.2667 - 0.1100i 1.1212 + 0.6795i -0.9158 - 0.4481i -0.0517 + 0.0830i -0.3680 + 0.1296i 0.9860 - 0.8043i -0.5120 + 0.3572i -0.0411 - 0.1674i 0.7200 + 1.0571i -1.0143 - 0.9520i -0.4929 - 0.0399i -0.6280 - 0.2709i
% copy fourier % zero out unused % reconstruction
energy(indenergy(ind-1) = sum(r.^2); % energy of reconstruction error(inderror(ind-1) = sum(abs(rsum(abs(r-a).^2); % error end E = ones(maxIndones(maxInd-1, 1)*E; error = E - energy; ratio = energy ./ E; subplot(1,2,1); % left plot plot([1:maxIndplot([1:maxInd-1], error, 'r', 'LineWidth',1.5); subplot(1,2,2); % right plot plot([1:maxIndplot([1:maxInd-1], ratio, 'b', 'LineWidth',1.5);
end
plot(r, 'r','LineWidth',2); hold on; plot(a,'k'); title(['Reconstruction using ' num2str(indnum2str(ind-1) 'coefficients']); set(gca,'plotboxaspectratio', [3 1 1]); axis tight pause; % wait for key cla; % clear axis keep
42
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding using Fourier coefficients
Parsevals Theorem states that energy in the frequency domain equals the energy in the time domain:
Lower Bounding using Fourier coefficients -Example
x y or, that Euclidean distance
If we just keep some of the coefficients, their sum of squares always underestimates (ie lower bounds) the Euclidean distance:
Note the normalization
x = cumsum(randn(100,1)); y = cumsum(randn(100,1)); euclid_Time = sqrt(sum((xsqrt(sum((x-y).^2));
120.9051
Keeping 10 coefficients the distance is: 115.5556 < 120.9051
fx = fft(x)/sqrt(length(x)); fft(x)/sqrt(length(x)); fy = fft(y)/sqrt(length(x)); fft(y)/sqrt(length(x)); euclid_Freq = sqrt(sum(abs(fx - fy).^2));
120.9051
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Fourier Decomposition
Wavelets Why exist?
Similar concept with Fourier decomposition Fourier coefficients represent global contributions, wavelets are localized
O(nlogn) O(nlogn)complexity complexity Tried Triedand andtested tested Hardware Hardwareimplementations implementations Many Manyapplications: applications: compression compression smoothing smoothing periodicity periodicitydetection detection
Not Notgood goodapproximation approximationfor for bursty burstysignals signals Not Notgood goodapproximation approximationfor for signals signalswith withflat flatand andbusy busy sections sections (requires (requiresmany manycoefficients) coefficients) Fourier is good for smooth, random walk data, but not for bursty data or flat data
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelets (Haar) - Intuition
Wavelet coefficients, still represent an inner product (projection) of the signal with some basis functions. These functions have lengths that are powers of two (full sequence length, half, quarter etc)
c-d00 c+d00 D
Wavelets in Matlab
Specialized Matlab interface for wavelets
An arithmetic example X = [9,7,3,5] Haar = [6,2,1,-1]
etc
c = 6 = (9+7+3+5)/4 c + d00 = 6+2 = 8 = (9+7)/2 c - d00 = 6-2 = 4 = (3+5)/2 etc
Haar coefficients: {c, d00, d10, d11,}
See also:wavemenu
See also:wavemenu
43
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Code for Haar Wavelets
a = load('[Link]'); a = (a% z(a-mean(a))/std(a); z-normalization maxlevels = wmaxlev(length(a),'haar'); [Ca, La] = wavedec(a,maxlevels,'haar'); % Plot coefficients and MRA for level = 1:maxlevels cla; subplot(2,1,1); plot(detcoef(Ca,La,level)); axis tight; title(sprintf('Wavelet coefficients Level %d',level)); subplot(2,1,2); plot(wrcoef('d',Ca,La,'haar',level)); axis tight; title(sprintf('MRA Level %d',level)); pause; end % TopTop-20 coefficient reconstruction [Ca_sorted, Ca_sortind] = sort(Ca); Ca_top20 = Ca; Ca_top20(Ca_sortind(1:endCa_top20(Ca_sortind(1:end-19)) = 0; a_top20 = waverec(Ca_top20,La,'haar'); figure; hold on; plot(a, 'b'); plot(a_top20, 'r');
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 1coefficients 2 1 0 -1 -2 50 100 150 200 250
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 2coefficients 2 1 0 -1 -2 50 100 150 200 250
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 4coefficients 2 1 0 -1 -2 50 100 150 200 250
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 8coefficients 2 1 0 -1 -2 50 100 150 200 250
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 16coefficients 2 1 0 -1 -2 50 100 150 200 250
44
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation Represent time-series as a sequence of segments Essentially a projection of the Haar coefficients in time
Reconstruction using 32coefficients 2 1 0 -1 -2 50 100 150 200 250
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (Nx1 or Nx1) % numCoeff: number of PAA segments % data: PAA sequence (Nx1) N = length(s); segLen = N/numCoeff; % length of sequence % assume it's integer % % % % break in segments average segments expand segments make column
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
numCoeff
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (Nx1 or Nx1) % numCoeff: number of PAA segments % data: PAA sequence (Nx1) N = length(s); segLen = N/numCoeff; % length of sequence % assume it's integer % % % % break in segments average segments expand segments make column
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (Nx1 or Nx1) % numCoeff: number of PAA segments % data: PAA sequence (Nx1)
N=8 segLen = 2
N = length(s); segLen = N/numCoeff;
2
% length of sequence % assume it's integer
4
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
% % % %
break in segments average segments expand segments make column
numCoeff
s sN
numCoeff
1 2
3 4
5 6
7 8
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (Nx1 or Nx1) % numCoeff: number of PAA segments % data: PAA sequence (Nx1) N = length(s); segLen = N/numCoeff; % length of sequence % assume it's integer % % % % break in segments average segments expand segments make column
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (1xN) % numCoeff: number of PAA segments % data: PAA sequence (1xN)
N=8 segLen = 2
N = length(s); segLen = N/numCoeff;
% length of sequence % assume it's integer % % % % break in segments average segments expand segments make row
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:);
sN = reshape(s, segLen, numCoeff); avg = mean(sN); 2 data = repmat(avg, segLen, 1); data = data(:) data(:);
s sN avg
numCoeff
s sN avg
numCoeff
1.5 3.5 3.5 5.5 5.5 7.5 7.5
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
data
1.5
45
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff) % PAA(s, numcoeff) % s: sequence vector (1xN) % numCoeff: number of PAA segments % data: PAA sequence (1xN) N = length(s); segLen = N/numCoeff; % length of sequence % assume it's integer % % % % break in segments average segments expand segments make row
APCA (Adaptive Piecewise Constant Approximation)
PAA
Segments of equal size
Not all haar/PAA coefficients are equally important Intuition: Keep ones with the highest energy Segments of variable length APCA is good for bursty signals PAA requires 1 number per segment, APCA requires 2: [value, length]
E.g. 10 bits for a sequence of 1024 points
N=8 segLen = 2
sN = reshape(s, segLen, numCoeff); avg = mean(sN); data = repmat(avg, segLen, 1); data = data(:) data(:);
APCA
s sN avg
numCoeff
1.5 3.5 3.5 1.5 5.5 5.5 3.5 7.5 7.5 3.5
Segments of variable size
1 2 1.5
3 4 3.5
5 6 5.5
7 8 7.5
data data
1.5 1.5
5.5
5.5
7.5
7.5
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Wavelet Decomposition
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation Many implementations Optimal Greedy Bottom-Up Greedy Top-down Genetic, etc You can find a bottom-up implementation here: [Link]
O(n) O(n)complexity complexity Hierarchical Hierarchicalstructure structure Progressive Progressivetransmission transmission Better Betterlocalization localization Good Goodfor forbursty burstysignals signals
Most Mostdata-mining data-miningresearch research still stillutilizes utilizesHaar Haarwavelets wavelets because of their simplicity. because of their simplicity.
Many Manyapplications: applications: compression compression periodicity periodicitydetection detection
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation
46
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence with multiple linear segments First such algorithms appeared in cartography for map approximation
Piecewise Linear Approximation (PLA)
O(nlogn) O(nlogn)complexity complexityfor for bottom bottomup upalgorithm algorithm Incremental Incrementalcomputation computation possible possible Provable Provableerror errorbounds bounds Applications Applicationsfor: for: Image Image//signal signal simplification simplification Trend Trenddetection detection
Visually Visuallynot notvery verysmooth smoothor or pleasing. pleasing.
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Singular Value Decomposition (SVD)
SVD attempts to find the optimal basis for describing a set of multidimensional points Objective: Find the axis (directions) that describe better the data variance
Singular Value Decomposition (SVD)
Each time-series is essentially a multidimensional point Objective: Find the eigenwaves (basis) whose linear combination describes best the sequences. Eigenwaves are data-dependent.
eigenwave 0
AMxn = UMxr * rxr * VTnxr
Factoring of data array into 3 matrices
eigenwave 1
each of length n
eigenwave 3
eigenwave 4
y We need 2 numbers (x,y) for every point
y Now we can describe each point with 1 number, their projection on the line
M sequences
[U,S,V] = svd(A) svd(A)
New axis and position of points (after projection and rotation)
A linear combination of the eigenwaves can produce any sequence in the database
47
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Code for SVD / PCA
A = cumsum(randn(100,10)); % zz-normalization A = (A(A-repmat(mean(A),size(A,1),1))./repmat(std(A),size(A,1),1); [U,S,V] = svd(A,0); % Plot relative energy figure; plot(cumsum(diag(S).^2)/norm(diag(S))^2); set(gca, 'YLim', [0 1]); pause; % TopTop-3 eigenvector reconstruction A_top3 = U(:,1:3)*S(1:3,1:3)*V(:,1:3)'; % Plot original and reconstruction figure; for i = 1:10 cla; subplot(2,1,1); plot(A(:,i)); title('Original'); axis tight; subplot(2,1,2); plot(A_top3(:,i)); title('Reconstruction'); axis tight; pause; end
Singular Value Decomposition
Optimal Optimaldimensionality dimensionality reduction reductionin inEuclidean Euclidean distance distancesense sense SVD SVDis isa avery verypowerful powerfultool tool in inmany manydomains: domains: Websearch Websearch(PageRank) (PageRank)
Cannot Cannotbe beapplied appliedfor forjust just one onesequence. sequence.A Aset setof of sequences sequencesis isrequired. required. Addition Additionof ofa asequence sequencein in database databaserequires requires recomputation recomputation Very Verycostly costlyto tocompute. compute. 2n), O(Mn22 Time: n), O(Mn )} )} Time:min{ min{O(M O(M2 Space: Space:O(Mn) O(Mn)
M Msequences sequencesof oflength lengthn n
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Symbolic Approximation
Assign a different symbol based on range of values Find ranges either from data histogram or uniformly
Symbolic Approximations
c c b
0
c b
Linear Linearcomplexity complexity After Aftersymbolization symbolizationmany many tools toolsfrom frombioinformatics bioinformatics can canbe beused used Markov Markovmodels models Suffix-Trees, Suffix-Trees,etc etc Number Numberof ofregions regions (alphabet (alphabetlength) length)can canaffect affect the quality of the quality ofresult result
b a
20
a
40 60 80 100 120
baabccbc
You can find an implementation here: [Link]
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Multidimensional Time-Series
Catching momentum lately Applications for mobile trajectories, sensor networks, epidemiology, etc
Ari,are areyou yousure surethe the Ari, world is not 1D? world is not 1D?
Multidimensional MBRs
Find Bounding rectangles that completely contain a trajectory given some optimization criteria (eg minimize volume)
Lets see how to approximate 2D trajectories with Minimum Bounding Rectangles
Aristotle
On my income tax 1040 it says "Check this box if you are blind." I wanted to put a check mark about three inches away. - Tom Lehrer
48
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Comparison of different Dim. Reduction Techniques
So which dimensionality reduction is the best?
Fourieris is Fourier good good APCAis is APCA better better PAA! PAA! thanPAA! PAA! than Chebyshev Chebyshev isbetter better is thanAPCA! APCA! than The The futureis is future symbolic! symbolic!
1993
2000
2001
2004
2005
Absence of proof is no proof of absence. - Michael Crichton
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Comparisons
Lets see how tight the lower bounds are for a variety on 65 datasets
Average Lower Bound
A. No approach is better on all datasets B. Best coeff. techniques can offer tighter bounds C. Choice of compression depends on application
Note: similar results also reported by Keogh in SIGKDD02
PART II: Time Series Matching Lower Bounding the DTW and LCSS
Median Lower Bound
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding the Dynamic Time Warping
Recent approaches use the Minimum Bounding Envelope for bounding the DTW Create Minimum Bounding Envelope (MBE) of query Q Calculate distance between MBE of Q and any sequence A One can show that: D(MBE(Q) D(MBE(Q),A) < DTW(Q,A)
Lower Bounding the Dynamic Time Warping
LB by Keogh approximate MBE and sequence using MBRs LB = 13.84
Q A
MBE(Q)
LB = sqrt(sum([[A > U].* [A-U]; [A < L].* [L-A]].^2)); One Matlab command! U LB by Zhu and Shasha approximate MBE and sequence using PAA However, this representation is uncompressed. Both MBE and the DB sequence can be compressed using any of the previously mentioned techniques. LB = 25.41
Q A
49
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Lower Bounding the Dynamic Time Warping
An even tighter lower bound can be achieved by warping the MBE approximation against any other compressed signal. LB_Warp = 29.05
Time Comparisons
We will use DTW (and the corresponding LBs) for recognition of hand-written digits/shapes.
Lower Bounding approaches for DTW, will typically yield at least an order of magnitude speed improvement compared to the nave approach. Lets compare the 3 LB approaches:
Accuracy: Using DTW we can achieve recognition above 90%. Running Time: runTime LB_Warp < runTime LB_Zhu < runTime LB-Keogh Pruning Power: For some queries LB_Warp can examine up to 65 time fewer sequences
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Upper Bounding the LCSS
Since LCSS measures similarity and similarity is the inverse of distance, to speed up LCSS we need to upper bound it.
LCSS Application Image Handwriting
Library of Congress has 54 million manuscripts (20TB of text) Increasing interest for automatic transcribing
Word annotation:
1. [Link] Extractwords wordsfrom fromdocument document 2. [Link] Extractimage imagefeatures features 3. [Link] Annotateaasubset subsetof ofwords words 4. [Link] Classifyremaining remainingwords words
LCSS(MBE ,A) >= LCSS(Q,A) Q LCSS(MBE Q,A) >= LCSS(Q,A)
Indexed Sequence Query
1 0.8
Sim.=50/77 = 0.64
Feature Value
0.6 0.4 0.2 0
50
100
150
200 Column
250
300
350
400
Features:
44 points
6 points
George Washington Manuscript
- Black pixels / column - Ink-paper transitions/ col , etc
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
LCSS Application Image Handwriting
Utilized 2D time-series (2 features) Returned 3-Nearest Neighbors of following words Classification accuracy > 70%
PART II: Time Series Analysis Test Case and Structural Similarity Measures
50
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Analyzing Time-Series Weblogs
PKDD 2005
Weblog Data Representation
Record We can aggregate information, eg, number of requests per day for each keyword
Query: Spiderman
Requests
May 2002. Spiderman 1 was released in theaters
Porto
Jan
Feb Mar
Apr May Jun
Jul
Aug Sep Okt
Nov Dec
Weblog of user requests over time
Capture trends and periodicities Privacy preserving
Google Zeitgeist
Priceline
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Finding similar patterns in query logs
We can find useful patterns and correlation in the user demand patterns which can be useful for: Search engine optimization Recommendations Advertisement pricing (e.g. keyword more expensive at the popular months)
Finding similar patterns in query logs
We can find useful patterns and correlation in the user demand patterns which can be useful for: Search engine optimization Recommendations Advertisement pricing (e.g. keyword more expensive at the popular months)
Requests
Query: ps2
Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec
Requests
Query: xbox
Query: elvis
Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec
Game consoles are more popular closer to Christmas
th Burst on Aug. 16 Death Anniversary of Elvis
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matching of Weblog data
Use Euclidean distance to match time-series. But which dimensionality reduction technique to use? Lets look at the data:
First Fourier Coefficients vs Best Fourier Coefficients
Query Bach 1 year span
The data is smooth and highly periodic, so we can use Fourier decomposition. Instead of using the first Fourier coefficients we can use the best ones instead. Lets see how the approximation will look:
Using the best coefficients, provides a very high quality approximation of the original time-series
Query stock market
51
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matching results I
Query = Lance Armstrong
Matching results II
Query = Christmas
2000
2001
2002
2000
2001
2002
LeTour
0 2000 2001 2002
Knn4: Christmas coloring books Knn8: Christmas baking Knn12: Christmas clipart
Tour De France Knn20: Santa Letters
0 2000 2001 2002
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Finding Structural Matches
The Euclidean distance cannot distill all the potentially useful information in the weblog data. Some data are periodic, while other are bursty. We will attempt to provide similarity measures that are based on periodicity and burstiness.
Query cinema. Weakly periodicity. Peak of period every Friday.
Periodic Matching
Frequency Ignore Phase/ Keep important components Calculate Distance
F ( x), F ( y )
cinema
arg max || F ( x) ||, F ( x + ) k arg max || F ( y ) ||, F ( y + )
k
Periodogram
D1 =|| F ( x + ) F ( y + ) || D2 =|| F ( x + ) F ( y + ) ||
stock
easter
Query Elvis. Burst in demand on 16th August. Death anniversary of Elvis Presley
10
15
20
25
30
35
40
45
50
christmas
10
15
20
25
30
35
40
45
50
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matching Results with Periodic Measure
Now we can discover more flexible matches. We observe a clear separation between seasonal and periodic sequences.
Matching Results with Periodic Measure
Compute pairwise periodic distances and do a mapping of the sequences on 2D using Multi-dimensional scaling (MDS).
52
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Matching Based on Bursts
Another method of performing structural matching can be achieved using burst features of sequences. Burst feature detection can be useful for: Identification of important events Query-by-burst
Harry Potter 2 (November 15 2002)
Burst Detection
Burst detection is similar to anomaly detection. Create distribution of values (eg gaussian model) Any value that deviates from the observed distribution (eg more than 3 std) can be considered as burst.
Valentines Day Mothers Day
Harry Potter 1 (Movie)
Harry Potter 1 (DVD)
2002: Harry Potter demand
50
100
150
200
250
300
350
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Query-by-burst
To perform query-by-burst we can perform the following steps: 1. 2. 3. Find burst regions in given query Represent query bursts as time segments Find which sequences in DB have overlapping burst regions.
Query-by-burst Results
Queries
Pentagon attack
[Link]
Cheap gifts
Matches
Nostradamus prediction
Tropical Storm
Scarfs
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Structural Similarity Measures
Periodic similarity achieves high clustering/classification accuracy in ECG data
DTW
34 33 30 35 27 26 36 31 28 32 29 25 24 21 17 13 23 20 22 19 15 18 16 14 11 7 9 6 3 2 10 4 12 8 5 1 36 35 33 28 27 26 32 34 30 31 29 25 18 23 20 19 17 24 22 16 14 15 21 13 12 8 2 7 11 5 9 3 10 6 4 1
Structural Similarity Measures
Periodic similarity is a very powerful visualization tool.
Random Walk Random Walk Sunspots: 1869 to 1990 Sunspots: 1749 to 1869 Great Lakes (Ontario) Great Lakes (Erie) Power Demand: April-June (Dutch) Power Demand: Jan-March (Dutch) Power Demand: April-June (Italian) Power Demand: Jan-March (Italian) Random Random Video Surveillance: Eamonn, no gun Video Surveillance: Eamonn, gun Video Surveillance: Ann, no gun Video Surveillance: Ann, gun Koski ECG: fast 2 Koski ECG: fast 1 Koski ECG: slow 2 Koski ECG: slow 1 MotorCurrent: healthy 2 MotorCurrent: healthy 1 MotorCurrent: broken bars 2 MotorCurrent: broken bars 1
Periodic Measure
Incorrect Grouping
53
Tutorial | Time-Series with Matlab
Tutorial | Time-Series with Matlab
Structural Similarity Measures
Burst correlation can provide useful insights for understanding which sequences are related/connected. Applications for: Gene Expression Data Stock market data (identification of causal chains of events)
Query: Which stocks exhibited trading bursts during 9/11 attacks?
Conclusion
The traditional shape matching measures cannot address all timeseries matching problems and applications. Structural distance measures can provide more flexibility. There are many other exciting time-series problems that havent been covered in this tutorial: Anomaly Detection
PRICELINE: Stock value dropped
Idont dontwant wantto to I achieveimmortality immortality achieve throughmy myworkI workI through wantto toachieve achieveit it want through not dying. through not dying.
Frequent pattern Discovery
NICE SYSTEMS: Stock value increased (provider of air traffic control systems)
Rule Discovery etc
54