0% found this document useful (0 votes)

154 views26 pages

100 Time Series Data Mining Questions With Answers

This document provides examples of time series queries that can be answered using tools like the Matrix Profile. It contains 100 questions about patterns, similarities, differences, and unusual events in time series data. The questions are accompanied by step-by-step code solutions. The code and data to reproduce the examples are available online. The author welcomes additional questions and suggestions to further the work.

Uploaded by

if05041736

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views26 pages

100 Time Series Data Mining Questions With Answers

Uploaded by

if05041736

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

100 Time Series Data Mining Questions

(with answers!)
Keogh’s Lab (with friends)

Dear Reader: This document offers examples of time series questions/queries, expressed in intuitive natural language, that
can be answered using simple tools, like the Matrix Profile, and related tools such as MASS.
We show the step-by-step solutions. In most cases, the solutions require just a handful of lines of code.
As you may have noticed, we are not at 100 yet! This is a long term work-in-progress. We welcome suggestions and
“donations” of questions.
The code and data is here: www.cs.ucr.edu/~eamonn/HundredQuestions.zip
Corrections and suggestions to [email protected]

In a handful of cases, we report timing results. These examples were made on an old machine, were optimized for simplicity, not speed. In any case the timing will
become dated with Moore’s Law. In addition, we are constantly optimizing our code. We only mean to produce relative numbers for your instruction. Please do
not report the absolute numbers, run the experiments yourself, with the most optimized code available.

0 2000 4000 6000 8000 10000 12000

This task is trivial with Mueen’s MASS code…

>> load robot_dog.txt , load carpet_query.txt % load the data
>> dist = findNN(robot_dog ,carpet_query ); % compute a distance profile
>> [val loc] = min(dist); % find location of match
>> disp(['The best matching subsequence starts at ',num2str(loc)])
The best matching subsequence starts at 7479

7450 7500 7550 7600

Below we plot the 16 best matches. Note that they all occur during the The best match, shown in context
carpet walking period. This entire process takes about 1/1000th of a second.

0 2000 4000 6000 8000 10000 12000

The dataset is an hour of EOG (eye movement) data of a sleeping patient, sampled at 100
Are there any repeated patterns in my data? Hz. It looks very noisy, it is not obvious that there is any repeated structure…

0 350,000

Let us run the Matrix Profile, looking for four-second long motifs…
>> load eog_sample.mat
>> [matrixProfile profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer3_website(eog_sample, 400);

The code takes a while to fully converge, but in just a few seconds, we see some stunningly well conserved motifs…

Having found the motifs, we can ask, what are

they? A quick glance at a paper by Noureddin
et. al. locates a very similar pattern (with
some time warping) called eye-blink-
artifact.
Figure 1.(f) of
Noureddun et al Motif 2
Motif 1 Motif 2
Four seconds Four seconds

Note that there may be more examples of each motif. We should take one
of the above, and use MASS to find the top 100 neighbors… See Have we
ever seen a pattern that looks just like this?. We can also adjust the range
parameter r inside the motif extraction code.
What are the three most unusual days in this three month long dataset?
0 500 1000 1500 2000 2500 3000 3500

The datasets is Taxi demand, in New York City, in the last three months of the year.
We choose 100 datapoints, which is about two days long (the exact values do not matter much here).

>> load taxi_3_months.txt

>> [matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer3_website(taxi_3_months ,100);

The code pops up the matrix profile tool, and one second later, we are done! The three most unusual days
correspond to the three highest values of the matrix profile (i.e. the discords), but what are they?
• The highest value corresponds to Thanksgiving
• We find a secondary peak around Nov 6th, what could it be? Daylight Saving Time! The clock going backwards one hour,
gives an apparent doubling of taxi load.
• We find a tertiary peak around Oct 13th, what could it be? Columbus Day! Columbus Day is largely ignored in much of
America, but still a big deal in NY, with its large Italian American community.

1 3600

Thanksgiving
Daylight Saving Time
Columbus Day
1 3600
ice
Is there any pattern that is common to these two time series? ice
queen
0 0.5 1 1.5 2 2.5
4
10
Lets assume that the common pattern is 3 seconds, or 300 datapoints long.
Let us concatenate the two time series, and smooth them (just for visualization purposes, we don’t really need to)
Now let us find the top motif, but insist that one motif comes before 24289, and one after…
>> load('Queen_vs_Ice.mat’)
>> whos
Name Size Bytes Class Attributes
mfcc_queen 1x24289 194312 double
mfcc_vanilla_ice 1x23095 184760 double
>> interactiveMatrixProfileAB(smooth([mfcc_queen , mfcc_vanilla_ice]), 300, 24289); % This will spawn this plot ->

The top join motif shows a highly conserved pattern.

It is the famous bass line from Under Pressure by Queen

which was plagiarized by Vanilla Ice.

The top join motif

The concept for this example comes from Dr. Diego Furtado Silva.
How do these two time series differ in terms of alignment?

The data are two motifs discovered in the song of a bird1, which we converted to MFCC. Let us load the data, and look
at the DTW alignment.

>> load green.txt

>> load cyan.txt
>> DTW(green', cyan' ,1 ); % the ‘1’ is just to force the plot

The DTW alignment clearly indicates where the differences lie, in the variability of the timing of a single note, about
2/3rds of the through the snippet. This example is trivial to see, but in more complex processes, this visual analysis can
be very fruitful. 1 1200
See Multifractal analysis reveals music-like dynamic structure in songbird rhythms, by Tina Roeske et al.

alignment

The second occurrence of this note happens much later,

1https://www.xeno-canto.org/415294
given how well the rest of the song snippet is preserved
Find the most conserved pattern that happens at least once every two days in this dataset
0 40000 80000 120000 160000

The question is a little underspecified, as the length for the conserved patterns was not given. Let us try two hours, which is about
800 data points.
The full 20,000 datapoints represents about 14 days of electrical demand data for a house in the U.K. Thus we first need to divide it
into approximate 2 day chunks.
>> load TwoWeekElectrical
>> seven_two_day_chunks = divide_data(T);
Now we just need to call the consensus motif code.
>> consensus_motifs = consensusMotifs(seven_two_day_chunks,800); % 800 is the length of subsequence

(probable) hair dryer

The code returns the seven time series below. Note that the basic
pattern is highly conserved, given how noisy the data is.
The similarity between the items can be better seen if we cluster the (probable) electric kettle
time series with a single linkage dendrogram.

The most conserved pattern

The dataset is 3 years of Italian power
If you had to summarize this long time series with just two shorter examples, what would they be? demand data which represents the hourly
electrical power demand of a small Italian city
for 3 years beginning on Jan 1st 1995.

Jan/1/1995 May/31/1998

We just need to call Time Series Snippets algorithm…

>> load('ItalianPowerDemand.mat’)
>> [fraction,snippet,snippetidx]= snippetfinder(data(:,4),2,200,30);
It will pop open three windows, which are snippet 1, snippet 2 and the regime bar.
We searched for the top-2 snippets of length 200. This was our quick “eyeballing” guess as to the length of a
week, but it is actually about 8.3 days. Note that the snippets are not align to start at the same day of the
week (this is a trivial constraint to add if desired).
What makes the snippets different? (tentative answer)
In the winter, people go home after work (and turn on
Sunday Sunday heaters/appliances). In the summer, people do more
leisure activities after work and don’t return home until
it is cooler.

Snippet 1 8 days 8 days

Snippet 2

We obtain the “regime bar,” which tells us which snippet “explains” which region of data. As it
happens, Snippets seem to represent summer and winter regimes respectively.

Snippet 1 Snippet 2
Jan/1/1995 May/31/1998
Are there any patterns that appear as time reversed versions of themselves in my data?
0 1000 2000 3000 4000 5000

Lets us load the data, and concatenate it to itself, after flipping left to right.
We can then search for a join motif, that spans 5046, the length of the original time series.
If we find a good join motif, it means that the conserved pattern is time reversed!
>> load('mfcc.mat’)
>> length(mfcc1(1,:))
ans = 5046
>> interactiveMatrixProfileAB(([mfcc1(1,:)'; flipud(mfcc1(1,:)')]), 150, 5046); % This will spawn this plot ->

The top join motif shows a highly conserved pattern.

Why would a pattern occur time reversed?

“The most extraordinary of all canonic movements from this

MFC8”Symphony
No. 47

time is of course from Symphony No. 47. Here Haydn writes out
only one reprise of a two-reprise form, and the performer must
0 21:02
minutes:seconds
play the music ‘backward’ the second time around”.
The data is the 1st MFCC of this piece of music.
14:53
14:16
1 150

0 40 0 40
seconds seconds
The top join motif

al roverso

1 150
When does the regime change in this time series?
Arterial Blood Pressure Healthy Pig.. …internal bleeding induced
0 15000

In this dataset, at time stamp 7,500, bleeding was induced in an otherwise healthy pig. This changes the pig’s APB measurement, but
only very slightly. Could we find the location of the change, if we were not told it? Moreover, can we do this with no domain knowledge?
In other words, can we detect regime changes in time series?
>> TS = load('PigInternalBleedingDatasetArtPressureFluidFilled_100_7501.txt');
>> CAC = RunSegmentation(TS, SL); %SL is the length of subsequence
>> plot(CAC,'c’)
>> [~, loc] = min(CAC) %value of loc is 7460 which is the approximation of exact value 7500
Here, we choose SL to be 100, approximately the length of one period of arterial pressure (or the period of whatever repeated patterns you have
in your data), however, up to half or twice that value would work just as well. The output curve, the CAC, minimizes at just the right place.
How does it do it? In brief, if we examine the pointers in the Matrix Profile Index, we will find that very few will cross over the location of a regime
change (most healthy beats have a nearest neighbor that is another healthy beat, most “bleeding” beats have a nearest neighbor that is another
“bleeding” beat), it is this lack of pointers that cross over the regime change that is what the CAC is measuring.

CAC
0.5
The minimum value of the CAC suggests the location of the regime change
0
0 15000

100

40
0 5000 10000 15000
How can I compare these time series of different lengths?

If you have data that are different lengths, you could make them the same length (using truncation or interpolation) or use DTW. However, for
some datasets, that would be a very bad idea. To see why, consider text instead of time series for a moment.
For example, to find the similarity between (Lisa, Lisabeth), truncation of the second half of Lisabeth works well. However, to find the
similarity between (Beth, Lisabeth), truncation of the second half of Lisabeth is clear wrong.
One trick to solve this issue is the Mpdist, a distance measure that automatically solves the above dilemma, by only comparing the most
similar parts of the sequence. Below we demonstrate it on the Y-axis of the time series recording the location of the tip of a pen as it writes
six girls names.
>> load('TSs’)
>> MPdist_Clustering(TSs)

(We also have done this Lisa

experiment on synthetic
data to make this result Lisabeth
even clearer)
Beth
Note the that names in our example piecewise match.
However, there are differences, for example due to the Anne
capitalization of ‘b’ in Beth vs.Lisabeth
b t
e Maryanne
h
B t Mary
e MPdist
h
Are there any patterns that repeat in my data, but at two distinct lengths?
Voltage reading

See also “Is there any pattern that is common to these two time series?”

We can solve this with a quick and dirty trick. The code interactiveMatrixProfileAB(T,m,crossover) searches time
series T for a motif of length m, such that one of the motif pair occurs before crossover and one occurs after crossover.
We can take a time series and append it to a rescaled copy itself, setting the to the length of the original time series. Now when we
find motifs, we are finding one at the original scale, and one at the rescaled size.
In this case, I want to know if any of my insect behaviors happens at length 5,000 and at 10,000, so I type…
>> load insectvolts.mat % load some insect epg data
>> interactiveMatrixProfileAB([insectvolts ; insectvolts(1:2:end)], 5000, length(insectvolts)); % search the appended data

No need to let it converge, after a few seconds we have our answer…

Two motifs in the rescaled space Two motifs in the original, true space
Note you can do this for non
integer rescaling. Matlab will warn
you, but it is defined and allowed. This behavior took 20 seconds

Note that the bottom motif is

discovered in spite of having a lot This behavior took 40 seconds
of noise in one of the occurrences.

Note that the dimensionality of This behavior took 20 seconds

the motifs is 5,000! This would
have been unthinkable before the
Matrix Profile.
This behavior took 40 seconds
Have we ever seen a multidimensional pattern that looks just like this? Pressure
MagX
0 250,000

I have 262,144 data points that record a penguin’s orientation (MagX) and the water/air pressure as he hunts for fish.
Question: Does he ever change his bearing leftwards as he reaches the apex of his dive?
This is easy to describe as a multidimensional search. The apex of a dive is just an approximately parabolic shape. I can
create this with query_pressure = zscore([[-500:500].^2]*-1)’; it looks like this
I can create bearing leftwards with a straight rising line, like this query_MagX = zscore([[-500:500]])’; It looks like this

We have seen elsewhere in this document how to search for a 1D pattern. For this 2D case, all we
have to do is add the two distance profiles together, before we find the minimum value.
Note that the best match location in 2D is different to either of the 1D queries.
We can do this for 3D or 4D…
However, there are some caveats. In brief, it almost never makes sense to do multidimensional
time series search in more than 3 or 4D. See Matrix Profile VI: M. Yeh ICDM 2017 and “Weighting” B. Hu, ICDM
2013. In addition, in some cases we may want to weight the dimensions differently, even though
they are both z-normalized Euclidean Distance. What are the periodic
bumps? They are
wingstokes as the bird
“flies” underwater

load penguintest.mat
figure;, hold on;

query_pressure = zscore([[-500:500].^2]*-1)';
dist_p = MASS_V2(penguintest(:,1),query_pressure);
query_MagX = zscore([[-500:500]])';
dist_m = MASS_V2(penguintest(:,2),query_MagX);
[val,loc] = min([dist_m + dist_p]); % find best match location in 2D
plot(zscore(penguintest(loc:loc+length(query_MagX),2)),'color',[0.85 0.32 0.09])
plot(zscore(query_MagX),'m')
plot(zscore(penguintest(loc:loc+length(query_pressure),1)),’b’)
Best match in 2D space
plot(zscore(query_pressure),'g')
title(['Best matching sequence, pressure/MagX, is at ', num2str(loc)]) 0 1000
How do I quickly search this long dataset for this pattern, if an approximate search is acceptable?
As shown elsewhere in this document, exact search is surprisingly fast under Euclidean Distance. However, let us suppose that you want to do even faster
search, and you are willing to do an approximate search (but want a high quality answer). A simple trick is to downsample both the data and the query by
the same amount (in the below, by 1 in 64) and search the downsampled data. If the data has low intrinsic dimensionality, this will typically give you very
good results.
Let us build a long dataset, with 67,108,864 datapoints, and a long query, with 8,192 datapoints

A full exact search takes 12.4 seconds, an approximate search takes 0.24 seconds, and produces (at least in
this example) almost exactly the same answer. The answer is just slightly shifted in time.

How well this will work for you depends on the intrinsic dimensionality of your data.
rng('default') % set seed for reproducibility
data= cumsum(randn(1,2^26)); % make data
query= cumsum(randn(1,2^13)); % make a query
tic
dist = MASS_V2(data ,query );
[val,loc] = min(dist); % find best match location exact
hold on
plot(zscore(data(loc:loc+length(query)))) exact
plot(zscore(query),'r')
title(['Exact best matching sequence is at ', num2str(loc)])
disp(['Exact best matching sequence is at ', num2str(loc)])
toc

figure;
downsampled_data = data(1:64:end); % create a downsampled version of the data
downsampled_query = query(1:64:end); % create a downsampled version of the query
tic
dist = MASS_V2(downsampled_data ,downsampled_query );
approximate
[val,loc] = min(dist);
hold on
plot(zscore(data((loc*64):(loc*64)+length(query)))) % multiply the 'loc' by 64 to index correctly
plot(zscore(query),'r')
title(['Approx best matching sequence is at ', num2str(loc*64)])
disp(['Approx best matching sequence is at ', num2str(loc*64)])
toc

Exact best matching sequence is at 61726727, Elapsed time is 12.40 seconds.

approximate
Approx best matching sequence is at 61726784, Elapsed time is 0.240 seconds.
see also “How do I quickly search this
How can I optimize similarity search in a long time series? long dataset for this pattern, if an
approximate search is acceptable?”
Suppose you want to find a query inside a long time series, say of length 67,000,000.

First trick: MASS (and several other FFT and DWT ideas) have their best case when the data length is a power of two, so pad the data to make it a power of
two (padding with zeros works fine).

Second trick: MASS V3 is a piecewise version of MASS that performs better when the size of the pieces are well aligned with the hardware. You need to
tune a single parameter, but the parameter can only be a power of two, so you can search over say 210 to 220. Once you find a good value, you can
hardcode it for your machine.
rng('default') % Set seed for reproducibility
data= cumsum(randn(1,67000000)); % make data
query= cumsum(randn(1,2^13)); % make a long query
tic
dist = MASS_V2(data ,query );
If you run this code, it will output…
[val,loc] = min(dist); % find best match location
hold on Best matching sequence is at 32463217
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r') Elapsed time is 14.30 seconds.
disp(['Best matching sequence is at ', num2str(loc)])
toc After padding: Best matching sequence is at 32463217
figure Elapsed time is 12.31 seconds.
data = [data zeros(1,2^nextpow2(67000000) -67000000)]; % pad data to
tic % next pow of 2 MASS V3 & padding: Best matching sequence is at 32463217
dist = MASS_V2(data ,query );
[val,loc] = min(dist); % find best match location Elapsed time is 5.82 seconds.
hold on
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r') Note that it outputs the exact same
disp(['After padding: Best matching sequence is at ', num2str(loc)])
toc answer, regardless of the optimizations,
figure but it is fast, then faster, then super fast.
tic
dist = MASS_V3(data ,query, 2^16 );
[val,loc] = min(dist); % find best match location
hold on
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r')
disp(['MASS V3 & padding: Best matching sequence is at ', num2str(loc)])
toc
What is most likely to happen next? <other>
We are monitoring What happens next?
With 1%
Can we do “predictive text” for time series? It is meant to “Boiler 179” probability
be very “light weight”, no model building, no training, no With
tweaking. 99.99%
probability
It is not predicting the value, only the shape, of things to With 4%
come. probability
Such predictions can justify themselves, like this... 5 min ago Now 5 min ago Now 5 min in future
We predicted this pattern because the last two times we saw a 24-hour prefix
that looked like the last day (May 2, June 9), it was followed by this shape.

>> data = load('power_data.txt’);

This dataset is one year of electrical power demand in an Dutch facility city. The key >> sub_len = 150;
features for our demo are weekends look different and there are some national holidays. >> predicting_tool(data, sub_len);
Here our prediction is good for 24 hours, then it is poor. (contrast left figure) With just a little more experience, we can predict weekends. How? It is
Up to this point, we have seen too few weekends to make good predictions…. very subtle, but on Fridays some folk leave early, and the power demand scales down a little
(contrast right figure, later in the year) faster in the afternoon….

2nd week of Jan 1st week of Feb

What is the right length for motifs in this dataset? See also Are there any repeated patterns in my data?
This is a very interesting question, which more than most, deserves a long explanation. However, to be brief and pragmatic. Let us revisit the
EOG dataset. Recall that we choose 4 seconds as the motif length, which I happen to know (from reading papers on the topic) is a good choice.
• >> load eog_sample.mat
• >> [MP profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer3_website(eog_sample, 400);

Let us look at the Matrix Profile, and the top motif we find (bottom left). The results seem to make sense.
However, suppose in contrast that we knew nothing about the domain, and had chosen a motif length that was much too long, say length 3,000
(bottom right). How could we know that we had picked a length that was too long? There are two clues:
• Obviously, the motifs themselves will be less well conserved visually.
• The Matrix Profile itself offers useful clues. It tells us how “specially well conserved” the motif is ( min(MP)) relative to the average subsequence (
mean(MP)). As the ratio of these two numbers is approaches zero, it suggests a stunningly well conserved motif in the midst of others unconserved
data. However, as the ratio of these two numbers is approaches one, it suggest that the “motif” is no better conserved than we would expect by random
chance. In practice, we rarely compute these ratios, as is visually obvious that the MP looks “flat”.

60 >> min(MP)/mean(MP) ans = 0.290 >> min(MP)/mean(MP) ans = 0.756 60

40 40

20 Here the MP looks “flat” 20

0 0

The motifs are well The motif are not well

conserved visually.. conserved visually..

1 400 1 3000
I need to find motifs faster! Part I

Part of the solution might be to use GPUs, see [a][b].

Moreover, it is important to understand, we almost never need to compute the Matrix Profile to completion, the anytime
SCRIMP++ converges so fast, that in general we just run it 1% (or less) of convergence.
Nevertheless, sometimes you might want to compute the converged Matrix Profile. There is a faster algorithm for this. It exploits
some of the ideas in [b], and it exploits the fact that it does not need to waste the overhead needed to make anytime updates,
to achieve about an order of magnitude speedup.
For consistency with our other tools, when the fast code finish, it pops open the same plot.

To make this compatible with 2016 MATLAB, we replace the

built-in “maxk” with a third party version called “maxk1”
(by Salam Ismaeel). If you have later versions of MATLAB, you
may wish to undo this change.

Why are there very slightly different results? The value of the exclusion zone, and of ‘r’, the radius [a] https://www.cs.ucr.edu/~eamonn/ten_quadrillion.pdf
were different here. See elsewhere to understand how these effect the motifs returned. [b] https://www.cs.ucr.edu/~eamonn/public/GPU_Matrix_profile_VLDB_30DraftOnly.pdf
See also, How do I quickly search this long dataset for this pattern, if an approximate search is acceptable?
I need to find motifs faster! Part II

Part of the solution might be to use GPUs, see [a][b].

Many datasets are oversampled. For example, the insectvolts dataset that accompanies these notes is greatly oversampled.
By downsampling, all algorithms have some speedup, but for motif discovery, that speed up is most dramatic.
You need to remember to downsample the query length by the same factor. For example, in the below we want to find motifs of
five seconds, in a 100 Hz dataset. So we should use..
• eog_sample(1:1:end) with a motif length of 500. This in the original data, or..
(500 is the original motif length, note that in
• eog_sample(1:2:end) with a motif length of 250. or… every case, X times Y = 500)
• eog_sample(1:5:end) with a motif length of 100. or…

Below, we obtain speedup by downsampling, and get essentially the same results. However, it is important to note that in both
cases we found the motifs well before even preSHRIMP finished. In this case, in a few seconds.
For greatly oversampled datasets, this simple trick can get you speedups of 2 or 3 orders of magnitude!

Note: These are the times for preSHRIMP only

[a] https://www.cs.ucr.edu/~eamonn/ten_quadrillion.pdf
[b] https://www.cs.ucr.edu/~eamonn/public/GPU_Matrix_profile_VLDB_30DraftOnly.pdf
Have we ever seen a pattern that looks just like this, but possibly at a different length?
Voltage reading

In our insect data, a basic feeding primitive looks like this: , we can model it with something like: [1:600].^0.2
We have a theory that a certain higher level behavior will result in “A long primitive, followed by a shorter and smaller primitive,
followed by another long primitive”, like this.. , we can model it with: [[[1:600].^0.2] [[1:300].^0.2] [[1:600].^0.2]]
However, we don’t know how long the whole pattern could be…
The function to the right can solve this problem. It simply This looks like a lot of code, but most of it is for plotting
brute forces a MASS test for all lengths within a range function [] = uniform_scaling_search(TAG, QUERY)
figure; % Spawn a blank figure

(here 100 to 300%) at a given step size (here 5%). subplot(4,1,1);

plot(TAG,'g')
% Plot panel 1
% Plot the TAG/time series in green
title(['This is the time series, of length ',num2str(length(TAG))]);
There may be faster techniques, but MASS is so fast, they subplot(4,1,2); % Plot panel 2
hold on;
may not be worth bothering with. title(['This is the QUERY, of length ',num2str(length(QUERY)), ', rescaled versions in gray']);
for i = 110:10:300 % Plot the rescaled queries

A critical trick is to normalize the comparisons at end

plot(QUERY(1:100/i:end),'color',[0.5 0.5 0.5])

plot(QUERY,'LineWidth',2,'color','r'); % Plot the query

different lengths (see Appendix). subplot(4,1,4); % Plot panel 4
hold on;
best_match_val = inf;
for i = 100:5:300 % Loop over all scalings
NewQUERY = (QUERY(1:100/i:end));
>> load insectvolts.mat % load some insect epg data distprofile = MASS_V3(TAG, NewQUERY, 1024);
>> query = ([[[1:600].^0.2] [[1:300].^0.2] [[1:600].^0.2]]); [val,loc] = min(distprofile);
val = val * 1/sqrt(i); % This normalizaiton step is critical see Appendix
>> uniform_scaling_search(smooth(insectvolts,10), query); if val < best_match_val % Record the best scaling
best_match_val = val;
best_match_loc = loc;
best_match_scale = i;
end
end

Match at original length does not plot(zscore(QUERY(1:100/best_match_scale:end)),'LineWidth',2,'color','r'); % Plot bestscaled match

plot(zscore(TAG(best_match_loc:best_match_loc+length(QUERY(1:100/best_match_scale:end))-1)),'g');
look very similar. However, at 285% set(gca,'Xlim',[1 length(QUERY(1:100/best_match_scale:end))]);
title(['The best match is found when we rescale to ',num2str(best_match_scale),'%']);
the match is very good.
subplot(4,1,3); % Plot panel 3
hold on;
distprofile = MASS_V3(TAG, QUERY, 1024); % Compute the distance profile
[val,loc] = min(distprofile); % Find were the best match was
plot(zscore(QUERY),'LineWidth',2,'color','r'); % Plot the query
Even with this unoptimized approach. We can search plot(zscore(TAG(loc:loc+length(QUERY)-1)),'g');
set(gca,'Xlim',[1 length(QUERY(1:100/best_match_scale:end))]);
two hours of data (at 100hz), with a long query (upto title(['This is the best match, at original length'])

4000 to 4,500 datapoints), in well under 10 seconds end

1
How can I know which of these two classification approaches is best for time series?
This is a very tricky question. A sophisticated statistical test is probably See arxiv.org/abs/1810.07758 for more details
2.2
the answer, but beyond the scope of this document. Key
Texas SharpShooter plot
However, here we consider a simple way to visualize the answer. 84
We expected We expected
Many papers have essentially said “we tested on many datasets, we are

Actual Accuracy Gain

to do worse, an
2 but we did improvement 104
better on some, so we are useful sometimes”. However, it is not useful to better. and we got it! 100
88
be better sometimes, unless you know in advance you are going to be
better! We expected
We expected
to do better,
to do worse,
but actually
In brief, after computing the strawman/baseline (here, the Euclidean 1.8 and we did.
did worse.

distance) we then compute the expected improvement we would get 99

Expected Accuracy Gain
using the proposed algorithm (here cDTW, learning any parameters and 86
settings on just the training data), then compute the actual improvement

Actual Accuracy Gain

1.6 66
obtained (using these now hardcoded parameters and settings). We then 87 89
plot a point for each dataset, using the expected and actual improvement
(either of which could be less than one) 5
60
As the key shows, there are four possible outcomes for each dataset, we 1.4 3 101
19
would prefer to be in a yellow region (ideally, the upper right region). 18
102 115
Let us answer the question for 1NN-ED vs, 1NN-DTW. We did all the 20 4
118 68
9
48
experiments and saved them into a file called texas_plot_2018.csv. 1.2 40
125
62
126
39
10 85 82
12
>> result_file = texas_plot_2018.csv 1
38 61
116
67
37 51 97 34
44
114
11298 50
47 58 95
>> plot_texas_sharpshooter(result_file) 113 4971 117
26 41 11053
94
103
108
121
17 35
63 27
64 11 120
105
15
90
42 122
111
This plot show forceful evidence that cDTW is superior to ED. This is 1 46
7
72
14
30
59
119
128
55
107
21
57
92
13 76 29 36
246 25 23106 73
265
16
22
123
109
31
80
54
28
91 38
83
74
52
2170
776975 124
96 43 33 45 56
unsurprising, since cDTW subsumes ED as a special case. The rare 78 79 93
127
8
examples where cDTW is worse is because we learned an unsuitable
warping window*.
0.8
0 0.5 1 1.5 2 2.5
Expected Accuracy Gain
*For example, 8 is BeetleFy, with just 20 train and 20 test instances. Here we expected to do a little better, but we did a little worse.
In contrast, for 66 (LargeKitchenAppliances) we had 375 train and 375 test instances, and where able to more accurately predict the warping window that gives a large improvement.
Normal
Are there any evolving patterns in this dataset (time series chains) Breathing

2.017 2.018 2.019 2.02 2.021

1000
500
0
-500
-1000
0 2000 4000 6000 8000 10000 12000 14000 16000

This is a dataset of respiration from a sleep study. Each breath appears to be about 360 data points long. So lets search for time series
chains of length 360…
>> load respiration.mat
>> TSC1_demo(respiration , 360);
The algorithm finds the highlighted chains below.
1000
500
0
-500
-1000
0 2000 4000 6000 8000 10000 12000 14000 16000

Let us zoom in on the chains, to better see what is going on…

Note the increasing “gulp” artifact that happens between cycles. Also note that it begins to happen earlier and earlier in
the cycle. What does this mean?
Here is the (lightly edited) annotation of Dr. Gregory Mason (LA BioMed/UCLA) an expert on cardiopulmonary interactions.
“The gulps are attempts to inspire against an obstruction coming the back of the tongue. The large signals are from the machine which do
not necessarily reach the patient, the small gulps are pathologic attempts to breathe. Why does it increase? With each successive breath
the patient tries harder to inspire. It finally is 'synchronized' and you don't see the small patient signal, and this event cycles over and
over. The cycling is best seen without treatment if one looks up "crescendo snoring," a hallmark of obstructive sleep apnea.”
Please consider “donating” question
• Even better if you can donate data
• Even better if you can donate an answer!
Appendices
• Below are miscellaneous appendices
In some of our examples we needed to compare the
Euclidian distances of pair of time series, that are of Which of these three pairs
different lengths. is closest?

Under mild assumptions,

How should we normal to compensate for the length we might claim that they
differences? are all equally similar.
1 10,000
Time Series Length
We could:
1) Not normalize at all, but this strongly biases us to short 3 0.001 3 0.05
patterns. Divide by length normalized
2) Normalize by dividing my the length of the time series.
This seems to make sense, but it strongly biases us towards
long patterns. Divide by reciprocal of square
root of length normalized
3) Normalize by the reciprocal of square root of length of the 2 2

time series. Without explanation (here) we claim that this is

correct.

In the figures on the bottom right, we compare the three 1 1

above ideas on slightly nosily sine waves of different Raw Euclidean Distance Raw Euclidean Distance
lengths (shown top right). 0 0
1,000 10,000 1,000 10,000
Time Series Motif Length Time Series Motif Length

As you can see, the “Normalize by the reciprocal of square

root of length of the time series” approach is invariant to
the length of the patterns.

AST1
No ratings yet
AST1
84 pages
qt8xc8s9df Nosplash
No ratings yet
qt8xc8s9df Nosplash
36 pages
Lecture 1 - Time Series Fundamentals - Introduction
No ratings yet
Lecture 1 - Time Series Fundamentals - Introduction
61 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Da Lab Exp 7,8,9,10,11,12
No ratings yet
Da Lab Exp 7,8,9,10,11,12
32 pages
Ijbidm070405 Anh
No ratings yet
Ijbidm070405 Anh
22 pages
03 Matrix
No ratings yet
03 Matrix
112 pages
T25. Forecasting Big Time Series - Theory and Practice
No ratings yet
T25. Forecasting Big Time Series - Theory and Practice
166 pages
Matrix Profile Tutorial Part1
No ratings yet
Matrix Profile Tutorial Part1
110 pages
TT02 Data, Methods, and Scenarios
No ratings yet
TT02 Data, Methods, and Scenarios
44 pages
DM Trend Seminar
No ratings yet
DM Trend Seminar
34 pages
Mining Rare Temporal Pattern in Time Series
No ratings yet
Mining Rare Temporal Pattern in Time Series
15 pages
MatrixProfile A Time Series
No ratings yet
MatrixProfile A Time Series
10 pages
Association Rule Mining
No ratings yet
Association Rule Mining
191 pages
Data Mining
No ratings yet
Data Mining
44 pages
Feature-Based Time-Series Analysis
No ratings yet
Feature-Based Time-Series Analysis
28 pages
DMDW Notes Unit 3
No ratings yet
DMDW Notes Unit 3
9 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
Biological Data Science Lecture3
No ratings yet
Biological Data Science Lecture3
23 pages
Motion Code
No ratings yet
Motion Code
20 pages
Motion Code Arxiv
No ratings yet
Motion Code Arxiv
17 pages
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
No ratings yet
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
14 pages
Datamining and Analytics Unit V
No ratings yet
Datamining and Analytics Unit V
102 pages
Planning Kopia
No ratings yet
Planning Kopia
4 pages
4540 17 PDF
No ratings yet
4540 17 PDF
274 pages
FULLTEXT02
No ratings yet
FULLTEXT02
63 pages
TSIndexing
No ratings yet
TSIndexing
64 pages
l8 Signal Extraction
No ratings yet
l8 Signal Extraction
80 pages
2013TelgarskyRastislav-DominantFrequencyExtractionarXiv1306 0103
No ratings yet
2013TelgarskyRastislav-DominantFrequencyExtractionarXiv1306 0103
13 pages
DM Unit-5
No ratings yet
DM Unit-5
27 pages
Defense
No ratings yet
Defense
91 pages
Anirban CMI StatFin 2019 II
No ratings yet
Anirban CMI StatFin 2019 II
92 pages
Hctsa Manual
No ratings yet
Hctsa Manual
103 pages
Kshape
No ratings yet
Kshape
49 pages
Eigenbehaviors - Identifying Structure in Routine
No ratings yet
Eigenbehaviors - Identifying Structure in Routine
16 pages
Unsupervised Learning Algorithm 1
No ratings yet
Unsupervised Learning Algorithm 1
3 pages
Interactive and Dynamic Graphics For Data Analysis
No ratings yet
Interactive and Dynamic Graphics For Data Analysis
169 pages
Machine Learning Introduction Presentation
No ratings yet
Machine Learning Introduction Presentation
35 pages
XMPP Server Demo Scenario
No ratings yet
XMPP Server Demo Scenario
28 pages
CQF Brochure
No ratings yet
CQF Brochure
24 pages
Preventing Stuck DS (Drill String) Preventing Blow Out Mud Mudlogger Geologist Detect Oil and Gas Detect A Kick
No ratings yet
Preventing Stuck DS (Drill String) Preventing Blow Out Mud Mudlogger Geologist Detect Oil and Gas Detect A Kick
1 page
Market Reaction To The News Is The Most Important Not The News Itself Bullish Engulfing Pattern As Support
No ratings yet
Market Reaction To The News Is The Most Important Not The News Itself Bullish Engulfing Pattern As Support
1 page
Bej1906 004r2a0 PDF
No ratings yet
Bej1906 004r2a0 PDF
35 pages
ETP Specs Review
No ratings yet
ETP Specs Review
3 pages
Discovering Deformable Motifs in Continuous Time Series Data
No ratings yet
Discovering Deformable Motifs in Continuous Time Series Data
7 pages
Liu 2015
No ratings yet
Liu 2015
8 pages
Online Discovery and Maintenance of Time Series Motifs: Abdullah Mueen Mueen@cs - Ucr.edu Eamonn Keogh Eamonn@cs - Ucr.edu
No ratings yet
Online Discovery and Maintenance of Time Series Motifs: Abdullah Mueen Mueen@cs - Ucr.edu Eamonn Keogh Eamonn@cs - Ucr.edu
10 pages
A Review On Time Series Data Mining
100% (1)
A Review On Time Series Data Mining
18 pages
Pattern Matching With Acceleration Data: Pramod Vemulapalli
No ratings yet
Pattern Matching With Acceleration Data: Pramod Vemulapalli
29 pages
Market Impact and Alpha (Jan 2017)
No ratings yet
Market Impact and Alpha (Jan 2017)
1 page
Time Series
No ratings yet
Time Series
29 pages
Big Data Exercieses
No ratings yet
Big Data Exercieses
6 pages
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
Engineering Applications of Artificial Intelligence: Tak-Chung Fu
No ratings yet
Engineering Applications of Artificial Intelligence: Tak-Chung Fu
18 pages
Time Series Motif Discovery: Dimensions and Applications: Abdullah Mueen
No ratings yet
Time Series Motif Discovery: Dimensions and Applications: Abdullah Mueen
8 pages
Near-Neighbor Search in Pattern Distance Spaces
No ratings yet
Near-Neighbor Search in Pattern Distance Spaces
5 pages
JHDSS CourseDependencies
No ratings yet
JHDSS CourseDependencies
2 pages
1.1. Using Pre-Written Code
No ratings yet
1.1. Using Pre-Written Code
2 pages
YanchangZhao Refcard Data Mining
No ratings yet
YanchangZhao Refcard Data Mining
3 pages
Statement of Accomplishment: Welly Tambunan
No ratings yet
Statement of Accomplishment: Welly Tambunan
1 page
Time - Series - in - Brief
No ratings yet
Time - Series - in - Brief
11 pages
Data Mining
No ratings yet
Data Mining
22 pages
Optimal Multi-Scale Patterns in Time Series Streams: Spiros Papadimitriou Philip S. Yu
No ratings yet
Optimal Multi-Scale Patterns in Time Series Streams: Spiros Papadimitriou Philip S. Yu
12 pages
SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model
No ratings yet
SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model
11 pages
SPMF - A Java Open-Source Data Mining Library
No ratings yet
SPMF - A Java Open-Source Data Mining Library
1 page
Research
No ratings yet
Research
2 pages
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
No ratings yet
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
4 pages
Data Driven Market Making Via Model Free Learning
No ratings yet
Data Driven Market Making Via Model Free Learning
8 pages
Bag of Features
No ratings yet
Bag of Features
7 pages
USACO Training
No ratings yet
USACO Training
12 pages
Oliver and The Game - Topological Sort & Algorithms Practice Problems - HackerEarth
No ratings yet
Oliver and The Game - Topological Sort & Algorithms Practice Problems - HackerEarth
5 pages