100 Time Series Data Mining Questions With Answers
100 Time Series Data Mining Questions With Answers
(with answers!)
Keogh’s Lab (with friends)
Dear Reader: This document offers examples of time series questions/queries, expressed in intuitive natural language, that
can be answered using simple tools, like the Matrix Profile, and related tools such as MASS.
We show the step-by-step solutions. In most cases, the solutions require just a handful of lines of code.
As you may have noticed, we are not at 100 yet! This is a long term work-in-progress. We welcome suggestions and
“donations” of questions.
The code and data is here: www.cs.ucr.edu/~eamonn/HundredQuestions.zip
Corrections and suggestions to [email protected]
In a handful of cases, we report timing results. These examples were made on an old machine, were optimized for simplicity, not speed. In any case the timing will
become dated with Moore’s Law. In addition, we are constantly optimizing our code. We only mean to produce relative numbers for your instruction. Please do
not report the absolute numbers, run the experiments yourself, with the most optimized code available.
0 350,000
Let us run the Matrix Profile, looking for four-second long motifs…
>> load eog_sample.mat
>> [matrixProfile profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer3_website(eog_sample, 400);
The code takes a while to fully converge, but in just a few seconds, we see some stunningly well conserved motifs…
Note that there may be more examples of each motif. We should take one
of the above, and use MASS to find the top 100 neighbors… See Have we
ever seen a pattern that looks just like this?. We can also adjust the range
parameter r inside the motif extraction code.
What are the three most unusual days in this three month long dataset?
0 500 1000 1500 2000 2500 3000 3500
The datasets is Taxi demand, in New York City, in the last three months of the year.
We choose 100 datapoints, which is about two days long (the exact values do not matter much here).
The code pops up the matrix profile tool, and one second later, we are done! The three most unusual days
correspond to the three highest values of the matrix profile (i.e. the discords), but what are they?
• The highest value corresponds to Thanksgiving
• We find a secondary peak around Nov 6th, what could it be? Daylight Saving Time! The clock going backwards one hour,
gives an apparent doubling of taxi load.
• We find a tertiary peak around Oct 13th, what could it be? Columbus Day! Columbus Day is largely ignored in much of
America, but still a big deal in NY, with its large Italian American community.
1 3600
Thanksgiving
Daylight Saving Time
Columbus Day
1 3600
ice
Is there any pattern that is common to these two time series? ice
queen
0 0.5 1 1.5 2 2.5
4
10
Lets assume that the common pattern is 3 seconds, or 300 datapoints long.
Let us concatenate the two time series, and smooth them (just for visualization purposes, we don’t really need to)
Now let us find the top motif, but insist that one motif comes before 24289, and one after…
>> load('Queen_vs_Ice.mat’)
>> whos
Name Size Bytes Class Attributes
mfcc_queen 1x24289 194312 double
mfcc_vanilla_ice 1x23095 184760 double
>> interactiveMatrixProfileAB(smooth([mfcc_queen , mfcc_vanilla_ice]), 300, 24289); % This will spawn this plot ->
The data are two motifs discovered in the song of a bird1, which we converted to MFCC. Let us load the data, and look
at the DTW alignment.
The DTW alignment clearly indicates where the differences lie, in the variability of the timing of a single note, about
2/3rds of the through the snippet. This example is trivial to see, but in more complex processes, this visual analysis can
be very fruitful. 1 1200
See Multifractal analysis reveals music-like dynamic structure in songbird rhythms, by Tina Roeske et al.
alignment
The question is a little underspecified, as the length for the conserved patterns was not given. Let us try two hours, which is about
800 data points.
The full 20,000 datapoints represents about 14 days of electrical demand data for a house in the U.K. Thus we first need to divide it
into approximate 2 day chunks.
>> load TwoWeekElectrical
>> seven_two_day_chunks = divide_data(T);
Now we just need to call the consensus motif code.
>> consensus_motifs = consensusMotifs(seven_two_day_chunks,800); % 800 is the length of subsequence
Jan/1/1995 May/31/1998
We obtain the “regime bar,” which tells us which snippet “explains” which region of data. As it
happens, Snippets seem to represent summer and winter regimes respectively.
Snippet 1 Snippet 2
Jan/1/1995 May/31/1998
Are there any patterns that appear as time reversed versions of themselves in my data?
0 1000 2000 3000 4000 5000
Lets us load the data, and concatenate it to itself, after flipping left to right.
We can then search for a join motif, that spans 5046, the length of the original time series.
If we find a good join motif, it means that the conserved pattern is time reversed!
>> load('mfcc.mat’)
>> length(mfcc1(1,:))
ans = 5046
>> interactiveMatrixProfileAB(([mfcc1(1,:)'; flipud(mfcc1(1,:)')]), 150, 5046); % This will spawn this plot ->
time is of course from Symphony No. 47. Here Haydn writes out
only one reprise of a two-reprise form, and the performer must
0 21:02
minutes:seconds
play the music ‘backward’ the second time around”.
The data is the 1st MFCC of this piece of music.
14:53
14:16
1 150
0 40 0 40
seconds seconds
The top join motif
al roverso
1 150
When does the regime change in this time series?
Arterial Blood Pressure Healthy Pig.. …internal bleeding induced
0 15000
In this dataset, at time stamp 7,500, bleeding was induced in an otherwise healthy pig. This changes the pig’s APB measurement, but
only very slightly. Could we find the location of the change, if we were not told it? Moreover, can we do this with no domain knowledge?
In other words, can we detect regime changes in time series?
>> TS = load('PigInternalBleedingDatasetArtPressureFluidFilled_100_7501.txt');
>> CAC = RunSegmentation(TS, SL); %SL is the length of subsequence
>> plot(CAC,'c’)
>> [~, loc] = min(CAC) %value of loc is 7460 which is the approximation of exact value 7500
Here, we choose SL to be 100, approximately the length of one period of arterial pressure (or the period of whatever repeated patterns you have
in your data), however, up to half or twice that value would work just as well. The output curve, the CAC, minimizes at just the right place.
How does it do it? In brief, if we examine the pointers in the Matrix Profile Index, we will find that very few will cross over the location of a regime
change (most healthy beats have a nearest neighbor that is another healthy beat, most “bleeding” beats have a nearest neighbor that is another
“bleeding” beat), it is this lack of pointers that cross over the regime change that is what the CAC is measuring.
CAC
0.5
The minimum value of the CAC suggests the location of the regime change
0
0 15000
100
40
0 5000 10000 15000
How can I compare these time series of different lengths?
If you have data that are different lengths, you could make them the same length (using truncation or interpolation) or use DTW. However, for
some datasets, that would be a very bad idea. To see why, consider text instead of time series for a moment.
For example, to find the similarity between (Lisa, Lisabeth), truncation of the second half of Lisabeth works well. However, to find the
similarity between (Beth, Lisabeth), truncation of the second half of Lisabeth is clear wrong.
One trick to solve this issue is the Mpdist, a distance measure that automatically solves the above dilemma, by only comparing the most
similar parts of the sequence. Below we demonstrate it on the Y-axis of the time series recording the location of the tip of a pen as it writes
six girls names.
>> load('TSs’)
>> MPdist_Clustering(TSs)
See also “Is there any pattern that is common to these two time series?”
We can solve this with a quick and dirty trick. The code interactiveMatrixProfileAB(T,m,crossover) searches time
series T for a motif of length m, such that one of the motif pair occurs before crossover and one occurs after crossover.
We can take a time series and append it to a rescaled copy itself, setting the to the length of the original time series. Now when we
find motifs, we are finding one at the original scale, and one at the rescaled size.
In this case, I want to know if any of my insect behaviors happens at length 5,000 and at 10,000, so I type…
>> load insectvolts.mat % load some insect epg data
>> interactiveMatrixProfileAB([insectvolts ; insectvolts(1:2:end)], 5000, length(insectvolts)); % search the appended data
Two motifs in the rescaled space Two motifs in the original, true space
Note you can do this for non
integer rescaling. Matlab will warn
you, but it is defined and allowed. This behavior took 20 seconds
I have 262,144 data points that record a penguin’s orientation (MagX) and the water/air pressure as he hunts for fish.
Question: Does he ever change his bearing leftwards as he reaches the apex of his dive?
This is easy to describe as a multidimensional search. The apex of a dive is just an approximately parabolic shape. I can
create this with query_pressure = zscore([[-500:500].^2]*-1)’; it looks like this
I can create bearing leftwards with a straight rising line, like this query_MagX = zscore([[-500:500]])’; It looks like this
We have seen elsewhere in this document how to search for a 1D pattern. For this 2D case, all we
have to do is add the two distance profiles together, before we find the minimum value.
Note that the best match location in 2D is different to either of the 1D queries.
We can do this for 3D or 4D…
However, there are some caveats. In brief, it almost never makes sense to do multidimensional
time series search in more than 3 or 4D. See Matrix Profile VI: M. Yeh ICDM 2017 and “Weighting” B. Hu, ICDM
2013. In addition, in some cases we may want to weight the dimensions differently, even though
they are both z-normalized Euclidean Distance. What are the periodic
bumps? They are
wingstokes as the bird
“flies” underwater
load penguintest.mat
figure;, hold on;
query_pressure = zscore([[-500:500].^2]*-1)';
dist_p = MASS_V2(penguintest(:,1),query_pressure);
query_MagX = zscore([[-500:500]])';
dist_m = MASS_V2(penguintest(:,2),query_MagX);
[val,loc] = min([dist_m + dist_p]); % find best match location in 2D
plot(zscore(penguintest(loc:loc+length(query_MagX),2)),'color',[0.85 0.32 0.09])
plot(zscore(query_MagX),'m')
plot(zscore(penguintest(loc:loc+length(query_pressure),1)),’b’)
Best match in 2D space
plot(zscore(query_pressure),'g')
title(['Best matching sequence, pressure/MagX, is at ', num2str(loc)]) 0 1000
How do I quickly search this long dataset for this pattern, if an approximate search is acceptable?
As shown elsewhere in this document, exact search is surprisingly fast under Euclidean Distance. However, let us suppose that you want to do even faster
search, and you are willing to do an approximate search (but want a high quality answer). A simple trick is to downsample both the data and the query by
the same amount (in the below, by 1 in 64) and search the downsampled data. If the data has low intrinsic dimensionality, this will typically give you very
good results.
Let us build a long dataset, with 67,108,864 datapoints, and a long query, with 8,192 datapoints
A full exact search takes 12.4 seconds, an approximate search takes 0.24 seconds, and produces (at least in
this example) almost exactly the same answer. The answer is just slightly shifted in time.
How well this will work for you depends on the intrinsic dimensionality of your data.
rng('default') % set seed for reproducibility
data= cumsum(randn(1,2^26)); % make data
query= cumsum(randn(1,2^13)); % make a query
tic
dist = MASS_V2(data ,query );
[val,loc] = min(dist); % find best match location exact
hold on
plot(zscore(data(loc:loc+length(query)))) exact
plot(zscore(query),'r')
title(['Exact best matching sequence is at ', num2str(loc)])
disp(['Exact best matching sequence is at ', num2str(loc)])
toc
figure;
downsampled_data = data(1:64:end); % create a downsampled version of the data
downsampled_query = query(1:64:end); % create a downsampled version of the query
tic
dist = MASS_V2(downsampled_data ,downsampled_query );
approximate
[val,loc] = min(dist);
hold on
plot(zscore(data((loc*64):(loc*64)+length(query)))) % multiply the 'loc' by 64 to index correctly
plot(zscore(query),'r')
title(['Approx best matching sequence is at ', num2str(loc*64)])
disp(['Approx best matching sequence is at ', num2str(loc*64)])
toc
First trick: MASS (and several other FFT and DWT ideas) have their best case when the data length is a power of two, so pad the data to make it a power of
two (padding with zeros works fine).
Second trick: MASS V3 is a piecewise version of MASS that performs better when the size of the pieces are well aligned with the hardware. You need to
tune a single parameter, but the parameter can only be a power of two, so you can search over say 210 to 220. Once you find a good value, you can
hardcode it for your machine.
rng('default') % Set seed for reproducibility
data= cumsum(randn(1,67000000)); % make data
query= cumsum(randn(1,2^13)); % make a long query
tic
dist = MASS_V2(data ,query );
If you run this code, it will output…
[val,loc] = min(dist); % find best match location
hold on Best matching sequence is at 32463217
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r') Elapsed time is 14.30 seconds.
disp(['Best matching sequence is at ', num2str(loc)])
toc After padding: Best matching sequence is at 32463217
figure Elapsed time is 12.31 seconds.
data = [data zeros(1,2^nextpow2(67000000) -67000000)]; % pad data to
tic % next pow of 2 MASS V3 & padding: Best matching sequence is at 32463217
dist = MASS_V2(data ,query );
[val,loc] = min(dist); % find best match location Elapsed time is 5.82 seconds.
hold on
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r') Note that it outputs the exact same
disp(['After padding: Best matching sequence is at ', num2str(loc)])
toc answer, regardless of the optimizations,
figure but it is fast, then faster, then super fast.
tic
dist = MASS_V3(data ,query, 2^16 );
[val,loc] = min(dist); % find best match location
hold on
plot(zscore(data(loc:loc+length(query))))
plot(zscore(query),'r')
disp(['MASS V3 & padding: Best matching sequence is at ', num2str(loc)])
toc
What is most likely to happen next? <other>
We are monitoring What happens next?
With 1%
Can we do “predictive text” for time series? It is meant to “Boiler 179” probability
be very “light weight”, no model building, no training, no With
tweaking. 99.99%
probability
It is not predicting the value, only the shape, of things to With 4%
come. probability
Such predictions can justify themselves, like this... 5 min ago Now 5 min ago Now 5 min in future
We predicted this pattern because the last two times we saw a 24-hour prefix
that looked like the last day (May 2, June 9), it was followed by this shape.
Let us look at the Matrix Profile, and the top motif we find (bottom left). The results seem to make sense.
However, suppose in contrast that we knew nothing about the domain, and had chosen a motif length that was much too long, say length 3,000
(bottom right). How could we know that we had picked a length that was too long? There are two clues:
• Obviously, the motifs themselves will be less well conserved visually.
• The Matrix Profile itself offers useful clues. It tells us how “specially well conserved” the motif is ( min(MP)) relative to the average subsequence (
mean(MP)). As the ratio of these two numbers is approaches zero, it suggests a stunningly well conserved motif in the midst of others unconserved
data. However, as the ratio of these two numbers is approaches one, it suggest that the “motif” is no better conserved than we would expect by random
chance. In practice, we rarely compute these ratios, as is visually obvious that the MP looks “flat”.
40 40
0 0
1 400 1 3000
I need to find motifs faster! Part I
Why are there very slightly different results? The value of the exclusion zone, and of ‘r’, the radius [a] https://www.cs.ucr.edu/~eamonn/ten_quadrillion.pdf
were different here. See elsewhere to understand how these effect the motifs returned. [b] https://www.cs.ucr.edu/~eamonn/public/GPU_Matrix_profile_VLDB_30DraftOnly.pdf
See also, How do I quickly search this long dataset for this pattern, if an approximate search is acceptable?
I need to find motifs faster! Part II
Below, we obtain speedup by downsampling, and get essentially the same results. However, it is important to note that in both
cases we found the motifs well before even preSHRIMP finished. In this case, in a few seconds.
For greatly oversampled datasets, this simple trick can get you speedups of 2 or 3 orders of magnitude!
[a] https://www.cs.ucr.edu/~eamonn/ten_quadrillion.pdf
[b] https://www.cs.ucr.edu/~eamonn/public/GPU_Matrix_profile_VLDB_30DraftOnly.pdf
Have we ever seen a pattern that looks just like this, but possibly at a different length?
Voltage reading
In our insect data, a basic feeding primitive looks like this: , we can model it with something like: [1:600].^0.2
We have a theory that a certain higher level behavior will result in “A long primitive, followed by a shorter and smaller primitive,
followed by another long primitive”, like this.. , we can model it with: [[[1:600].^0.2] [[1:300].^0.2] [[1:600].^0.2]]
However, we don’t know how long the whole pattern could be…
The function to the right can solve this problem. It simply This looks like a lot of code, but most of it is for plotting
brute forces a MASS test for all lengths within a range function [] = uniform_scaling_search(TAG, QUERY)
figure; % Spawn a blank figure
1000
500
0
-500
-1000
0 2000 4000 6000 8000 10000 12000 14000 16000
This is a dataset of respiration from a sleep study. Each breath appears to be about 360 data points long. So lets search for time series
chains of length 360…
>> load respiration.mat
>> TSC1_demo(respiration , 360);
The algorithm finds the highlighted chains below.
1000
500
0
-500
-1000
0 2000 4000 6000 8000 10000 12000 14000 16000
Note the increasing “gulp” artifact that happens between cycles. Also note that it begins to happen earlier and earlier in
the cycle. What does this mean?
Here is the (lightly edited) annotation of Dr. Gregory Mason (LA BioMed/UCLA) an expert on cardiopulmonary interactions.
“The gulps are attempts to inspire against an obstruction coming the back of the tongue. The large signals are from the machine which do
not necessarily reach the patient, the small gulps are pathologic attempts to breathe. Why does it increase? With each successive breath
the patient tries harder to inspire. It finally is 'synchronized' and you don't see the small patient signal, and this event cycles over and
over. The cycling is best seen without treatment if one looks up "crescendo snoring," a hallmark of obstructive sleep apnea.”
Please consider “donating” question
• Even better if you can donate data
• Even better if you can donate an answer!
Appendices
• Below are miscellaneous appendices
In some of our examples we needed to compare the
Euclidian distances of pair of time series, that are of Which of these three pairs
different lengths. is closest?
above ideas on slightly nosily sine waves of different Raw Euclidean Distance Raw Euclidean Distance
lengths (shown top right). 0 0
1,000 10,000 1,000 10,000
Time Series Motif Length Time Series Motif Length