Matrix Profile Tutorial Part1
Matrix Profile Tutorial Part1
• NSF IIS-1161997 II
• NSF 544969
• CNS 1544969
• SHF-1527127
• AFRL FA9453-17-C-0024
Any errors or controversial statements are due solely to Mueen and Keogh
Disclaimer:
Time series is an inherently visual domain, and we exploit that fact in this tutorial.
We therefore keep formal notations and proofs to an absolute minimum.
If you want them, you can read the relevant papers [a]
---
All the datasets used in tutorial are freely available, all experiments are reproducible.
[a] www.cs.ucr.edu/~eamonn/MatrixProfile.html
If you enjoy this tutorial,
please check out our
other tutorials..
www.cs.ucr.edu/~eamonn/public/SDM_How_to_do_Research_Keogh.pdf www.cs.unm.edu/~mueen/DTW.pdf
Outline
Act 1 Act 2
• Our Fundamental Assumption • Background on time series mining
• What is the (MP) Matrix Profile? • Similarity Measures
• Normalization
• Properties of the MP
• Distance Profile
• Developing a Visual Intuition for MP • Definition and Trivial Approach
• Basic Algorithms • Just-in-time Normalization
• MP Motif Discovery • The MASS Algorithm
• MP Time Series Chains • Weighted Distance Profile
• MP Anomaly Discovery • Distance Profile with Gaps
• MP Joins (self and AB)
• MP Semantic Segmentation • Matrix Profile
• STAMP
• From Domain Agnostic to Domain Aware:
• STOMP
The Annotation Vector (A simple way to use domain
knowledge to adjust your results) • GPU-STOMP
• The “Matrix Profile and ten lines of code is • Open problems to solve
all you need” philosophy.
• Break
Fundamental Assumption: Conservation is Key motif 1
motif 2
|T | = n = 3,000
Note that for most time series data mining tasks, we are not interested in any global properties of the time
series, we are only interested in small local subsequences, of this length, m
These subsequences might be about the length of individual heartbeats (for ECGs), individual days (for
social media behavior), individual words (for speech analysis) etc
m = 100
The matrix profile at the ith location records the distance of the subsequence in T, at the ith location, to its nearest
neighbor under z-normalized Euclidean Distance.
For example, in the below, the subsequence starting at 921 happens to have a distance of 177.0 to its nearest
neighbor (wherever it is).
177
921
Another example. In the below, the subsequence starting at 378 happens to have a distance of
34.2 to its nearest neighbor (wherever it is).
34.1
0 500 1000 1500 2000 2500 3000
378
For the rest of this tutorial….
The Matrix Profile is always shown in blue.
The MPI contains integers that are used as pointers. As a practical matter, even 32-bits will let us have a MP of
length 2,147,483,647, over two years of data at 60Hz. A 64-bit integer gives us ten billion years at 60Hz)
In the following slides we won’t bother to show the matrix profile index, but be aware it exists,
and it allows us to find the nearest neighbor to any subsequence in constant time.
200
34.1
An interesting exception, the two smallest values in the MP must have the same value, and
their pointers must be mutual. This is the classic time series motif.
1373 1375 1389 … .. 368 378 378 234 … 2000 2001 2002 2003 2003
Why is it called the Matrix Profile?
m
m
construct a distance matrix of all pairs of
subsequences of length m.
Key:
Small distances are blue
Large distances are red
Dark stripe is excluded
How to “read” a Matrix Profile
Where you see relatively low values, you know that the subsequence in the original time
series must have (at least one) relatively similar subsequence elsewhere in the data (such
regions are “motifs” or reoccurring patterns)
Where you see relatively high values, you know that the subsequence in the original time
series must be unique in its shape (such areas are “discords” or anomalies).
Where you see relatively high values, you know that the subsequence in the original time
series must be unique in its shape. In fact, the highest point is exactly the definition of Time
Series Discord, perhaps the best anomaly detector for time series*
* Vipin Kumar performed an extensive empirical evaluation and noted that “..on 19 different publicly available data sets, comparing 9 different techniques (time
series discords) is the best overall technique.”. V. Chandola, D. Cheboli, V. Kumar. Detecting Anomalies in a Time Series Database. UMN TR09-004
How to “read” a Matrix Profile: Synthetic Motif Example
Where you see relatively low values, you know that the subsequence in the original time
series must have (at least one) relatively similar subsequence elsewhere in the data.
In fact, the lowest points must be a tieing pair, and correspond exactly to the classic definition
of time series motifs.
The corresponding subsequence in the raw data at this location, must have at least one similar
subsequence somewhere
How to “read” a Matrix Profile:
Now that we understand what a Matrix Profile is, and we have some practice
interpreting them on synthetic data, let us spend the next five minutes to see
some examples on real data.
Note that we will typically create algorithms that use the Matrix Profile,
without actually having humans look at it.
Nevertheless, in many exploratory time series data mining tasks, just looking at
the Matrix Profile can give us unexpected and actionable insights.
Ready to begin?
Taxi Example: Part I
Given a long time series, where should you examine carefully?
The problem is called “Attention Prioritization”, a group at Stanford is working on this [a].
However we think that the Matrix Profile can be used for this, “for free”.
Below is the data, the hourly average of the number of NYC taxi passengers over 75 days
in Fall of 2014.
Lets compute the Matrix Profile for it, we choose a subsequence length corresponding to
two days…. (next slide)
[a] http://futuredata.stanford.edu/ASAP/extended.pdf
Taxi Example: Part II
• The highest value corresponds to Thanksgiving (the uniqueness of Thanksgiving was the only thing
the Stanford Team noted)
• We find a secondary peak around Nov 6th, what could it be? Daylight Saving Time! The clock going
backwards one hour, gives an apparent doubling of taxi load.
• We find a tertiary peak around Oct 13th, what could it be? Columbus Day! Columbus Day is largely
ignored in much of America, but still a big deal in NY, with its large Italian American community.
0
500 1000 1500 2000 2500 3000 3500
Taxi Example: Part III
0
500 1000 1500 2000 2500 3000 3500
The top motif is a typical work week, starting from Tuesday
Italy Power Demand Weekend
(1995 to 1998)
0 20 40 60 80 100 120 140 160
The Taxi example was easy to solve by manual inspection of the raw data, but with just an order of magnitude more data,
the problem becomes much harder. Lets try a similar, but larger example, Italian Power Demand 1995 to 1998.
Note that the matrix profile is very low on average, most weeks are similar to the previous week (persistence) or the same
week in a different year (history).
All the high values can be explained by Italian holidays, most of which fall on different days in consecutive years.
Here the subsequence length was set to 150, but we still find these
anomalies if we half or triple that length.
motif 1
Motif discovery can often surprise you.
While it is clear that this time series is not random, we did motif 2
not expect the motifs to be so well conserved or repeated
so many times. There is evidence of a vocabulary, and
maybe even a grammar… motif 3
0 200
2 seconds
Seismology
If we see low values in the MP of a seismograph, it means there must have been a repeated earthquake.
Repeated earthquakes can happen decades apart.
Many fundamental problems seismology, including the discovery of foreshocks, aftershocks, triggered
earthquakes, swarms, volcanic activity and induced seismicity, can be reduced to the discovery of these
repeated patterns.
0 1,000,000
12,749,475 to 14,249,474 bp
622,725 to 2,122,724 bp
*“much of the Y (Chimp chromosome) consists of
Zoom-In lengthy, highly similar repeat units, or ‘amplicons’”
0 *J. Hughes et al., “Chimpanzee and human Y chromosomes are remarkably divergent in structure” Nature 463, (2010). 60,000
Music
let it be, let it be, yeah let it be And there will be an answer, let it
{instrumental bridge}
Discord at 1m54 let it be, let it be, yeah let it be And there will be an answ
it
60
30
15
0 60 120 180
Time (s)
Summary
We need a parameter R.
1 < R < (small number, say 3)
Lets make R = 2 for now.
Next slide…
We ran time series chain discovery on the dataset. The only thing we tell it is the
length of the subsequence to use (about one heartbeat long).
Zoom In
60
mmHg
40
20 tilt begins
0 5000
Ads the chain progresses, the depth of the dicrotic notch decreases….
lic
Dicrotic de
cl i
notch ne
Systolic
Di
c ro
tic
run
off
2004
2014
0 250 weeks 500 weeks
Thanksgiving
Xmas
pressure
Zoom-In
0 18 seconds
*Williams, C.L. et al. Muscle energy stores and stroke rates of emperor penguins: implications for muscle metabolism and dive
performance. Physiological and Biochemical Zoology.85.2(2011):120-133 Photo by Paul J. Ponganis
• There are literally 100’s of time series anomaly detectors.
• However, many claim that Time Series Discords is among the best.
..on 19 different publicly available data sets, comparing 9 different
techniques (time series discords) is the best overall technique among all Vipin Kumar
ACM SIGKDD
techniques. Vipin Kumar* 2012 Innovation
Award Winner
• This is good news for us, because if you compute the matrix profile,
you have the discords “for free”. In fact, you have all the top K-
discords, for any K.
• Why are discords so effective? (our subjective opinion)
• They make no assumptions about the data (so no wrong assumptions).
• They don’t need to learn a bunch of parameters, with no parameters to fit, it
is hard to overfit.
• There is one pathological (but fixable) case where they don’t work
(next slide) *https://www.cs.umn.edu/research/technical_reports/view/09-004
The twin freak problem (see next slide)
The definition of a discord is: This is the discord.
The subsequence D that has the It is far from its nearest neighbor
maximum distance from its Let us say it was caused be a valve
(non-trivial match) nearest being stuck one day..
neighbor.
The twin freak problem
The definition of a discord is: ..but suppose that the anomaly
The subsequence D that has the happened twice?
maximum distance from its Once on Monday, once on Friday…
(non-trivial match) nearest
neighbor. The problem is that it is no longer
the discord, under our classic
definition ;-(
join discord 1) The Golden Batch: Here we have two time series
that we think should be about the same. But when
we join them, there is a join discord, a subsequence
that appears only in only in A, but not in B, but
why? (spoken word example below)
Assume we have two time series TA and TB ... Note that they can be of different lengths
TA
0 500 1000
TB
| TA | = 1,000 | TB | = 2,000
As before, we are not interested in any global properties of the time series, we are only interested in small
local subsequences, of this length, m
These subsequences might be about the length of individual heartbeats (for ECGs), individual days (for
social media behavior), individual words (for speech analysis) etc.
TA
m = 100
TB
m = 100
For every subsequence in TA, we look for its nearest neighbor in TB.
The Matrix Profile at the ith location records the distance of the subsequence in TA, at the ith location, to its
nearest neighbor in TB, under z-normalized Euclidean Distance.
The Matrix Profile is almost the same length as TA, it is shorter by just m
For example, in the below, the subsequence of length 100 starting at 362 happens to have a distance of 1.24 to
its nearest neighbor (wherever it is) in TB .
TA
However, it does not tell us where the location of the nearest neighbor in TB. To store this information, we can
create another companion sequence, called a matrix profile index.
The green arrow points from the subsequence of TA starting at 362 to its nearest neighbor in TB. The nearest
neighbor locates at 359 of TB .
This is JT
TA ATB
TB
1.24
0 1000 0 2000
362 359
The green arrow points from the subsequence of TB starting at 1340 to its nearest neighbor in TA. The nearest neighbor
locates at 395 of TA .
This is JT
TA BTA
TB
1.05
0 1000 0 2000
395 1340
(zoom in )
Music I (join case)
Can you see any common structure between the two time
series below?
Hint, it is probably about this length
0
10,000 20,000
Music II (join case) The data is the 2nd MFCC of two songs,
Under Pressure and Ice Ice Baby
Queen-Bowie
Vanilla Ice
10,000 20,000
A zoom-in of the best conserved region between the two time series (the similarity join)
Queen-Bowie
Vanilla Ice
0 250 500
I
In the previous example we asked you to find “common structure between the two time
series” Now I am going to ask you the opposite question.
What is different between the two time series?
Hint, it cant be the regions in the matching boxes, since they have matches…
UK
US
0 100 200 300 400 500 600 700 800
II Closest Match
ED = 2.8
USA version : Harry had been on the Gryffindor House Quidditch team
ever since his first year at Hogwarts and owned one of the best racing
DNA (join case)
L. pneumophila Paris
L. pneumophila Lens
We consider two strains of
Legionella bacteria, L.
pneumophila Paris and L.
pneumophila Lens, which
consist of 3,503,504 and
3,345,567 bp respectively. We 0 1,000,000 2,000,000 3,000,000
0 100,000 200,000
Laura Gomez-Valero et al. Comparative and functional genomics
of Legionella identified eukaryotic like proteins as key players in host–
Time Series Semantic Segmentation Sometime the system we are
monitoring changes regimes, can
we detect such changes?
TiltECG
..lying horizontal, titling begins …
FLOSS: Matrix Profile Segmentation
What do we want in a Semantic Segmentation Algorithm?
1892
1000 2000
1270
3000
4039
4000
4607
5000
Key Observation
1269 1270 1892 3450 4039 4040
Recall that the Matrix Profile Index has pointers (arrows, arcs)
that point to the nearest neighbor of each subsequence.
So, if we slide across the Matrix Profile Index, and count how
many arrows cross each particular point, we expect to find few
that span the change of behavior.
1270
1270
1892
4039
3450
4607
4039 4040
If we use the sliding arc count to produce an arc-curve, we find
it is near zero at the point of system transition. This low value
signals the location of the system change.
1500
1000
There is one flaw. The arc-curve, tends to be low near the
500
beginning and end of the time series, just because there are
0
0 1000 2000 3000 4000 5000 fewer arcs that could cross at those locations.
The arc count here What we can do is calculate what the arc-curve would look like
is almost zero! if there was no system transition, and use that to correct the
arc-curve.
If there was no system transition structure, the arc-curve would
be a inverted parabola, with a height ½ the time series length.
Lets try this, next slide…
Empirical Theoretical
2500
1270
1270
1892
4039
3450
4607
4039 4040
better!
The corrected arc-curve minimizes in the right place,
1500
1000
1
0.8
0.6
0.4
0.6
We added a
0.4
0.2
0
lot of noise
0 1000 2000 3000 4000 5000
0 1000 2000 3000 4000 5000
FLOSS is very robust to its only
parameter
The CAC has a single parameter, the subsequence length m.
But we can typically change it by an order of magnitude, and get very good results.
One individual Great Barbet sings…, ….another takes over…, …yet another takes over
10
MFCC Space
5
-5
0 5000
0.5
0
0GreatBarbet2_50_1900_3700.txt 1000 2000 3000 4000 5000
This dataset was hand annotated by an entomologist. The
insect changes its feeding behavior at about time 1,800. Asian citrus psyllid
(Diaphorina citri)
1
-1
0 12000
1
0.5
0
0 InsectEPG2_50_1800.txt 4000 8000 12000
Pulsus Paradoxus is often visually apparent in the SP02 trace.
Here we deliberately ignore this fact, and look only in the ECG
trace, which is normally considered as not predictive of Pulsus
Paradoxus.
Note that the clinician that annotated this data was in the room at the
time and may have had access to information that is simply not
available in this signal. Pulsus paradoxus (PP), also paradoxic pulse or paradoxical pulse, is an abnormally large
decrease in systolic blood pressure and pulse wave amplitude during inspiration.
See also https://www.youtube.com/watch?v=7AXIYQK5BBM
10
-5
0 10000 18000
1
0.5
0
0 PulsusParadoxusECG2_30_10000.txt 10000 18000
Summary for Time Series Segmentation
The Matrix Profile allows a simple algorithm, FLUSS/FLOSS, for time series segmentation.
• It has been tested on the largest and most diverse collection of time series ever considered
for this problem, and in spite of (or perhaps, because of) its simplicity, it is state-of-the-art.
Better than rival methods, and better than humans (details offline).
From Domain Agnostic to Domain
Aware*
• The great strength of the MP is that is domain agnostic. A single black box
algorithm works for taxi demand, seismology, DNA, power demand,
heartbeats, music, bird vocalizations....
• However, in a handful of cases, there is a need to, or some utility in,
incorporating some domain knowledge/constraints.
• There is a simple, generic and elegant way to do this, using the
Annotation Vector (AV).
• In the following slides we will show you the annotation vector in the
context of motif discovery, but you can use it with any MP algorithm.
• We will begin by showing you some examples of spurious motifs that can
be discovered in particular domains, then we will show you how the AV
*
mitigates
Hoang Anh them.
Dau and Eamonn Keogh. Matrix Profile V: A Generic Technique to Incorporate Domain Knowledge into Motif Discovery. KDD'17, Halifax, Canada.
Motivating Example 1:
Stop-word Motif Bias
wave is just a calibration signal, A snippet of ECG data from the LTAF-71 Database. The top motifs
sent when the sensor has weak come from regions of the calibration signal because they are much
contact with the skin. It is a more similar than the motifs discovered if we search only data that
frequent, but spurious motif. contains true ECGs.
Euclidean Distance has a bias toward simple shapes.
Motivating Example 2:
“Pairs of complex objects, even those which subjectively may
Simplicity Bias seem very similar to the human eye, tend to be further apart
under (Euclidean) distance than pairs of simple objects.” [1]
Top-1 Motif
Motion Artifact
1 4000 8000
A snippet of ECG time series in which two motion artifacts were deliberately introduced
by the attending physician.
[1] Batista et al. “CID: an efficient complexity-invariant distance for time series.” Data Mining and Knowledge Discovery, 2014
Motivating Example 3:
Actionability Bias
In many cases a domain expert wants to find not simply the best
motif, but regularities in the data which are exploitable or actionable
in some domain specific ways.
“I want to find motifs in this web-click data, preferably occurring
on or close to the weekend.”
“I want to find motifs in this oil pressure data, but they would be
more useful if they end with a rising trend.”
If = : raises MP value in order to remove the subsequence from potential motif pool
If = : retains original MP value to allow the motifs that best balance the fidelity of conservation with the user’s constraints to rise to the top
This only leaves the question of how do we create such an AV for our domain of
interest?
Key Claim: For most problems, a domain expert can design an appropriate AV with
5 minutes of introspection, and implement it in 2 or 3 lines of code or an excel
script.
Case study: Stop-word motif bias
Stop-word motif
1
Distance profile Original MP
0
Threshold
Extended exclusion zone for
Corrected MP 1 150
each data point below threshold
0.5
Annotation vector 1 3000 1 3000
0
0 1000 2000 3000 By correcting the MP to bias away from stop-word
top) We annotated a single stop-word from the motifs, we can discover medically meaningful motifs.
LTAF-71 dataset. middle) The stop-word distance
profile to the entire dataset was thresholded to
create an exclusion zone, which was used to create
an AV (bottom).
Case study: Actionability bias (i)
Suppressing motion artifacts
How to make the AV
Functional near-infrared spectroscopy • Slides a window of length m across the
(fNIRS) data 690 nm intensity acceleration time series.
(subset of record fNIRS3)
• Compares the STD of each subsequence with
the mean of all the subsequences’ STDs, and
assign the corresponding AV value to be
0 4000 8000 12000
either 0 or 1
A snippet of fNIRS searched for motifs of length 600. The
motifs correspond to an atypical region, which (using
STD vector
external data, see Fig. 7 below) we know is due to a Mean of STD vector
sensor artifact.
Acceleration time series
fNIRS data
Acceleration AV vector
1 50000
1 25000 Points above the mean of all subsequences’ standard deviation
The synchronization between the fNIRS data and are well aligned with regions of motion artifacts. The
accompanying accelerator data. corresponding AV values for these points are 0 and 1 for the rest.
Case study: Actionability bias (ii)
Suppressing motion artifacts
Motifs discovered
the classic approach
(top to bottom) Motifs in fNIRS data
discovered using classic motif search
tend to be spurious motion artifacts, Original matrix profile
0 200 400
Time series
Complexity estimation
1 60000
The complexity measure shown in parallel to the raw
data. We simply normalize this complexity vector to be A visual intuition of the complexity estimation of three
in range [0 - 1] to obtain the final AV. time series subsequences of different complexity levels.
Summary of the last ten minutes: annotation vector
Most of the time, the plain vanilla MP is going to be all you need to find
motifs/discords/chains etc. for you data.
In some cases, you may get spurious results. That is to say, mathematically
correct results, but not what you want/need/expect for your domain.
In those cases, you can just invent a simple function to suppress the spurious
motifs, code it up as an annotation vector in a handful of lines of code, use it to
“correct” the MP, and then run the motifs/discords/chains algorithms as before.
Once you invent an AV, say AVdiesel_engine or AVTurkish_folk_music, you can reuse it on
similar datasets, share it with a friend, publish it etc.
The “Matrix Profile and ten lines of code is all you need” philosophy
Key Idea:
• We should think of the Matrix Profile as a black box, a primitive.
• As we will later see, in most cases we can think of it as being obtained
essentially for free.
• We claim that given this primitive, and at most ten lines of additional code, we
can reproduce the results of hundred of papers.
• This suggests that other people, may be able to take this primitive, add ten
lines of code, and do amazing things that have not occurred to us. We look
forward to seeing what you come up with!
• In the meantime, lets see an example of: with ten lines of additional code, we
can reproduce the results of a published paper….
Motifs under Uniform Scaling
The two imbedded examples
We took two exemplars from the same class from the MALLET
dataset, and imbedded them into a random walk dataset. Even
without the color-coded clue brushed onto the data by the Matrix
Profile discovery tool, the repeated pattern is visually obvious.
1 10,048
Suggestion: Toggle back and forth with last slide
We stretched the left half of the time series by just
5%, and now the pair of imbedded patterns are no 5% stretching means the shapes begin to go out of
phase, accumulating more and more error…
longer the top-1 motif, an unexpected and
disquieting result.
1 10,048
100%
105%
1 10,313
*D.Yankov, et al (2007). Detecting Motifs Under Uniform Scaling. SIGKDD 2007.
This issue is easy to fix with our “Matrix Profile and ten lines of code is all you need” philosophy.
For example. Suppose you suspect that there are motifs in your dataset, that differ in length by 164%
Take the original dataset T, and copy a stretched version of it into T2, simply by using:
T2 = T(1: 100/164: end); % Unofficial matlab way to resample
Now call:
[JMP, JMPindex] = computeMatrixProfileJoin(T,T2,500);
The resulting Matrix Profile will discover the motifs with the appropriate uniform scaling invariance.
This issue is easy to fix with our “Matrix Profile and ten lines of code is all you need” philosophy.
For example. Suppose you suspect that there are motifs in your dataset, that differ in length by 164%
Take the original dataset T, and copy a stretched version of it into T2, simply by using:
T2 = T(1: 100/164: end); % Unofficial matlab way to resample
Now call:
[JMP, JMPindex] = computeMatrixProfileJoin(T,T2,500);
The resulting Matrix Profile will discover the motifs with the appropriate uniform scaling invariance.
We did this for the electric power demand example below…
• There are some slides below, they are mostly back-up and bonus slides
End! www.cs.ucr.edu/~eamonn/MatrixProfile.ht
ml
12 1
Euclidean Distance * Sqrt(1/Length)
Original Length
Euclidean Distance
Downsampled 1 in 2
Downsampled 1 in 3
Downsampled 1 in 4
Downsampled 1 in 5 0.5
Euclidean Distance / Length
See also: Hoang Anh Dau and Eamonn Keogh. Matrix Profile V: A Generic Technique to Incorporate Domain
Knowledge into Motif Discovery. KDD'17, Halifax, Canada
Lets start by making a test dataset 13
12
In a smoothed random walk of length 50,000, we imbedded the reverse of
one Mallet-6 at location 10,000, and the reverse of pattern a different
Tak
Mallet-6 at location 40,000.
en
We imbedded one Mallet-2 at location 15,000, and a different Mallet-2 at
fro
m
location 25,000, and yet another Mallet-6 at location 35,000.
UC
Then we added noise to the entire thing:
IM
all
5
et
TAG = (TAG + randn(size(TAG))/4)
4
A
13
TAG
12
4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
3 10
We are 100.0% done: The input time series: The best-so-far motifs are color coded We are 100.0% done: The input time series: The best-so-far motifs are color coded
(see bottom panel)
TAG: Pure motif search
(see bottom panel)
0.0001 5 0.0001 5
The best-so-far corrected matrix profile The best-so-far corrected matrix profile
104 104
40 40
30 30
20
10
We did not find the 20
10
0
0.0001 5
imbedded motifs ;-( 0
0.0001 5
The best-so-far 1st motifs are located at 5503 (green) and 14051 The best-so-far 1st motifs are located at 14988 (green) and 24987
104 104
(cyan) (cyan)
Discard Discard
1 500 1 500
The best-so-far 3rd motifs are located at 19769 (green) and 36016 The best-so-far 3rd motifs are located at 37759 (green) and 47461
(cyan) (cyan)
Zoom-in of part of MP As you can see below, the true motif is low in
20 the MP, just not quite low enough.
10
That means if we can just nudge the relevant
0
section down a little (or equivalently, nudge
everything else up), we would find the right
motifs.
1 500
1 500
What we need is a function that recognizes that one
of these patterns is too simple to be of interest.
The complexity function might be too strong, and might push too hard for more complex motifs, even if they are
not really similar.
However, we can control its strength.
The dilution_factor is a number greater than or equal to zero. If it is zero, there is no dilution. If it large
enough, say over 40, we begin to degenerate to classic motif search.
At least for this problem, values in the range 2 to 16 work great.
% Makes annotation vector that favors complexity
% Dau Hoang Anh and Eamonn Keogh
% [annotationVector] = make_AV_complexity(data, subsequenceLength);
% Output: annotationVector: annotation vector (vector)
% Input: data: input time series (vector)
% subsequenceLength: motif length (scalar)
%
function [AV] = make_AV_complexity(data, subsequenceLength)
data = zscore(data); % data is a row vector
profile_length = length(data) - subsequenceLength + 1;
AV = zeros(profile_length,1);
for j = 1: profile_length
AV(j) = check_complexity(data(j:j+subsequenceLength-1));
end
end
Music The MP is an useful tool for various music analysis tasks
Revisited 1
Eagles – Hotel California
The MP can be used to create arc plots,
giving a good visualization of the music
structure
Yes, there are two paths you can go by, but in the long run
The plot is a histogram of the MPindex. The values record how many times a
subsequence was considered NN of some other subsequence. The
subsequence that maximize this plot was used as the audio thumbnail
• http://bdh-rd.bne.es/viewer.vm?id=0000012553&page=1
• The elephant 1: Elephants via some Paintings of the Mughal Era, http://ranasafvi.com/mughal-elephants/