ASDM C02 Clustering
ASDM C02 Clustering
Andrés M. Alonso
1 Department of Statistics, UC3M
2 Institute Flores de Lemus
ASDM - C02
June 25 – 29, 2018, Boadilla del Monte
Outline
1 Introduction
Connectivity-based clustering
BS
PR
These algorithms connect “objects”
BR
to form "clusters" based on their
ES
FP
distance/similarity.
AC
NL
A cluster can be described by the
CH
maximum distance needed to
NY
LW
connect parts of the cluster.
KW
NN
At different distances, different
WL
clusters will form, which can be
NW
1600
1400
1200
1000
800
600
400
200
0
represented using a dendrogram.
−2
−4
−6
Centroid-based clustering
−8
k-means −16
−18
−22
35 40 45 50
Centroid-based clustering
Clusters are represented by a central
“object”, which may not necessarily
be a member of the data set.
k-means
k-mediods or PAM
0.4
uk lu
nl se
cy ie
(Model) Distribution-based clustering es at
0.3
fr lv
The clustering model most closely no
be
ro
hu
related to statistics is based on
0.2
de si
fi
distribution models.
Clusters can then easily be defined
0.1
as objects belonging most likely to
the same distribution/model.
0.0
Density-based clustering
Clusters are defined as areas of
higher density than the remainder of
the data set.
Objects in sparse areas are usually
considered to be noise and border
points.
The problem
Liao, T.W. (2005) Clustering of time series data-a survey, Pattern Recognition, 38,
1857–1874.
Aghabozorgi, S., Shirkhorshidi, A.S. and Wah, T.Y. (2015) Time-series clustering
– A decade review. Information Systems 53 16–38.
Starting point
To choose a metric to assess the dissimilarity between two time
series.
0.5
0.3
0.1
−0.1
approach: −0.2
−0.3
0 1000 2000 3000 4000 5000 6000 7000 8000
X i , X j ) = d (X
D(X X i − X j ),
Word YES
0.5
0.3
0.2
−0.2
be aligned. −0.3
−0.4
0 1000 2000 3000 4000 5000 6000 7000 8000
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Datafile <yesnot.xls>
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction
Introduction
Raw data clustering
Time series clustering by features
Autocorrelation clustering
Model based time series clustering
Spectral domain clustering
Time series clustering by dependence
Extreme value clustering
of the series. 1
−1
−2
−3
−5
0 100 200 300 400 500 600 700 800 900 1000
5
−1
48.735 51.472 0 51.669
−2
−3
−4
0 100 200 300 400 500 600 700 800 900 1000
Autocorrelation clustering
But, in this case, autocorrelation functions are a “good”
clustering criteria:
1 1
Sample Autocorrelation
Sample Autocorrelation
0.5 0.5
0 0
−0.5 −0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Lag Lag
1 1
Sample Autocorrelation
Sample Autocorrelation
0.5 0.5
0 0
−0.5 −0.5
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Lag Lag
Autocorrelation clustering
Autocorrelation clustering
Autocorrelation clustering
(j)
Xm
Tl,m = l ρXj ,k − rρbj ,k )2 ,
(b
k =1
25
20
0
0 50 100 150 200 250
25
20
15
10
0
0 50 100 150 200 250
Alonso, A.M. and Maharaj, E.A. (2006) Comparison of time series using
subsampling, Computational Statistics and Data Analysis, 50,
2589–2599.
Datafile <BME.xls>
0.9
0.8
0.7
0.6
1 - p-value
0.5
0.4
0.3
0.2
0.1
3 7 2 6 1 5
BME0804020300#
and X∞
λY = γY ,k exp(−ikω)
k =−∞
and Xj Xm
FY (ωj ) = IY (ωi )/ IY (ωi ),
i=1 i=1
where ωi = 2πi/n, IX (·) is the periodogram, and
m = ⌈(n − 1)/2⌉.
We can use the following test statistics:
Rπ
Dm = sup |FX (ω) − FY (ω)| or Wm = 0 (FX (ω) − FY (ω))2 d F̄ (ω).
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1111111111111110000001111100111001111111111111000111101111000011111111000000000000000000000000101100
defined on {x : 1 + ξ( x−µ
σ ) > 0} where −∞ < µ < ∞, σ > 0,
and −∞ < ξ < ∞,
The three parameters µ, σ and ξ are the location, scale
and shape parameters, respectively where ξ determines
the three extreme value types.
When ξ < 0, ξ > 0 or ξ = 0 , the GEV distribution is the
negative Weibull, the Fréchet or the Gumbel distribution,
respectively.
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction
Introduction
Raw data clustering
Time series clustering by features
Autocorrelation clustering
Model based time series clustering
Spectral domain clustering
Time series clustering by dependence
Extreme value clustering
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
12
b)
10
Datafile <SpainTemperature.xls>
GEV estimates <SpainTemperatureEstimates.xls>
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction
Introduction
Time series clustering by features
Forecast density clustering
Model based time series clustering
Multivariate models with cluster structure
Time series clustering by dependence
d (X , Y ) = (ΞX − ΞY )′Σ −1
Ξ (ΞX − ΞY ),
−3 −3
−4 −4
−5 −5
Log−mortality rate
Log−mortality rate
−6 −6
−7 −7
−8 −8
−9 −9
−10 −10
1970 1975 1980 1985 1990 1995 2000 1970 1975 1980 1985 1990 1995 2000
0.084
−1
Mean Absolute Prediction Error
0.0835
MAPE reduction
0.083
−2
0.0825
−3
0.082
0.0815
−4
0.081
−5
0.0805
Number of considered models
0.08 −6
20 40 60 80 100 120 140 160 0 5 10 15 20 25
Prediction horizont
25
Present
20
consider:
15
the models;
the last available observation;
10
the future values.
5
0 10 20 30 40 50 60 70
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-5 0 5 -5 0 5
0.25
0.2
0.15
0.1
0.05
| | ||
-0.05
5 10 15 20 25 30 35 40 45
where p = 1, 2.
3 Finally, we use classical clustering procedures that allows
distances as inputs.
Xt = m(X
X t−1 ) + εt ,
where
{εt } is an i.i.d. sequence
X t−1 is a d-dimensional vector of known lagged variables
m(·) is assumed to be a smooth function but it is not restricted to
any pre-specified parametric model.
for p = 1, 2.
Clustering step
Australia
40 Austria
Belgium
Canada
35 China
Cyprus
Denmark
30 Finland
France
Greece
25 Hungary
Ireland
Italy
20 Japan
Luxembourg
Malta
15 Netherlands
Norway
Poland
10 Portugal
Spain
Sweden
5 United Kingdom
United States
0
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction
Introduction
Time series clustering by features
Forecast density clustering
Model based time series clustering
Multivariate models with cluster structure
Time series clustering by dependence
3.5
2.5
1.5
0.5
0
GRC CYP JPN MLT NLD ITA FRA HUN BEL IRL CAN LUX
POL DNK GBR NOR AUT ESP PRT SWE FIN CHN AUS USA
3.5
2.5
1.5
0.5
0
JPN AUT DNK GBR PRT ITA NOR HUN AUS LUX MLT IRL
ESP BEL CYP NLD GRC POL FRA SWE FIN USA CAN CHN
0.25
0.2
0.15
0.1
0.05
| | ||
-0.05
5 10 15 20 25 30 35 40 45
0.25
0.2
0.15
0.1
0.05
0
FIN MLT BEL ITA NLD PRT JPN IRL GBR AUS HUN GRC
LUX AUT DNK NOR POL ESP CAN CYP USA FRA SWE CHN
Dynamic Factor Models can deal with large sets of time series.
Engle and Watson (1981), Peña and Box (1987), Forni et al
(2000), Bai and Ng (2002), Peña and Poncela (2006), Hallin
and Liska (2007), Alonso et al (2011), Lam and Yao (2012),
Forni et al (2015, 2016,2017).
where
′
f0t = (f01t , . . . , f0r0 t ) is a r0 -dimensional vector of common
factors, P0 is a m × r0 factor loading matrix and k is the number
of clusters.
′
fit = (fi1t , . . . , firi t ) be a ri -dimensional vector of group-specific
factors corresponding to the ith cluster and Pi is the m × ri factor
loading of these specific factors. The columns of the matrix Pi
are of the form (0, . . . , 0, pj1 , . . . , pjmi , 0, . . . , 0), for j = 1, . . . , ri .
Ando, T. and Bai J. (2016) Panel data models with grouped factor
structure under unknown group membership, Journal of Applied
Econometrics, 31, 163–191.
Ando, T. and Bai J. (2017) Clustering huge number of financial
time series: A panel data approach with high-dimensional
predictor and factor structures, Journal of the American
Statistical Association, in press.
−2
−4
Log−mortality rate
−6
Civil War
−8 period
Spanish influenza
pandemy
−10
−12
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
−2
−4
Log−mortality rates
−6
−8
−10
−12
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Note that
|Bk | = R(x)k R(y)k − R(y, x)k R−1 (x)k R(x, y)k
Dependent series
The models for the three populations are:
(1,i) (1,i) (1,i)
1 AR(1) Xt = 0.9Xt−1 + ǫt with i = 1, 2, ..., 5.
(2,i) (2,i) (2,i)
2 AR(1) Xt = 0.2Xt−1 + ǫt with i = 1, 2, ..., 5.
(3,i) (3,i) (3,i)
3 AR(1) Xt = 0.2Xt−1 + ǫt with i = 1, 2, ..., 5.
That is, the second and the third models have the same
autocorrelation structure.
The five scenarios differs in the dependence structure of
the innovations. In the following, we present the
(1,1) (1,2) (3,5)
autocorrelation matrices of (ǫt , ǫt , ..., ǫt ).
6
o
1
o
7
o
15
o
8
o
14
o
9
o
13
o
10
o
12
11 o
o
6 6
o o
1 1
o o
7 7
o o
15 15
o o
8 8
o o
14 14
o o
9 9
o o
13 13
o o
10 10
o o
12 12
11 o 11 o
o o
4 4 4
o 3 o 3 o 3
o o o
5 5 5
o o o
2 2 2
o o o
6 6 6
o o o
1 1 1
o o o
7 7 7
o o o
15 15 15
o o o
8 8 8
o o o
14 14 14
o o o
9 9 9
o o o
13 13 13
o o o
10 10 10
o o o
12 12 12
11 o 11 o 11 o
o o o
and Xk
Sim(C, C ′ ) = k −1 max1≤j≤k ′ Sim(Ci , Cj′ ).
i=1
The closer to one the index, the higher is the agreement between the
two partitions.
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction Introduction
Time series clustering by features A dissimilarity measure based on mutual dependency
Model based time series clustering The clustering procedure
Time series clustering by dependence Case-studies with real data
1
0.98
0.96
0.95 0.94
0.92
0.9
0.9
0.88
0.86
0.85
0.84
0.82
0.8 0.8
0.78
9 10 11 7 8 15 1 2 3 6 4 5 12 13 14 9 10 6 7 8 1 2 3 4 5 11 15 14 12 13
0.98
1
0.96
0.94
0.95
0.92
0.9
0.9
0.88
0.86
0.85
0.84
0.82
0.8 0.8
0.78
3 4 15 6 7 11 12 1 2 8 5 9 10 13 14 3 4 1 2 5 6 7 8 9 10 11 12 13 14 15
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
4 5 3 1 2 6 10 7 9 8 11 14 12 13 15 4 5 3 2 1 6 10 7 9 8 13 11 12 15 14
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
1 4 2 3 5 6 9 8 7 10 11 12 13 14 15 1 4 2 3 5 6 7 10 8 9 11 12 13 14 15
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
11 12 13 15 14 1 3 5 4 2 6 7 8 9 10 11 12 13 14 15 1 2 3 5 4 6 7 8 9 10
Main conclusions
The results of the univariate methods are similar and they don’t
change much across linkage methods.
Notice that here a Gravilov index around 0.667 corresponds to
approximately separate the first population from the third one in
scenarios D.2, D.4 and D.5.
For scenarios D.3, D.4 and D-5 where there are some “strong”
clusters, the complete linkage for both multivariate measures
improve the univariate measures.
For all scenarios, the single linkage and RD is preferable to other
considered alternatives.
-2
-3
-4
-5
log(MR)
-6
-7
-8
-9
-10
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Lee-Carter model
It is a well-known model which looks at the dependence between
mortality time series. It relates the mortality rates by age with a single
unobservable factor:
ln(MRx,t ) = ax + bx kt + εx,t
,
kt = c + kt−1 + ηt
20 0.2
0 0.1
−20 0
−40 −0.1
−60 −0.2
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
0.035 2
0.03 1.5
First factor loadings Second factor loadings
1
0.025
0.5
0.02
0
0.015
−0.5
0.01
−1
0.005 −1.5
0 −2
0 20 40 60 80 100 0 20 40 60 80 100
−2 −4
−4 −6
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
0.08 −0.01
0.01 −0.04
15 20 25 30 35 40 50 55 60 65 70 75 80 85 90
0.45
One factor model
Two factors model
Factors + Cluster
0.4 Cluster 1
Cluster 2
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 10 20 30 40 50 60 70 80 90 100
0.15
0.1
0.05
55 60 65 70 75 80 85 90 95
We observe improvements in ages where two factors is worse than one factor
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction Introduction
Time series clustering by features A dissimilarity measure based on mutual dependency
Model based time series clustering The clustering procedure
Time series clustering by dependence Case-studies with real data
0.35
0.3
0.25
0.2
0.15
0.1
20 25 30 35 40
But also in ages where two factors is better than one factor
A.M. Alonso, L. Cayuelas and A. Justel Time series clustering
Introduction Introduction
Time series clustering by features A dissimilarity measure based on mutual dependency
Model based time series clustering The clustering procedure
Time series clustering by dependence Case-studies with real data
20
0
0 100 200 300 400 500 600 700 800 900
40
20
0
0 100 200 300 400 500 600 700 800 900
8 4
o o
0.8
9 3
o o
0.6
10 2
o o
0.4
There are three clusters: 11
o
1
o
0.2
Sleeping hours 12 24
0o o
home. −0.6
15 21
o o
−0.8
16 20
o o
17 19
o 18 o
−1 o
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
6.4
6.2
5.8
5.6
5.4
5.2
5
0 5 10 15 20 25