0% found this document useful (0 votes)
70 views134 pages

Te 1555

This document presents a thesis that analyzes changes in the spectra of time series. It begins with an introduction that reviews relevant previous work, including spectral theory for stationary processes, change point detection methods, and modeling locally stationary processes. The thesis aims to use time series clustering methods to identify segments of time series that have similar spectra, as an alternative to traditional change point detection approaches. It will introduce a total variation distance to compare estimated spectra and develop a hierarchical spectral merger clustering method. Simulation studies and applications to ocean wave and EEG data are presented to evaluate the proposed methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views134 pages

Te 1555

This document presents a thesis that analyzes changes in the spectra of time series. It begins with an introduction that reviews relevant previous work, including spectral theory for stationary processes, change point detection methods, and modeling locally stationary processes. The thesis aims to use time series clustering methods to identify segments of time series that have similar spectra, as an alternative to traditional change point detection approaches. It will introduce a total variation distance to compare estimated spectra and develop a hierarchical spectral merger clustering method. Simulation studies and applications to ocean wave and EEG data are presented to evaluate the proposed methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Centro de Investigación en Matemáticas, A.C.

Detection of changes in time series: a


frequency domain approach

Tesis

Que para obtener el Grado de:


Doctorado en Ciencias con Orientación en Probabilidad y
Estadística

P R E S E N T A:

Carolina de Jesús Euán Campos

Director:

Dr. Joaquín Ortega Sánchez

Guanajuato, Guanajuato, México

16 Agosto de 2016
Integrantes del Jurado.

Presidente: Dra. Graciela María de los Dolores González Farías


CIMAT
Secretario: Dr. Rolando José Biscay Lirio
CIMAT
Vocal: Dr. Hernando Ombao
Universidad de California en Irvine
Vocal: Dr. Gabriel Rodríguez Yam
Universidad Autónoma de Chapingo
Vocal y director de tesis: Dr. Joaquín Ortega Sánchez
CIMAT
Lector especial: Dr. Pedro César Alvarez Esteban
Universidad de Valladolid

Asesor:

Dr. Joaquín Ortega Sánchez.

Sustentante:

M.C. Carolina de Jesús Euán Campos.


To my parents, Bolivar and Maria Concepción.

To Israel M Hdz.
Acknowledgements
I would like to express my deepest gratitude to my advisor, Dr Joaquín Or-
tega, for his patience, suggestions and guidance during the last four years.

I am grateful to my external examiner, Prof Hernando Ombao, for his


advices and suggestions to this project. I wish to thank Prof Pedro C. Ál-
varez Esteban for several fruitful conversations during the last years and his
comments on this thesis.

A special note of thanks is also extended to the committee members of


oral defense, Dra Graciela González, Dr Rolando Biscay and Dr Gabriel Ro-
dríguez, for their time dedicated to comment and to make suggestions to the
thesis.

I thank CIMAT for the facilities to carry out this research and CONACYT
for the PhD scholarship given.
ii
Contents

1 Introduction 1
1.1 Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Spectral theory for a stationary process . . . . . . . . . 2
1.1.2 Change point detection . . . . . . . . . . . . . . . . . . 4
1.1.3 Spectral theory for a locally stationary process . . . . . 6
1.1.4 Time Series Clustering . . . . . . . . . . . . . . . . . . 9

2 Total Variation Distance 13


2.1 TV distance and the Wasserstein distance . . . . . . . . . . . 14
2.2 TV distance to compare spectra . . . . . . . . . . . . . . . . . 16
2.3 Distribution of the TV distance between estimated spectra . . 19
2.3.1 Estimation of dT V . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Asymptotic distribution of dˆT V . . . . . . . . . . . . . 22
2.3.3 Approximation of the distribution of dˆT V . . . . . . . . 31
2.3.4 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Rate of convergence . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Significance level and power of the test . . . . . . . . . 49
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Clustering Methods 55
3.1 TV distance in a clustering method . . . . . . . . . . . . . . . 57
3.2 Hierarchical spectral merger (HSM) method . . . . . . . . . . 59
3.3 TV distance and other dissimilarity measures . . . . . . . . . 63
3.3.1 Simulation of a process based on the spectral density . 63
3.3.2 Comparative study . . . . . . . . . . . . . . . . . . . . 65
3.4 Detection of transitions between spectra . . . . . . . . . . . . 73
3.4.1 Simulation of transitions between two spectra . . . . . 74

iii
iv Contents

3.4.2 Detection of transitions . . . . . . . . . . . . . . . . . . 75


3.5 Unknown number of clusters . . . . . . . . . . . . . . . . . . . 80
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Applications to Data 87
4.1 Ocean wave analysis . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . 88
4.1.2 Results using the TV distance as a similarity measure . 89
4.2 Clustering of EEG data . . . . . . . . . . . . . . . . . . . . . 97
4.2.1 Data description . . . . . . . . . . . . . . . . . . . . . 98
4.2.2 Results using the HSM method . . . . . . . . . . . . . 100

Appendices 109

A R Codes 111
A.1 Computing the TV distance . . . . . . . . . . . . . . . . . . . 111
A.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B Effect of Sampling Frequency 117


B.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . 117
Chapter 1

Introduction

In time series analysis, the stationarity assumption is fundamental. However,


for real applications this assumption is not always satisfied, due to changes in
the process over time. A change point τ is a time point when the probability
distribution of a time series changes. A process {Xt } can change in different
ways, for example a change in mean or a change in variance (see for instance
panel (a) and (b) in Figure 1.1). These cases have been studied by many
researchers and different statistical tests have been developed, some of which
are based on sample moments and computing algorithms.

Xt Xt Xt

(a) Change in mean (b) Change in variance (c) Change in spectra

Figure 1.1: Processes that change in (a) mean, (b) variance and (c) spectra.

Our interest lies in considering changes in spectra (see panel (c) in Figure
1.1). A change in spectra means a change in the waveforms of the signal and
it is very important in many applications. For example, during a storm
sea waves become higher and slower, this has many implications in the
construction of maritime structures. Another example is the study of brain
signals taken by electroencephalograms, in this case an activation of a brain

1
2 Chapter 1. Introduction

region shows faster oscillations of the signal (i.e. more transfered energy in
shorter periods of time). In both cases, the understanding of where and how
these changes happen is relevant. A large part of the literature on spectral
analysis is based on the stationarity assumption for the process. However, in
some cases, we need to carry out spectral analysis for processes that are not
stationary. There are several points of view from which an analysis of this
sort can be assessed. The most frequently used is the detection of change
points in the process. The main assumption of this approach is that a process
{Xt } changes in k specific time points, τ1 , τ2 , . . . , τk , where both the number
and location of the change points are unknown and the process is assumed
to be stationary between them. Another approach is to model the process
{Xt } as a locally stationary process.
This project considers, as an alternative to the change point approach,
the use of clustering methods for time series to identify periods or segments
that have similar spectra. Time series clustering has captured the attention
of many researchers in the past few years.

1.1 Previous Results


1.1.1 Spectral theory for a stationary process
Let us consider the following definition of stationarity.
Definition 1.1. A time series {X(t), t ∈ R}, is said to be stationary
if for all t, s and r in R (i) E|X(t)|2 < ∞, (ii) E(X(t)) = m, and
(iii) cov(X(t + s), X(t + r)) does not depend on t.
We will denote by γ(h) = Cov(X(t), X(t + h)), the covariance function
of the stationary time series {X(t)}.
The basis of time series spectral analysis are Herglotz’s theorem and the
Spectral Representation theorem. The proofs can be found in different
sources such as Brockwell and Davis (2006) and Shumway and Stoffer (2011).
Theorem 1.1 (Herglotz’s Theorem). A complex-valued function γ(·) defined
on the integers is non-negative definite, i.e., ni,j=1 ai γ(i − j)aj ≥ 0 for all
P
positive integers n and all vector a ∈ Rn , if and only if
Z 1/2
γ(h) = ei2πωh dF (ω), (1.1)
−1/2
1.1. Previous Results 3

for all h = 0, 1, . . ., where F (·) is a right-continuous, non-decreasing, bounded


function on [−1/2, 1/2] and F (−1/2) = 0.

R ω F is called the spectral distribution function of γ, and if F (ω) =


−1/2
f (ν)dν, then f is called the spectral density of γ. In terms of the
time series X(t), if γ(·) is absolutely summable, the spectral density is the
Fourier transform of the covariance function, i.e.,

X 1 1
f (ω) = γ(h)e−i2πωh , − ≤ω≤ .
−∞
2 2

Theorem 1.2 (The Spectral Representation Theorem). If {Xt } is a


stationary sequence with mean zero and spectral distribution F , then there
exists a right-continuous orthogonal-increment process {Z(ω)} such that
(i) E|Z(ω) − Z(−1/2)|2 = F (ω),
Z 1/2
(ii) Xt = ei2πωt dZ(ω).
−1/2

The spectral representation can be interpreted as follows, “any stationary


time series can be looked at as a sum of infinitely many cosine and sine
waveforms with random coefficients” [Shumway and Stoffer (2011)]. At lag
R 1/2
h = 0 one gets γ(0) = Var(Xt ) = −1/2 f (ω)dω. Thus, the time series
variance is decomposed over the frequency domain, where the spectrum at
frequency ω can be roughly interpreted as the variance contributed by the
oscillation in a narrow frequency band around ω.
The spectral analysis of a time series can be seen from a nonparametric or
parametric approach. For example, in the case of the parametric ARMA(p,q)
model, the spectral density has a closed form as
|θ(ei2πω )|2
f (ω) = σ 2 ,
|φ(ei2πω )|2
where θ and φ are the pth and q th degree polynomials, φ(z) = 1 − φ1 z − . . . −
φp z p and θ(z) = 1 + θ1 z + . . . + θq z q .
Based on the stationary assumption, a natural nonparametric estimator
for the spectral density is the periodogram which is defined as
2
X n X
−1 −it2πωj
I(ωj ) = n xt e = γ̂(h)e−it2πωj ,
t=1
|h|<n
4 Chapter 1. Introduction

Parzen Window Periodogram Smoothed Periodogram

300

300
1.0

250

250
0.8

200

200
0.6

150

150
0.4

100

100
0.2

50

50
0.0

0
−1.0 −0.5 0.0 0.5 1.0 0 1 2 3 4 0 1 2 3 4
x Hz Hz

(a) (b) (c)

Figure 1.2: Estimation of the spectral density, the true density is the dashed red curve.
(a) Parzen window used in the lag window estimator. (b) Estimator using the periodogram.
(c) Estimator using the smoothed lag window

for a stationarity and centered time series Xt , t = 1, . . . , T , at the


fundamental Fourier frequencies ωk = k/T, k = 1, . . . , n with n =
b(T − 1)/2c, where γ̂ is the sample autocovariance function. This estimator
is asymptotically unbiased but the variance does not go to 0 as T increases.
For this reason, it is common to smooth the periodogram, we will work with
the smoothed lag window estimator,
X
fˆ(ω) = β(h/a)γ̂(h)e−it2πωj , (1.2)
|h|<a

where β(x) is an even, piecewise continuous function of x satisfying the


conditions: 1) β(0) = 1, 2) |β(x)| ≤ 1 for all x, and 3) β(x) = 0 for |x| > 1.
Figure 1.2 shows an example of these estimators. The spectral density is
symmetric, i.e, f (ω) = f (−ω); hence, we use the one sided spectra, which is
defined as f ∗ (ω) = 2f (ω) with 0 < ω ≤ 1/2.

1.1.2 Change point detection


An approach to study the changes in a process is to consider that the changes
occur at specific time points and between two change points the process is
stationary, i.e., considering the process as piecewise stationary.
Let {Xt } be a process that changes in K time points, τ1 , τ2 , . . . , τK , where
the number of change points and their locations are unknown. Let fj (ω) be
1.1. Previous Results 5

the spectrum of the kth segment, i.e., the spectrum of {Xt } for τk ≤ t < τk+1 .
Davis et al. (2006) developed a methodology called Auto-PARM where the
main idea is to fit an AR model to each piece. The minimum description
length principle is applied to find the “best” combination of the number of
segments, the lengths of the segments, and the orders of the piecewise AR
processes. The estimates are strongly consistent, however, approximating
a nonstationary time series by a parametric AR model which may not be
reasonable in some applications.
Another parametric framework is the Detection of Changes by Penalized
Contrasts (DCPC) developed by Lavielle (1999) and collaborators. They
considered a sequence of real random variables {Xt }t=1,...,n and assumed
that the distribution of the process depends on a parameter θ that changes
abruptly at some unknown instants {τi , 1 ≤ i ≤ K}, where K is also
unknown. To estimate both K and the change points {τi , 1 ≤ i ≤ K}
they used a penalized contrast function of the form J(t, y) + βpen(t), where
the contrast function is defined as
K
X
J(t, y) = C(Xτk−1 , . . . , Xτk ),
k=1

and pen(t) is a penalization term and β is a tuning parameter. The contrast


is the addition of local contrast functions at each segment [τk−1 , τk ]. The
contrast function at each segment depends on the estimator θ̂(Xtk−1 , . . . , Xtk )
calculated on the kth segment of t which is defined as the solution of a
minimization problem of a function U . A particular case of this methodology
has been studied in Lavielle and Ludeña (2000). They considered a
parametric spectral density f (ω; θ), and chose the function U as the Whittle
log likelihood between the periodogram and the parametric model. So, θ̂k is
the Whittle estimator in the kth segment. Finally, the contrast function J
is taken as a weighted average of the Whittle log likelihood of each segment.
Simulation studies show that the DCPC method performs well when K is
known. In the case of K unknown, how to determine the penalization term
is not clear. This election will be taken subjectively in most of the cases.
The methodologies mentioned before explore the change point detection
as a minimization problem Another approach to study the change point
problem is as a statistical hypothesis test, as follows. We would like to
test, for a given segment j, the hypothesis
H0 : fj (ω) = fj+1 (ω) ∀ω vs HA : ∃ω0 such that fj (ω0 ) 6= fj+1 (ω0 ).
6 Chapter 1. Introduction

Dette and Paparoditis (2009) and Dette and Hildebrandt (2012) studied
this hypothesis test. The test statistic proposed is a functional of the
euclidean norm of the difference between the spectra integrated over [−π, π).
To establish a rejection region they proposed two options; 1) it can be proved
that the test statistic is asymptotically normally distributed or 2) a bootstrap
procedure based on a Wishart distribution. The power of the test is good in
the examples shown in the paper. The asymptotic distribution is dependent
on the smoothing of the periodogram.
Two other possible statistics are studied in Jentsch and Pauly (2012)
with special interest in the case of unequal length time series (n1 < n2 ). The
statistics are based on the periodogram,
n2
1 X
Tn(1) = cj (In1 ,X (ωj ) − In2 ,Y (ωj ))2
n2 j=1

and
n2  
1 X In1 ,X (ωj )
Tn(2) = cj log 2
1{In1 ,X (ωj )In2 ,Y (ωj )6=0} .
n2 j=1 In2 ,Y (ωj )

In both cases an asymptotic distribution is obtained. In the first case, it


converges to a random variable which is a linear combination of a sequence of
independent double exponentially distributed random variables. The second
statistic converges to a random variable which is a linear combination of a
sequence of independent standard logistic distributed random variables.
(1)
The asymptotic distribution of Tn depends on unknown values of the
spectra which need to be estimated. Hence, an asymptotically exact test
(2)
can not be applied. The asymptotic distribution of Tn has the advantage
of being distribution-free under H0 , but the power of the test is low. As a
promising way to improve the performance of these statistics, the authors
proposed, as future work, to use integrated periodograms or smoothed
periodograms as estimators of the spectrum.

1.1.3 Spectral theory for a locally stationary process


The spectral analysis for locally stationary process was developed by
Dahlhaus (1997).
1.1. Previous Results 7

Definition 1.2. A sequence Xt,T (t = 1, . . . , T ) is called locally stationary


with transfer function A0 and trend µ if there exists a representation
  Z π
t
Xt,T = µ + exp(iλt)A0t,T (λ) dξ(λ), (1.3)
T −π

where the following holds.

(i) ξ(λ) is a stochastic process on [−π, π] with ξ(λ) = ξ(−λ) and


k
!
X
cum{dξ(λ1 ), . . . , dξ(λk )} = η λj gk (λ1 , . . . , λk )dλ1 · · · dλk ,
j=i

where cum{· · · } denotes the cumulant of kth order, g1 P = 0, g2 = 1,


|gk (λ1 , . . . , λk )| ≤ Mk (constant) for all k and η(λ) = ∞
j=−∞ δ(λ +
2πj) is the 2π periodic extension of the Dirac delta function.

(ii) There exists a constant K and a 2π-periodic function A : [0, 1]×R → C


with A(u, −λ) = A(u, λ) and
 
t
, λ ≤ KT −1
0
supt,λ At,T − A

T

for all T : A(u, λ) and µ(u) are assumed to be continuous in u.

From the definition and other results (see Dahlhaus, 2011) it can be shown
that for u0 ∈ [0, 1] there exists a stationary process X̃t (u) such that
 
t 1
|Xt,T − X̃t (u)| = Op − u0 + ,
T T

which justifies the name “locally stationary process”. Xt,T has a unique
time varying spectral density which is, locally, the same as the spectral
density of X̃t (u). Furthermore, it has, locally, the same auto-covariance
since cov(X[uT ],T , X[uT ]+k,T ) = c(u, k) + O(T −1 ) uniformly in u and k, where
c(u, k) is the covariance function of X̃t (u). This justifies taking c(u, k) as the
local covariance function of Xt,T at time u = t/T . This suggests to estimate
a local spectra with rolling windows.
A more formal estimation method for the time varying spectral density
was proposed by Dahlhaus (2000), using an approximation to the Gaussian
8 Chapter 1. Introduction

likelihood. The proposed quasi likelihood is a generalization of the Whittle


likelihood for a stationary process. The generalization is obtained using
the localized periodogram or preperiodogram which uses only the pairs
X[t+0.5−k/2] , X[t+0.5+k/2] to estimate the covariance of lag k at time t.
He looked at the parametric time-varying models and proved asymptotic
properties for the resulting estimator.
Another methodology to estimate the time varying spectra was developed
by Ombao et al. (2005). In this case, they work under the SLEX framework
(smooth localized complex exponentials, which is a collection of orthogonal
bases). They built a family of multivariate models that characterized the
time varying spectra. To select the best model in this family a penalized log
energy is used. The resulting method is flexible, computationally efficient,
and easy to interpret.
The study of locally stationary processes has been a topic of research for
many years. One of the challenges in this area is building a hypothesis tests
for this family of processes. Sergides and Paparoditis (2008) established a
general framework for hypothesis testing in the multivariate case. The main
idea is to compare the spectra between the processes to answer a specific
scientific question, for example, are the spectra equal? are the time series
uncorrelated? etc. They assume a parametric model and use the L2 norm
for the test statistic. A bootstrap method is used to establish the rejection
criterion. The examples of hypotheses are simple, a more complicated
hypothesis could be more difficult to study using this framework.
Another important scientific question is whether a parametric model
is a good option. Preuss et al. (2013) investigated the problem of
testing semiparametric hypotheses in locally stationary processes. The best
parametric estimator under the null hypothesis is chosen minimizing a local
version of the Kullback-Leibler divergence,
Z π 
1 IN (u, λ)
L(u, θ) = log g(θ, λ) + dλ,
4π −π g(θ, λ)
between the parametric family and the periodogram. To establish the
rejection criteria an asymptotic distribution of the test statistic and a
bootstrap method is used. In general, the bootstrap method shows better
results than using the asymptotic distribution.

Change point detection and the locally stationary approaches are not
completely separated. Last and Shumway (2008) studied the problem of
1.1. Previous Results 9

detecting abrupt changes in a piecewise locally stationary time series. The


goal was segmenting a nonstationary time series into locally stationary
segments. The proposed method consists of comparing, with the symmetric
Kullback-Leibler divergence, the right and left spectra at time t. Given a
time t, compute the estimated spectral density fˆL (t, λ) with xt−n+1 , . . . , xt
and fˆR (t, λ) with xt+1 , . . . , xt+n , then obtain
!
1 X fˆL (t, λ) fˆR (t, λ)
D1 (t) = − .
n λ fˆR (t, λ) fˆL (t, λ)

Next, find the maximun of D1 (t) and take



 D1 (t) if x ∈ Ac
D2 (t) =
0 if x ∈ A

where A = [ti − n, ti − 1] ∪ [ti + 1, ti + n]. Repeat until the values of D are


sufficiently “small”. To decide this critical value the asymptotic distribution
or bootstrap methods are used.
Depending on the specific problem one prefers one approach over the
other. For example, in the study of ocean waves the changes are slow, so
methods to detect abrupt changes usually give poor results. Also, there could
be transition periods between stationary states, which could be modelled
within the locally stationary framework.

1.1.4 Time Series Clustering


In general, clustering is a procedure whereby a set of unlabeled data is divided
into groups so that members of the same group are similar, while members
of distinct groups differ as much as possible. The problem of clustering when
the data points are time series has received a lot of attention in recent times.
Liao (2005) gives a revision of the field up to 2005. A most recent review on
time series clustering can be found in Caiado et al. (2015).
There exists a big variety of applications in different fields. Lachiche et al.
(2005) developed a method for fMRI where the area between the variations of
the signals around their means was used as similarity measure together with
a Growing Neural Gas (GNG) clustering algorithm. Other examples are: the
identification of similar physicochemical properties of amino acid sequences
10 Chapter 1. Introduction

(Savvides et al., 2008), detection of groups of stocks sharing synchronous time


evolutions with a view towards portfolio optimization (Basalto and De Carlo,
2006), the identification of geographically homogeneous regions based on
similarities in the temporal dynamics of weather patterns (Bengtsson and
Cavanaugh, 2008) and finding groups of similar river flow time series for
regional classification (Corduas, 2011), to name a few.
According to Liao (2005) there are three approaches to time series
clustering: methods based on the comparison of raw data, feature-based
methods, where the similarity between time series is gauged through features
extracted from the data and methods based on parameters from models
adjusted to the data. We are interested in the second group using the spectral
density of the corresponding time series as the principal feature, i.e., the time
series will be characterized by its spectral density and produce groups with
signals having similar spectral densities. Then, the problem of detection of
changes has been transformed to an equivalent problem, finding similarities
in the behavior of time series.
We will use the clustering method to identify periods with similar spectral
density and hence consider them as a bigger stationary period. However, the
proposal will be more general and could be also used to identify similarities
in space instead of time.

The rest of the thesis is organized as follows. In Chapter 2 the concept


of the total variation (TV) distance is introduced. The TV distance is used
to quantify the similarity between spectral densities. Previous approaches
have in common the use of a similarity measure between spectra. The
most frequently used are the L2 norm and the Kullback-Leibler divergence,
however, it wil be shown that these may not be adecuate measures to detect
small changes. In Chapter 2 the properties of the TV distance between
estimated spectra and its capacity to detect small changes are established.
In Chapter 3 the proposed methodologies which belong to the time series
clustering approaches are given. Two different proposals, the TV distance
in a hierarchical clustering method and the Hierarchical Spectral Merger
(HSM) method are considered. In both cases the feature of interest for
clustering is the spectral density and the TV distance is used as a measure
of similarity. However, there are important differences in the clustering
procedures that make both methods distinct. Simulation studies show that
the first proposal is more convenient to detect slow changes. The HSM
1.1. Previous Results 11

method has the advantage of having an accurate procedure to choose the


number of clusters, based on the distribution of the TV distance.
Finally, in Chapter 4 we present applications to two different cases of
study. The first is related to the study of ocean waves, where the main goal
is the identification of stationary periods. We use the first proposal in this
case. The second belongs to the neuroscience framework, in particular the
analysis of electroencephalogram (EEG) data. The main goal is the detection
of activation or anomalies in any channel during the resting state of a motor
skill experiment. The HSM method is used in this example.
12 Chapter 1. Introduction
Chapter 2

Total Variation Distance

To detect changes in spectra, we need a quantity that gauges gauges the


extent of similarity between two spectral densities. Our proposal is to use
the total variation (TV) distance as a similarity measure to compare spectral
densities. The TV distance is defined in general for any two probability
measures, we shall adapt it to the case of spectral densities. A statistical
application of the TV distance was proposed by Alvarez-Esteban et al. (2012),
where a hypothesis test to compare two probability densities is developed.
Definition 2.1. Let (X, A) be any measurable space. The total variation
distance dT V between two probability measures P and Q on X is defined as
dT V (P, Q) = sup |P (A) − Q(A)|.
A∈A

In our case, we will use the total variation distance on the real line, then
X = R and A = B(R), where B(R) is the class of the Borel sets on the real
line.
An important property of the TV distance is that it is bounded between
0 and 1. This property can be easily deduced from the definition. A value
of 1 for the distance can be attained if P and Q have disjoint support. This
property is very useful in order to interpret distances: values close to 1 mean
that the two measures are quite different while distance values close to 0
mean that they are very similar, almost equal. In terms of spectral densities,
if the TVD is equal to 1 then the spectral content of the two signals are
completely different, i.e., they not share a common frequency band.
If P and Q have density functions (typically with respect to the Lebesgue
measure, µ), f and g, the TV distance between them can be computed using

13
14 Chapter 2. Total Variation Distance

Density f
Density g

Figure 2.1: The TV distance measures the similarity between the two densities. The
blue shaded area, that is equal to the pink, is the value of the TV distance.

the following expression,


Z
dT V (P, Q) = 1 − min(f, g) dµ.

This equation helps to graphically interpret the TV distance. If two


densities f and g, have TV distance equal to 1 − α this means that they
share a common area of size α. Figure 2.1 illustrates the case with two
density functions, the area of the pink or blue regions represents the TV
distance. Both colored regions represent the non-common part of the density
functions, while the white area under the curves is the common part.

2.1 TV distance and the Wasserstein distance


The total variation distance has an important interpretation in the framework
of contamination models (Alvarez-Esteban et al., 2012), that could be
extended to the spectral analysis of time series. To interpret this notion
of contamination in the case of spectral densities, we shall define the general
concept of a contamination model.
A contamination model for a probability P0 consists of assuming that P0
cannot be observed because the presence of noise, or contamination of level
ε ∈ (0, 1), which follows a probability law N , so that we observe

P = (1 − ε)P0 + εN.
2.1. TV distance and the Wasserstein distance 15

If one only considers two measures, the similarity between them is defined as
follows.
Definition 2.2. Two probability measures P and Q on the same sample
space are α-similar if there exist probability measures λ, P 0 , and Q0 such that

P = (1 − ε1 )λ + ε1 P 0
Q = (1 − ε2 )λ + ε2 Q0 (2.1)

with 0 ≤ εi ≤ α, i = 1, 2.
Smaller values of α correspond to more similar probability measures.
Before we connect this concept with the total variation distance, we consider
the case of measures on the real line and the definition of the Wasserstein
distance.
Definition 2.3. Given α ∈ (0, 1), we define the set of α-trimmed versions
of P by  
dQ 1
Rα (P ) := Q ∈ P : Q  P, ≤ ,
dP 1−α
where P denotes the set of Borel probability measures on R. Or equivalently,
 
1
Rα (P ) := Q ∈ P : Q  P, Q(A) ≤ P (A) for all A ∈ B(R) .
1−α
Definition 2.4. The Wasserstein distance between two probability measures
P, Q ∈ P, is defined as
Z
W2 (P, Q) = inf{ ||x − y||2 µ(dx, dy), µ ∈ M (P, Q)},
2
(2.2)

where M (P, Q) is the set of probability measures on R × R with marginals P


and Q.
The Wasserstein distance is related to the transportation problem, it is
the minimum cost of “transporting” the mass of P to Q. If the measures are
the same, then the transportation cost is equal to zero. In the real line, it
can be computed as
Z 1
W2 (P, Q) =
2
|FP−1 (u) − FQ−1 (u)|du,
0
16 Chapter 2. Total Variation Distance

with FP−1 and FQ−1 the quantile functions of P and Q.


Now, if we have two contaminated probability measures P and Q, we could
trim a portion α from P and Q to make them more similar. If we continue
trimming, we would like to know which is the “best” level of trimming, the
level that makes P and Q equal.
From Proposition 2 in Alvarez-Esteban et al. (2012), it follows that
W22 (Rα (P ), Rα (Q)) > 0 if and only if dT V (P, Q) > α, P and Q as in (2.1).
So, the total variation distance is the minimal level of trimming required to
make P and Q equal, W22 (Rα (P ), Rα (Q)) = 0.

In the case of spectral densities, the estimation of the spectrum has noise. In
terms of a model, the signals X1 (t) and X2 (t) are observed with uncorrelated
noises N1 (t) and N2 (t), i.e., we observe

XiO (t) = Xi (t) + Ni (t), i = 1, 2.

Where the unobserved process (the true signal) is contaminated by the noise
Ni (t). This gives rise to the spectrum in the following sense,

fiO (ω) = f Xi (ω) + f Ni (ω), i = 1, 2.

This causes the estimations to be different, even in the case of two processes
with the same spectral density, f X1 (ω) = f X2 (ω). So, the total variation
distance quantifies the level of similarity between two spectra, in the sense
of (2.1).

2.2 TV distance to compare spectra


Our principal goal is to use the TV distance as a similarity measure between
spectral densities. Since a spectral density does not necessarily integrate
1, we must normalize them before we compute the TV distance. It means
that the TV distance will be able to identify changes in the distribution of
the energy and not necessarily changes in the total energy. However, the
detection of changes based in the TV distance can be done together with or
after another method that detects changes in the total energy, if required.
The TV distance between spectra is a metric on the following space. Let
M be the set of equivalence classes defined on the space of spectral density
functions, as follows: f1 and f2 are in the same equivalence class (f1 ∼ f2 )
2.2. TV distance to compare spectra 17

if there exist a constant value c > 0 such that f1 (ω) = cf2 (ω) for almost all
ω. The non-negativity, symmetry and subadditivity properties are obtained
by restricting the TV distance to this space. The identity of indiscernibles is
satisfied on M, since

dT V (f, g) = 0 ⇔ ∃c > 0 f (ω) = cg(ω), ∀ω ⇔ g, f ∈ [f ].

There exist many important probability metrics that are used by


statisticians and probabilists, Gibbs and Su (2002) present a summary of
inequalities between them that could be useful in practice. One of these
metrics, that is also bounded between [0,1] as the TV distance, is the
Hellinger distance (dH ), which is defined as
Z 1/2
p √ 2
dH (f, g) = ( f − g) .

This metric is related to the TV distance through the inequalities

(dH )2
≤ dT V ≤ dH ,
2
that could produce similar results when one compares two density functions.
This distance is not considered in the rest of this thesis, however, it is still
an option to explore. A possible disadvantage of the Hellinger distance is
the lack of interpretation in terms of the spectral densities. In the frequency
domain approach, the spectral density and the log spectral density have a
physical interpretation while the interpretation of the square root of the
density is not clear.
Chapter 1 described some of the distances used to compare spectral
densities. Two of the most frequently employed are the L2 norm and the
Kullback-Leibler (KL) divergence. The KL is not symmetric, but there
exists a symmetric version (SKL). Remember that, if f and g are two
density functions,
Z 1/2
2
dL2 (f, g) = (f − g) ,
Z  
f
dKL (f, g) = f log
g
and dSKL (f, g) = (dKL (f, g) + dKL (g, f ))/2.
18 Chapter 2. Total Variation Distance

Case 1 Case 2 Case 3

0.20
0.20

0.00 0.05 0.10 0.15 0.20

0.15
0.15

0.10
0.10

0.05
0.05
0.00

0.00
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Hz Hz Hz

(a) (b) (c)

Figure 2.2: (a) Spectra with different peak frequency but close support. (b) Spectra with
disjoint support. (c) Spectra with close peak frequency and similar support but different
dispersion.

We illustrate with an example, some advantages of the TV distance over


L and SKL. Consider two unimodal spectra, as those presented in Figure
2

2.2. We look at three different cases. Case 1 - The first spectrum peaks at
10 Hz (black continuous curve) and the second peaks at 15 Hz. Case 2 - The
first spectrum peaks at 5 Hz, the second peaks at 30 Hz, and the supports
are disjoint. Case 3 - The first spectrum peaks at 15 Hz, the second peaks
at 16 Hz, and they have different dispersion.

For each case, we compute the TV distance, the L2 distance, and the SKL
divergence. Table 2.1 shows these values. When the spectra are different
as in Case 1, we would expect all distances to be “big”, and indeed, all
the considered distance values are big enough to distinguish between them.
Now, if we observe Case 2, the spectra are completely different, since they
have different supports. This example shows one of the disadvantages of
the SKL divergence, since we cannot compute the value in this case. Notice
that the TV distance has no problem, and the value is equal to one, which
indicates that the spectra are completely different. On the other hand, the
L2 distance has a value comparable to that of Case 1, even though in this case
the densities have disjoint support. Then, in Case 3 the spectra are different
but it could be difficult to conclude that from the L2 and SKL distances.
The difference would be clearer using the TV distance. A more exhaustive
simulation exercise to compare these distances is performed in Chapter 3.
The TV distance has a finite range, however, we need to establish a
statistical notion of “big”. This is important because, even in the case of two
2.3. Distribution of the TV distance between estimated spectra 19

Distance Case 1 Case 2 Case 3


TV 0.686 1 0.232
L2 0.402 0.5 0.146
SKL 1.413 NaN 0.141

Table 2.1: Distance values of the TV distance, L2 norm and SKL divergence between
the spectra plot in Figure 2.2.

samples with the same spectral representation, the estimated spectra have a
TV distance value not equal to zero. We would like to choose a threshold
for the TV distance between estimated spectra to decide if the samples were
generated from the same spectral density or not, so that the probability of
type I error is controlled at some level α. In addition, the procedure to choose
this threshold must have enough power to detect when the true spectra are
different. The next sections deal with the distribution of the TV distance
between estimated spectral densities.

2.3 Distribution of the TV distance between


estimated spectra
Let X1 (t) and X2 (t) be two time series with spectra f X1 , f X2 , and normalized
spectra defined as

f Xi
fNXi = R 1/2 , i = 1, 2.
1/2
f Xi (ω)dω

Then, the TV distance between the normalized spectra will be


Z 1/2
X1 X2
dT V (fN , fN ) = 1 − min(fNX1 (ω), fNX2 (ω)) dω
−1/2
Z 1/2
1
= |fNX1 (ω) − fNX2 (ω)|dω. (2.3)
2 −1/2

2.3.1 Estimation of dT V
At this point dT V , defined as (2.3), is not a random quantity because it is
based on the true (though unknown) spectral density. The next step is to
20 Chapter 2. Total Variation Distance

Window Asymptotic Variance


2a 2
Rectangular or Truncated f (ω)
T
2a 2
Bartlett or Triangular f (ω)
3T
a 2
Daniell f (ω)
T
2a
Blackman - Tukey (1 − 4b + 6b2 )f 2 (ω)
T
151 a 2
Parzen f (ω)
280 T
Table 2.2: Asymptotic variance for different windows. T is the length of the time series
and a is the bandwidth.

consider a numeric approximation of dT V . Then, we add uncertainty when


f X1 and f X2 are unknown since we have to use estimations of them.
We apply the trapezoid method, to get a numerical approximation of equation
(2.3). The trapezoid rule, which approximates the area under a curve using
trapezoids instead of rectangles, is given by the following formula:

c n−1  !
c−b k(c − b)
Z
d(b) + d(c) X
d(x)dx ≈ + d b+ , (2.4)
b n 2 k=1
n

where n is the number of elements in the partition of the interval [b, c]. We
could choose another numerical approximation and the procedure to obtain
the asymptotic distribution would be similar.
For real data sets, f X1 and f X2 are not known and have to be estimated.
As we mentioned before, the raw periodogram is not mean-square consistent
because its variance does not decrease even when the length of the time series
increases. So, we choose the lag window estimator (smoothed periodogram)
defined in (1.2), with a Parzen window of width a. A lag window estimator
can be rewritten as a spectral average estimator, i.e., the properties of the
lag window estimators are similar to the spectral average.
2.3. Distribution of the TV distance between estimated spectra 21

The Parzen window is defined as



1 − 6|x| + 6|x| , if |x| < 2
2 3 1

β(x) = 2(1 − |x|)3 , if 21 ≤ |x| ≤ 1,
otherwise.

0,

A possible criterion to choose a window is to compare the asymptotic


variance of the resulting lag window estimator. Table 2.2 shows the
asymptotic variance for different windows, the details can be found in
Brockwell and Davis (2006). The Parzen window has smaller variance
compared to the Rectangular, Bartlett, and Daniell. However, the Blackman
- Tukey window with parameter b could have a smaller variance (depending
on b) than the Parzen window. The Parzen window is included in many
computational packages, in particular it is implemented in the function
spec.parzen of the HSMClust Toolbox in R that we developed (see Appendix
A).
Z 1/2
Finally, to normalize the estimated spectra, we use fˆ(ω)dω = γ̂(0).
−1/2
Our final estimator of the spectra is

fˆXi (ω)
fˆNXi (ω) = Xi . (2.5)
γ̂ (0)

Using the estimator (2.5) and the numerical approximation (2.4), we can
write an estimator of the TV distance, dˆT V , as follows.

T    
1 X ˆX1 k 1 k 1
dˆT V ˆ 2
X
(2.6)

= f N − − f N − ,
2T k=1 T 2 T 2

where T is the time series length.


Remarks. 1) To obtain equation (2.6) we take b = −1/2, c = 1/2 and n = T ,
also we use the symmetry of fˆNXi .
2) We assume that T is even, if T is odd we consider n = 2bT /2c, so the
frequencies correspond to wk = k/T , the fundamental Fourier frequencies. If
X1 and X2 have different lengths, but one a multiple of the other to ensure
that we have enough common Fourier frequencies, we consider the smaller
T.
22 Chapter 2. Total Variation Distance

2.3.2 Asymptotic distribution of dˆT V


We would like to find the asymptotic distribution of (2.6) under H0 , so we
can establish a critical value for the following test,
H0 : f X1 (ω) = f X2 (ω), ∀ω vs HA : ∃ω such that f X1 (ω) 6= f X2 (ω).
Under H0 , we can write dˆT V as follows. Let f (ω) = f X1 (ω) = f X2 (ω),
T T
ˆ 1 X X
dT V = f (ωk ) |DT,k | = cT,k |DT,k | , (2.7)
2T k=1 k=1

fˆNX1 (ωk ) − fˆNX2 (ωk )


where cT,k = f (ωk )
, k 1
and DT,k = .

2T
ωk = T
− 2
f (ωk )
To find the distribution of dˆT V , we use the asymptotic convergence of fˆNXi
and the following property: If {Xt } is a time series with Var(Xt ) = σ 2 and
t = 1, 2 . . . , T then
P
γ̂(0) → σ 2 , (2.8)
when T −→ ∞.
Additionally, we are going to use the following lemmas and the Central Limit
Theorem for triangular arrays.
Theorem 2.1. Let {(Xn,j , 1 ≤ j ≤ n), n ≥ 1} be a triangular array of
row-wise independent random variables with mean zero. Set, for n ≥ 1,
X n
Xn,j , s2n,j = E[Xn,j ], and s2n = Var(Sn ) = j=1 sn,j . If the
2
Pn 2
Sn =
j=1
Lindeberg condition
n Z
X 1 2
lim 2
Xn,j dP = 0
n→∞ s
j=1 n |Xn,j |>ε

is satisfied, then
Sn
−→ N (0, 1).
sn
Lemma 2.1. If the Lyapunov Condition: there exists a δ > 0 such that
n
1 X T →∞
E|Xn,j |2+δ −→ 0,
s2+δ
n j=1

is satisfied then the Lindeberg condition is satisfied.


2.3. Distribution of the TV distance between estimated spectra 23

The proof of the theorem and lemma can be found in Brockwell and Davis
(2006), Chapter 6, or Gnedenko and Kolmogorov (1968).

We shall now consider two processes X1 (t) and X2 (t) that satisfy the fol-
lowing hypothesis.
Assumption A1. Suppose that X1 (t) and X2 (t) are independent stationary
processes with mean µ, variance σ 2 , absolutely summable covariance
functions and the spectral densities f Xi are continuous functions with the
first three spectral moments finite.
X
Lemma 2.2. Let fˆL (ω) = (2π)−1 β(h/a)γ̂(h)e−ihω , where β is a taper
|h|≤a
function,Pa is the bandwidth, T is the length of the time series Xt ,
T −|h|
γ̂(h) = t=1 (Xt − X)(Xt+|h| − X) the sample covariance function, and
X = T t=1 Xt . Under assumption A1, if a → ∞ when T → ∞ and
1 T
P
a/T → 0, then
r   Z 1 
T ˆ 
w 2 2
fL (ω) − f (ω) −→ N 0, f (ω) β (u)du
a −1
(2.9)
The proof can be found in Brillinger (1981), Section 5.6.

Before stating the normal approximation theorem we introduce some no-


tation:
Z 1 r r
2 2 2 2a a
Σ = 4
β (u)du, µT,k = cT,k Σ = f (ωk )Σ ,
σ −1 πT 2πT 3
  T
2 a X
2 2 2
sT,k = cT,k Σ 1 − and 2
sT = s2T,k .
π T k=1

Theorem 2.2 (Normal Approximation). Suppose assumption A1 holds,


then, under H0
T
1 X w
(cT,k |DT,k | − µT,k ) −→ N (0, 1) (2.10)
sT k=1

when T → ∞ and a/T → 0


24 Chapter 2. Total Variation Distance

Proof. Let fˆNXi be the normalized lag window estimator using the time
series Xi (t) with bandwidth a, T the length of the observed time series,
σ 2 = Var(Xi (t)) i = 1, 2, and dˆT V as (2.6).
Let ZT,k and ZT,k

, k = 1, . . . , T , be independent Gaussian random variables
with parameters

Z 1
1 a
µ = 2 and σa,T
2
= β 2 (u)du.
σ T σ4 −1

T
X
Step 1. First, we will show that dˆT V − ∗
cT,k |ZT,k − ZT,k | converges in
k=1
probability to zero.
Notice that cT,k ≥ 0. Without loss of generality, we can assume that
cT,k > 0 (if not, the elements being summed up will be zero). Now, under
H0 ,

T
T
X X
ˆ ∗ ∗

dT V − cT,k |ZT,k − ZT,k | = cT,k |DT,k | − |ZT,k − ZT,k |


k=1 k=1
T
X


≤ cT,k |DT,k | − |ZT,k − ZT,k |
k=1
XT


≤ cT,k DT,k − (ZT,k − ZT,k )
k=1

Then,

T
T
X X
ˆ ∗


0 ≤ dT V − cT,k |ZT,k − ZT,k | ≤ cT,k DT,k − (ZT,k − ZT,k ) .

k=1 k=1
(2.11)

It is sufficient to show that the sum on the right in (2.11) converges in


2.3. Distribution of the TV distance between estimated spectra 25

probability to zero. Let ε > 0,


T
X 


P cT,k DT,k −(ZT,k − ZT,k ) > ε

k=1
T
X


≤ P cT,k DT,k − (ZT,k − ZT,k ) >ε
k=1
T  
X ∗
ε
= P DT,k − (ZT,k − ZT,k ) >
k=1
cT,k
T
!
X fˆNX1 (ωk ) − fˆNX2 (ωk ) ∗
ε
= P − (ZT,k − ZT,k ) >
k=1
f (ωk ) cT,k
T
! !
X fˆNX1 (ωk ) fˆX2 (ωk )
N ∗
ε
= P − ZT,k − − ZT,k > .
k=1
f (ω k) f (ωk) cT,k

Now, for each k, applying Chebychev’s inequality, we have


 ˆX1 !
fN (ωk ) ˆX2 (ωk )
f ε


N

P − ZT,k − − ZT,k >
f (ωk ) f (ωk ) cT,k
!
 ˆX1 2
c2T,k fN (ωk ) fˆNX2 (ωk ) ∗
≤ 2 E −ZT,k − − ZT,k .
ε f (ωk ) f (ωk )

Then,
!!2 !2
fˆNX1 (ωk ) fˆNX2 (ωk ) ∗
ˆX1 (ωk )
f N
E − ZT,k − − ZT,k =E − ZT,k
f (ωk ) f (ωk ) f (ωk )
! ! !2
fˆNX1 (ωk ) fˆNX2 (ωk ) ∗ fˆNX2 (ωk ) ∗
−2E − ZT,k E − ZT,k + E − ZT,k .
f (ωk ) f (ωk ) f (ωk )
(2.12)

Now, we compute each term,


!2 ! " !#2
fˆNX1 (ωk ) fˆNX1 (ωk ) fˆNX1 (ωk )
E − ZT,k = V ar − ZT,k + E − ZT,k .
f (ωk ) f (ωk ) f (ωk )
(2.13)
26 Chapter 2. Total Variation Distance

Combining (2.8) and (2.9) in Lemma 2.2,


! !
fˆNX1 (ωk ) fˆNX1 (ωk )
r Z 1
1 T 1
E −→ 2 and V ar −→ 4 β 2 (u)du.
f (ωk ) σ a f (ωk ) σ −1

Since ZT,k is independent of X1 ,


!
fˆNX1 (ωk )
E − ZT,k −→ 0, and
f (ωk )
!
fˆNX1 (ωk )
r Z 1
T 1
V ar − ZT,k −→ β 2 (u)du < ∞. (2.14)
a f (ωk ) σ4 −1

The same proof works for X2 and ZT,k



. Denote

fˆNX1 (ωk )
GT,k = − ZT,k
f (ωk )

and
fˆNX2 (ωk )
G∗T,k = ∗
− ZT,k .
f (ωk )
Substituting this notation and using the previous inequalities we get.

T T
c2T,k
  X
X
∗ ε 2
E GT,k − G∗T,k

P GT,k − GT,k >
≤ 2
k=1
cT,k k=1
ε
T
c2T,k
X  
2 ∗ ∗ 2
= 2
E[GT,k ] − 2E[GT,k ]E[GT,k ] + E[(GT,k ) ]
k=1
ε
T
c2T,k
X 
= 2
Var[GT,k ] + E2 [GT,k ] − 2E[GT,k ]E[G∗T,k ]
k=1
ε

+ Var[GT,k ] + E [GT,k ] .
∗ 2 ∗

Notice that the moments of GT,k and G∗T,k are equal and have the same value
2.3. Distribution of the TV distance between estimated spectra 27

for all k. So, if k0 denotes a fixed value of k,


T T
!
c2T,k
 
X ∗
ε X
P GT,k − GT,k >
≤ 2Var[GT,k0 ]
k=1
c T,k
k=1
ε2
T
! r r !!
X f 2 (ωk ) a T
= 2 ε2
2 Var[GT,k0 ]
k=1
T T a
T
! r r !
2 1 X a T
= 2
f 2 (ωk ) Var[GT,k0 ]
Tε T k=1 T a
T
! r !
2a1/2 1X 2 T
= f (ωk ) Var[GT,k0 ]
T 3/2 ε2 T k=1 a

Assuming that the first three spectral moments are finite then
T Z 1/2
1X 2 T →∞
f (ωk ) −→ f 2 (ω)dω. (2.15)
T k=1 −1/2

Using (2.15) and (2.14) we get that


T  
X ε

(2.16)

P GT,k − GT,k >
−→ 0,
k=1
c T,k

when T → ∞. The convergence of (2.16) and the bound in (2.11) prove that
T
X
ˆ
dT V − ∗
cT,k |ZT,k − ZT,k | converges in probability to zero.
k=1
Step 2. Now, we show that
T
1 X ∗
 w
cT,k |ZT,k − ZT,k | − µT,k −→ N (0, 1).
sT k=1

We have a triangular array, so we would like to use the Theorem 2.1.


Let T and k be fixed, then
 a 

ZT,k − ZT,k ∼ N 0, Σ2 ,
Ta 2

|ZT,k − ZT,k | ∼ HN Σ , (2.17)
T
28 Chapter 2. Total Variation Distance

Half Normal Half Normal


0.8

1.0
Standard deviation
1

0.8
1.5
0.6

0.6
pdf(x)

cdf(x)
0.4

0.4
0.2

0.2
Standard deviation
1
1.5
0.0

0.0
2

0 1 2 3 4 5 0 1 2 3 4 5
x x

(a) Density function (b) Distribution function

Figure 2.3: Probability density and distribution functions of the Half Normal (HN)
distribution with different standard deviation values.

Z 1
2
where Σ = 4 2
β 2 (u)du and HN denotes the Half-Normal distribution.
σ −1
The HN distribution is a particular case of the folded normal (see Leone
et al., 1961), when the mean is equal to zero. This distribution is used when
the measurements are Gaussian but only their absolute value is considered.
Figure 2.3 plots examples of the density and distribution functions.
Using properties of this distribution,
r
2a

E(|ZT,k − ZT,k |) = Σ (2.18)
πT 
2 a
Var(|ZT,k − ZT,k

|) = Σ2 1 − . (2.19)
π T
Let YT,k = cT,k |ZT,k − ZT,k

| − µT,k , where
r r
2a a
µT,k = cT,k Σ = m1 f (ωk ) ,
πT T3
and m1 is a constant. Then, {(YT,k , 1 ≤ k ≤ T ), k ≥ 1} is an independent
triangular array with mean zero and variance
 
2 2 2 2 a a
sT,k = cT,k Σ 1 − = m2 f 2 (ωk ) 3 ,
π T T

where m2 is a constant. Now, we need to verify the Lindeberg condition. In


2.3. Distribution of the TV distance between estimated spectra 29

fact, we will verify the Lyapunov condition, since it implies the Lindeberg
condition.
Lyapunov Condition. There exist a δ > 0 such that

T
1 X T →∞
E|YT,k |2+δ −→ 0.
s2+δ
T k=1

Let δ = 1, then using the density of the Half Normal distribution,


3 r u2
− √
r
Z ∞
2a 2 1 a
T du
E|YT,k |3 = c3T,k u − Σ e 2Σ

πT πσ

0
√ !3 r u2
2 1 − 2Σ√ Ta
Z Σ 2a r
πT 2a
3
= cT,k u−Σ e du
0 πT πσ
!3 r u2
− √
Z ∞ r
2a 2 1 a
T du.
− c3T,k √ y−Σ e 2Σ
(2.20)
Σ πT2a πT πσ

To compute (2.20), we use integration by parts and properties of the


exponential function. Since we are interested in the limit, we will denote
by mi with i = 3, 4, 5, terms that do not depend of a, T or k. Then
" r !#
 a 3/2  a 3/2 2a
E|YT,k |3 = c3k,T m3 + m4 Φ Σ
T T πT
" r !#
1 3  a 3/2  a 3/2 2a
= f (ωk ) m3 + m4 Φ Σ
T3 T T πT
"  r !#
a3/2
  3/2 
1 3 a 2a
= f (ωk ) m3 + m4 Φ Σ
T T 7/2 T 7/2 πT
(2.21)

Then,

T
!3/2 T
!3/2
a3/2 X a3/2 1X 2
s3T = m5 9/2 f 2 (ωk ) = m5 3 f (ωk ) .
T k=1
T T k=1
30 Chapter 2. Total Variation Distance

So,
PT h  3/2   3/2   q i
1 3 a a 2a
1 X
T
T k=1 f (ωk ) m 3 T 7/2 + m4 T 7/2 Φ Σ πT
E|YT,k |3 =
s3T k=1
 P 3/2
m5 aT 3 T1 Tk=1 f 2 (ωk )
3/2

1
PT 3
h
1
 1
  q 2a i
T k=1 f (ωk ) m3 T 1/2 + m4 T 1/2 Φ Σ πT
=  P 3/2 .
1 T 2
m5 T k=1 f (ωk )
(2.22)

Notice that,
r r !
2 a T →∞
Φ Σ −→ 1/2,
π T
 
1 T →∞
m3 1/2
−→ 0,
T
 
1 T →∞
m4 1/2
−→ 0,
T

then, r !
   
1 1 2a T →∞
m3 + m4 Φ Σ −→ 0.
T 1/2 T 1/2 πT
Since ωk = (k/T − 1/2) and assuming that the first three spectral moments
are finite then
T Z 1/2
1X 3 T →∞
f (ωk ) −→ f 3 (ω)dω,
T k=1 −1/2
T Z 1/2
1X 2 T →∞
f (ωk ) −→ f 2 (ω)dω.
T k=1 −1/2

Finally, we get the Lyapunov condition,


T
1 X T →∞
E|YT,k |3 −→ 0.
s3T k=1
2.3. Distribution of the TV distance between estimated spectra 31

So, !
T T
1 X 1 X
∗ w
YT,k = cT,k |ZT,k − ZT,k | − µT,k −→ N (0, 1),
sT k=1 sT k=1

when T → ∞.

Finally, we conclude from Step 1 and Step 2 that dˆT V converges to the
XT
same distribution as ∗
cT,k |ZT,k − ZT,k |. It means that dˆT V is asymptoti-
k=1
cally Normal with the same parameters as
PT ∗
k=1 cT,k |ZT,k − ZT,k |.

Remark. We could use directly the distribution of ξT,k = |ZT,k − ZT,k



| instead
of two Gaussian variables, however to make the second step more intuitive
we prefer to use two Gaussian variables.

2.3.3 Approximation of the distribution of dˆT V


Using Theorem 2.2 we get an asymptotic distribution for dˆT V , however for
finite T we need an alternative to approximate this distribution. From the
proof of the theorem, we will use as an approximation the distribution of
T
X

cT,k |ZT,k − ZT,k |,
k=1

where ZT,k , ZT,k



are independent Gaussian variables as before.

A similar approximation can be obtained but considering chi-squared vari-


ables. Using the chi-squared approximation of the smoothed periodogram,

fˆXi (ωk ) · χ22Lh


∼ , (2.23)
f (ωk ) 2Lh
T
where Lh = R1 .
a −1 β 2 (u)du
Then the chi-squared approximation is
T
X

cT,k |ZT,k − ZT,k |,
k=1
32 Chapter 2. Total Variation Distance

where cT,k = f (ω
2T
k)
, ZT,k and ZT,k

k = 1, . . . , T are independent random
variables with distribution,
r
σ 2 ∗ χ22Lh T
Zk ∼ , with Lh = R 1 .
2π 2Lh a −1 β 2 (u)du

2.3.4 Bootstrapping
As an alternative to the asymptotic distribution, one can obtain a critical
value of the statistic dˆT V based on a bootstrap procedure. If X(t) is a linear
process, i.e.
X∞
X(t) = ψj W (t − j)
j=−∞

where W (t) is white noise, it can be proved (see Bloomfield, 1976) that the
periodogram of X(t) satisfies
I X (ωk ) ≈ G(ψ, ω)I W (ωk ),
where G(ψ, ω) = ψj e−2πiωj and I W (·) is the periodogram of the white noise
P
W (t). Moreover, if the white noise has variance equal to one then G(ψ, ω)
is equal to the spectral density of X. So, the observed periodogram consist
of the spectral density of X multiplied by the periodogram of a white noise
process. A natural way to get a replicate of the observed spectral density is
to multiply the density f times the estimated periodogram of white noise.
We will explain this with more detail when the consistency of the bootstrap
estimator is proved.
This proposal is motivated by the method presented in Kreiss and Paparo-
ditis (2015).

Algorithm:

1. From X1 (t) and X2 (t), estimate fˆNX1 (ω) and fˆNX2 (ω).

fˆX1 (ω) + fˆNX2 (ω)


2. Under H0 , take fˆN (ω) = N .
2
3. Draw Z(1), . . . , Z(T ) ∼ N (0, 1) i.i.d random variables, then estimate
fˆNZ (ω) using also the lag window estimator.
2.3. Distribution of the TV distance between estimated spectra 33

4. The bootstrap spectral density will be

fˆNB (ω) = fˆN (ω)fˆNZ (ω).

5. Repeat 3 and 4 and estimate dˆT V using the bootstrap spectral densities,
i.e.,
dˆB ˆ ˆB1 ˆB2
T V = dT V (fN , fN ),

where fˆNBi , i = 1, 2, are two bootstrap spectral densities using different


replicates of the process Z(·).

In order to have a good approximation of the distribution of dˆT V by dˆB


T V , we
ˆ
need to show that fN (ω) is a consistent estimator.
B

Consistency of the Bootstrap estimator for the spectral density. Suppose that
X∞
X(t) is a linear processes, i.e. X(t) = ψj Zt−j where Zt is white noise
j=−∞

X
with variance equal to 1, and assume that |ψj ||j|1/2 < ∞ and EZ14 < ∞.
j=−∞
From Theorem 10.3.1 in Brockwell and Davis (2006), we know that,

I X (ωk ) = |ψ(e−iωk )|2 I Z (ωk ) + RT (ωk ), (2.24)



X
where ψ(e−iλ ) = ψj e−iλj and maxωk ∈[0,π] E|RT (ωk )|2 = O(T −1 ).
j=−∞
In addition, X is a linear filter of Z, so

f X (ω) = |ψ(e−iω )|2 f Z (ω).

Because Z(t) is white noise with variance one, f Z (ω) = 1. Then,

f X (ω) = |ψ(e−iω )|2 . (2.25)

Substituting (2.25) into (2.24), we obtain

I X (ωk ) = f X (ωk )I Z (ωk ) + RT (ωk ). (2.26)


34 Chapter 2. Total Variation Distance

Define the functional J as J(f )(ω) = fNX (ω)fˆNZ (ω), where fˆZ is the lag
window estimator for the spectral density of Gaussian standard white noise.
Since we are interested in bootstrapping the lag window estimator instead
of just the periodogram, we now study the behavior of J(f )(ω). We shall
consider the equivalent representation of a lag window estimator as an
averaged periodogram. If
X
fˆL (ω) = β(h/a)γ̂(h)e−i2πhω ,
|h|≤a

is the lag window estimator with bandwidth a, then we can approximate fˆL
as X
fˆL (ω) ≈ β T (ωj )I(ωj ),
|j|<b(T −1)/2c

where β T (ω) = 2π −i2πhω


and I(ωj ) is the periodogram
P
T |h|≤a β(h/a)e
(see Brockwell and Davis, 2006). To prove the consistency of the bootstrap,
it is necessary to assume that β decreases exponentially and to consider that
fNX is smooth. This is valid when one uses the Parzen window, but it is not
necessarily true in other cases.
Then,
X
J(f )(ω) = fNX (ωk )fˆNZ (ωk ) ≈ fNX (ωk ) β T (ωj )I Z (ωk+j )
j
X
= β T (ωj )fNX (ωk )I Z (ωk+j )
j
X
≈ β T (ωj )fNX (ωk+j )I Z (ωk+j ) (2.27)
j
X
= β T (ωj )(I X (ωk+j ) − RT (ωk+j )) (2.28)
j
X X
= β T (ωj )I X (ωk+j ) − β T (ωj )RT (ωk+j ).
j j

So, applying (2.26), we can rewrite (2.27) as (2.28), and


J(f ) = fˆNX (ωk ) − R̃T (ωk ),
where R̃T (ωk ) = j β T (ωj )RT (ωk+j ).
P
To prove consistency of the bootstrap procedure, it is enough to show that
P
max |J(f )(ωk ) − fˆNX (ωk )| −→ 0. (2.29)
ωk ∈[0,π]
2.4. Simulation Study 35

Considering the previous equations, we have

|J(f ) − fˆNX (ω)| = |R̃T (ωk )|.

From (2.24) we have

max E|RT (ωk )|2 = O(T −1 ) ⇒ max E|RT (ωk )|2 = o(1)
ωk ∈[0,π] ωk ∈[0,π]
2
⇒ E|RT (ωk )| = o(1), ∀ωk
P
⇒ RT (ωk ) −→ 0, ∀ωk
P
⇒ max |RT (ωk )| −→ 0.
ωk ∈[0,π]

Hence,

X
max |R̃T (ωk )| = max β(ωj )RT (ωk+j )

ωk ∈[0,π] ωk ∈[0,π]
j
P
X
≤ β(ωj ) max |RT (ωk+j )| −→ 0. (2.30)
ωk ∈[0,π]
j

We conclude from (2.30) that (2.29) holds and the bootstrap estimator of
the spectral density is consistent. Then, the consistency of the bootstrap
approximation for the distribution of dˆT V is a consequence, since dˆT V is a
continuous function of the spectral density.

2.4 Simulation Study


The general setting of the simulation study is based on comparing (under H0 )
the empirical distributions between 1) values of dˆT V , obtained by simulating
time series X1 (t) and X2 (t) and computing the TV distance between the
normalized estimated spectra, and 2) values of the Normal, Chi-square
and/or Bootstrap approximations obtained by drawing from the variables
Zk and Zk∗ .
We consider two cases for Xi (t), i = 1, 2. Case 1 - AR(1) process with φ = .5,
and Case 2 - an AR(2) process with (φ1 , φ2 ) = (−.5, −.6). Figure 2.4 shows
the spectra for these processes.
36 Chapter 2. Total Variation Distance

AR(1) Spectrum AR(2) Spectrum

2.0
1.0

1.0
0.6
0.2

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
Frequency (Hz) Frequency (Hz)

Figure 2.4: Spectra used in the simulation study to draw Xi (t), i = 1, 2..

2.4.1 Rate of convergence


a
Notice that the asymptotic convergence requires that → 0. However,
T
the theorem does not specify how to choose a. In the simulation study, we
p−1 a 1
consider a = T p with p = 2, 3, 4, 5. Then = 1/p and the condition is
T T
satisfied.
Other values for the simulation are T = 1000, 2000, 5000 and 10000 replicates
to obtain the empirical distribution in each case. Our goal is to study the rate
of convergence, but we are also interested in observing if the approximations
are sensitive to the choice of a.
First, we shall focus on the Normal and Chi-square approximations,
assuming the true spectral density known. Figures 2.5, 2.6, 2.7 and 2.8 show
a nonparametric estimation of the density of each sample, the red density was
estimated using the values of TV distance, the green density was estimated
using the values of the normal approximation and the black dashed density
was estimated using the values of the Chi-square approximation.
Although the spectra are very different, the results are similar in both
examples. We can observe a good approximation in the cases p ≥ 3 and
when p = 4 we observe that the densities are almost the same for T = 10000.
In the case p = 2 the convergence is slower and bigger values of T are
needed in order to have a good approximation. In the other cases, a good
approximation is observed for T ≥ 5000.
It seems that the best choice for a could be T 4/5 , “best” in the sense that the
convergence will be faster.
2.4. Simulation Study 37

100 150 200

100 150 200


TVD TVD
Normal Normal
Chisquare Chisquare
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 31 T= 2000 , a= 44
100 150 200

100 150 200


TVD TVD
Normal Normal
Chisquare Chisquare
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 70 T= 10000 , a= 100

(a) p = 2
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 99 T= 2000 , a= 158
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 292 T= 10000 , a= 464

(b) p = 3

p−1
Figure 2.5: Simulation results for AR(1), T = 1000, 2000, 5000, 10000, a = T p ,
p = 2, 3.
38 150 Chapter 2. Total Variation Distance

150
TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 177 T= 2000 , a= 299
150

150
TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 594 T= 10000 , a= 1000

(a) p = 4
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25
T= 1000 , a= 251 T= 2000 , a= 437
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25
T= 5000 , a= 910 T= 10000 , a= 1584

(b) p = 5

p−1
Figure 2.6: Simulation results for AR(1), T = 1000, 2000, 5000, 10000, a = T p ,
p = 4, 5.
2.4. Simulation Study 39

100 150 200

100 150 200


TVD TVD
Normal Normal
Chisquare Chisquare
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 31 T= 2000 , a= 44
100 150 200

100 150 200


TVD TVD
Normal Normal
Chisquare Chisquare
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 70 T= 10000 , a= 100

(a) p = 2
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 99 T= 2000 , a= 158
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 292 T= 10000 , a= 464

(b) p = 3

p−1
Figure 2.7: Simulation results for AR(2), T = 1000, 2000, 5000, 10000, a = T p ,
p = 2, 3.
40 150 Chapter 2. Total Variation Distance

150
TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 1000 , a= 177 T= 2000 , a= 299
150

150
TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20
T= 5000 , a= 594 T= 10000 , a= 1000

(a) p = 4
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25
T= 1000 , a= 251 T= 2000 , a= 437
150

150

TVD TVD
Normal Normal
Chisquare Chisquare
100

100
Density

Density
50

50
0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25
T= 5000 , a= 910 T= 10000 , a= 1584

(b) p = 5

p−1
Figure 2.8: Simulation results for AR(2), T = 1000, 2000, 5000, 10000, a = T p ,
p = 4, 5.
2.4. Simulation Study 41

These results show that we should have at least 5000 points to get a good
approximation. However, sometimes we are not able to get so much data.
We explore the case of a “small” sample size and propose a modification to
get a better approximation. “Small” will be relative to a time series case,
because we cannot have a good estimation of the spectral density if T is too
small. We consider the cases when T = 1000 and T = 2000.
First, we would like to verify if the convergence will be faster with a better
choice of the bandwidth a. Consider the same AR(1) and AR(2) processes
as before but with a = 100, 200, 300, and 400. Figures 2.9 and 2.10 show the
results for each case. Increasing the value of a we can approximate better
the dispersion of the dˆT V , however, a bias appears.

Why does this happen? Consider the Normal approximation, since Zk


and Zk∗ are not correlated, then
·
|Zk − Zk∗ | ∼ HN a 2
(2.31)

T
Σ ,
Z 1
2
where Σ = 4
2
β 2 (u)du and HN denotes the Half-Normal distribution.
σ −1
Using properties of this distribution,
r
2a
E(|Zk − Zk∗ |) = Σ
πT 
2 a
Var(|Zk − Zk |) = Σ 1 −
∗ 2
.
π T
Z
So, under H0 and using that f (ω)d(ω) = σ 2 ,

r r
2a a
2E(d˜T V ) = σ 2 Σ = K1 , (2.32)
πT T
Z  
˜ 2 2 2 a a
4Var(dT V ) = f (ω)d(ω)Σ 1 − = K2 ,
π T T

where d˜T V denotes the normal approximation of dˆT V .


So, the factor a/T together with the first two spectral moments determine
the dispersion and the mean of the normal approximation. It is true that
both objects go to zero when a/T goes to zero. However, for T fixed, when
42 Chapter 2. Total Variation Distance

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 100 T= 1000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 300 T= 1000 , a= 400

(a) T = 1000

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 100 T= 2000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 300 T= 2000 , a= 400

(b) T = 2000

Figure 2.9: Simulation results for AR(1) with small sample size and different values of
the bandwidth, T = 1000, 2000, a = 100, 200, 300, 400.
2.4. Simulation Study 43

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 100 T= 1000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 300 T= 1000 , a= 400

(a) T = 1000

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 100 T= 2000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 300 T= 2000 , a= 400

(b) T = 2000

Figure 2.10: Simulation results for AR(2) with small sample size and different values of
the bandwidth„ T = 1000, 2000, a = 100, 200, 300, 400.
44 Chapter 2. Total Variation Distance

a is closer to T we increase the dispersion but also the mean, and a bias
appears. This is the phenomenon we observe in the simulations.
We would like to increase the dispersion but not the mean, and also we
would like to know how much we should increase it. To prove the convergence
of the smoothed periodogram the following inequality is used (Brockwell and
Davis, 2006),
−1 −1 P 2
β 2T (ωj ) Var(fˆ(ω)) ≤ f 2 (ω) +
P P 2  2a+1

β T (ωj ) o β T (ωj ) + c2 T
,
Z 1
X a
where c2 is a constant and β 2T (ωj ) ≈ β 2 (u)du. This inequality and
T −1
|j|<a
(2.32) motivate the following proposal.
Consider a transformation of the random variable d˜T V , that approximates
ˆ
dT V by the following function,

   
2a + 1 ˜ 2a + 1
1+ dT V − E(d˜T V ). (2.33)
T T

The results obtained are presented in Figures 2.11 and 2.12. The
transformed approximations, for values of a bigger than 300, capture the
right dispersion of the distribution and reduce the bias. However, the
approximations are not completely accurate. This is to be expected since
we have “small” values of T .
Bootstrapping. Now, we explore the approximation using the bootstrap
procedure. The simulation setting in this case is T = 1000, 2000 and
a = 100, 150, 200, 250. We consider the AR(1) and AR(2) processes as before.
In this case we take one pair of samples from [X1 (t), X2 (t)], then we draw a
bootstrap sample based on them. Finally, we compare this bootstrap density
with a density obtained by dˆT V from different replicates of [X1 (t), X2 (t)].
Figures 2.13 and 2.14 show the results.
The bootstrap density is a good approximation to the dˆT V density.
It doest not depend on a, in the sense that under any value of a the
approximations are very close to the empirical. The performance of the
bootstrap is equally precise for both processes.
2.4. Simulation Study 45

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 100 T= 1000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 300 T= 1000 , a= 400

(a) T = 1000

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 100 T= 2000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 300 T= 2000 , a= 400

(b) T = 2000

Figure 2.11: Results using the transformed values for AR(1), T = 1000, 2000, a =
100, 200, 300, 400.
46 Chapter 2. Total Variation Distance

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 100 T= 1000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80
Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 1000 , a= 300 T= 1000 , a= 400

(a) T = 1000

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 100 T= 2000 , a= 200

TVD TVD
Normal Normal
20 40 60 80

20 40 60 80

Chisquare Chisquare
Density

Density
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
T= 2000 , a= 300 T= 2000 , a= 400

(b) T = 2000

Figure 2.12: Results using the transformed values for AR(2), T = 1000, 2000, a =
100, 200, 300, 400.
2.4. Simulation Study 47

TVD TVD
50

50
Bootstrap Bootstrap
Density

Density
30

30
0 10

0 10
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 100 T=1000 a= 150

TVD TVD
50

50
Bootstrap Bootstrap
Density

Density
30

30
0 10

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 200 T=1000 a= 250

(a) T = 1000

TVD TVD
50

50

Bootstrap Bootstrap
Density

Density
30

30
0 10

0 10

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 100 T=1000 a= 150

TVD TVD
50

50

Bootstrap Bootstrap
Density

Density
30

30
0 10

0 10

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 200 T=1000 a= 250

(b) T = 2000

Figure 2.13: Results using bootstrap for AR(1), T = 1000, 2000, a = 100, 150, 200, 250.
48 40 Chapter 2. Total Variation Distance

40
TVD TVD
30

30
Bootstrap Bootstrap
Density

Density
20

20
10

10
0

0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 100 T=1000 a= 150
40

40
TVD TVD
30

30
Bootstrap Bootstrap
Density

Density
20

20
10

10
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 200 T=1000 a= 250

(a) T = 1000
40

40

TVD TVD
30

30

Bootstrap Bootstrap
Density

Density
20

20
10

10
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 100 T=1000 a= 150
40

40

TVD TVD
30

30

Bootstrap Bootstrap
Density

Density
20

20
10

10
0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
T=1000 a= 200 T=1000 a= 250

(b) T = 2000

Figure 2.14: Results using bootstrap for AR(2), T = 1000, 2000, a = 100, 150, 200, 250.
2.4. Simulation Study 49

2.4.2 Significance level and power of the test


Given two time series, X1 (t) and X2 (t), we establish the following hypothesis
test.
Hypothesis:

H0 : f X1 (ω) = f X2 (ω), ∀ω vs HA : ∃ω such that f X1 (ω) 6= f X2 (ω).

Test statistic:
T    
1 X ˆX1 k 1 k 1
dˆT V ˆ 2
X

= f N − − f N − .
2T k=1 T 2 T 2

In both cases, to determine the rejection region or to compute the p-value,


we need an approximation of the distribution of our test statistic dˆT V . We use
one of the following distributions: 1) Normal approximation, 2) Transformed
Normal approximation, 3) Chi-square approximation, 4) Transformed Chi-
square approximation and 5) Bootstrapping.
We will approximate the distribution of our test statistic, so the first
thing we would like to verify is the significance level. In other words, with
a fixed value ciα we want to verify how close is P(dˆT V > ciα ) to α, where
i = 1, . . . , 5 indicates which approximation we are using. Also, we want to
study the power of the test.
Significance Level. To explore the first property we shall consider
the AR(2) process with peak frequency at .3 Hz, (φ1 , φ2 ) = (−.47, −.6).
We use T = 1000, 2000, 5000, a = 100, 200, 300, 400, and 1000 replicates
to approximate the distributions and 1000 replicates to study the test
performance. The pair (T, a) is specified in Table 2.3.
The simulation procedure is the following.

• We draw two time series from the AR(2) process with the same
parameters.

ˆX1 ˆX2 ˆ fˆNX1 + fˆNX2


• We estimate fN , fN and the “true” spectra as fN = .
2
• We estimate the quantiles ciα using, i = 1 - Normal approximation,
i = 2 - Transformed Normal approximation, i = 3 - Chi-square
approximation, i = 4 - Transformed Chi-square approximation and
i = 5 - Bootstrapping.
50 Chapter 2. Total Variation Distance

Under H0

α Normal Transformed Chi-square Transformed Bootstrapping


Normal Chi-square

T = 1000 0.01 0.134 0.117 0.143 0.117 0.012


a = 100 0.05 0.185 0.164 0.189 0.169 0.050
0.1 0.213 0.189 0.223 0.202 0.100
T = 1000 0.01 0.075 0.043 0.086 0.048 0.012
a = 200 0.05 0.126 0.074 0.147 0.086 0.05
0.1 0.158 0.11 0.178 0.135 0.097
T = 2000 0.01 0.128 0.101 0.133 0.107 0.009
a = 200 0.05 0.174 0.149 0.187 0.159 0.042
0.1 0.209 0.183 0.226 0.201 0.077
T = 2000 0.01 0.119 0.072 0.133 0.085 0.01
a = 300 0.05 0.166 0.133 0.18 0.147 0.056
0.1 0.191 0.164 0.206 0.183 0.106
T = 5000 0.01 0.202 0.193 0.215 0.200 0.014
a = 300 0.05 0.253 0.240 0.262 0.252 0.053
0.1 0.273 0.265 0.292 0.277 0.096
T = 5000 0.01 0.186 0.161 0.199 0.175 0.010
a = 400 0.05 0.232 0.214 0.252 0.230 0.056
0.1 0.260 0.245 0.279 0.262 0.104
T = 5000 0.01 0.167 0.129 0.188 0.159 0.007
a = 500 0.05 0.218 0.191 0.234 0.210 0.050
0.1 0.243 0.226 0.268 0.248 0.102

Table 2.3: Proportion of rejection H0 using different approximated distribution of dˆT V

• Finally we compute dˆT V and compare with ciα .

Table 2.3 shows the proportion of times that the null hypothesis is rejected
using the critical value associated to each approximation. If we observe
the theoretical approximation, the transformed values, compared to the non
transformed, have proportions closer to α. As we expect, for the theoretical
approximation the value of a has an influence on the proportion of rejection,
bigger values of T need bigger values of a. On the other hand, the bootstrap
procedure outperforms the rest in all cases. The proportion of rejection using
the bootstrap procedure is almost equal to α, and is not influenced by the
choice of a.
Power. Now, we draw a time series from the AR(2) process but with
different parameters. Figure 2.15 shows the spectrum for X1 , the continuous
black curve, and for X2 we use three different cases, the dotted curves. So,
we fix the spectra for X1 and we use one of the others for X2 . They are very
close and we would like to see how many false positive there will be as the
spectra get closer.
2.4. Simulation Study 51

AR(2) Spectrum

Peak Frequency

2.0
0.27
0.28
0.29

1.5
0.3

1.0
0.5
0.0

0.0 0.1 0.2 0.3 0.4 0.5


Frequency (Hz)

Figure 2.15: Spectra under HA .

We use T = 1000, a = 200 (Table 2.4) and T = 2000, a = 300 (Table 2.5),
and 1000 replicates to approximate the distributions and 1000 replicates to
study the test performance.

In all cases, the power decreases when the spectra are closer. The Chi-
squared approximation has the biggest power. If we compare the power of
the theoretical and the transformed approximations, the transformation does
not improve the power. This fact is a consequence of the subestimation of
the dispersion of the distribution of dˆT V .
The power in the bootstrap case is closer to the theoretical approximation
when the spectrum of X2 has the peak frequency at .27. When the spectra
are closer the power decreases but it is still acceptable (around .7) when the
peak frequency is at .28.

The comparison between the power of the bootstrap procedure and the
asymptotic approximations is not completely fair, since the significance
level when one uses the asymptotic distributions is bigger than the fixed
level. Since the bootstrap procedure is the only option that preserves the
significance level, it will be the best option to use in practice even when the
power could be low.
52 Chapter 2. Total Variation Distance

Under HA T = 1000 a = 200

Peak Frequency α Normal Transformed Chi-square Transformed Bootstrapping


of X2 spectra Normal Chi-square

0.27 0.01 0.996 0.988 0.996 0.99 0.965


0.27 0.05 0.997 0.996 0.997 0.996 0.991
0.27 0.1 0.997 0.997 0.998 0.997 0.997
0.28 0.01 0.758 0.658 0.774 0.682 0.467
0.28 0.05 0.825 0.761 0.839 0.777 0.695
0.28 0.1 0.847 0.81 0.875 0.832 0.792
0.29 0.01 0.236 0.146 0.257 0.159 0.068
0.29 0.05 0.319 0.235 0.342 0.259 0.177
0.29 0.1 0.367 0.305 0.416 0.326 0.276

Table 2.4: Proportion of rejection H0 under HA . T = 1000 and a = 200.

Under HA T = 2000 a = 300

Peak Frequency α Normal Transformed Chi-square Transformed Bootstrapping


of X2 spectra Normal Chi-square

0.27 0.01 0.951 0.964 1.000 1.000 0.902


0.27 0.05 0.973 0.982 1.000 1.000 0.966
0.27 0.1 0.983 0.986 1.000 1.000 0.984
0.28 0.01 0.517 0.576 0.991 0.993 0.332
0.28 0.05 0.637 0.690 0.999 1.000 0.584
0.28 0.1 0.701 0.746 1.000 1.000 0.709
0.29 0.01 0.112 0.150 0.898 0.915 0.037
0.29 0.05 0.248 0.297 0.980 0.987 0.253
0.29 0.1 0.199 0.242 0.966 0.971 0.161

Table 2.5: Proportion of rejection H0 under HA . T = 2000 and a = 300


2.4. Simulation Study 53

Proportion of Rejection H0

1.0

0.8
H0 HA ●

0.6
0.4



0.2

Normal
● Transformed Normal
● Chi−square
α

● Transformed Chi−squared
0.0

Bootstrapping

.3 .29 .28 .27


Peak Frequency of X2

(a) T = 1000, a = 200

Proportion of Rejection H0
1.0

● ●
● ●
0.8

H0 HA

0.6
0.4


0.2


Normal
● ● Transformed Normal
Chi−square
α ● Transformed Chi−squared
0.0

Bootstrapping

.3 .29 .28 .27


Peak Frequency of X2

(b) T = 2000, a = 300

Figure 2.16: Proportion of rejection of H0 considering a level α = 0.05 and different


approximations of the distribution of dˆT V . The x-axis represents the peak frequency (Tp )
of the spectra used to draw X2 (t), when Tp = .3, X1 (t) and X2 (t) have the same spectral
density.
54 Chapter 2. Total Variation Distance

2.5 Discussion
In comparison with other similarity measures, the total variation distance
has some desirable properties. The intuition and easy interpretation is one
of them. Also, contamination models give an interpretation of the distance as
the level of similarity. It is important to note that we use the total variation
distance to compare continuous functions, since the total variation distance
is not useful to compare discrete with continuous functions.
We explored the statistical properties of the estimator of the total
variation distance, dˆT V . Two approximations of the distribution are
proposed, using Gaussian or Chi-squared variables, and a transformation
of them is introduced in the case of small samples. In the simulation study,
the test based on these distribution has a bigger significance level than the
nominal α. The transformations gave a closer significance level to α, however,
they are not sufficiently precise for “small” T .
As an alternative, we propose a bootstrap algorithm and the results are
very good. The bootstrap outperforms the asymptotic methods and the
significance level is almost equal to α. It has the limitation of low power when
the spectral densities are very close. In general, the bootstrap procedure
is the best option to approximate the distribution of dˆT V , under the null
hypothesis.
The developed theory can be extended to the multivariate case. Another
posible extension is to consider the distances between some operator of the
spectral density such that the first or second derivative of the spectra, this
would be useful in some applications.
Chapter 3

Clustering Methods

Our main goal is to detect changes in spectra and the previous chapter
explores the proposal of considering the TV distance as a similarity measure
between spectra. As was mentioned in the introduction, several methods for
detecting instantaneous breaks in time series have been proposed, but they
do not produce good results when the changes are slow. In this situation
it is convenient to change the point of view from detecting change-points to
determining time intervals during which the spectra are similar, in the sense
that their TV distance is small. If one considers that time series that have
similar spectral densities also share similar properties, one could think about
them as a group. Taking this into account, clustering methods are a natural
approach. Clustering based on spectral densities will be intuitive in many
applications.
In general, clustering is a procedure whereby a set of unlabeled data
is divided into groups so that members of the same group are similar, while
members of different groups differ as much as possible. Our goal is to develop
a method that produces groups or clusters consisting of time series having
similar spectral representation.
The subject of time series clustering is an active research area with
applications in many fields. Frequently, finding similarity between time
series plays a central role in the applications. In fact, time series clustering
problems arise in a natural way in a wide variety of fields, including
economics, finance, medicine, ecology, environmental studies, engineering,
and many others. This is not an easy task, since it requires a notion
of similarity between time series. Liao (2005) and Caiado et al. (2015)
give a revision of the field and Montero and Vilar (2014) present an R

55
56 Chapter 3. Clustering Methods

package (TSclust) for time series clustering with a wide variety of alternative
procedures. According to Liao (2005), there are three approaches to time
series clustering: methods based on the comparison of raw data, feature-
based methods, where the similarity between time series is gauged through
features extracted from the data, and methods based on parameters from
models adjusted to the data.
The first approach, comparison of raw data, will be very complicated
when we have long time series, since it becomes a computational problem.
The third approach, based on parameters, is one of the most frequently used,
however, it has the limitation of considering a specific parametric model.
Our proposals are feature-based and the spectral density of the time series
is considered the central feature for classification purposes. The resulting
clusters will be similar in the sense that the time series in a cluster will have
similar spectral density. This will have an interpretation depending on the
application, Chapter 4 presents two different cases and the interpretation for
each one.
To build a clustering method the first question is how to measure the
similarity between spectral densities. We propose the use of the total
variation distance as a measure of similarity. Then, we need a clustering
algorithm, and we use a hierarchical algorithm with classical linkage functions
as our first proposal.
However, hierarchical clustering algorithms with linkage functions (such
as complete, average, Ward, and so on) are based on geometrical ideas
where the distances between new clusters and old ones are computed by
a linear combination of the distance of their members, which may not be
meaningful for clustering time series since these linear combinations may not
have a meaning in terms of the spectral densities. So, our second proposal
considers a new clustering algorithm, which takes advantage of the spectral
theory. We propose the Hierarchical Spectral Merger algorithm, which is
a modification of the classical hierarchical algorithms. The main difference
is the consideration of a new representative, i.e. a new estimation of the
spectral density for an updated cluster. This is intuitive and the updated
spectral estimates are smoother, less noisy and hence give better estimates
of the TV distance.
We explain each proposal in detail and compare through simulation
studies their performance.
3.1. TV distance in a clustering method 57

3.1 TV distance in a clustering method


There are two general families of clustering algorithms: partitioning and
hierarchical. Among partitioning algorithms, K-means and K-medoids are
the two more representative, and for the hierarchical clustering algorithms,
the main examples are agglomerative with single-linkage or complete-linkage
(Xu and Wunsch, 2005).
We consider the hierarchical algorithm because it accepts as input the
distances between objects. The algorithm starts with a dissimilarity matrix
and builds each cluster giving preference to the closest candidates. Distances
between new and old clusters are required during the algorithm and they are
calculated using the link functions. In complete link clustering, the distance is
equal to the longest distance from any member of one cluster to any member
of the other cluster. In average link clustering, the distance is equal to the
average distance from any member of one cluster to any member of the other
cluster.

Let Xi = (Xi (1), . . . , Xi (T )) be a set of time series, i = 1, . . . , N . The first


proposed procedure will be as follows.
Step 1. Estimate the spectral density for each time series using the smoothed
periodogram.
Step 2. To calculate the dissimilarity matrix, the TV distance between the
normalized spectra is used.
Step 3. The dissimilarity matrix is used for the hierarchical algorithm, with
the complete and average link functions.
Step 4. As a result, a clustering dendrogram is obtained for the data set
in which the distance between groups is represented by the length of the
segments, and one can decide to cut the dendrogram according to a fixed
value of the distance or to the number of clusters desired.

Example 3.1. Consider two different AR(2) models with spectra concen-
trated at 0.05 Hz and 0.06 Hz, respectively. We simulate three time series
for each process, each one consisting of 1000 points with a sampling fre-
quency of 1 Hz, being X1 , X3 , X5 from the first process and X2 , X4 , X6 from
the second process. Figure 3.1(a) shows the estimated spectra for each series.
We compute the dissimilarity matrix with the TV distance and the values
are shown in 3.1(b). These values are not big since the spectra are close.
When we apply the hierarchical algorithm with complete and average link
58 Chapter 3. Clustering Methods

Estimated Spectra Estimated Spectra by Cluster


0.30
X6
250

250
0.25

X5
0.20

X4
150

150
0.15

X3
0.10

X2
50

50
0.05

X1
0

0
0.00

0.00 0.05 0.10 0.15 0.20 X1 X2 X3 X4 X5 X6 0.00 0.05 0.10 0.15 0.20
w (Hz) w (Hz)

(a) (b) (c)

Figure 3.1: (a) Estimated spectra for Example 3.1, (b) Dissimilarity Matrix, computed
usig the TV distance, and (c) Clustering result using either the complete or average link
functions.

Dendrogram with Complete link function Dendrogram with Average link function
0.25

0.15
0.15
Height

Height
X3

X3
X6

X6
0.05

0.05

X2

X4
X2

X4
X1

X5

X1

X5

Agglomerative Coefficient = 0.68 Agglomerative Coefficient = 0.64

(a) (b)

Figure 3.2: Dendrograms obtained for Example 3.1 using (a) the complete link function
and (b) the average link function.
3.2. Hierarchical spectral merger (HSM) method 59

functions, we obtain the dendrogram plots in Figures 3.2(a) and (b). If we


consider two groups, Figure 3.1(c) shows in different color (red and black)
which one belongs to each cluster. The method is able to recover the orig-
inal groups. Note that new distances obtained using the complete link are
bigger than those obtained with the average link. It could be possible that
for detecting small changes, the complete link could have better results.
In the simulation study, we will study the case when the number of groups,
k, is unknown.

3.2 Hierarchical spectral merger (HSM) method


As an alternative to the previous proposal, a new time series clustering
algorithm was developed. The Hierarchical Spectral Merger (HSM) method
uses the TV distance as a dissimilarity measure as before but new clustering
procedures are introduced. The algorithms proposed are a modification of the
usual agglomerative hierarchical procedure, taking advantage of the spectral
point of view for the analysis of time series.
The hierarchical spectral merger algorithm has two versions: the first,
known as single version, updates the spectral estimate of the cluster from a
concatenation of the time series; and the second, known as average version,
updates the spectral estimate of the cluster from a weighted average of the
spectral estimate obtained from each signal in the cluster.

Hierarchical Spectral Merger Algorithm. Let Xi = (Xi (1), . . . , Xi (T ))


be a set of time series, i = 1, . . . , N . The procedure starts with N clusters,
each cluster being a single signal.
Step 1. Estimate the spectral density for each cluster using the smoothed
periodogram, then each cluster will be represent by a common spectral
density fj (ω), j = 1, . . . , k (number of clusters).
Step 2. Compute the TV distance between their spectra.
Step 3. Find the two clusters that have lowest TV distance, save this value
as a characteristic.
Step 4. Merge the signals in the two closest clusters and replace the two
clusters by this new one.
Step 5. Repeat Steps 1-4 until there is only one cluster left.
The characteristic saved in Step 3 represents the “cost” of joining two
clusters, i.e., having k − 1 clusters vs k clusters. If a significantly large
60 Chapter 3. Clustering Methods

Algorithm:

1. Initial clusters: C = {Ci }, Ci = Xi , i = 1, . . . , N


Dissimilarity matrix entry between clusters i and j,

Dij = d(Ci , Cj ) := dT V (fbi , fbj ),


fbi is the estimated spectra using the signals in each cluster.
2. for k in 1 : N − 1
3. (ik , jk ) = argmin Dij ; mink = min Dij #Find the closest clusters
ij ij

4. Cnew = Cik ∪ Cjk #Join the closest clusters


5. Dnew = D \ {Dik . ∪ Djk . ∪ D.ik ∪ D.jk } #Delete rows and columns ik , jk
6. for j in 1 : N − k − 1
new new
7. D(N −k)j = Dj(N −k) =dT V (Cnew , Cj ) #Compute new distances
8. end
9. D = Dnew ; C = (C \ {Cik , Cjk }) ∪ Cnew #New matrix D and new clusters
10. end

Table 3.1: Hierarchical Merger Algorithm proposed using the total variation distance
and the estimated spectra.

value is observed, then it may be reasonable to choose k clusters instead


of k − 1. When two clusters merge, there are two options, either (1) for
the single version, we concatenate both signals and compute the smoothed
periodogram with the concatenated signal; or (2) for the average version, we
take the weighted average over all the estimated spectra for each signal in
the cluster as the new estimated spectra.
Both algorithms compute the TV distance between the new cluster
and the old clusters based on updated estimated spectra, which is the
main difference with classical hierarchical algorithms. While a hierarchical
algorithm has a dissimilarity matrix of size N ×N during the whole algorithm,
the proposed method reduces this size to (N − k) × (N − k) at the k-th
iteration. Table 3.1 gives a summary of the hierarchical merger algorithm.
Example 3.2. To illustrate the HSM method, consider two different AR(2)
models with their spectra concentrated at 10 Hz, however, one also contains
3.2. Hierarchical spectral merger (HSM) method 61

Estimated Spectrum for Each Series Expected Result

300

300
200

200
100

100
0

0
0 10 20 30 40 50 0 10 20 30 40 50
w (Hz) w (Hz)

(a) (b)

Figure 3.3: Estimated spectra. (a.) Different colors correspond to different time series,
(b.) Red spectra are from the AR(2) model with activity at alpha (8-12 Hz) and beta
(12-30 Hz) bands and black spectra are from the AR(2) model with activity at alpha and
gamma (30-50 Hz) bands.

power at 21 Hz while the other has power at 40 Hz. We simulate three time
series for each process, 10 seconds of each one with a sampling frequency
of 100 Hz (t = 1, . . . , 1000). Figure 3.3(a) shows the estimated spectra for
each series and Figure 3.3(b) shows by different colors (red and black) which
one belongs to the first or second process. If we only look at the spectra,
it is hard to recognize the number of clusters and their memberships. We
probably could not identify some cases, like the red and purple spectra.
The dynamics of the HSM method is shown in Figure 3.4. We start with
six clusters; at the first iteration we find the closest spectra, represented in
Figure 3.4(a) with the same color (red). After the first iteration we merge
these time series and get 5 estimated spectra, one per cluster, Figure 3.4(b)
shows the estimated spectra where the new cluster is represented by the
dashed red curve. We can follow the procedure in Figures 3.4(c), (d), (e)
and (f). In the end, the proposed clustering algorithm reaches the correct
solution, Figures 3.4(g) and 3.3(b) coincide. Also, the estimated spectra for
the two clusters, Figure 3.4(h), is better than any of the initial spectra and
we can identify the dominant frequency bands for each cluster.
We developed the HSMClust package written in R that implements
our proposed clustering method. The package can be downloaded from
http://ucispacetime.wix.com/spacetime#!project-a/cxl2.
The principal function is called HSM, which executes the HSM method given
a matrix X, which has the signals by column. HSMClust has also some
other useful functions. One of them is the Sim.Ar function, which draws
62 Chapter 3. Clustering Methods

300 First Iteration Estimated Spectra − 5 Clusters

300
200

200
100

100
0

0
0 10 20 30 40 50 0 10 20 30 40 50
w (Hz) w (Hz)

(a) (b)

Second Iteration Estimated Spectra − 4 Clusters


300

300
200

200
100

100
0

0 10 20 30 40 50 0 10 20 30 40 50
w (Hz) w (Hz)

(c) (d)

Third Iteration Estimated Spectra − 3 Clusters


300

300
200

200
100

100
0

0 10 20 30 40 50 0 10 20 30 40 50
w (Hz) w (Hz)

(e) (f)

Last Iteration Estimated Spectra − 2 Clusters


300

300
200

200
100

100
0

0 10 20 30 40 50 0 10 20 30 40 50
w (Hz) w (Hz)

(g) (h)

Figure 3.4: Dynamic of the hierarchical merger algorithm. (a), (c), (e) and (g) show the
clustering process for the spectra. (b), (d), (f) and (h) show the evolution of the estimated
spectra, which improves when we merge the series in the same cluster.
3.3. TV distance and other dissimilarity measures 63

an AR(2) process reparametrized as a function of the norm of the root of


its characteristic polynomial (M) and the peak frequency of the spectrum
(η). These parameters determine the shape of the spectral density (peak and
sparseness).

3.3 TV distance and other dissimilarity mea-


sures
In this section we study the performance of the proposed methods and
compare them with methods based on other distances.
First, we explain the simulation methods based on the spectrum that we will
use in the experiments. Then, we present the results of the experiments,
assuming that the number of clusters is known. Finally, we explore the case
of unknown number of clusters and possible criteria to choose the number of
clusters.
This method can be applied using the agnes function of the R package
cluster.

3.3.1 Simulation of a process based on the spectral


density
There exist several methods to simulate a time series based on the spectral
density, usually they depend on a model for the spectral density. We consider
two of them, based on a parametric spectral density and based on AR(2)
processes.

Simulation based on a parametric family of spectral densities. There exist


several parametric families of spectra of frequent use in Oceanography and
they have an interpretation in terms of the behavior of sea waves (Ochi,
1998). Motivated by the applications we will present in the next chapter,
we will simulate time series (Gaussian process) using spectra from several of
these families. This methodology is already implemented by Brodtkorb et al.
(2011) in the WAFO toolbox for MATLAB.
WAFO has a routine for simulation of (transformed) Gaussian processes
and their derivatives, using a technique of circulant embedding of the
covariance matrix proposed by Dietrich and Newsam (1997). More
64 Chapter 3. Clustering Methods

traditional spectral simulation methods (FFT) are also implemented, see


the WAFO tutorial for more details.
An example of a group of parametric densities is the JONSWAP (Joint
North-Sea Wave Project) spectral family. This is a parametric family of
spectral densities which is frequently used in Oceanography, and is given by
the formula

g2 4 4 exp(−(ω−ωp )2 /2ωp2 s2 )
S(ω) = exp(−5ωp /4ω )γ
ω5
where g is the acceleration of gravity, s = 0.07 if ω ≤ ωp and s =
0.09 otherwise;
√ ωp = π/Tp and γ = exp(3.484(1 − 0.1975(0.036 −
0.0056Tp / Hs )Tp4 /(Hs2 ))). The parameters for the model are the significant
wave height Hs , which is defined as 4 times the standard deviation of
the time series, and the spectral peak period Tp , which is the period
corresponding to the modal frequency of the spectrum. This spectral family
was empirically developed after analysis of data collected during the Joint
North Sea Wave Observation Project, JONSWAP, (Hasselmann √ et al., 1973).

It is a reasonable model for wind-generated seas when 3.6 Hs ≤ Tp ≤ 5 Hs .

Simulation based on AR(2) processes. Consider the second order auto-


regressive model which is defined as

Zt = φ1 Zt−1 + φ2 Zt−2 + t , (3.1)

where t is a white noise process. The characteristic polynomial for this model
is φ(z) = 1 − φ1 z − φ2 z 2 . The roots of the polynomial equation indicate the
properties of the oscillations. If the roots, denoted z01 and z02 are complex-
valued then they have to be complex-conjugates, i.e., z01 = z02 . These roots
have a polar representation

2πη
|z01 | = |z02 | = M, arg(z0 ) = , (3.2)
Fs

where Fs denotes the sampling frequency; M is the amplitude or magnitude


of the root (M > 1 for causality); and η is the frequency index. The spectrum
of the AR(2) process with polynomial roots above will have peak frequency
at η. The peak becomes broader as M → ∞, and it becomes narrower as
3.3. TV distance and other dissimilarity measures 65

AR(2)
M = 1.01 −− Sampling Freq= 1000 Hz

150000

40000
6e+05

30000
100000
4e+05

20000
50000
2e+05

10000
0e+00 10 21 40

0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Hz Hz Hz
1000

600

300
400

200
500

200

100
0
0
0

−100
−200
−500

−300
−600

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
seg seg seg

Figure 3.5: Top: Spectra for the AR(2) process with different peak frequency; η =
10, 21, 40. Bottom: Realization from the corresponding AR(2) process.

M → 1+ .
Then, given (η, M, Fs ) we take

2 cos(w0 ) −1
φ1 = and φ2 = , (3.3)
M M2
where w0 = 2πηFs
. If one computes the roots of the characteristic polynomial
with the coefficients in (3.3), they satisfy (3.2). To illustrate the type of
oscillatory patterns that can be observed in time series from processes with
corresponding spectra, we plot in Figure 3.5 the spectra (top) for different
values of η, M = 1.1 and Fs = 100Hz; and the generated time series
(bottom). Larger values of η gives rise to faster oscillation of the signal.

3.3.2 Comparative study


Pértega Díaz and Vilar (2010) proposed two simulation tests to compare
the performance of several dissimilarity measures for time series clustering.
Our goal in this section is to compare the performance of the TV distance
with those that were based on the spectral density and had good results
in Pértega and Vilar’s experiments. In addition, we use the distance based
66 Chapter 3. Clustering Methods

on the cepstral coefficients (the Fourier coefficients of the expansion of the


logarithm of the estimated periodogram), which was used in Maharaj and
D’Urso (2011), and the symmetric version of the Kulback-Leibler divergence.
P 2
Let IX (ωk ) = T −1 Tt=1 Xt e−iωk t be the periodogram for time series X,

at frequencies ωk = 2πk/T, k = 1, . . . , n with n = [(T − 1)/2], and N IX


be the normalized periodogram, i.e. N IX (λk ) = IX (ωk )/γ̂0X , with γ̂0X the
sample variance of time series X. The dissimilarity criteria in the frequency
domain considered were:

• The Euclidean distance between the normalized estimated periodogram


P 2 1/2
ordinates: dN P (X, Y ) = n1 k N I X (λ k ) − N I Y (λk ) .

• The Euclidean distance between the logarithms  P of the normal-


ized estimated periodograms dLN P (X, Y ) = n1 k log N IX (λk ) −
2  1/2
log N IY (λk ) .

• The square of the Euclidean distance between Rthe cepstral coefficients


Y 2 1
where, θ0 = 0 log I(λ)dλ and θk =
Pp X

dCEP (X, Y ) = k=0 θk − θk
R1
0
log I(λ) cos(2πkλ)dλ.

These dissimilarity measures were compared with the TV distance in a


hierarchical algorithm and with the HSM method. In order to compare the
distances, the simulation settings were the same in all cases, the estimator
of the spectrum is the smoothed periodogram using a Parzen window with
bandwidth equal to 100 points, and the clustering algorithm is hierarchical
with the complete link function (similar results are obtained with the average
link). For the HSM method we denote by HSM 1 when we use the single
version and HSM 2 when we use the average version.
To evaluate the rate of success, we consider the following index which has
been already used for comparing different clustering procedures [Pértega Díaz
and Vilar (2010), Gavrilov et al. (2000)]. Let {C1 , . . . , Cg } and {G1 , . . . , Gk },
be the set of the g real groups and a k-cluster solution, respectively. Then,
g
1X
Sim(G, C) = max Sim(Gj , Ci ), (3.4)
g i=1 1≤j≤k
3.3. TV distance and other dissimilarity measures 67

2.5

8
20
2
Tp= 3.6
Tp= 4.1

6
15
1.5

4
10
1

2
0.5

0
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Frequency (Hz) Frequency (Hz) Frequency (Hz)

(a) Experiment 1 (b) Experiment 2 (c) Experiment 3

Figure 3.6: Spectra used in the simulation study to compare the TV distance with other
similarity measures. Each spectrum, with different color and line type, corresponds to a
cluster.

2|Gj ∪ Ci |
where Sim(Gj , Ci ) = . Note that this similarity measure will
|Gj | + |Ci |
return 0 if the two clusterings are completely dissimilar and 1 if they are the
same.
In the comparative study, we replicate each simulation setting N times, and
compute the rate of success for each one. The mean of this index is shown
in Tables 3.2, 3.3 and 3.4 and a box plot of the values obtained is shown in
Figures 3.7, 3.8 and 3.9.
We consider three different experiments. The first one is motivated by
the applications in Oceanography, where the differences between spectra
could be produced by a small change in the modal frequency. The second
experiment was designed to test if the proposals are able to distinguish
between a unimodal and a bimodal spectrum. Finally, the third one considers
models that are frequently used in the study of signals using a frequency
domain approach. For all the experiments, the lengths of the time series
were T = 500, 1000, and 2000.

• Experiment 1 is based on two different JONSWAP (Joint North-Sea


Wave Project) spectra. The spectra considered have significant
√ wave
height Hs equal to 3, the√ first has a peak period Tp of 3.6 Hs while
for the second Tp = 4.1 Hs . Figure 3.6(a) exhibits the JONSWAP
spectra, showing that the curves are close to each other. Five series
from each spectrum were simulated and N = 500 replicates of this
experiment were made. In this case the sampling frequency was set
68 Chapter 3. Clustering Methods

to 1.28 Hz, which is a common value for wave data recorded using sea
buoys.
• Experiment 2 is based on the AR(2) process. Let Ztj be the j-th
component, j = 1, 2, 3, having AR(2) process with Mj = 1.1 for all
j and peak frequency ηj = .1, .13, .16 for j = 1, 2, 3, respectively. Ztj
represents a latent signal oscillating at a pre-defined band. Define the
observed time series to be a mixture of these latent AR(2) processes.
     
Xt1 eT1  1 ε1t
 X2   eT  Zt  ε2 
 t  2  t
 ..  =  ..  Zt2  +  ..  (3.5)
 .   .  3  . 
Z t 3×1
XtK K×1 eTK K×3 εK
t K×1

where εjt is Gaussian white noise, Xtj is a signal with oscillatory behavior
generated by the linear combination eTi Ztj and K is the number of
clusters. In this experiment, K = 3, with eT1 = c(1, 0, 0), eT2 = c(0, 1, 0)
and eT3 = c(0, 1, 1), and the number of draws of each signal Xti is
5, Figure 3.6(b) plots the three different spectra. So, we have three
clusters with five members each. For this experiment N = 1000
replicates were made, and the sampling frequency was set to 1 Hz.
• Experiment 3 considers time series with two additive components.
The first component is an oscillating process with random amplitude
while the second one is a random noise with an autoregressive structure.
The general model is:
X(t) = A cos(2πtω0 ) + B sin(2πtω0 ) + Z(t),
where Z(t) is an AR(2) process with parameters (η, M ), and A, B
are independent Gaussian N (0, 1) random variables. We look at three
different models, one per cluster.
a) Model 1 ω0 = .3, η = .2, M = 1.3
b) Model 2 ω0 = .1, η = .18, M = 1.3
c) Model 3 ω0 = .25, η = .22, M = 1.3
For each model five time series were generated, with T = 500, 1000, 2000
and N = 1000 replicates of the experiment. Figure 3.6(c) presents the
spectral densities for the AR(2) components and shows that they are
close to each other. The sampling frequency was set to 1 Hz
3.3. TV distance and other dissimilarity measures 69

Similarity Index Similarity Index


1.0

1.0

0.9

0.9
● ● ● ● ● ● ● ● ● ● ●
0.8

0.8

● ● ● ● ●
0.7

0.7
● ● ● ●
0.6

0.6
● ● ●
0.5

0.5
NP LNP CEP TV SKL HSM1 HSM2 NP LNP CEP TV SKL HSM1 HSM2

(a) T = 500, a = 100 (b) T = 1000, a = 100

Similarity Index
1.0
0.9


0.8



0.7



0.6



0.5

● ●

NP LNP CEP TV SKL HSM1 HSM2

(c) T = 2000, a = 100

Figure 3.7: Box plots of the rate of success for the replicates under the simulation setting
of Experiment 1, by using different distances.

In all the experiments, we assume the number of clusters known, since


the purpose is to compare the dissimilarity measures and not the algorithm
to decide the number of clusters.
Tables 3.2, 3.3 and 3.4 give the mean values of the rate of success, for
Experiment 1, 2 and 3 respectively. Being a nonlinear function, the logarithm
enhances differences when the values of the spectral densities are below
1, and has the opposite effect for values larger than 1. Then, when the
spectra are very close and the values are bigger than 1, it is more difficult to
distinguish them using a logarithmic distance. This can be seen in the results
of Experiment 1 and Experiment 3, where the LNP and CEP distances
70 Chapter 3. Clustering Methods

Experiment 1

T a NP LNP CEP TV SKL HSM1 HSM2

500 100 0.979 0.772 0.597 0.988 0.994 0.989 0.988


1000 100 0.998 0.851 0.825 0.999 0.999 0.999 0.999
2000 100 1 0.932 0.908 1 1 1 1

Table 3.2: Mean values of the similarity index obtained using different distances and the
two proposed methods in Experiment 1. The number of replicates is N = 500.

Experiment 2

T a NP LNP CEP TV SKL HSM1 HSM2

500 100 0.864 0.949 0.895 0.930 0.952 0.836 0.838


1000 100 0.961 0.996 0.974 0.990 0.994 0.983 0.983
2000 100 0.995 1 0.999 0.999 0.999 0.999 0.999

Table 3.3: Mean values of the similarity index obtained using different distances and the
two proposed methods in Experiment 2. The number of replicates is N = 1000.

Experiment 3

T a NP LNP CEP TV SKL HSM1 HSM2

500 100 0.974 0.843 0.777 0.976 0.984 0.977 0.977


1000 100 0.974 0.830 0.762 0.975 0.985 0.977 0.976
2000 100 0.975 0.823 0.757 0.975 0.984 0.977 0.977

Table 3.4: Mean values of the similarity index obtained using different distances and the
two proposed methods in Experiment 3. The number of replicates is N = 500.
3.3. TV distance and other dissimilarity measures 71

Similarity Index Similarity Index


1.0

1.0
● ● ● ● ● ●
0.9

0.9

● ● ● ● ● ●
0.8

0.8
● ● ● ● ● ●
● ● ● ● ● ● ● ●

● ● ●
● ● ● ● ● ●
0.7

0.7
● ● ● ● ● ● ● ● ●
● ● ● ● ●


● ●
● ●
● ●
● ● ●
● ● ●

NP LNP CEP TV SKL HSM1 HSM2 NP LNP CEP TV SKL HSM1 HSM2

(a) T = 500, a = 100 (b) T = 1000, a = 100

Similarity Index
1.0

● ● ● ● ● ●
0.9

● ●
0.8


0.7

● ●

NP LNP CEP TV SKL HSM1 HSM2

(c) T = 2000, a = 100

Figure 3.8: Box plots of the rate of success for the replicates under the simulation setting
of Experiment 2, by using different distances

have smaller rate of success compared to the TV distance. In Experiment


1, for short series (T = 500) the best results correspond to SKL followed
closely by the TV distance, while for long series T = 1000, and 2000 the
methods that used the TV distance have a success rate close to one.
In Experiment 2 and Experiment 3 it is more difficult to discriminate
between groups, since there are three clusters involved. In Experiment 2
the best results were obtained with the SKL distance, followed by the LNP,
while the TV distance was not far behind. In Experiment 3, the best
results were obtained with the SKL divergence, and the methods based on
the TV distance were very close.
72 Chapter 3. Clustering Methods

Similarity Index Similarity Index


1.0

1.0
● ●
0.9

0.9
● ●
● ●
0.8

0.8
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●

● ●
● ● ● ● ● ● ● ● ● ●
0.7

0.7
● ● ● ● ● ● ●

● ●
0.6

0.6



0.5

0.5





0.4

0.4

● ●

NP LNP CEP TV SKL HSM1 HSM2 NP LNP CEP TV SKL HSM1 HSM2

(a) T = 500, a = 100 (b) T = 1000, a = 100

Similarity Index
1.0


0.9



0.8

● ● ● ● ●
● ● ● ●


● ● ● ● ●
0.7

● ● ● ●
0.6
0.5
0.4

NP LNP CEP TV SKL HSM1 HSM2

(c) T = 2000, a = 100

Figure 3.9: Box plots of the rate of success for the replicates under the simulation setting
of Experiment 3, by using different distances
3.4. Detection of transitions between spectra 73

Figures 3.7, 3.8, and 3.9 show the boxplot for the values of the rate
of success obtained in each experiment. We can see from the box plots
corresponding to Experiment 1 that the CEP distance has many values
smaller than 0.9 even in the case of T = 2000.
In Experiment 2, the HSM method did not have a good performance, in
small and medium sized time series, compared to the others. It is necessary
to have T = 2000, for the HSM method to identify the clusters more precisely.
The NP distance has the worst performance overall. In Experiment 3, the
performance of all distances does not improve significantly when we increases
the length of the series. The LNP and CEP distances instead of improving,
get worse when we increases the value of T .
In general, the rate of success for the methods that used the TV distance
are good, in some cases they have the best results, and when they do not, they
are close to the best. The methods based on logarithms, such as the LNP and
CEP, have in some cases a good performance but in others their performance
is very poor. The SKL has the best results in many cases, however, as we
mentioned in Chapter 2, this distance cannot be computed when two spectra
have disjoint support. In addition, methods that use logarithmic functions
require more computational time than methods based directly on spectra.
It is important to mention that these methods could be applied to big
data, long time series or several series. In general, the proposed methods
are efficient in this sense. The computational complexity is O(n3 T ), where
n is the number of time series to be clustered and T the length of each time
series. It implies that the computational time does not increases exponentialy
as with other methods.
Considering the properties of the TV distance and its performance when
used as a dissimilarity measure in a clustering method, we consider the
procedures proposed as a good option as time series clustering methods.

3.4 Detection of transitions between spectra


In many instances, non-stationary time series present changes that do not
occur abruptly, but rather appear as transitions between stationary intervals.
In such cases, methods devised for the detection of changes usually produce
poor results. One example that will be considered in detail in the next
chapter, is the analysis of stationary periods for wave height data. The sea
surface is stationary only for a period of time, and when the environmental
74 Chapter 3. Clustering Methods

conditions that produce it change, there is usually a slow transition, lasting


hours or days, to a new stationary state.
As an alternative to change-point detection, we propose looking at the
behavior of segments of time from a long time series, and using clustering
algorithms in order to detect periods having similar behavior. If these periods
are contiguous in time, then it is natural to consider them as part of a longer
stationary interval. This section is devoted to presenting a simulation method
for transitions, which is later used in a simulation experiment to evaluate the
performance of the procedure.

3.4.1 Simulation of transitions between two spectra


We propose a new method to simulate transitions between two stationary
periods of a time series. Our proposal is based on the definition of a Locally
Stationary Process.
From Definition 1.2, Xt,T has a unique time varying spectral density which is
locally the same as the spectral density of X̃t (u). Furthermore, it has locally
the same auto-covariance since cov(X[uT ],T , X[uT ]+k,T ) = c(u, k) + O(T −1 )
uniformly in u and k, where c(u, k) is the covariance function of X̃t (u). This
justifies taking c(u, k) as the local covariance function of Xt,T at time u = t/T .
We need a method that, given two different spectra, is able to simulate a
process that changes its spectrum from f1 to f2 during the transition period.
This can be reformulated equivalently in terms of the covariance functions
r1 and r2 .
Suppose that we have two independent processes Xt1 and Xt2 , which have
r1 (h) and r2 (h) as covariance functions. Take
p p
Xt = a(t)Xt1 + b(t)Xt2 ,

where a and b are functions with slow changes, a(t) ≈ a(t + h) and
b(t) ≈ b(t + h) if h is small, a(0) = b(T ) = 1 and a(T ) = b(0) = 0. Then it
is easy to see that, for small values of h,

cov(Xt , Xt+h ) = a(t) a(t + h)r1 (h) + b(t) b(t + h)r2 (h)
p p p p

≈ a(t)r1 (h) + b(t)r2 (h).

So, Xt is a process that for t near 0 has locally covariance function r1 and
for t near T it has r2 .
3.4. Detection of transitions between spectra 75

JONSWAP Spectra Estimated Spectral Densities


0.6

0.5

0.1 0.4

0.3

0.05 0.2

0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Frequency (Hz) Frequency (Hz)

(a) (b)

Figure 3.10: Simulation of a transition between two spectral densities. (a) The transition
starts with Tp = 5 (the blue curve) and finishes with Tp = 3.6 (the red curve). (b)
Estimated densities from the simulated data during the transition.

Example 3.3. To test our method we have to produce data from a process
that has a transition period, with a slow change from one spectrum to
another. Take f1 and f2 as JONSWAP spectra with Hs = 1 in both cases and
Tp = 5 and 3.6 respectively. We choose a(t) = 1 − T /t and b(t) = 1 − a(t),
where T = 5 hours is the total observation time. Figure 3.10(a) shows
the spectra involved in the transition, starting with the blue spectrum and
finishing with the red spectrum. Figure 3.10(b) shows the estimated densities
after we apply the algorithm, we can observe the form of the transition and
how the process starts at f1 and finishes at f2 .

3.4.2 Detection of transitions


Further simulation studies were carried out to assess the performance of the
clustering algorithm in the presence of transition periods. The main objective
was to gauge the performance when slow transitions between stationary
periods are present in a data set.
Experiment 4. The simulations were carried out using the JONSWAP
and Torsethaugen families of spectra (see Torsethaugen, 1993; Torsethaugen
and Haver, 2004). The latter is a family of bimodal spectra used in
Oceanography, which accounts for the presence of swell and wind-generated
waves, and was also developed to model spectra observed in North-Sea
locations. In all cases, the significant wave height (Hs ) was set to 1. The
76 Chapter 3. Clustering Methods

Jonswap Tp=3.6
Jonswap Tp=4.2
Torsethaugen Tp=5

0.1

Transition Transition
0.05
Stationary Period Stationary Period Stationary Period

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Frequency (Hz)

(a) (b)

Figure 3.11: Elements of Experiment 4. (a) Spectra involved in the stationary periods.
(b) Sketch of the simulation sequence (100 points per period).

simulation has three stationary periods and two transitions between them,
both transitions lasts 3 hours from one stationary period to the next.
Stationary Period 1 - the simulated series starts with waves from a stationary
period of 4 hours, from a JONSWAP spectrum with peak period Tp = 3.6,
Stationary Period 2 - the second stationary period corresponds to another
4 hours of waves drawn from a JONSWAP spectrum with Tp = 4.2, and
Stationary Period 3 - a third 4-hour stationary period but in this case from
a bimodal family, Torsethaugen spectrum with Tp = 5.0.
In this case, we simulate N = 1000 replicates and the sampling frequency
was set to 1.28 Hz.
Figure 3.11(b) shows a sketch of the simulation setting, where we get
one continuous signal. We start with the stationary period in red which
corresponds to the red spectrum in Figure 3.11(a), then a transition period
in gray color, and so on. Figure 3.11(a) plots the spectra involved in the
experiment for the stationary periods.
The test procedure is the following.

1. Each time series has 82944 time points, 4608 points per hour (18 hrs).

2. Data are divided into 30-minute intervals, each segment will be


considered as an element to be clustered.

3. Then, we have 36 segments, each of length T = 2304. We apply the


clustering methods to these segments.
3.4. Detection of transitions between spectra 77

First, we consider that there are just three genuine clusters, since
the transition periods, by definition, do not represent intervals with a
homogeneous behavior. We consider the TV distance in a hierarchical
algorithm with the complete link function and the HSM with the two possible
algorithms, HSM 1 and HSM 2.
Figure 3.12 shows the results obtained if we set the number of clusters to
3. Each plot represents by the corresponding color (red, blue and black) the
members that are assigned to the same cluster. For the three methods, the
resulting clusters contain each one of the stationary periods, 0-4 hours, 7-11
hours and 14-18 hours. The beginning of the transitions are mostly assigned
to the previous stationary period, for example the two segments from 4 to 5
hours are assigned to the same cluster as the first stationary period. While
the end of the transitions are assigned to the next stationary period. This is
reasonable since these are the more similar periods, respectively. The middle
of the transitions seem to be assigned randomly between the two closest
stationary periods.
It is interesting to observe that, in general, the elements in a cluster are
contiguous in time, even when no information about the time structure of
the series is included in the procedure, and the methods identify the changes
in the transition periods.
The problem of deciding whether the intervals close to the border belong
to a cluster or should be classified as transition periods requires a criterion
for deciding whether a given interval is “well classified” within a given cluster.
In Alvarez-Esteban et al. (2016a), we explore the use of the silhouette index,
proposed by Rousseeuw (1987), which gives a measure of the adequacy of
each point to its cluster.
Another approach that was also attempted was the use of trimming
procedures in the clustering process, as is considered in the work of Cuesta-
Albertos and Fraiman (2007) for functional data. In this context, the
spectral densities would be the functional data to be classified. The trimming
procedure “discards” a certain fraction of the information in the classification
process, in order to robustify the result, and it seems reasonable to consider
the trimmed information as data objects that do not fit properly within any
of the clusters. In consequence they could be labelled as transition periods.
An important shortcomming of this method is the long time it takes even
with moderately sized samples, and therefore the difficulty in handling real-
life information.
As an alternative one could consider that there should be five clusters,
78 Chapter 3. Clustering Methods

(a) TV distance in a hierarchical algorithm (complete linkage).

(b) HSM method with the single version.

(c) HSM method with the average version.

Figure 3.12: Members of groups 1 (red), 2 (blue), and 3 (black). If we consider only 3
clusters.
3.4. Detection of transitions between spectra 79

(a) TV distance in a hierarchical algorithm (complete linkage).

(b) HSM method with the single version.

(c) HSM method with the average version.

Figure 3.13: Members of groups 1 (red), 3 (blue), and 5 (black), which correspond to
each one of the stationary periods.
80 Chapter 3. Clustering Methods

three corresponding to the stationary periods and two to the transitions.


However, the transitions can be difficult to identify as a cluster, since there
is not complete homogeneity between elements of a transition.
Figure 3.13 shows the results with 5 clusters. Each plot represents by
the corresponding color (red, blue and black) the clusters that are assigned
to the stationary periods. In gray we represent the two clusters that should
correspond to transitions. As we can see, none of the different methods are
able to recover the complete transitions. However, the results are reasonable,
in the sense that the beginning of the transitions are assigned to the previous
stationary period and the end of the transitions to the next stationary period.
The transition from a unimodal to a bimodal spectrum (transition 2) is
more difficult to identify. The hierarchical method with the TV distance has a
better performance with the transitions, keeping the two central segments of
a transition in the same cluster more often than the HSM method. It seems,
that the HSM method has more difficulty in identifying slow transitions.
In practice, it is important to consider that the beginning and end of
transitions are going to be hard to identify if the transition is too slow.

3.5 Unknown number of clusters


An important point to discuss is how to choose the number of clusters. There
is not a universally best criterion that works for all clustering methods. Many
researches are still working on improving the existing criteria or in proposing
new ones. Usually the criteria will not be fully automatic and will depend
on the problem.
The general theory of clustering has some options to decide the number of
clusters. The criteria are usually of two types, internal and external. With
an external criterion, we need to have some prior information of the true
clusters. An internal criterion is usually more reasonable, since in real data
analysis we do not always have such information.
Dunn’s Index. Among those indices that admit as input a dissimilarity
matrix, we have selected Dunn’s index. This index is defined as
  
D(Ci , Cj )
VD (k) = min min ,
1≤i≤k i+1≤j≤k max1≤h≤k diam(Ch )
where k is the number of clusters, D(Ci , Cj ) is the distance between clusters
Ci and Cj , and diam(Ch ) is the diameter of cluster Ch .
3.5. Unknown number of clusters 81

From the definition of VD it is clear that high values point to suitable values
of k. However, the maximum value of VD (k) is not always the best choice,
specially when we have patterns which include clusters which are close to
each other. This situation is common in random sea waves where consecutive
stationary periods can have similar characteristics. All these issues make the
choice of the “optimal” k not an automatic process.
Since this is a well known criterion, we will not present any simulation in
this case. In applications, the results obtained with this index are similar to
other indices such as the David-Bouldin Index.
Test based on the distribution of dˆT V . Another procedure for deciding the
number of clusters is based on the bootstrap algorithm proposed in Chapter
3. We will use this methodology to approximate the distribution of the total
variation distance between two clusters. Note that due to the hierarchical
structure of the algorithms used in all the methods proposed, the test

H0 : k − 1 Clusters vs HA : k Clusters,

is equivalent to the test,

H0 : 1 Cluster vs HA : 2 Clusters.

This is because the (k − 1) clusters are built by joining two of the k clusters.
The distribution of the total variation distance between two clusters
depends on the clustering procedure. When using the HSM method we aim
to approximate the distribution of the distance between the mean spectra in
each cluster while for the hierarchical clustering with the TV distance, we
need to produce samples from each cluster to approximate the distribution
of the distance calculated through the link function.
The procedure of this test will be:
• Run the clustering procedure, either the HSM method or hierarchical
clustering with the average or complete linkage.

• Identify the two clusters that are joined to get the (k − 1) clusters.

• Consider as the estimation of the common spectra, fˆ, the mean spectra
over all elements in both clusters.

• Simulate with the bootstrap procedure the spectra to compute the TV


distance. We consider two cases:
82 Chapter 3. Clustering Methods

Experiment 1

Test α Complete Average HSM

1 cluster vs 2 clusters 0.01 1 1 1


0.05 1 1 1
0.1 1 1 1
2 clusters vs 3 clusters 0.01 0.052 0.154 0.008
0.05 0.206 0.492 0.058
0.1 0.382 0.670 0.164

Table 3.5: Proportion of times that the null hypothesis is rejected. Complete corresponds
to the TV distance in hierarchical algorithm with the complete link function, and Average
with the average link.

Case 1. When using the HSM method simulate two spectral densities from
the common spectra f and compute the TV distance between
them. We repeat this procedure M times.method.
Case 2. When using hierarchical clustering with the TV distance simulate
two sets of spectral densities of size g1 and g2 from the common
spectra f , where gi are the number of members in cluster i = 1, 2
(clusters to be joined). We compute the link function (complete or
average) between these two sets of spectra using the TV distance.

• Run the test proposed in Chapter 3 with the bootstrap sample.

Remark. Notice that this test assumes that there exits a common spectra f .
To explore the performance of our proposals, we used Experiments 1
and 2. We consider the TV distance to feed a hierarchical algorithm with
two different link functions, average and complete. Also, we consider the
HSM method with the hierarchical spectral merger algorithm. In this case,
we just use 500 replicates for each experiment.
Tables 3.5 and 3.6 present the proportion of times that the null hypothesis
is rejected. To reject we consider a bootstrap quantile of probability α. We
do not expect to have a proportion of rejection equal to α, since in the case of
using the complete or average link, these values are not a direct observation
of the TV distance. However, we expect to have a good performance. In
general, it could be possible to overestimate the number of clusters.
In Experiment 1 the true number of clusters is 2. From Table 3.5,
we observe that all methods reject the hypothesis of one cluster, at all the
3.5. Unknown number of clusters 83

Experiment 2

Test α Complete Average HSM

2 clusters vs 3 clusters 0.01 0.968 1 0.25


0.05 1 1 0.924
0.1 1 1 0.998
3 clusters vs 4 clusters 0.01 0.072 0.18 0.002
0.05 0.228 0.924 0.050
0.1 0.376 0.998 0.106

Table 3.6: Proportion of times that the null hypothesis is rejected, in Experiment 2.

significance levels. This means that the procedure will not under estimate
the number of clusters. To test 2 vs 3, the proportion of rejection is high
when we use the average link function, except in the case of α = 0.01. If we
use the complete link, the results are better. However, the best results are
for the HSM method.
In Experiment 2 the true number of clusters is 3. This is a more a
difficult case, since the spectra are very close. From Table 3.6 when testing 2
vs 3 clusters, we observe that the complete and average link functions do not
under estimate the number of clusters. However, the HSM method can not
distinguish 3 clusters at a level α = 0.01, but it is possible at higher levels.
For testing 3 vs 4 clusters, the performance for the HSM and complete are
better, being again necessary a small value of α for the average link to have
reasonable performance.
Figure 3.14 shows the p-values obtained comparing the value obtained at
each simulation with the bootstrap distribution. We confirm the fact that
the underestimation of the number of clusters has low probability, almost
zero in some cases, for the three methods. When the number of clusters to
test is the correct one, 2 in Experiment 1 and 3 in Experiment 2, the
p-values are widely distributed in the case of the complete link and HSM
method. With the average link, the p-values are smaller compared to the
other methods. In general, this test has a good performance when one uses
the complete link or the HSM method.

Permutation Test. Consider the same testing problem, as an alternative for


the average link function, we can use a permutation test (see Rudolph, 1995).
84 Chapter 3. Clustering Methods

Experiment 1
1.0

Complete link Average link HSM


● ●
0.8


● ●





● ●

0.6


P−values










0.4










0.2
0.0

1 vs 2 2 vs 3 1 vs 2 2 vs 3 1 vs 2 2 vs 3

Test

(a)

Experiment 2
1.0

Complete link Average link HSM


● ●


0.8

● ●

● ●


0.6


P−values









0.4



0.2







0.0


2 vs 3 3 vs 4 2 vs 3 3 vs 4 2 vs 3 3 vs 4

Test

(b)

Figure 3.14: P-values obtained in the test of number of clusters using bootstrap samples.
3.5. Unknown number of clusters 85

Experiment 1 Experiment 2

Test α Average Test α Average

1 cluster vs 2 clusters 0.01 .848 2 clusters vs 3 clusters 0.01 .848


0.05 1 0.05 .994
0.1 1 0.1 .996
2 clusters vs 3 clusters 0.01 .002 3 clusters vs 4 clusters 0.01 .022
0.05 .006 0.05 .032
0.1 .182 0.1 .218

Table 3.7: Proportion of times that the null hypothesis is rejected, using the permutation
test, in Experiment 1 and 2.

Let be G1 = {f11 , f21 , . . . , fn11 } and G2 = {f12 , f22 , . . . , fn22 } two clusters where
fijj are the spectral densities of the time series Xijj , members of the clusters
Gj , j = 1, 2 and ij = 1, . . . , nj . If the two clusters belong to a one bigger
cluster, G1 ∪ G2 ⊆ G, we can take a subsample of this clusters as follows.

• Let be G∗ = G1 ∪ G2 = {f11 , f21 , . . . , fn11 , f12 , f22 , . . . , fn22 }.

• Take, with probability 1


n1 +n2
, n1 elements of G∗ and assign them to G∗1 .

• Take G∗2 = G∗ \ G∗1 .

• Finally, compute the average link function between G∗1 and G∗2 .

• Repeat this procedure M times.

Then, the permutation test will take the link functions values computed
between the subsample clusters as a sample of the distribution of our statistic.
So, the test will reject the null hypothesis using the quantiles of this sample.
Table 3.7 shows, for both experiments, the proportion of times that we
reject the null hypothesis using the average link function and the permutation
test. We observe that the level of the test is improved, however, it loses some
power.
Remark. This test is not useful when the complete link function is used.
Due to the hierarchical algorithm, the maximum between the original groups
will always be bigger or equal than the maximum of any of the subsamples.
86 Chapter 3. Clustering Methods

3.6 Discussion
The use of the TV distance as a dissimilarity measure for clustering has
shown good results compared to other dissimilarity measures proposed in
the literature. In some of the experiments in the simulation study we got
the best rate of success or close to the best ones. In addition, the clusters
generated by our proposal have an intuitive interpretation in terms of real
application problems. In the case of transitions, it is still difficult to identify
the beginning or end of the transitions, however, the results are acceptable
and a good approximation of the true clusters, if we consider a transition as
a cluster. The HSM method seems not to be a good option for the detection
of transitions.
The election of the number of clusters will always be complicated.
However, the test proposed is a promising option. In particular, the HSM
method has a good performance using this test to choose the number of
clusters.
We proposed the use of time series clustering methods to detect changes
in spectra. However, the resolution of the change point detected will depend
on the time series length, since we need a reasonable number of time points
to have a good estimation of the spectral density.
The proposed methods are general for time series clustering, they can be
used to identify similarities in time and/or space. Evenmore, they can be
used to cluster any set of time series where the goal is to find similarities
in spectra. These methods were proposed in Alvarez-Esteban et al. (2016b);
Euán et al. (2015). Further details and discussion can be found in them.
Chapter 4

Applications to Data

In this chapter we present the analysis of two different applications, both of


which are commonly studied using the spectral analysis of time series. The
spectral density has many interpretations in each case. First, we consider
an application to the analysis of ocean wave data and then we present the
analysis of brain signals.

4.1 Ocean wave analysis


Random processes have been used to model sea waves since the 1950’s,
starting with the work of Pierson (1955) and Longuet-Higgins (1957). Models
based on random processes have proved useful, allowing the study of many
wave features (see, e.g. Ochi, 1998). A class of models often used to study
sea waves in deep waters with standard conditions are stationary centered
Gaussian processes (Aage et al., 1999; Ochi, 1998). The intuitive idea was
to consider a linear model based on the superposition of infinite elementary
waves of the form
ζn = Re(An ei(λn x+µn y+ωn t) ),
where An is a centered Gaussian variable, Re is the real part and (x, y) is a
specific location.
The stationarity hypothesis allows the use of Fourier spectral analysis to
study the wave energy distribution as a function of frequency. In particular,
this spectral analysis is related to several features of interest, such as the
significant wave height (Hs ) or the dominant or peak period (Tp ), that can
be computed from the spectral distribution (see, e.g. Ochi, 1998).

87
88 Chapter 4. Applications to Data

1.5
1

1.0
Height (m)
0

0.5
−1

0.0
−2

0 5 10 15 20 25 30 0.0 0.1 0.2 0.3 0.4 0.5 0.6


Time (min) w (Hz)

(a) Buoy 106 (b) 30 minute wave (c) Estimated Spectrum

Figure 4.1: One interval of the data set taken by Buoy 106. (a) Buoy at Waimea Bay,
Hawaii. (b) A 30 minute wave (centered) taken at 1.28 Hz. (c) Estimated spectrum of the
wave process.

Gaussian models, beyond being a good first order approximation, allow


obtaining explicit expressions for the distribution of objects of interest. An
accurate description of the statistical characteristics of the wave climate on
a given region is an important input for the design of marine structures and
ships and also for the design of wave energy converters.
However, the property of stationarity is only true for short time intervals
and frequently the changes are not abrupt. These changes could be
considered as a transition between stationary periods. A natural question
is, how long can we consider a sea state to be stationary? How long can a
transition last? So, we are interested in detecting stationary and transition
periods.
Typically, stationary sea states last for some time (hours or days), and
then, due to changing weather conditions, sea currents, the presence of swell
or other reasons, change to a different state. The idea of our analysis is
to identify short stationary intervals which have similar behavior, in terms
of their spectral densities. If these intervals are contiguous in time, then
it is reasonable to assume that they constitute a single (longer) stationary
interval.

4.1.1 Data description


The data at a fixed point (x, y) is recorded by buoys that are located in the
ocean. Usually, data are sampled at a frequency of 1.28 Hz which is 5 data
points per 4 seconds approximately.
4.1. Ocean wave analysis 89

Buoy 106: January 2003

4.5

0.08
4.0

0.073
3.5

wp (Hz)
Hs (m)
3.0

0.066
2.5
2.0

0.059
1.5

20 40 60 80

Time (Hrs)

Figure 4.2: Buoy 106, significant wave height and modal frequency of each segment.

We use raw wave height time series obtained from the U.S. Coastal Data
Information Program (CDIP) website. The data considered were measured
by a moored buoy: Buoy number 106 (number 51201 for the National Data
Buoy Center) which is located in Waimea Bay, Hawaii, with a water depth of
200 m. We consider the month of January, 2003. Figure 4.1 shows the wave
height time series corresponding to a 30 minute interval and the estimated
spectral density.

4.1.2 Results using the TV distance as a similarity


measure
The complete recorded data set considered corresponds to 92 hours. In
Oceanography it is usual to divide long wave height records into shorter
intervals of between 20 to 30 minutes, and then calculate the spectral
density. These intervals are considered to be short enough for the stationarity
assumption to hold, yet long enough to have a reasonably accurate estimation
of the spectrum. Based on these estimated spectra, we can compute the
significant wave height and modal frequency of each spectra, which will serve
as a summary of the data.
The significant wave height (Hs ) is the mean wave height of the highest
third of the waves, under the Gaussian assumption it is equal to

4 m0 ,
90 Chapter 4. Applications to Data

Z ∞
where m0 = f (ω)dω.
−∞
The modal frequency, ωp , is the frequency at which a wave spectrum
reaches its maximum, the inverse of the peak period. Figure 4.2 shows Hs
(black line) and ωp = T1p (blue line) computed for each time segment. From
this plot we have a description about the behavior of the waves recorded by
Buoy 106. During the first 50 hours the significant wave height stays mainly
below 2.5 m. and for a long interval it is below 2 m. Around 50 hours it
rises to about 4.5 m. and stays above 3 m. for the rest of the period. The
dominant frequency decreases from a 0.073 to 0.06 Hz, approximately.
Our goal is to find the stationary intervals and also look at the changes
in spectra between the different intervals. The procedure to analyze the data
is the following:

• Divide the data into segments of 30 minutes, 2304 time points.

• Each segment is considered as a unit, so we apply the clustering


procedure to 192 “time series” (each one is one segment).

• We use the TV distance between the smoothed estimated spectra


(Parzen window with a bandwidth a = 100) to feed a hierarchical
clustering algorithm with a complete or average link function.

• If two consecutive (in time) segments are in the same cluster, then they
will be considered to be part of a stationary period.

We get two different (but similar) results, one with the complete link and
the other with the average link. Figure 4.3 shows the value of Dunn’s index
in each case. The “best” number of clusters should be the one where Dunn’s
index reaches its maximum, however, the maximum value is not always the
best choice. The two highest values were considered and, after analyzing the
results it was observed that in most cases, the second highest value gave the
best clustering (see Alvarez-Esteban et al., 2016b). So, for the complete link
function we choose 6 clusters and for the average link function we choose 5
clusters. Figures 4.4 and 4.5 show the dendrograms resulting in each case
and the resulting branches if we cut the tree at 6 and 5 clusters respectively.
As in the simulation study (Experiment 4 in Section 3.4.2), most of
the members in a cluster are contiguous segments in time, even though the
time structure plays no role in the clustering algorithm.
4.1. Ocean wave analysis 91

Dunn's Index
Buoy 106

5
4
3
Dunn

2
1

Complete
Average
0

2 4 6 8 10

Number of clusters

Figure 4.3: Dunn’s Index computed between 2 to 10 clusters for the complete and average
link functions.

A shortcoming of hierarchical clustering algorithms is that, once an


element has been assigned to a cluster, it cannot be reassigned to a different
one, even if changes in the composition of the clusters indicate that it would
have been better classified on a different cluster.
The silhouette index, proposed by Rousseeuw (1987), gives a measure of
the adequacy of each point to its cluster. Let a(i) be the average distance or
dissimilarity of point i with all the other elements within the same cluster,
and let b(i) be the smallest average dissimilarity of i to any of the clusters
to which i does not belong. Then the silhouette index of i is defined as
b(i) − a(i)
s(i) = .
max{a(i), b(i)}
This index satisfies −1 ≤ s(i) ≤ 1 for all i, and large positive values
indicate that the element has been well classified while negative values point
to misclassification. As a consequence the classification of intervals with
negative silhouette index was revised. Just a few number of segments were
reasigned.
Figures 4.6 and 4.7 show the results of the clustering procedure for
the average and complete linkage functions after the correction using the
silhouette index, respectively. In part (a), we show which segments belong to
each cluster with vertical lines in different colors, i.e, each color represents one
cluster and time segments with vertical lines in the same color are members
92
Height
0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.00 0.10 0.20
0.0 0.1 0.2 0.3 0.4 0.5

53
55
1
4
54 106 22 1 28
41
43
56 48
115 23 4 49
50
57 52
51
28 35
59 117 154 46
45
47
60 41 2
147 25 3
15
63 18
43 14
36
136 34 44
64 39
48 40
42
75 179 104 6
11
49 13

Branch 4
65 17
24
110 107 21
70 50 33
26
29
83 8
112 111 52 9
12
67 37
38
119 105 51 5
68 10
27
19
35 31
58 113 148 7
32
20
61 46 16
146 169 30
22
23
45 154
25
121 171 34
104
47 107
111
143 108 105
148
2 169
171
108
137 109 109
3 120
125
0.00 0.10 0.20 98
135 120 99
153
15 100
101
157
122 125 102
18 103
156
84 158
166 98 116
14 162
160
161
89 168
149 99 36 163
164
123
92 167
165 153 44 124
150
159
155
90 39 172
173 100 126
128
129
40 106
88 175 101 115
117
147
42 136
179
91 178 157 110
112
6 119
113
177 102 146
93 121

Branch 5
Branch 3
Branch 2
Branch 1

11 143
137
135
174 103 122
96 13 166
149
165
173
97 114 156 17 175
178
177
174
134 158 24 114
94 134
118
139
21 127
118 116 152
95 131
144
33 151
139 162 130
132
140
26 133
138
127 160 141
145
29 142
170
152 161 176
53
8 55
54
56
131 168 57
59
Dendrogram with complete link function

0.00 0.10 0.20 9 60


63
144 163 64
75
12 65
70
83
151 164 67
37 68
180 58
61
130 123 84
38 89
182 92
90
88
132 167 5 91
183 93
96
97
140 124 10 94
181 95
180
182
27 183
184 133 150 181
184
185
19 190
185 138 159 188
186
189
31 187
190 191
141 155 192
62
7 72
188 78
145 172 66
85

Branch 6
32 76
186 80
79
142 126 69
20 71
73
189 74
81
170 128 16 86
77

Dendogram. Bottom: Low branches when we cut the dendrogram at 6 clusters.


187 82
87
176 129 30
191
192

the number of the segment (each segment is a 30 minute recording). Top: Complete
Figure 4.4: Dendogram of Buoy 106 using the complete link function, the index is
Chapter 4. Applications to Data
Height
0.00 0.04 0.08 0.12 0.00 0.04 0.08 0.12 0.00 0.04 0.08 0.12 0.00 0.05 0.10 0.15
0.00 0.05 0.10 0.15 0.20 0.25 0.30

180 1
22 1 4
28
57 23 35
4 41
43
25 182 2
3
34 28 15
59 18
40
154 35 42
183 36
104 44
41 39
60 16
107 30
43 20
111 21
181 33
26
105 2 29
63 6
109 11
3 13
17
110 184 24
15 5
58 112 7
32
18 10
148 19
31
169 188 27
64 40 8
9
171 12
42 37
106 14
38
75 185 36 98
108 99
153
116 44 100
101
157

Branch 2
162 190 102
70 39 103
120 45
47
16 46
125 48
49
50
4.1. Ocean wave analysis

65 124 186 30 52
51
150 20 53
55
159 54
67 56
189 21 180
155 182
183
172 33 181
184
188
68 123 26 185
187 190
167 186
29 189
187
156 191
61 192
158 6 22
191 23
25
126 11 34
154
62 128 104
13 107
111
129 192 105
17 109
160 110
66 112
148
161 24 169
171
168 106
5 108
116
69 163 162
7 120
164 125
124
32 150
113 159
71 155
146 172
10 123
167
121 156
0.00 0.04 0.08 19 158

Branch 4
Branch 3
Branch 1

73 143 126
128
31 129
137 160
161
135 27 168
163
74 164
122 113
8 146
166 121
88 143
9 137
81 149 135
122
127 12 166
149
127
152 37 152
86 130
130 91 132
140
132 14 131
144
141
76 140 38 145
151
133
131 98 138
94 142
144 170
99 117
80 141 147
136
165
153
Dendrogram with average link function

145 115
119
173
85 151 95 100 175
178
133 177
101 174
114
138 134
79 157 118
142 139
176
96 102 179
170 57
59
83 117 60
103 63
58
147 64
45 75
136 70
97 65
Branch 5

84 47 67
165 68
61
115 46 62
66
69
72 119 71
48 73
173 89 74
81
49 86
175 76
78 80
178 50 85
79
83
177 92 52 84
72
174 78
77 77
51 82
114 87
88
134 53 91
94
82 90 95
118 55 96
97
139 89
54 92
90
176 93

dendrogram. Bottom: Low branches when we cut the dendrogram at 5 clusters.


87
179 56
93

the number of the segment (each segment is a 30 minute recording). Top: Complete
Figure 4.5: Dendrogram of Buoy 106 using the average link function, the index is
93
94 Chapter 4. Applications to Data

of the same cluster. In part (b), we show (with the corresponding color)
the estimated spectra of all members in a cluster and in black the mean
spectrum.
As we mentioned before, the clustering procedure captures the time
structure in the data using only information about the TV distance between
normalized spectral densities. In addition, using either the complete or
average link, the members in a cluster have very similar spectra and the
method is able to identify small differences between clusters. For example,
the method is able to discriminate between unimodal and bimodal spectra.
From Figure 4.6, we observe that from 0 to 27 hours, almost all segments
belong to Cluster 1 (black), just a few of the members in Cluster 1 are mixed
with Cluster 2 (red). This is reasonable since both spectra are unimodal and
the modal frequencies are close, the one from Cluster 2 being smaller than
the modal frequency of Cluster 1. Then, the members in Clusters 3 (blue), 4
(cyan) and 5 (magenta) are more mixed (in time) than the members in other
clusters. These could be related with a transition between Cluster 1 and 2.
Finally, Cluster 6 (green) has a bimodal spectrum, however we could not give
a precise interpretation because it is close to the border and one should take
a look at the following intervals.
In the case of the average link, we choose one cluster less than for the
complete link case. We observe in Figure 4.7 that the clusters between 28
and 45 hours (3 and 4 in the complete link case) merge into one cluster,
Cluster 3 (blue) in this case. However, some members of Cluster 1 (black)
appear between Cluster 4 (cyan) and 2 (red). On the other hand, the average
linkage function seems to produce clusterings that are more homogeneous in
time than those obtained using the complete link, although further research
in this respect is needed. Since, Cluster 5 in case 1 (magenta, when we use
the complete link) and Cluster 4 in case 2 (cyan, when we use the average
link) are located when Hs increases and wp is moving, we could consider this
as a transition period. So, a possible conclusion from this analysis is that
there are three stable periods: 1 - 27, 28 - 45 and 52 - 89 hours, and the
other intervals correspond to transition periods.
This methodology has been applied to a longer data series. Results
show that the method is able to detect stable intervals, during which
the distribution of the energy as a function of frequency have similar
patterns, and also allows the identification of unstable or transition periods.
This analysis gives statistical characteristics for the duration of stationary
intervals, which may vary for different periods of the year. The complete
4.1. Ocean wave analysis 95

Clusters with Complete link

4.5
4.0
3.5
Hs (m)
3.0
2.5
2.0
1.5

20 40 60 80

Time (Hrs)

(a)

Mean Spectrum Mean Spectrum Mean Spectrum


Cluster 1 Cluster 2 Cluster 3
6

6
5

5
4

4
3

3
2

2
1

1
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

w (Hz) w (Hz) w (Hz)

Mean Spectrum Mean Spectrum Mean Spectrum


Cluster 4 Cluster 5 Cluster 6
6

6
5

5
4

4
3

3
2

2
1

1
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

w (Hz) w (Hz) w (Hz)

(b)

Figure 4.6: Clustering result using the complete link function and 6 clusters
96 Chapter 4. Applications to Data

Cluster with Average link


4.5
4.0
3.5
Hs (m)
3.0
2.5
2.0
1.5

20 40 60 80

Time (Hrs)

(a)

Mean Spectrum Mean Spectrum Mean Spectrum


Cluster 1 Cluster 2 Cluster 3
6

6
5

5
4

4
3

3
2

2
1

1
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

w (Hz) w (Hz) w (Hz)

Mean Spectrum Mean Spectrum


Cluster 4 Cluster 5
6

6
5

5
4

4
3

3
2

2
1

1
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

w (Hz) w (Hz)

(b)

Figure 4.7: Clustering result using the average link function and 5 clusters
4.2. Clustering of EEG data 97

analysis can be found in Alvarez-Esteban et al. (2016a).

4.2 Clustering of EEG data


Brain activity following stimulus presentation and during resting state are
often the result of highly coordinated responses of large numbers of neurons
both locally (within each region) and globally (across different brain regions).
Coordinated activity of neurons can give rise to oscillations which are
captured by electroencephalograms (EEG).
Spectral analysis of time series is a natural approach for studying EEG
data because it identifies frequency oscillations that dominate the signal.
It has many applications in neuroscience because EEG signals can be seen
as a superposition of components oscillating at different frequencies. The
range of frequencies that can be observed in a signal depends on the
sampling frequency, usually measured in Hertz (number of cycles per second).
Moreover, the convention for the different frequency bands are as follows:
delta (0-4 Hz), theta (4-8 Hz), alpha (8-12 Hz), beta (12-30 Hz) and gamma
(30-50 Hz).
The analysis of EEG data is different from the ocean waves study, since
we have multichannel data and multiple trials/epochs (replicates), and the
changes in brain signals can be abrupt. The HSM method will produce
clusters of EEG channels according to the similarity of their spectra. The
resulting clusters serve as a proxy for segmenting the brain cortical surface
since the EEGs capture neuronal activity over a locally distributed region on
the cortical surface.
We will analize a data set from a motor skill experiment (see Wu
et al., 2014). The original study investigates how measures of cortical
network function acquired at rest using dense-array EEG predict subsequent
acquisition of a new motor skill. Using a partial least squares regression
(PLS), they found that the coherence with the region of the left primary
motor area in resting EEG was a strong predictor of motor skill acquisition.
We will follow their interest in analyzing the resting state, in particular the
study of the spectral profiles during rest.
Our goal is to cluster resting-state EEG signals that are spectrally
synchronized, i.e., that show similar spectral profiles from subjects in this
study. The participants here are healthy subjects whose EEG clustering will
serve as a “standard" to which the clustering of stroke patients (with severe
98 Chapter 4. Applications to Data

Figure 4.8: Brain regions defined in Wu


et al. (2014); Left/Right Prefrontal (L_Pf,
R_Pr), Left/Right Dorsolateral Prefrontal
(L_dPr, R_dPr), Left/Right Pre-motor
(L_PMd, R_PMd), Supplementary Motor
R_antPr
Area (SMA), anterior SMA (aSMA), poste-
L_antPr
R_medPr rior SMA (pSMA), Left/Right Primary Mo-
L_medPr
R_latPr
L_latPr
tor Region (L_M1, R_M1), Left/Right Pari-
R_Pr
L_Pr
R_M1
etal (L_Pr, R_Pr), Left/Right Lateral Pari-
L_M1
aSMA etal (L_latPr, R_latPr), Left/Right Media
pSMA
SMA
R_PMd
Parietal (L_medPr, R_medPr), Left/Right
L_PMd
R_dPf
L_dPf
Anterior Parietal (L_antPr, R_antPr).
R_Pf
L_Pf Gray squared channels do not belong to any
of these regions; Light blue region corre-
sponds to right and left occipital and light
green region corresponds to central occipital.

motor impairment) will be compared. Some specific questions of interest


are:
(1.) How many spectrally synchronized clusters are there during resting-
state?
(2.) Does the number of clusters remain fixed across epochs during the entire
resting-state?
(3.) Does cluster membership of the channels evolve across the entire resting-
state?

4.2.1 Data description


The EEG channels were grouped into 19 pre-defined regions in the brain
as specified in Wu et al. (2014): prefrontal (left-right), dorsolateral
prefrontal (left-right), pre-motor (left-right), supplementary motor area
(SMA), anterior SMA, posterior SMA, primary motor region (left-right),
parietal (left-right), lateral parietal (left-right), media parietal (left-right)
and anterior parietal (left-right). Figure 4.8 shows the locations of these
regions on the cortical surface. The number of channels for the EEG data is
256.
The data was recorded from a dense array surface using a 256-lead
Hydrocel net. The complete data is formed by 17 right-handed individuals
who were between 18 and 30 years of age. During the EEG-Rest period, the
4.2. Clustering of EEG data 99

10

0.20
5

0.10
0
−5

0.00
0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50
Time (sec) w (Hz)

(a) Channel in primary motor region

0.4
5

0.3
0

0.2
−5

0.1
−10

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50


Time (sec) w (Hz)

(b) Channel in prefrontal region

Figure 4.9: One second recording (1000 pts) of a brain signal and the estimated spectra.

participants were asked to hold still with the forearms resting on the anterior
thigh and to direct their gaze at a fixation cross displayed on the computer
monitor. Data were recorded at 1000 Hz using a high input impedance
Net Amp 300 amplifier (Electrical Geodesics) and Net Station 4.5.3 software
(Electrical Geodesics). Data were preprocessed. The continuous EEG signal
was low-pass filtered at 100 Hz, segmented into non-overlapping 1 second
epochs, and detrended. The original number of channels (256) had to be
reduced to 194 because of the presence of artifacts in channels that could not
be corrected (e.g. loose leads).
Smoothing the periodogram curves. To determine a reasonable value for
the smoothing bandwidth, we adapted the Gamma-deviance generalised cross
validation (Gamma GCV) criterion in Ombao et al. (2001) to the multi-
channel setting. We applied the Gamma GCV criterion to each channel for all
epochs. Trajectories of the Gamma GCV for each channel were very different
because this criterion depends on the shape of the estimated spectra. There
is not a common optimal bandwith for all channels. A minimum appears at
100 Chapter 4. Applications to Data

Figure 4.10: Minimum value obtained at the k-th step of the algorithm for each epoch.

a = 80. From the spectral estimation point of view, one could select a = 80
over a = 100. However, in our simulations, choosing the smaller bandwidth
results in selecting unnecessarily too many clusters. The choice of a slightly
large bandwidth, a = 100, gave better overall results.

4.2.2 Results using the HSM method


We analize the EEG recordings from a subject identified as “BLAK". A
comparison with another subject can be found in Euán et al. (2015). The
entire resting-state for each subject consisted of 160 epochs (each is a 1-
second recording). Each epoch has 194 channels with T = 1000 time points.
The HSM method (with the average version) was applied to each epoch.
To determine a reasonable number of clusters for this particular data
we use the analogue of the elbow of the scree plot which in this case is
the trajectory of the minimum value of the TV distance, Figure 4.10. This
empirical criteria was proposed in Euán et al. (2015). To find the elbow we
use the numerical derivative of the curves which is an effective visual tool for
selecting the number of clusters by identifying the first value of K where the
numerical derivative was below a small threshold (here, we used 0.01, based
on empirical evidence from simulations). In most of the epochs, this value
was equal to 9.
Now, we verify this criteria with a test based on the approximation using
bootstrap for the distribution of dˆT V . In this situation, it is convenient to
choose the same number of clusters in all epochs, to make them comparable,
4.2. Clustering of EEG data 101

even if, in some cases, some clusters are close to each other. The following
table shows the number of epochs where the null hypothesis (9 clusters) is
rejected.

α .01 .05 .1

0 0 2

There is not significant evidence to reject 9 clusters in any of the epochs, so,
we take 9 clusters as the number of clusters for all epochs.
Even though the number of clusters remains constant across epochs,
the cluster formation (i.e., location, spatial distribution, specific channel
memberships) of the clusters may vary across epochs. In this EEG analysis,
the total number of epochs was divided into three different phases of the
resting state: early (epoch numbers 1 to 50), middle (epochs 51-110) and
late (epochs 111-160).
In Figure 4.11, we show the “affinity matrix" which is the proportion
of epochs when a pair of channels belong to the same cluster. The (i, j)
element of the affinity matrix is the proportion of epochs such that channels
i and j are clustered together – regardless of how they cluster with other
channels. On the lower left corner of the affinity matrix, there are a few
small red squares that represent channels that are always clustered together
and completely separated from the rest.
It is evident that clustering evolved across the three phases. The affinity
matrix for the early and late phases shows darker red colors which has a
wider spread than that during the middle phase.
The next step in our analysis is to compare the clustering results across
the different phases of resting state. Since there are 50 epochs per phase,
in order to present a summary of the clustering results for each phase, we
focus only on the “representative” clustering. Using the affinity matrices
(in Figure 4.11), we consider the 9 clusters where the members remains
most of the time in the same cluster, as the representative clustering. The
procedure to get these clusters was a hierarchical cluster analysis with the
complete linkage applied to the affinity matrices (considering each matrix as
a similarity matrix).
Figure 4.12 shows formation of the clusters (location, spatial distribution
and specific channel membership) and the shape of the corresponding spectral
densities, coded in different colors, for the subject BLAK in each phase.
102 Chapter 4. Applications to Data

Affinity Matrix −− BLAK_REST1 Affinity Matrix −− BLAK_REST1


(Number of Clusters per epoch = 9 ) (Number of Clusters per epoch = 9 )
37 15
33 7
38 22
47 28
21 34
14 39
27 55
13 47
6 27
15 21
7 14
22
28
62
1.0 62
48
35
1.0
34 33
39 37
55 38
187 46
186 32
188 26
189 85
179 11
178 128
185 54
192 31
4 25
193 20
194 10
5 18
3 99
181 1
180 190
26 184
46 91
32 84
31 77
11 144
18 137
20 145
191 150
54 151
190 156
1 138
184 152
2 143
10
104
105
0.8 136
129
149
0.8
93 120
92 121
94 153
100 146
101 167
111 130
106 139
102 147
107 157
112 100
122 94
85 101
35 93
48 92
166 111
182 107
183 106
91 102
99 122
77 112
84 191
25 2
19 105
137 104
129 187
144 186
145 188
150 189
151 179
138 178
156 185
143
136
121
0.6 194
192
5
0.6
120 193
149 4
146 181
139 180
153 3
130 13
167 6
147 182
157 183
152 68
140 56
131 40
148 29
155 63
159 49
158 36
160 50
169 24
154 57
168 41
161 30
164 23
162 42
163 67
44 44
61 166
81 61
74 43
53 17
75 16
128 140
30
23
36
0.4 131
148
154
0.4
41 155
49 158
57 159
63 168
24 169
50 160
56 161
40 164
68 162
29 163
42 59
67 52
43 65
17 70
16 72
66 69
73 73
71 71
60 66
52 60
59 58
70 64
65 51
64 53
58 74
51 80
72 88
69 75
88 98
80 97
98 89
97
89
95
0.2 81
86
78
0.2
86 95
103 103
78 123
114 109
113 114
108 142
123 176
109 8
142 113
176 108
8 174
174 173
173 175
175 172
172 171
171 170
170 165
165 177
177 76
76 83
83 12
12 19
90 90
82 82
116 116
110 110
117 117
118 118
125 125
133 133
134 134
79
45
87
96
0.0 79
45
87
96
0.0
9 9
126 126
124 124
132 132
141 141
119 119
135 135
127 127
115 115
115
127
135
119
141
132
124
126
9
96
87
45
79
134
133
125
118
117
110
116
82
90
12
83
76
177
165
170
171
172
175
173
174
8
176
142
109
123
108
113
114
78
103
86
95
89
97
98
80
88
69
72
51
58
64
65
70
59
52
60
71
73
66
16
17
43
67
42
29
68
40
56
50
24
63
57
49
41
36
23
30
128
75
53
74
81
61
44
163
162
164
161
168
154
169
160
158
159
155
148
131
140
152
157
147
167
130
153
139
146
149
120
121
136
143
156
138
151
150
145
144
129
137
19
25
84
77
99
91
183
182
166
48
35
85
122
112
107
102
106
111
101
100
94
92
93
105
104
10
2
184
1
190
54
191
20
18
11
31
32
46
26
180
181
3
5
194
193
4
192
185
178
179
189
188
186
187
55
39
34
62
28
22
7
15
6
13
27
14
21
47
38
33
37

115
127
135
119
141
132
124
126
9
96
87
45
79
134
133
125
118
117
110
116
82
90
19
12
83
76
177
165
170
171
172
175
173
174
108
113
8
176
142
114
109
123
103
95
78
86
81
89
97
98
75
88
80
74
53
51
64
58
60
66
71
73
69
72
70
65
52
59
163
162
164
161
160
169
168
159
158
155
154
148
131
140
16
17
43
61
166
44
67
42
23
30
41
57
24
50
36
49
63
29
40
56
68
183
182
6
13
3
180
181
4
193
5
192
194
185
178
179
189
188
186
187
104
105
2
191
112
122
102
106
107
111
92
93
101
94
100
157
147
139
130
167
146
153
121
120
149
129
136
143
152
138
156
151
150
145
137
144
77
84
91
184
190
1
99
18
10
20
25
31
54
128
11
85
26
32
46
38
37
33
35
48
62
14
21
27
47
55
39
34
28
22
7
15
Affinity Matrix −− BLAK_REST1
(Number of Clusters per epoch = 9 )
190
184
1
10
11
20
18
54
2
27
21
14
38
37
1.0
33
46
26
101
100
111
94
93
92
107
102
112
122
106
85
150
145
151
156
137
129
144
138
143
136
120
121
149
139
130
147
153
167
146
0.8
157
152
128
22
15
28
7
39
55
34
47
48
35
62
187
186
188
189
179
178
185
193
192
194
5
4
3
181
180
191
6
13
41
36
0.6
49
63
29
30
23
57
24
50
56
40
68
42
43
67
166
44
105
104
99
91
84
77
31
32
25
182
183
61
59
52
65
70
64
58
0.4
72
69
51
73
66
71
60
16
17
75
81
80
88
53
74
86
78
95
103
114
113
108
123
109
142
176
8
173
172
174
175
171
170
165
0.2
177
140
131
148
155
158
159
160
168
169
154
161
162
164
163
98
97
89
90
82
83
76
19
12
116
110
117
118
125
133
134
79
45
87
96
0.0
9
132
124
126
119
141
127
115
135
135
115
127
141
119
126
124
132
9
96
87
45
79
134
133
125
118
117
110
116
12
19
76
83
82
90
89
97
98
163
164
162
161
154
169
168
160
159
158
155
148
131
140
177
165
170
171
175
174
172
173
8
176
142
109
123
108
113
114
103
95
78
86
74
53
88
80
81
75
17
16
60
71
66
73
51
69
72
58
64
70
65
52
59
61
183
182
25
32
31
77
84
91
99
104
105
44
166
67
43
42
68
40
56
50
24
57
23
30
29
63
49
36
41
13
6
191
180
181
3
4
5
194
192
193
185
178
179
189
188
186
187
62
35
48
47
34
55
39
7
28
15
22
128
152
157
146
167
153
147
130
139
149
121
120
136
143
138
144
129
137
156
151
145
150
85
106
122
112
102
107
92
93
94
111
100
101
26
46
33
37
38
14
21
27
2
54
18
20
11
10
1
184
190

Figure 4.11: The affinity matrix: proportion of epochs where channel i and j belong to
the same cluster, with 9 clusters, by segments 1-50, 51-110 and 111-160..
4.2. Clustering of EEG data 103

(a)
0.12

0.12

0.12

0.12

0.12

0.12
0.06

0.06

0.06

0.06

0.06

0.06
0.00

0.00

0.00

0.00

0.00

0.00
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz)
0.12

0.12

0.12

0.12

0.12

0.12
0.06

0.06

0.06

0.06

0.06

0.06
0.00

0.00

0.00

0.00

0.00

0.00
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz)
0.12

0.12

0.12

0.12

0.12

0.12
0.06

0.06

0.06

0.06

0.06

0.06
0.00

0.00

0.00

0.00

0.00

0.00
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz)
0.12

0.12

0.12

0.12

0.12

0.12
0.06

0.06

0.06

0.06

0.06

0.06
0.00

0.00

0.00

0.00

0.00

0.00

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz) Freq (Hz)
0.12

0.12

0.12
0.06

0.06

0.06
0.00

0.00

0.00

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Freq (Hz) Freq (Hz) Freq (Hz)

(b)

Figure 4.12: Clustering results for BLAK’s resting state during different phases: early
resting state (epochs 1-50), middle resting state (51-110) and late resting state (111-160).
a) Distribution of clusters across the cortical surface and b) Mean spectral estimates across
epochs by cluster.
104 Chapter 4. Applications to Data

Comparing the early and middle phases of resting state, we note that the
formation of clusters during these phases were heavily influenced by specific
bands: seven (out of the nine) clusters were dominated by the theta and alpha
bands; while the formation of the remaining two clusters had also influence
of the gamma band. During the late phase, the influence of the alpha band
was reduced in some of the clusters but the influence of the delta and beta
bands increased. The increased power in the beta band is interesting. This
suggests that this subject was engaged in some cognitive task (which could
not have been a response to an experimental stimulus but something that is
self-induced). An study related to evidence of a relation between the beta
band and attention disorder is shown in Barry et al. (2010), they report
decreased levels of absolute beta and gamma power during resting state in
children with attention-deficit hyperactivity disorder (ADHD), compare to
healthy controls.
The formation of the clusters at the cortical surface varies across the three
phases during resting state. In the early phase, channels at the left pre-motor
region belong to one cluster (green) and most of the channels at prefrontal
and right pre-motor region belong to another cluster (purple). However, the
clustering structure at these regions changes during the middle phase where
the channels in the pre-motor (which were originally clustered with the other
non pre-motor channels) are assigned back with the rest of the pre-motor
channels (dark blue cluster). As we transition from the middle to the late
phase, channels that were assigned to the right pre-motor reverted back to the
channels at the prefrontal region. These changes in cluster assignment was
not entirely unexpected since many of these channels lie at the boundaries
between the two anatomical regions.
Also, some channels which belong to the yellow cluster during the early
phase switched to the orange cluster during the middle phase of resting-state.
In this switch, the alpha and beta bands played the key roles. The late phase
of the resting state shows more changes. For example, three channels located
at the right occipital region switched from the yellow to the brown cluster,
due to an increase of power in the alpha band and a decrease at the gamma
band. Another interesting change appears on the prefontal region. There
we observe three of the purple-colored channels switched to light blue cluster
and a new cluster was formed. This fact lets the dark blue channels at the
middle state go back to the purple ones and the underlying process that
characterized this cluster (dark blue) changes completely its location.
While some channels displayed dynamic behavior across phases, there
4.2. Clustering of EEG data 105

are some clusters, such as the red and black, which showed consistent
membership. The red cluster is characterized by the presence of the delta,
theta bands and small activity on the gamma band while the black cluster
was dominated by the theta and alpha bands.
This subject had low improvement during the task compared with the
others. It is not possible to say if this has implications for the change
on perceptual improvement since a causal analysis was not performed, but
the presence of beta activity could produce a difference in the improvement
during the task of the individuals .
The clusters produced are consistent for the most part with the
anatomical-based parcellation of the cortical surface and thus cluster
formation based on the spectra of the EEGs can be used to recover the
spatial structure of the underlying brain process.
In addition, the HSM method has been used to analyze an epileptic seizure
data. This epileptic seizure recording captures brain activity of a subject
who suffered a spontaneous epileptic seizure while being connected to the
EEG. The recording is digitized at 100 Hz and about 500 sec long, providing
us with a time series of length T = 50000. We analyzed multichannel
electroencephalograms when they exhibit “non-stationary” behavior. Our
goal was to analyze the changes on the clustering of the EEG signals before,
during and after the epileptic seizure. Using the HSM method we observe
that only lower frequencies are mostly involved before and after the epileptic
seizure (approximately 90 seconds post seizure). In contrast, immediately
following seizure onset, the higher frequency bands dictated the clustering
distribution of the channels. Moreover, immediately following the seizure
onset but before the last subinterval, the channels were clustered similarly
but the clustering was heavily influenced by the beta and gamma frequency
bands. The complete analysis can be found in Ombao et al. (2016).
106 Chapter 4. Applications to Data
Conclusions
In this thesis we presented the proposal of using the total variation (TV)
distance as a similarity measure in a clustering method to detect similarities
in spectra.
First, we studied the theoretical properties of the TV distance between
estimated spectra. We considered two asymptotic approximations of the
distribution for the TV distance, a modified version for small sample size
and a bootstrap procedure. The asymptotic convergence strongly depends
on the election of the bandwidth, while the bootstrap procedure shows a
better approximation for all bandwidth value. We established a hypothesis
test which is able to detect differences in spectra and explored its power.
When one uses the TV distance in a clustering method, the results are
satisfactory. The rate of success of right classification is close to one and
in some cases outperforms other alternatives and in other cases it is as good
as other distances. The methods proposed are efficient and have shown good
results in the simulation experiments.
We have used the proposed methods to analyze two different data sets,
from different areas. The first analysis is related to the study of ocean waves,
where the interest is to find stationary intervals. This goal was achieved using
the TV distance in a hierarchical clustering algorithm, where segments in the
same cluster and contiguous in time were considered as a stationary period.
The second analysis is related to the study of brain signals, here we used the
HSM method to detect channels of a dense EEG array that were spectrally
synchronized. In both cases of study the results are good.
In general, the proposed methods have shown a good performance to
detect similarities in spectra, and they can also be seen as methods to detect
changes. The methodologies do not require very high computational time
when one analyzes long time series or several time series. Even though we
just explored two different applications, the methods are not limited to those
problems and could be used in other areas.

107
The work performed in this project provides many possible directions for
future research. They include:

• Extension of the theoretical results to a multivariate case.

• Clustering algorithms that could consider time dependency between


segments or replicates,

• A more automatic method to distinguish between transitions and stable


periods, in the study of oceans waves,

• Consider a windowed or weighted TV distance between different


frequency bands.

• Exploring other dissimilarity measures to perform a clustering method


that gives intuitive interpretations in the study of EEG data, for
example “block coherence”.

This problems will be studied in future research projects.


Appendices

109
Appendix A

R Codes

The methods developed in this project were implemented in R. We will


present some of the relevant codes. The first methodology, the TV distance
in a clustering algorithm, can be applied using the hclust package, available
in the Repository (CRAN). To execute the second method, the Hierarchical
Spectral Merger (HSM) method, we developed the HSMClust Toolbox in R.

A.1 Computing the TV distance


The HSMClust has one function to estimate the spectral density using the lag
window estimator with a Parzen window. So, after the estimation procedure,
we compute the TV distance between two normalized espectral density using
the function TVD.
TVD (Total variation distance)

Description:
Computes the total variation distance between f1 and f2 with respect
to the values w using the trapezoidal rule.

Usage: TVD(w, f1, f2)

Arguments:
w - Sorted vector of w values.

f1,f2 - Numeric vectors with the values of f1(w) and f2(w) which

111
112 Appendix A. R Codes

are going to be compared.

*f1,f2 and w must have the same length. f1 and f2 must be


normalized functions.

spec.parzen (Smoothed periodogram using a Parzen window)

Description:
One-side estimated spectrum using a lag window estimador with a
parzen window.

Usage:
spec.parzen(x, a = 100, dt = 1, w0 = 10^(-5), wn = 1/(2 * dt), nn
= 512)

Arguments
x - Time series.

a - Bandwidth value.

dt - Sampling interval. Also, dt=1/Fs where Fs is the sampling


frequency. Default value is 1.

w0,wn -Range of frequencies of interest. By default (10^-5,Fs/2),


where Fs
is the sampling frequency.

nn - Number of evaluated frequencies in (w0,wn).

Value
A matrix of 2 columns and nn rows, where the first column corresponds
to the grid of frequencies and the second column corresponds to the
spectrum at those frequencies.
Examples.

##TVD between two normal densities


w<-seq(0,5,length=1000)
f1<-dnorm(w,2,.5)
A.2. Methods 113

f2<-dnorm(w,2.5,.5)
diss<-TVD(w,f1,f2)
plot(w,f1,type="l",lwd=2,col=2,main=paste("TVD =",round(diss,3)),
xlab="x",ylab="")
lines(w,f2,col=3,lwd=2)

##TVD between the normalized estimated spectra of two AR2


#processes
X1<-Sim.Ar(1000,12,1.01,100)
X2<-Sim.Ar(1000,15,1.01,100)
fest1<-spec.parzen(X1,a=300,dt=1/100)
fest2<-spec.parzen(X2,a=300,dt=1/100)
diss<-TVD(fest1[,1],fest1[,2]/var(X1),fest2[,2]/var(X2))
plot(fest1[,1],fest1[,2]/var(X1),type="l",lwd=2,col=2,
main=paste("TVD=",round(diss,3)),xlab="w (Hz)",ylab="",
ylim=c(0,max(fest1[,2]/var(X1),fest2[,2]/var(X2))))
lines(fest2[,1],fest2[,2]/var(X2),col=3,lwd=2)

A.2 Methods
We present the code related to the example in Chapter 3.

Example 1.
#################
set.seed(2786)
library(HSMClust)
normaliza<-function(f,w){
nor<-((w[2:length(w)]-w[1:(length(w)-1)])%*%
(f[2:length(w)]+f[1:(length(w)-1)])/2)
return(f/nor)
}
#################

# Simulated Data
M<-1.05
eta1<-.053
114 Appendix A. R Codes

eta2<-.06
Time<-1000
k<-2
nk<-3
X<-matrix(0,nrow=Time,ncol=k*nk)
for(i in seq(1,k*nk,2))X[,i]<-Sim.Ar(Time,eta1,M)
for(i in seq(1,k*nk,2)+1)X[,i]<-Sim.Ar(Time,eta2,M)

# TV distance in a clustering algorithm

# 1- Compute the dissimilarity matrix


Fest_aux<-apply(scale(X,scale=FALSE),2,spec.parzen,a=100,dt=1,nn=512)
Fest<-Fest_aux[513:1024,]
w<-Fest_aux[1:512,1]
matplot(w,Fest,type="l",lwd=3,xlim=c(0,.2),xlab="w (Hz)",ylab="",col=1,
main="Estimated Spectra",lty=1)
FestMN<-apply(Fest,2,normaliza,w=w)
S<-matrix(0,k*nk,k*nk)
for(i in 1:(k*nk))for(j in i:(k*nk))S[i,j]<-TVD(w,FestMN[,i],FestMN[,j])
S[lower.tri(S)]<-t(S)[lower.tri(S)]

# 2- Execute the hierarquical algorithm


library(cluster)
library(dendroextras)
require(clv)
arbol<-agnes(S,diss=TRUE,method=’complete’,keep.diss=200)
arbol2<-agnes(S,diss=TRUE,method=’average’,keep.diss=200)

# 3- Results
clus<-slice(as.dendrogram(arbol),k=2)
clus2<-slice(as.dendrogram(arbol2),k=2)

# HSM Method

ClustHSM<-HSM(X)
cutk(HSM,2)
A.2. Methods 115

The HSM method is implemented in the function HSM and we get k


clusters with the cutk function.

HSM (Hierarchical spectral merger algorithm)

Description:
Compute the hierarchical merger clustering algorithm or
the hierarchical spectral merger clustering algorithm
for a set of time series X.

Usage:
HSM(X, freq = 1, Merger = 1, par.spectrum = c(100, 1/(2 * dt), 512))

Arguments

X - Matrix of time series, the series should be located by column.

freq - Sampling Frequency. Default value is 1.

Merger - If Merger==1 (default), the algorithm will estimate the


new spectral density with the concatenated signals in order to get
a better estimation of the original spectral density. If Merger==2
the algorithm will estimate the new spectral density with the mean
spectrum using all time series in the cluster.

par.spectrum - A vector of length 3 with the parameters for


the estimation:
par.spectrum[1]=Bandwidth value,
par.spectrum[2]=maximun evaluated frequency,
par.spectrum[3]= length of the grid of the
frequencies values.

Value:
A HSM object with the following variables:
Diss.Matrix = Initial dissimilarity matrix.
min.value = trayectory of the minimum value.
Groups = list with the groupping structure at each step.
116 Appendix A. R Codes

cutk (K groups from HSM)

Description:
Returns k groups from a HSM object.

Usage
cutk(Clust, kg = NA, alpha = NA)
Arguments

Clust - Output from HSM.

kg - Number of groups.

alpha - TVD value before the next clustering step.


Appendix B

Effect of Sampling Frequency

If a stationary process X(t) with continuous spectral density, f (ω), is sampled


with a sampling frequency Fs , the observed sequence is
X∞
Xd (t) = X(t)δ(t − mdt), (B.1)
m=−∞

where dt = 1/Fs and δ(u) is the impulse function or Dirac delta function,
which satisfies that
Z
δ(u) = 0, if u 6= 0 and δ(u)du = 1.

Then, the spectral density of the discrete signal can be written as a folding
of the original spectral density,

X
fd (ω) = f (ω − m/dt),
m=−∞

for 0 < ω ≤ F2s .


Remark. Notice that we cannot observe the presence of frequencies bigger
than F2s , since we need more than two observe time points to observed a
complete period.

B.1 Discrete Fourier Transform


The periodogram is the modulus of the Discrete Fourier Transform (DFT),
therefore, it is important to make some remarks on the estimation procedure

117
118 Appendix B. Effect of Sampling Frequency

when the sampling frequency is different to one (see Mandal and Asif, 2007).
Consider the Fourier transform of (B.1), using the following calculation. Let
X̂d (ω) be the Fourier transform of the discrete signal, X̂(ω) the Fourier
transform
P∞ of the continuous signal X(t) and ĝ(ω) the Fourier transform of
m=−∞ δ(t − mdt). Then,

X̂d (ω) = X̂(ω) ∗ ĝ(ω)


∞  
2π X 2π
= X̂(ω) ∗ δ ω−m (B.2)
dt m=−∞ dt
∞  
2π X 2π
= X̂(ω) ∗ δ ω − m
dt m=−∞ dt
∞ Z  
2π X 2π
= X̂(ω)δ ω − ω ∗ −m dω ∗
dt m=−∞ dt
∞  
2π X 2π
= X̂ ω − m . (B.3)
dt m=−∞ dt

To obtain (B.2), we have to use the Fourier transform of δ(t − mdt) which is
 
2π 2π
δ ω−m .
dt dt

To obtain (B.3) we use a property of the Dirac function,


Z
h(t)δ(t − τ )dt = h(τ ).

Finally, we get that


∞  
2π X 2π
fˆd (ω) = f ω−m .
dt m=−∞ dt

Hence, when we use the Fourier transform of the signal (periodogram) to


estimate the spectral density, we need to consider a scaling factor which is
dt

, i.e., the periodogram in case of a sampling frequency should be
dt
X̂d (ω).

Bibliography

Aage, C., Allan, T., Carter, D., Lindgren, G., and Olagnon, M. (1999). Oceans from
Space: A textbook for offshore engineers and naval architects. Edition Ifremer.

Alvarez-Esteban, P. C., Euán, C., and Ortega, J. (2016a). Statistical analysis of


stationary intervals for random waves. In In Proceedings of the 26th International
Offshore and Polar Engineering Conference (to appear).

Alvarez-Esteban, P. C., Euán, C., and Ortega, J. (2016b). Time series


clustering using the total variation distance with applications in Oceanography.
Environmetrics (to appear).

Alvarez-Esteban, P. C., Matrán, C., del Barrio, E., and Cuesta-Albertos, J. A.


(2012). Similarity of samples and trimming. Bernoulli, 18(2):606–634.

Barry, R., Clarke, A., Hajos, M., McCarthy, R., Selikowitz, M., and Dupuy,
F. (2010). Resting-state eeg gamma activity in children with attention-
deficit/hyperactivity disorder. Clinical Neurophysiology, 121(11):1871–1877.

Basalto, N. and De Carlo, F. (2006). Practical fruits of econophysics: proceedings


of the third nikkei econophysics symposium, chapter Clustering financial time
series, pages 252–256. Springer Tokyo, Tokyo.

Bengtsson, T. and Cavanaugh, J. E. (2008). State-space discrimination and


clustering of atmospheric time series data based on Kullback information
measures. Environmetrics, 19(2):103–121.

Bloomfield, P. (1976). Fourier analysis of time series: an introduction. Wiley


Series in Probability and Statistics - Applied Probability and Statistics Section.
Wiley.

Brillinger, D. R. (1981). Time series: data analysis and theory. Holden-Day, Inc.,
Oakland, Calif., second edition.

119
120 Bibliography

Brockwell, P. J. and Davis, R. A. (2006). Time series: theory and methods.


Springer, New York. Reprint of the second (1991) edition.

Brodtkorb, P. A., Johannesson, P., Lindgren, G., Rychlik, I., Rydén, J., and Sjö,
E. (2011). WAFO - a matlab toolbox for analysis of random waves and loads.
Mathematical Statistics, Centre for Mathematical Sciences, Lund University.

Caiado, J., Maharaj, E. A., and D’Urso, P. (2015). Handbook of Cluster


Analysis, chapter Time Series Clustering, pages 241–263. Chapman & Hall/CRC
Handbooks of Modern Statistical Methods. Taylor & Francis.

Corduas, M. (2011). Clustering streamflow time series for regional classification.


Journal of Hydrology, 407(1–4):73 – 80.

Cuesta-Albertos, J. A. and Fraiman, R. (2007). Impartial trimmed k-means for


functional data. Computational Statistics and Data Analysis, 51(10):4864 – 4877.

Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. The


Annals of Statistics, 25(1):1–37.

Dahlhaus, R. (2000). A likelihood approximation for locally stationary processes.


The Annals of Statistics, 28(6):1762–1794.

Dahlhaus, R. (2011). Locally Stationary Processes. ArXiv e-prints.

Davis, R. A., Lee, T. C. M., and Rodriguez-Yam, G. A. (2006). Structural


break estimation for nonstationary time series models. Journal of the American
Statistical Association, 101(473):223–239.

Dette, H. and Hildebrandt, T. (2012). A note on testing hypotheses for stationary


processes in the frequency domain. Journal of Multivariate Analysis, 104(1):101
– 114.

Dette, H. and Paparoditis, E. (2009). Bootstrapping frequency domain tests in


multivariate time series with an application to comparing spectral densities.
Journal of the Royal Statistical Society: Series B (Statistical Methodology),
71(4):831–857.

Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation of stationary


gaussian processes through circulant embedding of the covariance matrix. SIAM
Journal on Scientific Computing, 18(4):1088–1107.

Euán, C., Ombao, H., and Ortega, J. (2015). Spectral synchronicity in brain
signals. arXiv:1507.05018v1.
Bibliography 121

Euán, C., Ortega, J., and Alvarez-Esteban, P. C. (2014). Detecting stationary


intervals for random waves using time series clustering. In Proceedings of the
33rd. International Conference on Ocean and Arctic Engineering, pages 1–7.
ASME.

Gavrilov, M., Anguelov, D., Indyk, P., and Motwani, R. (2000). Mining the stock
market: which measure is best. In In proceedings of the 6 th ACM International
Conference on Knowledge Discovery and Data Mining, pages 487–496.

Gibbs, A. L. and Su, F. E. (2002). On choosing and bounding probability metrics.


International Statistical Review, 70(3):419–435.

Gnedenko, B. V. and Kolmogorov, A. N. (1968). Limit distributions for sums


of independent random variables. Translated from the Russian, annotated,
and revised by K. L. Chung. With appendices by J. L. Doob and P. L. Hsu.
Revised edition. Addison-Wesley Publishing Co., Reading, Mass.-London-Don
Mills., Ont.

Hasselmann, K., Barnett, T., Bouws, E., Carlson, H., Cartwright, D., Enke,
K., Ewing, J., Gienapp, H., Hasselmann, D., Kruseman, P., Meerburg,
A., Mller, P., Olbers, D., Richter, K., Sell, W., and Walden, H. (1973).
Measurements of wind-wave growth and swell decay during the joint north sea
wave project (jonswap). Deutschen Hydrographischen Zeitschrift 12, Deutsches
Hydrographisches Institut Hamburg.

Jentsch, C. and Pauly, M. (2012). A note on using periodogram-based distances for


comparing spectral densities. Statistics and Probability Letters, 82(1):158–164.

Kreiss, J.-P. and Paparoditis, E. (2015). Bootstrapping locally stationary processes.


Journal of the Royal Statistical Society. Series B. Statistical Methodology,
77(1):267–290.

Lachiche, N., Hommet, J., Korczak, J., and Braud, A. (2005). Neuronal clustering
of brain fmri images. Pattern Recognition and Machine Intelligence: Lecture
Notes in Computer Science, 3776:300–305.

Last, M. and Shumway, R. (2008). Detecting abrupt changes in a piecewise locally


stationary time series. Journal of Multivariate Analysis, 99(2):191–214.

Lavielle, M. (1999). Detection of multiple changes in a sequence of dependent


variables. Stochastic Processes and their Applications, 83(1):79–102.
122 Bibliography

Lavielle, M. and Ludeña, C. (2000). The multiple change-points problem for the
spectral distribution. Bernoulli, 6(5):845–869.

Leone, F. C., Nelson, L. S., and Nottingham, R. B. (1961). The folded normal
distribution. Technometrics, 3(4):543–550.

Liao, T. W. (2005). Clustering of time series data – a survey. Pattern Recognition,


38:1857–1874.

Longuet-Higgins, M. S. (1957). The statistical analysis of a random, moving


surface. Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences, 249(966):321–387.

Maharaj, E. A. and D’Urso, P. (2011). Fuzzy clustering of time series in the


frequency domain. Information Sciences, 181(7):1187 – 1211.

Mandal, M. and Asif, A. (2007). Continuous and discrete time signals and systems.
Cambridge University Press, New York, first edition.

Montero, P. and Vilar, J. (2014). TSclust: An R package for time series clustering.
Journal of Statistical Software, 62(1).

Ochi, M. K. (1998). Ocean waves: the stochastic approach. Cambridge, U.K. ; New
York : Cambridge University Press.

Ombao, H. and Bellegen, S. V. (2008). Evolutionary coherence of nonstationary


signals. IEEE Transactions Signal Process., 56(6):2259–2266.

Ombao, H., Schröder, A. L., Euán, C., Ting, C.-M., and Samdin, B. (2016).
Handbook of Neuroimaging Data Analysis, chapter Advanced topics for modeling
electroencephalograms (to appear), pages 567–621. Chapman & Hall/CRC
Handbooks of Modern Statistical Methods. Taylor & Francis.

Ombao, H., von Sachs, R., and Guo, W. (2005). Slex analysis of multivariate
nonstationary time series. Journal of the American Statistical Association,
100(470):519–531.

Ombao, H. C., Raz, J. A., Strawderman, R. L., and Sachs, R. V. (2001). A simple
generalised crossvalidation method of span selection for periodogram smoothing.
Biometrika, 88(4):1186–1192.

Paparoditis, E. (2010). Validating stationarity assumptions in time series analysis


by rolling local periodograms. Journal of the American Statistical Association,
105(490):839–851.
Bibliography 123

Pértega Díaz, S. and Vilar, J. A. (2010). Comparing several parametric and


nonparametric approaches to time series clustering: a simulation study. Journal
of Classification, 27(3):333–362.

Pierson, W. J. (1955). Wind generated gravity waves. volume 2 of Advances in


Geophysics, pages 93 – 178. Elsevier.

Preuss, P., Vetter, M., and Dette, H. (2013). Testing semiparametric hypotheses
in locally stationary processes. Scandinavian Journal of Statistics. Theory and
Applications, 40(3):417–437.

Priestley, M. B. (1981). Spectral analysis and time series. Vol. 1. Academic Press,
Inc. [Harcourt Brace Jovanovich, Publishers], London-New York. Univariate
series, Probability and Mathematical Statistics.

R Core Team (2014). R: A language and environment for statistical computing. R


Foundation for Statistical Computing, Vienna, Austria.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation


and validation of cluster analysis. Journal of Computational and Applied
Mathematics, 20:53 – 65.

Rudolph, P. E. (1995). Permutation tests: A practical guide to resampling methods


for testing hypotheses. Biometrical Journal, 37(2):150–150.

Savvides, A., Promponas, V. J., and Fokianos, K. (2008). Clustering of biological


time series by cepstral coefficients based distances. Pattern Recognition,
41(7):2398 – 2412.

Sergides, M. and Paparoditis, E. (2008). Bootstrapping the local periodogram of


locally stationary processes. Journal of Time Series Analysis, 29(2):264–299.

Shumway, R. H. and Stoffer, D. S. (2011). Time series analysis and its applications.
With R examples. Springer, New York, third edition.

Torsethaugen, K. (1993). A two-peak wave spectrum model. In Proceedings of


the International Conference on Offshore Mechanics and Arctic Engineering
(OMAE), volume II, pages 175–180.

Torsethaugen, K. and Haver, S. (2004). Simplified double peak spectral model


for ocean waves. In Proceedings of the 14th International Offshore and Polar
Engineering Conference, pages 23–28.
124 Bibliography

Wu, J., Srinivasan, R., Kaur, A., and Cramer, S. C. (2014). Resting-state cortical
connectivity predicts motor skill acquision. NeuroImage, 91:84–90.

Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions


on Neural Networks, 16(3):645–678.

You might also like