Bayesian Inference On Change Point Problems
Bayesian Inference On Change Point Problems
Problems
by
Xiang Xuan
Master of Science
in
(Computer Science)
ii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . 3
1.2.2 Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
iii
Table of Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
iv
List of Figures
v
List of Figures
vi
Acknowledgements
I would like to thank all the people who gave me help and support throughout
my degree.
First of all, I would like to thank my supervisor, Professor Kevin Murphy for his
constant encouragement and rewarding guidance throughout this thesis work.
Kevin introduced me to this interesting problem. I am grateful to Kevin for
giving me the Bayesian education and directing me to the research in machine
learning. I am amazed by Kevin’s broad knowledge and many quick and bril-
liant ideas.
Secondly, I would like to thank Professor Nando de Freitas for dedicating his
time and effort in reviewing my thesis.
Thirdly, I would like to extend my appreciation to my colleagues for their friend-
ship and help, and especially to the following people: Sohrab Shah, Wei-Lwun
Lu and Chao Yan.
Last but not least, I would like to thank my parents and my wife for their end-
less love and support.
XIANG XUAN
vii
Chapter 1
Introduction
then the data on two successive segments could be different in the following
ways:
Figure 1.1 shows four examples of changes over successive segments. In all
four examples, there is one change point at location 50 (black vertical solid
line) which separates 100 observations into two segments. The top left panel
shows an example of different model orders. The 1st segment is a 2nd order
Autoregressive model and the 2nd segment is a 4th order Autoregressive model.
The top right panel shows an example of same model with different parameters.
Both segments are linear models, but the 1st segment has a negative slope while
the 2nd segment has a positive slope. The bottom left panel shows an example
1
Chapter 1. Introduction
of same model with different noise level. Both segments are constant models
which have means at 0, but the noise level (the standard deviation) of the 2nd
segment is three times as large as the one on the 1st segment. The bottom right
panel is an example of different correlation between two series. We can see that
two series are positive correlated in the 1st segment, but negative correlated in
the 2nd segment.
20 1.2
15 1
10 0.8
5 0.6
0 0.4
−5 0.2
−10 0
−15 −0.2
−20 −0.4
0 20 40 60 80 100 0 20 40 60 80 100
10 4
8
3
6
4 2
2
1
0
0
−2
−4 −1
−6
−2
−8
−10 −3
0 20 40 60 80 100 0 20 40 60 80 100
Figure 1.1: Examples show possible changes over successive segments. The
top left panel shows changes on AR model orders. The top right panel shows
changes on parameters. The bottom left panel shows changes on noise level.
The bottom right panel shows changes on correlation between two series.
The aim of the change point problems is to make inference about the number
and location of change points.
2
Chapter 1. Introduction
Reversible jump is a MCMC algorithm which has been extensively used in the
change point problems. It starts by a initial set of change points. At each step,
it can make the following three kinds of moves:
• Birth move to add a new change point or split a segment into two,
3
Chapter 1. Introduction
1.3 Contribution
Our contributions of the thesis are:
4
Chapter 2
Fearnhead [13, 14, 16] proposed three algorithms (offline, online exact and online
approximate) to solve the change point problems under PPM. They all first
calculate the joint posterior distribution of the number and positions of change
points P (K, S1:K |Y1:N ) using dynamic programming, then sample change points
from this posterior distribution by perfect sampling [22].
After sampling change points, making inference on models and their parameters
over segments is straight forward.
The offline algorithm and online exact algorithm run in O(N 2 ), and the online
approximate algorithm runs in O(N ).
5
Chapter 2. One Dimensional Time Series
We assume ( 2.3) only depends on the distance between two change points. Also
we let the probability mass function for the distance between two successive
change points s and t be g(|t − s|). Furthermore, we define the cumulative
distribution function for the distance as following,
l
X
G(l) = g(i) (2.4)
i=1
and assume that g() is also the probability mass function for the position of
the first change point. In general, g() can be any arbitrary probability mass
function with the domain over 1, 2, · · · , N − 1. Then g() and G() imply a prior
distribution on the number and positions of change points.
For example, if we use the Geometric distribution as g(), then our model implies
a Binomial distribution for the number of change points and a Uniform distri-
bution for the locations of change points. To see that, let’s suppose there are N
data points and we use a Geometric distribution with parameter λ. We denote
P (Ci = 1) as the probability of location i being a change point. By default,
position 0 is always a change point. That is,
P (C0 = 1) = 1 (2.5)
First, we show that the distribution for the location of change points is Uniform
by induction. That is, ∀i = 1, . . . , N
P (Ci = 1) = λ (2.6)
6
Chapter 2. One Dimensional Time Series
When i = 1, we only have one case: position 0 and 1 both are change points.
Hence the length of the segment is 1. We have,
P (C1 = 1) = g(1) = λ
Suppose ∀ i ≤ k, we have
P (Ci = 1) = λ (2.7)
where
1 − (1 − λ)k
= λ(1 − λ)k + λ2 = λ(1 − λ)k + λ(1 − (1 − λ)k )
1 − (1 − λ)
= λ (2.10)
By induction, this proves ( 2.6). Next we show the number of change points
follows Binomial distribution.
Let’s consider each position as a trial with two outcomes, either being a change
point or not. By ( 2.6), we know the probability of being a change point is
the same. Then we only need to show each trial is independent. That is,
∀i, j = 1, . . . , N and i 6= j,
7
Chapter 2. One Dimensional Time Series
When i < j, it is true by default, since future will not change history. When
i > j, we show it by induction on j.
When j = i − 1, we only have one case: position j and i both are change points.
Hence the length of the segment is 1. We have,
Suppose ∀ j ≥ k, we have
where
λ if t < i
P (Ci = 1|Ct = 1, Ck−1 = 1) = P (Ci = 1|Ct = 1) =
1 if t = i
P (Ct = 1|Ck−1 = 1) = g(t − k + 1) = λ(1 − λ)t−k (2.14)
By induction, this proves ( 2.11). Hence the number of change points follows
Binomial distribution.
8
Chapter 2. One Dimensional Time Series
We assume that P (Ys+1:t |q) can be efficiently calculated for all s, t and q, where
s < t. In practice, this requires either conjugate priors on θq which allow us to
work out the likelihood function analytically, or fast numerical routines which
are able to evaluate the required integration. In general, for any data and
models, as long as we can evaluate the likelihood function ( 2.17), we can use
Fearnhead’s algorithms. Our extensions are mainly based on this.
Now let’s look at some examples.
9
Chapter 2. One Dimensional Time Series
The linear regression models are one of the most widely used models. Here, we
assume
Ys+1:t = Hβ + ǫ (2.18)
Gamma distribution with parameters ν/2 and γ/2, and the jth component of
regression parameter βj has a Gaussian distribution with mean 0 and variance
σ 2 δj2 . This hierarchical model can be illustrated by Figure 2.1. For simplicity,
we write Ys+1:t as Y and let n = t − s. And we have the following,
1 1
P (Y |β, σ 2 ) = exp − (Y − Hβ)T −1
In (Y − Hβ) (2.19)
(2π)n/2 σ n 2σ 2
1 1 T −1
P (β|D, σ 2 ) = exp − β D β (2.20)
(2π)q/2 σ q |D|1/2 2σ 2
10
Chapter 2. One Dimensional Time Series
(γ/2)ν/2 2 −ν/2−1 γ
P (σ 2 |ν, γ) = (σ ) exp − 2 (2.21)
Γ(ν/2) 2σ
Now let
M = (H T H + D−1 )−1
P = (I − HM H T )
(∗) = Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β
Then
(∗) = β T (H T H + D−1 )β − 2Y T Hβ + Y T Y
= β T M −1 β − 2Y T HM M −1 β + Y T Y
= β T M −1 β − 2Y T HM M −1 β + Y T HM M −1 M T H T Y − Y T HM M −1 M T H T Y + Y T Y
Using fact M = M T
(∗) = (β − M H T Y )T M −1 (β − M H T Y ) + Y T Y − Y T HM H T Y
= (β − M H T Y )T M −1 (β − M H T Y ) + Y T P Y
= (β − M H T Y )T M −1 (β − M H T Y ) + ||Y ||2P
11
Chapter 2. One Dimensional Time Series
Hence
1
P (Y |β, K)P (β|D, K) ∝ exp − 2 ((β − M H T Y )T M −1 (β − M H T Y ) + ||Y ||2P
2σ
(2π)q/2 σ q |M |1/2
1 1 2
= exp − ||Y ||P
(2π)n/2 σ n (2π)q/2 σ q |D|1/2 2σ 2
21
1 |M | 1
= n/2 n
exp − 2 ||Y ||2P (2.23)
(2π) σ |D| 2σ
γ + ||Y ||2P
2 2 2 −n/2−ν/2−1
P (Y |D, σ )P (σ |ν, γ) ∝ (σ ) exp −
2σ 2
So the posterior for σ 2 is still Inverse Gamma with parameters (n + ν)/2 and
(γ + ||Y ||2P )/2:
12
Chapter 2. One Dimensional Time Series
We use the following notations: if Y is a vector, then Ys:t denotes the entries
from position s to t inclusive. If Y is a matrix, then Ys:t,: denotes the s-th row
to the t-th row inclusive, and Y:,s:t denotes the s-th column to the t-th column
inclusive.
13
Chapter 2. One Dimensional Time Series
P
By multiply ( 2.27) and ( 2.28), and let SY = i Yi , we have,
14
Chapter 2. One Dimensional Time Series
15
Chapter 2. One Dimensional Time Series
(2.36)
Where log(Γ(α)) and αlog(β) can be pre-computed, and SY can use the rank
one update as mentioned earlier.
i
where Xi = N.
16
Chapter 2. One Dimensional Time Series
There are many other possible basis functions as well (eg, Fourier Basis). Here
we just want to point out that the basis function does provide a way to extend
Fearnhead’s algorithms (eg, Kernel Basis).
When s < N
N
X −1 X
Q(s) = P (Ys:t |q)π(q)Q(t + 1)g(t − s + 1)
t=s q
X
+ P (Ys:N |q)π(q)(1 − G(N − s)) (2.43)
q
where π(q) is the prior probability of model q and g() and G() are defined in
( 2.4).
17
Chapter 2. One Dimensional Time Series
The reason behind ( 2.43) is that (drop the explicit conditioning on a change
point at s − 1 for notational convenience)
N
X −1
Q(s) = P (Ys:N , next change point is at t)
t=s
+ P (Ys:N , no further change points) (2.44)
= g(t − s + 1)P (Ys:t |s,t form a segment)P (Yt+1:N |next change point is at t)
X
= P (Ys:t |q)π(q)Q(t + 1)g(t − s + 1)
q
After we calculate Q(s) for all s = 1, . . . , N , we can simulate all change points
forward. To simulate one realisation, we do the following,
1. Set τ0 = 0, and k = 0.
18
Chapter 2. One Dimensional Time Series
4. If τk < N return to step (2); otherwise output the set of simulated change
points, τ1 , τ2 , . . . , τk .
19
Chapter 2. One Dimensional Time Series
And if j = t − 1
G(t − i − 1) − G(t − i − 2) (1 − λ)t−i−2 − (1 − λ)t−i−1
T P (Ct = j|Ct−1 = i) = =
1 − G(t − i − 2) (1 − λ)t−i−2
= λ
Let’s define the filtering density P (Ct = j|Y1:t ) as: given data Y1:t , the proba-
bility of the last change point is at position j. Online algorithm will compute
P (Ct = j|Y1:t ) for all t, j, such that t = 1, 2, . . . , N and j = 0, 1, . . . , t − 1.
When t = 1, we have j = 0, hence
20
Chapter 2. One Dimensional Time Series
and
t−2
X
P (Ct = j|Y1:t−1 ) = T P (Ct = j|Ct−1 = i)P (Ct−1 = i|Y1:t−1 ) (2.48)
i=0
(2.49)
P (Yj+1:t |Ct = j)
wtj =
P (Yj+1:t−1 |Ct = j)
P
P (Yj+1:t |q)π(q)
= P q
q P (Yj+1:t−1 |q)π(q)
When j = t − 1,
X
wtt = P (Yt:t |q)π(q) (2.50)
q
1. Set τ0 = N , and k = 0.
21
Chapter 2. One Dimensional Time Series
2. Simulated τk+1 from the filtering density P (Cτk |Y1:τk ), and set k = k + 1.
3. If τk > 0 return to step (2); otherwise output the set of simulated change
points backward, τk−1 , τk−2 , . . . , τ1 .
We can also obtain an online Viterbi algorithm for calculating the maximum a
posterior (MAP) estimate of positions of change points and model parameters
for each segment as following. We define Mi to be the event that given a change
point at time i, the MAP choice of change points and model parameters occurs
prior to time i.
For t = 1, . . . , n, and i = 0, . . . , t − 1, and all q,
and
and
Pt (j, q)g(t − j)
PtM AP = maxj,q (2.52)
1 − G(t − j − 1)
At time t, the MAP estimates of Ct and the current model parameters are given
respectively by the values of j and q which maximize Pt (i, q). Given a MAP
estimate of Ct , we can then calculate the MAP estimates of the change point
prior to Ct and the model parameters of that segment by the value of j and q
that maximized the right hand side of ( 2.52). This procedure can be repeated
to find the MAP estimates of all change points and model parameters.
22
Chapter 2. One Dimensional Time Series
23
Chapter 2. One Dimensional Time Series
Now let’s look at a real data set which records British coal-mining disasters [17]
by year during the 112 year period from 1851 to 1962. This is a well-known
data set and is previously studied by many researchers [6, 14, 18, 25]. Here Yi is
the number of disasters in the i-th year, which follows Poisson distribution with
parameter (rate) λ. Hence the natural conjugate prior is a Gamma distribution.
24
Chapter 2. One Dimensional Time Series
We set the hyper parameter α = 1.66 and β = 1, since we let the prior mean
of λ is equal to the empirical mean ( α
β = Ȳ = 1.66) and the prior strength β
to be weak. Then we use Fearnhead’s algorithms to analyse the data, and the
results are shown in Figure 2.5. In top left panel, it shows the raw data as
a Poisson sequence. The bottom left panel shows the posterior distribution on
the number of segments. It shows the most probable number of segments is
four. The bottom right panel shows the posterior distribution of being change
points at each location. Since we have four segments, we will pick the most
probable three change points at location 41, 84 and 102 which corresponding
to year 1891, 1934 and 1952. Then in the up right panel, it shows the resulted
segmentation (the red vertical line) and the posterior estimators of rate λ on
each segment (the red horizontal line). On four segments, the posterior rates
are roughly as following, 3, 1, 1.5 and 0.5.
25
Chapter 2. One Dimensional Time Series
For example, if there are 1000 data points and we expect there are 10 seg-
ments, then we will set λ = 0.01. If we increase λ, we will encourage more
change points. We use the synthetic data Blocks as an example. Results ob-
tained by the online approximate algorithms under different values of λ are
shown in Figure 2.7. From top to bottom, the values of λ are: 0.5, 0.1, 0.01,
0.001 and 0.0002. We see when λ = 0.5 (this prior says the length of segment is
26
Chapter 2. One Dimensional Time Series
2, which is too short.), the result is clearly oversegmented. Under other values
of λ, results are fine.
Different likelihood functions have different hyperparameters. In linear regres-
sion, we have hyperparameters ν and γ on the variance σ 2 . Since ν represents
the strength of prior, we normally set ν = 2 such that it is a weak prior. Then
we can set γ to reflect our belief on how large the variance will be within each
ν γ
segment. (Note: we parameterize Inverse-Gamma in term of 2 and 2 .) When
γ
ν = 2, the mean does not exist. We use the mode ν+2 to set γ = 4σ , where σ 2
2
is expected variance within each segment. For example, if we believe the vari-
ance is 0.01, then we will set γ = 0.04. Now we show results from synthetic data
Blocks and AR1 obtained by the online approximate algorithms under different
values of γ in Figure 2.8 and Figure 2.9. From top to bottom, the values of γ
are: 100, 20, 2, 0.2 and 0.04. We see that the data Blocks is very robust to the
choices of γ. For the data AR1, we see when γ = 100, we only detect 3 change
points instead of 5. For other values of γ, results are fine.
27
Chapter 2. One Dimensional Time Series
20
10
−10
0 100 200 300 400 500 600 700 800 900 1000
Offline P(N)
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 5 10
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 5 10
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 5 10
Figure 2.3: Results on synthetic data Blocks (1000 data points). The top
panel is the Blocks data set with true change points. The rest are the posterior
probability of being change points at each position and the number of segments
calculated by (from top to bottom) the offline, the online exact and the online
approximate algorithms. Results are generated by ’showBlocks’.
28
Chapter 2. One Dimensional Time Series
Y
20
10
−10
−20
0 100 200 300 400 500 600 700 800 900 1000
Offline P(N)
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 2 4 6 8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 2 4 6 8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 2 4 6 8
Figure 2.4: Results on synthetic data AR1 (1000 data points). The top panel is
the AR1 data set with true change points. The rest are the posterior probability
of being change points at each position and the number of segments calculated
by (from top to bottom) the offline, the online exact and the online approximate
algorithms. Results are generated by ’showAR1’.
29
Chapter 2. One Dimensional Time Series
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
P(N)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 2 4 6 8 10 0 20 40 60 80 100
Figure 2.5: Results on Coal Mining Disaster data. The top left panel shows
the raw data as a Poisson sequence. The bottom left panel shows the posterior
distribution on the number of segments. The bottom right panel shows the
posterior distribution of being change points at each position. The up right
panel shows the resulted segmentation and the posterior estimators of rate λ on
each segment. Results are generated by ’showCoalMining’.
30
Chapter 2. One Dimensional Time Series
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
P(N)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1 2 3 4 5 6 7 8 0 50 100 150 200
Figure 2.6: Results on Tiger Woods data. The top left panel shows the raw
data as a Bernoulli sequence. The bottom left panel shows the posterior distri-
bution on the number of segments. The bottom right panel shows the posterior
distribution of being change points at each position. The up right panel shows
the resulted segmentation and the posterior estimators of rate λ on each seg-
ment. Results are generated by ’showTigerWoods’. This data comes from Tiger
Woods’s official website http://www.tigerwoods.com/.
31
Chapter 2. One Dimensional Time Series
Y
20
10
0
−10
λ = 0.5
1 P(N)
1
0.5
0.5
0 0
λ = 0.1 0 20 40
1 P(N)
1
0.5
0.5
0 0
λ = 0.01 0 20 40
1 P(N)
1
0.5
0.5
0 0
λ = 0.001 0 20 40
1 P(N)
1
0.5
0.5
0 0
λ = 0.0002 0 20 40
1 P(N)
1
0.5
0.5
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 20 40
Figure 2.7: Results on synthetic data Blocks (1000 data points) under different
values of hyperparameter λ. The top panel is the Blocks data set with true
change points. The rest are the posterior probability of being change points at
each position and the number of segments calculated by the online approximate
algorithms under different values of λ. From top to bottom, the values of λ are:
0.5, 0.1, 0.01, 0.001 and 0.0002. Large value of λ will encourage more segments.
Results are generated by ’showBlocksLambda’.
32
Chapter 2. One Dimensional Time Series
Y
20
10
0
−10
γ = 100
1 P(N)
1
0.5
0.5
0 0
γ = 20 0 5 10
1 P(N)
1
0.5
0.5
0 0
γ=2 0 5 10
1 P(N)
1
0.5
0.5
0 0
γ = 0.2 0 5 10
1 P(N)
1
0.5
0.5
0 0
γ = 0.04 0 5 10
1 P(N)
1
0.5
0.5
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 5 10
Figure 2.8: Results on synthetic data Blocks (1000 data points) under different
values of hyperparameter γ. The top panel is the Blocks data set with true
change points. The rest are the posterior probability of being change points at
each position and the number of segments calculated by the online approximate
algorithms under different values of γ. From top to bottom, the values of γ
are: 100, 20, 2, 0.2 and 0.04. Large value of γ will allow higher variance in one
segment, hence encourage less segments. Results are generated by ’showBlocks-
Gamma’.
33
Chapter 2. One Dimensional Time Series
Y
20
−20
γ = 100
1 P(N)
1
0.5
0.5
0 0
γ = 20 0 2 4 6
1 P(N)
1
0.5
0.5
0 0
γ=2 0 2 4 6
1 P(N)
1
0.5
0.5
0 0
γ = 0.2 0 2 4 6
1 P(N)
1
0.5
0.5
0 0
γ = 0.04 0 2 4 6
1 P(N)
1
0.5
0.5
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 2 4 6
Figure 2.9: Results on synthetic data AR1 (1000 data points) under differ-
ent values of hyperparameter γ. The top panel is the AR1 data set with true
change points. The rest are the posterior probability of being change points
at each position and the number of segments calculated by the online approx-
imate algorithms under different values of γ. From top to bottom, the values
of γ are: 100, 20, 2, 0.2 and 0.04. Large value of γ will allow higher variance
within one segment, hence encourage less segments. Results are generated by
’showAR1Gamma’.
34
Chapter 3
j
where P (Ys+1:t ) is the marginal likelihood in the j-th dimension. Since now
j j
Ys+1:t is one dimension, P (Ys+1:t ) can be any likelihood function discussed in
the previous chapter.
The independent model is simple to use. Even when independent assumption is
not valid, it can be used as an approximate model similar to Naive Bayes when
we cannot model the correlation structures among each dimension.
35
Chapter 3. Multiple Dimensional Time Series
Let’s use the following example to illustrate the importance of modeling corre-
lation structures. As shown in Figure 3.1, we have two series. The data on
Y1
−1
−2
−3
0 100 200 300
Y2
3
−1
−2
−3
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
three segments are generated from the following Gaussian distributions. On the
36
Chapter 3. Multiple Dimensional Time Series
k-th segment,
Yk ∼ N (0, Σk )
1 0.75 1 0 1 −0.75
where Σ1 = , Σ2 = , Σ3 =
0.75 1 0 1 −0.75 1
As a result, the marginal distribution on each dimension is, N (0, 1), the same
over all three segments. Hence if we look at each dimension individually, we are
unable to identify any changes. However if we consider them jointly, then we
find their correlation structures are changed. For example, in the first segment,
they are positive correlated; in the second segment, they are independent; in
the last segment, they are negative correlated.
Ys+1:t = Hβ + ǫ (3.2)
A ∼ N (MA , V, W )
1 1 T −1 −1
P (A) = exp − trace((A − MA ) V (A − MA )W )
(2π)mn/2 |V |n/2 |W |m/2 2
37
Chapter 3. Multiple Dimensional Time Series
1 1 T −1 −1
P (Y |β, Σ) = exp − trace((Y − Hβ) In (Y − Hβ)Σ )
(2π)nd/2 |In |d/2 |Σ|n/2 2
(3.3)
1 1
P (β|D, Σ) = exp − trace(β T D−1 βΣ−1 ) (3.4)
(2π)qd/2 |D|d/2 |Σ|q/2 2
|Σ0 |N0 /2
1 1 −1
P (Σ|N0 , Σ0 ) = exp − trace(Σ0 Σ ) (3.5)
Z(N0 , d)2N0 d/2 |Σ|(N0 +d+1)/2 2
where N0 ≥ d and
d
Y
d(d−1)/4
Z(n, d) = π Γ((n + 1 − i)/2)
i=1
38
Chapter 3. Multiple Dimensional Time Series
By ( 2.17), we have,
M = (H T H + D−1 )−1
P = (I − HM H T )
(∗) = Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β
Then
(∗) = β T (H T H + D−1 )β − 2Y T Hβ + Y T Y
= β T M −1 β − 2Y T HM M −1 β + Y T Y
= β T M −1 β − 2Y T HM M −1 β + Y T HM M −1 M T H T Y − Y T HM M −1 M T H T Y + Y T Y
Using fact M = M T
(∗) = (β − M H T Y )T M −1 (β − M H T Y ) + Y T Y − Y T HM H T Y
= (β − M H T Y )T M −1 (β − M H T Y ) + Y T P Y
Hence
1 T T −1 T −1 T −1
P (Y |β, Σ)P (β|D, Σ) ∝ exp − trace(((β − M H Y ) M (β − M H Y ))Σ + Y P Y Σ )
2
So the posterior for β is still Matrix-Gaussian with mean M H T Y and covariance
M and W ,
P (β|D, Σ) ∼ N (M H T Y, M, Σ) (3.6)
39
Chapter 3. Multiple Dimensional Time Series
1 1
P (Y |D, Σ)P (Σ|N0 , Σ0 ) ∝ exp(− trace((Y T P Y + Σ0 )Σ−1 ))
|Σ|n/2 |Σ|(N0 +d+1)/2 2
1 1
∝ exp(− trace((Y T P Y + Σ0 )Σ−1 ))
|Σ|(n+N0 +d+1)/2 2
P (Σ) ∼ IW (n + N0 , Y T P Y + Σ0 ) (3.8)
N0 n + N0 d
log(P (Ys+1:t |q)) log(|Σ0 |) −
= log(|Y T P Y + Σ0 |) − (log|M | − log|D|)
2 2 2
d d
nd X X
− log(π) + log(Γ((n + N0 + 1 − i)/2)) − log(Γ((N0 + 1 − i)/2))
2 i=1 i=1
(3.10)
40
Chapter 3. Multiple Dimensional Time Series
N0 Pd
To speed up, − nd d
2 log(π), 2 log(|Σ0 |), − 2 log|D|, i=1 log(Γ((n+N0 +1−i)/2))
Pd
and i=1 log(Γ((N0 + 1 − i)/2)) can be pre-computed. At each iteration, M
and Y T P Y can be computed by the following rank one update,
T T T
H1:i+1,: H1:i+1,: = H1:i,: H1:i,: + Hi+1,: Hi+1,:
T T T
Y1:i+1,: Y1:i+1,: = Y1:i,: Y1:i,: + Yi+1,: Yi+1,:
T T T
H1:i+1,: Y1:i+1,: = H1:i,: Y1:i,: + Hi+1,: Yi+1,:
41
Chapter 3. Multiple Dimensional Time Series
Figure 3.3: A simple example to show how to use a graph to represent correla-
tion structures. We first compute precision matrix Λ. Then from Λ, if Λij = 0,
then there is no edge on from node i to node j. ”X” represent non-zero entries
in the matrix.
42
Chapter 3. Multiple Dimensional Time Series
by s = 0.1w at each step. This will generate a set of about 50 candidate seg-
mentations. We can repeat this for different setting of w and s. Then we run a
fast structure learning algorithm on each windowed segment, and sort resulting
set of graphs by frequency of occurrence. Finally we pick the top M = 20 to
form the set of candidate graphs. We hope this set will contain the true graph
structures or at least the ones that are very similar.
As shown in Figure 3.6, suppose the vertical pink dot lines are true segmen-
tations. When we run a sliding window inside of a true segment, (eg, the red
window), we hope the structure we learn from this window is similar to the
true structure of this segment. And when we shift the window one step, (eg,
shifting to the blue window), if it is still inside of the same segment, we hope
the structure we learn is the same or at least very similar to the one we learn
in the previous window. Of course we will have the window that overlaps two
segments (eg, the black window), then we know the structure we learn from this
window will represent neither segments. However, this brings no harm since
these ”wrong” graph structures will later receive negligible posterior probabil-
ity. We can choose the number of the graphs we want to consider based on how
43
Chapter 3. Multiple Dimensional Time Series
Λ
3. compute the partial correlation coefficients by ρij = √ ij ,
Λii Λjj
4. set edge Gij = 0 if |ρij | < θ for some threshold θ (eg, θ = 0.2).
The thresholding method is simple and fast, but it may not give a good esti-
mator. We can also use the shrinkage method discussed in [23] to get a better
estimator of Σ, which helps regularize the problem when the segment is too
short and d is large.
If we further pose the sparsity on the graph structure, we can use convex op-
timization techniques discussed in [2] to compute the MAP estimator for the
precision Λ under a prior that encourage many entries to go to 0. We first form
44
Chapter 3. Multiple Dimensional Time Series
where Σ is the empirical covariance, ||Λ||1 = Σij |Λij |, and ρ > 0 is the regular-
ization parameter which controls the sparsity of the graph. Then we can solve
( 3.11) by block coordinate descent algorithms.
We summarize the overall algorithm as following,
1. Input: data Y1:N , hyperparameters λ for change point rate and θ for
likelihood functions, observation model obslik and parameter ρ for graph
structure learning methods.
6. G = estG(Ys , ρ) : s ∈ S1:K
7. end while
9. Output: the number of segments K, the set of segments S1:K , the model
inferences on each segment m1:K
After we get the set of candidate graphs, we need to be able to compute marginal
likelihood for each graph. However, we can only do so for decomposable graphs
in undirected graphs. For non-decomposable graphs, we will use approximation
by adding minimum number of edges to make it decomposable.
Given a decomposable graph, we will assume the following conjugate priors.
Comparing with the multivariate linear regression models, everything is the
45
Chapter 3. Multiple Dimensional Time Series
Y = Hβ + ǫ
ǫ ∼ N (0, In , Σ)
β ∼ N (0, D, Σ)
Since P (Y |β, Σ)P (β|D, Σ) are the same as before, by integrating out β, we have,
Z
P (Y |D, Σ) = P (Y |β, Σ)P (β|D, Σ) dβ (3.14)
d2
1 |M | 1
= exp(− trace(Y T P Y Σ−1 ))
(2π)nd/2 |Σ|n/2 |D| 2
where M = (H T H + D−1 )−1 and P = (I − HM H T ) are defined before.
When P is positive definite, P has the Cholesky decomposition QT Q. (In
46
Chapter 3. Multiple Dimensional Time Series
practice, we can let the prior D = c ∗ I, where I is the identity matrix and c is a
scalar. By setting c into a proper value, we can make sure P is always positive
definite. This is very important in high dimension.) Now let X = QY , then
( 3.14) can rewrite as the following,
d2
1 |M | 1
P (Y |D, Σ) = exp(− trace(X T XΣ−1 ))
(2π)nd/2 |Σ|n/2 |D| 2
d2
|M | 1 1 T −1 −1
= exp(− trace(X In XΣ ))
|D| (2π)nd/2 |In |d/2 |Σ|n/2 2
d2
|M |
= P (X|Σ) (3.15)
|D|
where P (X|Σ) ∼ N (0, In , Σ), or just P (X|Σ) ∼ N (0, Σ).
Since we use decomposable graphs, this likelihood P (X|Σ) can be decomposited
as following [8],
Q
P (XC |ΣC )
P (X|Σ) = QC (3.16)
S P (XS |ΣS )
same holds for each S. As a result, the posterior of Σ ∼ HIW (b0 +n, Σ0 +X T X).
Hence by integrating out Σ, we have,
47
Chapter 3. Multiple Dimensional Time Series
d
Y
Z(n, d) = π d(d−1)/4 Γ((n + 1 − i)/2)
i=1
Σn = Σ0 + Y T P Y
bn = b0 + n
nd d
log(P (Ys+1:t |q)) = − log(π) − (log|M | − log|D|) + log(h(G, b0 , Σ0 )) − log(h(G, bn , Σn ))
2 2
(3.19)
We will also use the similar rank one update as mentioned earlier. Also notice
that h(G, b, Σ) contains many local terms. When we evaluate h(G, b, Σ) over
different graphs, we will cache all local terms. Then later when we meet the
same term, we don’t need to re-evalue it.
First, we revisit the 2D synthetic data mentioned in the beginning of this chap-
ter. We run it with all three different models (independent, full and Gaussian
graphical models). We set the hyper parameter ν = 2 and γ = 2 on σ 2 for each
dimension in the independent model; and set the hyper parameter N0 = d and
Σ0 = I on Σ where d is the number of the dimension (in this case d = 2) and
I is the identity matrix in the full model; and set the hyper parameter b0 = 1
and Σ0 = I on Σ in the Gaussian graphical model. We know there are three
segments. From Figure 3.7, the raw data are shown in the first two rows. The
independent model thinks there is only one segment, since the posterior prob-
ability of the number of segments is mainly at 1. Hence it detects no change
points. The other two models both think there are three segments. Both models
detect positions of change points that are close to the ground truth with some
48
Chapter 3. Multiple Dimensional Time Series
uncertainty.
Y1
−2
−2
0.5
0
0 100 200 300
Indep
1 Indep : P(N)
1
0.5 0.5
0 0
0 100 200 300 0 2 4 6
Full
1 Full : P(N)
1
0.5 0.5
0 0
0 100 200 300 0 2 4 6
GGM
1 GGM : P(N)
1
0.5 0.5
0 0
0 100 200 300 0 2 4 6
Figure 3.7: Results on synthetic data 2D. The top two rows are raw data. The
3rd row is the ground truth of change points. From the 4th row to the 6th row,
results are the posterior distributions of being change points at each position
and the number of segments generated from the independent model, the full
model and the Gaussian graphical model respectfully. Results are generated by
’show2D’.
Now let’s look at a 10D synthetic data. We generate data in a similar way, but
this time we have 10 series. And we set the hyper parameters in a similar way.
To save space, we only show the first two dimensions since the rest are similar.
From Figure 3.8, we see as before the independent model thinks there is only
one segment hence detects no change point. The full model thinks there might
be one or two segments. Also it detects the position of change point is close to
49
Chapter 3. Multiple Dimensional Time Series
two ends which is wrong. In this case, only the Gaussian graphical model think
there are three segments, and detects positions that are very close to the ground
truth. Then based on the segments estimated by Gaussian graphical model, we
plot the posterior over all graph structures, P (G|Ys:t ), the true graph struc-
ture, the MAP structure GM AP = argmaxG P (G|Ys:t ), and the marginal edge
probability, P (Gi,j = 1|Ys:t ), computed using Bayesian model averaging (Gray
squares represent edges about which we are uncertain) in the bottom two rows
of Figure 3.8. Note, we plot the graph structure by its adjacency matrix, such
that if node i and j are connected, then the corresponding entry is a black square;
otherwise it is a white square.
We plot all candidate graph strutures selected by sliding windows in Figure 3.9.
In this case, we have 30 graphs in total. For the 1st segment, we notice that
the true graph structure is not included in the candidate list. As a result, GGM
picks the graphs that are most close to the true structure. In this case, there are
two graphs that both are close. Hence we see some uncertainty over the edges.
However on the 3rd segment, the true structure is included in the candidate list.
In this case, GGM can identify it correctly, and we see very little uncertainty in
the posterior probability. As mentioned earlier, the candidate list contains 30
graphs. Some are very different from the true structures which might be gener-
ated by the window overlapping two segments. However, we find these graphs
only slow the algorithms but won’t hurt their results, because these ”useless”
graphs all get very low posterior probabilities.
Finally let’s look at a 20D synthetic data. We generate data in a similar way and
set the hyper parameters in a similar way. To save space, we only show the first
two dimensions since the rest are similar. From Figure 3.10, we see as before
the independent model thinks there is only one segment hence detects no change
point. The full model clearly oversegments the data. Again, only the Gaussian
graphical model correctly segments, and detects positions that are very close to
the ground truth. Comparing with the results from 2d and 10d cases, we find
that the independent model fails to detect in all cases since the changes are on
the correlation structures, and the independent model cannot model it. The
50
Chapter 3. Multiple Dimensional Time Series
full model can detect in low dimensional case, but fails in high dimension cases,
because the full model has more parameters to learn. The Gaussian graphical
model can detect on all cases, and it is more confident in high dimension cases
since the sparsity structures are more important in high dimension.
Now let’s look at two real data sets which record annually rebalanced value-
weighted monthly returns from July 1926 to December 2001 total 906 month
of U.S. portfolios data. The first data set has five industries (manufacturing,
utilities, shops, finance and other) and the second has thirty industries.
The first data set has been previously studied by Talih and Hengartner in [24].
We set the hyper parameter ν = 2 and γ = 2 on σ 2 for each dimension in the
independent model; and set the hyper parameter N0 = 5 and Σ0 = I on Σ in
the full model; and set the hyper parameter b0 = 1 and Σ0 = I on Σ in the
Gaussian graphical model. The raw data are shown in the first five rows of
Figure 3.11. Talih’s result is shown on the 6th row. From the 7th row to the
9th row, results are from independent model, full model and GGM respectfully.
We find that full model and GGM have similar results since they think there
are roughly 7 segments, and they agree on 4 out of 6 change points. Indepen-
dent model seems to be over-segmented. In this problem, we do not know the
ground truth and the true graph structures. Note, only 2 change points that we
discovered coincide with the results of Talih (namely 1959 and 1984). There are
many possible reasons for this. First, they assume a different prior over models
(the graph changes by one arc at a time between neighboring segments); second,
they use reversible jump MCMC; third, their model requires to pre-specify the
number of change points. In this case, the positions of change points are very
sensitive to the number of change points. We think their results could be over-
segmented. In this data, sliding windows generates 17 graphs. As usual, based
on the segmentation estimated by GGM, we plot the posterior over all 17 graphs
51
Chapter 3. Multiple Dimensional Time Series
Finally, we analyse the honey bee dance data set used in [19, 20]. This consists
of the x and y coordinates of a honey bee, and its head angle θ, as it moves
around an enclosure, as observed by an overhead camera. Two examples of the
data, together with a ground truth segmentation (created by human experts)
are shown in Figure 3.13 and 3.14. We also show the results of segmenting
this using a first-order auto-regressive AR(1) model, using independent model
or with full covariate model. We preprocessed the data by replacing θ with sinθ
and cosθ to overcome the discontinuity as the bees moves between −π to π. We
set the hyper parameter ν = 2 and γ = 0.02 on σ 2 for each dimension in the
52
Chapter 3. Multiple Dimensional Time Series
53
Chapter 3. Multiple Dimensional Time Series
Y1
5
−5
0 100 Y2 200 300
−5
0 100 Truth 200 300
1
0.5
0
0 100 Indep 200 300
1 Indep : P(N)
1
0.5
0.5
0 0
0 100 Full 200 300 0 5
1 Full : P(N)
1
0.5
0.5
0 0
0 100 GGM 200 300 0 2 4 6
1 GGM : P(N)
1
0.5
0.5
0 0
0 100 200 300 0 2 4 6
P(G) Truth P(G) Truth P(G) Truth
1 1 1
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
P(G(i,j)=1) MAP P(G(i,j)=1) MAP P(G(i,j)=1) MAP
Figure 3.8: Results on synthetic data 10D. The top two rows are the first two
dimensions of raw data. The 3rd row is the ground truth of change points. From
the 4th row to the 6th row, results are the posterior distributions of being change
points at each position and the number of segments generated from the inde-
pendent model, the full model and the Gaussian graphical model respectfully.
In the bottom 2 rows, we plot the posterior over all graph stuctures, P (G|Ys:t ),
the true graph structure, the MAP structure GM AP = argmaxG P (G|Ys:t ), and
the marginal edge probability, P (Gi,j = 1|Ys:t ) on 3 segments detected by the
Gaussian graphical model. Results are generated by ’show10D’.
54
Chapter 3. Multiple Dimensional Time Series
Figure 3.9: Candidate list of graphs generated by sliding windows in 10D data.
We plot the graph structure by its adjacency matrix, such that if node i and j
are connected, then the corresponding entry is a black square; otherwise it is a
white square. Results are generated by ’show10D’.
55
Chapter 3. Multiple Dimensional Time Series
Y1
5
−5
0 100 Y2 200 300
−5
0.5
0
0 100 Indep 200 300
1 Indep : P(N)
1
0.5
0.5
0 0
0 100 Full 200 300 0 2 4 6
1 Full : P(N)
1
0.5
0.5
0 0
0 100 GGM 200 300 0 2 4 6
1 GGM : P(N)
1
0.5
0.5
0 0
0 100 200 300 0 2 4 6
P(G) Truth P(G) Truth P(G) Truth
1 1 1
0 0 0
0 20 40 0 20 40 0 20 40
P(G(i,j)=1) MAP P(G(i,j)=1) MAP P(G(i,j)=1) MAP
Figure 3.10: Results on synthetic data 20D. The top two rows are the first two
dimensions of raw data. The 3rd row is the ground truth of change points. From
the 4th row to the 6th row, results are the posterior distributions of being change
points at each position and the number of segments generated from the inde-
pendent model, the full model and the Gaussian graphical model respectfully.
In the bottom 2 rows, we plot the posterior over all graph stuctures, P (G|Ys:t ),
the true graph structure, the MAP structure GM AP = argmaxG P (G|Ys:t ), and
the marginal edge probability, P (Gi,j = 1|Ys:t ) on 3 segments detected by the
Gaussian graphical model. Results are generated by ’show20D’.
56
Chapter 3. Multiple Dimensional Time Series
40
20
0
−20
−40
40
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
20
0
−20
−40
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
20
0
−20
−40
50
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−50
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
20
0
−20
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
0.5
0
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106 1
0.5 0.5
0 0
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106 1 0 10 20
0.5 0.5
0 0
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106 1 0 10 20
0.5 0.5
0 0
1
192606 193410 1 194302 1 195106 1
195910 196802 1 197606 1 198410 1
199302 200106 0 10 20
0 0 0 0 0 0 0
0 10 0 10 0 10 0 10 0 10 0 10 0 10
Figure 3.11: Results on U.S. portfolios data of 5 industries. The top five rows
are the raw data representing annually rebalanced value-weighted monthly re-
turn in 5 industries from July 1926 to December 2001. Total there are 906
month. The 6th row is the result by Talih. From the 7th row to the 9th row,
results are the posterior distributions of being change points at each position
and the number of segments generated from the independent model, the full
model and the Gaussian graphical model respectfully. In the bottom three
rows, we plot the posterior over all graph structures, P (G|Ys:t ), the MAP
graph structure GM AP = argmaxG P (G|Ys:t ), and the marginal edge proba-
bility, P (Gi,j = 1|Ys:t ). Results are generated by ’showPortofios’.
57
Chapter 3. Multiple Dimensional Time Series
30
20
10
0
−10
−20
−30
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302
60
40
20
−20
−40
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302 0 5 10 15
Figure 3.12: Results on U.S. portfolios data of 30 industries. We show the first
two industry raw data in the top two rows. The third row is the result of the
posterior distribution of being change points at each position and the number
of segments generated from the Gaussian graphical model. In the fourth row,
we show the MAP graph structures in 3 consecutive regions detected by the
Gaussian graphical model. Results are generated by ’showPort30’.
58
Chapter 3. Multiple Dimensional Time Series
−2
−2
0 100 200 300 500 600 700
Sin(T)
1
−1
0 100 200 300 500 600 700
Cos(T)
1
−1
0 100 200 300 500 600 700
Truth
1
0.5
0
0 100 200 300 500 600 700
Indep
1 Indep : P(N)
1
0.5 0.5
0 0
0 100 200 300 500 600 700 0 20 40 60
Full
1 Full : P(N)
1
0.5 0.5
0 0
0 100 200 300 500 600 700 0 20 40 60
Figure 3.13: Results on honey bee dance data 4. The top four rows are the raw
data representing x, y coordinates of honey bee and sin, cos of its head angle
θ. The 5th row is the ground truth. The 6th row is the result from independent
model and the 7th row is the result from full covariate model. Results are
generated by ’showBees(4)’.
59
Chapter 3. Multiple Dimensional Time Series
X
2
−2
−2
0 100 200 400 500 600
Sin(T)
1
−1
0 100 200 400 500 600
Cos(T)
1
−1
0 100 200 400 500 600
Truth
1
0.5
0
0 100 200 400 500 600
Indep
1 Indep : P(N)
1
0.5 0.5
0 0
0 100 200 400 500 600 0 20 40
Full
1 Full : P(N)
1
0.5 0.5
0 0
0 100 200 400 500 600 0 20 40
Figure 3.14: Results on honey bee dance data 6. The top four rows are the raw
data representing x, y coordinates of honey bee and sin, cos of its head angle
θ. The 5th row is the ground truth. The 6th row is the result from independent
model and the 7th row is the result from full covariate model. Results are
generated by ’showBees(6)’.
60
Chapter 4
61
Bibliography
[1] Jim Albert and Patricia Williamson. Using model/data simulation to detect
streakiness. The American statistician, 55:41–50, 2001.
[3] Daniel Barry and J. A. Hartigan. Product partition models for change
point problems. The Annals of Statistics, 20(1):260–279, 1992.
[4] Daniel Barry and J. A. Hartigan. A Bayesian analysis for change point
problems. Journal of the American Statistical Association, 88(421):309–
319, 1993.
[5] Albright S. C., Albert J., Stern H. S., and Morris C. N. A statistical
analysis of hitting streaks in baseball. Journal of the American Statistical
Association, 88:1175–1196, 1993.
62
Bibliography
[9] Nicolas Chopin. Dynamic detection of change points in long time series.
The Annals of the Institute of Statistical Mathematics, 2006.
[12] Arnaud Doucet, Nando De Freitas, and Neil Gordon. Sequential Monte
Carlo methods in practice. Springer-Verlag, New York, 2001.
[13] Paul Fearnhead. Exact Bayesian curve fitting and signal segmentation.
IEEE Transactions on Signal Processing, 53:2160–2166, 2005.
[14] Paul Fearnhead. Exact and efficient Bayesian inference for multiple change-
point problems. Statistics and Computing, 16(2):203–213, 2006.
[15] Paul Fearnhead and Peter Clifford. Online inference for hidden Markov
models via particle filters. Journal of the Royal Statistical Society: Series
B, 65:887–899, 2003.
[16] Paul Fearnhead and Zhen Liu. Online inference for multiple change point
problems. 2005.
[18] Peter J.Green. Reversible jump Markov Chain Monte Carlo computation
and Bayesian model determination. Biometrika, 82(4):711–732, 1995.
63
Bibliography
[19] Sang Min Oh, James M. Rehg, Tucker Balch, and Frank Dellaert. Learn-
ing and inference in parametric switching linear dynamic systems. IEEE
International Conference on Computer Vision, 2:1161–1168, 2005.
[20] Sang Min Oh, James M. Rehg, and Frank Dellaert. Parameterized du-
ration modeling for switching linear dynamic systems. IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2:1694–
1700, 2006.
[22] Christian P. Robert and George Casella. Monte Carlo Statistical Methods.
Springer, New York, 2004.
[24] Makram Talih and Nicolas Hengartner. Structural learning with time-
varying components: Tracking the cross-section of financial time series.
Journal of Royal Statistical Society Series B, 67:321–341, 2005.
[26] Tae Young Yang. Bayesian binary segmentation procedure for detecting
streakiness in sports. Journal of Royal Statistical Society Series A, 167:627–
637, 2004.
64