0% found this document useful (0 votes)
13 views33 pages

AI61004 Module4 Latentvariable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

AI61004 Module4 Latentvariable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Latent Variable Models and E-M algorithm

Statistical Foundations of AI and ML, Module 3

Adway Mitra
Center for Artificial Intelligence
Indian Institute of Technology Kharagpur

March 6, 2023

Adway Mitra LVM


Generative Model for Clustering

▶ Generative model is a story about how the data was created


▶ We imagine that each of K clusters has a prototype
▶ Every data point is a “noisy version” of one prototype
▶ For any datapoint i,
▶ first the cluster index Zi is decided (Zi ∈ {1, 2, . . . , K })
▶ then the feature Xi is created, as a noisy version of the
selected cluster’s prototype

Adway Mitra LVM


Generative Model for Clustering

▶ Assumption: Each cluster represented by prototypes:


{θ1 , θ2 , . . . , θK }
▶ for each datapoint i
▶ Draw cluster index Zi ∼ g (g : distribution on the clusters)
▶ Draw feature vector Xi ∼ f (θZi ) (f : distribution on the
observation space)
▶ We choose f and g according to application (eg. f can be
Gaussian if our observations are real-valued)

Adway Mitra LVM


Inference and Estimation Problems

▶ Observed variables: X (our observed datapoints)


▶ Unknown variables: cluster assignments Z , cluster parameters
θ
▶ Finding Z : Inference problem, prob(Z |X , θ)
▶ Finding θ: Estimation problem, θ = argmaxprob(Z , X , θ)
▶ Challenge: The two problems are linked together!
▶ Cannot estimate θ directly because of Z

Adway Mitra LVM


Gaussian Mixture Model

▶ Each cluster represented by a Gaussian distribution: N (µj , σj )


(j ∈ {1, . . . , K })
▶ Each cluster has a probability πj , π = [π1 , . . . , πK ], πj ≥ 0,
PK
j=1 πj = 1
▶ Model parameters: {µj , σj , πj }K
j=1
▶ for each datapoint i
▶ Draw cluster index Zi ∼ Categorical(π)
▶ Draw feature vector Xi ∼ N (µZi , σZi )

Adway Mitra LVM


Gaussian Mixture Model

▶ Xi depends only on Zi , Zi depends on nothing!


▶ Joint distribution:
prob(Z1 , . . . , ZN , X1 , . . . , XN ) = N
Q
i=1 prob(Zi )prob(Xi |Zi )
QK I (Zi =j)
▶ prob(Zi ) = j=1 πj (I : indicator function)
(xi −µj )2 I (Zi =j)
▶ prob(Xi |Zi ) = K 1
Q
j=1 ( σj exp(− 2σ 2 ))
j
▶ Likelihood function
Q QK πj (xi −µj )2 I (Zi =j)
L(µ, σ, π) = N i=1 j=1 ( σj exp(− 2σ 2 ))
j
▶ Log-likelihood =
PN PK (xi −µj )2
i=1 j=1 I (Zi = j)(log πj − log σj − 2σj2
))

Adway Mitra LVM


Gaussian Mixture Model

▶ {µMLE , σMLE , πMLE } = argmaxµ,σ,π L(µ, σ, π)


∂L
▶ Solve ∂µ ∂L
= 0, ∂σ =0
| |
PN
I (Zi =j)xi
▶ µj = Pi=1
N , i.e. mean of the points in cluster j
i=1 I (Zi =j)
PN 2
▶ σj = i=1 I (Zi =j)(xi −µj )
PN i.e. variance of the points in cluster j
i=1 I (Z i =j)
PN
I (Zi =j)
▶ πj = PN i=1
PN , i.e. relative frequency of the points in
i=1 j=1 I (Zi =j)
cluster j
▶ Unfortunately we cannot compute these, as we do not know
Z!

Adway Mitra LVM


Expectation Maximization

▶ As we do not know I (Zi = j), we consider it as a random


variable, with distribution p(Zi |X )
▶ We replace I (Zi = j) by its expected value, γij = E (I (Zi = j))
▶ As I is binary, E (I (Zi = j)) = p(Zi = j|X )
p(Xi |Zi =j)p(Zi =j)
▶ p(Zi = j|X ) = p(Zi = j|Xi ) = p(Xi ) =
p(X |Z =j)p(Zi =j)
PK i i
l=1 p(Xi |Zi =l)p(Zi =l)
πj N (Xi ;µj ,σj )
▶ So, γij = PK
l=1 πl N (xi ;µl ,σl )
PN PN 2
PN
▶ µj = γij xi i=1 γij (xi −µj ) γij
Pi=1
N , σ j = PN , πj = PN i=1
PN
γ
i=1 ij γ
i=1 ij i=1 j=1 γij

Adway Mitra LVM


Expectation Maximization

▶ We use an iterative algorithm


1. Initialize µ0 , σ 0 , π 0
2. Repeat
πj0 N (Xi ;µ0j ,σj0 )
2.1 E-step: Calculate γij = PK
π 0 N (Xi ;µ0l ,σl0 )
l=1 l
2.2 M-step: Re-estimate the parameters
N N
γij (xi −µj )2
P P PN
γij xi γij
2.3 µ1j = Pi=1
N γ
, σj1 = i=1
PN
γ
, πj1 = i=1
N
i=1 ij i=1 ij

3. If (µ0 , σ 0 , π 0 ) ≈ (µ1 , σ 1 , π 1 ), STOP


4. Else set (µ0 = µ1 , σ 0 = σ 1 , π 0 = π 1 ) and GOTO 2

Adway Mitra LVM


Expectation Maximization

▶ When E-M algorithm converges, we get optimal values of the


parameters (µEM , σ EM , π EM )
πjEM N (Xi ;µEM EM
j ,σj )
▶ Compute posterior distribution p(Zi |Xi ) = PK EM EM )
l=1 πl N (Xi ;µl ,σl
▶ Soft-clustering instead of hard-clustering as in K-means
▶ Mode of distribution may be used as cluster assignment

Adway Mitra LVM


Model Likelihood

▶ The likelihood of a model: L(P) = prob(X ) the joint


distribution of the data according to the model
▶ If model contains latent variables like Z , marginalize over
them
N
Y N X
Y K
L(µ, σ, π) = prob(X ) = prob(Xi ) = prob(Xi , Zi = k)
i=1 i=1 k=1
YN XK
= prob(Xi |Zi = k)prob(Zi = k)
i=1 k=1
N X K
Y 1 (xi − µk )2
= πk exp(− )
2πσk 2σk2
i=1 k=1

Adway Mitra LVM


Comparing models

▶ Two different GMMs - with different sets of parameters


(µa , σa , πa ) and (µb , σb , πb )
▶ They can be compared by their likelihoods
▶ L(µa , σa , πa ) > L(µb , σb , πb ) implies that first model fits the
data better than the second
▶ Choosing K may be done by this approach

Adway Mitra LVM


Latent Linear Gaussian Model

▶ Zi ∼ N (µ0 , Σ0 ), Xi ∼ N (WZi + µ, Σ)
▶ Observation X = {X1 , . . . , XN }, Parameters
θ = {W , µ0 , µ, Σ0 , Σ}, Latent Z = {Z1 , . . . , ZN }
▶ Posterior on latents pθ (Zi |Xi ) = N (µi , Σi ) where
Σi = (Σ−1 T
0 + W ΣW )
−1 and

µi = Σi (W Σ (Xi − µ) + Σ−1
T −1
0 µ0 )
▶ Marginal pθ (Xi ) = N (W µ0 + µ, Σ + W Σ0 W T )
▶ Log-Likelihood l(θ) = N
P
i=1 log (pθ (Xi ))
▶ Parameters can be estimated by maximum-likelihood

Adway Mitra LVM


Mixture of Latent Linear Gaussians

▶ Yi ∼ Cat(π), Zi ∼ N (µ0 , Σ0 )
▶ Xi |Zi , Yi = k ∼ N (Wk Zi + µk , Σk )
▶ Observation X = {X1 , . . . , XN }, Latent
Z = {Z1 , . . . , ZN , Y1 , . . . , YN }
▶ Parameters
θ = {π, W1 , . . . , WK , µ0 , µ1 , . . . , µK , Σ0 , Σ1 , . . . , ΣK }
▶ For simplicity, assume µ0 = 0, Σ0 = I , Σk = Σ
▶ Log-likelihood l(θ) = N
P
i=1 log (pθ (Xi ))
▶ Expected Complete Log-likelihood Er ,s (log (pθ ((X , Y , Z ))
where ri (Yi ) = pθ (Yi |Xi ) and si (Zi ) = pθ (Zi |Xi , Yi )

Adway Mitra LVM


E-M for Mixture of Latent Linear Gaussians
▶ E-step for Y :
ri (Yi = c, θt ) = pθt (Yi = c|Xi ) ∝ πct N (µtc , Wct Wct T )
▶ E-step for Z :
si (Zi |Xi , Yi , θt ) = pθt (Zi |Xi , Yi = c) = N (mic , Σic ) where
Σic = (I + Wct T Σt Wct )−1 and mic = Σic (WcT Σ−1 (Xi − µc ))
▶ M-step: Estimate θt+1 as below (Ŵc = {Wc , µc }, qi = Yi ):

Adway Mitra LVM


Hidden Markov Model

▶ Consider sequential observations x1 , x2 , . . . , xT


▶ Key assumption in GMM: all the data-points are independent
▶ For sequential applications, this may not be true any longer!
▶ eg. a long audio stream with many speakers
▶ The observation xt is likely to belong to same speaker as xt−1
▶ There may be a transition pattern from one speaker to
another!

Adway Mitra LVM


Hidden Markov Model

▶ Different values of Z indicate state of the system (eg. which


speaker is talking)
▶ System may have K states (decided by user)
▶ Current state Zt depends on previous states Z1 , . . . , Zt−1
▶ Instead of prob(Zt ), we need prob(Zt |Zt−1 , . . . , Z1 )
▶ Markov Assumption: Future indepedent of past, given the
present!
▶ Markov model: prob(Zt |Zt−1 , . . . , Z1 ) = prob(Zt |Zt−1 )
▶ New parameter instead of π: Aij = prob(Zt = j|Zt−1 = i)

Adway Mitra LVM


Hidden Markov Model

▶ Each state represented by parameters: pj (j ∈ {1, . . . , K }) of


emission distribution f
▶ Transition distribution from state i to state j:
Aij = prob(Zt = j|Zt−1 = i) (KxK matrix)
▶ Each row of matrix A: categorical probability distribution
▶ An initial state distribution π (similar to GMM)
▶ Z1 ∼ Categorical(π); X1 ∼ f (pZ1 )
▶ for each datapoint t
▶ Draw cluster index Zt ∼ Categorical(AZt −1 )
▶ Draw feature vector Xt ∼ f (pZt )

Adway Mitra LVM


Hidden Markov Model

▶ Common emission distributions: Categorical (discrete


observations) or Gaussian (real observations)
▶ Xt depends on Zt only, Zt depends on Zt−1 only
▶ Joint distribution prob(X , Z ) =
prob(Z1 )prob(X1 |Z1 ) T
Q
t=2 prob(Zt |Zt−1 )prob(Xt |Zt )
▶ Rearranging, prob(X , Z ) =
prob(Z1 ) × T
Q QT
t=2 prob(Zt |Zt−1 ) × t=1 prob(Zt |Xt )
▶ prob(Z1 ) : π (initial state distribution), prob(Zt |Zt−1 ) : A
(transition distribution), prob(Xt |Zt ) : f (p) (emission
distribution)

Adway Mitra LVM


Forward-Backward Algorithm

Inference problem: Given (π, A, p), find posterior distribution


prob(Zt |X1 , . . . , XT )

prob(Zt |X1 , . . . , XT ) ∝ prob(Zt , X1 , . . . , XT )


= prob(Zt , X1 , . . . , Xt )
x prob(Xt+1 , . . . , XT |Zt , X1 , . . . , Xt )
= prob(Zt , X1 , . . . , Xt )prob(Xt+1 , . . . , XT |Zt )
= αt (Zt )βt (Zt )

Adway Mitra LVM


Forward Algorithm

αt (Zt ) = prob(Zt , X1 , . . . , Xt )
X
= prob(Zt , Zt−1 , X1 , . . . , Xt )
Zt−1
X
= prob(Zt−1 , X1 , . . . , Xt−1 )prob(Zt , Xt |Zt−1 , X1 , . . . , Xt−1 )
Zt−1
X
= αt−1 (Zt−1 )prob(Zt , Xt |Zt−1 )
Zt−1
X
= αt−1 (Zt−1 )prob(Zt |Zt−1 )prob(Xt |Zt )
Zt−1

K K ,K
I (Z =k) I (Zt−1 =k,Zt =l)
Y Y
α1 (Z1 ) = πk 1 f (X1 , pk ), prob(Zt |Zt−1 ) = Akl
k=1 k,l=1

Adway Mitra LVM


Backward Algorithm

βt (Zt ) = prob(Xt+1 , . . . , XT |Zt )


X
= prob(Zt+1 , Xt+1 , . . . , XT |Zt )
Zt+1
X
= prob(Zt+1 , Xt+1 |Zt )prob(Xt+2 , . . . , XT |Zt+1 , Xt+1 , Zt )
Zt+1
X
= prob(Zt+1 |Zt )prob(Xt+1 |Zt+1 )prob(Xt+2 , . . . , XT |Zt+1 )
Zt+1
X
= prob(Zt+1 |Zt )prob(Xt+1 |Zt+1 )βt+1 (Zt+1 )
Zt+1
X
βT −1 (ZT −1 ) = prob(XT , ZT |ZT −1 )
ZT
X
= prob(ZT |ZT −1 )prob(XT |ZT )
ZT
Adway Mitra LVM
Parameter Estimation in HMM

Estimation problem: Estimate the parameters (π, A, p), though


we don’t know Z
T
Y
prob(X , Z ) = prob(Z1 )prob(X1 |Z1 ) prob(Zt |Zt−1 )prob(Xt |Zt )
t=2

K T K ,K T
I (Z1 =k) I (Zt−1 =k,Zt =l)
Y Y Y Y
L(π, A, p) = πk × Akl × f (Xt , pZt )
k=1 t=2 k,l=1 t=1

Replace I (Z1 = k) by γ1 (k) = E (I (Z1 = k)), I (Zt−1 = k, Zt = l)


by ξt (kl) = E (I (Zt−1 = k, Zt = l))

Adway Mitra LVM


Baum-Welch Algorithm (E-M)

Input: sequence {X1 , . . . , XT }, emission parameters p


1. Make initial estimates of parameters π 0 , A0
2. Repeat
π 0 f (X ,p )
2.1 πk1 = γk = PK k 0 1 k
π f (X1 ,pl )
PT −1 l=1 l
1 t=1 ξt (kl)
2.2 Akl = PT −1 γ (k)
t=1 t
0 0 1 1
2.3 If (π , A , ) ≈ (π , A ), STOP
2.4 Else set (π 0 = π 1 , A0 = A1 ) and GOTO 2
αt (k)βt (k) αt−1 (k)βt (l)A0kl f (Xt , pl )
γt (k) = PK , ξt (kl) = PK ,K
0
i=1 αt (i)βt (i) i,j=1 αt−1 (i)βt (j)Aij f (Xt , pj )

Adway Mitra LVM


Why does E-M work?

▶ Latent Variables: Z , Observed variables: X , Model


parameters: θ
▶ Full model log-likelihood log (pθ (Z , X )) cannot be evaluated
because Z is latent
▶ Data log-likelihood l(θ) = log (pθ (X )) = log ( Z pθ (X , Z ))
P

▶ Using Jensen’s Inequality, log (Eq (Z )) ≥ Eq (log (Z )) for any


arbitrary distribution q
▶ Hence l(θ) = log ( Z q(Z ) pθq(Z
(X ,Z ) pθ (X ,Z )
P P
) )≥ q(Z ) log ( q(Z ) )

▶ i.e. l(θ) ≥ Eq (log ( pθq(Z


(X ,Z )
) )) = Q(θ, q)
▶ So for any arbitrary distribution q, Q(θ, q) is a lower bound
on q

Adway Mitra LVM


Why does E-M work?

▶ Aim: to estimate the parameters θ that maximize l(θ) - which


is analytically difficult
▶ Key idea: As an alternative, find a tight lower bound of
l(θ) and maximize it
▶ Find that distribution q(Z ) for which Q(θ, q) is as close to
l(θ) as possible!
▶ Q(θ, q) = Eq (log ( pθq(Z
(X ,Z )
) )) = Eq (log (
pθ (Z /X )pθ (X )
q(Z ) )) =
Eq (log ( pθq(Z
(Z /X )
) )) + log (pθ (X ))
▶ Rearranging terms, we have
l(θ) = Q(θ, q) + KL(pθ (Z /X ), q(Z ))
▶ If q(Z ) = pθ (Z /X ), KL is 0, i.e. l(θ) = Q(θ, q) (tight lower
bound)

Adway Mitra LVM


Why does E-M work?

▶ Now that we have found a tight lower bound, we need to


estimate parameters θ to maximize it
▶ But since we cannot maximize it directly, we have to do it
iteratively!
▶ Current estimate of parameters: θt
▶ q t (Z ) = pθt (Z |X ), i.e. conditional distribution of latent
variables w.r.t. current estimate of parameters
▶ This can be numerically evaluated using the model
▶ Q(θ, q t ) = Eqt (log ( pθq(X ,Z )
t (Z ) )) = Eq t (log (pθ (X , Z )) + C

▶ Calculating this is equivalent to the E-step (calculating


expected total likelihood WRT current parameter estimate)!

Adway Mitra LVM


Why does E-M work?

▶ θt+1 = argmaxθ Q(θ, q t ) = argmaxθ Eqt (log (pθ (X , Z ))


▶ This is equivalent to maximizing the total likelihood
▶ l(θt+1 ) ≥ Q(θt+1 , q t ) (Jensen’s Inequality)
▶ Q(θt+1 , q t ) ≥ Q(θt , q t ) (As θt+1 maximizes Q(θ, q t )
▶ But Q(θt , q t ) = l(θt ) (tight lower bound)
▶ Combining, we have l(θt+1 ) ≥ l(θt ), i.e. with each iteration
the data likelihood increases
▶ Hence, E-M algorithm must converge to a maxima
(local/global) depending on initial values θ0

Adway Mitra LVM


General Framework of E-M

▶ Find analytical expression for complete log-likelihood of model


log (pθ (Z , X ))
▶ Find analytical expression of expected complete likelihood
Eq (log (pθ (Z , X ))), where the expectation is taken with
respect to q(Z ) = pθ (Z |X ), i.e. posterior of latent variables
WRT observations
▶ Initialize the parameters θ0
▶ E-step: Calculate numerically the posterior of latent variables
WRT this parameter estimates q 0 (Z ) = pθ0 (Z |X )
▶ M-step: Choose θ1 to maximize the expected complete
likelihood Eq0 (log (pθ (Z , X )))
▶ Repeat E and M steps till θ converges

Adway Mitra LVM

You might also like