ML Lecture19
ML Lecture19
I Method of moments
I LVM: single topic model, gaussian mixture model, multiview model
I Introduction to tensors
I MoM for LVMs using tensor decomposition techniques
⇒ Consistent learning algorithms!
Spectral Methods
Method of Moments
Tensors
Conclusion
Density Estimation: Learning from Data
S = {x1 , · · · , xn }
fb
Learning from Data: Gaussian
N (x; µ, σ 2 )
S = {x1 , · · · , xn }
Pn
E[x] = 1
µ ' n Pi=1 xi
n
E[x ] = σ + µ '
2 2 2 1
n
2
i=1 xi
µ b2
b, σ
Learning from Data: Method of Moments (Pearson, 1894)
f (x; θ1 , · · · , θk )
S = {x1 , · · · , xn }
Pn
E[x] = g1 (θ1 , · · · , θk ) ' 1
n Pi=1 xi
n
E[x 2 ] = g2 (θ1 , · · · , θk ) ' 1 2
i=1 xi
n
..
. Pn
E[x ] = gk (θ1 , · · · , θk ) '
1
k k
i=1 xi
n
θb1 , · · · θbk
Method of Moments: Gaussian distribution
I If X follows a normal distribution N (µ, σ 2 ):
E[X ] = µ
E[X 2 ] = σ 2 + µ2
E[X ] = µ
E[X 2 ] = σ 2 + µ2
⇒ Here MoM and ML estimators are equal but this is not always the
case (e.g. uniform distribution).
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2
k̂ = m12 /(m12 + m1 − m2 ) →n k
p̂ = (m12 + m1 − m2 )/m1 →n p
1 1
Xi2 .
P P
where m1 = n i Xi and m2 = n i
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2
k̂ = m12 /(m12 + m1 − m2 ) →n k
p̂ = (m12 + m1 − m2 )/m1 →n p
1 1
Xi2 .
P P
where m1 = n i Xi and m2 = n i
I 0 ≤ p̂ ≤ 1 but k̂ may not be an integer.
Method of Moments: Multivariate case
E[x ⊗ x ⊗ x]
Overview
Method of Moments
Tensors
Conclusion
Tensors
(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:
1
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Tensors: Matricizations
I T ∈ Rd1 ×d2 ×d3 can be reshaped into a matrix as
T T(1)
Tensors: Multiplication with Matrices
ex: If T ∈ Rd1 ×d2 ×d3 and B ∈ Rm2 ×d2 , then T ×2 B ∈ Rd1 ×m2 ×d3 and
d2
X
(T ×2 B)i1 ,i2 ,i3 = T i1 ,k,i3 Bi2 ,k for all i1 ∈ [d1 ], i2 ∈ [m2 ], i3 ∈ [d3 ].
k=1
Tensors are not easy...
Abstract. The idea that one might extend numerical linear algebra, the collection of matrix com-
putational methods that form the workhorse of scientific and engineering computing, to numeri-
cal multilinear algebra, an analogous collection of tools involving hypermatrices/tensors, appears
very promising and has attracted a lot of attention recently. We examine here the computational
tractability of some core problems in numerical multilinear algebra. We show that tensor analogues
of several standard problems that are readily computable in the matrix (i.e. 2-tensor) case are NP
hard. Our list here includes: determining the feasibility of a system of bilinear equations, determin-
ing an eigenvalue, a singular value, or the spectral norm of a 3-tensor, determining a best rank-1
approximation to a 3-tensor, determining the rank of a 3-tensor over R or C. Hence making tensor
computations feasible is likely to be a challenge.
[Hillar and Lim, Most tensor problems are NP-hard, Journal of the ACM, 2013.]
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )
I Tucker decomposition:
2
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Hardness results
Method of Moments
Tensors
Conclusion
Tensor Decomposition for Learning Latent Variable Models
Latent P
Variable Model:
f (x) = ki=1 wi fi (x; µi )
S = {x1 , · · · , xn } ⊂ Rd
Structure in the
Low Order Moments
( Pk
E[x ⊗ x] = g1 ( i=1 wi µi ⊗ µi )
E[x ⊗ x ⊗ x] = g2 ( ki=1 wi µi ⊗ µi ⊗ µi )
P
x1 x2 ··· xℓ
Single Topic Model
I Documents modeled as bags of words:
I Vocabulary of d words
I k different topics
I ` words per document
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
I Using one-hot encoding for the words x1 , · · · , x` ∈ Rd in a
document we have
E[x1 | h = j] = µj
E[x1 ⊗ x2 | h = j] = µj ⊗ µj
E[x1 ⊗ x2 ⊗ x3 | h = j] = µj ⊗ µj ⊗ µj
Single Topic Model
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
I Using one-hot encoding for the words x1 , · · · , x` ∈ Rd in a
document we also have
E[x1 | h = j] = µj
E[x1 ⊗ x2 | h = j] = µj ⊗ µj
E[x1 ⊗ x2 ⊗ x3 | h = j] = µj ⊗ µj ⊗ µj
i=1
k
X
= wj µj ⊗ µj ⊗ µj
j=1
Mixture of Spherical Gaussians
I Mixture of k Gaussians with the same variance σ 2 I:
(1) Draw a Gaussian h randomly with probability P[h = j] = wj for
j ∈ [k]
(2) Draw x from the multivariate normal N (µh , σ 2 I)
I Hence
k
X
M2 = E[x ⊗ x] − σ 2 I = wj µj ⊗ µj
j=1
d
X
M3 = E[x ⊗ x ⊗ x] − σ (E[x] ⊗ ei ⊗ ei + ei ⊗ E[x] ⊗ ei + · · · )
2
i=1
k
X
= wj µj ⊗ µj ⊗ µj
j=1
k
X
S = E[(x − µ̄) ⊗ (x − µ̄)] = wj (µj − µ̄) ⊗ (µj − µ̄) + σ 2 I
j=1
Pk
Let A = j=1 wj (µj − µ̄) ⊗ (µj − µ̄). A is p.s.d. and has rank
r ≤ k − 1 < d. Hence if U diagonalizes A we have UAU> = D
where D is diagonal with its first d − r diagonal entries equal to 0.
The results follow from observing that USU> = D + σ 2 I.
Mixture of Spherical Gaussians
I When each spherical Gaussians has its own variance σj2 we have
the following result:
3
see [Anandkumar et al. Tensor decompositions for learning latent variable
models, JMLR 2014].
Overview
Method of Moments
Tensors
Conclusion
Tensor Decomposition for Learning Latent Variable Models
Latent P
Variable Model:
f (x) = ki=1 wi fi (x; µi )
S = {x1 , · · · , xn } ⊂ Rd
Structure in the
Low Order Moments
( Pk
E[x ⊗ x] = g1 ( i=1 wi µi ⊗ µi )
E[x ⊗ x ⊗ x] = g2 ( ki=1 wi µi ⊗ µi ⊗ µi )
P
(
b 2 ' Pk wi µ ⊗ µ
M i i
Pi=1
k
M3 ' i=1 wi µi ⊗ µi ⊗ µi
c
?
bi , µ
w bi
I k≤d
I µ1 , · · · , µk ∈ Rd are linearly independent
I w1 , · · · , wk ∈ R are strictly positive real numbers
Method of Moments with Tensor Decomposition
Tx := T •1 x = UDx U>
Tx := T •1 x = UDx U>
Tx := T •1 x = UDx U>
M3 = ki=1 wi µi ⊗ µi ⊗ µi
P
Overview:
1. Use M2 to transform the tensor M3 into an orthogonally
decomposable tensor: i.e. find W ∈ Rk×d such that
k
X
T = M3 ×1 W ×2 W ×3 W = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i
i=1
(
M2 = ki=1 wi µi ⊗ µi
P
M3 = ki=1 wi µi ⊗ µi ⊗ µi
P
Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I
√
I W= D −1/2 U> ∈ Rk×d and µei = wi Wµi ∈ Rk .
Orthonormalization
(
M2 = ki=1 wi µi ⊗ µi
P
M3 = ki=1 wi µi ⊗ µi ⊗ µi
P
Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I
√
I W =D −1/2 U> ∈ Rk×d and µ ei = wi Wµi ∈ Rk .
I We have µe> e j = δij for all i, j,
i µ because
k k
!
e>
X X
>
I = WM2 W = W wi µi µ>
i
>
W = µ
ei µi
i=1 i=1
Pk 1
⇒ T = M3 ×1 W ×2 W ×3 W = √ µe ⊗µ
ei ⊗ µ
ei .
i=1
wi i
Orthonormalization
(
M2 = ki=1 wi µi ⊗ µi
P
M3 = ki=1 wi µi ⊗ µi ⊗ µi
P
Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I
√
I W =D −1/2 U> ∈ Rk×d and µ ei = wi Wµi ∈ Rk .
I We have µe> e j = δij for all i, j,
i µ because
k k
!
e>
X X
>
I = WM2 W = W wi µi µ>
i
>
W = µ
ei µi
i=1 i=1
Pk1
⇒ T = M3 ×1 W ×2 W ×3 W = √ µ e ⊗µei ⊗ µei .
wi i
i=1
√
I Since UU> µi = µi for all i we have W+ µ̃i = wi µi .
Orthogonal Tensor Decomposition
M3 = ki=1 wi µi ⊗ µi ⊗ µi
P
µ>
Let θ 0 ∈ Rd , suppose that |w̃1 .e µ>
1 θ 0 | > |w̃j .e j θ 0 | > 0 for all j > 1.
For t = 1, 2, · · · , define
T •1 θ t−1 •2 θ t−1
θt = and λt = T •1 θ t •2 θ t •3 θ t
kT •1 θ t−1 •2 θ t−1 k
Then, θ t → µ
e 1 and λt → w̃1 .
Overview
Method of Moments
Tensors
Conclusion
Conclusion