0% found this document useful (0 votes)
20 views

ML Lecture19

The document discusses method of moments, latent variable models, and tensor decomposition techniques. It introduces tensors and their properties like fibers and matricization. Higher-order moments of latent variable models can reveal structure that can be extracted using tensor decomposition methods for consistent learning.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

ML Lecture19

The document discusses method of moments, latent variable models, and tensor decomposition techniques. It introduces tensors and their properties like fibers and matricization. Higher-order moments of latent variable models can reveal structure that can be extracted using tensor decomposition methods for consistent learning.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Lecture 17: Method of Moments, Latent Variable Models,

and Tensor Decomposition Techniques

I Method of moments
I LVM: single topic model, gaussian mixture model, multiview model
I Introduction to tensors
I MoM for LVMs using tensor decomposition techniques
⇒ Consistent learning algorithms!
Spectral Methods

I Spectral methods are an alternative to EM to learn latent variable


models (e.g. HMMs in the previous lecture, single-topic/Gaussian
mixtures models in this one).
I Spectral methods usually achieve learning by extracting structure
from observable quantities through eigen-decompositions/tensor
decompositions.
I Advantages of spectral methods:
I computationally efficient,
I consistent,
I no local optima.
Overview

Method of Moments

Tensors

Structure in the Low-Order Moments of Latent Variable Models


Single Topic Model
Mixture of Spherical Gaussians

Method of Moments via Tensor Decomposition


Jennrich’s algorithm
Tensor Power Method / (Simultaneous) Diagonalization

Conclusion
Density Estimation: Learning from Data

S = {x1 , · · · , xn }

fb
Learning from Data: Gaussian

N (x; µ, σ 2 )

S = {x1 , · · · , xn }

Pn
E[x] = 1

µ ' n Pi=1 xi
n
E[x ] = σ + µ '
2 2 2 1
n
2
i=1 xi

µ b2
b, σ
Learning from Data: Method of Moments (Pearson, 1894)

f (x; θ1 , · · · , θk )

S = {x1 , · · · , xn }

Pn
E[x] = g1 (θ1 , · · · , θk ) ' 1

 n Pi=1 xi
n

 E[x 2 ] = g2 (θ1 , · · · , θk ) ' 1 2
i=1 xi

n
..

 . Pn
E[x ] = gk (θ1 , · · · , θk ) '
 1
k k
i=1 xi

n

θb1 , · · · θbk
Method of Moments: Gaussian distribution
I If X follows a normal distribution N (µ, σ 2 ):

E[X ] = µ
E[X 2 ] = σ 2 + µ2

I If S = {X1 , · · · Xn } are i.i.d. from N (µ, σ 2 ), by the law of large


numbers:
n
1X
µ̂ = Xi →n µ
n
i=1
n
1
σˆ2 =
X
Xi2 − µ̂2 →n σ 2
n
i=1
Method of Moments: Gaussian distribution
I If X follows a normal distribution N (µ, σ 2 ):

E[X ] = µ
E[X 2 ] = σ 2 + µ2

I If S = {X1 , · · · Xn } are i.i.d. from N (µ, σ 2 ), by the law of large


numbers:
n
1X
µ̂ = Xi →n µ
n
i=1
n
1
σˆ2 =
X
Xi2 − µ̂2 →n σ 2
n
i=1

⇒ Here MoM and ML estimators are equal but this is not always the
case (e.g. uniform distribution).
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2

I If S = {X1 , · · · Xn } are i.i.d. from B(k, p), by the LLN:

k̂ = m12 /(m12 + m1 − m2 ) →n k
p̂ = (m12 + m1 − m2 )/m1 →n p
1 1
Xi2 .
P P
where m1 = n i Xi and m2 = n i
Method of Moments: Binomial distribution
If X follows a binomial distribution B(k, p), then X = ki=1 Bi
P
I
where each Bi follows a Bernoulli with parameter p. Hence,
k
X
E[X ] = E[B1 + · · · + Bk ] = E[Bi ] = kp
i=1
2
E[X ] = E[(B1 + · · · + Bk ) ] = k 2 p 2 + kp(1 − p)
2

I If S = {X1 , · · · Xn } are i.i.d. from B(k, p), by the LLN:

k̂ = m12 /(m12 + m1 − m2 ) →n k
p̂ = (m12 + m1 − m2 )/m1 →n p
1 1
Xi2 .
P P
where m1 = n i Xi and m2 = n i
I 0 ≤ p̂ ≤ 1 but k̂ may not be an integer.
Method of Moments: Multivariate case

I What if the random variable x takes its values in Rd ?


Method of Moments: Multivariate case

I What if the random variable x takes its values in Rd ?


I Let’s look at the multivariate normal. If x ∼ N (µ, Σ), the first
and second moments are

E[x] = µ and E[xx> ] = Σ + µµ>


Method of Moments: Multivariate case

I What if the random variable x takes its values in Rd ?


I Let’s look at the multivariate normal. If x ∼ N (µ, Σ), the first
and second moments are

E[x] = µ and E[xx> ] = Σ + µµ>

I What if we need higher order moments? The second order moment


is E[xx> ], but what is e.g. the third order moment?
Method of Moments: Multivariate case

I What if the random variable x takes its values in Rd ?


I Let’s look at the multivariate normal. If x ∼ N (µ, Σ), the first
and second moments are

E[x] = µ and E[xx> ] = Σ + µµ>

I What if we need higher order moments? The second order moment


is E[xx> ], but what is e.g. the third order moment?

E[x ⊗ x ⊗ x]
Overview

Method of Moments

Tensors

Structure in the Low-Order Moments of Latent Variable Models


Single Topic Model
Mixture of Spherical Gaussians

Method of Moments via Tensor Decomposition


Jennrich’s algorithm
Tensor Power Method / (Simultaneous) Diagonalization

Conclusion
Tensors

M ∈ Rd1 ×d2 T ∈ Rd1 ×d2 ×d3

Mij ∈ R for i ∈ [d1 ], j ∈ [d2 ] (T ijk ) ∈ R for i ∈ [d1 ], j ∈ [d2 ], k ∈ [d3 ]


Tensors

M ∈ Rd1 ×d2 T ∈ Rd1 ×d2 ×d3

Mij ∈ R for i ∈ [d1 ], j ∈ [d2 ] (T ijk ) ∈ R for i ∈ [d1 ], j ∈ [d2 ], k ∈ [d3 ]


I Outer product. If u ∈ Rd1 , v ∈ Rd2 , w ∈ Rd3 :

u ⊗ v = uv> ∈ Rd1 ×d2 (u ⊗ v)i,j = ui vj

u ⊗ v ⊗ w ∈ Rd1 ×d2 ×d3 (u ⊗ v ⊗ w)i,j,k = ui vj wk


Tensors: mode-n fibers

I Matrices have rows and columns, tensors have fibers1 :

(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:

Fig. 2.1 Fibers of a 3rd-order tensor.

1
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Tensors: Matricizations
I T ∈ Rd1 ×d2 ×d3 can be reshaped into a matrix as

T(1) ∈ Rd1 ×d2 d3


T(2) ∈ Rd2 ×d1 d3
T(3) ∈ Rd3 ×d1 d2

T T(1)
Tensors: Multiplication with Matrices

AMB> ∈ Rm1 ×m2 T ×1 A ×2 B ×3 C ∈ Rm1 ×m2 ×m3

For vectors, we write T •n v = T ×n v>


Tensors: Multiplication with Matrices

AMB> ∈ Rm1 ×m2 T ×1 A ×2 B ×3 C ∈ Rm1 ×m2 ×m3

For vectors, we write T •n v = T ×n v>

ex: If T ∈ Rd1 ×d2 ×d3 and B ∈ Rm2 ×d2 , then T ×2 B ∈ Rd1 ×m2 ×d3 and
d2
X
(T ×2 B)i1 ,i2 ,i3 = T i1 ,k,i3 Bi2 ,k for all i1 ∈ [d1 ], i2 ∈ [m2 ], i3 ∈ [d3 ].
k=1
Tensors are not easy...

MOST TENSOR PROBLEMS ARE NP HARD

CHRISTOPHER J. HILLAR AND LEK-HENG LIM

Abstract. The idea that one might extend numerical linear algebra, the collection of matrix com-
putational methods that form the workhorse of scientific and engineering computing, to numeri-
cal multilinear algebra, an analogous collection of tools involving hypermatrices/tensors, appears
very promising and has attracted a lot of attention recently. We examine here the computational
tractability of some core problems in numerical multilinear algebra. We show that tensor analogues
of several standard problems that are readily computable in the matrix (i.e. 2-tensor) case are NP
hard. Our list here includes: determining the feasibility of a system of bilinear equations, determin-
ing an eigenvalue, a singular value, or the spectral norm of a 3-tensor, determining a best rank-1
approximation to a 3-tensor, determining the rank of a 3-tensor over R or C. Hence making tensor
computations feasible is likely to be a challenge.

[Hillar and Lim, Most tensor problems are NP-hard, Journal of the ACM, 2013.]
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )

I The CP rank of T is the smallest integer R such that T can be


written as a sum of R rank-one tensors:
R
X
T = u i ⊗ v i ⊗ wi .
i=1
CP and Tucker decomposition
I CP decomposition2 :

I Tucker decomposition:

2
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Hardness results

I Those are all NP-hard for tensor of order ≥ 3 in general:


I Compute the CP rank of a given tensor
I Find the best approximation with CP rank R of a given tensor
I Find the best approximation with multilinear rank (R1 , · · · , Rp ) of a
given tensor (*)
I ...
I On the positive side:
IComputing the multilinear rank is easy and efficient algorithms exist
for (*).
I Under mild conditions, the CP decomposition is unique (modulo

scaling and permutations).


⇒ Very relevant for model identifiability...
Overview

Method of Moments

Tensors

Structure in the Low-Order Moments of Latent Variable Models


Single Topic Model
Mixture of Spherical Gaussians

Method of Moments via Tensor Decomposition


Jennrich’s algorithm
Tensor Power Method / (Simultaneous) Diagonalization

Conclusion
Tensor Decomposition for Learning Latent Variable Models

Latent P
Variable Model:
f (x) = ki=1 wi fi (x; µi )

S = {x1 , · · · , xn } ⊂ Rd
Structure in the
Low Order Moments
( Pk
E[x ⊗ x] = g1 ( i=1 wi µi ⊗ µi )
E[x ⊗ x ⊗ x] = g2 ( ki=1 wi µi ⊗ µi ⊗ µi )
P

Tensor Power Method


bi , µ
w bi
Single Topic Model
I Documents modeled as bags of words:
I Vocabulary of d words
I k different topics
I ` words per document
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
Single Topic Model
I Documents modeled as bags of words:
I Vocabulary of d words
I k different topics
I ` words per document
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
⇒ Words are independent given the topic:

x1 x2 ··· xℓ
Single Topic Model
I Documents modeled as bags of words:
I Vocabulary of d words
I k different topics
I ` words per document
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
I Using one-hot encoding for the words x1 , · · · , x` ∈ Rd in a
document we have

(E[x1 ])i = P[1st word = i]


(E[x1 ⊗ x2 ])i,j = P[1st word = i, 2nd word = j]
..
.
(E[x1 ⊗ x2 ⊗ · · · ⊗ x` ])i1 ,··· ,i` = P[1st word = i1 , 2nd word = i2 , · · · ,
`-th word = i` ]
Single Topic Model
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
I Using one-hot encoding for the words x1 , · · · , x` ∈ Rd in a
document we also have

E[x1 | h = j] = µj
E[x1 ⊗ x2 | h = j] = µj ⊗ µj
E[x1 ⊗ x2 ⊗ x3 | h = j] = µj ⊗ µj ⊗ µj
Single Topic Model
I Documents are drawn as follows:
(1) Draw a topic h randomly with probability P[h = j] = wj for j ∈ [k]
(2) Draw ` word independently according to the distribution µh ∈ ∆d−1
I Using one-hot encoding for the words x1 , · · · , x` ∈ Rd in a
document we also have

E[x1 | h = j] = µj
E[x1 ⊗ x2 | h = j] = µj ⊗ µj
E[x1 ⊗ x2 ⊗ x3 | h = j] = µj ⊗ µj ⊗ µj

From which we can deduce


k
X
E[x1 ⊗ x2 ] = wj µj ⊗ µj
j=1
k
X
E[x1 ⊗ x2 ⊗ x3 ] = wj µj ⊗ µj ⊗ µj
j=1
Mixture of Spherical Gaussians
I Mixture of k d-dimensional Gaussians (k ≤ d) with the same
variance σ 2 I:
(1) Draw a Gaussian h randomly with P[h = j] = wj for j ∈ [k]
(2) Draw x from the multivariate normal N (µh , σ 2 I)
Mixture of Spherical Gaussians
I Mixture of k d-dimensional Gaussians (k ≤ d) with the same
variance σ 2 I:
(1) Draw a Gaussian h randomly with P[h = j] = wj for j ∈ [k]
(2) Draw x from the multivariate normal N (µh , σ 2 I)
I The first three moments are:
k
X
E[x] = wj µj
j=1
k
X
E[x ⊗ x] = σ 2 I + wj µj ⊗ µj
j=1
k
X
E[x ⊗ x ⊗ x] = wj µj ⊗ µj ⊗ µj
j=1
d
X
2
+σ (E[x] ⊗ ei ⊗ ei + ei ⊗ E[x] ⊗ ei + ei ⊗ ei ⊗ E[x])
i=1
Mixture of Spherical Gaussians
I Mixture of k Gaussians with the same variance σ 2 I:
(1) Draw a Gaussian h randomly with probability P[h = j] = wj for
j ∈ [k]
(2) Draw x from the multivariate normal N (µh , σ 2 I)
I Hence
k
X
M2 = E[x ⊗ x] − σ 2 I = wj µj ⊗ µj
j=1
d
X
M3 = E[x ⊗ x ⊗ x] − σ (E[x] ⊗ ei ⊗ ei + ei ⊗ E[x] ⊗ ei + · · · )
2

i=1
k
X
= wj µj ⊗ µj ⊗ µj
j=1
Mixture of Spherical Gaussians
I Mixture of k Gaussians with the same variance σ 2 I:
(1) Draw a Gaussian h randomly with probability P[h = j] = wj for
j ∈ [k]
(2) Draw x from the multivariate normal N (µh , σ 2 I)
I Hence
k
X
M2 = E[x ⊗ x] − σ 2 I = wj µj ⊗ µj
j=1
d
X
M3 = E[x ⊗ x ⊗ x] − σ (E[x] ⊗ ei ⊗ ei + ei ⊗ E[x] ⊗ ei + · · · )
2

i=1
k
X
= wj µj ⊗ µj ⊗ µj
j=1

I How can we estimate σ 2 ?


Mixture of Spherical Gaussians

I Mixture of k Gaussians with the same variance σ 2 I.


I σ 2 is the smallest eigenvalue of the covariance matrix!
proof: Let µ̄ = E[x] =
P
j wj µj , we have

k
X
S = E[(x − µ̄) ⊗ (x − µ̄)] = wj (µj − µ̄) ⊗ (µj − µ̄) + σ 2 I
j=1

Pk
Let A = j=1 wj (µj − µ̄) ⊗ (µj − µ̄). A is p.s.d. and has rank
r ≤ k − 1 < d. Hence if U diagonalizes A we have UAU> = D
where D is diagonal with its first d − r diagonal entries equal to 0.
The results follow from observing that USU> = D + σ 2 I.
Mixture of Spherical Gaussians
I When each spherical Gaussians has its own variance σj2 we have
the following result:

Theorem (D. Hsu and D. Kakade, ITCS, 2013.)


The average variance σ̄ 2 = ki=1 wi σi2 is the smallest eigenvalue of
P
I
the covariance matrix E[(x − E[x])(x − E[x])> ].
I Let v be any unit-norm eigenvector corresponding to σ̄ 2 and let

m1 = E[x(v> (x − E[x]))2 ], M2 = E[x ⊗ x] − σ̄ 2 I, and


Xn
M3 = E[x ⊗ x ⊗ x] − [m1 ⊗ ei ⊗ ei + ei ⊗ m1 ⊗ ei + ei ⊗ ei ⊗ m1 ]
i=1

where e1 , · · · , en is the coordinate basis of Rn . Then,


k
X k
X k
X
m1 = wi σi2 µi , M2 = wi µi ⊗µi , and M3 = wi µi ⊗µi ⊗µi .
i=1 i=1 i=1
Structure in the Low-Order Moments of Latent Variable
Models

I For single topic models and spherical Gaussian mixtures, we


showed that the tensors M2 = ki=1 wi µi ⊗ µi and
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi can be expressed as functions of the


P
2nd and 3rd order moments.
I Similar results can be shown for hidden Markov models, latent
Dirichlet allocation, independent component analysis and
multiview models3 .
I M2 and M3 can be estimated from data, now it remains to
recover the parameters wi , µi from M2 and M3 .

3
see [Anandkumar et al. Tensor decompositions for learning latent variable
models, JMLR 2014].
Overview

Method of Moments

Tensors

Structure in the Low-Order Moments of Latent Variable Models


Single Topic Model
Mixture of Spherical Gaussians

Method of Moments via Tensor Decomposition


Jennrich’s algorithm
Tensor Power Method / (Simultaneous) Diagonalization

Conclusion
Tensor Decomposition for Learning Latent Variable Models

Latent P
Variable Model:
f (x) = ki=1 wi fi (x; µi )

S = {x1 , · · · , xn } ⊂ Rd
Structure in the
Low Order Moments
( Pk
E[x ⊗ x] = g1 ( i=1 wi µi ⊗ µi )
E[x ⊗ x ⊗ x] = g2 ( ki=1 wi µi ⊗ µi ⊗ µi )
P

Tensor Power Method


bi , µ
w bi
Method of Moments with Tensor Decomposition

(
b 2 ' Pk wi µ ⊗ µ
M i i
Pi=1
k
M3 ' i=1 wi µi ⊗ µi ⊗ µi
c

?
bi , µ
w bi

I k≤d
I µ1 , · · · , µk ∈ Rd are linearly independent
I w1 , · · · , wk ∈ R are strictly positive real numbers
Method of Moments with Tensor Decomposition

I Under which conditions can


Pwe recover the weights wj and vectors
µj for j ∈ [k] from M2 = j wj µj ⊗ µj ?
Method of Moments with Tensor Decomposition

I Under which conditions can


Pwe recover the weights wj and vectors
µj for j ∈ [k] from M2 = j wj µj ⊗ µj ?
(i) If the µj are orthonormal and the wj are distinct, they are the unit
eigenvectors of M2 and the weights are its eigenvalues.
→ We would still need to recover the signs of the µj ...
(ii) Otherwise, this is not possible!
Method of Moments with Tensor Decomposition

I Under which conditions canPwe recover the weights wj and vectors


µj for j ∈ [k] from M3 = j wj µj ⊗ µj ⊗ µj ?
Method of Moments with Tensor Decomposition

I Under which conditions canPwe recover the weights wj and vectors


µj for j ∈ [k] from M3 = j wj µj ⊗ µj ⊗ µj ?
1/3
→ We can recover ±wj µj if the µj are linearly independent using
Jennrich’s algorithm (this is sufficient for e.g. single topics model)
Method of Moments with Tensor Decomposition

I Under which conditions canPwe recover the weights wj and vectors


µj for j ∈ [k] from M3 = j wj µj ⊗ µj ⊗ µj ?
1/3
→ We can recover ±wj µj if the µj are linearly independent using
Jennrich’s algorithm (this is sufficient for e.g. single topics model)
→ For any vector v ∈ Rd we have
k
X
M3 •1 v = wj (v> µj )µj ⊗ µj = UΛU> .
j=1

thus if the µj are orthonormal we can recover the µj as eigenvectors


and the wj by solving the linear equation λj = wj (v> µj ).
(No more ambiguity for the signs of the µj since the wj are
positive.)
Method of Moments with Tensor Decomposition

I Under which conditions canPwe recover the weights wj and vectors


µj for j ∈ [k] from M3 = j wj µj ⊗ µj ⊗ µj ?
1/3
→ We can recover ±wj µj if the µj are linearly independent using
Jennrich’s algorithm (this is sufficient for e.g. single topics model)
→ For any vector v ∈ Rd we have
k
X
M3 •1 v = wj (v> µj )µj ⊗ µj = UΛU> .
j=1

thus if the µj are orthonormal we can recover the µj as eigenvectors


and the wj by solving the linear equation λj = wj (v> µj ).
(No more ambiguity for the signs of the µj since the wj are
positive.)
idea: Use M2 to whiten the tensor M3 , then recover the
parameters using eigen-decomposition or tensor power method.
Jennrich’s P
algorithm. [Harshman, 1970]
Let T = kj=1 vj ⊗ vj ⊗ vj where the vj are linearly
independent (this implies k ≤ d).
I For any x ∈ Rd , we have

Tx := T •1 x = UDx U>

where U = [v1 , · · · , vk ] ∈ Rd×k and Dx is the diagonal matrix with


entries vj> x.
Jennrich’s P
algorithm. [Harshman, 1970]
Let T = kj=1 vj ⊗ vj ⊗ vj where the vj are linearly
independent (this implies k ≤ d).
I For any x ∈ Rd , we have

Tx := T •1 x = UDx U>

where U = [v1 , · · · , vk ] ∈ Rd×k and Dx is the diagonal matrix with


entries vj> x.
I If we draw two unit vectors x, y at random in Rd we have

Tx (Ty )+ = UDx (Dy )−1 U+ .

By drawing x and y at random we ensure that, with probability


one,
I Dy is invertible
I the diagonal entries of Dx (Dy )−1 are distinct
Jennrich’s P
algorithm. [Harshman, 1970]
Let T = kj=1 vj ⊗ vj ⊗ vj where the vj are linearly
independent (this implies k ≤ d).
I For any x ∈ Rd , we have

Tx := T •1 x = UDx U>

where U = [v1 , · · · , vk ] ∈ Rd×k and Dx is the diagonal matrix with


entries vj> x.
I If we draw two unit vectors x, y at random in Rd we have

Tx (Ty )+ = UDx (Dy )−1 U+ .

By drawing x and y at random we ensure that, with probability


one,
I Dy is invertible
I the diagonal entries of Dx (Dy )−1 are distinct
I Since U has rank k we have U+ U = I and the vj ’s can be
recovered as eigenvectors of Tx (Ty )+ (up to the signs).
Tensor Power Method / (Simultaneous) Diagonalization
We want to solve the following system of equations in wi ,µi :
(
M2 = ki=1 wi µi ⊗ µi
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi
P

Overview:
1. Use M2 to transform the tensor M3 into an orthogonally
decomposable tensor: i.e. find W ∈ Rk×d such that
k
X
T = M3 ×1 W ×2 W ×3 W = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i
i=1

where the µ̃i ∈ Rk are orthonormal.


2. Use (simultaneous) diagonalization or the tensor power method to
recover the weights w̃i and vectors µ̃i .
3. Recover the original weights wi and vectors µi by ’reverting’ the
transformation from step 1.
Orthonormalization

(
M2 = ki=1 wi µi ⊗ µi
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi
P

Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I

I W= D −1/2 U> ∈ Rk×d and µei = wi Wµi ∈ Rk .
Orthonormalization

(
M2 = ki=1 wi µi ⊗ µi
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi
P

Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I

I W =D −1/2 U> ∈ Rk×d and µ ei = wi Wµi ∈ Rk .
I We have µe> e j = δij for all i, j,
i µ because

k k
!
e>
X X
>
I = WM2 W = W wi µi µ>
i
>
W = µ
ei µi
i=1 i=1

Pk 1
⇒ T = M3 ×1 W ×2 W ×3 W = √ µe ⊗µ
ei ⊗ µ
ei .
i=1
wi i
Orthonormalization

(
M2 = ki=1 wi µi ⊗ µi
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi
P

Pk >
M2 = i=1 wi µi ⊗ µi = UDU eigendecomposition of M2 .
I

I W =D −1/2 U> ∈ Rk×d and µ ei = wi Wµi ∈ Rk .
I We have µe> e j = δij for all i, j,
i µ because

k k
!
e>
X X
>
I = WM2 W = W wi µi µ>
i
>
W = µ
ei µi
i=1 i=1

Pk1
⇒ T = M3 ×1 W ×2 W ×3 W = √ µ e ⊗µei ⊗ µei .
wi i
i=1

I Since UU> µi = µi for all i we have W+ µ̃i = wi µi .
Orthogonal Tensor Decomposition

Using M2 we’ve reduced the problem of solving


(
M2 = ki=1 wi µi ⊗ µi
P

M3 = ki=1 wi µi ⊗ µi ⊗ µi
P

into the problem of finding an orthogonal decomposition of the


tensor
k
X
T = M3 ×1 W ×2 W ×3 W = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i .
i=1
Orthogonal Tensor Decomposition via Diagonalization
I We want to find the orthogonal decomposition
k
X
T = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i ∈ Rk×k×k
i=1

(where the µ̃i are unit norm orthogonal vectors)


⇒ The w̃j ’s and µ̃j ’s can be recovered as eigenvalues/vectors of any
projection T •1 v:
Orthogonal Tensor Decomposition via Diagonalization
I We want to find the orthogonal decomposition
k
X
T = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i ∈ Rk×k×k
i=1

(where the µ̃i are unit norm orthogonal vectors)


⇒ The w̃j ’s and µ̃j ’s can be recovered as eigenvalues/vectors of any
projection T •1 v:
I For any vector v we have
k
X
T •1 v = w̃j (v> µ̃j )µ̃j ⊗ µ̃j = UΛU>
j=1

with U = [µ̃1 · · · µ̃k ] and Λj,j = w̃j (v> µ̃j ).


I UΛU> is the eigendecomposition of T •1 v (since U> U = I).
Orthogonal Tensor Decomposition via Diagonalization
I We want to find the orthogonal decomposition
k
X
T = w̃i µ̃i ⊗ µ̃i ⊗ µ̃i ∈ Rk×k×k
i=1

(where the µ̃i are unit norm orthogonal vectors)


⇒ The w̃j ’s and µ̃j ’s can be recovered as eigenvalues/vectors of any
projection T •1 v:
I For any vector v we have
k
X
T •1 v = w̃j (v> µ̃j )µ̃j ⊗ µ̃j = UΛU>
j=1

with U = [µ̃1 · · · µ̃k ] and Λj,j = w̃j (v> µ̃j ).


I UΛU> is the eigendecomposition of T •1 v (since U> U = I).
I This may be sensitive to noise. Performing simultaneous
diagonalization of several random projections is a more robust
approach [Kuleshov et al., AISTATS 2015].
Tensor Power Method

I Extension to orthogonal tensors of the power method (which


computes the dominant eigenvector of a matrix):
Theorem (Anandkumar et al., JMLR, 2014)
N3
Let T ∈ Rd have an orthonormal decomposition
k
X
T = ei ⊗ µ
w̃i µ ei ⊗ µ
ei .
i=1

µ>
Let θ 0 ∈ Rd , suppose that |w̃1 .e µ>
1 θ 0 | > |w̃j .e j θ 0 | > 0 for all j > 1.
For t = 1, 2, · · · , define
T •1 θ t−1 •2 θ t−1
θt = and λt = T •1 θ t •2 θ t •3 θ t
kT •1 θ t−1 •2 θ t−1 k

Then, θ t → µ
e 1 and λt → w̃1 .
Overview

Method of Moments

Tensors

Structure in the Low-Order Moments of Latent Variable Models


Single Topic Model
Mixture of Spherical Gaussians

Method of Moments via Tensor Decomposition


Jennrich’s algorithm
Tensor Power Method / (Simultaneous) Diagonalization

Conclusion
Conclusion

I For a wide class of latent variable models, the method of moments


can be implemented by exploiting the tensor structure in the low
order moments.
I This approach relies on extracting an orthogonal decomposition of
a symmetric 3rd order tensor.
I Although tensor decomposition are usually intractable, orthogonal
decompositions can be computed efficiently (when the number of
components is less than the dimension).
I The estimators obtained for the parameters of the LVM are
consistent (in contrast with the EM estimator).
I Both sample complexity and computational complexity are
polynomial.

You might also like