Multi-View Mixture Models
Multi-View Mixture Models
Daniel Hsu
Multi-view method-of-moments
Concluding remarks
Part 1. Multi-view mixture models
Multi-view method-of-moments
Concluding remarks
Unsupervised learning
h h ∈ [k ] := {1, 2, . . . , k } (hidden);
~x ∈ Rd (observed);
~x Pr[ h = j ] = wj ; ~x h ∼ Ph ;
h h ∈ [k ] := {1, 2, . . . , k } (hidden);
~x ∈ Rd (observed);
~x Pr[ h = j ] = wj ; ~x h ∼ Ph ;
h h ∈ [k ],
~x1 ~x2 ··· ~x` ~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .
h h ∈ [k ],
~x1 ~x2 ··· ~x` ~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .
Multi-view assumption:
Views are conditionally independent given the component.
Questions:
1. How do we estimate {wj } and {~
µv ,j } without observing h?
2. How many views ` are sufficient to learn with poly(k )
computational / sample complexity?
Some barriers to efficient estimation
0.6
1 2 3 4 5 6 7 8
≈
1 2 3 4 5 6 7 8 +0.4
Pr[~xt = ·|ht = 1]
1 2 3 4 5 6 7 8
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]
Making progress: discrete hidden Markov models
Hardness reductions create HMMs with degenerate output and
next-state distributions.
0.6
1 2 3 4 5 6 7 8
≈
1 2 3 4 5 6 7 8 +0.4
Pr[~xt = ·|ht = 1]
1 2 3 4 5 6 7 8
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]
Multi-view method-of-moments
Overview
Structure of moments
Uniqueness of decomposition
Computing the decomposition
Asymmetric views
Concluding remarks
The plan
−→
Simpler case: exchangeable views
E[ ~xv |h = j ] ≡ µ
~j, j ∈ [k ], v ∈ [`].
Simpler case: exchangeable views
E[ ~xv |h = j ] ≡ µ
~j, j ∈ [k ], v ∈ [`].
One-hot encoding:
~xv = ~ei ⇔ v -th word in document is i-th word in vocab
(where ~ei ∈ {0, 1}d has 1 in i-th position, 0 elsewhere).
µi } linearly independent)
Non-degeneracy assumption ({~
2nd moment: subspace spanned by conditional means
However, {~
µi } not generally determined by just Pairs
µi } are not necessarily orthogonal).
(e.g., {~
2nd moment: subspace spanned by conditional means
However, {~
µi } not generally determined by just Pairs
µi } are not necessarily orthogonal).
(e.g., {~
Must look at higher-order moments?
3rd moment: (cross) skew maximizers
Theorem
Each isolated local maximizer ~η ∗ of
√ 1
Pairs ~η ∗ = wi µ
~i, Triples(~η ∗ , ~η ∗ , ~η ∗ ) = √ .
wi
3rd moment: (cross) skew maximizers
Theorem
Each isolated local maximizer ~η ∗ of
√ 1
Pairs ~η ∗ = wi µ
~i, Triples(~η ∗ , ~η ∗ , ~η ∗ ) = √ .
wi
Pk Pk
(Substitute Pairs = i=1 ~i ⊗ µ
wi µ ~ i and Triples = i=1 ~i ⊗ µ
wi µ ~i ⊗ µ
~ i .)
Variational analysis
k
X k
X
max ~ i )3 s.t.
wi (~η > µ ~ i )2 ≤ 1
wi (~η > µ
~ ∈Rd
η
i=1 i=1
Pk Pk
(Substitute Pairs = i=1 ~i ⊗ µ
wi µ ~ i and Triples = i=1 ~i ⊗ µ
wi µ ~i ⊗ µ
~ i .)
Variational analysis
k
X k
X
max ~ i )3 s.t.
wi (~η > µ ~ i )2 ≤ 1
wi (~η > µ
~ ∈Rd
η
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 √ X √
max ~ i )3 s.t.
√ ( wi ~η > µ ~ i )2 ≤ 1
( wi ~η > µ
~ ∈Rd
η wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 X
max √ θi3 s.t. θi2 ≤ 1
~ k
θ∈R wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 X
max √ θi3 s.t. θi2 ≤ 1
~ k
θ∈R wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Reduction: transform ~x1 and ~x2 to “look like” ~x3 via linear
transformations.
−→
Asymmetric cross moments
E[Cv →3~xv |h = j] = µ
~ 3,j
so Cv →3~xv behaves like ~x3 (as far as our algorithm can tell).
Part 3. Some applications and open questions
Multi-view method-of-moments
Concluding remarks
Mixtures of axis-aligned Gaussians
x1 x2 ··· xn
Mixtures of axis-aligned Gaussians
x1 x2 ··· xn
Assumptions:
I non-degeneracy: component means span k dim subspace.
I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar to
spreading condition of (Chaudhuri-Rao, ’08).
Mixtures of axis-aligned Gaussians
x1 x2 ··· xn
Assumptions:
I non-degeneracy: component means span k dim subspace.
I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar to
spreading condition of (Chaudhuri-Rao, ’08).
Then, randomly partitioning coordinates into ` ≥ 3 views
guarantees (w.h.p.) that non-degeneracy holds in all ` views.
Hidden Markov models and others
h1 h2 h3
h1 h2 h3 h
−→
~x1 ~x2 ~x3 ~x1 ~x2 ~x3
Hidden Markov models and others
h1 h2 h3 h
−→
~x1 ~x2 ~x3 ~x1 ~x2 ~x3
Other models:
1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
3. Latent Dirichlet Allocation
(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)
4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
5. Independent Component Analysis
(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)
Bag-of-words clustering model
µj )i = Pr[ see word i in document | document topic is j ].
(~
I Corpus: New York Times (from UCI), 300000 articles.
I Vocabulary size: d = 102660 words.
I Chose k = 50.
I For each topic j, show top 10 words i.
Bag-of-words clustering model
µj )i = Pr[ see word i in document | document topic is j ].
(~
I Corpus: New York Times (from UCI), 300000 articles.
I Vocabulary size: d = 102660 words.
I Chose k = 50.
I For each topic j, show top 10 words i.
etc.
Some open questions
x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?
Some open questions
x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?
x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?
x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?
Multi-view method-of-moments
Concluding remarks
Concluding remarks
Take-home messages:
Concluding remarks
Take-home messages:
I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervised
learning.
Concluding remarks
Take-home messages:
I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervised
learning.
I Overcoming complexity barriers: Some provably hard
estimation problems become easy after ruling out
“degenerate” cases.
Concluding remarks
Take-home messages:
I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervised
learning.
I Overcoming complexity barriers: Some provably hard
estimation problems become easy after ruling out
“degenerate” cases.
I “Blessing of dimensionality” for estimators based on
method-of-moments.
Thanks!
(Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham
Kakade, Yi-Kai Liu, Matus Telgarsky)
http://arxiv.org/abs/1210.7559