0% found this document useful (0 votes)

28 views

Multi-View Mixture Models

This document outlines an algorithm for efficiently estimating parameters of multi-view mixture models from unlabeled data. Multi-view mixture models assume data has multiple views (features) that are conditionally independent given a component. The algorithm aims to estimate mixing weights and conditional means without observing the component assignments. Prior work faced barriers, but this work proposes a method-of-moments approach that is efficient given at least 3 views and a mild non-degeneracy condition on the conditional means across views.

Uploaded by

sarasiruhasanapriye

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Multi-View Mixture Models

Uploaded by

sarasiruhasanapriye

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Efficient algorithms for estimating

multi-view mixture models

Daniel Hsu

Microsoft Research, New England

Outline

Multi-view mixture models

Multi-view method-of-moments

Some applications and open questions

Concluding remarks
Part 1. Multi-view mixture models

Multi-view mixture models

Unsupervised learning and mixture models
Multi-view mixture models
Complexity barriers

Multi-view method-of-moments

Some applications and open questions

Concluding remarks
Unsupervised learning

I Many modern applications of machine learning:

I high-dimensional data from many diverse sources,
I but mostly unlabeled.
Unsupervised learning

I Many modern applications of machine learning:

I high-dimensional data from many diverse sources,
I but mostly unlabeled.

I Unsupervised learning: extract useful info from this data.

I Disentangle sub-populations in data source.
I Discover useful representations for downstream stages of
learning pipeline (e.g., supervised learning).
Mixture models

Simple latent variable model: mixture model

h h ∈ [k ] := {1, 2, . . . , k } (hidden);
~x ∈ Rd (observed);

~x Pr[ h = j ] = wj ; ~x h ∼ Ph ;

so ~x has a mixture distribution

P(~x ) = w1 P1 (~x ) + w2 P2 (~x ) + · · · + wk Pk (~x ).

Mixture models

Simple latent variable model: mixture model

h h ∈ [k ] := {1, 2, . . . , k } (hidden);
~x ∈ Rd (observed);

~x Pr[ h = j ] = wj ; ~x h ∼ Ph ;

so ~x has a mixture distribution

P(~x ) = w1 P1 (~x ) + w2 P2 (~x ) + · · · + wk Pk (~x ).

Typical use: learn about constituent sub-populations

(e.g., clusters) in data source.
Multi-view mixture models

Can we take advantage of diverse sources of information?

Multi-view mixture models

Can we take advantage of diverse sources of information?

h h ∈ [k ],

~x1 ~x2 ··· ~x` ~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .

k = # components, ` = # views (e.g., audio, video, text).

View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3

Multi-view mixture models

Can we take advantage of diverse sources of information?

h h ∈ [k ],

~x1 ~x2 ··· ~x` ~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .

k = # components, ` = # views (e.g., audio, video, text).

View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3

Multi-view mixture models

Multi-view assumption:
Views are conditionally independent given the component.

View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3

Larger k (# components): more sub-populations to disentangle.

Larger ` (# views): more non-redundant sources of information.
Semi-parametric estimation task

“Parameters” of component distributions:

Mixing weights wj := Pr[h = j], j ∈ [k ];

~ v ,j := E[~xv |h = j] ∈ Rdv ,
Conditional means µ j ∈ [k ], v ∈ [`].

Goal: Estimate mixing weights and conditional means from

independent copies of (~x1 , ~x2 , . . . , ~x` ).
Semi-parametric estimation task

“Parameters” of component distributions:

Mixing weights wj := Pr[h = j], j ∈ [k ];

~ v ,j := E[~xv |h = j] ∈ Rdv ,
Conditional means µ j ∈ [k ], v ∈ [`].

Goal: Estimate mixing weights and conditional means from

independent copies of (~x1 , ~x2 , . . . , ~x` ).

Questions:
1. How do we estimate {wj } and {~
µv ,j } without observing h?
2. How many views ` are sufficient to learn with poly(k )
computational / sample complexity?
Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to

this estimation problem.
Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to

this estimation problem.

Cryptographic barrier: discrete HMM pa-

rameter estimation as hard as learning parity
functions with noise (Mossel-Roch, ’06).
Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to

this estimation problem.

Cryptographic barrier: discrete HMM pa-

rameter estimation as hard as learning parity
functions with noise (Mossel-Roch, ’06).

Statistical barrier: Gaussian mixtures in R1

can require exp(Ω(k )) samples to estimate
parameters, even if components are well-
separated (Moitra-Valiant, ’10).
Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to

this estimation problem.

Cryptographic barrier: discrete HMM pa-

rameter estimation as hard as learning parity
functions with noise (Mossel-Roch, ’06).

Statistical barrier: Gaussian mixtures in R1

can require exp(Ω(k )) samples to estimate
parameters, even if components are well-
separated (Moitra-Valiant, ’10).

In practice: resort to local search (e.g., EM), often subject to

slow convergence and inaccurate local optima.
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assume
some large minimum separation between component means
(Dasgupta, ’99):
k~
µi − µ
~jk
sep := min .
i6=j max{σi , σj }
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assume
some large minimum separation between component means
(Dasgupta, ’99):
k~
µi − µ
~jk
sep := min .
i6=j max{σi , σj }

I sep = Ω(d c ): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)
I sep = Ω(k c ): first use PCA to k dimensions
(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05;
Achlioptas-McSherry, ’05)
I Also works for mixtures of log-concave distributions.
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assume
some large minimum separation between component means
(Dasgupta, ’99):
k~
µi − µ
~jk
sep := min .
i6=j max{σi , σj }

I sep = Ω(d c ): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)
I sep = Ω(k c ): first use PCA to k dimensions
(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05;
Achlioptas-McSherry, ’05)
I Also works for mixtures of log-concave distributions.
I No minimum separation requirement: method-of-moments
but exp(Ω(k )) running time / sample size
(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)
Making progress: discrete hidden Markov models
Hardness reductions create HMMs with degenerate output and
next-state distributions.

0.6

1 2 3 4 5 6 7 8
≈

1 2 3 4 5 6 7 8 +0.4
Pr[~xt = ·|ht = 1]
1 2 3 4 5 6 7 8
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]
Making progress: discrete hidden Markov models
Hardness reductions create HMMs with degenerate output and
next-state distributions.

0.6

1 2 3 4 5 6 7 8
≈

1 2 3 4 5 6 7 8 +0.4
Pr[~xt = ·|ht = 1]
1 2 3 4 5 6 7 8
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]

These instances are avoided by assuming parameter matrices

are full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)
What we do

This work: given ≥ 3 views, mild non-degeneracy conditions

imply efficient algorithms for estimation.
What we do

This work: given ≥ 3 views, mild non-degeneracy conditions

imply efficient algorithms for estimation.
I Non-degeneracy condition for multi-view mixture model:
Conditional means {~
µv ,1 , µ ~ v ,k } are linearly
~ v ,2 , . . . , µ
independent for each view v ∈ [`], and w ~ > ~0.
Requires high-dimensional observations (dv ≥ k )!
What we do

This work: given ≥ 3 views, mild non-degeneracy conditions

I New efficient learning guarantees for parametric models

(e.g., mixtures of Gaussians, general HMMs)
What we do

This work: given ≥ 3 views, mild non-degeneracy conditions

I New efficient learning guarantees for parametric models

(e.g., mixtures of Gaussians, general HMMs)
I General tensor decomposition framework applicable to
a wide variety of estimation problems.
Part 2. Multi-view method-of-moments

Multi-view mixture models

Multi-view method-of-moments
Overview
Structure of moments
Uniqueness of decomposition
Computing the decomposition
Asymmetric views

Some applications and open questions

Concluding remarks
The plan

I First, assume views are (conditionally) exchangeable,

and derive basic algorithm.
The plan

I First, assume views are (conditionally) exchangeable,

and derive basic algorithm.

I Then, provide reduction from general multi-view setting to

exchangeable case.

−→
Simpler case: exchangeable views

(Conditionally) exchangeable views: assume the views have

the same conditional means, i.e.,

E[ ~xv |h = j ] ≡ µ
~j, j ∈ [k ], v ∈ [`].
Simpler case: exchangeable views

(Conditionally) exchangeable views: assume the views have

the same conditional means, i.e.,

E[ ~xv |h = j ] ≡ µ
~j, j ∈ [k ], v ∈ [`].

Motivating setting: bag-of-words model,

~x1 , ~x2 , . . . , ~x` ≡ ` exchangeable words in a document.

One-hot encoding:
~xv = ~ei ⇔ v -th word in document is i-th word in vocab
(where ~ei ∈ {0, 1}d has 1 in i-th position, 0 elsewhere).

µj )i = E[(~xv )i |h = j] = Pr[~xv = ~ei |h = j],

(~ i ∈ [d], j ∈ [k ].
Key ideas

1. Method-of-moments: conditional means are revealed by

appropriate low-rank decompositions of moment matrices
and tensors.
2. Third-order tensor decomposition is uniquely
determined by directions of (locally) maximum skew.
3. The required local optimization can be efficiently
performed in poly time.
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = µ
~j.
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = µ
~j.
By conditional independence and exchangeability of
~x1 , ~x2 , . . . , ~x` given h,

Pairs := E[~x1 ⊗ ~x2 ]

= E E[~x1 |h] ⊗ E[~x2 |h] = E[~
µh ⊗ µ
~ h]
k
X
= wi µ ~ i ∈ Rd×d .
~i ⊗ µ
i=1
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = µ
~j.
By conditional independence and exchangeability of
~x1 , ~x2 , . . . , ~x` given h,

Pairs := E[~x1 ⊗ ~x2 ]

= E E[~x1 |h] ⊗ E[~x2 |h] = E[~
µh ⊗ µ
~ h]
k
X
= wi µ ~ i ∈ Rd×d .
~i ⊗ µ
i=1

Triples := E[~x1 ⊗ ~x2 ⊗ ~x3 ]

k
X
= ~i ⊗ µ
wi µ ~ i ∈ Rd×d×d ,
~i ⊗ µ etc.
i=1

(If only we could extract these “low-rank” decompositions . . . )

2nd moment: subspace spanned by conditional means
2nd moment: subspace spanned by conditional means

µi } linearly independent)
Non-degeneracy assumption ({~
2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({~ µi } linearly independent)

Pk
~i ⊗ µ
=⇒ Pairs = i=1 wi µ ~ i symmetric psd and rank k
2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({~ µi } linearly independent)

Pk
~i ⊗ µ
=⇒ Pairs = i=1 wi µ ~ i symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span{~ µ1 , µ ~ k } with
~ 2, . . . , µ
inner product
Pairs(~x , ~y ) := ~x > Pairs~y .
2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({~ µi } linearly independent)

Pk
~i ⊗ µ
=⇒ Pairs = i=1 wi µ ~ i symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span{~ µ1 , µ ~ k } with
~ 2, . . . , µ
inner product
Pairs(~x , ~y ) := ~x > Pairs~y .

However, {~
µi } not generally determined by just Pairs
µi } are not necessarily orthogonal).
(e.g., {~
2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({~ µi } linearly independent)

Pk
~i ⊗ µ
=⇒ Pairs = i=1 wi µ ~ i symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span{~ µ1 , µ ~ k } with
~ 2, . . . , µ
inner product
Pairs(~x , ~y ) := ~x > Pairs~y .

However, {~
µi } not generally determined by just Pairs
µi } are not necessarily orthogonal).
(e.g., {~
Must look at higher-order moments?
3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices.

View Triples : Rd × Rd × Rd → R as trilinear form.
3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices.

View Triples : Rd × Rd × Rd → R as trilinear form.

Theorem
Each isolated local maximizer ~η ∗ of

max Triples(~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

~ ∈Rd
η

satisfies, for some i ∈ [k ],

√ 1
Pairs ~η ∗ = wi µ
~i, Triples(~η ∗ , ~η ∗ , ~η ∗ ) = √ .
wi
3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices.

View Triples : Rd × Rd × Rd → R as trilinear form.

Theorem
Each isolated local maximizer ~η ∗ of

max Triples(~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

~ ∈Rd
η

satisfies, for some i ∈ [k ],

√ 1
Pairs ~η ∗ = wi µ
~i, Triples(~η ∗ , ~η ∗ , ~η ∗ ) = √ .
wi

Also: these maximizers can be found efficiently and robustly.

Variational analysis

max Triples(~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

~ ∈Rd
η
Variational analysis

max Triples(~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

~ ∈Rd
η

Pk Pk
(Substitute Pairs = i=1 ~i ⊗ µ
wi µ ~ i and Triples = i=1 ~i ⊗ µ
wi µ ~i ⊗ µ
~ i .)
Variational analysis
k
X k
X
max ~ i )3 s.t.
wi (~η > µ ~ i )2 ≤ 1
wi (~η > µ
~ ∈Rd
η
i=1 i=1
Pk Pk
(Substitute Pairs = i=1 ~i ⊗ µ
wi µ ~ i and Triples = i=1 ~i ⊗ µ
wi µ ~i ⊗ µ
~ i .)
Variational analysis
k
X k
X
max ~ i )3 s.t.
wi (~η > µ ~ i )2 ≤ 1
wi (~η > µ
~ ∈Rd
η
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 √ X √
max ~ i )3 s.t.
√ ( wi ~η > µ ~ i )2 ≤ 1
( wi ~η > µ
~ ∈Rd
η wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 X
max √ θi3 s.t. θi2 ≤ 1
~ k
θ∈R wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ
Variational analysis
k k
X 1 X
max √ θi3 s.t. θi2 ≤ 1
~ k
θ∈R wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ

Isolated local maximizers θ~∗ (found via gradient ascent) are

~e1 = (1, 0, 0, . . . ), ~e2 = (0, 1, 0, . . . ), etc.

which means that each ~η ∗ satisfies, for some i ∈ [k ],

(
1 j =i
wj ~η ∗> µ
p
~j =
0 j 6= i.
Variational analysis
k k
X 1 X
max √ θi3 s.t. θi2 ≤ 1
~ k
θ∈R wi
i=1 i=1
√
(Let θi := ~ i ) for i ∈ [k ].)
wi (~η > µ

Isolated local maximizers θ~∗ (found via gradient ascent) are

~e1 = (1, 0, 0, . . . ), ~e2 = (0, 1, 0, . . . ), etc.

which means that each ~η ∗ satisfies, for some i ∈ [k ],

(
1 j =i
wj ~η ∗> µ
p
~j =
0 j 6= i.
Therefore k
X √
Pairs ~η ∗ = ~ j ~η ∗> µ
wj µ ~ j = wi µ ~i.
j=1
Extracting all isolated local maximizers

1. Start with T := Triples.

Extracting all isolated local maximizers

1. Start with T := Triples.

2. Find isolated local maximizer of

T (~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

via gradient ascent from random ~η ∈ range(Pairs).

Say maximum is λ∗ and maximizer is ~η ∗ .
Extracting all isolated local maximizers

1. Start with T := Triples.

2. Find isolated local maximizer of

T (~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

via gradient ascent from random ~η ∈ range(Pairs).

Say maximum is λ∗ and maximizer is ~η ∗ .
3. Deflation: replace T with T − λ∗ ~η ∗ ⊗ ~η ∗ ⊗ ~η ∗ .
Goto step 2.
Extracting all isolated local maximizers

1. Start with T := Triples.

2. Find isolated local maximizer of

T (~η , ~η , ~η ) s.t. Pairs(~η , ~η ) ≤ 1

via gradient ascent from random ~η ∈ range(Pairs).

Say maximum is λ∗ and maximizer is ~η ∗ .
3. Deflation: replace T with T − λ∗ ~η ∗ ⊗ ~η ∗ ⊗ ~η ∗ .
Goto step 2.

A variant of this runs in polynomial time (w.h.p.), and is

robust to perturbations to Pairs and Triples.
General case: asymmetric views

Each view v has different set of conditional means

{~
µv ,1 , µ ~ v ,k } ⊂ Rdv .
~ v ,2 , . . . , µ
General case: asymmetric views

Each view v has different set of conditional means

{~
µv ,1 , µ ~ v ,k } ⊂ Rdv .
~ v ,2 , . . . , µ

Reduction: transform ~x1 and ~x2 to “look like” ~x3 via linear
transformations.

−→
Asymmetric cross moments

Define asymmetric cross moment:

Pairsu,v := E[~xu ⊗ ~xv ].

Asymmetric cross moments

Define asymmetric cross moment:

Pairsu,v := E[~xu ⊗ ~xv ].

Transforming view v to view 3:

Cv →3 := E[~x3 ⊗ ~xu ] E[~xv ⊗ ~xu ]† ∈ Rd3 ×dv

where † denotes Moore-Penrose pseudoinverse.

Asymmetric cross moments

Define asymmetric cross moment:

Pairsu,v := E[~xu ⊗ ~xv ].

Transforming view v to view 3:

Cv →3 := E[~x3 ⊗ ~xu ] E[~xv ⊗ ~xu ]† ∈ Rd3 ×dv

where † denotes Moore-Penrose pseudoinverse.

Simple exercise to show

E[Cv →3~xv |h = j] = µ
~ 3,j

so Cv →3~xv behaves like ~x3 (as far as our algorithm can tell).
Part 3. Some applications and open questions

Multi-view mixture models

Multi-view method-of-moments

Some applications and open questions

Mixtures of Gaussians
Hidden Markov models and other models
Topic models
Open questions

Concluding remarks
Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn , with component means

µ
~ 1, µ ~ k ∈ Rn ; no minimum separation requirement.
~ 2, . . . , µ
h

x1 x2 ··· xn
Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn , with component means

µ
~ 1, µ ~ k ∈ Rn ; no minimum separation requirement.
~ 2, . . . , µ
h

x1 x2 ··· xn

Mixture of axis-aligned Gaussian in Rn , with component means

µ
~ 1, µ ~ k ∈ Rn ; no minimum separation requirement.
~ 2, . . . , µ
h

x1 x2 ··· xn

Assumptions:
I non-degeneracy: component means span k dim subspace.
I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar to
spreading condition of (Chaudhuri-Rao, ’08).
Then, randomly partitioning coordinates into ` ≥ 3 views
guarantees (w.h.p.) that non-degeneracy holds in all ` views.
Hidden Markov models and others

h1 h2 h3

~x1 ~x2 ~x3

Hidden Markov models and others

h1 h2 h3 h
−→
~x1 ~x2 ~x3 ~x1 ~x2 ~x3
Hidden Markov models and others

h1 h2 h3 h
−→
~x1 ~x2 ~x3 ~x1 ~x2 ~x3

Other models:
1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
3. Latent Dirichlet Allocation
(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)
4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
5. Independent Component Analysis
(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)
Bag-of-words clustering model
µj )i = Pr[ see word i in document | document topic is j ].
(~
I Corpus: New York Times (from UCI), 300000 articles.
I Vocabulary size: d = 102660 words.
I Chose k = 50.
I For each topic j, show top 10 words i.
Bag-of-words clustering model
µj )i = Pr[ see word i in document | document topic is j ].
(~
I Corpus: New York Times (from UCI), 300000 articles.
I Vocabulary size: d = 102660 words.
I Chose k = 50.
I For each topic j, show top 10 words i.

sales run school drug player

economic inning student patient tiger_wood
consumer hit teacher million won
major game program company shot
home season official doctor play
indicator home public companies round
weekly right children percent win
order games high cost tournament
claim dodger education program tour
scheduled left district health right
Bag-of-words clustering model

palestinian tax cup point yard

israel cut minutes game game
israeli percent oil team play
yasser_arafat bush water shot season
peace billion add play team
israeli plan tablespoon laker touchdown
israelis bill food season quarterback
leader taxes teaspoon half coach
official million pepper lead defense
attack congress sugar games quarter
Bag-of-words clustering model

percent al_gore car book taliban

stock campaign race children attack
market president driver ages afghanistan
fund george_bush team author official
investor bush won read military
companies clinton win newspaper u_s
analyst vice racing web united_states
money presidential track writer terrorist
investment million season written war
economy democratic lap sales bin
Bag-of-words clustering model

com court show film music

www case network movie song
site law season director group
web lawyer nbc play part
sites federal cb character new_york
information government program actor company
online decision television show million
mail trial series movies band
internet microsoft night million show
telegram right new_york part album

etc.
Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

I Apply some non-linear transformations ~xv 7→ fv (~xv )?
Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product

x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?
Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product

x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?

Can we relax the multi-view assumption?

Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product

x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?

Can we relax the multi-view assumption?

I Allow for richer hidden state?
(e.g., independent component analysis)
Some open questions

What if k > dv ? (relevant to overcomplete dictionary learning)

I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product

x̃1,2 := ~x1 ⊗ ~x2 , x̃3,4 := ~x3 ⊗ ~x4 , x̃5,6 := ~x5 ⊗ ~x6 , etc. ?

Can we relax the multi-view assumption?

I Allow for richer hidden state?
(e.g., independent component analysis)
I “Gaussianization” via random projection?
Part 4. Concluding remarks

Multi-view mixture models

Multi-view method-of-moments

Some applications and open questions

Concluding remarks
Concluding remarks

Take-home messages:
Concluding remarks

Take-home messages:
I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervised
learning.
Concluding remarks

Take-home messages:
I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervised
learning.
I Overcoming complexity barriers: Some provably hard
estimation problems become easy after ruling out
“degenerate” cases.
I “Blessing of dimensionality” for estimators based on
method-of-moments.
Thanks!
(Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham
Kakade, Yi-Kai Liu, Matus Telgarsky)

http://arxiv.org/abs/1210.7559

PATTERN RECOGNITION Final Notes
89% (9)
PATTERN RECOGNITION Final Notes
40 pages
211 R83-17 PDF
100% (1)
211 R83-17 PDF
11 pages
Cv-Chenxue Hou-Fin
No ratings yet
Cv-Chenxue Hou-Fin
5 pages
Hidden Markov Models
100% (2)
Hidden Markov Models
10 pages
- Variational Bayesian Model Selection for Mixture Distributions - Corduneanu at al
No ratings yet
- Variational Bayesian Model Selection for Mixture Distributions - Corduneanu at al
8 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Journal of Computer and System Sciences: Daniel Hsu, Sham M. Kakade, Tong Zhang
No ratings yet
Journal of Computer and System Sciences: Daniel Hsu, Sham M. Kakade, Tong Zhang
21 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
Final F02soln
No ratings yet
Final F02soln
11 pages
Guo Gwu 0075M 10929
No ratings yet
Guo Gwu 0075M 10929
33 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Finite Mixture Modelling Model Specification, Estimation & Application
No ratings yet
Finite Mixture Modelling Model Specification, Estimation & Application
11 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
The Infinite Gaussian Mixture Model: Carl Edward Rasmussen
No ratings yet
The Infinite Gaussian Mixture Model: Carl Edward Rasmussen
7 pages
Questions_for_Unit_4 (2)
No ratings yet
Questions_for_Unit_4 (2)
6 pages
AML_mod2
No ratings yet
AML_mod2
38 pages
Fitting A Mixture Distribution To Data
No ratings yet
Fitting A Mixture Distribution To Data
12 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Session 32 - Point Estimate
No ratings yet
Session 32 - Point Estimate
53 pages
Mixture Models
No ratings yet
Mixture Models
16 pages
ML-Map-and-Bayseian
No ratings yet
ML-Map-and-Bayseian
35 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
How To Use The HMM Toolbox
No ratings yet
How To Use The HMM Toolbox
3 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Embedded Bernoulli Mixture Hmms For Handwritten Word Recognition
No ratings yet
Embedded Bernoulli Mixture Hmms For Handwritten Word Recognition
5 pages
CB PDF
No ratings yet
CB PDF
69 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
UNIT 3 - Frequentist Statistics
No ratings yet
UNIT 3 - Frequentist Statistics
65 pages
2 Mle
No ratings yet
2 Mle
28 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
ML Lecture19
No ratings yet
ML Lecture19
65 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
14 Gaussian Mixture Models
No ratings yet
14 Gaussian Mixture Models
60 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Jahmm 0.6.1 Userguide
No ratings yet
Jahmm 0.6.1 Userguide
23 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
2014 OSELIO Multi-layerGraphAnalysisDynamicSocialNetworks
No ratings yet
2014 OSELIO Multi-layerGraphAnalysisDynamicSocialNetworks
10 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Clustering Mixture
No ratings yet
Clustering Mixture
22 pages
Final f02
No ratings yet
Final f02
12 pages
Computational Methods For Mixed Models
No ratings yet
Computational Methods For Mixed Models
21 pages
Baum Welch HMM
No ratings yet
Baum Welch HMM
24 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Cse291d 7
No ratings yet
Cse291d 7
39 pages
gmm
No ratings yet
gmm
8 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Introduction to Minimax
From Everand
Introduction to Minimax
V. F. Dem’yanov
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Fourth Wave of COVID-19 in India: Statistical Forecasting: Sabara Parshad Rajeshbhai, Subhra Sankar Dhar and Shalabh
No ratings yet
Fourth Wave of COVID-19 in India: Statistical Forecasting: Sabara Parshad Rajeshbhai, Subhra Sankar Dhar and Shalabh
14 pages
Penentuan SG Campuran Asphalt
No ratings yet
Penentuan SG Campuran Asphalt
2 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Some Notes For Machine Learning
No ratings yet
Some Notes For Machine Learning
40 pages
Self-Supervised Speech Representation Learning: A Review
No ratings yet
Self-Supervised Speech Representation Learning: A Review
34 pages
Probability Models for Customer-Base Analysis
No ratings yet
Probability Models for Customer-Base Analysis
9 pages
An Automatic Speech Recognition For The Filipino Language Using The HTK System
No ratings yet
An Automatic Speech Recognition For The Filipino Language Using The HTK System
6 pages
2021 An Overview of Voice Conversion and Its Challenges From Statistical Modeling To Deep Learning
No ratings yet
2021 An Overview of Voice Conversion and Its Challenges From Statistical Modeling To Deep Learning
26 pages
Sample Assignment-Artificial Intelligence)
No ratings yet
Sample Assignment-Artificial Intelligence)
16 pages
Applied Scientist Candidate Companion
No ratings yet
Applied Scientist Candidate Companion
5 pages
M.sc. Information Technology Vide Item No. 6.2 N Sem. III IV
No ratings yet
M.sc. Information Technology Vide Item No. 6.2 N Sem. III IV
51 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
41 pages
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team - Download the ebook today to explore every detail
100% (1)
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team - Download the ebook today to explore every detail
59 pages
JAWS (Screen Reader)
No ratings yet
JAWS (Screen Reader)
18 pages
Wa0000.
No ratings yet
Wa0000.
26 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
2024-Fourier Basis Density Model
No ratings yet
2024-Fourier Basis Density Model
5 pages
The Impact of Government Expenditure On Growth: Empirical Evidence From A Heterogeneous Panel
No ratings yet
The Impact of Government Expenditure On Growth: Empirical Evidence From A Heterogeneous Panel
10 pages
A Review of Bayesian Machine Learning Principles, Methods, and Applications
No ratings yet
A Review of Bayesian Machine Learning Principles, Methods, and Applications
6 pages
000 Contrbuted Part-4
No ratings yet
000 Contrbuted Part-4
116 pages
Expectation Maximization Homework Solution
100% (1)
Expectation Maximization Homework Solution
8 pages
aimlllll
No ratings yet
aimlllll
40 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Beamer
No ratings yet
Beamer
34 pages
Chinese Restaurant Process
No ratings yet
Chinese Restaurant Process
7 pages
Latent Clustering W Mplus v2
No ratings yet
Latent Clustering W Mplus v2
57 pages