0% found this document useful (0 votes)
124 views

ch13 Linear Factor Models

The document provides an overview of linear factor models for deep learning. It introduces probabilistic PCA and factor analysis, which use latent variables to capture dependencies between observed variables. Independent component analysis (ICA) seeks to separate observed signals into underlying independent signals. Slow feature analysis learns invariant features by enforcing a slowness principle where important scene characteristics change slowly over time. Sparse coding and manifold interpretations of PCA are also briefly mentioned.

Uploaded by

黃良初
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

ch13 Linear Factor Models

The document provides an overview of linear factor models for deep learning. It introduces probabilistic PCA and factor analysis, which use latent variables to capture dependencies between observed variables. Independent component analysis (ICA) seeks to separate observed signals into underlying independent signals. Slow feature analysis learns invariant features by enforcing a slowness principle where important scene characteristics change slowly over time. Sparse coding and manifold interpretations of PCA are also briefly mentioned.

Uploaded by

黃良初
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Deep Learning Tutorial:

Linear Factor Models

Jen-Tzung Chien

Department of Electrical and Computer Engineering


Department of Computer Science
National Chiao Tung University, Hsinchu

September 19, 2016


Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

1
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

2
Linear Factor Models Deep Learning

Model with Latent Variables

• Many of the research frontiers in deep learning involve building a probabilistic


model of the input pmodel(x)

• Such a model can, in principle, use probabilistic inference to predict any of the
variables in its environment given any of the other variables

• Many of these models also have latent variables h

• These latent variables provide another means of representing the data

3
Linear Factor Models Deep Learning

Linear Factor Model

• A linear factor model is defined by the use of a stochastic, linear decoder


function that generates x by adding noise to a linear transformation of h

• These models are interesting because they allow us to discover explanatory


factors that have a simple joint distribution

• A linear factor model describes the data generation process as follows


Q
– sample the explanatory factors h from a distribution p(h) = i p(hi)
– sample the real-valued observable variables given the factors
x = W h + b + noise

4
Linear Factor Models Deep Learning

Graphical Model

5
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

6
Linear Factor Models Deep Learning

Factor Analysis
• In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable
prior is just the unit variance Gaussian

h ∼ N (h; 0, I)

• The observed variables xi are assumed to be conditionally independent given


h

• The noise is assumed to be drawn from a diagonal covariance Gaussian


distribution
– covariance matrix: ψ = diag(σ 2), where σ 2 = [σ12, σ22, ..., σn2 ]

• The role of the latent variables is thus to capture the dependencies between
the different observed variables xi

x ∼ N (x; b, W W > + ψ)

7
Linear Factor Models Deep Learning

Probabilistic PCA

• In order to cast PCA in a probabilistic framework, we can make a slight


modification
– making the conditional variances σi2 equal to each other

• This yields the conditional distribution: x ∼ N (x; b, W W > + σ 2I)

• This probabilistic PCA model takes advantage of the observation that most
variations in the data can be captured by the latent variables h, up to some
small residualre construction error σ 2

8
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

9
Linear Factor Models Deep Learning

Independent Component Analysis

• ICA is an approach to modeling linear factors that seeks to separate an observed


signal into many underlying signals that are scaled and added together to form
the observed data

• The variant that is most similar to the other generative models we have
described here is a variant (Pham et al., 1992) that trains a fully parametric
generative model

• The prior distribution over the underlying factors, p(h), must be fixed ahead
of time by the user.

• The model then deterministically generates x = W h.

10
Linear Factor Models Deep Learning

Motivation of ICA

• The motivation for this approach is that by choosing p(h) to be independent

• Each xi is one sensor’s observation of the mixed signals

• Each hi is one estimate of one of the original signals

11
Linear Factor Models Deep Learning

Non-Gaussian

• All variants of ICA require that p(h) be non-Gaussian

• This is because if p(h) is an independent prior with Gaussian components,


then W is not identifiable

• In the maximum likelihood approach where the user explicitly specifies the
distribution

d
• Typical choice is to use p(hi) = dhi σ(hi )

– larger peaks near 0 than the Gaussian distribution

12
Linear Factor Models Deep Learning

Variants of ICA

• Some add some noise in the generation of x rather than using a deterministic
decoder

• Most do not use the maximum likelihood criterion, but instead aim to make
the elements of h = W −1x independent from each other

• Many variants of ICA only know how to transform between x and h, but do
not have any way of representing p(h), and thus do not impose a distribution
over p(x)

13
Linear Factor Models Deep Learning

Generalization of ICA

• ICA can be generalized to a nonlinear generative model


– Hyvärinen and Pajunen (1999) for the initial work on nonlinear ICA
– its successful use with ensemble learning by Roberts and Everson (2001)
and Lappalainen et al. (2000)

• Another nonlinear extension of ICA is the approach of nonlinear independent


components estimation, or NICE (Dinh et al., 2014)

• Another generalization of ICA is to learn groups of features, with statistical


dependence allowed within a group but discouraged between groups (Hyvärinen
and Hoyer, 1999; Hyvärinen et al., 2001b).

14
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

15
Linear Factor Models Deep Learning

Slow Feature Analysis

• Slow feature analysis (SFA) is a linear factor model that uses information from
time signals to learn invariant features (Wiskott and Sejnowski, 2002)

• The idea is that the important characteristics of scenes change very slowly
compared to the individual measurements that make up a description of a
scene

• The slowness principle predates slow feature analysis and has been applied to
a wide variety of models (Hinton, 1989; Földiák, 1989; Mobahi et al., 2009;
Bergstra and Bengio, 2009)

16
Linear Factor Models Deep Learning

Model with Slow Principle

• Slowness principle may be introduced by adding a term to the cost function of


the form X
λ L(f (xt+1), f (xt))
t

– λ is a hyperparameter determining the strength of the slowness regularization


term
– t t is the index into a time sequence of examples
– f is the feature extractor to be regularized
– L is a loss function measuring the distance between f (xt+1) and f (xt)

17
Linear Factor Models Deep Learning

Objective Function

• The SFA algorithm (Wiskott and Sejnowski, 2002) consists of defining f (x; θ)
to be a linear transformation, and solving the optimization problem
 2
min Et f (x(t+1))i − f (x(t))i =0
θ

– subject to the constraints

Etf (x(t))i = 0
h i
Et f (x(t))2i = 1

18
Linear Factor Models Deep Learning

Discussion about Constraints

• The constraint that the learned feature have zero mean is necessary to make
the problem have a unique solution

• The constraint that the features have unit variance is necessary to prevent the
pathological solution where all features collapse to 0

• To learn multiple features, we must also add the constraint

(t) (t)
 
∀i < j, Et f (x )if (x )j = 0

– this specifies that the learned features must be linearly decorrelated from
each other
– without this constraint, all of the learned features would simply capture the
one slowest signal

19
Linear Factor Models Deep Learning

Conclusion of SFA

• SFA is typically used to learn nonlinear features by applying a nonlinear basis


expansion to x before running SFA

• A major advantage of SFA is that it is possibly to theoretically predict which


features SFA will learn, even in the deep, nonlinear setting

• Deep SFA has also been used to learn features for object recognition and pose
estimation (Franzius et al., 2008)

20
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

21
Linear Factor Models Deep Learning

Sparse Coding

• Sparse coding (Olshausen and Field, 1996) is a linear factor model that has
been heavily studied as an unsupervised feature learning and feature extraction
mechanism

• Sparse coding models typically assume that the linear factors have Gaussian
noise with isotropic precision

1
p(x|h) = N (x; W h + b, I)
β

• The distribution p(h) is chosen to be one with sharp peaks near 0 (Olshausen
and Field, 1996)
– factorized Laplace distributions
– Cauchy distributions
– factorized Student-t distributions

22
Linear Factor Models Deep Learning

Encoder and Decoder

• Training sparse coding with maximum likelihood is intractable

• The training alternates between encoding the data and training the decoder
to better reconstruct the data given the encoding

• The encoder is an optimization algorithm, that solves an optimization problem


in which we seek the single most likely code value

h∗ = arg max p(h|x)


h

= arg max log p(h|x)


h

= arg max λ||h||1 + β||x − W h||22


h

23
Linear Factor Models Deep Learning

Advantages

• The sparse coding approach combined with the use of the non-parametric
encoder can in principle minimize the combination of reconstruction error and
log-prior better than any specific parametric encoder

• Another advantage is that there is no generalization error to the encoder

• For the vast majority of formulations of sparse coding models, where the
inference problem is convex, the optimization procedure will always find the
optimal code

24
Linear Factor Models Deep Learning

Disadvantages

• The primary disadvantage of the non-parametric encoder is that it requires


greater time to compute h given x

• It is not straight-forward to back-propagate through the non-parametric


encoder
– makes it difficult to pretrain a sparse coding model with an unsupervised
criterion
– and then fine-tune it using a supervised criterion

25
Linear Factor Models Deep Learning

Poor Samples

26
Linear Factor Models Deep Learning

Outline

• Introduction
• Probabilistic PCA and Factor Analysis
• Independent Component Analysis (ICA)
• Slow Feature Analysis
• Sparse Coding
• Manifold Interpretation of PCA

27
Linear Factor Models Deep Learning

Manifold Interpretation of PCA

• Linear factor models including PCA and factor analysis can be interpreted as
learning a manifold (Hinton et al., 1997)

• PCA can be interpreted as aligning this pancake with a linear manifold in a


higher-dimensional space

28
Linear Factor Models Deep Learning

Encoder and Decoder

• Let the encoder be


h = f (x) = W >(x − µ)
– The encoder computes a low-dimensional representation of h

• We have a decoder computing the reconstruction

x̂ = g(h) = b + W h

29
Linear Factor Models Deep Learning

Optimization

• The choices of linear encoder and decoder that minimize reconstruction error

E[||x − x̂||2]

– corresponds to V = W
– µ = b = E[x]
– columns of W form an orthonormal basis which spans the same subspace
as the principal eigenvectors of the covariance matrix

>
 
C = E (x − µ)(x − µ)

30
Linear Factor Models Deep Learning

Discussions

• One can also show that eigenvalue λi of C corresponds to the variance of x


in the direction of eigenvector v (i)

• If x ∈ RD and h ∈ Rd with d < D, then the optimal reconstruction error is

D
X
min E[||x − x̂||2] = λi
i=d+1

• one can also show that the above solution can be obtained by maximizing the
variances of the elements of h, under orthogonal W

31
Linear Factor Models Deep Learning

References

32

You might also like