0% found this document useful (0 votes)
3 views

Notes For Generative AI

Advance model flows, VAE's, triangular jacobian, IAF, MAF, acceleration wavenet

Uploaded by

haadinaseer91
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Notes For Generative AI

Advance model flows, VAE's, triangular jacobian, IAF, MAF, acceleration wavenet

Uploaded by

haadinaseer91
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Advanced Flow Models

Volodymyr Kuleshov

Cornell Tech

Lecture 8

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 1 / 31


Announcements

Congratulations on completing Assignment 1!


Assignment 2 will be out today and due in two weeks
Project proposals are due in about one week in Gradescope
Come see me during office hours (or schedule time with me) to get
feedback on your project ideas

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 2 / 31


Normalizing Flows: Motivation

Model families:
Qn
R pθ (xi |x<i )
Autoregressive Models: pθ (x) = i=1
Variational Autoencoders: pθ (x) = pθ (x, z)dz
Autoregressive models provide tractable likelihoods but no direct
mechanism for learning features
Variational autoencoders can learn feature representations (via latent
variables z) but have intractable marginal likelihoods
Key question: Can we design a latent variable model with tractable
likelihoods? Yes! Use normalizing flows.
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 3 / 31
Normalizing Flow Models: Definition
In a normalizing flow model, the mapping between Z and X , given by
fθ : Rn 7→ Rn , is deterministic and invertible such that X = fθ (Z ) and
Z = fθ−1 (X )

We want to learn pX (x; θ) using the principle of maximum likelihood.


Using change of variables, the marginal likelihood p(x) is given by
!
−1
∂f (x)
pX (x; θ) = pZ fθ−1 (x) det θ

∂x

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 4 / 31


Normalizing Flow Models: Constricuting f .
We need to construct a density transformation that is:
Invertible, so that we can apply the change of variables formula.
Expressive, so that we can learn complex distributions.
Computationally tractable, so that we can optimize and evaluate it.
Computing likelihoods requires evaluting the determinant for an n × n
Jacobian matrix, an expensive O(n3 ) operation!
Strategies:
1 Apply sequence of M simple invertible transformations with x , z
M

zm := fθm ◦ · · · ◦ fθ1 (z0 ) = fθm (fθm−1 (· · · (fθ1 (z0 )))) , fθ (z0 )


Determininant of composition equals product of determinants:
M  m −1 
−1
Y ∂(fθ ) (zm )
pX (x; θ) = pZ fθ (x) det
∂zm
m=1

2 Choose transformations for which the Jacobian matrix has special


structure (e.g., it is diagonal).
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 5 / 31
Triangular Jacobian

Suppose we have the following vector-valued invertible mapping f :

x = (x1 , · · · , xn ) = f(z) = (f1 (z), · · · , fn (z))

∂f1 ∂f1
···
 
∂f ∂z1 ∂zn
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

Suppose xi = fi (z) only depends on z≤i . Then


 ∂f1
···

∂z1 0
∂f
J= =  ··· ··· ··· 
∂z ∂fn ∂fn
∂z1 ··· ∂zn

has lower triangular structure. Determinant is computed in linear time.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 6 / 31


Lecture Outline

1 Flows With Triangular Jacobians


Nonlinear Independent Components Estimation (Dinh et al. 2014)
Real NVP (Dinh et al. 2017)
2 Autoregressive Flows
Masked Autoregressive Flow (Papamakarios et al., 2017)
Inverse Autoregressive Flow (Kingma et al., 2016)
3 Probability Distillation and Parallel Wavenet

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 7 / 31


Nonlinear Independent Components Estimation (NICE)

Nonlinear Independent Components Estimation (NICE; Dinh et al., 2014)


is a flow-based model, where the transformation

x ← fθm ◦ · · · ◦ fθ1 (z0 ) = fθm (fθm−1 (· · · (fθ1 (z0 )))) , fθ (z0 )

is made of a composition of two types of layers:


1 Additive coupling layers
2 Rescaling layers

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 8 / 31


NICE: Additive Coupling Layers
An additive coupling layer has the following structure:
First, we partition the variables z into two disjoint subsets, say z1:d
and zd+1:n for any 1 ≤ d < n
We define the forward mapping z 7→ x as follows:
The first set of variables stays the same: x1:d = z1:d
The second variables undergo an affine transformation:
xd+1:n = zd+1:n + mθ (z1:d )

mθ (·) is a DNN with params θ, d input units, and n − d output units


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 9 / 31
NICE: Additive Coupling Layers

Is this invertible? Yes!


Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n + mθ (z1:d ) (mθ (·) is a neural network with parameters
θ, d input units, and n − d output units)
Inverse mapping x 7→ z: defining z ← f −1 (x)
The first d dimensions are unchanged: z1:d = x1:d (identity
transformation)
The other dimensions are simply shifted (using the fact that the first
dimensions are unchanged): zd+1:n = xd+1:n − mθ (x1:d )

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 10 / 31


NICE: Additive Coupling Layers
Is the Jacobian tractable? Yes!
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n + mθ (z1:d ) (mθ (·) is a neural network with parameters
θ, d input units, and n − d output units)
Jacobian of forward mapping:
!
∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d In−d

det(J) = 1

Observe that:
We have a volume preserving transformation since determinant is 1.
Inverse mapping can be computed for any m.
Determinant is independent of mθ , hence we can use any function!
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 11 / 31
NICE: Rescaling Layers
Rescaling layers in NICE are defined as follows:
Forward mapping z 7→ x:
xi = si zi
where si > 0 is the scaling factor for the i-th dimension.
Inverse mapping x 7→ z:
xi
zi =
si

Jacobian of forward mapping:

J = diag(s)

n
Y
det(J) = si
i=1

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 12 / 31


Samples Generated via NICE

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 13 / 31


Samples Generated via NICE

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 14 / 31


Real-NVP: Non-Volume Preserving Extension of NICE
Forward mapping z 7→ x:
x1:d = z1:d (identity transformation)
xd+1:n = zd+1:n exp(αθ (z1:d )) + µθ (z1:d )
µθ (·) and αθ (·) are both neural networks with parameters θ, d input
units, and n − d output units [ : elementwise product]
Inverse mapping x 7→ z:
z1:d = x1:d (identity transformation)
zd+1:n = (xd+1:n − µθ (x1:d )) (exp(−αθ (x1:d )))
Jacobian of forward mapping:
 
∂x Id 0
J= = ∂xd+1:n
∂z ∂z1:d diag(exp(αθ (z1:d )))

n n
!
Y X
det(J) = exp(αθ (z1:d )i ) = exp αθ (z1:d )i
i=d+1 i=d+1

Non-volume preserving transformation in general since determinant can


be less than or greater than 1
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 15 / 31
Samples Generated via Real-NVP

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 16 / 31


Latent Space Interpolations via Real-NVP

Using four validation examples z(1) , z(2) , z(3) , z(4) , define interpolated z as:

z = cosφ(z(1) cosφ0 + z(2) sinφ0 ) + sinφ(z(3) cosφ0 + z(4) sinφ0 )

with manifold parameterized by φ and φ0 .


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 17 / 31
Lecture Outline

1 Flows With Triangular Jacobians


Nonlinear Independent Components Estimation (Dinh et al. 2014)
Real NVP (Dinh et al. 2017)
2 Autoregressive Flows
Masked Autoregressive Flow (Papamakarios et al., 2017)
Inverse Autoregressive Flow (Kingma et al., 2016)
3 Probability Distillation and Parallel Wavenet

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 18 / 31


Autoregressive Models as Flow Models

Consider a Gaussian autoregressive model:


n
Y
p(x) = p(xi |x<i )
i=1

such that p(xi | x<i ) = N (µi (x1 , · · · , xi−1 ), exp(αi (x1 , · · · , xi−1 ))2 ).
µi (·) and αi (·) are neural networks for i > 1 and constants for i = 1.
Consider a sampler for this model:
Sample zi ∼ N (0, 1) for i = 1, · · · , n
Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Let x3 = exp(α3 )z3 + µ3 . ...
This defines an invertible transformation from z to x. Hence, this
type of autoregressive model can be interpreted as a flow!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 19 / 31


Masked Autoregressive Flow (MAF)
A Masked Autoregressive Flow (MAF) is a normalizing flow model in
which the transformation f : Z → X implements this process:

Forward mapping from z 7→ x:


Let x1 = exp(α1 )z1 + µ1 . Compute µ2 (x1 ), α2 (x1 )
Let x2 = exp(α2 )z2 + µ2 . Compute µ3 (x1 , x2 ), α3 (x1 , x2 )
Computing the forward mapping (i.e. sampling) is sequential and
slow: O(n) time (it’s an autoregressive model)

Figure adapted from Eric Jang’s blog


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 20 / 31
Masked Autoregressive Flow (MAF)

Inverse mapping from x 7→ z:


Compute all µi , αi (can be done in parallel)
Let z1 = (x1 − µ1 )/ exp(α1 ) (scale and shift)
Let z2 = (x2 − µ2 )/ exp(α2 )
Let z3 = (x3 − µ3 )/ exp(α3 ) ...
Jacobian is lower diagonal; determinant is computed efficiently
Inverse mapping (i.e., likelihood evaluation) is easy and parallelizable

Figure adapted from Eric Jang’s blog


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 21 / 31
Inverse Autoregressive Flow (IAF)
An Inverse Autoregressive Flow (IAF) is a model in which the transformation
f : X → Z is autoregressive in z (while an MAF is autoregressive in x):

Forward mapping from z 7→ x:


Sample zi ∼ N (0, 1) for i = 1, · · · , n
Compute all µi (z<i ), αi (z<i ) (can be done in parallel)
Let x1 = exp(α1 )z1 + µ1
Let x2 = exp(α2 )z2 + µ2 ...

Figure adapted from Eric Jang’s blog


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 22 / 31
Inverse Autoregressive Flow (IAF)

Inverse mapping from x 7→ z (sequential):


Let z1 = (x1 − µ1 )/ exp(α1 ). Compute µ2 (z1 ), α2 (z1 )
Let z2 = (x2 − µ2 )/ exp(α2 ). Compute µ3 (z1 , z2 ), α3 (z1 , z2 )
It’s fast to sample from an IAF, but slow to evaluate likelihoods and train
Note: Fast to evaluate likelihoods of a generated point (cache z1 , z2 , . . . , zn )

Figure adapted from Eric Jang’s blog


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 23 / 31
IAF is Transpose of MAF

Figure: Inverse pass of MAF (left) vs. Forward pass of IAF (right)

Interchanging z and x in the inverse transformation of MAF gives the


forward transformation of IAF
Similarly, swapping z and x in the forward transform of MAF yields
the inverse transform of IAF
Figure adapted from Eric Jang’s blog
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 24 / 31
NICE and Real NVP as IAF

Note that NICE and Real NVP are special cases of the IAF framework.

But scale and shift statistics can be computed in a single pass because they
are a function of the partition that is not being transformed.
Therefore sampling and posterior inference is fast.

Figure from Eric Jang’s blog


Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 25 / 31
IAF vs. MAF

Both IAF and MAF are expressive models, with different


computational tradeoffs
MAF: Fast likelihood evaluation, slow sampling
IAF: Fast sampling, slow likelihood evaluation
MAF more suited for training based on MLE, density estimation
IAF more suited for real-time generation
Can we get the best of both worlds?

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 26 / 31


Lecture Outline

1 Flows With Triangular Jacobians


Nonlinear Independent Components Estimation (Dinh et al. 2014)
Real NVP (Dinh et al. 2017)
2 Autoregressive Flows
Masked Autoregressive Flow (Papamakarios et al., 2017)
Inverse Autoregressive Flow (Kingma et al., 2016)
3 Probability Distillation and Parallel Wavenet

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 27 / 31


Recall: WaveNet (van den Oord et al., 2016)

WaveNet is a state of the art generative model for audio.

WaveNet is an autoregressive model, therefore generating from this model


is slow, especially for audio sampled at tens of thousands of Hz.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 28 / 31


Model Distillation

In probability density distillation, a student distribution is trained to


minimize the KL divergence between the student (s) and the teacher (t)

DKL (s, t) = Ex∼s [log s(x) − log t(x)]

Evaluating and optimizing Monte Carlo estimates of this objective


requires:
Samples x from student model (e.g., IAF)
Density of x assigned by student model
Density of x assigned by teacher model (e.g., MAF)

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 29 / 31


Accelerating Wavenet

Fast WaveNet: Two-part training with a teacher and student model


1 Train a WaveNet teacher model efficiently via MLE
2 Fit a student model parameterized by IAF
Student IAF model allows for efficient sampling.
Student IAF can also evaluate density of its samples! (via caching)
Teacher can also quickly evaluate the density of samples from the
student, making distillation possible
3 At inference-time we us the student IAF model to generate audio.
This improves WaveNet speed by 1000x!

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 30 / 31


Summary of Normalizing Flow Models

Transform simple distributions into more complex distributions via


change of variables
Normalizing Flows Pros:
Exact marginal likelihood p(x) is tractable to compute and optimize
Exact posterior inference p(z|x) is tractable
Normalizing Flows Cons:
Only works for continuous variables
The dimensionality of z and x must be the same (can pose
computational challenges).
Places important constraints on what model family we can use.
Strategies for constructing flows
Composition of simple bijections
Triangular Jacobian
Can be interpreted as model with a certain auto-regressive structure
that influences speed of forward and inverse sampling.

Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2023 Lecture 8 31 / 31

You might also like