0% found this document useful (0 votes)
92 views1 page

Poster For Residual Flows

1. The document presents Residual Flows, a powerful flow-based generative model using invertible residual networks (i-ResNets). 2. It introduces an unbiased "Russian roulette" estimator to calculate the log-density during training, allowing principled maximum likelihood training. 3. A gradient power series is formulated to efficiently compute partial derivatives from the log determinant term with constant memory. 4. The model motivates and investigates Lipschitz-constrained activations to avoid gradient saturation and generalizes i-ResNets to learned mixed norms. 5. Experimental results show the Residual Flow model is competitive with state-of-the-art flow-based density estimation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views1 page

Poster For Residual Flows

1. The document presents Residual Flows, a powerful flow-based generative model using invertible residual networks (i-ResNets). 2. It introduces an unbiased "Russian roulette" estimator to calculate the log-density during training, allowing principled maximum likelihood training. 3. A gradient power series is formulated to efficiently compute partial derivatives from the log determinant term with constant memory. 4. The model motivates and investigates Lipschitz-constrained activations to avoid gradient saturation and generalizes i-ResNets to learned mixed norms. 5. Experimental results show the Residual Flow model is competitive with state-of-the-art flow-based density estimation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Residual Flows for Invertible Generative Modeling

Ricky T. Q. Chen1,3, Jens Behrmann2, David Duvenaud1,3, Jörn-Henrik Jacobsen1,3


University of Toronto1, University of Bremen2, Vector Institute3

Contributions Unbiased Log-density via “Russian Roulette” Estimator Density Modeling


We build a powerful flow-based model based on the invertible residual network Russian roulette estimator. To illustrate the idea, let ∆k denote the k-th We are competitive with existing state-of-the-art flow-based models on density
architecture (i-ResNet) (Behrmann et al., 2019). We term of an infinite series, and suppose we always evaluate the first term then flip estimation, when using uniform dequantization.
1. Use a “Russian roulette” estimator to produce unbiased estimates of the a coin b ∼ Bernoulli(q) to determine whether we stop or continue evaluating the
log-density. This allows principled training as a flow-based model. remaining terms. By reweighting the remaining terms by 1−q 1 , we obtain Model MNIST CIFAR-10 ImageNet 32×32 ImageNet 64×64
2. Formulate a gradient power series for computing the partial derivatives from  P∞   P∞ ∞ Real NVP (Dinh et al., 2017) 1.06 3.49 4.28 3.98
the log determinant term in O(1) memory. ∆
k=2 k 1 ∆
k=2 k (1 − q) =
1
X
∆1 + E + (0) = ∆ + ∆k . Glow (Kingma & Dhariwal, 2018) 1.05 3.35 4.09 3.81
b=0 b=1 1
3. Motivate and investigate desirata for Lipschitz-constrained activation functions 1−q 1−q FFJORD (Grathwohl et al., 2019) 0.99 3.40 — —
k=1
that avoid gradient saturation. This unbiased estimator has probability q of being evaluated in finite time. We Flow++ (Ho et al., 2019) — 3.29 (3.09) — (3.86) — (3.69)
4. Generalize i-ResNets to induced mixed norms and learnable norm orders. can obtain an estimator that is evaluated in finite time with probability one by i-ResNet (Behrmann et al., 2019) 1.05 3.45 — —
applying this process infinitely many times to the remaining terms. Residual Flow (Ours) 0.97 3.29 4.02 3.78
Background: Invertible (Flow-based) Generative Models Residual Flows. Unbiased estimation of the log-density leads to our model. Table: Results [bits/dim] on standard benchmark datasets for density estimation. In brackets are
  models that used “variational dequantization”, which we don’t compare against.
n k+1 v T [J (x)k ]v
Maximum likelihood estimation. To perform maximum likelihood with X (−1) g
log p(x) = log p(f (x)) + En,v  , (8) Ablation Experiments
stochastic gradient descent, we require k P(N ≥ k)
k=1
∇θ Ex∼pdata(x) [log pθ (x)] = Ex∼pdata(x) [∇θ log pθ (x)] (1) where n ∼ p(N) and v ∼ N (0, I ). We use a shifted geometric distribution for 4.5

Bits/dim on CIFAR-10
Training Setting MNIST CIFAR-10† CIFAR-10
Change of Variables. With an invertible transformation f , we can build a p(N) with an expected compute of 4 terms. 4.0

i-ResNet + ELU 1.05 3.45 3.66∼4.78


generative model 3.5

Residual Flow + ELU 1.00 3.40 3.32


z ∼ p(z), x = f −1(z). (2) Memory-efficient Gradient Estimation 3.0
Residual Flow + LipSwish 0.97 3.39 3.29
2.5
0 5 10 15 20 25 30

Then the log-density of x is given by Epoch Table: Ablation results. †Uses immediate downsampling
Neumann gradient series. For estimating (1), we can either i-ResNet (Biased Train Estimate)
i-ResNet (Actual Test Value)
Residual Flow (Unbiased Train Estimate)
Residual Flow (Actual Test Value)
before any residual blocks.
df (x)
log p(x) = log p(f (x)) + log det . (3) (i) Estimate log p(x), then take gradient.
dx
(ii) Take the gradient, then estimate ∇ log p(x). Qualitative Samples
Flow-based generative models can be The first option uses variable amount of memory depending on the sample of n:
1. sampled, if (2) can be computed or approximated up to some precision.  
n k+1 ∂v T (J (x, θ)k )v
2. trained using maximum likelihood, if (3) can be unbiasedly estimated. ∂ df (x) X (−1) g
log det
= En,v  . (9)
∂θ dx k ∂θ
k=1
Background: Invertible Residual Networks (i-ResNets) The second option, by using a Neumann series we obtain constant memory cost:
   Figure: Qualitative samples. Real (left) and random samples (right) from a model trained on
n k 5bit 64×64 CelebA. The most visually appealing samples were picked out of 5 random batches.
Residual networks are composed of simple transformations ∂ df (x) X (−1) T k ∂(Jg (x, θ)) 
log det
= En,v   v J(x, θ)  v (10)
∂θ dx P(N ≥ k) ∂θ
y = f (x) = x + g (x) (4) k=0
Hybrid Modeling
Behrmann et al. (2019) proved that if g is a contractive mapping, ie. Lipschitz
strictly less than one, then the residual block transformation (4) is invertible. df (x)
Backward-in-forward. Since log det dx is a scalar quantity, we can Residual blocks are better building blocks for hybrid models than coupling blocks.
Sampling. The inverse f −1 can be efficiently computed by a fixed-point compute its gradient early and free up memory. This reduces the amount of Trained using a weighted maximum likelihood objective similar to (Nalisnick et al., 2019).
iteration memory by a factor equal to the number of residual blocks with negligible cost. E(x,y )∼pdata [λ log p(x) + log p(y |x)] (11)
x (i+1) = y − g (x (i)) (5)
which converges superlinearly by the Banach fixed-point theorem. MNIST CIFAR-10† CIFAR-10 MNIST SVHN
Log-density. The change of the variables can be applied to invertible residual ELU LipSwish ELU LipSwish ELU LipSwish Relative λ=0 λ = 1/D λ=1 λ=0 λ = 1/D λ=1
networks Naı̈ve Backprop 92.0 192.1 33.3 66.4 120.2 263.5 100% Block Type Acc↑ BPD↓ Acc↑ BPD↓ Acc↑ Acc↑ BPD↓ Acc↑ BPD↓ Acc↑
 
∞ k+1 Neumann Gradient 13.4 31.2 5.5 11.3 17.6 40.8 15.7% (Nalisnick et al., 2019) 99.33% 1.26 97.78% − − 95.74% 2.40 94.77% − −
(−1)
[Jg (x)]k 
X
log p(x) = log p(f (x)) + tr  (6) Backward-in-Forward 8.7 19.8 3.8 7.4 11.5 26.1 10.3% Coupling 99.50% 1.18 98.45% 1.04 95.42% 96.27% 2.73 95.15% 2.21 46.22%
k Both Combined 4.9 13.6 3.0 5.9 6.6 18.0 7.1% + 1 × 1 Conv 99.56% 1.15 98.93% 1.03 94.22% 96.72% 2.61 95.49% 2.17 46.58%
k=1
Table: Memory usage (GB) per minibatch of 64 samples when computing n=10 terms in the Residual 99.53% 1.01 99.46% 0.99 98.69% 96.72% 2.29 95.79% 2.06 58.52%

+ Trace can be efficiently estimated using the Skilling-Hutchinson estimator. corresponding power series. †Uses immediate downsampling before any residual blocks. Table: Comparison of residual vs. coupling blocks for the hybrid modeling task.
− The infinite sum can be estimated by truncating to a fixed n. However, this
introduces bias equal to the remaining terms. LipSwish Activation for Enforcing Lipschitz Constraint References
This results in the biased estimator:
  We motivate smooth and non-monotonic Lipschitz activation functions. This • Behrmann et al.. “Invertible residual networks.” (2019)
n
(−1) k+1 avoid “gradient saturation”, which occurs if the 2nd derivative asymptotically • Kahn. “Use of different monte carlo sampling techniques.” (1955)
v T [Jg (x)]k v 
X
log p(x) ≈ log p(f (x)) + Ev ∼N (0,1)  (7) approaches zero when the 1st derivative is close to one.
k • Beatson & Adams. “Efficient Optimization of Loops and Limits with Randomized Telescoping
k=1 Sums.” (2019)
LipSwish(x) = Swish(x)/1.1 = x · σ(βx)/1.1
where Behrmann et al. (2019) chose n = 5, 10. • Nalisnick et al.. “Hybrid Models with Deep and Invertible Features.” (2019)

You might also like