0% found this document useful (0 votes)
19 views22 pages

Stimper 22 A

Uploaded by

Spencer Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Stimper 22 A

Uploaded by

Spencer Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Resampling Base Distributions of Normalizing Flows

Vincent Stimper1,2 Bernhard Schölkopf1 José Miguel Hernández-Lobato2


1 2
Max Planck Institute for University of Cambridge,
Intelligent Systems, Germany United Kingdom

Abstract complex distribution matching our target. Normaliz-


ing flows have been applied successfully to a variety
of problems, such as image generation (Dinh et al.,
Normalizing flows are a popular class of mod-
2015, 2017; Kingma and Dhariwal, 2018; Ho et al.,
els for approximating probability distribu-
2019; Grcić et al., 2021), audio synthesis (van den
tions. However, their invertible nature lim-
Oord et al., 2018), variational inference (Rezende and
its their ability to model target distributions
Mohamed, 2015), semi-supervised learning (Izmailov
whose support have a complex topological
et al., 2020) and approximating Boltzmann distribu-
structure, such as Boltzmann distributions.
tions (Noé et al., 2019; Wu et al., 2020; Wirnsberger
Several procedures have been proposed to
et al., 2020) among others (Papamakarios et al., 2021).
solve this problem but many of them sac-
However, with respect to some performance measures
rifice invertibility and, thereby, tractability
they are still outperformed by autoregressive models
of the log-likelihood as well as other desir-
(Chen et al., 2018; Parmar et al., 2018; Child et al.,
able properties. To address these limitations,
2019), generative adversarial networks (GANs) (Gul-
we introduce a base distribution for normal-
rajani et al., 2017; Karras et al., 2019, 2020a,b), and
izing flows based on learned rejection sam-
diffusion based models (Sohl-Dickstein et al., 2015;
pling, allowing the resulting normalizing flow
Kingma et al., 2021). One reason for this is an ar-
to model complicated distributions without
chitectural limitation. Due to their bijective nature
giving up bijectivity. Furthermore, we de-
the normalizing flow transformation leaves the topo-
velop suitable learning algorithms using both
logical structure of the support of the base distribu-
maximizing the log-likelihood and the opti-
tion unchanged and, since it is usually simple, there is
mization of the Kullback-Leibler divergence,
a topological mismatch with the often complex target
and apply them to various sample problems,
distribution (Cornish et al., 2020), thereby diminish-
i.e. approximating 2D densities, density es-
ing the modeling performance and even causing ex-
timation of tabular data, image generation,
ploding inverses (Behrmann et al., 2021). Several so-
and modeling Boltzmann distributions. In
lutions have been proposed, e.g. augmenting the space
these experiments our method is competitive
the model operates on (Huang et al., 2020), continu-
with or outperforms the baselines.
ously indexing the flow layers (Cornish et al., 2020),
and adding stochastic (Wu et al., 2020) or surjec-
tive layers (Nielsen et al., 2020). However, these ap-
1 INTRODUCTION proaches sacrifice the bijectivity of the flow transfor-
mation, which means in most cases that the model
Inferring and approximating probability distributions is no longer tractable, memory savings during train-
is a central problem of unsupervised machine learning. ing are no longer possible (Gomez et al., 2017), and
A popular class of models for this task are normalizing the model is no longer a perfect encoder-decoder pair.
flows (Tabak and Vanden-Eijnden, 2010; Tabak and Some work has been done on using multimodal base
Turner, 2013; Rezende and Mohamed, 2015), which distributions (Izmailov et al., 2020; Ardizzone et al.,
are given by an invertible map transforming a sim- 2020; Hagemann and Neumayer, 2021), but the inten-
ple base distribution such as a Gaussian to obtain a tion was to do classification or solve inverse problems
with flow-based models and not to capture the inher-
Proceedings of the 25th International Conference on Artifi- ent multimodal nature of the target distribution. Pa-
cial Intelligence and Statistics (AISTATS) 2022, Valencia, pamakarios et al. (2017) took a mixture of Gaussians
Spain. PMLR: Volume 151. Copyright 2022 by the au-
as base distribution and showed that this can improve
thor(s).
Resampling Base Distributions of Normalizing Flows

the performance.
In this work, we develop a method to obtain a
more expressive base distribution through learned ac-
cept/reject sampling (LARS) (Bauer and Mnih, 2019).
It can be estimated jointly with the flow map by either
maximum likelihood (ML) learning or minimizing the
Kullback-Leibler (KL) divergence, matching the topo- (a) (b) (c)
logical structure of the target’s support. Moreover, we
propose how the method can be scaled up to high di- Figure 1: Illustration of the architectural limitation of
mensional datasets and demonstrate the effectiveness normalizing flows. (a) depicts the multimodal target
of our procedure on the tasks of learning 2D densities, distribution, (b) the Gaussian base distribution used,
estimating the density of tabular data, generating im- and (c) the learned real NVP model. The model’s
ages, and the approximation of a 22 atom molecule’s support has one connected component with a density
Boltzmann distribution. filament between the modes.

2 BACKGROUND For images, this is typically done by first squeezing the


image, i.e. reducing its height and width by a factor 2
2.1 Normalizing Flows and adding the surplus pixels as additional channels,
and then splitting the resulting tensor along the chan-
Let z be a random variable taking values in Rd , having (1)
the density pϕ (z) parameterized by ϕ. Furthermore, nel dimension. h1 is immediately factored out in the
(2)
let Fθ : Rd → Rd be a bijective map parameterized density, while h2 is further transformed by the next
by θ. We can compute the tractable density of the set of flow layers F2 . The process is then repeated un-
new random variable x := Fθ (z) with the change of til a desired depth is reached. The full density for a
variables formula multiscale architecture with n levels is given by
n
−1
p(x) = pϕ (z) |det(JFθ (z))| , (1)
Y
p(x) = |det (JFi (hi−1 ))| p(zi ), (4)
i=1
where JFθ is the Jacobian matrix of Fθ . This way of
constructing a complex probability distribution p(x) where we set h0 = x.
from a simple base distribution p(z) is called a normal- Normalizing flows can compete with other machine
izing flow. We can use them to approximate a target learning models on many benchmarks (Papamakarios
density p∗ (x), which is done by optimizing a training et al., 2021). However, their performance is still im-
objective. If the target density is unknown but sam- paired by an architectural weakness. The transfor-
ples from the corresponding distribution are available, mations defining a normalizing flow are invertible and
we maximize the expected log-likelihood (LL) of the such maps leave the topology of the sets they map
model unchanged (Runde, 2005). Consequently, the topolog-
LL(θ, ϕ) = Ep∗ (x) [log (p(x))] . (2) ical structure of the support of p(z) is the same as
Conversely, if the target density is given, we minimize that of p(x). Usually, the base distribution is a Gaus-
the (reverse) KL divergence1 (Papamakarios et al., sian, which has only one mode, so its support consists
2021) of one connected component but the target distribu-
tion might be multimodal with the density between
KLD(θ, ϕ) := Ep(x) [log p(x)] − Ep(x) [log p∗ (x)] , (3) the modes being close to zero or even numerically zero
due to finite precision so that the support consists of
or another difference measure for probability distri- multiple disconnected components. As an exemplifi-
butions such as the α-divergence (Hernández-Lobato cation we fit a real-valued non-volume preserving (real
et al., 2016). NVP) flow model with 8 coupling layers to a multi-
To deal with high dimensional data, such as images, modal target distribution, see Figure 1. The density
the multiscale architecture was introduced by Dinh of the trained model consists of one connected compo-
et al. (2017). As sketched in Figure 5, at the first level, nent covering the modes of the target, but connecting
the entire input x is transformed by several flow layers them via a density filament. Certain flow-based mod-
(1) (2) els, such as the residual flow (Behrmann et al., 2019;
F1 . The result is split up into two parts, h1 and h1 .
Chen et al., 2019), can only converge to the target if
1
For simplicity, we will call it just KL divergence from they become non-invertible due to the topological mis-
now on. match (Cornish et al., 2020), thereby causing unstable
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

training behaviour (Behrmann et al., 2021). Proposed 3 METHOD


solutions include increasing the model size significantly
(Chen et al., 2019), but this increases the computa- 3.1 Resampled base distributions
tional cost and memory demand while the stability
issues persist. Training can be stabilized via a suit- In Section 2.1, we argued that the topological structure
able regularization, but this reduces the performance of the support of the base distribution equals that of
(Behrmann et al., 2021). Other approaches are dis- the overall flow distribution. To avoid artefacts result-
cussed in Section 5. ing from mismatches between them, we aim to make
the latter closer to the former. Therefore, we resample
2.2 Learned accept/reject sampling the base distribution with LARS, i.e. use it as our pro-
posal so that its density becomes (6). Since there are
Learned accept/reject sampling (LARS) is a method no restrictions on the acceptance function aϕ , we can
to approximate a d-dimensional distribution q(z) by use an arbitrarily complex neural network to model
reweighting a proposal distribution π(z) through a any desired topological structure. The resulting log-
learned acceptance function aϕ : Rd → [0, 1], where probability of the model is given by
ϕ are the learned parameters (Bauer and Mnih, 2019). 
aϕ (z)

Given a sample zi from π, we will accept it with a prob- log p(x) = log π(z) + log αT + (1 − αT )
Z (8)
ability aϕ (zi ), otherwise we reject it and draw a new
sample until we accept one of the proposed samples. − log |det JFθ (z)| ,
The resulting distribution is given by where Fθ is the flow transformation, i.e. the composi-
Z tion of all flow layers, and z = Fθ−1 (x). In our case,
π(z)aϕ (z)
p∞ (z) = ; Z := π(z)aϕ (z)dz. (5) the proposal is a Gaussian but it could be any other
Z
distribution or a more complicated model, such as a
In order to limit the computational cost caused by high mixture of Gaussians or an autoregressive model. De-
rejection rates, Bauer and Mnih (2019) introduced a pending on the application, aϕ will be a fully con-
truncation parameter T ∈ N. If the first T − 1 sam- nected or a convolutional neural network, and details
ples from the proposal get rejected, we accept the T th about how the architecture can be chosen are given
sample no matter the value of the learned acceptance in Appendix C.1. Since the evaluation of aϕ can be
probability. Through this intervention, we alter the parallelized over the number of dimensions of the data
final sampling distribution to become d, we only add a constant computational overhead to
aϕ (z)π(z) our model. In contrast, autoregressive models scale
pT (z) = (1 − αT ) + αT π(z), (6) linearly with d. We can sample from the model by
Z
performing LARS and propagating the accepted val-
where αT := (1 − Z)T −1 , which reduces to (5) for ues through the flow map. The rejection rate, and
T → ∞. The integral (5) defining Z is not tractable, so hence the sampling speed, can be controlled via the
we cannot compute it directly. Instead, it is estimated truncation parameter T , which we set to 100 in our
via Monte Carlo sampling, i.e. experiments unless otherwise stated, but also through
S adding Z to our loss function, which is discussed in
1X Appendix C.2.
Z≈ aϕ (zs ), (7)
S s=1
Usually, the base distribution of normalizing flows has
where zs ∼ π(z), which needs to be recomputed in mean and variance parameters being trained with the
every training iteration, as parameter changes in aϕ flow layer parameters. Our proposal is simply a stan-
cause a change in Z. dard normal distribution, i.e. a diagonal Gaussian with
mean zero and variance one. Thereby, we ensure that
LARS was first used to create a more expressive
the samples from the proposal, which are the input
prior for variational autoencoders (VAEs) (Kingma
for the neural network representing the learned accep-
and Welling, 2014), making it closer to the aggregate
tance probability aϕ , come from a distribution which
posterior distribution, thereby bringing the approxi-
does not change during training. Instead, the mean
mate posterior distribution closer to the ground truth.
and variance of the distribution can be altered after
The resampled priors are trained jointly with the likeli-
the resampling process by applying an affine flow layer
hood and the approximate posterior via maximization
with scale and shift being learnable parameters.
of the evidence lower bound. Since this only requires
to evaluate the density of the prior at the data points, Note that while we retain the invertiblility of the flow,
it is not even required to perform rejection sampling the probability distribution (8) cannot be evaluated
during training; therefore, the computational cost of exactly since Z needs to be estimated via (7). How-
training the whole model is only increased slightly. ever, for large T the base distribution reduces to (5)
Resampling Base Distributions of Normalizing Flows

and, hence, we are only off by a constant meaning there Width


would not be a bias when doing importance sampling,

t
gh
which is crucial for applications such as Boltzmann

ei
H
generators, see Section 4.4.

3.2 Learning algorithms

Channels
The resampled base distribution can be trained jointly
with the flow layers of our model. Both, the expected
LL and the KL divergence, can be used as objectives.
The former corresponds to maximizing (2), which is
done via stochastic gradient decent. As done by Bauer Figure 2: Visualization of a feature map when process-
and Mnih (2019), we sample from the proposal in each ing an image in a machine learning model. The unit
iteration to estimate the gradient of Z with respect which is used for factorization in our resampled base
to the parameters, see (7). To stabilize training, we distribution is shown in red.
estimate the value of the normalization constant by
an exponential moving average, see Appendix A.1 for
more details. base distribution of a normalizing flow must have the

When the unnormalized target density p̂ (x) is known, same number of dimensions as the target, we can re-
we can use the KL divergence (3) as our objective. duce the number of dimensions significantly by factor-
However, because sampling from the base distribution ization. Therefore, we extend the multiscale architec-
includes an acceptance/rejection step, we cannot ap- ture, see Section 2.1 and (Dinh et al., 2017), by further
ply the reparameterization trick (Kingma and Welling, subdividing the base distribution at each level into fac-
2014) to obtain the gradients with respect to the model tors with less than 100 dimensions. First, we squeeze
parameters. Instead, we derive an expression of the the feature map until the product of height and width
gradients of the KL divergence similar to that intro- is smaller than 100. Then, each channel is treated as a
duced by Grover et al. (2018). separate factor, see Figure 2. To reduce the complexity
of the model, we use parameter sharing to express the
Theorem 1. Let pϕ (z) be the base distribution of a distribution of factors, i.e. there is one neural network
normalizing flow, having parameters ϕ, and Fθ be the per level with multiple outputs, each representing the
respective invertible mapping, depending on its param- acceptance probability aϕ for one channel. This also
eters θ, such that the density of the model is has the advantage that we can estimate the normal-
ization constant and its gradient of all factors of one
log (p(x)) = log (pϕ (z)) − log |det JFθ (z)| , (9) level in parallel by sampling from a Gaussian, passing
the samples through the neural network and comput-
with x = Fθ (z). Then, the gradients of the KL diver-
ing the average for each output dimension separately.
gence with respect to the parameters are given by
As mentioned in Section 3.1, the mean and variance

∇ϕ KLD(θ, ϕ) = Covpϕ (z) ∇ϕ log pϕ (z), is added via a constant coupling layer. Furthermore,
(10) the base distribution can be made class-conditional by
log (pϕ (z)) − log |det JFθ (z)| − log p̂∗ (Fθ (z)) making the mean and variance and/or aϕ dependent
∇θ KLD(θ, ϕ) = −Epϕ (z) ∇θ log p̂∗ (Fθ (z))

on the class. The latter can be efficiently achieved by
 (11) adding more outputs to the neural network to have
+ log |det JFθ (z)|
one value for aϕ per class and distribution if needed.

The proof is given in Appendix A.2. We will use (10)


and (11) to compute the gradients of the KL diver- 4 EXPERIMENTS
gence in our experiments and, thereby, demonstrate
its effectiveness. 4.1 2D distributions

In this section, we aim to demonstrate that our


3.3 Application to multiscale architecture method is indeed capable of modeling complicated
distributions. Our code for all experiments is pub-
LARS cannot be applied to very high dimensional dis-
licly available on GitHub at https://github.com/
tributions because we have to estimate Z and its gra-
VincentStimper/resampled-base-flows.
dients via Monte Carlo sampling and the number of
samples needed grows exponentially with the number We start with simple 2D distributions having supports
of dimensions (Bauer and Mnih, 2019). Although the with various topological structures, i.e. a distribution
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Real NVP, Real NVP, Real NVP,


Target Resampled base
Gaussian base Gaussian mixture base resampled base
Dual moon
Circle of Gaussians
Two rings

Figure 3: Visualization of the real NVP densities as well as the learned resampled base distribution when
approximating three 2D distributions with complex topological structure. The models are trained via ML
learning.

Table 1: KL divergences of the target distribution and the flow models which are trained to approximate the
three 2D distributions, shown in Figure 3, with ML learning. For each target distribution and flow architecture,
the model with the lowest KL divergence is marked in bold.

Flow architecture Real NVP Real NVP Real NVP Residual Residual Residual
Base distribution Gaussian Mixture Resampled Gaussian Mixture Resampled
Dual moon 1.83 1.80 1.77 1.82 1.80 1.76
Circle of Gaussians 0.090 0.060 0.043 0.045 0.042 0.039
Two rings 10.7 10.6 10.4 11.7 10.8 10.4

with two modes, one with eight modes, and one with The densities of the trained real NVP and residual
two rings, see Figure 3 and Table 8. We use both flow models are show in the Figures 3 and 9, respec-
learning algorithms discussed in Section 3.2. To train tively. With a Gaussian base distribution, the flows
our flows via ML, we draw samples from our distri- struggle to model the complex topological structure.
butions via rejection sampling. As flow architectures, For the trained real NVP models it is especially visi-
we choose real NVP (Dinh et al., 2017) and residual ble in Figure 3 that the density essentially consists of
flow (Behrmann et al., 2019; Chen et al., 2019) with one connected component since there are density fil-
16 layers each. For each flow architecture, we train aments between the modes and rings are not closed.
models with a Gaussian, a mixture of 10 Gaussians, The multimodal distributions can be fitted much bet-
and a resampled base distribution, having a Gaussian ter when using a mixture of Gaussians as base dis-
proposal and a neural network with 2 hidden layers tribution, but especially the ring distribution can still
with 256 hidden units each as well as a sigmoid out- not be represented properly. With a resampled base
put function as acceptance probability. distribution the flow models the target distributions
accurately without any artefacts. The base distribu-
Resampling Base Distributions of Normalizing Flows

Table 2: LL on the test sets of the respective datasets of NSF, its CIF variant, and a NSF with a resampled
base distribution (RBD). The values are averaged over 3 runs each and the standard error is given as a measure
of uncertainty. The highest values within the confidence interval are marked in bold.

Method Power Gas Hepmass Miniboone


NSF 0.69 ± 0.00 13.01 ± 0.02 −14.30 ± 0.05 −10.68 ± 0.06
CIF-NSF 0.68 ± 0.01 13.08 ± 0.00 −13.83 ± 0.10 −9.93 ± 0.06
RBD-NSF (ours) 0.69 ± 0.01 13.29 ± 0.05 −14.02 ± 0.12 −9.45 ± 0.03

tions assume the respective topological structure of the There is no significant performance difference of the
target while the flow transformation does the fine ad- three methods on the power dataset. On Hepmass,
justment of the density. We also estimate the KL di- the resampled base distributions achieves similar per-
vergences of the target and the model distributions formance to CIF, while both are better than the vanilla
which are listed in Table 1. In all cases the flow model NSF. For the Gas and Miniboone dataset, the flow
with the resampled base distribution outperforms the with the resampled base distribution clearly outper-
respective baselines. forms its baselines. When using real NVP, the differ-
ence is even larger on all datasets but Miniboone, as
Moreover, we train real NVP models with Gaussian
can be seen in Table 9.
and resampled base distributions with the KL diver-
gence using the gradient estimators derived in Theo-
rem 1. The same architecture as the models trained 4.3 Image generation
with ML learning are used and their resulting densities
To model images with our method, we train Glow
are shown in Figure 7. In addition, we also computed
(Kingma and Dhariwal, 2018) on the CIFAR-10
the KL divergences listed in Table 3. As for the pre-
dataset (Krizhevsky, 2009). We use the multiscale ar-
vious experiments, the flow with the resampled base
chitecture introduced in Section 3.3, where we compare
distribution clearly outperforms its baseline visually
a Gaussian with a respective resampled base distribu-
and quantitatively for all the three targets.
tion. As done by Kingma and Dhariwal (2018), we
Table 3: KL divergences of the target distribution and use 3 levels, but train models with 8, 16, and 32 layers
the models which were trained using the KL diver- per level with each base distribution, with more details
gence, shown in Figure 7. For each target distribution, provided in Appendix F. For each model architecture,
the real NVP model with the lower KL divergences is we do three seeded training runs and report bits per
marked in bold. dimension on the test set in Table 4.

Table 4: Bits per dimension on the test set of the


Base distribution Gaussian Resampled Glow models with Gaussian and resampled base dis-
Dual moon 1.844 1.839 tribution trained on CIFAR-10. For each architecture,
Circle of Gaussians 0.167 0.122 three seeded training runs were done, the reported bits
Two rings 11.5 10.3 per dimension values are averages over these runs and
the standard error is given as an uncertainty estimate.
For each number of layers, the lowest values within the
4.2 Tabular data confidence interval is marked in bold.

Next, we estimate the density of four tabular datasets


from the UCI Machine Learning Repository (Dheeru Base distribution Gaussian Resampled
and Taniskidou, 2022). We use the same preprocess- 8 layers per level 3.403 ± 0.002 3.399 ± 0.001
ing and training, validation, and test splits as Papa- 16 layers per level 3.339 ± 0.001 3.332 ± 0.001
makarios et al. (2017), which have been adopted by 32 layers per level 3.283 ± 0.002 3.282 ± 0.001
others in the field (Durkan et al., 2019; Cornish et al.,
2020). For each dataset, we train a Neural Spline Flow The flow with the resampled base distribution outper-
(NSF) (Durkan et al., 2019), its continuously indexed forms the baseline when using 8 or 16 layers per level,
(CIF) variant (Cornish et al., 2020), and one with a while performing about equal with 32 layers. The dif-
resampled base distribution. The LL of the models are ference is larger for smaller models, i.e. those where
shown in Table 2. More details about the setup and fewer layers are used, since models with many layers
the architecture as well as results for real NVP flows are already rather expressive. Using a more expressive
on the same datasets are given in Appendix E. base distribution also increases the model size and the
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Figure 4: Marginal distribution of three dihedral angles of Alanine dipeptide. The ground truth was determined
with a MD simulation. The flow models are based on real NVP and were trained via ML.

Table 5: Quantitative comparison of the real NVP models approximating the Boltzmann distribution of Alanine
dipeptide trained via ML learning. The LL is evaluated on a test set obtained with a MD simulation. The KL
divergences of the 60 marginals were computed and the mean and median of them are reported. All results are
averages over 10 runs, the standard error is given, and highers LL as well as lowest KL divergences are marked
in bold.

Base distribution Gaussian Mixture Gaussian Resampled


Number of layers 16 16 19 16
LL (×102 ) 1.8096 ± 0.0002 1.8106 ± 0.0002 1.8109 ± 0.0001 1.8118 ± 0.0001
Mean KLD (×10−3 ) 1.76 ± 0.08 8.23 ± 0.82 1.35 ± 0.03 1.12 ± 0.02
Median KLD (×10−4 ) 5.20 ± 0.10 43.5 ± 6.0 4.63 ± 0.08 4.36 ± 0.05

training time, but this amounts only to 0.4-1.5% and G.1, which incorporates the translational and rota-
5-15%, respectively, versus a roughly linear increase tional symmetry and reduces the number of dimen-
with the number of layers. Hence, this can be a desir- sions from 66 to 60. Both ML learning and training
able trade-off, depending on the use case. using the KL divergence are used. For the former we
generate a training dataset through a MD simulation
over 107 steps each and keep every 10th sample, re-
4.4 Boltzmann generators sulting in datasets with 106 samples. With the same
procedure we generate a test set to evaluate all trained
An important application of normalizing flows is the models.
approximation of Boltzmann distributions. Given the
atom coordinates x of a molecule, the likelihood of With ML learning we train real NVP models having
finding it in this state, i.e. the Boltzmann distribution, 16 layers with a Gaussian, a mixture of 10 Gaussians,
is proportional to e−u(x) , where u denotes the energy and a resampled base distribution. Furthermore, we
of the system, which can be obtained through physical train another real NVP model with a Gaussian base,
modeling. Usually, samples are drawn from this distri- but having 19 layers, which has roughly the same num-
bution through molecular dynamics (MD) simulations ber of parameters as the real NVP model with the re-
(Leimkuhler and Matthews, 2015). However, the sam- sampled base. More details of the architecture and
pling process can be greatly accelerated by approxi- the training procedure are listed in Appendix G.2.
mating the Boltzmann distribution with a normalizing The marginal distribution of three dihedral angles are
flow, called a Boltzmann generator, and then sampling shown in Figure 4.
from the flow model (Noé et al., 2019). Although we tried various methods of initializing the
Here, we approximate the Boltzmann distribution of mixture of Gaussians, training it jointly with the flow
the 22 atom Alanine dipeptide, which has been used turns out to be unstable leading to a poor fit of the
as a benchmark system in the machine learning liter- marginals, which is especially visible for γ3 . More-
ature (Wu et al., 2020; Campbell et al., 2021; Köhler over, for two of the three angles, the 16-layered model
et al., 2021). We use the coordinate transformation with the Gaussian base distribution cannot represent
introduced by Noé et al. (2019), see also Appendix the multimodal nature of the distribution accurately.
Resampling Base Distributions of Normalizing Flows

Increasing the number of layers to 19 improves the re- uous indices, which they used as additional input to
sult, but even this model is clearly outperformed by the flow maps. Thereby, they relaxed the bijectivity
the real NVP with a resampled base distribution. To of the transformation leading to a better model per-
compare the performance quantitatively, we computed formance. Huang et al. (2020) augmented the dataset
the LL on the test set and estimated the KL diver- by auxiliary dimensions before applying their normal-
gence between the MD samples and the models of the izing flow model. Although the topological constraints
marginals through histograms for all 60 dimensions are still present in the augmented space, the marginal
and report the mean and median. All performance distribution of interest can be arbitrary complex. Wu
measures where averaged over 10 seeded runs and are et al. (2020); Nielsen et al. (2020) suggested adding
shown in Table 5. The real NVP model with the re- sampling layers to the model. Hence, the topology
sampled base distribution outperforms all the base- of the support can be changed through the sampling
lines. The improved performance comes at the cost of process. Nielsen et al. (2020) also introduced surjec-
increased training time, i.e. 49% and 26%, and sam- tive layers, which do not suffer from topological con-
pling time, i.e. by a factor of 4 and 1.8, for the real straints and essentially combine VAEs with flow-based
NVP and the residual flow models with the resampled models. These approaches sacrifice the invertibility of
base distribution, when compared to their Gaussian the flow map, which has several disadvantages. First
counterparts. A further analysis of the Ramachan- of all, the model is no longer a perfect autoencoder,
dran plots of the models is done in Appendix G.3. i.e. the original datapoint cannot be fully recovered
There, we also do a comparison to stochastic normal- from its latent representation. Second, if the layers
izing flows (Wu et al., 2020) and show the results of of the flow-based model are bijective, significant mem-
training residual flows with ML whereby the model ory savings are possible (Gomez et al., 2017). Usu-
with the resampled base distribution outperforms the ally, when training layered models such as neural net-
baselines as well. works the activations of each layer need to be stored
in the forward pass because they are needed for gra-
Moreover, we used the KL divergence to train real
dient computation in the backward pass. However,
NVP models with Gaussian, mixture of Gaussians,
if the layers are invertible, the activations of the for-
and resampled base distributions as well, having the
ward pass can be recomputed in the backward pass
same architecture as the real NVP models with 16
by applying the inverse of the layer to the activations
layers in the experiments above. This is a challeng-
of the previous layers. Thereby, models can be made
ing task since if samples from the model are too far
basically infinitely deep with a fixed memory budget.
away from the modes of the Boltzmann distribution,
Thirdly, exact evaluation of the likelihood is no longer
their gradients can be very high making training un-
possible. To train the models, a bound needs to be
stable. However, it is important for the application
derived which is optimized instead of the actual like-
of Boltzmann generators since the necessity of creat-
lihood. Our model does not make this sacrifice since
ing a dataset through other expensive sampling pro-
only the base distribution is altered, but the transfor-
cedures diminishes their ability to reduce the overall
mation of the normalizing flow model is still invertible.
computational time needed for sampling. Details of
On the other side, the base distribution itself cannot be
the model architectures and the training procedure are
evaluated exactly because its normalization constant is
given in Appendix G.2. Although it involves rejection
unknown. It can be estimated via Monte Carlo sam-
sampling, training the flow models with the resampled
pling, but its logarithm, appearing in the LL of the
base distribution only took 15% longer than the base-
model, is biased. However, as discussed in Section 3.1
line models. As can be seen in Table 15, the real NVP
for large truncation parameter T we are only off by a
model with a resampled base outperforms those with
constant so e.g. importance sampling could be done
a Gaussian and a mixture of Gaussians; however, they
without a bias. Moreover, drawing samples from our
are still inferior to flows trained via ML.
model is less efficient as many samples from the pro-
posal might get rejected before finally one is accepted
5 DISCUSSION AND RELATED and propagated through the flow.
WORK An autoregressive base distribution was introduced by
Bhattacharyya et al. (2020). While they only consid-
The main challenge we tackle in this work, i.e. that ered image generation, their entire model, i.e. includ-
normalizing flows struggle to model distributions with ing the base distribution, is tractable in contrast to
supports having a complicated topological structure ours. However, the computational cost of their mod-
due to their invertible nature, has been addressed in els scales with the square root of the number of pixels,
several articles. Cornish et al. (2020) introduced a while ours is constant. Izmailov et al. (2020); Ardiz-
new set of variables for each flow layer, called contin- zone et al. (2020); Hagemann and Neumayer (2021)
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

explored normalizing flows with a multimodal base dis- ful discussions. José Miguel Hernández-Lobato ac-
tribution, in their case a mixture of Gaussians. How- knowledges support from a Turing AI Fellowship under
ever, their intention was to model data with multiple grant EP/V023756/1. This work was supported by the
classes, thereby performing classification and solving German Federal Ministry of Education and Research
inverse problems. Our model allows to describe data (BMBF): Tübingen AI Center, FKZ: 01IS18039B; and
with multiple classes as well through a conditional dis- by the Machine Learning Cluster of Excellence, EXC
tribution, similar to the work of Dinh et al. (2017); number 2064/1 - Project number 390727645.
Kingma and Dhariwal (2018), but is also able to de-
scribe the complicated topological structure of the dis- References
tribution of each class.
Ardizzone, L., Mackowiak, R., Rother, C., and Köthe,
Bauer and Mnih (2019) used LARS successfully to cre- U. (2020). Training Normalizing Flows with the
ate more expressive priors for VAEs, thereby boosting Information Bottleneck for Competitive Generative
their performance. They demonstrated that the re- Classification. In Advances in Neural Information
sampled prior can be learned jointly with the encoder Processing Systems 33.
and decoder by maximizing the evidence lower bound. Bauer, M. and Mnih, A. (2019). Resampled Priors for
In contrast, we showed that a resampled base distri- Variational Autoencoders. The 22nd International
bution can be jointly trained with a normalizing flow Conference on Artificial Intelligence and Statistics,
transformation using both the LL and the KL diver- pages 66–75.
gence as an objective. For the latter we derived an
Behrmann, J., Grathwohl, W., Chen, R. T. Q., Du-
expression of the gradient with reduced variance in-
venaud, D., and Jacobsen, J.-H. (2019). Invertible
spired by the work of Grover et al. (2018). Further-
Residual Networks. In Proceedings of the 36th In-
more, Bauer and Mnih (2019) reported that they tried
ternational Conference on Machine Learning, pages
to fully factorize their resampled prior, which would al-
573–582.
low them to scale to higher dimensional problems, but
they were not able not beat the baseline of a VAE with Behrmann, J., Vicol, P., Wang, K.-C., Grosse, R., and
a factorized Gaussian prior. We were successful by Jacobsen, J.-H. (2021). Understanding and Miti-
not fully factorizing our resampled base distribution, gating Exploding Inverses in Invertible Neural Net-
but defining factors for groups of variables. Moreover, works. In Proceedings of The 24th International
combining LARS with the multiscale architecture of Conference on Artificial Intelligence and Statistics,
Dinh et al. (2017) and using a factorization similar to pages 1792–1800. PMLR.
(Ma et al., 2019) allowed us to scale up our base distri- Bhattacharyya, A., Mahajan, S., Fritz, M., Schiele,
bution even further. The largest base distribution in B., and Roth, S. (2020). Normalizing Flows With
our work, used in Glow to model the CIFAR10 dataset, Multi-Scale Autoregressive Priors. In Proceedings
has 3072 dimensions, while the largest prior of Bauer of the IEEE/CVF Conference on Computer Vision
and Mnih (2019) only had 100. and Pattern Recognition, pages 8415–8424.
Campbell, A., Chen, W., Stimper, V., Hernandez-
6 CONCLUSION Lobato, J. M., and Zhang, Y. (2021). A Gradient
Based Strategy for Hamiltonian Monte Carlo Hy-
In this work, we introduced a base distribution for nor- perparameter Optimization. In Proceedings of the
malizing flows based on learned rejection sampling. 38th International Conference on Machine Learn-
We derived how it can be trained jointly with the ing, pages 1238–1248. PMLR.
flow layers maximizing the expected LL or minimizing Chen, R. T. Q., Behrmann, J., Duvenaud, D., and Ja-
the KL divergence. This base distribution can assimi- cobsen, J.-H. (2019). Residual Flows for Invertible
late the complex topological structure of a target and, Generative Modeling. In Advances in Neural Infor-
thereby, overcome a structural weakness of normaliz- mation Processing Systems, volume 32.
ing flows. By applying our procedure to 2D distribu- Chen, X., Mishra, N., Rohaninejad, M., and Abbeel,
tions, tabular data, images, and Boltzmann distribu- P. (2018). PixelSNAIL: An Improved Autoregres-
tions we demonstrated that resampling the base dis- sive Generative Model. In Proceedings of the 35th
tribution can improve their performance qualitatively International Conference on Machine Learning.
and quantitatively.
Child, R., Gray, S., Radford, A., and Sutskever, I.
(2019). Generating Long Sequences with Sparse
Acknowledgements
Transformers. arXiv:1904.10509 [cs, stat].
We thank Matthias Bauer, Richard Turner, Andrew Cornish, R., Caterini, A. L., Deligiannidis, G., and
Campbell, Austin Tripp, and David Liu for the help- Doucet, A. (2020). Relaxing Bijectivity Constraints
Resampling Base Distributions of Normalizing Flows

with Continuously Indexed Normalising Flows. In Izmailov, P., Kirichenko, P., Finzi, M., and Wilson,
Proceedings of the 37th International Conference on A. G. (2020). Semi-Supervised Learning with Nor-
Machine Learning. malizing Flows. In Proceedings of the 37th Inter-
Dheeru, D. and Taniskidou, E. K. (2022). UCI ma- national Conference on Machine Learning, pages
chine learning repository. http://archive.ics. 4615–4630.
uci.edu/ml. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.,
Dinh, L., Krueger, D., and Bengio, Y. (2015). NICE: and Wilson, A. G. (2018). Averaging Weights Leads
Non-linear Independent Components Estimation. In to Wider Optima and Better Generalization. In Pro-
3rd International Conference on Learning Represen- ceedings of the Thirty-Fourth Conference on Uncer-
tations, Workshop Track Proceedings. tainty in Artificial Intelligence, pages 876–885.
Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehti-
Density estimation using Real NVP. International nen, J., and Aila, T. (2020a). Training Genera-
Conference on Learning Representations. tive Adversarial Networks with Limited Data. Ad-
vances in Neural Information Processing Systems,
Durkan, C., Bekasov, A., Murray, I., and Papamakar- 33:12104–12114.
ios, G. (2019). Neural Spline Flows. In Advances in
Neural Information Processing Systems, volume 32. Karras, T., Laine, S., and Aila, T. (2019). A Style-
Based Generator Architecture for Generative Adver-
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. sarial Networks. In Proceedings of the IEEE/CVF
(2017). The Reversible Residual Network: Back- Conference on Computer Vision and Pattern Recog-
propagation Without Storing Activations. In Ad- nition, pages 4401–4410.
vances in Neural Information Processing Systems,
volume 32. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehti-
nen, J., and Aila, T. (2020b). Analyzing and Im-
Grcić, M., Grubišić, I., and Šegvić, S. (2021). Densely proving the Image Quality of StyleGAN. In Pro-
connected normalizing flows. In Advances in Neural ceedings of the IEEE/CVF Conference on Computer
Information Processing Systems 34. Vision and Pattern Recognition, pages 8110–8119.
Grover, A., Gummadi, R., Lazaro-Gredilla, M., Schu- Kingma, D. P. and Ba, J. (2015). Adam: A Method
urmans, D., and Ermon, S. (2018). Variational for Stochastic Optimization. In 3rd International
Rejection Sampling. In International Conference Conference on Learning Representations.
on Artificial Intelligence and Statistics, volume 84,
pages 823–832. Kingma, D. P. and Dhariwal, P. (2018). Glow: Gen-
erative Flow with Invertible 1x1 Convolutions. Ad-
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, vances in Neural Information Processing Systems.
V., and Courville, A. (2017). Improved Training
of Wasserstein GANs. In Advances in Neural Infor- Kingma, D. P., Salimans, T., Poole, B., and
mation Processing Systems, volume 30, pages 5767– Ho, J. (2021). Variational Diffusion Models.
5777. arXiv:2107.00630 [cs, stat].
Hagemann, P. and Neumayer, S. (2021). Stabilizing Kingma, D. P. and Welling, M. (2014). Auto-Encoding
invertible neural networks using mixture models. In- Variational Bayes. In 2nd International Conference
verse Problems, 37(8):085002. on Learning Representations.
Hernández-Lobato, J. M., Li, Y., Rowland, M., Köhler, J., Krämer, A., and Noé, F. (2021). Smooth
Hernández-Lobato, D., Bui, T., and Turner, R. E. Normalizing Flows. In Advances in Neural Informa-
(2016). Black-box alpha-divergence Minimization. tion Processing Systems 34.
In Proceedings of the 33nd International Conference Krizhevsky, A. (2009). Learning Multiple Layers of
on Machine Learning, pages 1511–1520. Features from Tiny Images. page 60.
Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, Leimkuhler, B. and Matthews, C. (2015). Molecular
P. (2019). Flow++: Improving Flow-Based Gener- Dynamics With Deterministic and Stochastic Nu-
ative Models with Variational Dequantization and merical Methods. Number 39 in Interdisciplinary
Architecture Design. In Proceedings of the 36th In- Applied Methematics. Springer.
ternational Conference on Machine Learning, pages Ma, X., Kong, X., Zhang, S., and Hovy, E. (2019). Ma-
2722–2730. cow: Masked convolutional generative flow. In Wal-
Huang, C.-W., Dinh, L., and Courville, A. (2020). lach, H., Larochelle, H., Beygelzimer, A., d'Alché-
Augmented Normalizing Flows: Bridging the Gap Buc, F., Fox, E., and Garnett, R., editors, Ad-
Between Generative Flows and Latent Variable vances in Neural Information Processing Systems,
Models. arXiv:2002.07101 [cs, stat]. volume 32. Curran Associates, Inc.
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Nielsen, D., Jaini, P., Hoogeboom, E., Winther, O., Speech Synthesis. In Proceedings of the 35th Inter-
and Welling, M. (2020). SurVAE Flows: Surjections national Conference on Machine Learning.
to Bridge the Gap between VAEs and Flows. In Ad- Wirnsberger, P., Ballard, A. J., Papamakarios,
vances in Neural Information Processing Systems, G., Abercrombie, S., Racanière, S., Pritzel, A.,
volume 33. Jimenez Rezende, D., and Blundell, C. (2020). Tar-
Noé, F., Olsson, S., Köhler, J., and Wu, H. (2019). geted free energy estimation via learned mappings.
Boltzmann generators: Sampling equilibrium states The Journal of Chemical Physics, 153(14):144112.
of many-body systems with deep learning. Science, Wu, H., Köhler, J., and Noe, F. (2020). Stochas-
365(6457). tic normalizing flows. In Larochelle, H., Ranzato,
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mo- M., Hadsell, R., Balcan, M. F., and Lin, H., ed-
hamed, S., and Lakshminarayanan, B. (2021). Nor- itors, Advances in Neural Information Processing
malizing Flows for Probabilistic Modeling and In- Systems, volume 33, pages 5933–5944. Curran As-
ference. Journal of Machine Learning Research, sociates, Inc.
22(57):1–64.
Papamakarios, G., Pavlakou, T., and Murray, I.
(2017). Masked Autoregressive Flow for Density Es-
timation. In Advances in Neural Information Pro-
cessing Systems, volume 30, pages 2338–2347.
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L.,
Shazeer, N., Ku, A., and Tran, D. (2018). Image
Transformer. In International Conference on Ma-
chine Learning, pages 4055–4064. PMLR.
Polyak, B. (1990). New stochastic approximation type
procedures. Avtomatica i Telemekhanika, 7:98–107.
Rezende, D. J. and Mohamed, S. (2015). Variational
Inference with Normalizing Flows. In Proceedings
of the 32nd International Conference on Machine
Learning.
Runde, V. (2005). A Taste of Topology. Universitext.
Springer-Verlag, New York.
Ruppert, D. (1988). Efficient estimators from a slowly
converging robbins-monro process.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N.,
and Ganguli, S. (2015). Deep Unsupervised Learn-
ing using Nonequilibrium Thermodynamics. In Pro-
ceedings of the 32nd International Conference on
Machine Learning, pages 2256–2265. PMLR.
Tabak, E. G. and Turner, C. V. (2013). A Family
of Nonparametric Density Estimation Algorithms.
Communications on Pure and Applied Mathematics,
66(2):145–164.
Tabak, E. G. and Vanden-Eijnden, E. (2010).
Density estimation by dual ascent of the log-
likelihood. Communications in Mathematical Sci-
ences, 8(1):217–233.
van den Oord, A., Li, Y., Babuschkin, I., Simonyan,
K., Vinyals, O., Kavukcuoglu, K., van den Driess-
che, G., Lockhart, E., Cobo, L. C., Stimberg, F.,
Casagrande, N., Grewe, D., Noury, S., Dieleman,
S., Elsen, E., Kalchbrenner, N., Zen, H., Graves,
A., King, H., Walters, T., Belov, D., and Hassabis,
D. (2018). Parallel WaveNet: Fast High-Fidelity
Supplementary Material:
Resampling Base Distributions of Normalizing Flows

A LEARNING ALGORITHMS

A.1 Estimating the normalization constant

To stabilize training, we use the exponential moving average to estimate the value of the normalization con-
stant (Bauer and Mnih, 2019). In practice, this means that if Zi is the current Monte Carlo estimate of the
normalization constant, the exponential moving average ⟨Z⟩i is computed by

⟨Z⟩1 = Z1 , (12)
⟨Z⟩i = (1 − ϵ)⟨Z⟩i−1 + ϵZi for i > 1, (13)

where ϵ is the decay parameter which we set to 0.05 throughout this article. However, the gradients are estimated
only with the current Monte Carlo estimate Zi , because otherwise backpropagation through the entire history
of ⟨Z⟩i would be necessary, which would be computationally expensive and memory demanding.

A.2 Gradient estimators of the Kullback-Leibler divergence

We repeat Theorem 1 as stated in the main text and supplement its proof.
Theorem 1. Let pϕ (z) be the base distribution of a normalizing flow, having parameters ϕ, and Fθ be the
respective invertible mapping, depending on its parameters θ, such that the density of the model is

log (p(x)) = log (pϕ (z)) − log |det JFθ (z)| , (14)

with x = Fθ (z). Then, the gradients of the KL divergence with respect to the parameters are given by

∇ϕ KLD(θ, ϕ) = Covpϕ (z) log (pϕ (z)) − log |det JFθ (z)| − log p̂∗ (Fθ (z)), ∇ϕ log pϕ (z)

(15)
∇θ KLD(θ, ϕ) = −Epϕ (z) ∇θ log |det JFθ (z)| + log p̂∗ (Fθ (z))
 
(16)

Proof. The KL divergence is defined as

KLD(θ, ϕ) := Ep(x) [log p(x)] − Ep(x) [log p∗ (x)] . (17)

By plugging in (14) into (17) we obtain

KLD(θ, ϕ) = Epϕ (z) [log pϕ (z) − log |det JFθ (z)| − log p̂∗ (Fθ (z))] . (18)

Computing the gradient of (18) with respect to θ is straight forward.

∇θ KLD(θ, ϕ) = ∇θ Epϕ (z) [log pϕ (z) − log |det JFθ (z)| − log p̂∗ (Fθ (z))]
= Epϕ (z) ∇θ log pϕ (z) − log |det JFθ (z)| − log p̂∗ (Fθ (z))
 
(19)
= −Epϕ (z) ∇θ log |det JFθ (z)| + log p̂∗ (Fθ (z))
 

To get the gradient with respect to ϕ, we decompose (18) into two parts and consider their gradients separately.
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Z
∇ϕ Epϕ (z) [log pϕ (z)] = ∇ϕ pϕ (z) log pϕ (z)dz
Z

= ∇ϕ pϕ (z) log pϕ (z) dz
Z
= ∇ϕ pϕ (z) + log pϕ (z)∇ϕ pϕ (z)dz (20)
Z Z
= ∇ϕ pϕ (z)dz + pϕ (z) log pϕ (z)∇ϕ log pϕ (z)dz
| {z }
=1
= Epϕ (z) [log pϕ (z)∇ϕ log pϕ (z)]

ld : = log |det JFθ (z)| + log p̂∗ (Fθ (z))


Z
∇ϕ Epϕ (z) [ld ] = ∇ϕ ld pϕ (z)dz
Z
= ld ∇ϕ pϕ (z)dz (21)
Z
= ld pϕ (z)∇ϕ log pϕ (z)dz

= Epϕ (z) [ld ∇ϕ log pϕ (z)]

Using these two expressions, we obtain

log pϕ (z) − log |det JFθ (z)| − log p̂∗ (Fθ (z)) ∇ϕ log pϕ (z)
  
∇ϕ KLD(θ, ϕ) = Epϕ (z) (22)
= Covpϕ (z) pϕ (z) − log |det JFθ (z)| − log p̂∗ (Fθ (z)), ∇ϕ log pϕ (z) .

(23)

When concluding (23) from (22) we used the well known identity
Z
Epϕ (z) [∇ϕ log pϕ (z)] = pϕ (z)∇ϕ log pϕ (z)dz
Z Z
pϕ (z) (24)
= ∇ϕ pϕ (z)dz = ∇ϕ pϕ (z)dz = 0.
pϕ (z)
| {z }
=1

B MULTISCALE ARCHITECTURE

As already mentioned in the main paper, Dinh et al. (2017) introduced the multiscale architecture for normalizing
flows to deal with high dimensional data such as images. As sketched in Figure 5, initially, the entire input x is
(1) (2)
transformed by several flow layers. The result is split up into two parts, h1 and h1 . Dinh et al. (2017) did
this by first squeezing the image, i.e. reducing the height and width of the image by a factor 2 and adding the
(1)
surplus pixels as additional channels, and then splitting the resulting tensor along the channel dimension. h1
(2)
is immediately factored out in the density, while h2 is further transformed by F2 . The process is the repeated
until a desired depth is reached. The output of the last map, in Figure 5 it is F4 , is not split, but directly passed
to its base distribution. The full density for a multiscale architecture with n levels is given by
n
Y
p(x) = |det (JFi (hi−1 ))| p(zi ),
i=1

where we set h0 = x.
Resampling Base Distributions of Normalizing Flows

(1) z1
h1

F1
x

(1) z2
h2

(2)
F2
h1
(1) z3
h3
(2)
F3
h2
(2)
F4
h3 z4

Figure 5: Multiscale architecture with four levels as introduced in (Dinh et al., 2017). First, the entire input x is
transformed by F1 . The result is then split up into two parts of which one of them is factored out immediately
and the other one is further processed by F2 . This process is repeated a few times until the desired depth is
reached. The input is drawn in blue, intermediate results are red, and the components of the final variable z are
yellow.

C LEARNED ACCEPTANCE PROBABILITY

C.1 Choosing the architecture

In order to get an impression of what the architecture of the neural network defining the acceptence probability
a for LARS, we did an ablation experiment on the Power UCI dataset. We left the flow architecture of a real
NVP model constant but changed the number of hidden layers and units of the neural network representing a.
The baseline model with a Gaussian base distribution achieved 0.330 ± 0.003 on the test set. When changing the
number of hidden layers we used 512 hidden units and 3 hidden layers when changing the number hidden units.

Table 6: LL of the test set for different number of hidden layers for a while leaving the number of hidden units
constant at 512.

Hidden layers 1 3 5 7 9
LL 0.37 0.53 0.58 0.62 0.63

Table 7: LL of the test set for different number of hidden layers for a while leaving the number of hidden layers
constant at 3.

Hidden units 32 128 512 2048 8192


LL 0.39 0.45 0.53 0.61 0.61

We see that both the number of hidden layers and units is important. The LL increases as we are adding more
with diminishing returns. However, note that especially inceasing the number of hidden units increases the
parameter count as well as the computational cost; hence, an application specific trade-off needs to be found.

C.2 Tuning the rejection rate

As discussed in the main text, the rejection rate of LARS can be controlled through the truncation parameter
T . It sets a limit on how often subsequent proposals can be rejection in order to generate one sample. However,
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Figure 6: LL on the test set and Z with respect to the hyperparameter λZ introduced in (26).

it does not tell us something about the actual rejection rate determining the sampling speed, which might be
lower. The number of expected samples per sample from the proposal π is given by
Z
Eπ (z)[a(z)] = a(z)π(z)dz = Z, (25)

which is equivalent to the normalization constant Z. Hence, if we increase Z we can decrease the rejection rate.
We can simply do so by including it in our optimization, e.g. when doing ML we can instead minimize the loss

L = −Ep∗ (x) [log p(x)] − λZ Z, (26)

where λZ ∈ R+ is a positive hyperparameter.


In order to test this procedure, we trained 30 real NVP models with a resampled base distribution with different
values of λZ on the UCI Power dataset. The neural network representing the acceptance probability a had 3
hidden layers with 512 hidden units and we set T = 20. In Figure 6 we show the LL of the models on the test set
as well as Z depending on the hyperparameter λZ . We see that by increasing λZ we can trade off performance
in terms of LL with the expected number of LARS samples per sample from the proposal. When Z approaches
one, i.e. nearly all samples from the proposal get accepted, the LL drops to the value achieved by the flow with
a Gaussian base distribution being 0.330 ± 0.003, see Table 9.

D 2D DISTRIBUTIONS

The densities of the distributions used as sample targets in Section 4.1 are given in Table 8.

Table 8: Logarithm of the unnormalized densities of the target distributions used in Section 4.1.

Unnormalized log density


2 2
(∥z∥ − 1) (|z1 | − 2)  4z1

Dual Moon − − + log 1 + e− 0.09
0.08 0.18
  
2 2
 
8 9 (z1 −2 sin( 2π )) +(z1 −2 cos( 2π ))
X 9 −
8i

8i
Circle of Gaussians log   √ e 4−2 2 
i=1
2π 2 − 2
" 2  #
X 32
−32(∥z∥−i−1)2
Two Rings log e
i=1
π

All models approximating a 2D distribution uses for each layer a fully connected network having 2 hidden layers
with 32 hidden units each as parameter map or residual learning block, respectively. The mixture of Gaussian
Resampling Base Distributions of Normalizing Flows

Real NVP, Real NVP,


Target Gaussian base Resampled base
Dual moon
Circle of Gaussians
Two rings Gaussian base resampled base

Figure 7: Visualization of the densities when approximating three 2D distributions with complex topological
structure. Real NVP models with Gaussian and a resampled base distributions where trained using the KL
divergence.
Gaussian mixture Resampled base,
Gaussian base Resampled base
base acceptance probability
Dual moon
Circle of Gaussians
Two rings

Figure 8: Visualization of the learned base distributions of the real NVP flow models shown in Figure 3.
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Residual flow, Residual flow, Residual flow,


Target
Gaussian base Gaussian mixture base resampled base

Dual moon
Circle of Gaussians
Two rings

Figure 9: Visualization of the residual flow densities when approximating three 2D distributions with complex
topological structure. The models were trained using ML learning and the corresponding base distributions are
shown in Figure 10.
Gaussian mixture Resampled base,
Gaussian base Resampled base
base acceptance probability
Dual moon
Circle of Gaussians
Two rings

Figure 10: Visualization of the learned base distributions of the residual flow models shown in Figure 9.
Resampling Base Distributions of Normalizing Flows

base distributions are initialized by uniformly sampling the mean in the hypercube [−2.5, 2.5]D and setting the
variances to 0.5 · 1D , where 1D is the D-dimensional identity matrix.
The models are trained on a computer with 6 Intel i5-9400F CPUs and a Nvidia GeForce RTX 2070 graphics
card. The Adam optimizer with a learning rate of 10−3 is used. Training is done for 2 · 104 iterations with a
batch size of 1024.

E TABULAR DATA

In addition to the NSF models, we also trained real NVP models with a Gaussian, a mixture of Gaussians, and
a resampled base distribution to the four UCI datasets. The results are shown in Table 9.

Table 9: LL of real NVP models with different base distributions on the test sets of the respective datasets. The
values are averaged over 3 runs each and the standard error is given as a measure of uncertainty. The highest
values within the confidence interval are marked in bold.

Base distribution Power Gas Hepmass Miniboone


Gaussian 0.330 ± 0.003 10.1 ± 0.1 −19.5 ± 0.1 −11.65 ± 0.05
Mixture 0.341 ± 0.001 9.9 ± 0.2 −19.5 ± 0.1 −11.49 ± 0.04
Resampled 0.560 ± 0.006 12.8 ± 0.1 −18.4 ± 0.1 −11.48 ± 0.01

Table 10: Details about datasets from the UCI machine learning repository, the architecture of the NSF models
as well as the resampled base distribution, and the training procedure.

Power Gas Hepmass Miniboone


Dimension 6 8 21 43
Train data points 1.6 · 106 8.5 · 105 3.2 · 105 3.0 · 104
Flow layers 10 10 10 10
Hidden layers flow maps 2 2 2 1
Hidden units flow maps 256 128 256 64
Hidden layers a 7 9 4 2
Hidden units a 512 512 512 128
Truncation parameter T 100 50 40 40
Dropout rate 0 0.1 0.3 0.3
Batch size 512 512 256 64
Learning rate 3 · 10−4 4 · 10−4 4 · 10−4 3 · 10−4

Table 11: Details about the architecture of the real NVP models used as well as the resampled base distribution,
and the training procedure.

Power Gas Hepmass Miniboone


Flow layers 16 16 16 16
Hidden layers flow maps 2 2 2 2
Hidden units flow maps 128 128 64 32
Hidden layers a 3 3 3 3
Hidden units a 512 512 512 256
Truncation parameter T 100 100 100 100
Dropout rate 0 0.1 0.2 0.2
Batch size 512 512 256 128
Learning rate 5 · 10−4 5 · 10−4 3 · 10−4 3 · 10−4

In all experiments regarding the UCI datasets, we use dropout both in the neural networks defining the flow map
and the acceptance probability function a of the resampled base distribution during training. Adamax is used as
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

an optimizer (Kingma and Ba, 2015). The experiments are run on machines with 36 Intel Xeon Platinum 9242
CPUs and 128 GB RAM. Further details on the datasets, the flow architecture, and the training procdure are
given in Table 10 and Table 11.

F IMAGE GENERATION
The parameter maps of the Glow models are convolutional neural networks (CNNs) with 3 layers, the first and
the last having a kernel size of 3 × 3 and the middle layer of 1 × 1. The number of channels of the middle layer is
512 and those of the other layers is determined by the respective input and output. This is the same architecture
as used in (Kingma and Dhariwal, 2018).
To ensure that each factor of the base distribution has not more than 100 dimensions, we apply a squeeze
operation to the feature map of the first level before passing it to the base distribution. Therefore, each channel
has a maximum size of 8 × 8 = 64. A CNN with 4 layers, having 32 channels and a kernel size of 3 × 3 each
and a fully connected output layer, is used as acceptance function at each level. The convolutions of this CNN
are strided with a stride of 2 until the image size is 4 × 4. The normalization constants are updated with 2048
samples per iteration and before evaluating our models we estimated them with 1010 samples.
Each model is trained for 106 iterations with the Adam optimizer having a learning rate of 10−3 . The learning
rate is warmed up linearly over 103 iterations and the batch size is 512. The models with 8, 16, and 32 layers
per level are trained in a distributed fashion on 1, 2, and 4 Nvidia Quadro RTX 5000 graphics cards. We
apply Polyak-Ruppert weight averaging (Polyak, 1990; Ruppert, 1988) with an update rate of 10−3 , where the
exponential moving average of the model weights is computed in order to improve the generalization performance
on the test set (Izmailov et al., 2018).

Table 12: Percentage increase in training time and model size when using a resampled instead of a Gaussian
base distribution for the models trained in Section 4.3.

Layers per level Training time Model size


8 4.7% 1.5%
16 15% 0.75%
32 9.1% 0.38%

G BOLTZMANN GENERATORS
G.1 Coordinate transformation

b B φ D

A ψ C

z y

Figure 11: Illustration of molecular coordinates. The state of the molecule can be described through the Cartesian
coordinates, i.e. x, y, and z, of each of the four atoms A, B, C, and D. Alternatively, internal coordinates, i.e.
bond lengths, bond angles, and dihedral angles, can be used. Here, the bond length b is the distance between
atom A and B, the bond angle φ being the angle between the bonds between B and C as well as C and D, and
the dihedral angle ψ is the angle between the plans spanned by A, B, and C as well as B, C, and D. We use a
combination of Cartesian and internal coordinates.

To simplify the approximation Boltzmann distributions of complex molecules, a coordinate transformation was
introduced (Noé et al., 2019). Some of the Cartesian coordinates are mapped to their respective internal co-
ordinates, i.e. bond lengths, bond angles, and dihedral angles, which are illustrated in Figure 11. The internal
Resampling Base Distributions of Normalizing Flows

coordinates are normalized, with mean and standard deviation calculated on the training dataset generated
through MD, but a suitable experimental dataset could be used as well. To the remaining Cartesian coordinates
principal component analysis is applied. Subsequently, the weights of all but the last six principal components
are used as coordinates. Thereby, six degrees of freedom are eliminated, corresponding to the three translational
and free rotational coordinates which leave the Boltzmann distribution invariant.

G.2 Setup of the experiments

All real NVP models trained via ML have a neural network with 2 hidden layers and 64 hidden units as a
parameter map at each coupling layer. Between the coupling layers, we apply a invertible linear transformation
which is learned with the other parameters of the flow, similar to the invertible 1x1 convolutions introduced in
(Kingma and Dhariwal, 2018). The acceptance function of the resampled base distribution is a fully connected
neural network with 2 hidden layers having 256 hidden units each. At each iteration, the normalization constant
Z is updated with 512 samples from the Gaussian proposal during training. Before evaluating our models, we
estimated Z with 1010 samples. The residual flow models have 8 layers each with each layer having 2 layer fully
connected neural network with 64 hidden units and the resampled base distribution has 3 hidden layers with 512
hidden units. All models are trained for 5 · 105 iterations with the Adam optimizer (Kingma and Ba, 2015) and
a batch size of 512. The learning rate is set to 10−3 and decreased to 10−4 after 2.5 · 105 iterations. We also
do Polyak-Ruppert weight averaging (Polyak, 1990; Ruppert, 1988) with an update rate of 10−2 . Each model is
trained and evaluated on a server with 16 Intel Xeon E5-2698 CPUs and a Nvidia GTX980 GPU.
The real NVP models trained by minimizing the KL divergence have the same architecture as those in the
previous experiment. However, to improve the stability of the training process, the models are trained with
double precision numbers on 32 Intel Xeon E5-2698 CPUs each. 105 iterations are done with the Adam optimizer
with a learning rate of 10−4 , which is exponentially decayed every 2.5 · 104 iterations by a factor of 0.5.
The KL divergences were computed by drawing 106 samples from the model and estimating the respective
integrals with histograms.

G.3 Further results

As additional performance metric to compare the models, we compute the Ramachandran plots, i.e. a 2D
histogram of two dihedral angles. These plots are frequently used to analyse how proteins fold locally and are
hence of high importance for many applications. Some Ramachandran plots are show in Figure 13. We also
estimate the KL divergences of the ground truth Ramachandran plot obtained from the MD test set and the
plots of the models by performing numerical integration with the histograms. The results are given in Table 13,
Table 14, and Table 15.
We also evaluated the stochastic normalizing flow model trained by Wu et al. (2020) through ML on our metrics.
The median KL divergences of the marginals is 2.3 · 10−3 while the mean is 2.6 · 10−2 , which is almost an order
of magnitude higher that the results of the models with a resampled base distribution. However, the stochastic
normalizing flow models the Ramachandran plot very well, where the KL divergence is only 2.4 · 10−1 . Note that
these results have to be taken with a grain of salt, since Wu et al. (2020) used an augmented normalizing flow
with less layers than we did. We tried to include their stochastic layers into our models but found training to be
very unstable in this setting.

Table 13: KL divergence of the Ramachandran plots of the MD simulation, serving as a ground truth, and real
NVP models trained via ML learning. It was estimated based on a histogram computed from 106 samples.

Base distribution Gaussian Mixture Gaussian Resampled


Number of layers 16 16 19 16
KL divergence 4.79 ± 0.73 10.8 ± 7.3 2.26 ± 0.27 3.00 ± 0.36
Vincent Stimper, Bernhard Schölkopf, José Miguel Hernández-Lobato

Figure 12: Marginal distribution of three dihedral angles of Alanine dipeptide. The ground truth was determined
with a MD simulation. The flow models are based on the residual flow architecture and were trained via ML
learning.

Table 14: Quantitative comparison of the residual flow models approximating the Boltzmann distribution of
Alanine dipeptide trained via ML learning. The LL is evaluated on a test set obtained with a MD simulation.
The KL divergences of the 60 marginals were computed and the mean and median of them are reported. Moreover,
the KL divergences of the Ramachandran plots are listed. All results are averages over 10 runs, the standard
error is given, and highers LL as well as lowest KL divergences are marked in bold.

Base distribution Gaussian Mixture Resampled


LL (×102 ) 1.8048 ± 0.0002 1.8061 ± 0.0002 1.8144 ± 0.0002
Mean KLD marginals (×10−3 ) 6.16 ± 0.17 31.5 ± 1.8 3.49 ± 0.15
Median KLD marginals (×10−4 ) 5.21 ± 0.12 14.2 ± 5.2 4.67 ± 0.05
KLD Ramachandran plot 8.1 ± 2.2 25.4 ± 10.2 4.4 ± 0.9

Table 15: Quantitative comparison of the real NVP models approximating the Boltzmann distribution of Alanine
dipeptide trained via the KL divergence. The LL is evaluated on a test set obtained with a MD simulation. The
KL divergences of the 60 marginals were computed and the mean and median of them are reported. Moreover,
the KL divergences of the Ramachandran plots are listed. All results are averages over 10 runs, the standard
error is given, and highers LL as well as lowest KL divergences are marked in bold.

Base distribution Gaussian Mixture Resampled


LL (×102 ) −2.78 ± 0.07 −2.70 ± 0.04 −1.84 ± 0.13
Mean KLD marginals (×10−1 ) 2.91 ± 0.05 2.98 ± 0.02 2.84 ± 0.07
Median KLD marginals (×10−3 ) 4.75 ± 0.04 4.77 ± 0.03 4.66 ± 0.05
KLD Ramachandran plot 7.63 ± 0.18 16.6 ± 8.4 6.92 ± 0.37
Resampling Base Distributions of Normalizing Flows

(a) Real NVP, Gaussian base, 16 layers (b) Real NVP, Gaussian mixture base, 16 layers

(c) Real NVP, Gaussian base, 19 layers (d) Real NVP, Resampled base, 16 layers

(e) Ground truth (MD simulation)

Figure 13: Ramachandran plots of Alanine dipeptide. The flow models were trained via ML learning

You might also like