0% found this document useful (0 votes)
12 views26 pages

NeurIPS 2023 Training Energy Based Normalizing Flow With Score Matching Objectives Paper Conference

Energía ml

Uploaded by

cacacacona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

NeurIPS 2023 Training Energy Based Normalizing Flow With Score Matching Objectives Paper Conference

Energía ml

Uploaded by

cacacacona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Training Energy-Based Normalizing Flow with

Score-Matching Objectives

Chen-Hao Chao1 , Wei-Fang Sun1,2 , Yen-Chang Hsu3 , Zsolt Kira4 , and Chun-Yi Lee1∗
1
Elsa Lab, National Tsing Hua University, Hsinchu City, Taiwan
2
NVIDIA AI Technology Center, NVIDIA Corporation, Santa Clara, CA, USA
3
Samsung Research America, Mountain View, CA, USA
4
Georgia Institute of Technology, Atlanta, GA, USA

Abstract
In this paper, we establish a connection between the parameterization of flow-based
and energy-based generative models, and present a new flow-based modeling ap-
proach called energy-based normalizing flow (EBFlow). We demonstrate that by
optimizing EBFlow with score-matching objectives, the computation of Jacobian
determinants for linear transformations can be entirely bypassed. This feature
enables the use of arbitrary linear layers in the construction of flow-based models
without increasing the computational time complexity of each training iteration
from O(D2 L) to O(D3 L) for an L-layered model that accepts D-dimensional in-
puts. This makes the training of EBFlow more efficient than the commonly-adopted
maximum likelihood training method. In addition to the reduction in runtime, we
enhance the training stability and empirical performance of EBFlow through a
number of techniques developed based on our analysis of the score-matching meth-
ods. The experimental results demonstrate that our approach achieves a significant
speedup compared to maximum likelihood estimation while outperforming prior
methods with a noticeable margin in terms of negative log-likelihood (NLL).

1 Introduction
Parameter estimation for probability density functions (pdf) has been a major interest in the research
fields of machine learning and statistics. Given a D-dimensional random data vector x ∈ RD , the goal
of such a task is to estimate the true pdf px (·) of x with a function p(· ; θ) parameterized by θ. In the
studies of unsupervised learning, flow-based modeling methods (e.g., [1–4]) are commonly-adopted
for estimating px due to their expressiveness and broad applicability in generative tasks.
Flow-based models represent p(· ; θ) using a sequence of invertible transformations based on the
change of variable theorem, through which the intermediate unnormalized densities are re-normalized
by multiplying the Jacobian determinant associated with each transformation. In maximum likelihood
estimation, however, the explicit computation of the normalizing term may pose computational
challenges for model architectures that use linear transformations, such as convolutions [4, 5] and
fully-connected layers [6, 7]. To address this issue, several methods have been proposed in the
recent literature, which includes constructing linear transformations with special structures [8–12]
and exploiting special optimization processes [7]. Despite their success in reducing the training
complexity, these methods either require additional constraints on the linear transformations or biased
estimation on the gradients of the objective.
Motivated by the limitations of the previous studies, this paper introduces an approach that reinterprets
flow-based models as energy-based models [13], and leverages score-matching methods [14–17] to

Corresponding author. Email: [email protected]

37th Conference on Neural Information Processing Systems (NeurIPS 2023).


optimize p(· ; θ) according to the Fisher divergence [14, 18] between px (·) and p(· ; θ). The proposed
method avoids the computation of the Jacobian determinants of linear layers during training, and
reduces the asymptotic computational complexity of each training iteration from O(D3 L) to O(D2 L)
for an L-layered model. Our experimental results demonstrate that this approach significantly
improves the training efficiency as compared to maximum likelihood training. In addition, we
investigate a theoretical property of Fisher divergence with respect to latent variables, and propose
a Match-after-Preprocessing (MaP) technique to enhance the training stability of score-matching
methods. Finally, our comparison on the MNIST dataset [19] reveals that the proposed method
exhibit significant improvements in comparison to our baseline methods presented in [17] and [7] in
terms of negative log likelihood (NLL).

2 Background
In this section, we discuss the parameterization of probability density functions in flow-based and
energy-based modeling methods, and offer a number of commonly-used training methods for them.

2.1 Flow-based Models

Flow-based models describe px (·) using a prior distribution pu (·) of a latent variable u ∈ RD and
an invertible function g = gL ◦ · · · ◦ g1 , where gi (· ; θ) : RD → RD , ∀i ∈ {1, · · · , L} and is
usually modeled as a neural network with L layers. Based on the change of variable theorem and the
distributive property of the determinant operation det (·), p(· ; θ) can be described as follows:
L
Y
p(x; θ) = pu (g(x; θ)) |det(Jg (x; θ))| = pu (g(x; θ)) |det(Jgi (xi−1 ; θ))| , (1)
i=1

where xi = gi ◦ · · · ◦ g1 (x; θ), x0 = x, Jg (x; θ) = ∂x g(x; θ) represents the Jacobian of g with

respect to x, and Jgi (xi−1 ; θ) = ∂xi−1 gi (xi−1 ; θ) represents the Jacobian of the i-th layer of g
with respect to xi−1 . This work concentrates on model architectures employing linear flows [20] to
design the function g. These model architectures primarily utilize linear transformations to extract
crucial feature representations, while also accommodating non-linear transformations that enable
efficient Jacobian determinant computation. Specifically, let Sl be the set of linear transformations
in g, and Sn = {gi | i ∈ {1, · · · , L}} \ Sl be the set of non-linear transformations. The general
QL
assumption
Q of these model
Q architectures is thatQ i=1 |det (Jgi )| in Eq.Q(1) can be decomposed
as gi ∈Sn |det (Jgi )| gi ∈Sl |det (Jgi )|, where gi ∈Sn |det (Jgi )| and gi ∈Sl |det (Jgi )| can be
calculated within the complexity of O(D2 L) and O(D3 L), respectively. Previous implementations
of such model architectures include Generative Flows (Glow) [4], Neural Spline Flows (NSF) [5],
and the independent component analysis (ICA) models presented in [6, 7].
Given the parameterization of p(· ; θ), a commonly used approach for optimizing θ is maxi-
mum likelihood (ML) estimation, whichh involves i minimizing the Kullback-Leibler (KL) diver-
px (x)
gence DKL [px (x)∥p(x; θ)] = Epx (x) log p(x;θ) between the true density px (x) and the param-
eterized density p(x; θ). The ML objective LML (θ) is derived by removing the constant term
Epx (x) [log px (x)] with respect to θ from DKL [px (x)∥p(x; θ)], and can be expressed as follows:
LML (θ) = Epx (x) [− log p(x; θ)] . (2)
The ML objective explicitly evaluates p(x; θ), which involves the calculation of the Jacobian determi-
nant of the layers in Sl . This indicates that certain model architectures containing convolutional [4, 5]
or fully-connected
Q layers [6, 7] may encounter training inefficiency due to the O(D3 L) cost of
evaluating gi ∈Sl |det (Jgi )|. Although a number of alternative methods discussed in Section 3 can
be adopted to reduce their computational cost, they either require additional constraints on the linear
transformation or biased estimation on the gradients of the ML objective.

2.2 Energy-based Models

Energy-based models are formulated based on a Boltzmann distribution, which is expressed as the
ratio of an unnormalized density function to an input-independent normalizing constant. Specif-

2
ically, given a scalar-valued energy function E(· ; θ) : RD → R, the unnormalized density func-
Rtion is defined as exp (−E(x; θ)), and the normalizing constant Z(θ) is defined as the integration
x∈RD
exp (−E(x; θ)) dx. The parameterization of p(· ; θ) is presented in the following equation:

p(x; θ) = exp (−E(x; θ)) Z −1 (θ). (3)

Optimizing p(· ; θ) in Eq. (3) through directly evaluating LML in Eq. (2) is computationally infeasible,
since the computation requires explicitly calculating the intractable normalizing constant Z(θ). To

address this issue, a widely-used technique [13] is to reformulate ∂θ LML (θ) as its sampling-based

variant ∂θ LSML (θ), which is expressed as follows:
LSML (θ) = Epx (x) [E(x; θ)] − Esg(p(x;θ)) [E(x; θ)] , (4)

where sg(·) indicates the stop-gradient operator. Despite the fact that Eq. (4) prevents the calculation
of Z(θ), sampling from p(· ; θ) typically requires running a Markov Chain Monte Carlo (MCMC)
process (e.g., [21, 22]) until convergence, which can still be computationally expensive as it involves
evaluating the gradients of the energy function numerous times. Although several approaches [23, 24]
were proposed to mitigate the high computational costs involved in performing an MCMC process,
these approaches make use of approximations, which often cause training instabilities in high-
dimensional contexts [25].
Another line of researches proposed
h to optimize
 p(· ;iθ) through minimizing the Fisher divergence
1 ∂ px (x)
DF [px (x)∥p(x; θ)] = Epx (x) 2 ∥ ∂x log p(x;θ) ∥2 between px (x) and p(x; θ) using the score-
h i
1 ∂ 2 ∂2
matching (SM) objective LSM (θ) = Epx (x) ∥
2 ∂x E(x; θ)∥ − Tr( ∂x2 E(x; θ)) [14] to avoid the
explicit calculation of Z(θ) as well as the sampling process required in Eq. (4). Several computation-
ally efficient variants of LSM , including sliced score matching (SSM) [16], finite difference sliced
score matching (FDSSM) [17], and denoising score matching (DSM) [15], have been proposed.
SSM is derived directly based on LSM with an unbiased Hutchinson’s trace estimator [26]. Given
a random projection vector v ∈ RD drawn from pv and satisfying Epv (v) [v T v] = I, the objective
function denoted as LSSM , is defined as follows:
" #
2
∂ 2 E(x; θ)
 
1 ∂E(x; θ)
LSSM (θ) = Epx (x) − Epx (x)pv (v) v T v . (5)
2 ∂x ∂x2

FDSSM is a parallelizable variant of LSSM that adopts the finite difference method [27] to approximate
the gradient operations in the objective. Given a uniformly distributed random vector ε, it accelerates
the calculation by simultaneously forward passing E(x; θ), E(x + ε; θ), and E(x − ε; θ) as follows:
LFDSSM (θ) = 2Epx (x) [E(x; θ)] − Epx (x)pξ (ε) [E(x + ε; θ) + E(x − ε; θ)]
1 h
2
i (6)
+ Epx (x)pξ (ε) (E(x + ε; θ) − E(x − ε; θ)) ,
8
where pξ (ε) = U(ε ∈ RD | ∥ε∥ = ξ), and ξ is a hyper-parameter that usually assumes a small value.
DSM approximates the true pdf through a surrogate thatR is constructed using the Parzen density
estimator pσ (x̃) [28]. The approximated target pσ (x̃) = x∈RD pσ (x̃|x)px (x)dx is defined based
on an isotropic Gaussian kernel pσ (x̃|x) = N (x̃|x, σ 2 I) with a variance σ 2 . The objective LDSM ,
which excludes the Hessian term in LSSM , is written as follows:
" #
2
1 ∂E(x̃; θ) x − x̃
LDSM (θ) = Epx (x)pσ (x̃|x) + . (7)
2 ∂ x̃ σ2

∂ ∂
To conclude, LSSM is an unbiased objective that satisfies ∂θ LSSM (θ) = ∂θ DF [px (x)∥p(x; θ)] [16],
while LFDSSM and LDSM require careful selection of hyper-parameters ξ and σ, since
∂ ∂
∂θ LFDSSM (θ) = ∂θ (DF [px (x)∥p(x; θ)] + o (ξ)) [17] contains an approximation error o (ξ), and
∂ ∂
pσ in ∂θ LDSM (θ) = ∂θ DF [pσ (x̃)∥p(x̃; θ)] may bear resemblance to px only for small σ [15].

3
3 Related Works
3.1 Accelerating Maximum Likelihood Training of Flow-based Models

A key focus in the field of flow-based modeling is to reduce the computational expense associated
with evaluating the ML objective [7–12, 29]. These acceleration methods can be classified into two
categories based on their underlying mechanisms.
Specially Designed Linear Transformations. A majority of the existing works [8–12, 29] have
attempted to accelerate the computation of Jacobian determinants in the ML objective by exploiting
linear transformations with special structures. For example, the authors in [8] proposed to constrain
the weights in linear layers as lower triangular matrices to speed up training. The authors in
[9, 10] proposed to adopt convolutional layers with masked kernels to accelerate the computation of
Jacobian determinants. The authors in [29] leveraged orthogonal transformations to bypass the direct
computation of Jacobian determinants. More recently, the authors in [12] proposed to utilize linear
operations with special butterfly structures [30] to reduce the cost of calculating the determinants.
Although these techniques avoid the O(D3 L) computation, they impose restrictions on the learnable
transformations, which potentially limits their capacity to capture complex feature representations, as
discussed in [7, 31, 32]. Our experimental findings presented in Appendix A.5 support this concept,
demonstrating that flow-based models with unconstrained linear layers outperform those with linear
layers restricted by lower / upper triangular weight matrices [8] or those using lower–upper (LU)
decomposition [4].
Specially Designed Optimization Process. To address the aforementioned restrictions, a recent
study [7] proposed the relative gradient method for optimizing flow-based models with arbitrary
linear transformations. In this method, the gradients of the ML objective are converted into their
relative gradients by multiplying themselves with W T W , where W ∈ RD×D represents the

weight matrix in a linear transformation. Since ∂W log |det (W )| W T W = W , evaluating relative
gradients is more computationally efficient than calculating the standard gradients according to
∂ T −1
∂W log |det (W )| = (W ) . While this method reduces the training time complexity from
3 2
O(D L) to O(D L), a significant downside to this approach is that it introduces approximation
errors with a magnitude of o(W ), which can escalate relative to the weight matrix values.

3.2 Training Flow-based Models with Score-Matching Objectives

The pioneering study [14] is the earliest attempt to train flow-based models by minimizing the
SM objective. Their results demonstrate that models trained using the SM loss are able to achieve
comparable or even better performance to those trained with the ML objective in a low-dimensional
experimental setup. More recently, the authors in [16] and [17] proposed two efficient variants of
the SM loss, i.e., the SSM and FDSSM objectives, respectively. They demonstrated that these loss
functions can be used to train a non-linear independent component estimation (NICE) [1] model
on high-dimensional tasks. While the training approaches of these works bear resemblance to ours,
our proposed method places greater emphasis on training efficiency. Specifically, they directly
implemented the energy function E(x; θ) in the score-matching objectives as − log p(x; θ), resulting
in a significantly higher computational cost compared to our method introduced in Section 4. In
Section 5, we further demonstrate that the models trained with the methods in [16, 17] yield less
satisfactory results in comparison to our approach.

4 Methodology
In this section, we introduce a new framework for reducing the training cost of flow-based models with
linear transformations, and discuss a number of training techniques for enhancing its performance.

4.1 Energy-Based Normalizing Flow

Instead of applying architectural constraints to reduce computational time complexity, we achieve the
same goal through adopting the training objectives of energy-based models. We name this approach as
Energy-Based Normalizing Flow (EBFlow). A key observation is that the parametric density function
of a flow-based model can be reinterpreted as that of an energy-based model through identifying

4
the input-independent multipliers in p(· ; θ). Specifically, p(· ; θ) can be explicitly factorized into an
unnormalized density and a corresponding normalizing term as follows:
L
Y
p(x; θ) = pu (g(x; θ)) |det (Jgi (xi−1 ; θ))|
i=1
Y Y
= pu (g(x; θ)) |det (Jgi (xi−1 ; θ))| |det(Jgi (θ))| ≜ exp (−E(x; θ)) Z −1 (θ)
| {z } | {z }
gi ∈Sn gi ∈Sl
| {z }| {z } Unnormalized Density Norm. Const.
Unnormalized Density Norm. Const.
(8)
where the energy function E(· ; θ) and the normalizing constant Z −1 (θ) are selected as follows:
 Y  Y
E(x; θ) ≜ − log pu (g(x; θ)) |det(Jgi (xi−1 ; θ))| , Z −1 (θ) = |det(Jgi (θ))| . (9)
gi ∈Sn gi ∈Sl

The detailed derivations of Eqs. (8) and (9) are elaborated in Lemma A.11 of Section A.1.2. By iso-
lating the computationally expensive term in p(· ; θ) as the normalizing constant Z(θ), the parametric
pdf defined in Eqs. (8) and (9) becomes suitable for the training methods of energy-based models. In
the subsequent paragraphs, we discuss the training, inference, and convergence property of EBFlow.
Training Cost. Based on the definition in Eqs. (8) and (9), the score-matching objectives specified in
Eqs. (5)-(7) can be adopted to prevent the Jacobian determinant calculation for the elements in Sl . As
a result, the training complexity can be significantly reduced to O(D2 L), as the O(D3 L) calculation
of Z(θ) is completely avoided. Such a design allows the use of arbitrary linear transformations in
the construction of a flow-based model without posing computational challenge during the training
process. This feature is crucial to the architectural flexibility of a flow-based model. For example,
fully-connected layers and convolutional layers with arbitrary padding and striding strategies can be
employed in EBFlow without increasing the training complexity. EBFlow thus exhibits an enhanced
flexibility in comparison to the related works that exploit specially designed linear transformations.
Inference Cost. Although the computational cost of evaluating the exact Jacobian determinants of
the elements in Sl still requires O(D3 L) time, these operations can be computed only once after
training and reused for subsequent inferences, since Z(θ) is a constant as long as θ is fixed. In cases
where D is extremely large and Z(θ) cannot be explicitly calculated, stochastic estimators such as
the importance sampling techniques (e.g., [33, 34]) can be used as an alternative to approximate Z(θ).
We provide a brief discussion of such a scenario in Appendix A.3.
Asymptotic Convergence Property. Similar to maximum likelihood training, score-matching meth-
ods that minimize Fisher divergence have theoretical guarantees on their consistency [14, 16]. This
property is essential in ensuring the convergence accuracy of the parameters. Let N be the num-
ber of independent and identically distributed (i.i.d.) samples drawn from px to approximate the
expectation in the SM objective. In addition, assume that there exists a set of optimal parameters θ∗
such that p(x; θ∗ ) = px (x). Under the regularity conditions (i.e., Assumptions A.1-A.7 shown in
Appendix A.1.1), consistency guarantees that the parameters θN minimizing the SM loss converges
p
(in probability) to its optimal value θ∗ when N → ∞, i.e., θN − → θ∗ as N → ∞. In Appendix A.1.1,
we provide a formal description of this property based on [16] and derive the sufficient condition for
g and pu to satisfy the regularity conditions (i.e., Proposition A.10).

4.2 Techniques for Enhancing the Training of EBFlow

As revealed in the recent studies [16, 17], training flow-based models with score-matching objectives
is challenging as the training process is numerically unstable and usually exhibits significant variances.
To address these issues, we propose to adopt two techniques: match after preprocessing (MaP) and
exponential moving average (EMA), which are particularly effective in dealing with the above issues
according to our ablation analysis in Section 5.3.
∂ ∂
MaP. Score-matching methods rely on the score function − ∂x E(x; θ) to match ∂x log px (x), which
requires backward propagation through each layer in g. This indicates that the training process
could be numerically sensitive to the derivatives of g. For instance, logit pre-processing layers
commonly used in flow-based models (e.g., [1, 4, 5, 7, 8, 35]) exhibit extremely large derivatives
near 0 and 1, which might exacerbate the above issue. To address this problem, we propose to

5
exclude the numerically sensitive layer(s) from the model and match the pdf of the pre-processed
variable during training. Specifically, let xk ≜ gk ◦ · · · ◦ g1 (x) be the pre-processed variable, where k
represents the index of the numerically sensitive layer. This method aims to optimize a parameterized
QL
pdf pk (· ; θ) ≜ pu (gL ◦ · · · ◦ gk+1 (· ; θ)) i=k+1 |det (Jgi )| that excludes (gk , · · · , g1 ) through
minimizing the Fisher divergence between the pdf pxk (·) of xk and pk (· ; θ) by considering the (local)
behavior of DF , as presented in Proposition 4.1.
Proposition 4.1. Let pxj be the pdf of the latent variable of xj ≜ gj ◦ · · · ◦ g1 (x) indexed by
QL
j. In addition, let pj (·) be a pdf modeled as pu (gL ◦ · · · ◦ gj+1 (·)) i=j+1 |det (Jgi )|, where
j ∈ {0, · · · , L − 1}. It follows that:
 
DF pxj ∥pj = 0 ⇔ DF [px ∥p0 ] = 0, ∀j ∈ {1, · · · , L − 1}. (10)
The derivation is presented in Appendix A.1.3. In Section 5.3, we validate the effectiveness of
the MaP technique on the score-matching methods formulated in Eqs. (5)-(7) through an ablation
analysis.
 Please note that MaP does not affect maximum likelihood training, since it always satisfies
DKL pxj ∥pj = DKL [px ∥p0 ], ∀j ∈ {1, · · · , L − 1} as revealed in Lemma A.12.
EMA. In addition to the MaP technique, we have also found that the exponential moving average
(EMA) technique introduced in [36] is effective in improving the training stability. EMA enhances
the stability through smoothly updating the parameters based on θ̃ ← mθ̃ + (1 − m)θi at each training
iteration, where θ̃ is a set of shadow parameters [36], θi is the model’s parameters at iteration i, and
m is the momentum parameter. In our experiments presented in Section 5, we adopt m = 0.999 for
both EBFlow and the baselines.

5 Experiments
In the following experiments, we first compare the training efficiency of the baselines trained with
LML and EBFlow trained with LSML , LSSM , LFDSSM , and LDSM to validate the effectiveness of
the proposed method in Sections 5.1 and 5.2. Then, in Section 5.3, we provide an ablation analysis
of the techniques introduced in Section 4.2, and a performance comparison between EBFlow and a
number of related studies [7, 16, 17]. Finally, in Section 5.4, we discuss how EBFlow can be applied
to generation tasks. Please note that the performance comparison with [8–12, 29] is omitted, since
their methods only support specialized linear layers and are not applicable to the employed model
architecture [7] that involves fully-connected layers. The differences between EBFlow, the baseline,
and the related studies are summarized in Table A4 in the appendix. The sampling process involved
in the calculation of LSML is implemented by g −1 (u ; θ), where u ∼ pu . The transformation g(· ; θ)
for each task is designed such that Sl ̸= ϕ and Sn ̸= ϕ. For more details about the experimental
setups, please refer to Appendix A.2.

5.1 Density Estimation on Two-Dimensional Synthetic Examples

In this experiment, we examine the performance of


EBFlow and its baseline on three two-dimensional
synthetic datasets. These data distributions are
formed using Gaussian smoothing kernels to en-
sure px (x) is continuous and the true score function

∂x log px (x) is well defined. The model g(· ; θ) is
constructed using the Glow model architecture [4],
which consists of actnorm layers, affine coupling lay-
Baseline EBFlow EBFlow EBFlow EBFlow
ers, and fully-connected layers. The performance True (ML) (SML) (SSM) (DSM) (FDSSM)
are evaluated in terms of the KL divergence and the Figure 1: The visualized density functions on
Fisher divergence between px (x) and p(x; θ) using the Sine, Swirl, and Checkerboard datasets.
independent and identically distributed (i.i.d.) testing The column ‘True’ illustrates the visualization
sample points. of the true density functions.
Table 1 and Fig. 1 demonstrate the results of the above setting. The results show that the performance
of EBFlow trained with LSSM , LFDSSM , and LDSM in terms of KL divergence is on par with those
trained using LSML as well as the baselines trained using LML . These results validate the efficacy of
training EBFlow with score matching.

6
Table 1: The evaluation results in terms of KL-divergence and Fisher-divergence of the flow-based
models trained with LML , LSML , LSSM , LDSM , and LFDSSM on the Sine, Swirl, and Checkerboard
datasets. The results are reported as the mean and 95% confidence interval of three independent runs.
Dataset Metric Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM)

Fisher Divergence (↓) 6.86 ± 0.73 e-1 6.65 ± 1.05 e-1 6.25 ± 0.84 e-1 6.66 ± 0.44 e-1 6.66 ± 1.33 e-1
Sine
KL Divergence (↓) 4.56 ± 0.00 e+0 4.56 ± 0.00 e+0 4.56 ± 0.01 e+0 4.57 ± 0.02 e+0 4.57 ± 0.01 e+0

Fisher Divergence (↓) 1.42 ± 0.48 e+0 1.42 ± 0.53 e+0 1.35 ± 0.10 e+0 1.34 ± 0.06 e+0 1.37 ± 0.07 e+0
Swirl
KL Divergence (↓) 4.21 ± 0.00 e+0 4.21 ± 0.01 e+0 4.25 ± 0.04 e+0 4.22 ± 0.02 e+0 4.25 ± 0.08 e+0

Fisher Divergence (↓) 7.24 ± 11.50 e+1 1.23 ± 0.75 e+0 7.07 ± 1.93 e-1 7.03 ± 1.99 e-1 7.08 ± 1.62 e-1
Checkerboard
KL Divergence (↓) 4.80 ± 0.02 e+0 4.81 ± 0.02 e+0 4.85 ± 0.05 e+0 4.82 ± 0.05 e+0 4.83 ± 0.03 e+0

Table 2: The evaluation results in terms of the performance (i.e., NLL and Bits/Dim) and the
throughput (i.e., Batch/Sec.) of the FC-based and CNN-based models trained with the baseline and
the proposed method on MNIST and CIFAR-10. Each result is reported in terms of the mean and
95% confidence interval of three independent runs after θ is converged. The throughput is measured
on NVIDIA Tesla V100 GPUs.
MNIST (D = 784)
Model FC-based CNN-based
Num. Param. 1.230 M 0.027 M
Method Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM) Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM)
NLL (↓) 1092.4 ± 0.1 1092.3 ± 0.6 1092.8 ± 0.3 1099.2 ± 0.2 1104.1 ± 0.5 1101.3 ± 1.3 1098.3 ± 6.6 1107.5 ± 1.4 1109.5 ± 2.4 1122.1 ± 3.1
Bits/Dim (↓) 2.01 ± 0.00 2.01 ± 0.00 2.01 ± 0.00 2.02 ± 0.00 2.03 ± 0.00 2.03 ± 0.00 2.02 ± 0.01 2.03 ± 0.00 2.04 ± 0.00 2.06 ± 0.01
Batch/Sec. (↑) 8.00 12.27 33.11 66.67 130.21 0.21 0.29 7.09 18.32 38.76

CIFAR-10 (D = 3, 072)
Model FC-based CNN-based
Num. Param. 18.881 M 0.241 M
Method Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM) Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM)
NLL (↓) 11912.9 ± 10.5 11915.6 ± 5.6 11917.7 ± 15.5 11940.0 ± 6.6 12347.8 ± 6.8 11408.7 ± 26.7 11553.6± 151.7 11435.5 ± 12.0 11462.3 ± 7.9 11766.0 ± 36.8
Bits/Dim (↓) 5.59 ± 0.00 5.60 ± 0.00 5.60 ± 0.01 5.61 ± 0.00 5.80 ± 0.00 5.36 ± 0.01 5.41 ± 0.07 5.37 ± 0.00 5.38 ± 0.00 5.54 ± 0.02
Batch/Sec. (↑) 5.05 7.35 29.85 57.14 62.50 0.02 0.03 7.35 18.41 39.84

5.2 Efficiency Evaluation on the MNIST and CIFAR-10 Datasets

In this section, we inspect the influence of data dimension D on the training efficiency of flow-based
models. To provide a thorough comparison, we employ two types of model architectures and train
them on two datasets with different data dimensions: the MNIST [19] (D = 1 × 28 × 28) and
CIFAR-10 [37] (D = 3 × 32 × 32) datasets.
The first model architecture is exactly the same as that adopted by [7]. It is an architecture consisting
of two fully-connected layers and a smoothed leaky ReLU non-linear layer in between. The second
model is a parametrically efficient variant of the first model. It replaces the fully-connected layers
with convolutional layers and increases the depth of the model to six convolutional blocks. Between
every two convolutional blocks, a squeeze operation [2] is inserted to enlarge the receptive field. In the
following paragraphs, we refer to these models as ‘FC-based’ and ‘CNN-based’ models, respectively.
The performance of the FC-based and CNN-based models is measured using the negative log likeli-
hood (NLL) metric (i.e., Epx (x) [− log p(x; θ)]), which differs from the intractable KL divergence
by a constant. In addition, its normalized variant, the Bits/Dim metric [38], is also measured and
reported. The algorithms are implemented using PyTorch [39] with automatic differentiation [40],
and the runtime is measured on NVIDIA Tesla V100 GPUs. In the subsequent paragraphs, we assess
the models through scalability analysis, performance evaluation, and training efficiency examination.
Scalability. To demonstrate the scalability of KL-divergence-based (i.e., LML and LSML ) and Fisher-
divergence-based (i.e., LSSM , LDSM , and LFDSSM ) objectives used in EBFlow and the baseline
method, we first present a runtime comparison for different choices of the input data size D. The
results presented in Fig. 2 (a) reveal that Fisher-divergence-based objectives can be computed more
efficiently than KL-divergence-based objectives. Moreover, the sampling-based objective LSML used
in EBFlow, which excludes the calculation of Z(θ) in the computational graph, can be computed
slightly faster than LML adopted by the baseline.

7
Baseline (ML) EBFlow (SML) EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM)

(a) (b)

(Sec./Batch)

Bits / Dim
MNIST
Runtime

Dimension (1, n, n) Dimension (1, n, n) Wall Time (hours) Wall Time (hours)

CIFAR-10
(Sec./Batch)

Bits / Dim
Runtime

Dimension (3, n, n) Dimension (3, n, n) Wall Time (hours) Wall Time (hours)
FC-based CNN-based FC-based CNN-based

Figure 2: (a) A runtime comparison of calculating the gradients of different objectives for different
input sizes (D). The input sizes are (1, n, n) and (3, n, n), with the x-axis in the figures representing
n. In the format (c, h, w), the first value indicates the number of channels, while the remaining values
correspond to the height and width of the input data. The curves depict the evaluation results in terms
of the mean of three independent runs. (b) A comparison of the training efficiency of the FC-based
and CNN-based models evaluated on the validation set of MNIST and CIFAR-10. Each curve and
the corresponding shaded area depict the mean and confidence interval of three independent runs.

W/o MaP Technique W/ MaP Technique Table 3: The results in terms of NLL of the FC-
based and CNN-based models trained using SSM,
DSM, and FDSSM losses on MNIST. The perfor-
mance is reported in terms of the means and 95%
Gradient Norm

confidence intervals of three independent runs.


FC-based

EMA MaP EBFlow(SSM) EBFlow(DSM) EBFlow(FDSSM)

1757.5 ± 28.0 4660.3 ± 19.8 3267.0 ± 99.2


✓ 1720.5 ± 0.8 4455.0 ± 1.6 3166.3 ± 17.3
✓ ✓ 1092.8 ± 0.3 1099.2 ± 0.2 1104.1 ± 0.5
Training Steps
CNN-based

Figure 3: The norm of ∂θ LSSM (θ) of an FC- EMA MaP EBFlow(SSM) EBFlow(DSM) EBFlow(FDSSM)
based model trained on the MNIST dataset. The 3518.0 ± 33.9 3170.0 ± 7.2 3593.3 ± 12.5
curves and shaded area depict the mean and 95% ✓ 3504.5 ± 2.4 3180.0 ± 2.9 3560.3 ± 1.7
✓ ✓ 1107.5 ± 1.4 1109.5 ± 2.6 1122.1 ± 3.1
confidence interval of three independent runs.

Performance. Table 2 demonstrates the performance of the FC-based and CNN-based models in
terms of NLL on the MNIST and CIFAR-10 datasets. The results show that the models trained with
Fisher-divergence-based objectives are able to achieve similar performance as those trained with
KL-divergence-based objectives. Among the Fisher-divergence-based objectives, the models trained
using LSSM and LDSM are able to achieve better performance in comparison to those trained using
LFDSSM . The runtime and performance comparisons above suggest that LSSM and LDSM can deliver
better training efficiency than LML and LSML , since the objectives can be calculated faster while
maintaining the models’ performance on the NLL metric.
Training Efficiency. Fig. 2 (b) presents the trends of NLL versus training wall time when LML , LSML ,
LSSM , LDSM , and LFDSSM are adopted as the objectives. It is observed that EBFlow trained with
SSM and DSM consistently attain better NLL in the early stages of the training. The improvement is
especially notable when both D and L are large, as revealed for the scenario of training CNN-based
models on the CIFAR-10 dataset. These experimental results provide evidence to support the use of
score-matching methods for optimizing EBFlow.

5.3 Analyses and Comparisons

Ablation Study. Table 3 presents the ablation results that demonstrate the effectiveness of the
EMA and MaP techniques. It is observed that EMA is effective in reducing the variances. In
addition, MaP significantly improves the overall performance. To further illustrate the influence
of the proposed MaP technique on the score-matching methods, we compare the optimization pro-

8
(a) (b) Mask: MNIST Mask: KMNIST
Figure 4: A qualitative comparison between (a) Figure 5: A qualitative demonstration of the FC-
our model (NLL=728) and (b) the model in [17] based model trained using LDSM on the data im-
(NLL=1,637) on the inverse generation task. putation task.

∂ ∂ ∂ p (x ) Qk
cesses with ∂θ DF [pxk ∥pk ] and ∂θ DF [px ∥p0 ] = ∂θ Epxk (xk ) [ 12 ∥( ∂x

k
log( pxkk(xkk) )) i=1 Jgi ∥2 ]

(i.e., Lemma A.13) by depicting the norm of their unbiased estimators ∂θ LSSM (θ) calculated with

and without applying the MaP technique in Fig. 3. It is observed that the magnitude of ∂θ LSSM (θ)
significantly decreases when MaP is incorporated into the training process. This could be attributed to

Qk ∂
the fact that the calculation of ∂θ DF [pxk ∥pk ] excludes the calculation of i=1 Jgi in ∂θ DF [px ∥p0 ],
which involves computing the derivatives of the numerically sensitive logit pre-processing layer.
Comparison with Related Works. Table 4 Table 4: A comparison of performance and training
compares the performance of our method complexity between EBFlow and a number of related
with a number of related works on the works [16, 7, 17] on the MNIST dataset.
MNIST dataset. Our models trained with
Method Complexity NLL (↓)
score-matching objectives using the same 3
model architecture exhibit improved perfor- Baseline (ML) O(D L) 1092.4 ± 0.1
DKL -Based EBFlow (SML) O(D3 L) 1092.3 ± 0.6
mance in comparison to the relative gradi- Relative Grad. [7] O(D2 L) 1375.2 ± 1.4
ent method [7]. In addition, when com- 2
EBFlow (SSM) O(D L) 1092.8 ± 0.3
pared to the results in [16] and [17], our EBFlow (DSM) O(D2 L) 1099.2 ± 0.2
models deliver significantly improved perfor- DF -Based EBFlow (FDSSM) 2
O(D L) 1104.1 ± 0.5
mance over them. Please note that the results SSM [16] - 3355
of [7, 16, 17] presented in Table 4 are ob- DSM [17] - 3398 ± 1343
FDSSM [17] - 1647 ± 306
tained from their original papers.

5.4 Application to Generation Tasks

The sampling process of EBFlow can be accomplished through the inverse function or an MCMC
process. The former is a typical generation method adopted by flow-based models, while the latter is
a more flexible sampling process that allows conditional generation without re-training the model. In
the following paragraphs, we provide detailed explanations and visualized results of these tasks.
Inverse Generation. One benefit of flow-based models is that g −1 can be directly adopted as a gener-
ator. While inverting the weight matrices in linear transformations typically demands time complexity
of O(D3 L), these inverse matrices are only required to be computed once θ has converged, and can
then be reused for subsequent inferences In this experiment, we adopt the Glow [4] model architecture
and train it using our method with LSSM on the MNIST dataset. We compare our visualized results
with the current best flow-based model trained using the score matching objective [17]. The results
of [17] are generated using their officially released code with their best setup (i.e., FDSSM). As
presented in Fig. 4, the results generated using our model demonstrate significantly better visual
quality than those of [17].
MCMC Generation. In comparison to the inverse generation method, the MCMC sampling process
is more suitable for conditional generation tasks such as data imputation due to its flexibility [41].
For the imputation task, a data vector x is separated as an observable part xO and a masked part
xM . The goal of imputation is to generate the masked part xM based on the observable part xO .
To achieve this goal, one can perform a Langevin MCMC process to update xM according to the

gradient of the energy function ∂x E(x; θ). Given a noise vector z sampled from N (0, I) and a

9
small step size α, the process iteratively updates xM based on the following equation:
(t+1) (t) ∂ (t)

xM = xM − α (t) E(xO , xM ; θ) + 2αz, (11)
∂xM
(t)
where xM represents xM at iteration t ∈ {1, · · · , T }, and T is the total number of iterations. MCMC
generation requires an overall cost of O(T D2 L), potentially more economical than the O(D3 L)
computation of the inverse generation method. Fig. 5 depicts the imputation results of the FC-based
model trained using LDSM on the CelebA [42] dataset (D = 3 × 64 × 64). In this example, we
implement the masking part xM using the data from the KMNIST [43] and MNIST [19] datasets.

6 Conclusion
In this paper, we presented EBFlow, a new flow-based modeling approach that associates the
parameterization of flow-based and energy-based models. We showed that by optimizing EBFlow with
score-matching objectives, the computation of Jacobian determinants for linear transformations can
be bypassed, resulting in an improved training time complexity. In addition, we demonstrated that the
training stability and performance can be effectively enhanced through the MaP and EMA techniques.
Based on the improvements in both theoretical time complexity and empirical performance, our
method exhibits superior training efficiency compared to maximum likelihood training.

Acknowledgement
The authors gratefully acknowledge the support from the National Science and Technology Council
(NSTC) in Taiwan under grant number MOST 111-2223-E-007-004-MY3, as well as the financial
support from MediaTek Inc., Taiwan. The authors would also like to express their appreciation
for the donation of the GPUs from NVIDIA Corporation and NVIDIA AI Technology Center
(NVAITC) used in this work. Furthermore, the authors extend their gratitude to the National Center
for High-Performance Computing (NCHC) for providing the necessary computational and storage
resources.

References
[1] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear Independent Components Estimation,
2015.
[2] L. Dinh, J. N. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In Proc. Int.
Conf. on Learning Representations (ICLR), 2016.
[3] G. Papamakarios, I. Murray, and T. Pavlakou. Masked Autoregressive Flow for Density
Estimation. In Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2017.
[4] D. P. Kingma and P. Dhariwal. Glow: Generative Flow with Invertible 1x1 Convolutions. In
Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2018.
[5] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural Spline Flows. In Proc. Conf.
on Neural Information Processing Systems (NeurIPS), 2019.
[6] A. Hyvärinen and E. Oja. Independent Component Analysis: Algorithms and Applications.
Neural Networks: the Official Journal of the International Neural Network Society, 13 4-5:411–
30, 2000.
[7] L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen. Relative Gradient Op-
timization of the Jacobian Term in Unsupervised Deep Learning. In Proc. Conf. on Neural
Information Processing Systems (NeurIPS), 2020.
[8] Y. Song, C. Meng, and S. Ermon. MintNet: Building Invertible Neural Networks with Masked
Convolutions. In Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2019.
[9] E. Hoogeboom, R. v. d. Berg, and M. Welling. Emerging Convolutions for Generative Normal-
izing Flows. In Proc. Int. Conf. on Machine Learning (ICML), 2019.

10
[10] X. Ma and E. H. Hovy. MaCow: Masked Convolutional Generative Flow. In Proc. Conf. on
Neural Information Processing Systems (NeurIPS), 2019.
[11] Y. Lu and B. Huang. Woodbury Transformations for Deep Generative Flows. In Proc. Conf. on
Neural Information Processing Systems (NeurIPS), 2020.
[12] C. Meng, L. Zhou, K. Choi, T. Dao, and S. Ermon. ButterflyFlow: Building Invertible Layers
with Butterfly Matrices. In Proc. Int. Conf. on Machine Learning (ICML), 2022.
[13] Y. LeCun, S. Chopra, R. Hadsell, A. Ranzato, and F. J. Huang. A Tutorial on Energy-Based
Learning. 2006.
[14] A. Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal
of Machine Learning Research (JMLR), 6(24):695–709, 2005.
[15] P. Vincent. A Connection between Score Matching and Denoising Autoencoders. Neural
computation, 23(7):1661–1674, 2011.
[16] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced Score Matching: A Scalable Approach to Density
and Score Estimation. In Proc. Conf. on Uncertainty in Artificial Intelligence (UAI), 2019.
[17] T. Pang, K. Xu, C. Li, Y. Song, S. Ermon, and J. Zhu. Efficient Learning of Generative Models
via Finite-Difference Score Matching. In Proc. Conf. on Neural Information Processing Systems
(NeurIPS), 2020.
[18] S. Lyu. Interpretation and Generalization of Score Matching. In Proc. Conf. on Uncertainty in
Artificial Intelligence (UAI), 2009.
[19] L. Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research.
IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[20] G. Papamakarios, E. T. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan.
Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning
Research (JMLR), 22:57:1–57:64, 2019.
[21] G. O Roberts and R. L. Tweedie. Exponential Convergence of Langevin Distributions and Their
Discrete Approximations. Bernoulli, 2(4):341 – 363, 1996.
[22] G. O Roberts and J. S Rosenthal. Optimal Scaling of Discrete Approximations to Langevin Dif-
fusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–
268, 1998.
[23] G. E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence. Neural
Computation, 14:1771–1800, 2002.
[24] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood
gradient. In Proc. Int. Conf. on Machine Learning (ICML), 2008.
[25] Y. Du, S. Li, B. J. Tenenbaum, and I. Mordatch. Improved Contrastive Divergence Training of
Energy Based Models. In Proc. Int. Conf. on Machine Learning (ICML), 2021.
[26] M. F Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian
smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–
1076, 1989.
[27] A. Neumaier. Introduction to Numerical Analysis. Cambridge University Press, 2001.
[28] E. Parzen. On Estimation of a Probability Density Function and Mode. Annals of Mathematical
Statistics, 33:1065–1076, 1962.
[29] J. M. Tomczak and M. Welling. Improving Variational Auto-Encoders using Householder Flow.
ArXiv, abs/1611.09630, 2016.
[30] T. Dao, A. Gu, M. Eichhorn, A. Rudra, and C. Ré. Learning Fast Algorithms for Linear
Transforms Using Butterfly Factorizations. In Proc. Int. Conf. on Machine Learning (ICML),
2019.

11
[31] J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen. Invertible Residual Networks. In Proc. Int.
Conf. on Machine Learning (ICML), 2018.

[32] R. T. Q. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen. Residual Flows for Invertible
Generative Modeling. In Proc. Conf. on Neural Information Processing Systems (NeurIPS),
2019.

[33] R. M. Neal. Annealed Importance Sampling. Statistics and Computing, 11:125–139, 1998.

[34] Y. Burda, R. B. Grosse, and R. Salakhutdinov. Accurate and Conservative Estimates of MRF
Log-Likelihood using Reverse Annealing. volume abs/1412.8566, 2015.

[35] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. K. Duvenaud. FFJORD:


Free-form Continuous Dynamics for Scalable Reversible Generative Models. In Int. Conf. on
Learning Representations (ICLR), 2018.

[36] Y. Song and S. Ermon. Improved Techniques for Training Score-Based Generative Models. In
Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2020.

[37] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.

[38] A. V. D. Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves.


Conditional Image Generation with PixelCNN Decoders. In Proc. Conf. on Neural Information
Processing Systems (NeurIPS), 2016.

[39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,


N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. In Proc. Conf. on Neural Information Processing
Systems (NeurIPS), 2019.

[40] A. Griewank and A. Walther. Evaluating Derivatives - Principles and Techniques of Algorithmic
Differentiation, Second Edition. In Frontiers in applied mathematics, 2000.

[41] A. M Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & Play Generative
Networks: Conditional Iterative Generation of Images in Latent Space. In Proc. Int. Conf. on
Computer Vision and Pattern Recognition (CVPR), 2016.

[42] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. In Proc. Int.
Conf. on Computer Vision (ICCV), December 2015.

[43] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep Learning
for Classical Japanese Literature. In Proc. Conf. on Neural Information Processing Systems
(NeurIPS), 2018.

[44] T. Anderson Keller, Jorn W. T. Peters, Priyank Jaini, Emiel Hoogeboom, Patrick Forr’e, and
Max Welling. Self Normalizing Flows. In Proc. Int. Conf. on Machine Learning (ICML), 2020.

[45] C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P.
Chen, and C.-Y. Lee. Denoising Likelihood Score Matching for Conditional Score-based Data
Generation. In Proc. Int. Conf. on Learning Representations (ICLR), 2022.

[46] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. CoRR,
abs/1412.6980, 2014.

[47] I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In Proc. Int. Conf. on
Learning Representations (ICLR), 2017.

[48] M. Ning, E. Sangineto, A. Porrello, S. Calderara, and R. Cucchiara. Input Perturbation Reduces
Exposure Bias in Diffusion Models. In Proc. Int. Conf. on Machine Learning (ICML), 2023.

12
[49] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-
fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Va-
sudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software
available from tensorflow.org.
[50] J. Sohl-Dickstein. Two Equalities Expressing the Determinant of a Matrix in terms of Expecta-
tions over Matrix-Vector Products, 2020.
[51] K. P. Murphy. Probabilistic Machine Learning: Advanced Topics, page 811. MIT Press, 2023.
[52] M. Zhang, O. Key, P. Hayes, D. Barber, B. Paige, and F.-X. Briol. Towards Healing the Blindness
of Score Matching. Workshop on Score-Based Methods at Conf. on Neural Information
Processing Systems (NeurIPS), 2022.

13
A Appendix
A.1 Derivations

In the following subsections, we provide theoretical derivations. In Section A.1.1, we discuss


the asymptotic convergence properties as well as the assumptions of score-matching methods. In
Section A.1.2, we elaborate on the formulation of EBFlow (i.e., Eqs. (8) and (9)), and provide a
explanation of their interpretation. Finally, in Section A.1.3, we present a theoretical analysis of KL
divergence and Fisher divergence, and discuss the underlying mechanism behind the proposed MaP
technique.

A.1.1 Asymptotic Convergence Property of Score Matching


In this subsection, we provide a formal description of the consistency property of score match-
ing. The description follows [16] and the notations are replaced with those used in this paper.
The regularity conditions for p(· ; θ) are defined in Assumptions A.1∼A.7. In the following para-
∂ ∂
graph, the parameter space is defined as Θ. In addition, s(x; θ) ≜ ∂x log p(x; θ) = − ∂x E(x; θ)
1
P N
represents the score function. L̂SM (θ) ≜ N k=1 f (xk ; θ)denotes an unbiased estimator of
2 ∂2 2
LSM (θ), where f (x; θ) ≜ 21 ∂x∂
= 12 ∥s(x; θ)∥ + Tr ∂x ∂

E(x; θ) − Tr ∂x 2 E(x; θ) s(x; θ)
and {x1 , · · · , xN } represents a collection of i.i.d. samples drawn from px . For notational simplicity,
∂ ∂
we denote ∂h(x; θ) ≜ ∂x h(x; θ) and ∂i hj (x; θ) ≜ ∂x i
hj (x; θ), where hj (x; θ) denotes the j-th
element of h.
Assumption A.1. (Positiveness) p(x; θ) > 0 and px (x) > 0, ∀θ ∈ Θ, ∀x ∈ RD .
Assumption A.2. (Regularity of the score functions) The parameterized score function s(x; θ)

and the true score function ∂x log px (x) are both continuous and differentiable. In addition, their
∂ 
expectations Epx (x) [s(x; θ)] and Epx (x) ∂x log px (x) are finite. (i.e., Epx (x) [s(x; θ)] < ∞ and
∂ 
Epx (x) ∂x log px (x) < ∞)
Assumption A.3. (Boundary condition) lim∥x∥→∞ px (x)s(x; θ) = 0, ∀θ ∈ Θ.
Assumption A.4. (Compactness) The parameter space Θ is compact.
Assumption A.5. (Identifiability) There exists a set of parameters θ∗ such that px (x) = p(x; θ∗ ),
where θ∗ ∈ Θ, ∀x ∈ RD .
Assumption A.6. (Uniqueness) θ ̸= θ∗ ⇔ p(x; θ) ̸= p(x; θ∗ ), where θ, θ∗ ∈ Θ, x ∈ RD .
Assumption A.7. (Lipschitzness of f ) The function f is Lipschitz continuous w.r.t. θ, i.e.,
|f (x; θ1 ) − f (x; θ2 )| ≤ L(x) ∥θ1 − θ2 ∥2 , ∀θ1 , θ2 ∈ Θ, where L(x) represents a Lipschitz con-
stant satisfying Epx (x) [L(x)] < ∞.
Theorem A.8. (Consistency of a score-matching estimator [16]) The score-matching estimator
θN ≜ argminθ∈Θ L̂SM is consistent, i.e.,
p
→ θ∗ , as N → ∞.
θN −
∂ ∂
Assumptions A.1∼A.3 are the conditions that ensure ∂θ DF [px (x)∥p(x; θ)] = ∂θ LSM (θ). Assump-
tions A.4∼A.7 lead to the uniform convergence property [16] of a score-matching estimator, which
gives rise to the consistency property. The detailed derivation can be found in Corollary 1 in [16]. In
the following Lemma A.9 and Proposition A.10, we examine the sufficient condition for g and pu to
satisfy Assumption A.7.
2
Lemma A.9. (Sufficient condition for the Lipschitzness of f ) The function f (x; θ) = 12 ∥s(x; θ)∥ +


Tr ∂x s(x; θ) is Lipschitz continuous if the score function s(x; θ) satisfies the following conditions:
∀θ, θ1 , θ2 ∈ Θ, ∀i ∈ {1, · · · , D},
∥s(x; θ)∥2 ≤ L1 (x),
∥s(x; θ1 ) − s(x; θ2 )∥2 ≤ L2 (x) ∥θ1 − θ2 ∥2 ,
∥∂i s(x; θ1 ) − ∂i s(x; θ2 )∥2 ≤ L3 (x) ∥θ1 − θ2 ∥2 ,
where L1 , L2 , and L3 are Lipschitz constants satisfying Epx (x) [L1 (x)] < ∞, Epx (x) [L2 (x)] < ∞,
and Epx (x) [L3 (x)] < ∞.

14
2
Proof. The Lipschitzness of f can be guaranteed by ensuring the Lipschitzness of ∥s(x; θ)∥2 and
Tr (∂s(x; θ)).
2
Step 1. (Lipschitzness of ∥s(x; θ)∥2 )
2 2
∥s(x; θ1 )∥2 − ∥s(x; θ2 )∥2
= s(x; θ1 )T s(x; θ1 ) − s(x; θ2 )T s(x; θ2 )
s(x; θ1 )T s(x; θ1 ) − s(x; θ1 )T s(x; θ2 ) + s(x; θ1 )T s(x; θ2 ) − s(x; θ2 )T s(x; θ2 )
 
=
= s(x; θ1 )T (s(x; θ1 ) − s(x; θ2 )) + s(x; θ2 )T (s(x; θ1 ) − s(x; θ2 ))
(i)
≤ s(x; θ1 )T (s(x; θ1 ) − s(x; θ2 )) + s(x; θ2 )T (s(x; θ1 ) − s(x; θ2 ))
(ii)
≤ ∥s(x; θ1 )∥2 ∥s(x; θ1 ) − s(x; θ2 )∥2 + ∥s(x; θ2 )∥2 ∥s(x; θ1 ) − s(x; θ2 )∥2
(iii)
≤ L1 (x) ∥s(x; θ1 ) − s(x; θ2 )∥2 + L1 (x) ∥s(x; θ1 ) − s(x; θ2 )∥2
(iii)
≤ 2L1 (x)L2 (x) ∥θ1 − θ2 ∥2 ,
where (i) is based on triangle inequality, (ii) is due to Cauchy–Schwarz inequality, and (iii) follows
from the listed assumptions.
Step 2. (Lipschitzness of Tr (∂s(x; θ)))
|Tr (∂s(x; θ1 )) − Tr (∂s(x; θ2 ))| = |Tr (∂s(x; θ1 ) − ∂s(x; θ2 ))|
(i)
≤ D ∥∂s(x; θ1 ) − ∂s(x; θ2 )∥2
sX
(ii)
2
≤ D ∥∂i s(x; θ1 ) − ∂i s(θ2 )∥2
i
(iii)
q
2
≤ D DL23 (x) ∥θ1 − θ2 ∥2

= D DL3 (x) ∥θ1 − θ2 ∥2
qP
2
where (i) holds by Von Neumann’s trace inequality. (ii) is due to the property ∥A∥2 ≤ i ∥ai ∥2 ,
where ai is the column vector of A. (iii) holds by the listed assumptions.
Based on Steps 1 and 2, the Lipschitzness of f is guaranteed, since
   
1 2 ∂ 1 2 ∂
|f (x; θ1 ) − f (x; θ2 )| = ∥s(x; θ1 )∥ + Tr s(x; θ1 ) − ∥s(x; θ2 )∥ − Tr s(x; θ2 )
2 ∂x 2 ∂x
   
1 2 1 2 ∂ ∂
= ∥s(x; θ1 )∥ − ∥s(x; θ2 )∥ + Tr s(x; θ1 ) − Tr s(x; θ2 )
2 2 ∂x ∂x
   
1 2 2 ∂ ∂
≤ ∥s(x; θ1 )∥ − ∥s(x; θ2 )∥ + Tr s(x; θ1 ) − Tr s(x; θ2 )
2 ∂x ∂x

≤ L1 (x)L2 (x) ∥θ1 − θ2 ∥2 + D DL3 (x) ∥θ1 − θ2 ∥2
 √ 
= L1 (x)L2 (x) + D DL3 (x) ∥θ1 − θ2 ∥2 .

Proposition A.10. (Sufficient condition for the Lipschitzness of f ) The function f is Lipschitz
continuous if g(x; θ) has bounded first, second, and third-order derivatives, i.e., ∀i, j ∈ {1, · · · , D},
∀θ ∈ Θ.
∥Jg (x; θ)∥2 ≤ l1 (x), ∥∂i Jg (x; θ)∥2 ≤ l2 (x), ∥∂i ∂j Jg (x; θ)∥2 ≤ l3 (x),
and smooth enough on Θ, i.e., θ1 , θ2 ∈ Θ:
∥g(x; θ1 ) − g(x; θ2 )∥2 ≤ r0 (x) ∥θ1 − θ2 ∥2 ,

15
∥Jg (x; θ1 ) − Jg (x; θ2 )∥2 ≤ r1 (x) ∥θ1 − θ2 ∥2 ,
∥∂i Jg (x; θ1 ) − ∂i Jg (x; θ2 )∥2 ≤ r2 (x) ∥θ1 − θ2 ∥2 .
∥∂i ∂j Jg (x; θ1 ) − ∂i ∂j Jg (x; θ2 )∥2 ≤ r3 (x) ∥θ1 − θ2 ∥2 .
In addition, it satisfies the following conditions:
′ ′
J−1
g (x; θ) 2
≤ l1 (x), ∂i J−1
g (x; θ) 2
≤ l2 (x),

J−1 −1
g (x; θ1 ) − Jg (x; θ2 ) 2
≤ r1 (x) ∥θ1 − θ2 ∥2 ,

∂i J−1 −1
g (x; θ1 ) − ∂i Jg (x; θ2 ) 2
≤ r2 (x) ∥θ1 − θ2 ∥2 ,
where J−1
g represents the inverse matrix of Jg . Furthermore, the prior distribution pu satisfies:

∥su (u)∥ ≤ t1 , ∥∂i su (u)∥ ≤ t2


∥su (u1 ) − su (u2 )∥2 ≤ t3 ∥u1 − u2 ∥2 ,
∥∂i su (u1 ) − ∂i su (u2 )∥2 ≤ t4 ∥u1 − u2 ∥2 ,

where su (u) ≜ ∂u log pu (u) is the score function of pu . The Lipschitz constants listed above (i.e.,
′ ′ ′ ′
l1 ∼ l3 , r0 ∼ r3 , l1 ∼ l2 , and r1 ∼ r2 ) have finite expectations.

Proof. We show that the sufficient conditions stated in Lemma A.9 can be satisfied using the
conditions listed above.
Step 1. (Sufficient condition of ∥s(x; θ)∥2 ≤ L1 (x))
∂ ∂ ∂
Since ∥s(x; θ)∥2 = ∂x log pu (g(x; θ)) + ∂x log |det Jg (x; θ)| 2
≤ ∂x log pu (g(x; θ)) 2 +
∂ ∂
∂x log |det Jg (x; θ)| 2
, we first demonstrate that ∂x log pu (g(x; θ)) 2 and

∂x log |det Jg (x; θ)| 2
are both bounded.

(1.1) ∂x log pu (g(x; θ)) 2
is bounded:
∂ T
log pu (g(x; θ)) = (su (g(x; θ))) Jg (x; θ) ≤ ∥su (g(x; θ))∥2 ∥Jg (x; θ)∥2 ≤ t1 l1 (x).
∂x 2 2


(1.2) ∂x log |det Jg (x; θ)| is bounded:
∂ −1 ∂
log |det Jg (x; θ)| = |det Jg (x; θ)| |det Jg (x; θ)|
∂x ∂x
−1 ∂
= (det Jg (x; θ)) det Jg (x; θ)
∂x
(i) −1
= (det Jg (x; θ)) det Jg (x; θ)v(x; θ)
= ∥v(x; θ)∥ ,
where (i) is derived using Jacobi’s formula, and vi (x; θ) = Tr J−1

g (x; θ)∂i Jg (x; θ) .
s
X 2
∥v(x; θ)∥ = Tr J−1
g (x; θ)∂i Jg (x; θ)
i
s
(i) X 2
≤ D2 J−1
g (x; θ)∂i Jg (x; θ) 2
i
s
(ii) X 2 2
≤ D2 J−1
g (x; θ) 2
∥∂i Jg (x; θ)∥2
i
sX
(iii)

≤ D2 l12 (x)l22 (x)
i
√ ′
= D3 l1 (x)l2 (x),

16
where (i) holds by Von Neumann’s trace inequality, (ii) is due to the property of matrix norm, and
(iii) is follows from the listed assumptions.
Step 2. (Sufficient condition of the Lipschitzness of s(x; θ))
∂ ∂ ∂
Since s(x; θ) = ∂x log pu (g(x; θ)) + ∂x log |det Jg (x; θ)|, we demonstrate that ∂x log pu (g(x; θ))

and ∂x log |det Jg (x; θ)| are both Lipschitz continuous on Θ.

(2.1) Lipschitzness of ∂x log pu (g(x; θ)):
∂ ∂
log pu (g(x; θ1 )) − log pu (g(x; θ2 ))
∂x ∂x 2
T T
= (su (g(x; θ1 ))) Jg (x; θ1 ) − (su (g(x; θ2 ))) Jg (x; θ2 )
2
(i)
≤ ∥su (g(x; θ1 ))∥2 ∥Jg (x; θ1 ) − Jg (x; θ2 )∥2 + ∥su (g(x; θ1 )) − su (g(x; θ2 ))∥2 ∥Jg (x; θ2 )∥2
(ii)
≤ t1 r1 (x) ∥θ1 − θ2 ∥2 + t2 l1 (x) ∥g(x; θ1 ) − g(x; θ2 )∥2
(ii)
≤ t1 r1 (x) ∥θ1 − θ2 ∥2 + t2 l1 (x)r0 (x) ∥θ1 − θ2 ∥2
= (t1 r1 (x) + t2 l1 (x)r0 (x)) ∥θ1 − θ2 ∥2 ,
where (i) is obtained using a similar derivation to Step 1 in Lemma A.9, while (ii) follows from the
listed assumptions.

(2.2) Lipschitzness of ∂x log |det Jg (x; θ)|:
Let M(i, x; θ) ≜ J−1
g (x; θ1 )∂i Jg (x; θ). We first demonstrate that M is Lipschitz continuous:

∥M(i, x; θ1 ) − M(i, x; θ2 )∥2


= J−1 −1
g (x; θ1 )∂i Jg (x; θ1 ) − Jg (x; θ2 )∂i Jg (x; θ2 ) 2
(i)
≤ J−1
g (x; θ1 ) 2
∥(∂i Jg (x; θ1 ) − ∂i Jg (x; θ2 ))∥2 + J−1 −1
g (x; θ1 ) − Jg (x; θ2 ) 2
∥∂i Jg (x; θ2 )∥2
(ii) ′ ′
≤ l1 (x)r2 (x) ∥θ1 − θ2 ∥2 + l2 (x)r1 (x) ∥θ1 − θ2 ∥2
′ ′

= l1 (x)r2 (x) + l2 (x)r1 (x) ∥θ1 − θ2 ∥2 ,

where (i) is obtained by an analogous derivation of the step 1 in Lemma A.9, and (ii) holds by the
listed assumption.

The Lipschitzness of M leads to the Lipschitzness of ∂x log |det Jg (x; θ)|, since:
∂ ∂
log |det Jg (x; θ1 )| − log |det Jg (x; θ2 )|
∂x ∂x 2
= ∥v(x; θ1 ) − v(x; θ2 )∥2
sX
2
= (Tr (M(i, x; θ1 )) − Tr (M(i, x; θ2 )))
i
sX
2
= (Tr (M(i, x; θ1 ) − M(i, x; θ2 )))
i
sX
(i)
2
≤ D2 ∥M(i, x; θ1 ) − M(i, x; θ2 )∥2
i
sX
(ii) 2
′ ′ 2
≤ D2 l1 (x)r2 (x) + l2 (x)r1 (x) ∥θ1 − θ2 ∥2
i
√ ′ ′

= D3 l1 (x)r2 (x) + l2 (x)r1 (x) ∥θ1 − θ2 ∥2 ,

17
where (i) holds by Von Neumann’s trace inequality, (ii) is due to the Lipschitzness of M.
Step 3. (Sufficient condition of the Lipschitzness of ∂i s(x; θ))
T T
∂i s(x; θ) can be decomposed as (∂i su (g(x; θ))) Jg (x; θ), (su (g(x; θ))) ∂i Jg (x; θ), and
∂i [v(x; θ)] as follows:
h i
T
∂i s(x; θ) = ∂i (su (g(x; θ))) Jg (x; θ) + ∂i [v(x; θ)]
h i h i
T T
= (∂i su (g(x; θ))) Jg (x; θ) + (su (g(x; θ))) ∂i Jg (x; θ) + ∂i [v(x; θ)] .

T T
(3.1) The Lipschitzness of (∂i su (g(x; θ))) Jg (x; θ) and (su (g(x; θ))) ∂i Jg (x; θ) can be derived
using proofs similar to that in Step 2.1:
T T
(∂i su (g(x; θ1 ))) Jg (x; θ1 ) − (∂i su (g(x; θ2 ))) Jg (x; θ2 ) ≤ (t2 r1 (x) + t4 r0 (x)l1 (x)) ∥θ1 − θ2 ∥2 ,
2
T T
(su (g(x; θ1 ))) ∂i Jg (x; θ1 ) − (su (g(x; θ2 ))) ∂i Jg (x; θ2 ) ≤ (t1 r2 (x) + t3 r0 (x)l2 (x)) ∥θ1 − θ2 ∥2 .
2

(3.2) Lipschitzness of ∂i [v(x; θ)]:


Let ∂i [vj (x; θ)] ≜ ∂i Tr (M(j, x; θ)) = Tr (∂i M(j, x; θ)). We first show that ∂i M(j, x; θ) can be
decomposed as:
∂i M(j, x; θ) = ∂i J−1 −1 −1
  
g (x; θ)∂j Jg (x; θ) = ∂i Jg (x; θ)∂j Jg (x; θ) + Jg (x; θ)∂i ∂j Jg (x; θ)
′ ′
 ′ ′

The Lipschitz constant of ∂i M equals to l2 (x)r2 (x) + l2 (x)r2 (x) + l1 (x)r3 (x) + l3 (x)r1 (x)
based on a similar derivation as in Step 3.1. The Lipschitzness of ∂i M(j, x; θ) leads to the Lipschitz-
ness of ∂i [v(x; θ)]:
∥∂i [v(x; θ1 )] − ∂i [v(x; θ2 )]∥2
sX
2
= (Tr (∂i M(j, x; θ1 )) − Tr (∂i M(j, x; θ2 )))
j
sX
2
= Tr (∂i M(j, x; θ1 ) − ∂i M(j, x; θ2 ))
j

(i)
sX
2
≤ D2 ∥∂i M(j, x; θ1 ) − ∂i M(j, x; θ2 )∥2
j

(ii)
sX
′ ′ ′ ′
2 2
≤ D2 l2 (x)r2 (x) + l2 (x)r2 (x) + l1 (x)r3 (x) + l3 (x)r1 (x) ∥θ1 − θ2 )∥2
j
√ ′ ′ ′ ′

= D3 l2 (x)r2 (x) + l2 (x)r2 (x) + l1 (x)r3 (x) + l3 (x)r1 (x) ∥θ1 − θ2 )∥2

where (i) holds by Von Neumann’s trace inequality, (ii) is due to the Lipschitzness of ∂i M.

A.1.2 Derivation of Eqs. (8) and (9)


Energy-based models are formulated based on the observation that any continuous pdf p(x; θ)
can be expressed as a Boltzmann distribution exp (−E(x; θ)) Z −1 (θ) [13], where the energy
function E(· ; θ) can be modeled as any scalar-valued Qcontinuous function. In EBFlow, the en-
ergy function E(x; θ) is selected as − log(pu (g(x; θ)) gi ∈Sn |det(Jgi (xi−1 ; θ))|) according to
R
Eq. (9). This suggests that the normalizing constant Z(θ) = exp (−E(x; θ)) dx is equal to
−1
Q
( gi ∈Sl |det(Jgi (θ))|) according to Lemma A.11.
Lemma A.11.
 Y −1 Z Y
|det(Jgi (θ))| = pu (g(x; θ)) |det(Jgi (xi−1 ; θ))| dx. (A1)
gi ∈Sl x∈RD gi ∈Sn

18
Proof.
Z
1= p(x; θ)dx
x∈RD
Z Y Y
= pu (g(x; θ)) det(Jgj (xi−1 ; θ)) |det(Jgi (θ))| dx
x∈RD gi ∈Sn gi ∈Sl
Y Z Y
= |det(Jgi (θ))| pu (g(x; θ)) |det(Jgi (xi−1 ; θ))| dx
gi ∈Sl x∈RD gi ∈Sn

Q −1
By multiplying gi ∈Sl |det(J g i
(θ))| to both sides of the equation, we arrive at the conclusion:

 Y −1 Z Y
|det(Jgi (θ))| = pu (g(x; θ)) |det(Jgi (xi−1 ; θ))| dx.
gi ∈Sl x∈RD gi ∈Sn

point-wise
density
evaluation

pdf (data)

variable

pdf (model)

point-wise
density
evaluation

KL divergence
(Lemma A.12)

Fisher divergence (in general) (in general)


(Lemma A.13)

Figure A1: An illustration of the relationship between the variables discussed in Proposition 4.1,
Lemma A.12, and Lemma A.13. x represents a random vector sampled from the data distribution
px . {gi }L
i=1 is a series of transformations. xj ≜ gj ◦ · · · ◦ g1 (x), and pxj is its pdf. pj (xj ) =
QL
pu (gL ◦ · · · ◦ gj+1 (xj )) i=j+1 |det (Jgi )|, where pu is a prior distribution. The properties of
KL divergence and Fisher divergence presented in the last two rows are derived in Lemmas A.12
and A.13.

A.1.3 Theoretical Analyses of KL Divergence and Fisher Divergence

In this section, we provide formal derivations for Proposition 4.1, Lemma A.12, and Lemma A.13.
To ensure a clear presentation, we provide a visualization of the relationship between the variables
used in the subsequent derivations in Fig. A1.
Lemma A.12. Let pxj be the pdf of the latent variable of xj ≜ gj ◦· · ·◦g1 (x) indexed by j. In addition,
QL
let pj (·) be a pdf modeled as pu (gL ◦ · · · ◦ gj+1 (·)) i=j+1 |det (Jgi )|, where j ∈ {0, · · · , L − 1}.
It follows that:
 
DKL pxj ∥pj = DKL [px ∥p0 ] , ∀j ∈ {1, · · · , L − 1}. (A2)

19
 
Proof. The equivalence DKL [px ∥p0 ] = DKL pxj ∥pj holds for any j ∈ {1, · · · , L − 1} since:
DKL [px ∥p0 ]
  
px (x)
= Epx (x) log
p0 (x)
" Qj !#
pxj (gj ◦ · · · ◦ g1 (x)) i=1 |det (Jgi )|
= Epx (x) log QL
pu (gL ◦ · · · ◦ g1 (x)) i=1 |det (Jgi )|
" !#
pxj (gj ◦ · · · ◦ g1 (x))
= Epx (x) log QL
pu (gL ◦ · · · ◦ g1 (x)) i=j+1 |det (Jgi )|
" !#
(i) pxj (xj )
= Epxj (xj ) log QL
pu (gL ◦ · · · ◦ gj+1 (xj )) i=j+1 |det (Jgi )|
 
= DKL pxj ∥pj ,
where (i) is due to the property that Epx (x) [f ◦ gj ◦ · · · ◦ g1 (x)] = Epxj (xj ) [f (xj )] for a given
 
function f . Therefore, DKL pxj ∥pj = DKL [px ∥p0 ], ∀j ∈ {1, · · · , L − 1}.

Lemma A.13. Let pxj be the pdf of the latent variable of xj ≜ gj ◦· · ·◦g1 (x) indexed by j. In addition,
QL
let pj (·) be a pdf modeled as pu (gL ◦ · · · ◦ gj+1 (·)) i=j+1 |det (Jgi )|, where j ∈ {0, · · · , L − 1}.
It follows that:
2
 
   Yj
1 ∂ p xj (x j )
DF [px ∥p0 ] = Epxj (xj )  log Jgi  , ∀j ∈ {1, · · · , L − 1}. (A3)
2 ∂xj pj (xj ) i=1

Proof. Based on the definition, the Fisher divergence between px and p0 is written as:
DF [px ∥p0 ]
" #
  2
1 ∂ px (x)
= Epx (x) log
2 ∂x p0 (x)
 
Qj ! 2
1 ∂ pxj (gj ◦ · · · ◦ g1 (x)) i=1 |det (Jgi )|
= Epx (x)  log QL 
2 ∂x pu (gL ◦ · · · ◦ g1 (x)) i=1 |det (Jgi )|
 ! 2
1 ∂ pxj (gj ◦ · · · ◦ g1 (x))
= Epx (x)  log QL 
2 ∂x pu (gL ◦ · · · ◦ g1 (x)) i=j+1 |det (Jgi )|
 
!! 2
1 ∂ pxj (gj ◦ · · · ◦ g1 (x)) ∂gj ◦ · · · ◦ g1 (x)
= Epx (x)  log QL 
2 ∂gj ◦ · · · ◦ g1 (x) pu (gL ◦ · · · ◦ g1 (x)) i=j+1 |det (Jgi )| ∂x

2
 !! j 
(i) 1 ∂ pxj (gj ◦ · · · ◦ g1 (x)) Y
= Epx (x)  log QL Jgi 
2 ∂gj ◦ · · · ◦ g1 (x) pu (gL ◦ · · · ◦ g1 (x)) i=j+1 |det (Jgi )| i=1
2
 !! j 
(ii) 1 ∂ pxj (xj ) Y
= Epxj (xj )  log QL Jg i  ,
2 ∂xj pu (gL ◦ · · · ◦ gj+1 (xj )) i=j+1 |det (Jgi )| i=1
2
 
   Y j
1 ∂ pxj (xj )
= Epxj (xj )  log Jgi  ,
2 ∂xj pj (xj ) i=1

where (i) is due to the chain rule, and (ii) is because Epx (x) [f ◦ gj ◦ · · · ◦ g1 (x)] = Epxj (xj ) [f (xj )]
for a given function f .

20
 
Remark A.14. Lemma A.13 implies that DF pxj ∥pj ̸= DF [px ∥p0 ] in general, as the latter contains
Qj
an additional multiplier i=1 Jgi as shown below:
2
 
   Yj
1 ∂ pxj (xj )
DF [px ∥p0 ] = Epxj (xj )  log Jgi  ,
2 ∂xj pj (xj ) i=1
"    2 #
  1 ∂ pxj (xj )
DF pxj ∥pj = Epxj (xj ) log .
2 ∂xj pj (xj )

Proposition 4.1. Let pxj be the pdf of the latent variable of xj ≜ gj ◦ · · · ◦ g1 (x) indexed by
QL
j. In addition, let pj (·) be a pdf modeled as pu (gL ◦ · · · ◦ gj+1 (·)) i=j+1 |det (Jgi )|, where
j ∈ {0, · · · , L − 1}. It follows that:
 
DF pxj ∥pj = 0 ⇔ DF [px ∥p0 ] = 0, ∀j ∈ {1, · · · , L − 1}. (A4)

Proof. Based on Remark A.14, the following holds:


"   2#
  1 ∂ pxj (xj )
DF pxj ∥pj = Epxj (xj ) log =0
2 ∂xj pj (xj )
  2
(i) ∂ pxj (xj )
⇔ log =0
∂xj pj (xj )
  j 2
(ii) ∂ pxj (xj ) Y
⇔ log Jg =0
∂xj pj (xj ) i=1 i
2
 
   Yj
(i) 1 ∂ pxj (xj )
⇔ DF [px ∥p0 ] = Epxj (xj )  log Jg i  = 0,
2 ∂xj pj (xj ) i=1

where (i) and (ii) both result from the positiveness condition presented
 in Assumption A.1. Specif-
−1 −1 Qj
ically, for (i), pxj (xj ) = px (g1 ◦ · · · ◦ gj (xj )) i=1 det Jg−1 > 0, since px > 0 and
  i
Qj Qj −1 Qj
i=1 det Jgi−1 = i=1 |det (Jgi )| > 0. Meanwhile (ii) holds since i=1 |det (Jgi )| > 0
Qj
and thus all of the singular values of i=1 Jgi are non-zero.

A.2 Experimental Setups

In this section, we elaborate on the experimental setups and provide the detailed configurations
for the experiments presented in Section 5 of the main manuscript. The code implementation for
the experiments is provided in the following repository: https://github.com/chen-hao-chao/
ebflow. Our code implementation is developed based on [7, 17, 44].

A.2.1 Experimental Setups for the Two-Dimensional Synthetic Datasets


Datasets. In Section 5.1, we present the experimental results on three two-dimensional synthetic
datasets: Sine, Swirl, and Checkerboard. The Sine dataset is generated by sampling data points
from the set {(4w − 2, sin(12w
√ − 6))
√ | w ∈√[0, 1]}. The
√ Swirl dataset is generated by sampling data
points from the set {(−π w cos(π w), π w sin(π w)) | w ∈ [0, 1]}. The Checkerboard dataset
is generated by sampling data points from the set {(4w −2, t−2s+⌊4w −2⌋ mod 2) | w ∈ [0, 1], t ∈
[0, 1], s ∈ {0, 1}}, where ⌊·⌋ is a floor function, and mod represents the modulo operation.
To establish px for all three datasets, we smooth a Dirac function using a Gaussian kernel. Specifically,
1
PM (i)
we define the Dirac function as p̂(x̂) ≜ M i=1 δ( x̂ − x̂ ), where {x̂(i) }M
i=1 are M uniformly-
sampled data points. The data distribution is defined as px (x) ≜ p̂(x̂)N (x|x̂, σ̂ 2 I)dx̂ =
R
1
PM (i) 2 ∂
M i=1 N (x|x̂ , σ̂ I). The closed-form expressions for px (x) and ∂x log px (x) can be ob-
tained using the derivation in [45]. In the experiments, M is set as 50, 000, and σ̂ is fixed at 0.375
for all three datasets.

21
Implementation Details. The model architecture of g(· ; θ) consists of ten Glow blocks [4]. Each
block comprises an actnorm [4] layer, a fully-connected layer, and an affine coupling layer. Table A2
provides the formal definitions of these operations. pu (·) is implemented as an isotropic Gaussian with
zero mean and unit variance. To determine the best hyperparameters, we perform a grid search over
the following optimizers, learning rates, and gradient clipping values based on the evaluation results
in terms of the KL divergence. The optimizers include Adam [46], AdamW [47], and RMSProp.
The learning rate and gradient clipping values are selected from (5e-3, 1e-3, 5e-4, 1e-4) and (None,
2.5, 10.0), respectively. Table A1 summarizes the selected hyperparameters. The optimization
processes of Sine and Swirl datasets require 50,000 training iterations for convergence, while that of
the Checkerboard dataset requires 100,000 iterations. The batch size is fixed at 5,000 for all setups.

A.2.2 Experimental Setups for the Real-world Datasets

Datasets. The experiments presented in Section 5.2 are performed on the MNIST [19] and CIFAR-
10 [37] datasets. The training and test sets of MNIST and CIFAR-10 contain 50,000 and 10,000
images, respectively. The data are smoothed using the uniform dequantization method presented
in [1]. The observable parts (i.e., xO ) of the images in Fig. 5 are produced using the pre-trained
model in [48].
Implementation Details. In Sections 5.2 and 5.4, we adopt three types of model architectures:
FC-based [7], CNN-based, and Glow [4] models. The FC-based model contains two fully-connected
layers and a smoothed leaky ReLU non-linearity [7] in between, which is identical to [7]. The CNN-
based model consists of three convolutional blocks and two squeezing operations [2] between every
convolutional block. Each convolutional block contains two convolutional layers and a smoothed
leaky ReLU in between. The Glow model adopted in Section 5.4 is composed of 16 Glow blocks.
Each of the Glow block consists of an actnorm [4] layer, a convolutional layer, and an affine coupling
layer. The squeezing operation is inserted between every eight blocks. The operations used in
these models are summarized in Table A2. The smoothness factor α of Smooth Leaky ReLU is
set to 0.3 and 0.6 for models trained on MNIST and CIFAR-10, respectively. The scaling and
transition functions s(· ; θ) and t(· ; θ) of the affine coupling layers are convolutional blocks with
ReLU activation functions. The prior distribution pu (·) is implemented as an isotropic Gaussian
with zero mean and unit variance. The FC-based and CNN-based models are trained with RMSProp
using a learning rate initialized at 1e-4 and a batch size of 100. The Glow model is trained with
an Adam optimizer using a learning rate initialized at 1e-4 and a batch size of 100. The gradient
clipping value is set to 500 during the training for the Glow model. The learning rate scheduler
MultiStepLR in PyTorch is used for gradually decreasing the learning rates. The hyper-parameters
{σ, ξ} used in DSM and FDSSM are selected based on a grid search over {0.05, 0.1, 0.5, 1.0}. The
selected {σ, ξ} are {1.0, 1.0} and {0.1, 0.1} for the MNIST and CIFAR-10 datasets, respectively.
The parameter m in EMA is set to 0.999. The algorithms are implemented using PyTorch [39].
The gradients w.r.t. x and θ are both calculated using automatic differential tools [40] provided by
PyTorch [39]. The runtime is evaluated on Tesla V100 NVIDIA GPUs. In the experiments performed
on CIFAR-10 and CelebA using score-matching methods, the energy function (i.e., Epx (x) [E(x; θ)])
is added as a regularization loss with a balancing factor fixed at 0.001 during the optimization
processes. The results in Fig. 2 (b) are smoothed with the exponential moving average function used
in Tensorboard [49], i.e., w × di−1 + (1 − w) × di , where w is set to 0.45 and di represents the
evaluation result at the i-th iteration.

Table A1: The hyper-parameters used in the two-dimensional synthetic example in Section 5.1.

Dataset ML SML SSM DSM FDSSM


Optimizor Adam AdamW Adam Adam Adam
Sine Learning Rate 5e-4 5e-4 1e-4 1e-4 1e-4
Gradient Clip 1.0 None 1.0 1.0 1.0
Optimizor Adam Adam Adam Adam Adam
Swirl Learning Rate 5e-3 1e-4 1e-4 1e-4 1e-4
Gradient Clip None 10.0 10.0 10.0 2.5
Optimizor AdamW AdamW AdamW AdamW Adam
Checkerboard Learning Rate 1e-4 1e-4 1e-4 1e-4 1e-4
Gradient Clip 10.0 10.0 10.0 10.0 10.0

22
Table A2: The components of g(· ; θ) used in this paper. In this table, z and y are the output and the
input of a layer, respectively. β and γ represent the mean and variance of an actnorm layer. w is a
convolutional kernel, and w ⋆ y ≜ Ŵ y, where ⋆ is a convolutional operator, and Ŵ is a D × D
matrix. W and b represent the weight and bias in a fully-connected layer. α is a hyper-parameter
for adjusting the smoothness of smooth leaky ReLU. In the affine coupling layer, z and y are split
into two parts {za , zb } and {ya , yb }, respectively. s(·; θ) and t(·; θ) are the scaling and transition
networks parameterized with θ. sig (y) = 1/(1 + exp (−y)) represents the sigmoid function. dim (·)
represents the dimension of the input vector. y[i] represents the i-th element of vector y.
Layer Function Log Jacobian Determinant Set
PD
actnorm [4] z = (y − β)/γ i=1 log1/γ[i] Sl
convolutional z =w⋆y+b log det Ŵ Sl
fully-connected z = Wy + b log |det (W )| Sl
PD 
smooth leaky ReLU [7] z = αy + (1 − α) log(1 + exp (y)) i=1 log α + (1 − α)sig y[i] Sn
Pdim(yb )
affine coupling [4] za = s(yb ; θ)ya + t(yb ; θ), zb = yb i=1 log s(yb ; θ)[i] Sn

Table A3: The simulation results of Eq. (A6). The error rate is measured by |dtrue − dest |/|dtrue |,
where dtrue and dest represent the true and estimated Jacobian determinants, respectively.

D =50 D =100 D =200

Error Rate (M =50) 0.004211 0.099940 0.355314


Error Rate (M =100) 0.003503 0.034608 0.076239
Error Rate (M =200) 0.002332 0.015411 0.011175

Results of the Related Works. The results of the relative gradient [7], SSM [16], and FDSSM [17]
methods are directly obtained from their original paper. On the other hand, the results of the DSM
method is obtained from [17]. Please note that the reported results of [16] and [17] differ from
each other given that they both adopt the NICE [1] model. Specifically, the SSM method achieves
NLL= 3, 355 and NLL= 6, 234 in [16] and [17], respectively. Moreover, the DSM method achieves
NLL= 4, 363 and NLL= 3, 398 in [16] and [17], respectively. In Table 4, we report the results with
lower NLL.

A.3 Estimating the Jacobian Determinants using Importance Sampling

Importance sampling is a technique used to estimate integrals, which can be employed to approximate
the normalizing constant Z(θ) in an energy-based model. In this method, a pdf q with a simple closed
form that can be easily sampled from is selected. The normalizing constant can then be expressed as
the following formula:
Z Z
exp (−E(x; θ))
Z(θ) = exp (−E(x; θ)) dx = q(x) dx
x∈RD x∈RD q(x)
M 
1 X exp −E(x̂(j) ; θ)
 
exp (−E(x; θ))
= Eq(x) ≈ ,
q(x) M j=1 q(x̂(j) )
(A5)
where {x̂(j) }M
j=1 represents M i.i.d. samples drawn from q. According to Lemma A.11, the Jacobian
determinants of the layers in Sl can be approximated using Eq. (A5) as follows:
 −1 Q (j)
M (j)
Y 1 X pu g(x̂ ; θ) gi ∈Sn det(Jgi (x̂i−1 ; θ))
 |det(Jgi (θ))| ≈ . (A6)
gi ∈Sl
M j=1 q(x̂(j) )

23
Table A4: An overall comparison between EBFlow, the baseline method, the Relative Gradient
method [7], and the methods that utilize specially designed linear layers [8–12, 29]. The notations
✓/ ✗ in row ‘Unbiased’ represent whether the models are optimized according to an unbiased target.
On the other hand, the notations ✓/ ✗ in row ‘Unconstrained’ represent whether the models can
be constructed with arbitrary linear transformations. (†) The approximation errors o(ξ) of FDSSM
is controlled by its hyper-parameter ξ. (‡) The error o(W ) of the Relative Gradient method is
determined by the values of a model’s weights.
KL-Divergence-Based Fisher-Divergence-Based
Baseline (ML) EBFlow (SML) Relative Grad. Special Linear EBFlow (SSM) EBFlow (DSM) EBFlow (FDSSM)
Complexity O(D3L) O(D3L) O(D2L) O(D2L) O(D2L) O(D2L) O(D2L)
Unbiased ✓ ✓ ✗(‡) ✓ ✓ ✗ ✗(†)
Unconstrained ✓ ✓ ✓ ✗ ✓ ✓ ✓

To validate this idea, we provide a simple simulation with pu = N (0, I), q = N (0, I), g(x; W ) =
W x, M = {50, 100, 200}, and D = {50, 100, 200} in Table A3. The results show that larger values
of M lead to more accurate estimation of the Jacobian determinants. Typically, the choice of q is
crucial to the accuracy of importance sampling. To obtain an accurate approximation, one can adopt
the technique of annealed importance sampling (AIS) [33] or Reverse AIS Estimator (RAISE) [34],
which are commonly-adopted algorithms for effectively estimating Z(θ).
Eq. (A6) can be interpreted as a generalization of the stochastic estimator presented in [50], where
the distributions pu and q are modeled as isotropic Gaussian distributions, and g is restricted as
a linear transformation. For the further analysis of this concept, particularly in the context of
determinant estimation for matrices, we refer readers to Section I of [50], where a more sophisticated
approximation approach and the corresponding experimental findings are provided.

A.4 A Comparison among the Methods Discussed in this Paper

In Sections 2, 3, and 4, we discuss various methods for efficiently training flow-based models.
To provide a comprehensive comparison of these methods, we summarize their complexity and
characteristics in Table A4.

A.5 The Impacts of the Constraint of Linear Transformations on the Performance of a


Flow-based Model

In this section, we examine the impact of the constraints of linear transformations on the performance
of a flow-based model. A key distinction between constrained and unconstrained linear layers lies

Learnable Weight Masked Weight (Not Learnable)

Lower Triangular Upper Triangular Lower & Upper Triangular


Full Matrix (F)
Matrix (L) Matrix (U) Matrices (LU)

Figure A2: An illustration of the weight matrices in the F, L, U, and LU layers described in
Section A.5.

-4 -2 0 2 4 -8 -6 -4 -2 0 2 4 6 8 -15 -10 -5 0 5 10 15 -30 -20 -10 0 10 20 30

Figure A3: Visualized marginal distributions of px[i] for i = 1, 2, 3, and 4.

24
in how they model the correlation between each element in a data vector. Constrained linear trans-
formations, such as those used in the previous works [8–12, 29], impose predetermined correlations
that are not learnable during the optimization process. For instance, masked linear layers [8–10]
are constructed by masking either the upper or lower triangular weight matrix in a linear layer. In
contrast, unconstrained linear layers have weight matrices that are fully learnable, making them more
flexible than their constrained counterparts.
To demonstrate the influences of the constraint on the expressiveness of a model, we provide a
performance comparison between flow-based models constructed using different types of linear
layers. Specifically, we compare the performance of the models constructed using linear layers
with full matrices, lower triangular matrices, upper triangular matrices, and matrices that are the
multiplication of both lower and upper triangular matrices. These four types of linear layers are
hereafter denoted as F, L, U, and LU, respectively, and the differences between them are depicted in
Fig. A2. Furthermore, to highlight the performance discrepancy between these models, we construct
the target distribution px based on an autoregressive relationship of data vector x. Let x[i] denote the
i-th element of x, and px[i] represent its associated pdf. x[i] is constructed based on the following
equation:

u[0] if i = 1,
x[i] = (A7)
tanh(u[i] × s) × (x[i−1] + d × 2i ), if i ∈ {2, . . . , D},
where u is sampled from an isotropic Gaussian, and s and d are coefficients controlling the shape
and distance between each mode, respectively. In Eq. (A7), the function tanh(·) can be intuitively
viewed as a smoothed variant of the function 2H(·) − 1, where H(·) represents the Heaviside step
function. In this context, the values of (x[i−1] + d × 2i ) are multiplied by a value close to either
−1 or 1, effectively transforming a positive number to a negative one. Fig. A3 depicts a number of
examples of px[i] constructed using this method. By employing this approach to design px , where
capturing px[i] is presumed to be more challenging than modeling px[j] for any j < i, we can inspect
how the applied constraints impact performance. Inappropriately masking the linear layers, like
the U-type layer, is anticipated to result in degraded performance, similar to the anti-casual effect
explained in [51].
In this experiment, we constructed flow-based F L U LU
models using the smoothed leakyReLU activa- 70
tion and different types of linear layers (i.e., F, 65
L, U, and LU) with a dimensionality of D = 10.
The models are optimized according to Eq. (2). 60
NLL

The performance of these models is evaluated


55
in terms of NLL, and its trends are depicted
in Fig. A4. It is observed that the flow-based 50
model built with the F-type layers achieved the
lowest NLL, indicating the advantage of using 45
0 2,000 4,000 6,000 8,000 10,000
unconstrained weight matrices in linear layers. Step
In addition, there is a noticeable performance Figure A4: The evaluation curves in terms of NLL
discrepancy between models with the L-type and of the flow-based models constructed with the F-
U-type layers, indicating that imposing inappro- type, L-type, U-type, and LU-type layers. The
priate constraints on linear layers may negatively curves and shaded area depict the mean and 95%
affect the modeling abilities of flow-based mod- confidence interval of three independent runs.
els. Furthermore, even when both L-type and U-type layers were adopted, as shown in the red curve
in Fig. A4, the performance remains inferior to those using the F-type layers. This experimental
evidence suggests that linear layers constructed based on matrix decomposition (e.g., [4, 9]) may not
possess the same expressiveness as unconstrained linear layers.

A.6 Limitations and Discussions

We noticed that score-matching methods sometimes exhibit difficulty in differentiating the weights be-
tween individual modes within a multi-modal distribution. This deficiency is illustrated in Fig. A5 (a),
where EBFlow fails to accurately capture the density of the Checkerboard dataset. This phenomenon
bears resemblance to the blindness problem discussed in [52]. While the solution proposed in [52]
has the potential to address this issue, their approach is not directly applicable to the flow-based
architectures employed in this paper.

25
(a) (b) (c)

SSM DSM FDSSM epoch: 40 epoch: 80

Figure A5: (a) Visualized examples of EBFlow trained with SSM, DSM, and FDSSM on the
Checkerboard dataset. (b) The samples generated by the Glow model at the 40-th training epoch. (c)
The samples generated by the Glow model at the 80-th training epoch.

In addition, we observed that the sampling quality of EBFlow occasionally experiences a significant
reduction during the training iterations. This phenomenon is illustrated in Fig. A5 (b) and (c), where
the Glow model trained using our approach demonstrates a decline in performance with extended
training periods. The underlying cause of this phenomenon remains unclear, and we consider it a
potential avenue for future investigation.

26

You might also like