2024-Fourier Basis Density Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Fourier Basis Density Model

Alfredo De la Fuente Saurabh Singh Johannes Ballé


Google Google Research Google Research
New York, NY, 10011, USA Mountain View, CA, 94043, USA New York, NY, 10011, USA
[email protected] [email protected] [email protected]

Abstract—We introduce a lightweight, flexible and end-to- but there are questions regarding how general and parameter-
end trainable probability density model parameterized by a efficient it is. The range of possible distributions that the DFP
constrained Fourier basis. We assess its performance at ap- can model is hard to understand given the intricacies of its
arXiv:2402.15345v1 [cs.LG] 23 Feb 2024

proximating a range of multi-modal 1D densities, which are


generally difficult to fit. In comparison to the deep factorized nonlinearities. In particular, empirical evidence suggests that
model introduced in [1], our model achieves a lower cross entropy the model struggles to approximate multi-modal distributions
at a similar computational budget. In addition, we also evaluate accurately.
our method on a toy compression task, demonstrating its utility Through a number of experiments, we analyze the properties
in learned compression. of the proposed Fourier basis density model and how its
Index Terms—density estimation, Herglotz’s theorem, Fourier
basis, entropy model. performance can depend on the data distribution. Since neural
compression models such as Nonlinear Transform Coding
(NTC) [2] work by compressing one dimension at a time, our
I. I NTRODUCTION
proposed model is applicable for such tasks.
Density estimation is a ubiquitous problem in statistics
and machine learning. Given a set of i.i.d. samples from an II. M ODEL D EFINITION
unknown true distribution, we aim to find the parameters of
Truncated Fourier series (i.e., with all but the first N
a density model that best describe this target distribution. In
coefficients assumed zero), are a canonical way to represent
particular, the Kullback–Leibler divergence (KLD) between
smooth functions. Our aim is to represent a probability den-
the model and the true distribution is often use to measure the
sity function p(x) with x ∈ R as a Fourier series with a
quality of fit. Within the field of lossy neural data compression
finite number of coefficients, and find these coefficients using
[2], [3], the cross entropy, related to the KLD, is directly
stochastic optimization (for example, by stochastic gradient
connected with the bit rate of the compression method. In
descent). In what follows, we first construct a flexible periodic
this context, the model is also labeled as an entropy model.
density model, and then extend it to the entire real line. Note
It is critical to be able to model arbitrary density functions in
that c∗ and |c| denote the complex conjugate and magnitude,
order to develop efficient learned compression systems.
respectively, of a complex number c.
To fit a density model to a limited set of samples, we need to
Let us begin with a probability density defined as p(x) ≡
assume the density belongs to a particular class of functions.
f (x)/Z, where f (x) is a periodic (with period R 1 2), real-
Since there may not be prior knowledge about the target distri-
valued, positive smooth function and Z = −1 f (x)dx is
bution, restricting the model to simple parametric distributions
the normalization constant. We represent f (x) in terms of its
such as Gaussians, Laplacians, etc. would not generally satisfy
complex valued truncated Fourier series coefficients cn ∈ C:
the need to obtain a good fit. Non-parametric approaches,
which generally have extensible families of parameters, and N
X
can often guarantee fitting arbitrary functions in the asymptotic f (x) = cn exp(inπx), (1)
limit, includes mixtures of Gaussians [4], mixtures of Kernel n=−N
functions [5], Parzen windows [6], and others. Here, we √
explore the use of Fourier series to model probability densities. where i ≡ −1. Conversely, we can write the coefficients as
Any function has a Fourier series expansion, albeit with a 1 1
Z
potentially infinite sequence of coefficients. Truncating this cn = f (x) exp(−inπx)dx. (2)
2 −1
sequence restricts the series to smooth functions, which is a
reasonable implicit bias for our purposes. Note that due to f (x) being real-valued, the coefficients follow
A different approach based on modeling the cumulative the symmetry cn = c∗−n for all n. Consequently, the negative
distribution function (CDF) using a multi-layer perceptron frequencies n < 0 are redundant and need not be considered
(MLP) is the deep factorized probability (DFP) model [1]. The model parameters. Now, we desire a model that represents a
MLP is constrained to have strictly non-negative weights, and flexible and valid probability density on R, so it needs to be
specialized activation functions that guarantee monotonicity 1) non-negative, 2) normalized, and 3) non-periodic with the
of the CDF. The model has proven quite popular (e.g., [7]), support on the full domain, R. We ensure this as follows.
10 1
A. Non-Negativity 1.2 Learned Distribution
Target Distribution
1.0 10 2

Log KL Divergence
To guarantee non-negativity, we consider Herglotz’s theo- 0.8
10 3
rem [8]. It states that the Fourier series of a non-negative 0.6

0.4
function is positive semi-definite. In other words, f (x) is non- 0.2
10 4

negative if and only if cn is a positive semi-definite sequence. 0.0


1.0 0.5 0.0 0.5 1.0
10 5
0 20 40 60 80 100 120
x Number of Parameters
A simple way to ensure this is to parameterize cn as an
autocorrelation sequence, i.e. for n = 0, . . . , N : (a) density fit, N = 64 (b) KLD vs. # parameters
N
X −n Fig. 1: Model fit for mixture of beta distribution. a) Density
cn = ak a∗k+n , (3) plot for a 64 term Fourier basis density model (best viewed
k=0 on screen). b) The fit improves with increasing number of
where an ∈ C is an arbitrary sequence defined for n = parameters.
0, . . . , N (and assumed zero otherwise). We can thus consider
θ ≡ {an }N 0 as the parameters of the model to be fitted and
6
Learned Distribution
Target Distribution
100

5
10 1
guarantee, by plugging (3) into (1), that f (x) is always non-

Log KL Divergence
4
10 2
negative. 3
10 3
2
B. Normalization 1 10 4

0 10 5
To compute the normalization constant Z, note in (2) that 1.0 0.5 0.0
x
0.5 1.0 0 20 40 60 80
Number of Parameters
100 120

the integral over one period of f (x) is 2c0 . Thus, if we limit


(a) density fit, N = 64 (b) KLD vs. # parameters
the density to a single period, the normalization constant is
available directly as Z = 2c0 . Using this normalizer, and Fig. 2: Model fit for mixture of logit-normals distribution.
together with non-negativity and the symmetry of cn , we can
now define a valid density model on (−1, 1):
N
D. Weight Regularization
1 X cn
p(x; θ) = + exp(inπx), (4) In the presence of limited data, more than one choice of
2 n=1 c0
parameters may fit the data equally well, even for a truncated
where cn is as defined in Eq. (3). Note that the cumulative series. To express a preference towards smoother densities,
distribution function (CDF) P (x) also has a simple closed- we penalize the total squared variation of the unnormalized
form expression: density f (x) with the following regularization loss:
N 1 2 N
x X cn
Z
df (x) X
P (x; θ) = + exp(inπx) + C(θ), (5) Lreg (θ) ≡ γ dx = γ 2π 2 n2 |cn |2 . (10)
2 n=1 πinc0 −1 dx
n=−N

where C is a function of the parameters that ensures P (−1) = Here, γ > 0 is a hyperparameter specifying the weight of
0 and P (1) = 1. the regularization. This leads to an intuitive penalty on the
C. Support on R frequency coefficients, where the coefficients for the higher
frequencies are penalized more than the lower ones. The
Lastly, to extend this model to the entire real line, we
hyperparameter γ needs to be selected manually, but in our
consider the mapping g : (−1, 1) → R, which is parameterized
experiments the optimization outcome is quite robust to the
by a scaling s and an offset t:
  choice. The regularization term appears to stabilize the training
s 1+x dynamics as well.
g(x; s, t) = s · tanh−1 (x) + t = ln + t. (6)
2 1−x To establish the equality in eq. (10), first note that:
The CDF Q of the mapped variable can be written directly as N
df (x) X
= inπcn exp(inπx) (11)
Q(x; θ, s, t) = P g −1 (x; s, t); θ

(7) dx
n=−N
with 2 N N Z 1
  df (x) X X
x−t = π2 nmcn c∗m ei(n−m)πx dx (12)
g −1 (x; s, t) = tanh , (8) dx −1
s n=−N m=−N

whereas in the density q, the derivative of g needs to be taken Now recall that
into account: Z 1 (
0 if n ̸= m

q(x; θ, s, t) = p g −1 (x; s, t); θ (g −1 ) (x; s, t)
 exp(i(n − m)πx)dx = (13)
      −1 2 if n = m
x−t 1 2 x−t
= p tanh ;θ sech . (9) which immediately leads to eq. (10).
s s s
1.4 Learned Distribution
100 Mixture of N Fixed Gaussians Fixed Budget Comparison
N=5 1.4 DFP
Target Distribution 3.0
1.2 N = 10 1.2 FBM
10 1 2.5 N = 15

Log KL Divergence
1.0 N = 20 1.0
2.0 N = 25
0.8 0.8
10 2
1.5

KL
0.6 0.6
1.0
0.4 10 3 0.4
0.5
0.2 0.2
0.0
0.0 10 4 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 10 15 20 25
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0 20 40 60 80 100 120 x Gaussian Mixture Components
x Number of Parameters

(a) density fit, N = 64 (b) KLD vs. # parameters (a) density fit (b) KLD vs. K

Fig. 3: Model fit for mixture of 3 Gaussians. a) Density plot for Fig. 5: a) Model fit for mixture of K Gaussians (best viewed
a 64 term Fourier basis density model (best viewed on screen). on screen). b) KLD between model and target as a function of
b) The fit improves with increasing number of parameters. K, for deep factorized model (DFP) and Fourier basis density
model (FBM) on a fixed parameter budget (∼ 90 parameters).
100
0.8 Learned Distribution
Target Distribution
0.7 0.8 Learned Distribution 0.8 Learned Distribution
10 1
0.6 Target Distribution Target Distribution
Log KL Divergence

0.7 0.7
0.5 10 2
0.6 0.6
0.4 0.5 0.5
10 3
0.3 0.4 0.4
0.2 0.3 0.3
10 4
0.1 0.2 0.2
10 5
0.1 0.1
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0 20 40 60 80 100 120
x Number of Parameters
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x x
(a) density fit, N = 64 (b) KLD vs. # parameters
(a) Fourier basis density model (b) Deep factorized model
Fig. 4: Model fit for mixture of Gaussians and Laplacians.
Fig. 6: Model fit for mixture of 25 Gaussians. a) Our proposed
model captures most of the target distribution modes. b) In
III. E XPERIMENTAL E VALUATION contrast, the Deep factorized model covers the modes, but also
assigns a lot of probability mass to other regions.
In order to compare the density estimation and compression
performance of our proposed model, we conducted a number
of experiments using the CoDeX [9] library in the JAX display the fit qualitatively, while the right pane reports the
framework [10]. KLD w.r.t. the true distribution as a function of the number
of parameters (frequencies). We notice that the fit improves
A. Experimental Setup
significantly with the number of parameters, until reaching
We optimize the parameters of the density models by a point of diminishing returns, where the addition of extra
maximizing their log likelihood on samples from the target parameters does not produce a substantial difference.
distributions, using the Adam optimizer [11], with a cosine Similarly, for the distributions with support on the real line,
decay schedule for the learning rate, initialized at 10−4 . The we considered a mixture of three Gaussian distributions and a
models were trained for 500 epochs of 500 steps each, with mixture of Gaussian and Laplacian distributions (Fig. 3 and 4,
a batch size of 128 and a validation batch size of 2048. After respectively). We observe a significant decrease in KLD as the
hyperparameter tuning, we found the regularization penalty number of model of parameters is increased. By introducing
γ = 10−6 to work well. For the DFP model, we consider the offset and scale terms in our model, we improve the quality
neural network architectures with three hidden layers of M of fit for asymmetric distributions.
units each, where M ∈ {5, 10, 20, 30}. Both models were
trained using the same number of iterations, learning rate and C. Multi-Modal Density Fit with a Fixed Parameter Budget
optimizer. Further, we set the scale s = 1 and offset t = 0, We compared the performance of the Fourier basis density
unless otherwise specified. model against the deep factorized model at modeling highly
As evaluation benchmarks, we experimented with univariate multi-modal 1D distributions. We trained both models with a
multi-modal distributions expressed as mixtures of Gaussian, fixed parameter budget, and compared the KLD with respect
beta, logit-normal, and Laplacian distributions. For the com- to the target distribution. First, we evaluated a mixture of K
pression task, we use the banana distribution from [2]. Gaussian distributions (for K = {5, 10, 15, 20, 25}) randomly
located in the range [−10, 10] with variance sampled inversely
B. Impact of the Number of Frequency Terms on Model Fit proportional to the number of components (Fig. 5a). We
Increasing the number of frequencies in our model affords observe that our model consistently obtains much lower KLD
us more expressive power, while also increasing the number values compared to the DFP model for each value of K
of parameters. We studied this trade-off on 1) periodic distri- (Fig. 5b), validating the parameter efficiency of our model.
butions on the support [−1, 1] and 2) distributions on R. Fig. 6 shows that for the same number of parameters, the
For periodic distributions, we explored a mixture of two Fourier basis density model is able to capture the multi-
beta distributions and a mixture of logit-normal distributions. modality of the distribution significantly better than the deep
Results are reported in Fig. 1 and Fig. 2. In the left pane we factorized model.
0.8 to a latent space and each dimension of the latent representa-
Fourier basis density model
tion is independently modeled and coded. To demonstrate the
Deep factorized model utility of our model, we evaluate our method on a compression
0.6
task for the banana distribution introduced in [2] (Fig. 9). We
KL Divergence

Gaussian mixture model


follow [2] closely and simply swap out the entropy model used
0.4 during training. In brief, the model consists of an encoder, a
decoder and an entropy model, which are jointly trained by
0.2
minimizing the rate–distortion Lagrangian with respect to the
model parameters θ, i.e., the loss is

Lcompress (θ) = R(θ) + λ D(θ), (14)


∼ 90 params ∼ 282 params

Number of Parameters where R is the rate and D is the distortion. Further, λ is


a hyperparameter determining the desired trade-off. Here, θ
Fig. 7: Parameter efficiency evaluation. Our model provides includes the parameters of the entropy model as well as the
a significantly better fit in comparison to the two baseline non-linear transforms. We use squared error as a measure of
models of deep factorized model and Gaussian mixture model distortion and a continuous and differentiable proxy for the
across two different parameter budget regimes. Accuracy of discrete entropy as a measure of rate for the joint optimization
fit is measured in terms of KLD with respect to the target during training. Once the model is trained, the latent space
distribution, for a similar parameter budget for all three of the encoder can be uniformly scalar quantized for entropy
models. The target distribution is a heterogeneous mixture of coding. See [2] for details.
Gaussian and Laplacian distributions. We use eq. (7) both to compute the probability within each
quantization interval, in order to evaluate discrete entropy, as
well as to obtain a model of the density convolved with a
Next, we extended the experiment to a mixture of 20 unit-width uniform distribution, for the differentiable proxy
Laplace distributions and 20 Gaussian distributions with mean of entropy during training. For the encoder and decoder
values randomly located between [−10, 10]. Variances are architectures, we use three-layer MLPs, with 50 hidden units
sampled proportional to the number of components, and mix- each, and leaky ReLU as the activation function. We consider
ture weights are randomly sampled. Similar to the previous a latent space dimension of 5, learning rate of 10−3 , batch
experiment, we compared the performance of the model with size of 512 samples, 200 epochs of 2048 steps, number of
the deep factorized model, in terms of KLD with respect to frequency terms N = 20.
the target distribution. Moreover, we also include a comparison
Fig. 9(a) plots the rate–distortion curves for the two ap-
with a Gaussian mixture model with a number of components
proaches using deep factorized model (red) and Fourier basis
such that the total number of parameters is equivalent for the
density model (black). The curves are produced by varying the
three models. Results are visualized in Fig. 7 both for ∼ 90
value of the trade-off parameter λ. As expected, the Fourier
and ∼ 282 parameters. We observe a significant gap in KLD
model achieves a comparable (or slightly better) trade-off in
between the models fitting the target distribution. Moreover,
comparison to the deep factorized model, for all choices of λ,
we show that as we overparameterize the models (with respect
and for an equivalent parameter budget. Figs. 9(b) and 9(c)
to the actual number of parameters of the target distribution),
provide a qualitative comparison of the learned quantization
our model achieves remarkably lower KLD with an order of
bins and their representers for both models, with λ = 10. Note
magnitude improvement compared to the other models.
that the bins far away from the mode see very few samples
Fig. 8 provides a qualitative comparison of the density fit
and therefore are not accurate. Further, the number and size
achieved by various methods. We see that our model captures
of the quantization bins is a function of the trade-off ratio λ.
more modes of the target distribution in comparison to the
other approaches, which explains its lower KLD. In contrast,
the deep factorized model only roughly approximates the IV. D ISCUSSION
overall behavior of the highest modes in the distribution.
We propose a novel univariate density model named Fourier
Finally, even though the Gaussian mixture model produces
basis density model, which is simple yet flexible and end-to-
a remarkable fit capturing some of the modes of the target
end trainable. Our experiments show that it provides a better
distribution, the optimization problem is complex, and highly
fit for challenging multi-modal distributions in comparison to
sensitive to the initialization of the parameters, leading to
prevalent methods at a similar parameter budget, when trained
suboptimal fits for many of the modes.
with the same optimizer choices. Moreover, the preliminary
results obtained for the compression toy task show the effec-
D. Lossy Compression of Banana Distribution
tiveness of our flexible density model in comparison to the
A notable application of univariate density models is in deep factorized model as a building block for trainable end-
nonlinear transform coding (NTC), where data is transformed to-end neural compression models.
1.0
0.8 Learned Distribution 0.8 Learned Distribution Learned Distribution
Target Distribution Target Distribution Target Distribution
0.7 0.7
0.8
0.6 0.6
0.5 0.5 0.6
0.4 0.4
0.4
0.3 0.3
0.2 0.2
0.2
0.1 0.1

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x x x

(a) Gaussian mixture model (b) Deep factorized model (c) Fourier basis density model
Fig. 8: Qualitative comparison of model fit for budget constrained models (∼ 90 parameters) for a multi-modal target distribution
formed by a mixture of Laplacian and Gaussian distributions. Both the mixture of Gaussians as well as the Fourier basis density
model fit most of the modes with precision, in comparison to the deep factorized model. Furthermore, the proposed model
produces a better fit with the same amount of parameters with respect to the mixture of Gaussians, by fitting one extra mode
around x = 4.0 while sacrificing fit of the modes in the range x = [−2.5, 0].

Rate Distortion Plot for Banana2D Dataset 4 4


DFP
4.5 FBM
2 2
4.0
source dimension 2

source dimension 2
Rate (bits)

3.5 0 0

3.0
2 2
2.5
2.0
4 4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 4 2 0 2 4 4 2 0 2 4


Distortion ( 2) source dimension 1 source dimension 1

(a) Rate–distortion performance (b) Fourier basis density model (c) Deep factorized model
Fig. 9: Rate–distortion comparison. a) R-D curves plotted by varying the trade-off parameter λ over {1, 5, 10, 15, 20, 25, 30}.
Our method (FBM) with 210 parameters for the entropy coder slightly outperforms the deep factorized model (DFP) with 215
parameters. b) and c): Quantization bins and bin centers learned by jointly optimizing rate–distortion using the Fourier basis
entropy model and the deep factorized model, respectively, with a fixed λ = 10. The results are quite comparable, with the
exception of “don’t care” regions off the main ridge, where the data distribution has few samples and the model behavior thus
doesn’t have an effect on the performance.

R EFERENCES [10] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,


C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas,
[1] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational S. Wanderman-Milne, and Q. Zhang, “JAX: composable
image compression with a scale hyperprior,” 2018. transformations of Python+NumPy programs,” 2018. [Online].
Available: http://github.com/google/jax
[2] J. Ballé, P. A. Chou, D. Minnen, S. Singh, N. Johnston, E. Agustsson,
[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
S. J. Hwang, and G. Toderici, “Nonlinear transform coding,” 2020.
2017.
[3] Y. Yang, S. Mandt, and L. Theis, “An introduction to neural data
compression,” 2023.
[4] G. Mclachlan and K. Basford, Mixture Models: Inference and Applica-
tions to Clustering, 01 1988, vol. 38.
[5] F. Bunea, A. Tsybakov, and M. H. Wegkamp, “Sparse density
estimation with l1 penalties,” in Annual Conference Computational
Learning Theory, 2007. [Online]. Available: https://api.semanticscholar.
org/CorpusID:15265978
[6] E. Parzen, “On estimation of a probability density function and mode,”
The annals of mathematical statistics, vol. 33, no. 3, pp. 1065–1076,
1962.
[7] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and
hierarchical priors for learned image compression,” Advances in neural
information processing systems, vol. 31, 2018.
[8] P. Brockwell and R. Davis, Time Series: Theory and Methods, ser.
Springer Series in Statistics. Springer New York, 2013.
[9] J. Ballé, S. J. Hwang, and E. Agustsson, “CoDeX: Learned data
compression in JAX,” 2022. [Online]. Available: http://github.com/
google/codex

You might also like