2024-Fourier Basis Density Model
2024-Fourier Basis Density Model
2024-Fourier Basis Density Model
Abstract—We introduce a lightweight, flexible and end-to- but there are questions regarding how general and parameter-
end trainable probability density model parameterized by a efficient it is. The range of possible distributions that the DFP
constrained Fourier basis. We assess its performance at ap- can model is hard to understand given the intricacies of its
arXiv:2402.15345v1 [cs.LG] 23 Feb 2024
Log KL Divergence
To guarantee non-negativity, we consider Herglotz’s theo- 0.8
10 3
rem [8]. It states that the Fourier series of a non-negative 0.6
0.4
function is positive semi-definite. In other words, f (x) is non- 0.2
10 4
5
10 1
guarantee, by plugging (3) into (1), that f (x) is always non-
Log KL Divergence
4
10 2
negative. 3
10 3
2
B. Normalization 1 10 4
0 10 5
To compute the normalization constant Z, note in (2) that 1.0 0.5 0.0
x
0.5 1.0 0 20 40 60 80
Number of Parameters
100 120
where C is a function of the parameters that ensures P (−1) = Here, γ > 0 is a hyperparameter specifying the weight of
0 and P (1) = 1. the regularization. This leads to an intuitive penalty on the
C. Support on R frequency coefficients, where the coefficients for the higher
frequencies are penalized more than the lower ones. The
Lastly, to extend this model to the entire real line, we
hyperparameter γ needs to be selected manually, but in our
consider the mapping g : (−1, 1) → R, which is parameterized
experiments the optimization outcome is quite robust to the
by a scaling s and an offset t:
choice. The regularization term appears to stabilize the training
s 1+x dynamics as well.
g(x; s, t) = s · tanh−1 (x) + t = ln + t. (6)
2 1−x To establish the equality in eq. (10), first note that:
The CDF Q of the mapped variable can be written directly as N
df (x) X
= inπcn exp(inπx) (11)
Q(x; θ, s, t) = P g −1 (x; s, t); θ
(7) dx
n=−N
with 2 N N Z 1
df (x) X X
x−t = π2 nmcn c∗m ei(n−m)πx dx (12)
g −1 (x; s, t) = tanh , (8) dx −1
s n=−N m=−N
whereas in the density q, the derivative of g needs to be taken Now recall that
into account: Z 1 (
0 if n ̸= m
′
q(x; θ, s, t) = p g −1 (x; s, t); θ (g −1 ) (x; s, t)
exp(i(n − m)πx)dx = (13)
−1 2 if n = m
x−t 1 2 x−t
= p tanh ;θ sech . (9) which immediately leads to eq. (10).
s s s
1.4 Learned Distribution
100 Mixture of N Fixed Gaussians Fixed Budget Comparison
N=5 1.4 DFP
Target Distribution 3.0
1.2 N = 10 1.2 FBM
10 1 2.5 N = 15
Log KL Divergence
1.0 N = 20 1.0
2.0 N = 25
0.8 0.8
10 2
1.5
KL
0.6 0.6
1.0
0.4 10 3 0.4
0.5
0.2 0.2
0.0
0.0 10 4 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 10 15 20 25
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0 20 40 60 80 100 120 x Gaussian Mixture Components
x Number of Parameters
(a) density fit, N = 64 (b) KLD vs. # parameters (a) density fit (b) KLD vs. K
Fig. 3: Model fit for mixture of 3 Gaussians. a) Density plot for Fig. 5: a) Model fit for mixture of K Gaussians (best viewed
a 64 term Fourier basis density model (best viewed on screen). on screen). b) KLD between model and target as a function of
b) The fit improves with increasing number of parameters. K, for deep factorized model (DFP) and Fourier basis density
model (FBM) on a fixed parameter budget (∼ 90 parameters).
100
0.8 Learned Distribution
Target Distribution
0.7 0.8 Learned Distribution 0.8 Learned Distribution
10 1
0.6 Target Distribution Target Distribution
Log KL Divergence
0.7 0.7
0.5 10 2
0.6 0.6
0.4 0.5 0.5
10 3
0.3 0.4 0.4
0.2 0.3 0.3
10 4
0.1 0.2 0.2
10 5
0.1 0.1
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0 20 40 60 80 100 120
x Number of Parameters
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x x
(a) density fit, N = 64 (b) KLD vs. # parameters
(a) Fourier basis density model (b) Deep factorized model
Fig. 4: Model fit for mixture of Gaussians and Laplacians.
Fig. 6: Model fit for mixture of 25 Gaussians. a) Our proposed
model captures most of the target distribution modes. b) In
III. E XPERIMENTAL E VALUATION contrast, the Deep factorized model covers the modes, but also
assigns a lot of probability mass to other regions.
In order to compare the density estimation and compression
performance of our proposed model, we conducted a number
of experiments using the CoDeX [9] library in the JAX display the fit qualitatively, while the right pane reports the
framework [10]. KLD w.r.t. the true distribution as a function of the number
of parameters (frequencies). We notice that the fit improves
A. Experimental Setup
significantly with the number of parameters, until reaching
We optimize the parameters of the density models by a point of diminishing returns, where the addition of extra
maximizing their log likelihood on samples from the target parameters does not produce a substantial difference.
distributions, using the Adam optimizer [11], with a cosine Similarly, for the distributions with support on the real line,
decay schedule for the learning rate, initialized at 10−4 . The we considered a mixture of three Gaussian distributions and a
models were trained for 500 epochs of 500 steps each, with mixture of Gaussian and Laplacian distributions (Fig. 3 and 4,
a batch size of 128 and a validation batch size of 2048. After respectively). We observe a significant decrease in KLD as the
hyperparameter tuning, we found the regularization penalty number of model of parameters is increased. By introducing
γ = 10−6 to work well. For the DFP model, we consider the offset and scale terms in our model, we improve the quality
neural network architectures with three hidden layers of M of fit for asymmetric distributions.
units each, where M ∈ {5, 10, 20, 30}. Both models were
trained using the same number of iterations, learning rate and C. Multi-Modal Density Fit with a Fixed Parameter Budget
optimizer. Further, we set the scale s = 1 and offset t = 0, We compared the performance of the Fourier basis density
unless otherwise specified. model against the deep factorized model at modeling highly
As evaluation benchmarks, we experimented with univariate multi-modal 1D distributions. We trained both models with a
multi-modal distributions expressed as mixtures of Gaussian, fixed parameter budget, and compared the KLD with respect
beta, logit-normal, and Laplacian distributions. For the com- to the target distribution. First, we evaluated a mixture of K
pression task, we use the banana distribution from [2]. Gaussian distributions (for K = {5, 10, 15, 20, 25}) randomly
located in the range [−10, 10] with variance sampled inversely
B. Impact of the Number of Frequency Terms on Model Fit proportional to the number of components (Fig. 5a). We
Increasing the number of frequencies in our model affords observe that our model consistently obtains much lower KLD
us more expressive power, while also increasing the number values compared to the DFP model for each value of K
of parameters. We studied this trade-off on 1) periodic distri- (Fig. 5b), validating the parameter efficiency of our model.
butions on the support [−1, 1] and 2) distributions on R. Fig. 6 shows that for the same number of parameters, the
For periodic distributions, we explored a mixture of two Fourier basis density model is able to capture the multi-
beta distributions and a mixture of logit-normal distributions. modality of the distribution significantly better than the deep
Results are reported in Fig. 1 and Fig. 2. In the left pane we factorized model.
0.8 to a latent space and each dimension of the latent representa-
Fourier basis density model
tion is independently modeled and coded. To demonstrate the
Deep factorized model utility of our model, we evaluate our method on a compression
0.6
task for the banana distribution introduced in [2] (Fig. 9). We
KL Divergence
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x x x
(a) Gaussian mixture model (b) Deep factorized model (c) Fourier basis density model
Fig. 8: Qualitative comparison of model fit for budget constrained models (∼ 90 parameters) for a multi-modal target distribution
formed by a mixture of Laplacian and Gaussian distributions. Both the mixture of Gaussians as well as the Fourier basis density
model fit most of the modes with precision, in comparison to the deep factorized model. Furthermore, the proposed model
produces a better fit with the same amount of parameters with respect to the mixture of Gaussians, by fitting one extra mode
around x = 4.0 while sacrificing fit of the modes in the range x = [−2.5, 0].
source dimension 2
Rate (bits)
3.5 0 0
3.0
2 2
2.5
2.0
4 4
(a) Rate–distortion performance (b) Fourier basis density model (c) Deep factorized model
Fig. 9: Rate–distortion comparison. a) R-D curves plotted by varying the trade-off parameter λ over {1, 5, 10, 15, 20, 25, 30}.
Our method (FBM) with 210 parameters for the entropy coder slightly outperforms the deep factorized model (DFP) with 215
parameters. b) and c): Quantization bins and bin centers learned by jointly optimizing rate–distortion using the Fourier basis
entropy model and the deep factorized model, respectively, with a fixed λ = 10. The results are quite comparable, with the
exception of “don’t care” regions off the main ridge, where the data distribution has few samples and the model behavior thus
doesn’t have an effect on the performance.