Conditional Density Estimation With Neural Network
Conditional Density Estimation With Neural Network
ABSTRACT
Given a set of empirical observations, conditional density estimation aims to capture the sta-
tistical relationship between a conditional variable x and a dependent variable y by modeling their
conditional probability p(y|x). The paper develops best practices for conditional density estima-
tion for finance applications with neural networks, grounded on mathematical insights and empirical
evaluations. In particular, we introduce a noise regularization and data normalization scheme, alle-
viating problems with over-fitting, initialization and hyper-parameter sensitivity of such estimators.
We compare our proposed methodology with popular semi- and non-parametric density estimators,
underpin its effectiveness in various benchmarks on simulated and Euro Stoxx 50 data and show
its superior performance. Our methodology allows to obtain high-quality estimators for statistical
expectations of higher moments, quantiles and non-linear return transformations, with very little
∗
denotes equal contribution
†
The authors are with the Computational Risk and Asset Management Research Group at the Karlsruhe Institute
of Technology (KIT), Germany. Reference to [email protected]
I. Introduction
A wide range of problems in econometrics and finance are concerned with describing the statis-
tical relationship between a vector of explanatory variables x and a dependent variable or vector y
of interest. While regression analysis aims to describe the conditional mean E[y|x], many problems
in risk and asset management require gaining insight about deviations from the mean and their
associated likelihood. The stochastic dependency of y on x can be fully described by modeling the
conditional probability density p(y|x). Inferring such a density function from a set of empirical
observations {(xn , yn )}N
n=1 is typically referred to as conditional density estimation (CDE).
We propose to use neural networks for estimating conditional densities. In particular, we discuss
two models in which a neural network controls the parameters of a Gaussian mixture. Namely,
these are the Mixture Density Network (MDN) by Bishop (1994) and the Kernel Mixture Network
(KMN) by Ambrogioni et al. (2017). When chosen expressive enough, such models can approximate
arbitrary conditional densities.
However, when combined with maximum likelihood estimation, this flexibility can result in over-
fitting and poor generalization beyond the training data. Addressing this issue, we develop a noise
regularization method for conditional density estimation. By adding small random perturbations
to the data during training, the conditional density estimate is smoothed and generalizes better.
In fact, we mathematically derive that adding noise during training is equivalent to penalizing the
second derivatives of the conditional log-probability. Graphically, the penalization punishes very
curved or even spiky density estimators in favor of smoother variants. Our experimental results
demonstrate the efficacy and importance of the noise regularization for attaining good out-of-sample
performance.
Moreover, we attend to further practical issues that arise due to different value ranges of the
training data. In this context, we introduce a simple data normalization scheme, that fits the
conditional density model on normalized data, and, after training, transforms the density estimate,
so that is corresponds to the original data distribution. The normalization scheme makes the hyper-
parameters and initialization of the neural network based density estimator insensitive to differing
value ranges. Our empirical evaluations suggest that this increases the consistency of the training
results and significantly improves the estimator’s performance.
Aiming to compare our proposed approach against well-established CDE methods, we report a
comprehensive benchmark study on simulated densities as well as on EuroStoxx 50 returns. When
trained with noise regularization, both MDNs and KMNs are able to outperform previous standard
semi- and nonparametric conditional density estimators. Moreover, the results suggest that even
for small sample sizes, neural network based conditional density estimators can be an equal or
superior alternative to well established conditional kernel density estimators.
Our study adds to the econometric literature, which discusses two main approaches towards
CDE. The majority of financial research assumes that the conditional distribution follows a standard
parametric family (e.g. Gaussian) that captures the dependence of the distribution parameters on
x with a (partially) linear model. The widely used ARMA-GARCH time-series model (Engle, 1982;
2
Nelson and Cao, 1992) and many of its extensions (Glosten et al., 1993; Hansen et al., 1994; Sentana,
1995) fall into this category. However, inherent assumptions in many such models have been refuted
empirically later on (Harvey and Siddique, 1999; Jondeau and Rockinger, 2003). Another example
for this class of models are linear factor models (Fama and French, 1993; Carhart, 1997; Fama
and French, 2015). Here, too, evidence for time variation in the betas of these factor models, as
documented by Jagannathan and Wang (1996), Lewellen and Nagel (2006) or Gormsen and Jensen
(2017), cast doubt about the actual existence of the stated linear relationships. Overall, it is unclear
to which degree the modelling restrictions are consistent with the actual mechanisms that generate
the empirical data and how much they bias the inference.
Another major strand of research approaches CDE from a nonparametric perspective, estimat-
ing the conditional density with kernel functions, centered in the data points (Hyndman et al., 1996;
Li and Racine, 2007). While kernel methods make little assumptions about functional relationships
and density shape, they typically suffer from poor generalization in the tail regions and from data
sparseness when dimensionality is high.
In contrast, CDE based on high-capacity function approximators such as neural networks has
received little attention in the econometric and finance community. Yet, they combine the global
generalization capabilities of parametric models with little restrictive assumptions regarding the
conditional density. Aiming to combine these two advantages, this work studies the use of neural
networks for estimating conditional densities. Overall, this paper establishes a sound framework
for fitting high-capacity conditional density models. Thanks to the presented noise regularization
and data normalization scheme, we are able to overcome common issues with neural network based
estimators and make the approach easy to use. The conditional density estimators are available as
open-source python package.1
II. Background
A. Density Estimation
Let X be a random variable with probability density function (PDF) p(x) defined over the
domain X . When investigating phenomena in the real world, the distribution of an observable
variable X is typically unknown. However, it possible to observe realizations xn ∼ p(x) of X.
Given a collection D = {x1 , ..., xn } of such observations, it is our aim to find a good estimate p̂(x)
of the true density function p. First, we have to assess what a ”good” estimate is. Throughout
the density estimation literature (Bishop, 2006; Li and Racine, 2007; Shalizi, 2011) , the two most
popular ways of quantifying the goodness of a fitted distribution p̂ are the following:
1. The Integrated Mean Squared Error (IMSE) measures the squared distance between the true
1
https://github.com/freelunchtheorem/Conditional Density Estimation
3
density function and the estimate:
Z
IM SE = |p̂(x) − p(x)|2 dx (1)
X
2. The Kullback-Leibler divergence / relative entropy measures the average log-likelihood ratio
between the p(x) and p̂(x)
Z
p(x)
DKL (p||p̂) = p(x) log dx (2)
X p̂(x)
Correspondingly, we aim to find a p̂ such that the selected error criterion is minimized.
In its most general form, density estimation aims to find the best p̂ among all possible PDFs
over the domain X , while only given a finite number of observations. Even in the simple case
X = R1 , this would require estimating infinitely many distribution parameters with a finite amount
of data, which is not feasible in practice. Hence, it is necessary to either restrict the space of
possible PDFs or to embed other assumptions into the density estimation. The kind of imposed
assumptions characterize the distinction between the sub-fields of parametric and non-parametric
density estimation.
N
X
∗
θ = arg max log p̂θ (xn ) (5)
θ
n=1
which is equivalent to minimizing the Kullback-Leibler divergence between the empirical data dis-
tribution (i.e. the weighted sum of point masses in the observations xn )
N
1 X
pD (x) = δ(||x − xn ||) (6)
N
n=1
4
and the parametric distribution p̂θ :
N
X
θ∗ = arg max log p̂θ (xn ) (7)
θ
n=1
N
1 X
p̂(x) = 1(xn is in the same bin as x) (10)
Nh
n=1
In that, 1(·) denotes the indicator function and h the width or area of the bin. Though appeal-
ingly simple, histograms bear the major disadvantage of being discontinuous which hinders any
differentiation-based method. In addition, the density estimates are not centered around the query
x, making the estimates for queries that are close to the bin boundaries worse than queries close
to the bin center.
Kernel density estimators enjoy more popularity as they overcome the named disadvantages
of histograms. Such estimators simply replace the indicator function in (10) with a symmetric
density function K(z), the so-called kernel (Rosenblatt, 1956; Parzen, 1962). The resulting density
estimator for univariate distributions reads as follows:
N
1 X x − xn
p̂(x) = K (11)
Nh h
n=1
Kernel density estimation (KDE) can be understood as placing a density function centered in each
data point xn and forming an equally weighted mixture of the N densities.
In the case of multivariate kernel density estimation, i.e. dim(X ) = l > 1, the density can be
5
Figure 1. Kernel density estimation with Gaussian kernels. The 6 red dashed curves depict the
kernels while the kernel density estimate p̂ is illustrated the blue curve.2
l l N
!
(j)
Y Y 1 X x(j) − xn
p̂(x) = p̂(x(j) ) = K (12)
j=1 j=1
N h(j) n=1 h(j)
In that, x(j) denotes the j-th element of the column vector x ∈ X ⊆ Rl and h(j) the bandwidth
corresponding to the j-th dimension.
One popular choice of K(·) is the Gaussian kernel:
1 z2
K(z) = √ e− 2 (13)
2π
6
given x. Typically, Y is referred to as a dependent variable (i.e. explained variable) and X as
conditional (explanatory) variable. Given a dataset of observations D = {(xn , yn )}N
n=1 drawn from
the joint distribution (xn , yn ) ∼ p(x, y), the aim of conditional density estimation (CDE) is to find
an estimate p̂(y|x) of the true conditional density p(y|x).
In the context of conditional density estimation, the IMSE and DKL objectives are expressed
as expectation over p(x):
Z Z
IM SEy|x = |p̂(y|x) − p(y|x)|2 p(x)dydx (14)
X Y
Z Z
p(y|x)
Ex∼p(x) [DKL (p(y|x)||p̂(y|x))] = p(y|x) log p(x)dydx (15)
X Y p̂(y|x)
Similar to the unconditional case in (7) - (9), parametric maximum likelihood estimation following
from (15) can be expressed as
N
X
θ∗ = arg max log p̂θ (yn |xn ) (16)
θ
n=1
The nonparametric KDE approach, discussed in Section II.A.2 can be extended to the condi-
tional case. Typically, unconditional KDE is used to estimate both the joint density p̂(x, y) and
the marginal density p̂(x). Then, the conditional density estimate follows as the density ratio
p̂(x, y)
p̂(y|x) = (17)
p̂(x)
where both the enumerator and denominator are the sums of Kernel functions as in (12). For more
details on conditional kernel density estimation, we refer the interested reader to Li and Racine
(2007).
7
Gaussian through linear relationships (Engle, 1982; Hamilton, 1994).
Although the conditional distribution is Gaussian, autocorrelation of the variance allows to
account for volatility clustering (Mandelbrot, 1967) and kurtosis in the unconditional distribution.
Various generalizations of GARCH attempt to model asymmetric return distributions and negative
skewness (Nelson and Cao, 1992; Glosten et al., 1993; Sentana, 1995). Further work employs the
student-t distribution as conditional probability model (Bollerslev et al., 1987; Hansen et al., 1994)
and models the dependence of higher-order moments on the past (Gallant et al., 1991; Hansen
et al., 1994).
While the neural network based CDE approaches, which are presented in this paper, are also
parametric models, they make very little assumptions about the underlying relationships and den-
sity family. Both the relationship between the conditional variables and distribution parameters, as
well as the probability density itself are modelled with flexible function classes (i.e. neural network
and GMM). In contrast, traditional financial models impose strong assumptions such as linear re-
lationships and Gaussian conditional distributions. It is unclear to which degree such modelling
restrictions are consistent with the empirical data and how much they bias the inference.
Non-parametric CDE. A distinctly different line of work in econometrics aims to estimate
densities in a non-parametric manner. Originally introduced in Rosenblatt (1956); Parzen (1962),
KDE uses kernel functions to estimate the probability density at a query point, based on the
distance to all training points. In principle, kernel density estimators can approximate arbitrary
probability distributions and make no parametric assumptions about the shape of the density.
However, in practice, when data is finite, smoothing is required to achieve satisfactory general-
ization beyond the training data. The fundamental issue of KDE, commonly referred to as the
bandwidth selection problem, is choosing the appropriate amount of smoothing (Park et al., 1990;
Cao et al., 1994). Common bandwidth selection methods include rules-of-thumb (Silverman, 1982;
Sheather and Jones, 1991; Botev et al., 2010) and selectors based on cross-validation (Rudemo,
1982; Bowman, 1984; Hall et al., 1992).
In order to estimate conditional probabilities, previous work proposes to estimate both the joint
and marginal probability separately with KDE and then computing the conditional probability as
their ratio (Hyndman et al., 1996; Li and Racine, 2007). Alternatively, this can be interpreted as a
combination of kernel regression and kernel density estimation (De Gooijer and Zerom, 2003). Other
approaches combine non-parametric elements with parametric elements (Tresp, 2001; Sugiyama and
Takeuchi, 2010), forming semi-parametric conditional density estimators. Despite their theoretical
appeal, non-parametric density estimators suffer from the following drawbacks: First, they tend
to generalize poorly in regions where data is sparse which especially becomes evident in the tail
regions of the distribution. Second, their performance deteriorates quickly as the dimensionality
of the dependent variable increases. This phenomenon is commonly referred to as the ”curse of
dimensionality”.
CDE with neural networks: MDN, KDE, Normalizing Flows. The third line work
approaches conditional density estimation from a parametric perspective. However, in contrast
8
to parametric modelling in finance and econometrics, such methods use high-capacity function
approximators instead of strongly constrained parametric families. Our work builds upon the work
of Bishop (1994) and Ambrogioni et al. (2017), who propose to use a neural network to control
the parameters of mixture density model. When both the neural network and the mixture of
densities is chosen to be sufficiently expressive, any conditional probability distribution can be
approximated (Hornik, 1991; Li and Andrew Barron, 2000). Sarajedini et al. (1999) propose neural
networks that parameterize a generic exponential family distribution. However, this limits the
overall expressiveness of the conditional density estimator.
A recent trend in machine learning is the use of neural network based latent density models
(Mirza and Osindero, 2014; Sohn et al., 2015). Although such methods have been shown successful
for estimating distributions of images, it is not possible to recover the PDF of such latent density
models. More promising in this sense are normalizing flows which use a sequence of parameterized
invertible maps to transform a simple distribution into more complex density functions (Rezende
and Mohamed, 2015; Dinh et al., 2017; Trippe and Turner, 2018). Since the PDF of normalizing
flows is tractable, this could be an interesting direction to supplement our work.
While neural network based density estimators make very little assumptions about the un-
derlying density, the suffer from severe over-fitting when trained with the maximum likelihood
objective. In order to counteract over-fitting, various regularization methods have been explored
in the literature (Krogh· and Hertz, 1992; Holmstrom and Koistinen, 1992; Webb, 1994; Srivastava
et al., 2014). However, these methods were developed with emphasis on regression and classification
problems. Our work focuses on the regularization of neural network based density estimators. In
that, we make use of the noise regularization framework (Webb, 1994; Bishop, 1995), discussing its
implications in the context of density estimation and empirically evaluating its efficacy.
Mixture Density Networks (MDNs) combine conventional neural networks with a mixture den-
sity model for the purpose of estimating conditional distributions p(y|x) (Bishop, 1994). In par-
ticular, the parameters of the unconditional mixture distribution p(y) are outputted by the neural
network, which takes the conditional variable x as input. The basic functioning of this framework
9
Figure 2. Illustration of a Mixture Density Network.
is illustrated in Figure 2. Given a mixture density with sufficiently many mixture components
and an expressive neural network that regresses into the parameter space of the density model,
MDNs can approximate arbitrary conditional distributions. The universal approximation property
of MDNs w.r.t. conditional densities follows directly from the universal function approximation
theorem for neural networks (Hornik, 1991) and the universal density approximation theorem for
mixture density models (Li and Andrew Barron, 2000).
For our purpose, we employ a Gaussian Mixture Model (GMM) with diagonal covariance ma-
trices as density model. The conditional density estimate p̂(y|x) follows as weighted sum of K
Gaussians
K
X
p̂(y|x) = wk (x; θ)N (y|µk (x; θ), σk2 (x; θ)) (18)
k=1
wherein wk (x; θ) denote the weight, µk (x; θ) the mean and σk2 (x; θ) the variance of the k-th Gaussian
component. All the GMM parameters are governed by the neural network with parameters θ and
input x. It is possible to use a GMM with full covariance matrices Σk by having the neural network
1/2
output the lower triangular entries of the respective Cholesky decompositions Σk (Tansey et al.,
2016). However, we choose diagonal covariance matrices in order to avoid the quadratic increase in
the neural network’s output layer size as the dimensionality of Y increases. Assuming K mixture
components and dim(Y) = l, the total number of neural network outputs is given by K(2l + 1) and
thus only grows only linearly in K and l.
The mixing weights wk (x; θ) must resemble a multinomial distribution, i.e. it must hold that
PK
k=1 wk (x; θ) = 1 and wk (x; θ) ≥ 0 ∀k. To satisfy the conditions, the a softmax function is used:
exp(aw
k (x))
wk (x) = PK (19)
w
i=1 exp(ai (x))
In that, aw
k (x) ∈ R denote the logit scores emitted by the neural network. Similarly, the standard
deviations σk (x) must be positive. To ensure that the respective neural network satisfy the non-
negativity constraint, an exponential non-linearity is applied:
10
Figure 3. Illustration of a Kernel Mixture Network.
Since the component means µk (x; θ) are not subject to such restrictions, we use a linear layer
without non-linearity for the respective output neurons.
While MDNs resemble a purely parametric conditional density model, a closely related approach,
the Kernel Mixture Network (KMN), combines both non-parametric and parametric elements (Am-
brogioni et al., 2017). Similar to MDNs, a mixture density model of p̂(y) is combined with a neural
network which takes the conditional variable x as an input. However, the neural network only
controls the weights of the mixture components while the component centers and scales are fixed
w.r.t. to x. Figuratively, one can imagine the neural network as choosing between a very large
amount of pre-existing kernel functions to build up the final combined density function. As common
for non-parametric density estimation, the components/kernels are placed in each of the training
samples or a subset of the samples. For each of the kernel centers, one ore multiple scale/bandwidth
parameters σm are chosen. As for MDNs, we employ Gaussians as mixture components, wherein
the scale parameter directly coincides with the standard deviation. Figure 3 holds an illustration
of the KMN conditional density model.
Let K be the number of kernel centers µk and M the number of different kernel scales σm . The
KMN conditional density estimate reads as follows:
K X
X M
2
p̂(y|x) = wk,m (x; θ)N (y|µk , σm ) (21)
k=1 m=1
As previously, the weights wk,m must resemble a multinomial distribution. Hence, the output
non-linearity of the neural network is chosen as a softmax function. Ambrogioni et al. (2017)
propose to choose the kernel centers µk by subsampling the training data by recursively removing
each point yn that is closer than a constant δ to any of its predecessor points. This can be seen as
a naive form of clustering which depends on the ordering of the dataset. Instead, we suggest to use
a well-established clustering method such as K-means for selecting the kernel centers. The scales
11
of the Gaussian kernels can either be fixed or jointly trained with the neural network weights. In
practice, considering the scales as trainable parameters consistently improves the performance.
Overall, the KMN model is more restrictive than MDN as the locations and scales of the mixture
components are fixed during inference and cannot be controlled by the neural network. However,
due to the lower expressiveness of KMNs, they are less prone to over-fit than MDNs.
In that, the negative log-likelihood in (22) is minimized through numerical optimization. Due
to its superior performance in non-convex optimization problems, we employ stochastic gradient
descent in conjunction with the adaptive learning rate method Adam (Kingma and Ba, 2015).
A central issue when training high capacity function approximators such as neural networks is
determining the optimal degree of complexity of the model. Models with too limited capacity may
not be a able to sufficiently capture the structure of the data, inducing a strong restriction bias.
On the other hand, if a model is too expressive it is prone to over-fit the training data, resulting
in poor generalization. This problem can be regarded as finding the right balance when trading
off variance against inductive bias. There exist many techniques allowing to control the trade-off
between bias and variance, involving various forms of regularization and data augmentation. For
an overview overview of regularization techniques, the interested reader is referred to Kukačka et al.
(2017). To a large degree, the contemporary practice of machine learning can be viewed as the
art of carefully engineering the right inductive bias for the problem at hand. This means, using
prior domain knowledge to select the right regularization terms and data augmentation methods,
attempting to minimize the variance of the learning algorithm while not imposing biases that guide
the learner away from good hypotheses.
Adding noise to the data during training can be viewed as form of data augmentation and
regularization that biases towards smooth functions (Webb, 1994; Bishop, 1994). In the domain
of finance, assuming smooth return distribution is a reasonable assumption. Hence, it is desirable
to embed an inductive bias towards smoothness into the learning procedure in order to reduce the
variance. Specifically, we add small perturbances in form of a random vector ξ ∼ q(ξ) to the data
x̃n = xn + ξx and ỹn = yn + ξy . Further, we assume that the noise is zero centered as well as
12
identically and independently distributed among the dimensions, with standard deviation η:
h i
Eξ∼q(ξ) [ξ] = 0 and Eξ∼q(ξ) ξξ > = η 2 I (23)
Before discussing the particular effects of randomly perturbing the data when fitting a conditional
density model p̂θ (y|x), we first analyze noise regularization in a more general case. Let LD (D) be
a loss function over a set of data points D = {x1 , ..., xN }, which can be partitioned into a sum of
losses corresponding to each data point xn :
N
X
LD (D) = L(xn ) (24)
n=1
The loss L(xn + ξ), resulting from adding random perturbations can be approximated by a second
order Taylor expansion around xn
1
L(xn + ξ) = L(xn ) + ξ > ∇x L(x)xn + ξ > ∇2x L(x)xn ξ + O(ξ 3 )
(25)
2
Assuming that the noise ξ is small in its magnitude, O(ξ 3 ) is neglectable. Using the assumption
about ξ in (23), the expected loss an be written as
1 h i η2
Eξ∼q(ξ) [L(xn + ξ)] ≈ L(xn ) + Eξ∼q(ξ) ξ > H(n) ξ = L(xn ) + tr(H(n) ) (26)
2 2
2
where L(xn ) is the loss without noise and H(n) = ∂∂xL2 (x)xn the Hessian of L w.r.t x, evaluated at
xn . This result has been obtained earlier by Webb (1994) and Bishop (1994). See Appendix.C for
derivations.
Previous work (Webb, 1994; Bishop, 1994; An, 1996) has introduced noise regularization for
regression and classification problems. However, to our best knowledge, noise regularization has not
been used in the context of parametric density estimation. In the following, we derive and analyze
the effect of noise regularization w.r.t. maximum likelihood estimation of conditional densities.
When concerned with maximum likelihood estimation of a conditional density pθ (y|x), the loss
function coincides with the negative conditional log-likelihood L(yn , xn ) = − log p(yn |xn ). Let the
standard deviation of the additive data noise ξx , ξy be ηx and ηy respectively. Maximum likelihood
estimation (MLE) with data noise is equivalent to minimizing the loss
N N N
X X ηy2 (n)
X η2 x (n)
L(D) ≈ − log pθ (yn |xn ) + tr(Hy ) + tr(Hx ) (27)
2 2
n=1 n=1 n=1
N N X m N l
X X ηy2
∂ 2 log pθ (y|x) ηx2 X X ∂ 2 log pθ (y|x)
=− log pθ (yn |xn ) − −
n=1
2
n=1 j=1
∂y (j) ∂y (j) y=yn 2
n=1 j=1
∂x(j) ∂x(j) x=xn
(28)
13
noise_std = 0.00 noise_std = 0.05 noise_std = 0.20
0.5 0.4
0.8 0.4
probability density
0.3
0.6 0.3
0.2
0.4 0.2
0.2 0.1 0.1
0.0 0.0 0.0
2 0 2 2 0 2 2 0 2
y true estimated y
In that, the first term corresponds to the standard MLE objective while the other two terms
constitute a smoothness regularization. The second term in (27) penalizes large negative second
derivatives of the conditional log density estimate log pθ (y|x) w.r.t. y. As the MLE objective
pushes the density estimate towards high densities and strong concavity in the data points yn , the
regularization term counteracts this tendency to over-fit and overall smoothes the fitted distribution.
The third term penalizes large negative second derivatives w.r.t. the conditional variable x, thereby
regularizing the sensitivity of the density estimate on changes in the conditional variable. This
smoothes the functional dependency of pθ (y|x) on x. As stated previously, the intensity of the
smoothness regularization can be controlled through the standard deviation (ηx and ηy ) of the
perturbations.
Figure 4 illustrates the effect of the introduced noise regularization scheme on MDN density
estimates. Plain maximum likelihood estimation (left) leads to strong over-fitting, resulting in a
spiky distribution that generalizes poorly beyond the training data. In contrast, training with
noise regularization (center and right) results in smoother density estimates that are closer to the
true conditional density. In Section V.C, a comprehensive empirical evaluation demonstrates the
efficacy and importance of noise regularization.
In many applications of machine learning and econometrics, the value range of raw data varies
widely. Significant differences in scale and range among features can lead to poor performance
of many learning algorithms. When the initial distribution during training is statistically too far
away from the actual data distribution, the training converges only slowly or may fail entirely.
Moreover, many hyperparameters of learning algorithms are often influenced by the value range of
learning features and targets. For instance, the efficacy of noise regularization, introduced in the
previous section, is susceptible to varying data ranges. If the training data has a large standard
deviation, the noise regularization with η = 0.1 has little effect, whereas in the opposite case with
14
data being in a very narrow range, the same regularization may strongly bias the density estimate.
In order to circumvent these and many more issues that arise due to different value ranges of the
data, a common practice in machine learning is to normalize the data so that it exhibits zero mean
and unit variance (Sola and Sevilla, 1997; Grus, 2015). While this practice is straightforward for
classification and regression problems, such a transformation requires further consideration in the
context of density estimation. The remainder of this section, elaborates on how to properly perform
data normalization for estimating conditional densities. In that, we view the data normalization as
change of variable and derive the respective density transformations that are necessary to recover
an estimate of the original data distribution.
Let D = {(xn , yn )}N
n=1 be a dataset where the tuples (xn , yn ) ∼ p(x, y) are drawn from a
joint distribution with density function p : Rl × Rm → R+ . In order to normalize the data D, we
estimate mean µ̂ and standard deviation σ̂ along each data dimension followed by subtracting the
mean from the data points and dividing by the standard deviation.
The normalization operations in (29) are linear transformations of the data. Subsequently, the
conditional density model is fitted on the normalized data, resulting in the estimated PDF q̂θ (ỹ|x̃).
However, when performing inference, one is interested in an unnormalized density estimate
p̂θ (y|x), corresponding to the conditional data distribution p(y|x). Thus, we have to transform the
learned distribution q̂θ (ỹ|x̃) so that it corresponds to p(y|x). In that, both the transformations
x → x̃ and y → ỹ must be accounted for.
The former is straightforward: Since the neural network is trained to receive normalized inputs
x̃, it is sufficient to transform the original inputs x to x̃ = diag(σ̂x )−1 (x − µ̂x ) before feeding them
into the network at inference time. In order to account for the linear transformation of y, we have
to use the change of variable formula since the volume of the probability density is not preserved
if σy 6= 1. The change of variable formula can be stated as follows.
THEOREM 1: Let Ỹ be a continuous random variable with probability density function q(ỹ), and
let Y = v(Ỹ ) be an invertible function of Ỹ with inverse Ỹ = v −1 (Y ). The probability density
function p(y) of Y is:
−1
d −1
p(y) = q(v (y)) ∗ v (y) (30)
dy
d
In that, dy (v(y)) is the determinant of the Jacobian of v which is vital for adjusting the
volume of q(v −1 (y)), so that p(y)dy = 1. In case of the proposed data normalization scheme, v is
R
a linear function
v −1 (y) = diag(σ̂y )−1 (y − µ̂y ) (31)
15
and, together with (30), p̂θ follows as
1
p̂θ (y|x) = diag(σ̂y )−1 |q̂θ (ỹ x̃) = Q
(j)
q̂θ (ỹ|x̃) (32)
l
j=1 σ̂y
The above equation provides a simple method for recovering the unnormalized density estimate
from the normalized mixture density q̂θ (ỹ|x̃).
Alternatively, we can directly recover the conditional mixture parameters corresponding to
pθ (y|x). Let (w̃k , µ̃k , diag(σ̃k )) be the conditional parameters of the GMM corresponding to q(ỹ|x̃).
Based on the change of variable formula, Theorem 2 provides a simple recipe for re-parameterizing
the GMM so that it reflects the unnormalized conditional density. As special case of Theorem
2, with Σ = diag(σ̃) and B = diag(σ̂y ), the transformed GMM corresponding to p̂θ (y|x) has the
following parameters:
wk = w˜k (33)
µk = µ̂y + diag(σ̂y )µ˜k (34)
σk = diag(σ̂y ) σ˜k . (35)
K
X
p(z) = wk N (a + Bµk , BΣk B > ) . (37)
k=1
Overall, the training process with data normalization includes the following steps:
1. Estimate empirical unconditional mean µ̂x , µ̂y and standard deviation σ̂x , σ̂y of training
data
2. Normalize the training data: {(xn , yn )} → {(x̃n , ỹn )}
x̃n = diag(σ̂x )−1 (xn − µ̂x ) , ỹn = diag(σ̂y )−1 (yn − µ̂y ), n = 1, ..., N
3. Fit the conditional density model q̂θ (ỹ|x̃) using the normalized data
4. Transform the estimated density back into the original data space to obtain p̂θ (y|x). This
can be done by either
(a) directly transforming the mixture density q̂θ with the change of variable formula in (32)
16
or
(b) transforming the mixture density parameters outputted by the neural network according
to (33)-(35)
A. Methodology
A.1. Density Simulation
In order to benchmark the proposed conditional density estimators and run experiments that
aim to answer different sets of questions, several data generating models (simulators) are employed.
The density simulations allow us to generate unlimited amounts of data, and, more importantly,
compute the statistical distance between the true conditional data distribution and the density
estimate. The density simulations, introduced in the remainder of this section, are inspired by
financial models and exhibit properties of empirical return distributions, such as negative skewness
and excess kurtosis.
A.1.1. ARMAJump
The underlying data generating process for this simulator is an AR(1) model with a jump compo-
nent. A new realization xt of the time-series can be described as follows:
In that, c ∈ R is the long run mean of the AR(1) process and α ∈ R constitutes the autoregressive
factor, describing how fast the AR(1) time series returns to its long run mean c. Typically an
ARMA process is perturbed by Gaussian White Noise σt with standard deviation σ ∈ R+ . We
add a jump component, that occurs with probability p and is indicated by the Bernoulli distributed
binary variable zt . If a jump occurs, a negative shock of the same magnitude as c is accompanied by
Gaussian noise with three times higher standard deviation than normal. The dynamic is a discrete
version of the class of affine jump diffusion models, which are heavily used in bond and option
pricing. Here, for each time period t, the conditional density p(xt |xt−1 ) shall be predicted. Note
17
0.35 x=0.10 x=-0.10
x=0.50 7 x=0.00
0.30 x=1.00
conditional probability density
12 x=-0.50 x=-1.00
x=0.00 0.6 x=0.00
10 x=0.70 x=1.00
conditional probability density
0.5
8
0.4
6 0.3
4 0.2
2 0.1
0 0.0
0.3 0.2 0.1 0.0 0.1 0.2 6 4 2 0 2 4 6
y y
(c) SkewNormal (d) GaussianMixture
18
that in this case y corresponds to xt . The conditional density follows as mixture of two Gaussians:
p(xt |xt−1 ) = (1 − p)N (xt |µ = c(1 − α) + αxt−1 , σ) + pN (xt |µ = α(xt−1 − c), 3σ) (39)
Figure 5b depicts the ARMAJump conditional probability density for the time-series parameters
c = 0.1, α = 0.2, p = 0.1, σ = 0.05. As can be seen in the depiction, the conditional distribution
has a negative skewness, resulting from the jump component.
A.1.2. EconDensity
This simple, economically inspired, distribution has the following data generating process (x, y) ∼
p(x, y):
and is illustrated in Figure 5a. One can imagine x to represent financial market volatility, which
is always positive with rare large realizations. y can be an arbitrary variable that is explained by
volatility. We choose a non-linear relationship between x and y to check how the estimators can
cope with that. To make things more difficult, the relationship between x and y becomes more
blurry at high x realizations, as expressed in a heteroscedastic σy , that is rising with x. This reflects
the common behaviour of higher noise in the estimators in times of high volatility.
A.1.3. GaussianMixture
The joint distribution p(x, y) follows a GMM. We assume that x ∈ Rm and y ∈ Rl can be factorized,
i.e.
K
X
p(x, y) = wk N (y|µy,k , Σy,k )N (x|µx,k , Σx,k ) (44)
i=1
When x and y can be factorized as in (44), the conditional density p(y|x) can be expressed as:
K
X
p(y|x) = Wk (x) N (y|µy,k , Σy,k ) (45)
i=1
wk N (x|µx,k , Σx,k )
Wk (x) = PK (46)
j=1 wk N (x|µx,j , Σx,j )
19
For details and derivations we refer the interested reader to Guang Sung (2004) and Gilardi et al.
(2002). Figure 5d depicts the conditional density of a GMM with 5 components (i.e. K = 5 and
1-dimensional x and y (i.e. l = m = 1).
A.1.4. SkewNormal
The data generating process (x, y) ∼ p(x, y) resembles a bivariate joint-distribution, wherein x ∈ R
follows a normal distribution and y ∈ R a conditional skew-normal distribution (Anděl et al.,
1984). The parameters (ξ, ω, α) of the skew normal distribution are functionally dependent on x.
Specifically, the functional dependencies are the following:
1
x∼N · µ = 0, σ =
(47)
2
ξ(x) = a ∗ x + b a, b ∈ R (48)
ω(x) = c ∗ x2 + d c, d ∈ R (49)
1
α(x) = αlow + ∗ (αhigh − αlow ) (50)
1 + e−x
y ∼ SkewN ormal ξ(x), ω(x), α(x) (51)
Accordingly, the conditional probability density p(y|x) corresponds to the skew normal density
function:
2 y − ξ(x) y − ξ(x)
p(y|x) = N Φ α(x) (52)
ω(x) ω(x) ω(x)
In that, N (·) denotes the density, and Φ(·) the cumulative distribution function of the standard
normal distribution. The shape parameter α(x) controls the skewness and kurtosis of the distri-
bution. We set αlow = −4 and αhigh = 0, giving p(y|x) a negative skewness that decreases as x
increases. This distribution will allow us to evaluate the performance of the density estimators in
presence of skewness, a phenomenon that we often observe in financial market variables. Figure 5c
illustrates the conditional skew normal distribution.
B. Evaluation Metrics
In order to assess the goodness of the estimated conditional densities, we measure the statistical
distance between the estimate and the true conditional propability density corresponding to the
introduced simulators. In particular, following Bakshi et al. (2017), the Hellinger distance
sZ
1 p p 2
DH (p||q) = p(y) − q(y) dy (53)
Y 2
is used as evaluation metric. We choose the Hellinger Distance over other popular statistical
divergences, because it is symmetric and constrained to values between 0 and 1. Thus, it can be
better interpreted than the Kullback-Leibler divergence. Since the training data is simulated from
a joint distribution p(x, y), but the density estimates p̂(y|x) are conditional, we have to evaluate
20
the statistical distance across different conditional values x. For that, we uniformly sample 10
values for x between the 10%- and 90%-percentile of p(x), compute the Hellinger distance between
the estimated and true conditional density and finally average the conditional statistical distances.
If the dimensionality of Y is 1, the Hellinger is approximated with numerical integration via the
Gaussian quadrature. If dim(Y) ≥ 2, the integral in (53) is estimated via monte carlo integration
with importance sampling. For details regarding the monte carlo integration, we refer to Appendix
.E.
In all experiments, 5 different random seeds are used for the data simulation and density esti-
mation. The reported Hellinger distances are averages over 5 random seeds. The translucent areas
in the plots depict the standard deviation among the seeds.
21
EconDensity ArmaJump
0.12
0.4 0.055 0.059 0.057 0.054 0.052 0.053 0.4 0.145 0.152 0.148 0.154 0.160 0.193 0.18
0.11
Hellinger distance
Hellinger distance
0.16
0.2 0.080 0.078 0.073 0.066 0.063 0.055 0.10 0.2 0.089 0.088 0.086 0.087 0.095 0.141
x-noise std
x-noise std
0.09 0.14
0.1 0.087 0.086 0.087 0.084 0.073 0.062 0.08 0.1 0.084 0.089 0.084 0.077 0.077 0.124 0.12
0.07 0.10
None 0.124 0.117 0.121 0.106 0.094 0.065 0.06 None 0.117 0.128 0.115 0.096 0.076 0.118
0.08
None 0.01 0.02 0.05 0.1 0.2 None 0.01 0.02 0.05 0.1 0.2
y-noise std y-noise std
GaussianMixture SkewNormal
0.975 0.11
0.4 0.819 0.817 0.817 0.815 0.810 0.794 0.950 0.4 0.074 0.073 0.070 0.068 0.072 0.085
0.10
Hellinger distance
Hellinger distance
0.925
0.2 0.894 0.890 0.889 0.887 0.881 0.852 0.2 0.065 0.065 0.063 0.059 0.056 0.069
x-noise std
x-noise std
0.900 0.09
0.875 0.08
0.1 0.940 0.942 0.941 0.935 0.919 0.877 0.1 0.074 0.075 0.069 0.071 0.056 0.064
0.850
0.07
None 0.960 0.977 0.980 0.966 0.939 0.886 0.825 None 0.113 0.103 0.086 0.083 0.062 0.065
0.800 0.06
None 0.01 0.02 0.05 0.1 0.2 None 0.01 0.02 0.05 0.1 0.2
y-noise std y-noise std
Hellinger distance
Hellinger distance
3 × 10 1
10 1 2 × 10 1
10 1
10 1
200 500 1000 2000 5000 200 500 1000 2000 5000 200 500 1000 2000 5000
number of training samples number of training samples number of training samples
22
EconDensity (std ~= 1) ArmaJump (std ~= 0.08) SkewNormal (std ~= 0.05)
4 × 10 1
3 × 10 1
Hellinger distance
Hellinger distance
Hellinger distance
10 1 2 × 10 1
10 1
10 1
500 1000 2000 5000 500 1000 2000 5000 500 1000 2000 5000
number of training samples number of training samples number of training samples
MDN normalized MDN normalized MDN unnormalized KMN unnormalized
Figure 8. Effect of data normalization. Goodness of MDN / KMN density estimate, fitted
with and without data normalization. The colored graphs display the Hellinger distance between
estimated and true density, averaged over 5 seeds, and the translucent areas the respective stan-
dard deviation across varying samples sizes. While the EconDensity has a unconditional standard
deviation of 1, the other two density simulations have a substantially lower unconditional volatility,
ca. 0.08 and 0.05.
might be slow or fail entirely to find a good fit. Figure 8 illustrates this phenomenon and emphasizes
the practical importance of proper data normalization.
In case of the EconDensity simulation, the conditional standard deviation of the simulation
density and the initial density estimate are similar. Both density estimation with and without
data normalization yield quite similar results. Yet, the data normalization consistently reduces the
Hellinger distance. The ArmaJump and SkewNormal density simulators have substantially smaller
conditional standard deviations, i.e. 12 - 20 times smaller than the EconDensity. Without the data
normalization scheme, the initial KMN/MDN density estimates exhibit a large statistical distance
to the true conditional density. As a result, the numerical optimization is not able to sufficiently fit
the density within 1000 training epochs. As can be seen in Figure 8, the resulting density estimates
are substantially offset compared to the estimates with data normalization.
• Mixture Density Network (MDN): As introduced in Section IV.A.1. The MDN is trained
with data normalization and noise regularization (ηx = 0.2, ηy = 0.1). For more details
regarding the neural network and training, we refer the interested reader to Appendix.G.
• Kernel Mixture Network (KMN): As introduced in Section IV.A.2. The KMN is trained
with data normalization and noise regularization (ηx = 0.2, ηy = 0.1). For more details, see
23
Appendix.G.
• Conditional Kernel Density Estimation (CKDE): This non-parametric conditional
density approach estimates both the joint probability p̂(x, y) and the marginal probability
p̂(x) with KDE (see Section II.A.2). The conditional density estimate follows as density ratio
p̂(x,y)
p̂(y|x) = p̂(x) . For selecting the bandwidths hx and hy of the kernels, the rule-of-thumb of
Silverman (1982) is employed:
1
h = 1.06σ̂N − 4+d (54)
In that, N denotes the number of samples, σ̂ the empirical standard deviation and d the
dimensionality of the data. The rule-of-thumb assumes that the data follows a normal distri-
bution. If this assumption holds, the selected bandwidth h is proven to be optimal w.r.t. the
IMSE criterion.
• Conditional Kernel Density Estimation with bandwidth selection via cross-validation
(CKDE-CV): Similar to the CDKE above but the bandwidth parameters hx and hy are de-
termined with leave-one-out maximum likelihood cross-validation. See Li and Racine (2007)
for further details about the cross-validation-based bandwidth selection.
• -Neighborhood kernel density estimation (NKDE): A non-parametric method that
considers only a local subset of training points in a -neighborhood of the query x to form
a kernel density estimate of p(y|x). The rule-of-thumb is used for bandwidth selection. We
refer the interested reader to Appendix.H.1 for details.
• Least-Squares Conditional Density Estimation (LSCDE): A semi-parametric estima-
tor that computes the conditional density as linear combination of kernels (Sugiyama and
Takeuchi, 2010).
p̂α (y|x) ∝ αT φ(x, y) (55)
Due to its restriction to linear combinations of Gaussian kernel functions φ, the optimal
parameters α w.r.t. the IMSE objective can be computed in closed form. However, at the
same time, the linearity assumption makes the estimator less expressive than the KMN or
MDN. See Appendix.I for details.
Figure 9 depicts the evaluation results for the described estimators across different density
simulations and number of training samples. Due to its limited modelling capacity, LSCDE yields
poor estimates in all three evaluation cases and shows only minor improvements as the number of
samples increases. CKDE consistently outperforms NKDE. This may be ascribed to the locality of
the considered data neighborhoods of the training points that NKDE exhibits, whereas CKDE is
able to fully use the available data. Unsurprisingly, the version of CKDE with bandwidth selection
through cross-validation always improves upon CKDE with the rule-of-thumb.
In the EconDensity evaluation, CKDE achieves lower statistical distances for small sample sizes.
However, the neural network based estimators KMN and MDN gain upon CKDE as the sample
size increases and achieve similar results in case of 6000 samples. In the other two evaluation cases,
KMNs and MDNs consistently outperform the other estimators. This demonstrates that even for
24
EconDensity ArmaJump SkewNormal
2 × 10 1
4 × 10 1
3 × 10 1
Hellinger distance
Hellinger distance
Hellinger distance
10 1
2 × 10 1
10 1
6 × 10 2
4 × 10 2 10 1
3 × 10 2
500 1000 2000 5000 500 1000 2000 5000 500 1000 2000 5000
number of training samples number of training samples number of training samples
MDN KMN LSCDE CKDE CKDE-CV NKDE
small sample sizes, neural network based conditional density estimators can be an equipollent or
even superior alternative to well established non-parametric CDEs.
• log ret risk free 1d: risk-free 1-day log return, computed based on the overnight index
f r
swap rate (OIS) with 1 day maturity. The OIS rate rf is transformed as log( 365 + 1).
25
• SVIX: 30-day option implied volatility3 (Whaley, 1993)
• Mkt-RF 10-day risk: risk of market return factor; sum of squared market returns over the
last 10 trading days
• SMB 10-day risk: SMB factor risk; sum of squared factor returns over the last 10 days
• HML 10-day risk: HML factor risk; sum of squared factor returns over the last 10 days
• WML 10-day risk: WML factor risk; sum of squared factor returns over the last 10 days
Overall, the target variable is one-dimensional, i.e. y ∈ Y ⊆ R, whereas the conditional variable
x constitutes a 14-dimensional vector, i.e. x ∈ X ⊆ R14 .
B. Evaluation Methodology
In order to assess the goodness of the different density estimators, out-of-sample validation is
used. In particular, the available data is split in a training set which has a proportion of 80 %
and a validation set consisting of the remaining 20 % of data points. It is important to note that
this split is done without shuffling since the time-series data may not be i.i.d. Hence the validation
set Dval is a consecutive series of data, corresponding to the 633 most recent trading-days. The
conditional density estimators are fitted with the training data, while the validation set is left out
during the training or model selection process. The validation data is only used for computing the
following goodness-of-fit measures:
3
The option implied moments are computed based on a options with maturity in 30 days. Since the days to
maturity vary, linear interpolation of the option implied moments, corresponding to different numbers of days till
maturity, is used to compute an estimate for maturity in 30 days.
26
• Avg. log-likelihood: Average conditional log likelihood of validation data
1 X
log p̂(y|x) (56)
|Dval |
(x,y)∈Dval
• RMSE mean: Root-Mean-Square-Error (RMSE) between the realized log-return and the
mean of the estimated conditional distribution. The estimated conditional mean is defined
as the expectation of y under the distribution p̂(y|x):
Z
µ̂(x) = y p̂(y|x)dy (57)
Y
• RMSE Std: RMSE between the realized deviation from the predicted mean µ̂(x) and the
standard deviation of the conditional density estimate. The estimated conditional standard
deviation is defined as sZ
σ̂(x) = (y − µ̂(x))2 p̂(y|x)dy (59)
Y
For details on the estimated conditional moments and the approximation of the associated integrals,
we refer the interested reader to Appendix.F.
Calculating the average log-likelihood is a common way of evaluating the goodness of a density
estimate (Rezende and Mohamed, 2015; Tansey et al., 2016; Trippe and Turner, 2018). The better
the estimated conditional density approximates the true distribution, the higher the out-of-sample
likelihood on expectation. Only if the estimator generalizes well beyond the training data, it can
assign high conditional probabilities to the left-out validation data.
In finance, return distributions are often characterized by their centered moments. The RMSEs
w.r.t. mean and standard deviation provide a quantitative measure for the predictive accuracy and
consistency w.r.t. the predictive uncertainty. Overall, the training of the estimators and calculation
of the goodness measures is performed with 5 different seeds. The reported results are averages
over the 5 seeds, alongside the respective standard deviation.
27
Avg. log-likelihood RMSE mean (10−2 ) RMSE std (10−2 )
CKDE 3.3368 ± 0.0000 0.6924 ± 0.0000 0.8086 ± 0.0000
NKDE 3.1171 ± 0.0000 1.0681 ± 0.0000 0.5570 ± 0.0000
LSCDE 3.5072 ± 0.0021 0.7105 ± 0.0047 0.5451 ± 0.0029
MDN w/o noise 3.2797 ± 0.2058 0.5279 ± 0.0075 0.3185 ± 0.0048
KMN w/o noise 3.3578 ± 0.0653 0.5903 ± 0.0339 0.3673 ± 0.0107
MDN w/ noise 3.7991 ± 0.0142 0.5224 ± 0.0019 0.3171 ± 0.0034
KMN w/ noise 3.8010 ± 0.0142 0.5342 ± 0.0062 0.3287 ± 0.0034
28
VII. Conclusion
This paper studies the use of neural networks for conditional density estimation. Addressing the
problem of over-fitting, we introduce a noise regularization method that leads to smooth density
estimates and improved generalization. Moreover, a normalization scheme which makes the model’s
hyper-parameters insensitive to differing value ranges is proposed. Corresponding experiments
showcase the effectiveness and practical importance of the presented approaches. In a benchmark
study, we demonstrate that our training methodology endows neural network based CDE with a
better out-of-sample performance than previous semi- and non-parametric methods. Overall, this
work establishes a practical framework for the successful application of neural network based CDE
in areas such as econometrics. Based on the promising results, we are convinced that the proposed
method enhances the econometric toolkit and thus advocate further research in this direction.
While this paper focuses on CDE with mixture densities, a promising avenue for future research
could be the use of normalizing flows as parametric density representation.
29
Appendix
Appendix A. Additional Data Generating Processes
Appendix A.1. ARMA Jump Diffusion Model
The underlying model for this simulator is a non-linear non Gaussian ARMA jump diffusion model
introduced by Christoffersen et al. (2016):
p p
dxt = (r − 0.5Vt − ξλt )dt + Vt ( 1 − ρ2 dWt1 + ρdWt2 ) + qt dNt
p
dVt = κV (θV − Vt )dt + γdLt + ξV Vt dWt2
p
dLt = κL (θL − Lt )dt + ξL Lt dWt3
p
Ψt = κΨ (θΨ − Ψt )dt + ξΨ P sit dWt4 (1)
λt = Ψt + γV Vt + γL Lt
qt ∼ N (θ, δ 2 )
δ2
ξ = eθ+ 2 − 1
Where xt resemble log stock returns, Vt is the spot variance, Lt a illiquidity factor and Ψt an
unknown latent factor. Vt , Lt and Ψt are referred to as jump parameters. A parameterization
can be taken from the paper (add reference) but generally these parameters influence the role of
jumps and non-normality.
N
X
LD (D) = L(xn ) (2)
i=1
30
Also, let each xn be perturbed by a random noise vector ξ ∼ q(ξ) with zero mean and i.i.d.
elements, i.e. h i
Eξ∼q(ξ) [ξ] = 0 and Eξ∼q(ξ) ξn ξj> = η 2 I (3)
The resulting loss L(xn + ξ) can be approximated by a second order Taylor expansion around xn
1
L(xn + ξ) = L(xn ) + ξ > ∇x L(x)xn + ξ > ∇2x L(x)xn ξ + O(ξ 3 )
(4)
2
Assuming that the noise ξ is small in its magnitude O(ξ 3 ) may be neglected. The expected loss
under q(ξ) follows directly from (4):
h i 1 h i
Eξ∼q(ξ) [L(xn + ξ)] = L(xn ) + Eξ∼q(ξ) ξ > ∇x L(x)xn + Eξ∼q(ξ) ξ > ∇2x L(x)xn ξ
(5)
2
1 h i
Eξ∼q(ξ) [L(xn + ξ)] = L(xn ) + Eξ∼q(ξ) [ξ]> ∇x L(x)xn + Eξ∼q(ξ) ξ > ∇2x L(x)xn ξ
(6)
2
1 h i
= L(xn ) + Eξ∼q(ξ) ξ > H(n) ξ (7)
2
2 L(x)
1 XX ∂
= L(xn ) + Eξ∼q(ξ) ξj ξk (j) (k) (8)
2 ∂x ∂x x
j k n
∂ 2 L(x) ∂ 2 L(x)
1 X 2
1 XX
= L(xn ) + Eξ ξj + Eξ [ξj ξk ] (j) (k) (9)
2
j
∂x(j) ∂x(j) xn 2 j k6=j ∂x ∂x xn
η 2 X ∂ 2 L(x)
= L(xn ) + (10)
2
j
∂x(j) ∂x(j) xn
η2
= L(xn ) + tr(H(n) ) (11)
2
In that, L(xn ) is the loss without noise and H(n) = ∇2x L(x)xn the Hessian of L at xn . With ξj
we denote the elements of the column vector ξ.
1
p B −1 (x − a) , z ∈ {a + Bx x ∈ S} .
q(z) = (12)
|B|
PROOF 1 (Proof of Lemma 1): The Lemma directly follows from the change of variable theorem
(see Bishop (2006) page 18)
31
THEOREM 3: Let x ∈ Rn be a continuous random variable following a Gaussian Mixture Model
(GMM), this is x ∼ p(x) with
K
X
p(x) = wk N (µk , Σk ) . (13)
k=1
K
X
p(z) = wk N (a + Bµk , BΣk B > ) . (14)
k=1
PROOF 2 (Proof of Theorem 3): With x ∈ Rn following a Gaussian Mixture Model, its probability
density function can be written as
K K
exp − 12 (x − µk )> Σ−1
1 k (x − µk )
X X
p(x) = wk N (µk , Σk ) = 1 wk 1 (15)
k=1 (2π) 2 k=1 |Σk | 2
K
exp − 12 (B −1 z − B −1 a − µk )> Σ−1 −1 −1
1 k (B z − B a − µk )
X
p(z) = 1 wk 1 (16)
(2π) 2 k=1 |B| |Σk | 2
K
exp − 12 (z − (a + Bµk ))> (B −1 )> Σ−1 −1
1 k B (z − (a + Bµk ))
X
= 1 wk 1 (17)
(2π) 2 k=1 |B| |Σk | 2
K
exp − 12 (z − (a + Bµk ))> (BΣk B > )−1 (z − (a + Bµk ))
1 X
= 1 wk 1 (18)
(2π) 2 k=1 |BΣk B > | 2
K
X
= wk N (a + Bµk , BΣk B > )
k=1
In that, UΩ is a uniform distribution over the set Ω. In practice the expected value Ex∼UΩ [f (x)]
can be estimated by uniformly drawing samples x1 , ...xN from Ω and averaging the function values.
Z N
1 X
f (x)dx ≈ f (xi ) xi ∼ UΩ (20)
Ω N
i=1
32
By the weak law of large numbers, the sample average in 20) is a consistent estimator of the integral.
In many interesting cases, Ω is unbounded. For instance, one might want to estimate the
moments Rm xn dx of a real-valued random variable X ∈ Rm with probability density function
R
p(x). Since there is no straightforward way to obtain uniform samples over an unbounded set, the
simple Monte-Carlo integration technique in (20) cannot be employed in such cases. Instead, one
draws samples from a non-uniform proposal distribution Q with density function q and support
{x | q(x) ≥ 0, x ∈ Rm } = Ω. The previous expectation over the uniform distribution can be
reformulated as expectation of Q:
Z N
u(x) 1 X f (xi )
f (x)dx = Ex∼Q f (x) ≈ xi ∼ Q (21)
Ω q(x) N q(xi )
i=1
In that, u(x) denotes the density function of the uniform distribution. When samples are drawn
from a proposal distribution Q, the evaluated function values f (xi ) have to be weighted by the
inverse of the density q(xi ). In our implementation, we use a student-t distribution as proposal
distribution.
Our implementation only supports estimating skewness and kurtosis for univariate target variables,
i.e. dim(Y) = 1. If dim(Y) = 1, the integral is approximated with numerical integration, using the
Gaussian quadrature with 10000 reference points, for which the density values are calculated. If
dim(Y) > 1, we use Monte-Carlo integration with 100,000 samples (see Appendix.E).
In case of the KMN and MDN, the conditional distribution is a GMM. Thus, we can directly
calculate mean and covariance from the GMM parameters, outputted by the neural network. The
33
mean follows straightforward as weighted sum of the Gaussian component centers: µk (x; θ)
K
X
µ̂(x) = wk (x; θ)µk (x; θ) (27)
k=1
K
X
wk (x; θ) (µk (x; θ) − µ̂(x))(µk (x; θ) − µ̂(x))T + diag(σk (x; θ)2 )
Cov(x)
d = (28)
k=1
wherein the outer product accounts for the covariance that arises from the different locations of
the components and the diagonal matrix for the inherent variance of each Gaussian component.
For estimating the conditional density p(y|x), -neighbor kernel density estimation employs
standard kernel density estimation in a local -neighborhood around a query point (x, y) (Sugiyama
and Takeuchi, 2010).
NKDE is similar to CKDE, as uses kernels in the training data points to estimate the conditional
probability density. However, rather than estimating both the joint probability p(x, y) and marginal
probability p(x), NKDE forms a density estimate by only considering a local subset of the training
samples {(xi , yi )}i∈Ix, , where Ix, is the set of sample indices such that ||xi − x||2 ≤ . The
estimated density can be expressed as
l (i) !
X Y 1 y (i) − yj
p(y|x) = wj K (29)
j∈Ix, i=1
h(i) h(i)
wherein wj is the weighting of the j-th kernel and K(z) a kernel function. In our implementation
K is the density function of a standard normal distribution. The weights wj can either be uniform,
34
1
i.e. wj = |Ix, | or proportional to the distance ||xj −x||. The vector of bandwidths h = (h(1 , ..., h(l) )T
can be determined with the rule-of-thumb (see Equation 17), where the the number of samples N
corresponds to the average number of neighbors in the training data:
N
1 X
N= |Ixn , | − 1 (30)
N
n=1
Alternatively, the bandwidths may be selected via leave-one-out maximum likelihood cross-validation.
In that α = (α1 , ..., αK )T are the learned parameters and φ(x, y) = (φ1 (x, y), ..., φK (x, y))T
are kernel functions such that φk (x, y) ≥ 0 for all (x, y) ∈ X × Y.
The parameters α ∈ RK are learned by minimizing the a integrated squared error.
Z Z
J(α) = (p̂α (x, y) − p(x, y))2 p(x)dxdy (32)
After having obtained α∗ = arg minα J(α) through training, the conditional density can be
computed as follows:
α̂T φ(x̃, y)
p̂α (y|x = x̃) = R (33)
α̂T φ(x̃, y)dy
Sugiyama and Takeuchi (2010) propose to use a Gaussian kernel with width σ (bandwidth
parameter), which is also the choice for our implementation:
35
REFERENCES
Ambrogioni, Luca, Umut Güçlü, Marcel A. J. van Gerven, and Eric Maris, 2017, The Kernel
Random Variables.
An, Guozhong, 1996, The Effects of Adding Noise During Backpropagation Training on a Gener-
Anděl, Ji, Ivan Netuka, and Karel Zvára, 1984, On Threshold Autoregressive Processes, Kybernetica
20, 89–106.
Bakshi, Gurdip, Xiaohui Gao Bakshi, and George Panayotov, 2017, A Theory of Dissimilarity
Bakshi, Gurdip S., Nikunj Kapadia, and Dilip B. Madan, 2003, Stock Return Characteristics, Skew
Laws, and the Differential Pricing of Individual Equity Options, Review of Financial Studies .
Bishop, Chris M., 1995, Training with Noise is Equivalent to Tikhonov Regularization, Neural
Computation 7, 108–116.
Bollerslev, Tim, Bollerslev, and Tim, 1987, A Conditionally Heteroskedastic Time Series Model for
Speculative Prices and Rates of Return, The Review of Economics and Statistics 69, 542–47.
Botev, Z. I., J. F. Grotowski, and D. P. Kroese, 2010, Kernel density estimation via diffusion, The
Bowman, Adrian W., 1984, An alternative method of cross-validation for the smoothing of density
Cao, Ricardo, Antonio Cuevas, and Wensceslao González Manteiga, 1994, A comparative study of
several smoothing methods in density estimation, Computational Statistics & Data Analysis 17,
153–176.
36
Carhart, Mark M., 1997, On Persistence in Mutual Fund Performance, The Journal of Finance 52,
57–82.
Christoffersen, Peter, Bruno Feunou, Yoontae Jeon, and Chayawat Ornthanalai, 2016, Time-varying
De Gooijer, Jan G, and Dawit Zerom, 2003, On Conditional Density Estimation, Technical report.
Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio, 2017, Density estimation using Real NVP,
Duin, R. P. W, 1976, On the Choice of Smoothing Parameters for Parzen Estimators of Probability
Engle, Robert F., 1982, Autoregressive Conditional Heteroscedasticity with Estimates of the Vari-
Fama, Eugene F., and Kenneth R. French, 1993, Common risk factors in the returns on stocks and
Fama, Eugene F., and Kenneth R. French, 2015, A five-factor asset pricing model, Journal of
Gallant, A. Ronald, D. Hsieh, George Tauchen, A. Gallant, D. Hsieh, and George Tauchen, 1991,
Gilardi, Nicolas, Samy Bengio, and Mikhail Kanevski, 2002, Conditional Gaussian Mixture Models
Glosten, Lawrence R, Ravi Jagannathan, David E Runkle, Lawrence R Glosten, Ravi Jagannathan,
and David E Runkle, 1993, On the Relation between the Expected Value and the Volatility of
Gormsen, Niels Joachim, and Christian Skov Jensen, 2017, Conditional Risk, SSRN Electronic
Journal .
37
Grus, Joel, 2015, Data Science from Scratch.
Guang Sung, Hsi, 2004, Gaussian Mixture Regression and Classification, Ph.D. thesis.
Hall, Peter, JS Marron, and Byeong U Park, 1992, Smoothed cross-validation, Probability Theory
92, 1–20.
Hamilton, James D. (James Douglas), 1994, Time series analysis (Princeton University Press).
Hansen, Bruce E., Hansen, and Bruce, 1994, Autoregressive Conditional Density Estimation, In-
Harvey, Campbell R, and Akhtar Siddique, 1999, Autoregressive Conditional Skewness, The Journal
Holmstrom, L., and P. Koistinen, 1992, Using additive noise in back-propagation training, IEEE
Hornik, Kurt, 1991, Approximation capabilities of multilayer feedforward networks, Neural Net-
works 4, 251–257.
Hyndman, Rob J., David M. Bashtannyk, and Gary K. Grunwald, 1996, Estimating and Visualizing
Jagannathan, Ravi, and Zhenyu Wang, 1996, The Conditional CAPM and the Cross-Section of
Jondeau, Eric, and Michael Rockinger, 2003, Conditional volatility, skewness, and kurtosis: exis-
Kingma, Diederik P., and Jimmy Ba, 2015, Adam: A Method for Stochastic Optimization, in
ICLR.
Krogh·, Anders, and John A Hertz, 1992, A Simple Weight Decay Can Improve Generalization,
Technical report.
Kukačka, Jan, Vladimir Golkov, and Daniel Cremers, 2017, Regularization for Deep Learning: A
38
Lewellen, Jonathan, and Stefan Nagel, 2006, The conditional CAPM does not explain asset-pricing
Li, Jonathan Q, and Abstract R Andrew Barron, 2000, Mixture Density Estimation, in NIPS .
Li, Qi, and Jeffrey S. Racine, 2007, Nonparametric econometrics : theory and practice (Princeton
University Press).
Mandelbrot, Benoit, 1967, The Variation of Some Other Speculative Prices, The Journal of Business
40, 393–413.
Mirza, Mehdi, and Simon Osindero, 2014, Conditional Generative Adversarial Nets, Technical
report.
Nelder, J A, and R Mead, 1965, A simplex method for function minimization, The Computer
Journa 7, 308–313.
Nelson, Daniel B., and Charles Q. Cao, 1992, Inequality Constraints in the Univariate GARCH
Park, Byeong, Byeong Park, and J. S. Marron, 1990, Comparison of data driven bandwidth selec-
Parzen, Emanuel, 1962, On Estimation of a Probability Density Function and Mode, The Annals
Pfeiffer, K.P., 1985, Stepwise variable selection and maximum likelihood estimation of smooth-
ing factors of kernel functions for nonparametric discriminant functions evaluated by different
Rezende, Danilo Jimenez, and Shakir Mohamed, 2015, Variational Inference with Normalizing
Rosenblatt, Murray, 1956, Remarks on Some Nonparametric Estimates of a Density Function, The
Rudemo, Mats, 1982, Empirical Choice of Histograms and Kernel Density Estimators.
39
Salimans, Tim, and Diederik P. Kingma, 2016, Weight Normalization: A Simple Reparameteriza-
Sarajedini, A., R. Hecht-Nielsen, and P.M. Chau, 1999, Conditional probability density function
estimation with sigmoidal neural networks, IEEE Transactions on Neural Networks 10, 231–238.
Sentana, E., 1995, Quadratic ARCH Models, The Review of Economic Studies 62, 639–661.
Sheather, S. J., and M. C. Jones, 1991, A Reliable Data-Based Bandwidth Selection Method for
Kernel Density Estimation, Journal of the Royal Statistical Society 53, 683–690.
Silverman, B., 1982, On the estimation of a probability density function by the maximum penalized
Sohn, Kihyuk, Honglak Lee, and Xinchen Yan, 2015, Learning Structured Output Representa-
tion using Deep Conditional Generative Models, in Advances in Neural Information Processing
Systems, 3483–3491.
Sola, J., and J. Sevilla, 1997, Importance of input data normalization for the application of neural
networks to complex industrial problems, IEEE Transactions on Nuclear Science 44, 1464–1468.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov,
2014, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine
Sugiyama, Masashi, and Ichiro Takeuchi, 2010, Conditional density estimation via Least-Squares
Tansey, Wesley, Karl Pichotta, and James G. Scott, 2016, Better Conditional Density Estimation
Trippe, Brian L, and Richard E Turner, 2018, Conditional Density Estimation with Bayesian
40
Webb, A.R., 1994, Functional approximation by feed-forward networks: a least-squares approach
Whaley, Robert E., 1993, Derivatives on Market Volatility: Hedging Is Long Overdue, The Journal
of Derivatives .
41