0% found this document useful (0 votes)

77 views41 pages

Conditional Density Estimation With Neural Network

The document discusses using neural networks for conditional density estimation. It introduces a noise regularization technique and data normalization scheme to improve performance and overcome issues like overfitting. Benchmark results show neural network models outperform other conditional density estimators on simulated and real stock market data, allowing flexible modeling of statistical relationships with few assumptions.

Uploaded by

الشمس اشرقت

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views41 pages

Conditional Density Estimation With Neural Network

Uploaded by

الشمس اشرقت

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Conditional Density Estimation with Neural Networks: Best

Practices and Benchmarks

Jonas Rothfuss ∗†. Fabio Ferreira∗† Simon Walther† Maxim Ulrich†

arXiv:1903.00954v1 [stat.ML] 3 Mar 2019

ABSTRACT

Given a set of empirical observations, conditional density estimation aims to capture the sta-

tistical relationship between a conditional variable x and a dependent variable y by modeling their

conditional probability p(y|x). The paper develops best practices for conditional density estima-

tion for finance applications with neural networks, grounded on mathematical insights and empirical

evaluations. In particular, we introduce a noise regularization and data normalization scheme, alle-

viating problems with over-fitting, initialization and hyper-parameter sensitivity of such estimators.

We compare our proposed methodology with popular semi- and non-parametric density estimators,

underpin its effectiveness in various benchmarks on simulated and Euro Stoxx 50 data and show

its superior performance. Our methodology allows to obtain high-quality estimators for statistical

expectations of higher moments, quantiles and non-linear return transformations, with very little

assumptions about the return dynamic.

∗
denotes equal contribution
†
The authors are with the Computational Risk and Asset Management Research Group at the Karlsruhe Institute
of Technology (KIT), Germany. Reference to [email protected]
I. Introduction
A wide range of problems in econometrics and finance are concerned with describing the statis-
tical relationship between a vector of explanatory variables x and a dependent variable or vector y
of interest. While regression analysis aims to describe the conditional mean E[y|x], many problems
in risk and asset management require gaining insight about deviations from the mean and their
associated likelihood. The stochastic dependency of y on x can be fully described by modeling the
conditional probability density p(y|x). Inferring such a density function from a set of empirical
observations {(xn , yn )}N
n=1 is typically referred to as conditional density estimation (CDE).
We propose to use neural networks for estimating conditional densities. In particular, we discuss
two models in which a neural network controls the parameters of a Gaussian mixture. Namely,
these are the Mixture Density Network (MDN) by Bishop (1994) and the Kernel Mixture Network
(KMN) by Ambrogioni et al. (2017). When chosen expressive enough, such models can approximate
arbitrary conditional densities.
However, when combined with maximum likelihood estimation, this flexibility can result in over-
fitting and poor generalization beyond the training data. Addressing this issue, we develop a noise
regularization method for conditional density estimation. By adding small random perturbations
to the data during training, the conditional density estimate is smoothed and generalizes better.
In fact, we mathematically derive that adding noise during training is equivalent to penalizing the
second derivatives of the conditional log-probability. Graphically, the penalization punishes very
curved or even spiky density estimators in favor of smoother variants. Our experimental results
demonstrate the efficacy and importance of the noise regularization for attaining good out-of-sample
performance.
Moreover, we attend to further practical issues that arise due to different value ranges of the
training data. In this context, we introduce a simple data normalization scheme, that fits the
conditional density model on normalized data, and, after training, transforms the density estimate,
so that is corresponds to the original data distribution. The normalization scheme makes the hyper-
parameters and initialization of the neural network based density estimator insensitive to differing
value ranges. Our empirical evaluations suggest that this increases the consistency of the training
results and significantly improves the estimator’s performance.
Aiming to compare our proposed approach against well-established CDE methods, we report a
comprehensive benchmark study on simulated densities as well as on EuroStoxx 50 returns. When
trained with noise regularization, both MDNs and KMNs are able to outperform previous standard
semi- and nonparametric conditional density estimators. Moreover, the results suggest that even
for small sample sizes, neural network based conditional density estimators can be an equal or
superior alternative to well established conditional kernel density estimators.
Our study adds to the econometric literature, which discusses two main approaches towards
CDE. The majority of financial research assumes that the conditional distribution follows a standard
parametric family (e.g. Gaussian) that captures the dependence of the distribution parameters on
x with a (partially) linear model. The widely used ARMA-GARCH time-series model (Engle, 1982;

2
Nelson and Cao, 1992) and many of its extensions (Glosten et al., 1993; Hansen et al., 1994; Sentana,
1995) fall into this category. However, inherent assumptions in many such models have been refuted
empirically later on (Harvey and Siddique, 1999; Jondeau and Rockinger, 2003). Another example
for this class of models are linear factor models (Fama and French, 1993; Carhart, 1997; Fama
and French, 2015). Here, too, evidence for time variation in the betas of these factor models, as
documented by Jagannathan and Wang (1996), Lewellen and Nagel (2006) or Gormsen and Jensen
(2017), cast doubt about the actual existence of the stated linear relationships. Overall, it is unclear
to which degree the modelling restrictions are consistent with the actual mechanisms that generate
the empirical data and how much they bias the inference.
Another major strand of research approaches CDE from a nonparametric perspective, estimat-
ing the conditional density with kernel functions, centered in the data points (Hyndman et al., 1996;
Li and Racine, 2007). While kernel methods make little assumptions about functional relationships
and density shape, they typically suffer from poor generalization in the tail regions and from data
sparseness when dimensionality is high.
In contrast, CDE based on high-capacity function approximators such as neural networks has
received little attention in the econometric and finance community. Yet, they combine the global
generalization capabilities of parametric models with little restrictive assumptions regarding the
conditional density. Aiming to combine these two advantages, this work studies the use of neural
networks for estimating conditional densities. Overall, this paper establishes a sound framework
for fitting high-capacity conditional density models. Thanks to the presented noise regularization
and data normalization scheme, we are able to overcome common issues with neural network based
estimators and make the approach easy to use. The conditional density estimators are available as
open-source python package.1

II. Background
A. Density Estimation
Let X be a random variable with probability density function (PDF) p(x) defined over the
domain X . When investigating phenomena in the real world, the distribution of an observable
variable X is typically unknown. However, it possible to observe realizations xn ∼ p(x) of X.
Given a collection D = {x1 , ..., xn } of such observations, it is our aim to find a good estimate p̂(x)
of the true density function p. First, we have to assess what a ”good” estimate is. Throughout
the density estimation literature (Bishop, 2006; Li and Racine, 2007; Shalizi, 2011) , the two most
popular ways of quantifying the goodness of a fitted distribution p̂ are the following:

1. The Integrated Mean Squared Error (IMSE) measures the squared distance between the true
1
https://github.com/freelunchtheorem/Conditional Density Estimation

3
density function and the estimate:
Z
IM SE = |p̂(x) − p(x)|2 dx (1)
X

2. The Kullback-Leibler divergence / relative entropy measures the average log-likelihood ratio
between the p(x) and p̂(x)
Z
p(x)
DKL (p||p̂) = p(x) log dx (2)
X p̂(x)

Correspondingly, we aim to find a p̂ such that the selected error criterion is minimized.
In its most general form, density estimation aims to find the best p̂ among all possible PDFs
over the domain X , while only given a finite number of observations. Even in the simple case
X = R1 , this would require estimating infinitely many distribution parameters with a finite amount
of data, which is not feasible in practice. Hence, it is necessary to either restrict the space of
possible PDFs or to embed other assumptions into the density estimation. The kind of imposed
assumptions characterize the distinction between the sub-fields of parametric and non-parametric
density estimation.

A.1. Parametric Density Estimation

In parametric estimation, the PDF p̂ is assumed to belong to a parametric family

F = {p̂θ (·)|θ ∈ Θ} (3)

where the PDF is described by a finite dimensional parameter θ ∈ Θ. A classical example of F is

the family of univariate normal distributions {N ( · |µ, σ)|(µ, σ) ∈ R × R+ }.
The standard method for estimating θ is maximum likelihood estimation, wherein θ∗ is chosen
so that the likelihood of the data D is maximized:
N
Y
∗
θ = arg max p̂θ (xn ) (4)
θ
n=1

In practice, the optimization problem is restated as maximizing the sum of log-probabilities:

N
X
∗
θ = arg max log p̂θ (xn ) (5)
θ
n=1

which is equivalent to minimizing the Kullback-Leibler divergence between the empirical data dis-
tribution (i.e. the weighted sum of point masses in the observations xn )

N
1 X
pD (x) = δ(||x − xn ||) (6)
N
n=1

4
and the parametric distribution p̂θ :

N
X
θ∗ = arg max log p̂θ (xn ) (7)
θ
n=1

= arg max H(pD ) − DKL (pD ||p̂θ ) (8)

= arg min DKL (pD ||p̂θ ) (9)

θ
R
In (8), H(pD ) = − X pD (x) log pD (x)dx denotes the (differential) entropy of the empirical data
distribution. Since H(pD ) does not depend on θ it can be considered as constant w.r.t. to the max
operator. For further explanations and discussions, we refer to Bishop (2006) page 57.

A.2. Nonparametric Density Estimation

In contrast to parametric methods, nonparametric density estimators do not explicitly restrict

the space of considered PDFs. As an introductory example for a nonparametric density estimator,
consider the histogram. When constructing a histogram, the domain X is partitioned into intervals,
also referred to as bins. All values of x lying within the same interval (bin) are assumed to have
the same probability density. This makes the density estimation straightforward:

N
1 X
p̂(x) = 1(xn is in the same bin as x) (10)
Nh
n=1

In that, 1(·) denotes the indicator function and h the width or area of the bin. Though appeal-
ingly simple, histograms bear the major disadvantage of being discontinuous which hinders any
differentiation-based method. In addition, the density estimates are not centered around the query
x, making the estimates for queries that are close to the bin boundaries worse than queries close
to the bin center.
Kernel density estimators enjoy more popularity as they overcome the named disadvantages
of histograms. Such estimators simply replace the indicator function in (10) with a symmetric
density function K(z), the so-called kernel (Rosenblatt, 1956; Parzen, 1962). The resulting density
estimator for univariate distributions reads as follows:
N
1 X x − xn
p̂(x) = K (11)
Nh h
n=1

Kernel density estimation (KDE) can be understood as placing a density function centered in each
data point xn and forming an equally weighted mixture of the N densities.
In the case of multivariate kernel density estimation, i.e. dim(X ) = l > 1, the density can be

5
Figure 1. Kernel density estimation with Gaussian kernels. The 6 red dashed curves depict the
kernels while the kernel density estimate p̂ is illustrated the blue curve.2

estimated as product of marginal kernel density estimates:

l l N
!
(j)
Y Y 1 X x(j) − xn
p̂(x) = p̂(x(j) ) = K (12)
j=1 j=1
N h(j) n=1 h(j)

In that, x(j) denotes the j-th element of the column vector x ∈ X ⊆ Rl and h(j) the bandwidth
corresponding to the j-th dimension.
One popular choice of K(·) is the Gaussian kernel:

1 z2
K(z) = √ e− 2 (13)
2π

An illustrative density estimate with Gaussian kernels is displayed in Fig. 1.

Other common choices of K(·) are the Epanechnikov and exponential kernels. Provided a
continuous kernel function, the estimated PDF in (12) is continuous. Beyond the appropriate
choice of K(·), a central challenge is the selection of the bandwidth parameter h which controls
the smoothing of the estimated PDF. Typically the IMSE in (1) is considered as goodness criterion
in the context of kernel density estimation. Consequently, one aims to select h so that the IMSE
is minimal. For details on bandwidth selection, we refer the interested reader to Li and Racine
(2007).

B. Conditional Density Estimation (CDE)

Let (X, Y ) be a pair of random variables with respective domains X ⊆ Rl and Y ⊆ Rm and
realizations x and y. Let p(y|x) = p(x, y)/p(x) denote the conditional probability density of y
2
Image originates from https://commons.wikimedia.org/wiki/File:Comparison of 1D histogram and KDE.png
and is under a creative commons licence.

6
given x. Typically, Y is referred to as a dependent variable (i.e. explained variable) and X as
conditional (explanatory) variable. Given a dataset of observations D = {(xn , yn )}N
n=1 drawn from
the joint distribution (xn , yn ) ∼ p(x, y), the aim of conditional density estimation (CDE) is to find
an estimate p̂(y|x) of the true conditional density p(y|x).
In the context of conditional density estimation, the IMSE and DKL objectives are expressed
as expectation over p(x):
Z Z
IM SEy|x = |p̂(y|x) − p(y|x)|2 p(x)dydx (14)
X Y
Z Z
p(y|x)
Ex∼p(x) [DKL (p(y|x)||p̂(y|x))] = p(y|x) log p(x)dydx (15)
X Y p̂(y|x)
Similar to the unconditional case in (7) - (9), parametric maximum likelihood estimation following
from (15) can be expressed as

N
X
θ∗ = arg max log p̂θ (yn |xn ) (16)
θ
n=1

The nonparametric KDE approach, discussed in Section II.A.2 can be extended to the condi-
tional case. Typically, unconditional KDE is used to estimate both the joint density p̂(x, y) and
the marginal density p̂(x). Then, the conditional density estimate follows as the density ratio

p̂(x, y)
p̂(y|x) = (17)
p̂(x)

where both the enumerator and denominator are the sums of Kernel functions as in (12). For more
details on conditional kernel density estimation, we refer the interested reader to Li and Racine
(2007).

III. Related Work

This chapter discusses related work in the areas of finance, econometrics and machine learn-
ing. In that, we use the the differentiation between parametric and non-parametric methods, as
discussed in Section II. In particular, we organize the following review in three categories: 1)
parametric conditional density and time-series models with narrowly defined parametric families
2) non-parametric density estimation and 3) parametric models based on high-capacity function
approximators such as neural networks.
Parametric CDE in finance and econometrics. The majority of work in finance and econo-
metrics uses a standard parametric family to model the conditional distribution of stock returns
and other instruments. Typically, a Gaussian distribution is employed, whereby the parameters of
the conditional normal distribution are predicted with time-series models. A popular instantiation
of this category is the ARMA-GARCH method which models mean and variance of the conditional

7
Gaussian through linear relationships (Engle, 1982; Hamilton, 1994).
Although the conditional distribution is Gaussian, autocorrelation of the variance allows to
account for volatility clustering (Mandelbrot, 1967) and kurtosis in the unconditional distribution.
Various generalizations of GARCH attempt to model asymmetric return distributions and negative
skewness (Nelson and Cao, 1992; Glosten et al., 1993; Sentana, 1995). Further work employs the
student-t distribution as conditional probability model (Bollerslev et al., 1987; Hansen et al., 1994)
and models the dependence of higher-order moments on the past (Gallant et al., 1991; Hansen
et al., 1994).
While the neural network based CDE approaches, which are presented in this paper, are also
parametric models, they make very little assumptions about the underlying relationships and den-
sity family. Both the relationship between the conditional variables and distribution parameters, as
well as the probability density itself are modelled with flexible function classes (i.e. neural network
and GMM). In contrast, traditional financial models impose strong assumptions such as linear re-
lationships and Gaussian conditional distributions. It is unclear to which degree such modelling
restrictions are consistent with the empirical data and how much they bias the inference.
Non-parametric CDE. A distinctly different line of work in econometrics aims to estimate
densities in a non-parametric manner. Originally introduced in Rosenblatt (1956); Parzen (1962),
KDE uses kernel functions to estimate the probability density at a query point, based on the
distance to all training points. In principle, kernel density estimators can approximate arbitrary
probability distributions and make no parametric assumptions about the shape of the density.
However, in practice, when data is finite, smoothing is required to achieve satisfactory general-
ization beyond the training data. The fundamental issue of KDE, commonly referred to as the
bandwidth selection problem, is choosing the appropriate amount of smoothing (Park et al., 1990;
Cao et al., 1994). Common bandwidth selection methods include rules-of-thumb (Silverman, 1982;
Sheather and Jones, 1991; Botev et al., 2010) and selectors based on cross-validation (Rudemo,
1982; Bowman, 1984; Hall et al., 1992).
In order to estimate conditional probabilities, previous work proposes to estimate both the joint
and marginal probability separately with KDE and then computing the conditional probability as
their ratio (Hyndman et al., 1996; Li and Racine, 2007). Alternatively, this can be interpreted as a
combination of kernel regression and kernel density estimation (De Gooijer and Zerom, 2003). Other
approaches combine non-parametric elements with parametric elements (Tresp, 2001; Sugiyama and
Takeuchi, 2010), forming semi-parametric conditional density estimators. Despite their theoretical
appeal, non-parametric density estimators suffer from the following drawbacks: First, they tend
to generalize poorly in regions where data is sparse which especially becomes evident in the tail
regions of the distribution. Second, their performance deteriorates quickly as the dimensionality
of the dependent variable increases. This phenomenon is commonly referred to as the ”curse of
dimensionality”.
CDE with neural networks: MDN, KDE, Normalizing Flows. The third line work
approaches conditional density estimation from a parametric perspective. However, in contrast

8
to parametric modelling in finance and econometrics, such methods use high-capacity function
approximators instead of strongly constrained parametric families. Our work builds upon the work
of Bishop (1994) and Ambrogioni et al. (2017), who propose to use a neural network to control
the parameters of mixture density model. When both the neural network and the mixture of
densities is chosen to be sufficiently expressive, any conditional probability distribution can be
approximated (Hornik, 1991; Li and Andrew Barron, 2000). Sarajedini et al. (1999) propose neural
networks that parameterize a generic exponential family distribution. However, this limits the
overall expressiveness of the conditional density estimator.
A recent trend in machine learning is the use of neural network based latent density models
(Mirza and Osindero, 2014; Sohn et al., 2015). Although such methods have been shown successful
for estimating distributions of images, it is not possible to recover the PDF of such latent density
models. More promising in this sense are normalizing flows which use a sequence of parameterized
invertible maps to transform a simple distribution into more complex density functions (Rezende
and Mohamed, 2015; Dinh et al., 2017; Trippe and Turner, 2018). Since the PDF of normalizing
flows is tractable, this could be an interesting direction to supplement our work.
While neural network based density estimators make very little assumptions about the un-
derlying density, the suffer from severe over-fitting when trained with the maximum likelihood
objective. In order to counteract over-fitting, various regularization methods have been explored
in the literature (Krogh· and Hertz, 1992; Holmstrom and Koistinen, 1992; Webb, 1994; Srivastava
et al., 2014). However, these methods were developed with emphasis on regression and classification
problems. Our work focuses on the regularization of neural network based density estimators. In
that, we make use of the noise regularization framework (Webb, 1994; Bishop, 1995), discussing its
implications in the context of density estimation and empirically evaluating its efficacy.

IV. Conditional Density Estimation with Neural Networks

The following chapter introduces and discusses two neural network based approaches for es-
timating conditional densities. Both density estimators are, in their nature, parametric models,
but exhibit substantially higher flexibility than traditional parametric methods. In the first part,
we formally define the density models and explain their fitting process. The second part of this
chapter attends to the challenges that arise from this flexibility, introducing a form of smoothness
regularization to combat over-fitting and enable good generalization.

A. The Density Models

A.1. Mixture Density Networks

Mixture Density Networks (MDNs) combine conventional neural networks with a mixture den-
sity model for the purpose of estimating conditional distributions p(y|x) (Bishop, 1994). In par-
ticular, the parameters of the unconditional mixture distribution p(y) are outputted by the neural
network, which takes the conditional variable x as input. The basic functioning of this framework

9
Figure 2. Illustration of a Mixture Density Network.

is illustrated in Figure 2. Given a mixture density with sufficiently many mixture components
and an expressive neural network that regresses into the parameter space of the density model,
MDNs can approximate arbitrary conditional distributions. The universal approximation property
of MDNs w.r.t. conditional densities follows directly from the universal function approximation
theorem for neural networks (Hornik, 1991) and the universal density approximation theorem for
mixture density models (Li and Andrew Barron, 2000).
For our purpose, we employ a Gaussian Mixture Model (GMM) with diagonal covariance ma-
trices as density model. The conditional density estimate p̂(y|x) follows as weighted sum of K
Gaussians
K
X
p̂(y|x) = wk (x; θ)N (y|µk (x; θ), σk2 (x; θ)) (18)
k=1

wherein wk (x; θ) denote the weight, µk (x; θ) the mean and σk2 (x; θ) the variance of the k-th Gaussian
component. All the GMM parameters are governed by the neural network with parameters θ and
input x. It is possible to use a GMM with full covariance matrices Σk by having the neural network
1/2
output the lower triangular entries of the respective Cholesky decompositions Σk (Tansey et al.,
2016). However, we choose diagonal covariance matrices in order to avoid the quadratic increase in
the neural network’s output layer size as the dimensionality of Y increases. Assuming K mixture
components and dim(Y) = l, the total number of neural network outputs is given by K(2l + 1) and
thus only grows only linearly in K and l.
The mixing weights wk (x; θ) must resemble a multinomial distribution, i.e. it must hold that
PK
k=1 wk (x; θ) = 1 and wk (x; θ) ≥ 0 ∀k. To satisfy the conditions, the a softmax function is used:

exp(aw
k (x))
wk (x) = PK (19)
w
i=1 exp(ai (x))

In that, aw
k (x) ∈ R denote the logit scores emitted by the neural network. Similarly, the standard
deviations σk (x) must be positive. To ensure that the respective neural network satisfy the non-
negativity constraint, an exponential non-linearity is applied:

σk (x) = exp(aσk ) (20)

10
Figure 3. Illustration of a Kernel Mixture Network.

Since the component means µk (x; θ) are not subject to such restrictions, we use a linear layer
without non-linearity for the respective output neurons.

A.2. Kernel Mixture Networks

While MDNs resemble a purely parametric conditional density model, a closely related approach,
the Kernel Mixture Network (KMN), combines both non-parametric and parametric elements (Am-
brogioni et al., 2017). Similar to MDNs, a mixture density model of p̂(y) is combined with a neural
network which takes the conditional variable x as an input. However, the neural network only
controls the weights of the mixture components while the component centers and scales are fixed
w.r.t. to x. Figuratively, one can imagine the neural network as choosing between a very large
amount of pre-existing kernel functions to build up the final combined density function. As common
for non-parametric density estimation, the components/kernels are placed in each of the training
samples or a subset of the samples. For each of the kernel centers, one ore multiple scale/bandwidth
parameters σm are chosen. As for MDNs, we employ Gaussians as mixture components, wherein
the scale parameter directly coincides with the standard deviation. Figure 3 holds an illustration
of the KMN conditional density model.
Let K be the number of kernel centers µk and M the number of different kernel scales σm . The
KMN conditional density estimate reads as follows:

K X
X M
2
p̂(y|x) = wk,m (x; θ)N (y|µk , σm ) (21)
k=1 m=1

As previously, the weights wk,m must resemble a multinomial distribution. Hence, the output
non-linearity of the neural network is chosen as a softmax function. Ambrogioni et al. (2017)
propose to choose the kernel centers µk by subsampling the training data by recursively removing
each point yn that is closer than a constant δ to any of its predecessor points. This can be seen as
a naive form of clustering which depends on the ordering of the dataset. Instead, we suggest to use
a well-established clustering method such as K-means for selecting the kernel centers. The scales

11
of the Gaussian kernels can either be fixed or jointly trained with the neural network weights. In
practice, considering the scales as trainable parameters consistently improves the performance.
Overall, the KMN model is more restrictive than MDN as the locations and scales of the mixture
components are fixed during inference and cannot be controlled by the neural network. However,
due to the lower expressiveness of KMNs, they are less prone to over-fit than MDNs.

B. Fitting the Density Models

The parameters θ of the neural network are fitted through standard maximum likelihood es-
timation. In practice, we minimize the negative conditional log-likelihood of the training data
D = {(xn , yn )}N
n=1 :
N
X
∗
θ = arg min − log pθ (yn |xn ) (22)
θ
n=1

In that, the negative log-likelihood in (22) is minimized through numerical optimization. Due
to its superior performance in non-convex optimization problems, we employ stochastic gradient
descent in conjunction with the adaptive learning rate method Adam (Kingma and Ba, 2015).

B.1. Variable Noise as Smoothness Regularization

A central issue when training high capacity function approximators such as neural networks is
determining the optimal degree of complexity of the model. Models with too limited capacity may
not be a able to sufficiently capture the structure of the data, inducing a strong restriction bias.
On the other hand, if a model is too expressive it is prone to over-fit the training data, resulting
in poor generalization. This problem can be regarded as finding the right balance when trading
off variance against inductive bias. There exist many techniques allowing to control the trade-off
between bias and variance, involving various forms of regularization and data augmentation. For
an overview overview of regularization techniques, the interested reader is referred to Kukačka et al.
(2017). To a large degree, the contemporary practice of machine learning can be viewed as the
art of carefully engineering the right inductive bias for the problem at hand. This means, using
prior domain knowledge to select the right regularization terms and data augmentation methods,
attempting to minimize the variance of the learning algorithm while not imposing biases that guide
the learner away from good hypotheses.
Adding noise to the data during training can be viewed as form of data augmentation and
regularization that biases towards smooth functions (Webb, 1994; Bishop, 1994). In the domain
of finance, assuming smooth return distribution is a reasonable assumption. Hence, it is desirable
to embed an inductive bias towards smoothness into the learning procedure in order to reduce the
variance. Specifically, we add small perturbances in form of a random vector ξ ∼ q(ξ) to the data
x̃n = xn + ξx and ỹn = yn + ξy . Further, we assume that the noise is zero centered as well as

12
identically and independently distributed among the dimensions, with standard deviation η:
h i
Eξ∼q(ξ) [ξ] = 0 and Eξ∼q(ξ) ξξ > = η 2 I (23)

Before discussing the particular effects of randomly perturbing the data when fitting a conditional
density model p̂θ (y|x), we first analyze noise regularization in a more general case. Let LD (D) be
a loss function over a set of data points D = {x1 , ..., xN }, which can be partitioned into a sum of
losses corresponding to each data point xn :

N
X
LD (D) = L(xn ) (24)
n=1

The loss L(xn + ξ), resulting from adding random perturbations can be approximated by a second
order Taylor expansion around xn

1
L(xn + ξ) = L(xn ) + ξ > ∇x L(x)xn + ξ > ∇2x L(x)xn ξ + O(ξ 3 )

(25)
2

Assuming that the noise ξ is small in its magnitude, O(ξ 3 ) is neglectable. Using the assumption
about ξ in (23), the expected loss an be written as

1 h i η2
Eξ∼q(ξ) [L(xn + ξ)] ≈ L(xn ) + Eξ∼q(ξ) ξ > H(n) ξ = L(xn ) + tr(H(n) ) (26)
2 2
2
where L(xn ) is the loss without noise and H(n) = ∂∂xL2 (x)xn the Hessian of L w.r.t x, evaluated at

xn . This result has been obtained earlier by Webb (1994) and Bishop (1994). See Appendix.C for
derivations.
Previous work (Webb, 1994; Bishop, 1994; An, 1996) has introduced noise regularization for
regression and classification problems. However, to our best knowledge, noise regularization has not
been used in the context of parametric density estimation. In the following, we derive and analyze
the effect of noise regularization w.r.t. maximum likelihood estimation of conditional densities.
When concerned with maximum likelihood estimation of a conditional density pθ (y|x), the loss
function coincides with the negative conditional log-likelihood L(yn , xn ) = − log p(yn |xn ). Let the
standard deviation of the additive data noise ξx , ξy be ηx and ηy respectively. Maximum likelihood
estimation (MLE) with data noise is equivalent to minimizing the loss

N N N
X X ηy2 (n)
X η2 x (n)
L(D) ≈ − log pθ (yn |xn ) + tr(Hy ) + tr(Hx ) (27)
2 2
n=1 n=1 n=1
N N X m N l
X X ηy2
∂ 2 log pθ (y|x) ηx2 X X ∂ 2 log pθ (y|x)
=− log pθ (yn |xn ) − −
n=1
2
n=1 j=1
∂y (j) ∂y (j) y=yn 2
n=1 j=1
∂x(j) ∂x(j) x=xn
(28)

13
noise_std = 0.00 noise_std = 0.05 noise_std = 0.20
0.5 0.4
0.8 0.4
probability density

0.3
0.6 0.3
0.2
0.4 0.2
0.2 0.1 0.1
0.0 0.0 0.0
2 0 2 2 0 2 2 0 2
y true estimated y

Figure 4. Effect of noise regularization on density estimate. Conditional MDN density

estimate (red) and true conditional density (green) for different noise regularization intensities
ηx , ηy . The MDN has been fitted with 3000 samples.

In that, the first term corresponds to the standard MLE objective while the other two terms
constitute a smoothness regularization. The second term in (27) penalizes large negative second
derivatives of the conditional log density estimate log pθ (y|x) w.r.t. y. As the MLE objective
pushes the density estimate towards high densities and strong concavity in the data points yn , the
regularization term counteracts this tendency to over-fit and overall smoothes the fitted distribution.
The third term penalizes large negative second derivatives w.r.t. the conditional variable x, thereby
regularizing the sensitivity of the density estimate on changes in the conditional variable. This
smoothes the functional dependency of pθ (y|x) on x. As stated previously, the intensity of the
smoothness regularization can be controlled through the standard deviation (ηx and ηy ) of the
perturbations.
Figure 4 illustrates the effect of the introduced noise regularization scheme on MDN density
estimates. Plain maximum likelihood estimation (left) leads to strong over-fitting, resulting in a
spiky distribution that generalizes poorly beyond the training data. In contrast, training with
noise regularization (center and right) results in smoother density estimates that are closer to the
true conditional density. In Section V.C, a comprehensive empirical evaluation demonstrates the
efficacy and importance of noise regularization.

B.2. Data Normalization

In many applications of machine learning and econometrics, the value range of raw data varies
widely. Significant differences in scale and range among features can lead to poor performance
of many learning algorithms. When the initial distribution during training is statistically too far
away from the actual data distribution, the training converges only slowly or may fail entirely.
Moreover, many hyperparameters of learning algorithms are often influenced by the value range of
learning features and targets. For instance, the efficacy of noise regularization, introduced in the
previous section, is susceptible to varying data ranges. If the training data has a large standard
deviation, the noise regularization with η = 0.1 has little effect, whereas in the opposite case with

14
data being in a very narrow range, the same regularization may strongly bias the density estimate.
In order to circumvent these and many more issues that arise due to different value ranges of the
data, a common practice in machine learning is to normalize the data so that it exhibits zero mean
and unit variance (Sola and Sevilla, 1997; Grus, 2015). While this practice is straightforward for
classification and regression problems, such a transformation requires further consideration in the
context of density estimation. The remainder of this section, elaborates on how to properly perform
data normalization for estimating conditional densities. In that, we view the data normalization as
change of variable and derive the respective density transformations that are necessary to recover
an estimate of the original data distribution.
Let D = {(xn , yn )}N
n=1 be a dataset where the tuples (xn , yn ) ∼ p(x, y) are drawn from a
joint distribution with density function p : Rl × Rm → R+ . In order to normalize the data D, we
estimate mean µ̂ and standard deviation σ̂ along each data dimension followed by subtracting the
mean from the data points and dividing by the standard deviation.

x̃ = diag(σ̂x )−1 (x − µ̂x ) and ỹ = diag(σ̂y )−1 (y − µ̂y ) (29)

The normalization operations in (29) are linear transformations of the data. Subsequently, the
conditional density model is fitted on the normalized data, resulting in the estimated PDF q̂θ (ỹ|x̃).
However, when performing inference, one is interested in an unnormalized density estimate
p̂θ (y|x), corresponding to the conditional data distribution p(y|x). Thus, we have to transform the
learned distribution q̂θ (ỹ|x̃) so that it corresponds to p(y|x). In that, both the transformations
x → x̃ and y → ỹ must be accounted for.
The former is straightforward: Since the neural network is trained to receive normalized inputs
x̃, it is sufficient to transform the original inputs x to x̃ = diag(σ̂x )−1 (x − µ̂x ) before feeding them
into the network at inference time. In order to account for the linear transformation of y, we have
to use the change of variable formula since the volume of the probability density is not preserved
if σy 6= 1. The change of variable formula can be stated as follows.

THEOREM 1: Let Ỹ be a continuous random variable with probability density function q(ỹ), and
let Y = v(Ỹ ) be an invertible function of Ỹ with inverse Ỹ = v −1 (Y ). The probability density
function p(y) of Y is:
−1
d −1

p(y) = q(v (y)) ∗ v (y) (30)
dy
d
In that, dy (v(y)) is the determinant of the Jacobian of v which is vital for adjusting the
volume of q(v −1 (y)), so that p(y)dy = 1. In case of the proposed data normalization scheme, v is
R

a linear function
v −1 (y) = diag(σ̂y )−1 (y − µ̂y ) (31)

15
and, together with (30), p̂θ follows as

1
p̂θ (y|x) = diag(σ̂y )−1 |q̂θ (ỹ x̃) = Q

(j)
q̂θ (ỹ|x̃) (32)
l
j=1 σ̂y

The above equation provides a simple method for recovering the unnormalized density estimate
from the normalized mixture density q̂θ (ỹ|x̃).
Alternatively, we can directly recover the conditional mixture parameters corresponding to
pθ (y|x). Let (w̃k , µ̃k , diag(σ̃k )) be the conditional parameters of the GMM corresponding to q(ỹ|x̃).
Based on the change of variable formula, Theorem 2 provides a simple recipe for re-parameterizing
the GMM so that it reflects the unnormalized conditional density. As special case of Theorem
2, with Σ = diag(σ̃) and B = diag(σ̂y ), the transformed GMM corresponding to p̂θ (y|x) has the
following parameters:

wk = w˜k (33)
µk = µ̂y + diag(σ̂y )µ˜k (34)
σk = diag(σ̂y ) σ˜k . (35)

THEOREM 2: Let x ∈ Rn be a continuous random variable following a Gaussian Mixture Model

(GMM), this is x ∼ p(x) with
K
X
p(x) = wk N (µk , Σk ) . (36)
k=1

Any linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an invertible n × n

matrix follows a Gaussian Mixture Model with density function

K
X
p(z) = wk N (a + Bµk , BΣk B > ) . (37)
k=1

Proof. See Appendix.D

Overall, the training process with data normalization includes the following steps:

1. Estimate empirical unconditional mean µ̂x , µ̂y and standard deviation σ̂x , σ̂y of training
data
2. Normalize the training data: {(xn , yn )} → {(x̃n , ỹn )}

x̃n = diag(σ̂x )−1 (xn − µ̂x ) , ỹn = diag(σ̂y )−1 (yn − µ̂y ), n = 1, ..., N

3. Fit the conditional density model q̂θ (ỹ|x̃) using the normalized data
4. Transform the estimated density back into the original data space to obtain p̂θ (y|x). This
can be done by either
(a) directly transforming the mixture density q̂θ with the change of variable formula in (32)

16
or
(b) transforming the mixture density parameters outputted by the neural network according
to (33)-(35)

V. Empirical Evaluation with Simulated Densities

This chapter comprises an extensive experimental study based on simulated densities. It is
organized as follows: In the first part, we explain the methodology of the experimental evaluation,
including the employed conditional density simulations and evaluation metric. The following sec-
tions include an evaluation of the noise regularization and the data normalization scheme, which
have been introduced in section IV.B. Finally, we present a benchmark study, comparing the neural
network based approaches with state-of-the art CDE methods.

A. Methodology
A.1. Density Simulation

In order to benchmark the proposed conditional density estimators and run experiments that
aim to answer different sets of questions, several data generating models (simulators) are employed.
The density simulations allow us to generate unlimited amounts of data, and, more importantly,
compute the statistical distance between the true conditional data distribution and the density
estimate. The density simulations, introduced in the remainder of this section, are inspired by
financial models and exhibit properties of empirical return distributions, such as negative skewness
and excess kurtosis.

A.1.1. ARMAJump
The underlying data generating process for this simulator is an AR(1) model with a jump compo-
nent. A new realization xt of the time-series can be described as follows:

xt = [c(1 − α) + αxt−1 ] + (1 − zt )σt + zt [−c + 3σt ] t ∼ N (0, 1), zt ∼ B(1, p) (38)

In that, c ∈ R is the long run mean of the AR(1) process and α ∈ R constitutes the autoregressive
factor, describing how fast the AR(1) time series returns to its long run mean c. Typically an
ARMA process is perturbed by Gaussian White Noise σt with standard deviation σ ∈ R+ . We
add a jump component, that occurs with probability p and is indicated by the Bernoulli distributed
binary variable zt . If a jump occurs, a negative shock of the same magnitude as c is accompanied by
Gaussian noise with three times higher standard deviation than normal. The dynamic is a discrete
version of the class of affine jump diffusion models, which are heavily used in bond and option
pricing. Here, for each time period t, the conditional density p(xt |xt−1 ) shall be predicted. Note

17
0.35 x=0.10 x=-0.10
x=0.50 7 x=0.00
0.30 x=1.00
conditional probability density

conditional probability density

6
0.25 5
0.20 4
0.15 3
0.10 2
0.05 1
0.00 0
4 2 0 2 4 6 0.2 0.1 0.0 0.1 0.2 0.3
y y
(a) EconDensity (b) ARMAJump

12 x=-0.50 x=-1.00
x=0.00 0.6 x=0.00
10 x=0.70 x=1.00
conditional probability density

conditional probability density

0.5
8
0.4
6 0.3
4 0.2
2 0.1

0 0.0
0.3 0.2 0.1 0.0 0.1 0.2 6 4 2 0 2 4 6
y y
(c) SkewNormal (d) GaussianMixture

Figure 5. Conditional density simulation models. Conditional probability densities corre-

sponding to the different simulation models. The coloured graphs represent the probability densities
p(y|x), conditioned on different values of x.

18
that in this case y corresponds to xt . The conditional density follows as mixture of two Gaussians:

p(xt |xt−1 ) = (1 − p)N (xt |µ = c(1 − α) + αxt−1 , σ) + pN (xt |µ = α(xt−1 − c), 3σ) (39)

Figure 5b depicts the ARMAJump conditional probability density for the time-series parameters
c = 0.1, α = 0.2, p = 0.1, σ = 0.05. As can be seen in the depiction, the conditional distribution
has a negative skewness, resulting from the jump component.

A.1.2. EconDensity
This simple, economically inspired, distribution has the following data generating process (x, y) ∼
p(x, y):

x = |x | , x ∼ N (0, 1) (40)

σy = 1 + x (41)
2
y = x + y , y ∼ N (0, σy ) (42)

The conditional density follows as

p(y|x) = N (y|µ = x2 , σ = 1 + x) (43)

and is illustrated in Figure 5a. One can imagine x to represent financial market volatility, which
is always positive with rare large realizations. y can be an arbitrary variable that is explained by
volatility. We choose a non-linear relationship between x and y to check how the estimators can
cope with that. To make things more difficult, the relationship between x and y becomes more
blurry at high x realizations, as expressed in a heteroscedastic σy , that is rising with x. This reflects
the common behaviour of higher noise in the estimators in times of high volatility.

A.1.3. GaussianMixture
The joint distribution p(x, y) follows a GMM. We assume that x ∈ Rm and y ∈ Rl can be factorized,
i.e.
K
X
p(x, y) = wk N (y|µy,k , Σy,k )N (x|µx,k , Σx,k ) (44)
i=1

When x and y can be factorized as in (44), the conditional density p(y|x) can be expressed as:

K
X
p(y|x) = Wk (x) N (y|µy,k , Σy,k ) (45)
i=1

wherein the mixture weights are a function of x:

wk N (x|µx,k , Σx,k )
Wk (x) = PK (46)
j=1 wk N (x|µx,j , Σx,j )

19
For details and derivations we refer the interested reader to Guang Sung (2004) and Gilardi et al.
(2002). Figure 5d depicts the conditional density of a GMM with 5 components (i.e. K = 5 and
1-dimensional x and y (i.e. l = m = 1).

A.1.4. SkewNormal
The data generating process (x, y) ∼ p(x, y) resembles a bivariate joint-distribution, wherein x ∈ R
follows a normal distribution and y ∈ R a conditional skew-normal distribution (Anděl et al.,
1984). The parameters (ξ, ω, α) of the skew normal distribution are functionally dependent on x.
Specifically, the functional dependencies are the following:

1
x∼N · µ = 0, σ =
(47)
2
ξ(x) = a ∗ x + b a, b ∈ R (48)
ω(x) = c ∗ x2 + d c, d ∈ R (49)
1
α(x) = αlow + ∗ (αhigh − αlow ) (50)
1 + e−x

y ∼ SkewN ormal ξ(x), ω(x), α(x) (51)

Accordingly, the conditional probability density p(y|x) corresponds to the skew normal density
function:
2 y − ξ(x) y − ξ(x)
p(y|x) = N Φ α(x) (52)
ω(x) ω(x) ω(x)
In that, N (·) denotes the density, and Φ(·) the cumulative distribution function of the standard
normal distribution. The shape parameter α(x) controls the skewness and kurtosis of the distri-
bution. We set αlow = −4 and αhigh = 0, giving p(y|x) a negative skewness that decreases as x
increases. This distribution will allow us to evaluate the performance of the density estimators in
presence of skewness, a phenomenon that we often observe in financial market variables. Figure 5c
illustrates the conditional skew normal distribution.

B. Evaluation Metrics
In order to assess the goodness of the estimated conditional densities, we measure the statistical
distance between the estimate and the true conditional propability density corresponding to the
introduced simulators. In particular, following Bakshi et al. (2017), the Hellinger distance
sZ
1 p p 2
DH (p||q) = p(y) − q(y) dy (53)
Y 2

is used as evaluation metric. We choose the Hellinger Distance over other popular statistical
divergences, because it is symmetric and constrained to values between 0 and 1. Thus, it can be
better interpreted than the Kullback-Leibler divergence. Since the training data is simulated from
a joint distribution p(x, y), but the density estimates p̂(y|x) are conditional, we have to evaluate

20
the statistical distance across different conditional values x. For that, we uniformly sample 10
values for x between the 10%- and 90%-percentile of p(x), compute the Hellinger distance between
the estimated and true conditional density and finally average the conditional statistical distances.
If the dimensionality of Y is 1, the Hellinger is approximated with numerical integration via the
Gaussian quadrature. If dim(Y) ≥ 2, the integral in (53) is estimated via monte carlo integration
with importance sampling. For details regarding the monte carlo integration, we refer to Appendix
.E.
In all experiments, 5 different random seeds are used for the data simulation and density esti-
mation. The reported Hellinger distances are averages over 5 random seeds. The translucent areas
in the plots depict the standard deviation among the seeds.

C. Evaluation of Noise Regularization

This section empirically examines the noise regularization, presented in Section IV.B.1. As has
been mathematically derived, the standard deviation of the variable noise controls the intensity of
the smoothness regularization. Accordingly, the standard deviation parameters ηx and ηy , corre-
sponding to the pertubations ξx and ξy , can be used to trade-off between bias and variance of the
conditional density estimator.
In the following, we aim to empirically determine which values of ηx and ηy lead to the best
density fit. For that, a grid-search on the noise regularization intensities is performed. Figure 6
displays the average Hellinger distance of MDN density estimates, fitted with 1600 data points,
across different (ηx , ηy ) configurations. We observe that no regularization or noise regularization
with low intensity leads to poor generalization of the density estimates, resulting in a high statistical
distance w.r.t. the true density. In contrast, strong regularization may results in too much inductive
bias, as it is the case for the ArmaJump simulation. The configuration ηx = 0.2, ηy = 0.1 leads to
low statistical distances in all density simulations. Hence, we choose it as default configuration for
further experiments.
Figure 7 compares the goodness of KMN/MDN density estimates with and without noise regu-
larization across different sizes of the training set. Consistent with Figure 6, the noise regularization
reduces the Hellinger distance by a significant margin. As the number of training samples increases,
the estimator is less prone to over-fit and the positive effect of noise regularization tends to be-
come smaller. Overall, the experimental results underpin the efficacy of noise regularization and
its importance for achieving good generalization beyond the training samples.

D. Evaluation of Data Normalization

The data normalization scheme, introduced in section IV.B.2, aims to make the hyperparameters
of the density estimator invariant to the distribution of the training data. If no data normalization
is used, the goodness of the hyperparameters is sensitive to the level and volatility of the training
data. The most sensitive hyperparameter is the initialization of the parametric density estimator. If
the initial density estimate is statistically too far away from the true density, numerical optimization

21
EconDensity ArmaJump
0.12
0.4 0.055 0.059 0.057 0.054 0.052 0.053 0.4 0.145 0.152 0.148 0.154 0.160 0.193 0.18
0.11

Hellinger distance

Hellinger distance
0.16
0.2 0.080 0.078 0.073 0.066 0.063 0.055 0.10 0.2 0.089 0.088 0.086 0.087 0.095 0.141
x-noise std

x-noise std
0.09 0.14
0.1 0.087 0.086 0.087 0.084 0.073 0.062 0.08 0.1 0.084 0.089 0.084 0.077 0.077 0.124 0.12
0.07 0.10
None 0.124 0.117 0.121 0.106 0.094 0.065 0.06 None 0.117 0.128 0.115 0.096 0.076 0.118
0.08
None 0.01 0.02 0.05 0.1 0.2 None 0.01 0.02 0.05 0.1 0.2
y-noise std y-noise std

GaussianMixture SkewNormal
0.975 0.11
0.4 0.819 0.817 0.817 0.815 0.810 0.794 0.950 0.4 0.074 0.073 0.070 0.068 0.072 0.085
0.10
Hellinger distance

Hellinger distance
0.925
0.2 0.894 0.890 0.889 0.887 0.881 0.852 0.2 0.065 0.065 0.063 0.059 0.056 0.069
x-noise std

x-noise std
0.900 0.09
0.875 0.08
0.1 0.940 0.942 0.941 0.935 0.919 0.877 0.1 0.074 0.075 0.069 0.071 0.056 0.064
0.850
0.07
None 0.960 0.977 0.980 0.966 0.939 0.886 0.825 None 0.113 0.103 0.086 0.083 0.062 0.065
0.800 0.06
None 0.01 0.02 0.05 0.1 0.2 None 0.01 0.02 0.05 0.1 0.2
y-noise std y-noise std

Figure 6. Effect of various noise regularization intensities ηx and ηy Hellinger distance

between MDN estimate, fitted with 1600 samples, and the true density across various noise regu-
larization intensities. The displayed values are averages over 5 seeds.

EconDensity ArmaJump SkewNormal

6 × 10 1
MDN noise
MDN no noise
4 × 10 1 KMN noise
KMN no noise
Hellinger distance

Hellinger distance

3 × 10 1

10 1 2 × 10 1
10 1

10 1

200 500 1000 2000 5000 200 500 1000 2000 5000 200 500 1000 2000 5000
number of training samples number of training samples number of training samples

Figure 7. Effect of noise regularization on goodness of estimated density. Goodness of

MDN/KMN density estimate, fitted with (ηx = 0.2, ηy = 0.1) and without noise regularization.
The colored graphs display the Hellinger distance between estimated and true density, averaged
over 5 seeds, and the translucent areas the respective standard deviation across varying samples
sizes.

22
EconDensity (std ~= 1) ArmaJump (std ~= 0.08) SkewNormal (std ~= 0.05)
4 × 10 1

3 × 10 1
Hellinger distance

Hellinger distance

Hellinger distance
10 1 2 × 10 1

10 1

500 1000 2000 5000 500 1000 2000 5000 500 1000 2000 5000
number of training samples number of training samples number of training samples
MDN normalized MDN normalized MDN unnormalized KMN unnormalized

Figure 8. Effect of data normalization. Goodness of MDN / KMN density estimate, fitted
with and without data normalization. The colored graphs display the Hellinger distance between
estimated and true density, averaged over 5 seeds, and the translucent areas the respective stan-
dard deviation across varying samples sizes. While the EconDensity has a unconditional standard
deviation of 1, the other two density simulations have a substantially lower unconditional volatility,
ca. 0.08 and 0.05.

might be slow or fail entirely to find a good fit. Figure 8 illustrates this phenomenon and emphasizes
the practical importance of proper data normalization.
In case of the EconDensity simulation, the conditional standard deviation of the simulation
density and the initial density estimate are similar. Both density estimation with and without
data normalization yield quite similar results. Yet, the data normalization consistently reduces the
Hellinger distance. The ArmaJump and SkewNormal density simulators have substantially smaller
conditional standard deviations, i.e. 12 - 20 times smaller than the EconDensity. Without the data
normalization scheme, the initial KMN/MDN density estimates exhibit a large statistical distance
to the true conditional density. As a result, the numerical optimization is not able to sufficiently fit
the density within 1000 training epochs. As can be seen in Figure 8, the resulting density estimates
are substantially offset compared to the estimates with data normalization.

E. Conditional Density Estimator Benchmark Study

After empirically evaluating the noise regularization and data normalization scheme, we bench-
mark the neural network based density estimators against state-of-the art conditional density esti-
mation approaches. Specifically, the benchmark study comprises the following conditional density
estimators:

• Mixture Density Network (MDN): As introduced in Section IV.A.1. The MDN is trained
with data normalization and noise regularization (ηx = 0.2, ηy = 0.1). For more details
regarding the neural network and training, we refer the interested reader to Appendix.G.
• Kernel Mixture Network (KMN): As introduced in Section IV.A.2. The KMN is trained
with data normalization and noise regularization (ηx = 0.2, ηy = 0.1). For more details, see

23
Appendix.G.
• Conditional Kernel Density Estimation (CKDE): This non-parametric conditional
density approach estimates both the joint probability p̂(x, y) and the marginal probability
p̂(x) with KDE (see Section II.A.2). The conditional density estimate follows as density ratio
p̂(x,y)
p̂(y|x) = p̂(x) . For selecting the bandwidths hx and hy of the kernels, the rule-of-thumb of
Silverman (1982) is employed:
1
h = 1.06σ̂N − 4+d (54)

In that, N denotes the number of samples, σ̂ the empirical standard deviation and d the
dimensionality of the data. The rule-of-thumb assumes that the data follows a normal distri-
bution. If this assumption holds, the selected bandwidth h is proven to be optimal w.r.t. the
IMSE criterion.
• Conditional Kernel Density Estimation with bandwidth selection via cross-validation
(CKDE-CV): Similar to the CDKE above but the bandwidth parameters hx and hy are de-
termined with leave-one-out maximum likelihood cross-validation. See Li and Racine (2007)
for further details about the cross-validation-based bandwidth selection.
• -Neighborhood kernel density estimation (NKDE): A non-parametric method that
considers only a local subset of training points in a -neighborhood of the query x to form
a kernel density estimate of p(y|x). The rule-of-thumb is used for bandwidth selection. We
refer the interested reader to Appendix.H.1 for details.
• Least-Squares Conditional Density Estimation (LSCDE): A semi-parametric estima-
tor that computes the conditional density as linear combination of kernels (Sugiyama and
Takeuchi, 2010).
p̂α (y|x) ∝ αT φ(x, y) (55)

Due to its restriction to linear combinations of Gaussian kernel functions φ, the optimal
parameters α w.r.t. the IMSE objective can be computed in closed form. However, at the
same time, the linearity assumption makes the estimator less expressive than the KMN or
MDN. See Appendix.I for details.

Figure 9 depicts the evaluation results for the described estimators across different density
simulations and number of training samples. Due to its limited modelling capacity, LSCDE yields
poor estimates in all three evaluation cases and shows only minor improvements as the number of
samples increases. CKDE consistently outperforms NKDE. This may be ascribed to the locality of
the considered data neighborhoods of the training points that NKDE exhibits, whereas CKDE is
able to fully use the available data. Unsurprisingly, the version of CKDE with bandwidth selection
through cross-validation always improves upon CKDE with the rule-of-thumb.
In the EconDensity evaluation, CKDE achieves lower statistical distances for small sample sizes.
However, the neural network based estimators KMN and MDN gain upon CKDE as the sample
size increases and achieve similar results in case of 6000 samples. In the other two evaluation cases,
KMNs and MDNs consistently outperform the other estimators. This demonstrates that even for

24
EconDensity ArmaJump SkewNormal
2 × 10 1
4 × 10 1

3 × 10 1
Hellinger distance

Hellinger distance

Hellinger distance
10 1
2 × 10 1
10 1

6 × 10 2

4 × 10 2 10 1

3 × 10 2
500 1000 2000 5000 500 1000 2000 5000 500 1000 2000 5000
number of training samples number of training samples number of training samples
MDN KMN LSCDE CKDE CKDE-CV NKDE

Figure 9. Conditional Density Estimator Benchmark. The illustrated benchmark study

compares 6 density estimators across in 3 density simulations. To asses the goodness of fit, we report
the Hellinger distance between the true density and the density estimate fitted with different sample
sizes. The colored graphs display the Hellinger distance averaged over 5 seeds and the translucent
areas the respective standard deviation.

small sample sizes, neural network based conditional density estimators can be an equipollent or
even superior alternative to well established non-parametric CDEs.

VI. Empirical Evaluation on Euro Stoxx 50 Data

While the previous chapter is based on simulations, this chapter provides an empirical evaluation
and benchmark study on real-world stock market data. In particular, we are concerned with
estimating the conditional probability density of Euro Stoxx 50 returns. After a describing the
data and density estimation task in detail, we report and discuss benchmark results.

A. The Euro Stoxx 50 data

The following empirical evaluation is based the Euro Stoxx 50 stock market index. The data
comprises 3169 trading days, dated from January 2003 until June 2015. We define the task as
predicting the conditional probability density of 1-day log returns, conditioned on 14 explanatory
variables. These conditional variables are listed below:

• log ret last period: realized log-return of previous trading day

• log ret risk free 1d: risk-free 1-day log return, computed based on the overnight index
f r
swap rate (OIS) with 1 day maturity. The OIS rate rf is transformed as log( 365 + 1).

• RealizedVariation: estimate of realized variance of previous day, computed as sum of

squared 10 minute returns over the previous trading day

25
• SVIX: 30-day option implied volatility3 (Whaley, 1993)

• bakshiKurt: 30-day option implied kurtosis3 (Bakshi et al., 2003)

• bakshiSkew: 30-day option implied skewness3 (Bakshi et al., 2003)

• Mkt-RF: Fama-French market return factor (Fama and French, 1993)

• SMB: Fama-French Small-Minus-Big factor (Fama and French, 1993)

• HML: Fama-French High-Minus-Low factor (Fama and French, 1993)

• WML: Winner-Minus-Looser (momentum) factor (Carhart, 1997)

• Mkt-RF 10-day risk: risk of market return factor; sum of squared market returns over the
last 10 trading days

• SMB 10-day risk: SMB factor risk; sum of squared factor returns over the last 10 days

• HML 10-day risk: HML factor risk; sum of squared factor returns over the last 10 days

• WML 10-day risk: WML factor risk; sum of squared factor returns over the last 10 days

Overall, the target variable is one-dimensional, i.e. y ∈ Y ⊆ R, whereas the conditional variable
x constitutes a 14-dimensional vector, i.e. x ∈ X ⊆ R14 .

B. Evaluation Methodology
In order to assess the goodness of the different density estimators, out-of-sample validation is
used. In particular, the available data is split in a training set which has a proportion of 80 %
and a validation set consisting of the remaining 20 % of data points. It is important to note that
this split is done without shuffling since the time-series data may not be i.i.d. Hence the validation
set Dval is a consecutive series of data, corresponding to the 633 most recent trading-days. The
conditional density estimators are fitted with the training data, while the validation set is left out
during the training or model selection process. The validation data is only used for computing the
following goodness-of-fit measures:
3
The option implied moments are computed based on a options with maturity in 30 days. Since the days to
maturity vary, linear interpolation of the option implied moments, corresponding to different numbers of days till
maturity, is used to compute an estimate for maturity in 30 days.

26
• Avg. log-likelihood: Average conditional log likelihood of validation data

1 X
log p̂(y|x) (56)
|Dval |
(x,y)∈Dval

• RMSE mean: Root-Mean-Square-Error (RMSE) between the realized log-return and the
mean of the estimated conditional distribution. The estimated conditional mean is defined
as the expectation of y under the distribution p̂(y|x):
Z
µ̂(x) = y p̂(y|x)dy (57)
Y

Based on that, the RMSE w.r.t. µ̂ is calculated as

v
1
u X
RM SEµ = t (y − µ̂(x))2 (58)
u
|Dval |
(x,y)∈Dval

• RMSE Std: RMSE between the realized deviation from the predicted mean µ̂(x) and the
standard deviation of the conditional density estimate. The estimated conditional standard
deviation is defined as sZ
σ̂(x) = (y − µ̂(x))2 p̂(y|x)dy (59)
Y

The respective RSME is calculated as follows:

v
1
u X
RM SEσ = t (|y − µ̂(x)| − σ̂(x))2 (60)
u
|Dval |
(x,y)∈Dval

For details on the estimated conditional moments and the approximation of the associated integrals,
we refer the interested reader to Appendix.F.
Calculating the average log-likelihood is a common way of evaluating the goodness of a density
estimate (Rezende and Mohamed, 2015; Tansey et al., 2016; Trippe and Turner, 2018). The better
the estimated conditional density approximates the true distribution, the higher the out-of-sample
likelihood on expectation. Only if the estimator generalizes well beyond the training data, it can
assign high conditional probabilities to the left-out validation data.
In finance, return distributions are often characterized by their centered moments. The RMSEs
w.r.t. mean and standard deviation provide a quantitative measure for the predictive accuracy and
consistency w.r.t. the predictive uncertainty. Overall, the training of the estimators and calculation
of the goodness measures is performed with 5 different seeds. The reported results are averages
over the 5 seeds, alongside the respective standard deviation.

27
Avg. log-likelihood RMSE mean (10−2 ) RMSE std (10−2 )
CKDE 3.3368 ± 0.0000 0.6924 ± 0.0000 0.8086 ± 0.0000
NKDE 3.1171 ± 0.0000 1.0681 ± 0.0000 0.5570 ± 0.0000
LSCDE 3.5072 ± 0.0021 0.7105 ± 0.0047 0.5451 ± 0.0029
MDN w/o noise 3.2797 ± 0.2058 0.5279 ± 0.0075 0.3185 ± 0.0048
KMN w/o noise 3.3578 ± 0.0653 0.5903 ± 0.0339 0.3673 ± 0.0107
MDN w/ noise 3.7991 ± 0.0142 0.5224 ± 0.0019 0.3171 ± 0.0034
KMN w/ noise 3.8010 ± 0.0142 0.5342 ± 0.0062 0.3287 ± 0.0034

Table I Out-of-sample validation on EuroStoxx 1-day returns.

Avg. log-likelihood RMSE mean (10−2 ) RMSE std (10−2 )

CKDE LOO-CV 3.8142 ± 0.0000 0.5344 ± 0.0000 0.3672 ± 0.0000
NKDE LOO-CV 3.3435 ± 0.0000 0.7943 ± 0.0000 0.4831 ± 0.0000
LSCDE 10-fold CV 3.5292 ± 0.0069 0.6803 ± 0.0049 0.5477 ± 0.0038
MDN 10-fold CV 3.8354 ± 0.0095 0.5250 ± 0.0075 0.3266 ± 0.0009
KMN 10-fold CV 3.8270 ± 0.0162 0.5327 ± 0.0080 0.3308 ± 0.0062

Table II Out-of-sample validation on EuroStoxx 1-day returns - Hyper-Parameter

selection via cross-validation.

C. Empirical Density Estimator Benchmark

The Euro Stoxx 50 estimator benchmark consists of two categories. First, we compare the
estimators in their default hyper-parameter configuration. The default configurations have been
selected with hyperparameter sweeps on the simulated densities in the previous chapters. The
respective results are reported in Table I. Consistent with the simulation studies in Section V,
noise regularization for MDNs and KMNs improves the estimate’s generalization (i.e. validation log-
likelihood) by a significant margin. Moreover, the regularization improves the predictive accuracy
(RMSE mean) and uncertainty estimates (RMSE std).
In the second part of the benchmark, the estimator parameters are selected through cross-
validation on the Euro Stoxx training set. As goodness criterion, the log-likelihood is used. Fol-
lowing previous work (Duin, 1976; Pfeiffer, 1985; Li and Racine, 2007), we use leave-one-out cross-
validation in conjunction with the downhill simplex method of Nelder and Mead (1965) for selecting
the parameters of the kernel density estimators (CKDE and NKDE). For the remaining methods,
hyper-parameter grid search with 10-fold cross-validation is employed. Table II depicts the evalua-
tion results with hyper-parameter search. Similar to the results without hyper-parameter selection,
MDNs and KMNs with noise regularization consistently outperform previous methods in all three
evaluation measures. This strengthens the results from the simulation study and demonstrates that,
when regularized properly, neural network based methods are able to generate superior conditional
density estimates.
Moreover, it is interesting to observe that both kernel density estimators, improve substantially,
when cross-validation is used. This is a strong indication, that the bandwidth which is selected
through the Gaussian rule-of-thumb is inferior and the underlying return data is non-Gaussian.

28
VII. Conclusion
This paper studies the use of neural networks for conditional density estimation. Addressing the
problem of over-fitting, we introduce a noise regularization method that leads to smooth density
estimates and improved generalization. Moreover, a normalization scheme which makes the model’s
hyper-parameters insensitive to differing value ranges is proposed. Corresponding experiments
showcase the effectiveness and practical importance of the presented approaches. In a benchmark
study, we demonstrate that our training methodology endows neural network based CDE with a
better out-of-sample performance than previous semi- and non-parametric methods. Overall, this
work establishes a practical framework for the successful application of neural network based CDE
in areas such as econometrics. Based on the promising results, we are convinced that the proposed
method enhances the econometric toolkit and thus advocate further research in this direction.
While this paper focuses on CDE with mixture densities, a promising avenue for future research
could be the use of normalizing flows as parametric density representation.

29
Appendix
Appendix A. Additional Data Generating Processes
Appendix A.1. ARMA Jump Diffusion Model

The underlying model for this simulator is a non-linear non Gaussian ARMA jump diffusion model
introduced by Christoffersen et al. (2016):

p p
dxt = (r − 0.5Vt − ξλt )dt + Vt ( 1 − ρ2 dWt1 + ρdWt2 ) + qt dNt
p
dVt = κV (θV − Vt )dt + γdLt + ξV Vt dWt2
p
dLt = κL (θL − Lt )dt + ξL Lt dWt3
p
Ψt = κΨ (θΨ − Ψt )dt + ξΨ P sit dWt4 (1)
λt = Ψt + γV Vt + γL Lt
qt ∼ N (θ, δ 2 )
δ2
ξ = eθ+ 2 − 1

Where xt resemble log stock returns, Vt is the spot variance, Lt a illiquidity factor and Ψt an
unknown latent factor. Vt , Lt and Ψt are referred to as jump parameters. A parameterization
can be taken from the paper (add reference) but generally these parameters influence the role of
jumps and non-normality.

Appendix B. Selection of KMN kernel centers

• all: use all data points in the train set as kernel centers
• random: randomly selects K points as kernel centers
• distance-based: distance-based subsampling as proposed in Ambrogioni et al. (2017)
• k means: uses K-means clustering to determine the kernel centers
• agglomerative: uses agglomerative clustering to determine K kernel centers

Appendix C. Noise Regularization

Let LD (D) be a loss function over a set of data points D = {x1 , ..., xN }, which can be partitioned
into a sum of losses corresponding to each data point xn :

N
X
LD (D) = L(xn ) (2)
i=1

30
Also, let each xn be perturbed by a random noise vector ξ ∼ q(ξ) with zero mean and i.i.d.
elements, i.e. h i
Eξ∼q(ξ) [ξ] = 0 and Eξ∼q(ξ) ξn ξj> = η 2 I (3)

The resulting loss L(xn + ξ) can be approximated by a second order Taylor expansion around xn

1
L(xn + ξ) = L(xn ) + ξ > ∇x L(x)xn + ξ > ∇2x L(x)xn ξ + O(ξ 3 )

(4)
2

Assuming that the noise ξ is small in its magnitude O(ξ 3 ) may be neglected. The expected loss
under q(ξ) follows directly from (4):
h i 1 h i
Eξ∼q(ξ) [L(xn + ξ)] = L(xn ) + Eξ∼q(ξ) ξ > ∇x L(x)xn + Eξ∼q(ξ) ξ > ∇2x L(x)xn ξ

(5)
2

Using the assumption about ξ in (3) we can simplify (5) as follows:

1 h i
Eξ∼q(ξ) [L(xn + ξ)] = L(xn ) + Eξ∼q(ξ) [ξ]> ∇x L(x)xn + Eξ∼q(ξ) ξ > ∇2x L(x)xn ξ

(6)
2
1 h i
= L(xn ) + Eξ∼q(ξ) ξ > H(n) ξ (7)
2  
2 L(x)

1 XX ∂
= L(xn ) + Eξ∼q(ξ)  ξj ξk (j) (k)  (8)
2 ∂x ∂x x
j k n

∂ 2 L(x) ∂ 2 L(x)

1 X 2
1 XX
= L(xn ) + Eξ ξj + Eξ [ξj ξk ] (j) (k) (9)
2
j
∂x(j) ∂x(j) xn 2 j k6=j ∂x ∂x xn

η 2 X ∂ 2 L(x)

= L(xn ) + (10)
2
j
∂x(j) ∂x(j) xn
η2
= L(xn ) + tr(H(n) ) (11)
2

In that, L(xn ) is the loss without noise and H(n) = ∇2x L(x)xn the Hessian of L at xn . With ξj
we denote the elements of the column vector ξ.

Appendix D. Data Normalization and Change of Variable

LEMMA 1: Let x ∈ S ⊆ Rn be a continuous random variable with probability density function
p(x). Any linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an invertible
n × n matrix follows the probability density function

1
p B −1 (x − a) , z ∈ {a + Bx x ∈ S} .

q(z) = (12)
|B|

PROOF 1 (Proof of Lemma 1): The Lemma directly follows from the change of variable theorem
(see Bishop (2006) page 18)

31
THEOREM 3: Let x ∈ Rn be a continuous random variable following a Gaussian Mixture Model
(GMM), this is x ∼ p(x) with
K
X
p(x) = wk N (µk , Σk ) . (13)
k=1

Any linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an invertible n × n

matrix follows a Gaussian Mixture Model with density function

K
X
p(z) = wk N (a + Bµk , BΣk B > ) . (14)
k=1

PROOF 2 (Proof of Theorem 3): With x ∈ Rn following a Gaussian Mixture Model, its probability
density function can be written as

K K
exp − 12 (x − µk )> Σ−1

1 k (x − µk )
X X
p(x) = wk N (µk , Σk ) = 1 wk 1 (15)
k=1 (2π) 2 k=1 |Σk | 2

Let z ∼ q(z) be a linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an

invertible n × n matrix. From Lemma 1 follows that

K
exp − 12 (B −1 z − B −1 a − µk )> Σ−1 −1 −1

1 k (B z − B a − µk )
X
p(z) = 1 wk 1 (16)
(2π) 2 k=1 |B| |Σk | 2
K
exp − 12 (z − (a + Bµk ))> (B −1 )> Σ−1 −1

1 k B (z − (a + Bµk ))
X
= 1 wk 1 (17)
(2π) 2 k=1 |B| |Σk | 2
K
exp − 12 (z − (a + Bµk ))> (BΣk B > )−1 (z − (a + Bµk ))

1 X
= 1 wk 1 (18)
(2π) 2 k=1 |BΣk B > | 2
K
X
= wk N (a + Bµk , BΣk B > )
k=1

Appendix E. Monte-Carlo Integration with Importance Sampling

Let f : X → R be a function defined on the domain X ⊆ Rm the integral of f (x) over Ω ⊆ X
can be written as an expectation:
Z
f (x)dx = Ex∼UΩ [f (x)] (19)
Ω

In that, UΩ is a uniform distribution over the set Ω. In practice the expected value Ex∼UΩ [f (x)]
can be estimated by uniformly drawing samples x1 , ...xN from Ω and averaging the function values.
Z N
1 X
f (x)dx ≈ f (xi ) xi ∼ UΩ (20)
Ω N
i=1

32
By the weak law of large numbers, the sample average in 20) is a consistent estimator of the integral.
In many interesting cases, Ω is unbounded. For instance, one might want to estimate the
moments Rm xn dx of a real-valued random variable X ∈ Rm with probability density function
R

p(x). Since there is no straightforward way to obtain uniform samples over an unbounded set, the
simple Monte-Carlo integration technique in (20) cannot be employed in such cases. Instead, one
draws samples from a non-uniform proposal distribution Q with density function q and support
{x | q(x) ≥ 0, x ∈ Rm } = Ω. The previous expectation over the uniform distribution can be
reformulated as expectation of Q:
Z N
u(x) 1 X f (xi )
f (x)dx = Ex∼Q f (x) ≈ xi ∼ Q (21)
Ω q(x) N q(xi )
i=1

In that, u(x) denotes the density function of the uniform distribution. When samples are drawn
from a proposal distribution Q, the evaluated function values f (xi ) have to be weighted by the
inverse of the density q(xi ). In our implementation, we use a student-t distribution as proposal
distribution.

Appendix F. Estimation of the conditional moments

This section briefly describes how the moments of the conditional density estimates p̂(y|x) are
computed. In particular, we will focus on the mean, covariance, skewness and kurtosis.
By default the centered moments are estimated via numerical or monte-carlo integration, using
their respective definitions:
Z
µ̂(x) = y p̂(y|x)dy (22)
Y
sZ
σ̂(x) = (y − µ̂(x))2 p̂(y|x)dy (23)
Y
Z
Cov(x)
d = (y − µ̂(x))(y − µ̂(x))T p̂(y|x)dy (24)
Y
Z 3
\ y − µ̂(x)
Skew(x) = p̂(y|x)dy (25)
Y σ̂(x)
Z 4
y − µ̂(x)
Kurt(x)
[ = p̂(y|x)dy − 3 (26)
Y σ̂(x)

Our implementation only supports estimating skewness and kurtosis for univariate target variables,
i.e. dim(Y) = 1. If dim(Y) = 1, the integral is approximated with numerical integration, using the
Gaussian quadrature with 10000 reference points, for which the density values are calculated. If
dim(Y) > 1, we use Monte-Carlo integration with 100,000 samples (see Appendix.E).
In case of the KMN and MDN, the conditional distribution is a GMM. Thus, we can directly
calculate mean and covariance from the GMM parameters, outputted by the neural network. The

33
mean follows straightforward as weighted sum of the Gaussian component centers: µk (x; θ)

K
X
µ̂(x) = wk (x; θ)µk (x; θ) (27)
k=1

The covariance matrix can be computed as

K
X
wk (x; θ) (µk (x; θ) − µ̂(x))(µk (x; θ) − µ̂(x))T + diag(σk (x; θ)2 )

Cov(x)
d = (28)
k=1

wherein the outer product accounts for the covariance that arises from the different locations of
the components and the diagonal matrix for the inherent variance of each Gaussian component.

Appendix G. Hyperparameters of MDN and KMN

The neural network has two hidden layers with 16 neurons each, tanh non-linearities and weight
normalization (Salimans and Kingma, 2016). For the KMN, we use K = 50 Gaussian mixture
components and, for the MDN, K = 10 components. The neural network is trained for 1000
epochs with the Adam optimizer (Kingma and Ba, 2015). In that, the Adam learning rate is set
to α = 0.001 and the mini-batch size is 200. In order to select the kernel centers, for the KMN,
we employ K-means clustering on the yi data points. In each of the selected centers, we place
two Gaussians with initial standard deviation σ1 = 0.7 and σ2 = 0.3. While the locations of the
Gaussians are fixed during training, the scale / standard deviation parameters are trainable and
thus adjusted by the optimizer.

Appendix H. Non- and Semi-parametric Conditional Density Estimators

Appendix H.1. Neighborhood Conditional Kernel Density Estimation (NKDE)

For estimating the conditional density p(y|x), -neighbor kernel density estimation employs
standard kernel density estimation in a local -neighborhood around a query point (x, y) (Sugiyama
and Takeuchi, 2010).
NKDE is similar to CKDE, as uses kernels in the training data points to estimate the conditional
probability density. However, rather than estimating both the joint probability p(x, y) and marginal
probability p(x), NKDE forms a density estimate by only considering a local subset of the training
samples {(xi , yi )}i∈Ix, , where Ix, is the set of sample indices such that ||xi − x||2 ≤ . The
estimated density can be expressed as

l (i) !
X Y 1 y (i) − yj
p(y|x) = wj K (29)
j∈Ix, i=1
h(i) h(i)

wherein wj is the weighting of the j-th kernel and K(z) a kernel function. In our implementation
K is the density function of a standard normal distribution. The weights wj can either be uniform,

34
1
i.e. wj = |Ix, | or proportional to the distance ||xj −x||. The vector of bandwidths h = (h(1 , ..., h(l) )T
can be determined with the rule-of-thumb (see Equation 17), where the the number of samples N
corresponds to the average number of neighbors in the training data:

N
1 X
N= |Ixn , | − 1 (30)
N
n=1

Alternatively, the bandwidths may be selected via leave-one-out maximum likelihood cross-validation.

Appendix I. Least-Squares Conditional Density Estimation (LSCDE)

The Least-Squares Conditional Density Estimation (LSCDE) approach (Sugiyama and Takeuchi,
2010) estimates the conditional density as linear combination of kernel functions φk (x, y)

p̂α (y|x) ∝ αT φ(x, y) (31)

In that α = (α1 , ..., αK )T are the learned parameters and φ(x, y) = (φ1 (x, y), ..., φK (x, y))T
are kernel functions such that φk (x, y) ≥ 0 for all (x, y) ∈ X × Y.
The parameters α ∈ RK are learned by minimizing the a integrated squared error.
Z Z
J(α) = (p̂α (x, y) − p(x, y))2 p(x)dxdy (32)

After having obtained α∗ = arg minα J(α) through training, the conditional density can be
computed as follows:

α̂T φ(x̃, y)
p̂α (y|x = x̃) = R (33)
α̂T φ(x̃, y)dy
Sugiyama and Takeuchi (2010) propose to use a Gaussian kernel with width σ (bandwidth
parameter), which is also the choice for our implementation:

||x − uk ||2 ||y − vk ||2

φk (x, y) = exp exp (34)
2σ 2 2σ 2
where {(uk , vk )}K
k=1 are kernel center points that are chosen from the training data set. By using
Gaussian kernels the optimization problem arg minα J(α) can be solved in closed form. Also, the
denominator in equation 33 is traceable and can be computed analytically. Hence, neither numerical
optimization nor numerical integration is needed for obtaining conditional density estimates with
LSCDE.

35
REFERENCES

Ambrogioni, Luca, Umut Güçlü, Marcel A. J. van Gerven, and Eric Maris, 2017, The Kernel

Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous

Random Variables.

An, Guozhong, 1996, The Effects of Adding Noise During Backpropagation Training on a Gener-

alization Performance, Neural Computation 8, 643–674.

Anděl, Ji, Ivan Netuka, and Karel Zvára, 1984, On Threshold Autoregressive Processes, Kybernetica

20, 89–106.

Bakshi, Gurdip, Xiaohui Gao Bakshi, and George Panayotov, 2017, A Theory of Dissimilarity

between Stochastic Discount Factors, SSRN Electronic Journal .

Bakshi, Gurdip S., Nikunj Kapadia, and Dilip B. Madan, 2003, Stock Return Characteristics, Skew

Laws, and the Differential Pricing of Individual Equity Options, Review of Financial Studies .

Bishop, Chris M., 1995, Training with Noise is Equivalent to Tikhonov Regularization, Neural

Computation 7, 108–116.

Bishop, Christopher M, 1994, Mixture Density Networks.

Bishop, Christopher M, 2006, Pattern Recognition and Machine Learning (Springer).

Bollerslev, Tim, Bollerslev, and Tim, 1987, A Conditionally Heteroskedastic Time Series Model for

Speculative Prices and Rates of Return, The Review of Economics and Statistics 69, 542–47.

Botev, Z. I., J. F. Grotowski, and D. P. Kroese, 2010, Kernel density estimation via diffusion, The

Annals of Statistics 38, 2916–2957.

Bowman, Adrian W., 1984, An alternative method of cross-validation for the smoothing of density

estimates, Biometrika 71, 353–360.

Cao, Ricardo, Antonio Cuevas, and Wensceslao González Manteiga, 1994, A comparative study of

several smoothing methods in density estimation, Computational Statistics & Data Analysis 17,

153–176.

36
Carhart, Mark M., 1997, On Persistence in Mutual Fund Performance, The Journal of Finance 52,

57–82.

Christoffersen, Peter, Bruno Feunou, Yoontae Jeon, and Chayawat Ornthanalai, 2016, Time-varying

Crash Risk: The Role of Market Liquidity .

De Gooijer, Jan G, and Dawit Zerom, 2003, On Conditional Density Estimation, Technical report.

Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio, 2017, Density estimation using Real NVP,

in Proceedings of the International Conference on Learning Representations.

Duin, R. P. W, 1976, On the Choice of Smoothing Parameters for Parzen Estimators of Probability

Density Functions, IEEE Transactions on Computers C-25, 1175–1179.

Engle, Robert F., 1982, Autoregressive Conditional Heteroscedasticity with Estimates of the Vari-

ance of United Kingdom Inflation, Econometrica 50, 987.

Fama, Eugene F., and Kenneth R. French, 1993, Common risk factors in the returns on stocks and

bonds, Journal of Financial Economics 33, 3–56.

Fama, Eugene F., and Kenneth R. French, 2015, A five-factor asset pricing model, Journal of

Financial Economics 116, 1–22.

Gallant, A. Ronald, D. Hsieh, George Tauchen, A. Gallant, D. Hsieh, and George Tauchen, 1991,

ON FITTING A RECALCITRANT SERIES: THE POUND/DOLLAR EXCHANGE RATE,

Nonparametric and Semiparametric Methods in Econometrics and Statistics .

Gilardi, Nicolas, Samy Bengio, and Mikhail Kanevski, 2002, Conditional Gaussian Mixture Models

for Environmental Risk Mapping, in NNSP .

Glosten, Lawrence R, Ravi Jagannathan, David E Runkle, Lawrence R Glosten, Ravi Jagannathan,

and David E Runkle, 1993, On the Relation between the Expected Value and the Volatility of

the Nominal Excess Return on Stocks, Journal of Finance 48, 1779–1801.

Gormsen, Niels Joachim, and Christian Skov Jensen, 2017, Conditional Risk, SSRN Electronic

Journal .

37
Grus, Joel, 2015, Data Science from Scratch.

Guang Sung, Hsi, 2004, Gaussian Mixture Regression and Classification, Ph.D. thesis.

Hall, Peter, JS Marron, and Byeong U Park, 1992, Smoothed cross-validation, Probability Theory

92, 1–20.

Hamilton, James D. (James Douglas), 1994, Time series analysis (Princeton University Press).

Hansen, Bruce E., Hansen, and Bruce, 1994, Autoregressive Conditional Density Estimation, In-

ternational Economic Review 35, 705–30.

Harvey, Campbell R, and Akhtar Siddique, 1999, Autoregressive Conditional Skewness, The Journal

of Financial and Quantitative Analysis .

Holmstrom, L., and P. Koistinen, 1992, Using additive noise in back-propagation training, IEEE

Transactions on Neural Networks 3, 24–38.

Hornik, Kurt, 1991, Approximation capabilities of multilayer feedforward networks, Neural Net-

works 4, 251–257.

Hyndman, Rob J., David M. Bashtannyk, and Gary K. Grunwald, 1996, Estimating and Visualizing

Conditional Densities, Journal of Computational and Graphical Statistics 5, 315.

Jagannathan, Ravi, and Zhenyu Wang, 1996, The Conditional CAPM and the Cross-Section of

Expected Returns, The Journal of Finance 51, 3.

Jondeau, Eric, and Michael Rockinger, 2003, Conditional volatility, skewness, and kurtosis: exis-

tence, persistence, and comovements, Technical report.

Kingma, Diederik P., and Jimmy Ba, 2015, Adam: A Method for Stochastic Optimization, in

ICLR.

Krogh·, Anders, and John A Hertz, 1992, A Simple Weight Decay Can Improve Generalization,

Technical report.

Kukačka, Jan, Vladimir Golkov, and Daniel Cremers, 2017, Regularization for Deep Learning: A

Taxonomy, Technical report.

38
Lewellen, Jonathan, and Stefan Nagel, 2006, The conditional CAPM does not explain asset-pricing

anomalies, Journal of Financial Economics 82, 289–314.

Li, Jonathan Q, and Abstract R Andrew Barron, 2000, Mixture Density Estimation, in NIPS .

Li, Qi, and Jeffrey S. Racine, 2007, Nonparametric econometrics : theory and practice (Princeton

University Press).

Mandelbrot, Benoit, 1967, The Variation of Some Other Speculative Prices, The Journal of Business

40, 393–413.

Mirza, Mehdi, and Simon Osindero, 2014, Conditional Generative Adversarial Nets, Technical

report.

Nelder, J A, and R Mead, 1965, A simplex method for function minimization, The Computer

Journa 7, 308–313.

Nelson, Daniel B., and Charles Q. Cao, 1992, Inequality Constraints in the Univariate GARCH

Model, Journal of Business & Economic Statistics 10, 229.

Park, Byeong, Byeong Park, and J. S. Marron, 1990, Comparison of data driven bandwidth selec-

tors, J. STATIST. ASSOC 66–72.

Parzen, Emanuel, 1962, On Estimation of a Probability Density Function and Mode, The Annals

of Mathematical Statistics 33, 1065–1076.

Pfeiffer, K.P., 1985, Stepwise variable selection and maximum likelihood estimation of smooth-

ing factors of kernel functions for nonparametric discriminant functions evaluated by different

criteria, Computers and Biomedical Research 18, 46–61.

Rezende, Danilo Jimenez, and Shakir Mohamed, 2015, Variational Inference with Normalizing

Flows, in Proceedings of the 32nd International Conference on Machine Learning.

Rosenblatt, Murray, 1956, Remarks on Some Nonparametric Estimates of a Density Function, The

Annals of Mathematical Statistics 27, 832–837.

Rudemo, Mats, 1982, Empirical Choice of Histograms and Kernel Density Estimators.

39
Salimans, Tim, and Diederik P. Kingma, 2016, Weight Normalization: A Simple Reparameteriza-

tion to Accelerate Training of Deep Neural Networks, in NIPS .

Sarajedini, A., R. Hecht-Nielsen, and P.M. Chau, 1999, Conditional probability density function

estimation with sigmoidal neural networks, IEEE Transactions on Neural Networks 10, 231–238.

Sentana, E., 1995, Quadratic ARCH Models, The Review of Economic Studies 62, 639–661.

Shalizi, Cosma, 2011, Estimating Distributions and Densities, Technical report.

Sheather, S. J., and M. C. Jones, 1991, A Reliable Data-Based Bandwidth Selection Method for

Kernel Density Estimation, Journal of the Royal Statistical Society 53, 683–690.

Silverman, B., 1982, On the estimation of a probability density function by the maximum penalized

likelihood method, The Annals of Statistics 10, 795–810.

Sohn, Kihyuk, Honglak Lee, and Xinchen Yan, 2015, Learning Structured Output Representa-

tion using Deep Conditional Generative Models, in Advances in Neural Information Processing

Systems, 3483–3491.

Sola, J., and J. Sevilla, 1997, Importance of input data normalization for the application of neural

networks to complex industrial problems, IEEE Transactions on Nuclear Science 44, 1464–1468.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov,

2014, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine

Learning Research 15, 1929–1958.

Sugiyama, Masashi, and Ichiro Takeuchi, 2010, Conditional density estimation via Least-Squares

Density Ratio Estimation, in Proceedings of the Thirteenth International Conference on Artificial

Intelligence and Statistics, volume 9, 781–788.

Tansey, Wesley, Karl Pichotta, and James G. Scott, 2016, Better Conditional Density Estimation

for Neural Networks.

Tresp, Volker, 2001, Mixtures of Gaussian Processes, in NIPS .

Trippe, Brian L, and Richard E Turner, 2018, Conditional Density Estimation with Bayesian

Normalising Flows, Technical report.

40
Webb, A.R., 1994, Functional approximation by feed-forward networks: a least-squares approach

to generalization, IEEE Transactions on Neural Networks 5, 363–371.

Whaley, Robert E., 1993, Derivatives on Market Volatility: Hedging Is Long Overdue, The Journal

of Derivatives .

WQU FundamentalsofStochasticFinance m1
No ratings yet
WQU FundamentalsofStochasticFinance m1
45 pages
Non-Parametric Methods Using Kernel Density Estimation
No ratings yet
Non-Parametric Methods Using Kernel Density Estimation
1 page
Chapter 06-Statistical Methods in Quality Management: True/False
No ratings yet
Chapter 06-Statistical Methods in Quality Management: True/False
20 pages
Lecture Notes in MAED Stat Part 1
100% (1)
Lecture Notes in MAED Stat Part 1
15 pages
Non Parametric Density Estimation
No ratings yet
Non Parametric Density Estimation
4 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
Deep Density Estimation
No ratings yet
Deep Density Estimation
20 pages
Ast Part1 PDF
No ratings yet
Ast Part1 PDF
20 pages
A Primer in Nonparametric Econometrics
No ratings yet
A Primer in Nonparametric Econometrics
88 pages
Racine - 2007 - Nonparametric Econometrics A Primer
No ratings yet
Racine - 2007 - Nonparametric Econometrics A Primer
88 pages
2024-A Multiple Kernel-Based Kernel Density Estimator For Multimodal Probability Density Functions
No ratings yet
2024-A Multiple Kernel-Based Kernel Density Estimator For Multimodal Probability Density Functions
16 pages
Modern Multivariate Statistical Techniques: - Nonparametric Density Estimation Xi Chen Nov 6
No ratings yet
Modern Multivariate Statistical Techniques: - Nonparametric Density Estimation Xi Chen Nov 6
20 pages
Densityestimation
No ratings yet
Densityestimation
28 pages
Measuring DAX Market Risk: A Neural Network Volatility Mixture Approach
No ratings yet
Measuring DAX Market Risk: A Neural Network Volatility Mixture Approach
8 pages
Robust Kernel Density Estimation-Kim and Scott
No ratings yet
Robust Kernel Density Estimation-Kim and Scott
37 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
13 Density Estimation Note
No ratings yet
13 Density Estimation Note
48 pages
Tabak Turner
No ratings yet
Tabak Turner
20 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
No ratings yet
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
5 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
2024-Fourier Basis Density Model
No ratings yet
2024-Fourier Basis Density Model
5 pages
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
No ratings yet
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
34 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
Pa 01 Density Estimation
No ratings yet
Pa 01 Density Estimation
25 pages
Mean Shift Opr00BCX
No ratings yet
Mean Shift Opr00BCX
9 pages
01 Intro Densities
No ratings yet
01 Intro Densities
23 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Review of Kernel Density Estimation
No ratings yet
Review of Kernel Density Estimation
35 pages
4 Conditional Density Estimation
No ratings yet
4 Conditional Density Estimation
4 pages
Adaptive Bayesian Density Regression For High-Dimensional Data
No ratings yet
Adaptive Bayesian Density Regression For High-Dimensional Data
25 pages
Izenman 1991
No ratings yet
Izenman 1991
21 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Classification and Kernel Density Estimation
No ratings yet
Classification and Kernel Density Estimation
7 pages
Robust Kernel Density Estimation With Median-of-Means Principle-Humbert
No ratings yet
Robust Kernel Density Estimation With Median-of-Means Principle-Humbert
22 pages
Nadaraya-Watson Teoria PDF
No ratings yet
Nadaraya-Watson Teoria PDF
9 pages
Articulo Sheather
No ratings yet
Articulo Sheather
11 pages
Estimating Distributions and Densities: 36-350, Data Mining, Fall 2009 23 November 2009
No ratings yet
Estimating Distributions and Densities: 36-350, Data Mining, Fall 2009 23 November 2009
7 pages
Mixture of Experts Distributional Regression: Implementation Using Robust Estimation With Adaptive First-Order Methods
No ratings yet
Mixture of Experts Distributional Regression: Implementation Using Robust Estimation With Adaptive First-Order Methods
22 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Density Ratio Estimation A Comprehensive Review
No ratings yet
Density Ratio Estimation A Comprehensive Review
22 pages
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
No ratings yet
Kernel Density Estimation of Tsalli's Entropy With Applications in Adaptive System Training
7 pages
Liu Et Al Density Estimation Using Deep Generative Neural Networks
No ratings yet
Liu Et Al Density Estimation Using Deep Generative Neural Networks
6 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
Estimation 2
No ratings yet
Estimation 2
20 pages
Probabilistic Neural Networks: Original Contribution
No ratings yet
Probabilistic Neural Networks: Original Contribution
10 pages
Ferrat y 2006
No ratings yet
Ferrat y 2006
30 pages
Adaptive Density Estimation For Stationary Process
No ratings yet
Adaptive Density Estimation For Stationary Process
31 pages
Naive Bayes Classification of Uncertain Data: 2009 Ninth IEEE International Conference On Data Mining
No ratings yet
Naive Bayes Classification of Uncertain Data: 2009 Ninth IEEE International Conference On Data Mining
6 pages
Kernel Density Estimation and Its Application
No ratings yet
Kernel Density Estimation and Its Application
8 pages
U4 ProbabilityDensityEstimation
No ratings yet
U4 ProbabilityDensityEstimation
6 pages
Manifold Estimation, Hidden Structure and Dimension Reduction
No ratings yet
Manifold Estimation, Hidden Structure and Dimension Reduction
39 pages
Just 2021 0136
No ratings yet
Just 2021 0136
8 pages
Slides3part1 mrbm2324
No ratings yet
Slides3part1 mrbm2324
29 pages
Conditional Density Estimation and Simulation Through Optimal Transport
No ratings yet
Conditional Density Estimation and Simulation Through Optimal Transport
24 pages
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
No ratings yet
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
11 pages
Module C
No ratings yet
Module C
30 pages
Forecasting Time Series With Encoder-Decoder Neural Networks
No ratings yet
Forecasting Time Series With Encoder-Decoder Neural Networks
69 pages
TEAA - Memory Based Tecniques
No ratings yet
TEAA - Memory Based Tecniques
23 pages
(Monographs On Statistics and Applied Probability (Series) ) Chacón, José E. - Duong, Tarn - Multivariate Kernel Smoothing and Its Applications-CRC Press (2018)
No ratings yet
(Monographs On Statistics and Applied Probability (Series) ) Chacón, José E. - Duong, Tarn - Multivariate Kernel Smoothing and Its Applications-CRC Press (2018)
249 pages
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
No ratings yet
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
17 pages
Sar 2000
No ratings yet
Sar 2000
22 pages
Neural Modeling Fields: Fundamentals and Applications
From Everand
Neural Modeling Fields: Fundamentals and Applications
Fouad Sabry
No ratings yet
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
No ratings yet
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
7 pages
Functional Local Linear Estimate For Functional Relative-Error Regression
No ratings yet
Functional Local Linear Estimate For Functional Relative-Error Regression
20 pages
2456 12048 1 PB PDF
No ratings yet
2456 12048 1 PB PDF
17 pages
Local Linear Conditional Cumulative Distribution Function With Mixing Data
No ratings yet
Local Linear Conditional Cumulative Distribution Function With Mixing Data
19 pages
Conditional Density Estimation With Spatially Dependent Data
No ratings yet
Conditional Density Estimation With Spatially Dependent Data
22 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
FDA: Strong Consistency of The KNN Local Linear Estimation of The Functional Conditional Density and Mode
No ratings yet
FDA: Strong Consistency of The KNN Local Linear Estimation of The Functional Conditional Density and Mode
22 pages
2020 The Effect of Green Buildings On Employees' Performance
No ratings yet
2020 The Effect of Green Buildings On Employees' Performance
348 pages
Chapter 4 (Hypothesis Testing)
No ratings yet
Chapter 4 (Hypothesis Testing)
20 pages
(Shavelson & Webb, 2005) - Generalizability Theory
No ratings yet
(Shavelson & Webb, 2005) - Generalizability Theory
14 pages
A Data Set That Consists of Observations On A Variable or Several Variables Over Time Is Called
No ratings yet
A Data Set That Consists of Observations On A Variable or Several Variables Over Time Is Called
10 pages
Grade11 Statistics and Probabilty - Module 2
100% (1)
Grade11 Statistics and Probabilty - Module 2
6 pages
CApm Derivation
No ratings yet
CApm Derivation
27 pages
Biostatistics Final Tests 2017 18-1802 PDF
No ratings yet
Biostatistics Final Tests 2017 18-1802 PDF
66 pages
Sprouse - A Program For Experimental Syntax
No ratings yet
Sprouse - A Program For Experimental Syntax
187 pages
2 Cumulative Effect of Tol
79% (14)
2 Cumulative Effect of Tol
30 pages
Alfredo H-S. Ang
100% (1)
Alfredo H-S. Ang
419 pages
Mathematics
No ratings yet
Mathematics
23 pages
MA - Standard Costing and Variance Analysis PDF
No ratings yet
MA - Standard Costing and Variance Analysis PDF
3 pages
Anderson and Gerbing 1988
No ratings yet
Anderson and Gerbing 1988
13 pages
A.Aziz Et Al., 2019 EVALUATING ENVIRONMENTAL, HEALTH AND SAFETY PRACTICES IN HOSPITALS A CASE STUDY IN KARACHI
No ratings yet
A.Aziz Et Al., 2019 EVALUATING ENVIRONMENTAL, HEALTH AND SAFETY PRACTICES IN HOSPITALS A CASE STUDY IN KARACHI
7 pages
Fooled by Correlation Common Misinterpre
No ratings yet
Fooled by Correlation Common Misinterpre
11 pages
Chapter11 001
No ratings yet
Chapter11 001
70 pages
405d - Business Statistics
No ratings yet
405d - Business Statistics
21 pages
Heritage Tourism in India A Stakeholders Perspect
No ratings yet
Heritage Tourism in India A Stakeholders Perspect
15 pages
Learning Worksheet No. In: 10 Statistics and Probability
No ratings yet
Learning Worksheet No. In: 10 Statistics and Probability
5 pages
SPC Manual
No ratings yet
SPC Manual
49 pages
12 STEM 1 Group 5
No ratings yet
12 STEM 1 Group 5
47 pages
Data Science Interview Quesions
No ratings yet
Data Science Interview Quesions
22 pages
Discrete Probability Distribution
No ratings yet
Discrete Probability Distribution
67 pages
Mod 4 - 1
No ratings yet
Mod 4 - 1
82 pages
Mahalanobis Distance - G. F. MacLachlan (1999)
No ratings yet
Mahalanobis Distance - G. F. MacLachlan (1999)
7 pages
Chapter 5:discrete Probability Distributions
No ratings yet
Chapter 5:discrete Probability Distributions
60 pages

Conditional Density Estimation With Neural Network

Uploaded by

Conditional Density Estimation With Neural Network

Uploaded by

Conditional Density Estimation with Neural Networks: Best

Practices and Benchmarks

Jonas Rothfuss ∗†. Fabio Ferreira∗† Simon Walther† Maxim Ulrich†

assumptions about the return dynamic.

A.1. Parametric Density Estimation

In parametric estimation, the PDF p̂ is assumed to belong to a parametric family

F = {p̂θ (·)|θ ∈ Θ} (3)

where the PDF is described by a finite dimensional parameter θ ∈ Θ. A classical example of F is

In practice, the optimization problem is restated as maximizing the sum of log-probabilities:

= arg max H(pD ) − DKL (pD ||p̂θ ) (8)

= arg min DKL (pD ||p̂θ ) (9)

A.2. Nonparametric Density Estimation

In contrast to parametric methods, nonparametric density estimators do not explicitly restrict

estimated as product of marginal kernel density estimates:

An illustrative density estimate with Gaussian kernels is displayed in Fig. 1.

B. Conditional Density Estimation (CDE)

III. Related Work

IV. Conditional Density Estimation with Neural Networks

A. The Density Models

σk (x) = exp(aσk ) (20)

A.2. Kernel Mixture Networks

B. Fitting the Density Models

B.1. Variable Noise as Smoothness Regularization

Figure 4. Effect of noise regularization on density estimate. Conditional MDN density

B.2. Data Normalization

x̃ = diag(σ̂x )−1 (x − µ̂x ) and ỹ = diag(σ̂y )−1 (y − µ̂y ) (29)

THEOREM 2: Let x ∈ Rn be a continuous random variable following a Gaussian Mixture Model

Any linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an invertible n × n

Proof. See Appendix.D

V. Empirical Evaluation with Simulated Densities

xt = [c(1 − α) + αxt−1 ] + (1 − zt )σt + zt [−c + 3σt ] t ∼ N (0, 1), zt ∼ B(1, p) (38)

conditional probability density

conditional probability density

Figure 5. Conditional density simulation models. Conditional probability densities corre-

x = |x | , x ∼ N (0, 1) (40)

The conditional density follows as

p(y|x) = N (y|µ = x2 , σ = 1 + x) (43)

wherein the mixture weights are a function of x:

C. Evaluation of Noise Regularization

D. Evaluation of Data Normalization

Figure 6. Effect of various noise regularization intensities ηx and ηy Hellinger distance

EconDensity ArmaJump SkewNormal

Figure 7. Effect of noise regularization on goodness of estimated density. Goodness of

E. Conditional Density Estimator Benchmark Study

Figure 9. Conditional Density Estimator Benchmark. The illustrated benchmark study

VI. Empirical Evaluation on Euro Stoxx 50 Data

A. The Euro Stoxx 50 data

• log ret last period: realized log-return of previous trading day

• RealizedVariation: estimate of realized variance of previous day, computed as sum of

• bakshiKurt: 30-day option implied kurtosis3 (Bakshi et al., 2003)

• bakshiSkew: 30-day option implied skewness3 (Bakshi et al., 2003)

• Mkt-RF: Fama-French market return factor (Fama and French, 1993)

• SMB: Fama-French Small-Minus-Big factor (Fama and French, 1993)

• HML: Fama-French High-Minus-Low factor (Fama and French, 1993)

• WML: Winner-Minus-Looser (momentum) factor (Carhart, 1997)

Based on that, the RMSE w.r.t. µ̂ is calculated as

The respective RSME is calculated as follows:

Table I Out-of-sample validation on EuroStoxx 1-day returns.

Avg. log-likelihood RMSE mean (10−2 ) RMSE std (10−2 )

Table II Out-of-sample validation on EuroStoxx 1-day returns - Hyper-Parameter

C. Empirical Density Estimator Benchmark

Appendix B. Selection of KMN kernel centers

Appendix C. Noise Regularization

Using the assumption about ξ in (3) we can simplify (5) as follows:

Appendix D. Data Normalization and Change of Variable

Any linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an invertible n × n

Let z ∼ q(z) be a linear transformation z = a + Bx of x ∼ p(x) with a ∈ Rn and B being an

Appendix E. Monte-Carlo Integration with Importance Sampling

Appendix F. Estimation of the conditional moments

The covariance matrix can be computed as

Appendix G. Hyperparameters of MDN and KMN

Appendix H. Non- and Semi-parametric Conditional Density Estimators

Appendix I. Least-Squares Conditional Density Estimation (LSCDE)

p̂α (y|x) ∝ αT φ(x, y) (31)

||x − uk ||2 ||y − vk ||2

Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous

xt = [c(1 − α) + αxt−1 ] + (1 − zt )σt + zt [−c + 3σt ] t ∼ N (0, 1), zt ∼ B(1, p) (38)

x = |x | , x ∼ N (0, 1) (40)