0% found this document useful (0 votes)
145 views31 pages

Big Data JPM

This document discusses particle filtering techniques for Bayesian inference and stochastic filtering. It explains that particle filters provide a flexible approach to determine the posterior distribution of latent variables given observations, without relying on linear or Gaussian assumptions. The document outlines the key assumptions and calculations involved in recursive Bayesian filtering and inference. It describes how particle filtering uses importance sampling to approximate the expectation of the posterior density function by generating state samples from an importance distribution proportional to the true posterior.

Uploaded by

mitch mast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views31 pages

Big Data JPM

This document discusses particle filtering techniques for Bayesian inference and stochastic filtering. It explains that particle filters provide a flexible approach to determine the posterior distribution of latent variables given observations, without relying on linear or Gaussian assumptions. The document outlines the key assumptions and calculations involved in recursive Bayesian filtering and inference. It describes how particle filtering uses importance sampling to approximate the expectation of the posterior density function by generating state samples from an importance distribution proportional to the true posterior.

Uploaded by

mitch mast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy

(1-212) 272-1438 18 May 2017


[email protected]

PnL vs Long : Cross-Asset Performance Analytics

Annualized Sharpe
Return (%) Ratio
X-A Long 4.80 0.75
X-A Lasso 6.80 0.85

65

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

ℎ∗ = arg minℎ∈𝐻𝐻 𝜖𝜖(ℎ),

and the classifier obtained by minimizing the training error over m samples as ℎ� = 𝑆𝑆𝑠𝑠𝑔𝑔 minℎ∈𝐻𝐻 𝜖𝜖̂(ℎ).

Then with probability exceeding 1 - δ, we have

𝑠𝑠 𝑚𝑚 1 1
𝜀𝜀�ℎ�� ≤ 𝜖𝜖(ℎ∗ ) + 𝑂𝑂 �� log + log �.
𝑚𝑚 𝑠𝑠 𝑚𝑚 𝛿𝛿

This implies that, for

𝜀𝜀�ℎ�� ≤ 𝜖𝜖(ℎ∗ ) + 2γ

to hold with probability exceeding 1 - δ, it suffices that m = O(d).

This reveals to us that the number of training samples must grow linearly with the VC dimension (which tends to be
equal to the number of parameters) of the model.

236

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Particle Filtering
Signal modelling and state inference given noisy observations naturally leads us to stochastic filtering and state-space
modelling. Wiener provided a solution for a stationary underlying distribution. Kalman provided a solution for non-
stationary underlying distribution: the optimal linear filter (first truly adaptive filter) based on assumptions on linearity and
Gaussianity. Extensions try to overcome limitations of linear and Gaussian assumptions but do not provide closed-form
solutions to the distribution approximations required. Bayesian inference aims to elucidate sufficient variables which
accurately describe the dynamics of the process being modeled. Stochastic filtering underlies Bayesian filtering and is an
inverse statistical problem: you want to find inputs as you are given outputs (Chen 2003). The principle foundation of
stochastic filtering lies in recursive Bayesian estimation where we are essentially trying to compute the joint posterior. More
formally, recovering the state variable 𝐱𝐱 𝑠𝑠 given 𝐹𝐹𝑠𝑠 with data up to and including time t, to essentially remove observation
errors and compute the posterior distribution over the most recent state: 𝑃𝑃(𝐗𝐗 𝑠𝑠 |𝐘𝐘0:𝑠𝑠 ).

There are two key assumptions underlying the recursive Bayesian filter: (i) that the state process follows a first-order
Markov process:

𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝐱𝐱0:𝑖𝑖−1 , 𝐲𝐲0:𝑖𝑖−1 ) = 𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝐱𝐱 𝑖𝑖−1 )

and (ii) that the observations and states are independent:

𝑝𝑝(𝐲𝐲𝑖𝑖 |𝐱𝐱 0:𝑖𝑖−1 , 𝐲𝐲0:𝑖𝑖−1 ) = 𝑝𝑝(𝐲𝐲𝑖𝑖 |𝐱𝐱 𝑖𝑖 )

From Bayes rule given 𝚼𝚼𝑖𝑖 as the set of observations 𝐲𝐲0:𝑖𝑖 ≔ {𝐲𝐲0 , … , 𝐲𝐲𝑖𝑖 } the conditional posterior density function (pdf) of
𝐱𝐱 𝑠𝑠 is defined as:

𝑝𝑝(𝐲𝐲𝑖𝑖 |𝐱𝐱 𝑖𝑖 )𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝚼𝚼𝑖𝑖−1 )


𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝚼𝚼𝑖𝑖 ) =
𝑝𝑝(𝐲𝐲𝑖𝑖 |𝚼𝚼𝑖𝑖−1 )

In turn, the posterior density function 𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝚼𝚼𝑖𝑖 ) is defined by three key terms:

Prior: the knowledge of the model is described by the prior 𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝚼𝚼𝑖𝑖−1 )

𝑝𝑝(𝐱𝐱 𝑖𝑖 |𝚼𝚼𝑖𝑖−1 ) =
∫ 𝑝𝑝(𝐱𝐱 |𝐱𝐱
𝑖𝑖 𝑖𝑖−1 )𝑝𝑝(𝐱𝐱 𝑖𝑖−1 |𝚼𝚼𝑖𝑖−1 )𝑑𝑑𝐱𝐱 𝑖𝑖−1

Likelihood: 𝑝𝑝(𝐲𝐲𝑖𝑖 |𝐱𝐱 𝑖𝑖 ) essentially determines the observation noise

Evidence: the denominator of the pdf involves an integral of the form

𝑝𝑝(𝐲𝐲𝑖𝑖 |𝚼𝚼𝑖𝑖−1 ) =
∫ 𝑝𝑝(𝐲𝐲 |𝐱𝐱 )𝑝𝑝(𝐱𝐱 |𝚼𝚼
𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖−1 )𝑑𝑑𝐱𝐱 𝑖𝑖

The calculation and or approximation of these three terms is the base of Bayesian filtering and inference.

Particle filtering is a recursive stochastic filtering technique which provides a flexible approach to determine the posterior
distribution of the latent variables given the observations. Simply put, particle filters provide online adaptive inference
where the underlying dynamics are non-linear and non-Gaussian. The main advantage of sequential Monte Carlo methods 59

59
For more information on Bayesian sampling, see Gentle (2003), Robert and Casella (2004), O’Hagan and Forster (2004), Rasmussen and
Ghahramani (2003), Rue, Martino and Chopin (2009), Liu (2001), Skare, Bolviken and Holden (2003), Ionides (2008), Gelman and Hill
(2007), Cook, Gelman and Rubin (2006), Gelman (2006, 2007). Techniques to improve Bayesian posterior simulations are covered in van
Dyk and Meng (2001), Liu (2003), Roberts and Rosenthal (2001) and Brooks, Giuidici and Roberts (2003). For adaptive MCMC, see
Andrieu and Robert (2001) and Andrieu and Thoms (2008), Peltola, Marttinen and Vehtari (2012); for reversible jump MCMC, see Green
(1995); for trans-dimensional MCMC, see Richardson and Green (1997) and Brooks, Giudici and Roberts (2003); for perfect-simulation

237

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

is that they do not rely on any local linearization or abstract functional approximation. This is at the cost of increased
computational expense though given breakthroughs in computing technology and the related decline in processing costs,
this is not considered a barrier except in extreme circumstances.

Monte Carlo approximation using particle methods calculates the expectation of the posterior density function by
importance sampling (IS). The state-space is partitioned into which particles are filled with respect to some probability
measure. The higher this measure the denser the particle concentration. Specifically, from earlier:

𝑝𝑝(𝑦𝑦𝑠𝑠 |𝑥𝑥𝑠𝑠 )𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠−1 )


𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ) =
𝑝𝑝(𝑦𝑦𝑠𝑠 |𝐲𝐲0:𝑠𝑠−1 )

(𝑖𝑖)
We approximate the state posterior by 𝑡𝑡(𝑥𝑥𝑠𝑠 ) with i samples of 𝑥𝑥𝑠𝑠 . To find the
(𝑖𝑖)
mean 𝔼𝔼[𝑡𝑡(𝑥𝑥𝑠𝑠 )] of the state posterior 𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ) at t, we generate state samples 𝑥𝑥𝑠𝑠 ~ 𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ). Though theoretically
plausible, empirically we are unable to observe and sample directly from the state posterior. We replace the state posterior
by a proposal state distribution (importance distribution) 𝜋𝜋 which is proportional to the true posterior at every point:
𝜋𝜋(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ) ∝ 𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ). We are thus able to sample sequentially independently and identically distributed draws from
𝜋𝜋(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ) giving us:

𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 )
𝔼𝔼[𝑡𝑡(𝑥𝑥𝑠𝑠 )] = � 𝑡𝑡(𝑥𝑥𝑠𝑠 ) 𝜋𝜋(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 )𝑑𝑑𝑥𝑥𝑠𝑠
𝜋𝜋(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 )

(𝑖𝑖) (𝑖𝑖)
∑𝑁𝑁
𝑖𝑖=1 𝑡𝑡(𝑥𝑥𝑠𝑠 )𝑤𝑤𝑠𝑠
≈ (𝑖𝑖)
∑𝑁𝑁
𝑖𝑖=1 𝑤𝑤𝑠𝑠

When increasing the number of draws N this average converges asymptotically (as 𝑁𝑁 → ∞) to the expectation of the true
posterior according to the central limit theorem (Geweke 1989). This convergence is the primary advantage of sequential
Monte Carlo methods as they provide asymptotically consistent estimates of the true distribution 𝑝𝑝(𝑥𝑥𝑠𝑠 |𝐲𝐲0:𝑠𝑠 ) (Doucet &
Johansen 2008).

IS allows us to sample from complex high-dimensional distributions though exhibits linear increases in complexity upon
each subsequent draw. To admit fixed computational complexity we use sequential importance sampling (SIS). There are a
number of critical issues with SIS primarily the variance of estimates increases exponentially with n and leads to fewer and
fewer non-zero importance weights. This problem is known as weight degeneracy. To alleviate this issue, states are
resampled to retain the most pertinent contributors, essentially removing particles with low weights with a high degree of
certainty (Gordon et al. 1993). It addresses degeneracy by replacing particles with high weight with many particles with
high inter-particle correlation (Chen 2003). The sequential importance resampling (SIR) algorithm is provided in the
mathematical box below:

Mathematical Box [Sequential Importance Resampling]

1. Initialization: for 𝑑𝑑 = 1, … , 𝑁𝑁𝑎𝑎 , sample


(𝑖𝑖)
𝐱𝐱 0 ~ 𝑝𝑝(𝐱𝐱 0 )
(𝑖𝑖) 1
with weights 𝑊𝑊0 = .
𝑁𝑁𝑝𝑝
For 𝑝𝑝 ≥ 1
2. Importance sampling: for 𝑑𝑑 = 1, … , 𝑁𝑁𝑎𝑎 , draw samples
(𝑖𝑖) (𝑖𝑖)
𝐱𝐱� 𝑠𝑠 ~ 𝑝𝑝�𝐱𝐱 𝑠𝑠 |𝐱𝐱𝑠𝑠−1 �

MCMC, see Propp and Wilson (1996) and Fill (1998). For Hamiltonian Monte Carlo (HMC), see Neal (1994, 2011). The popular NUTS
(No U-Turn Sampler) was introduced by Gelman (2014). For other extensions, see Girolami and Calderhead (2011), Betancourt and Stein
(2011), Betancourt (2013a, 2013b), Romeel (2011), Leimkuhler and Reich (2004).

238

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

(𝑖𝑖) (𝑖𝑖) (𝑖𝑖)


set 𝐱𝐱� 0:𝑠𝑠 = �𝐱𝐱 0:𝑠𝑠−1 , 𝐱𝐱� 𝑠𝑠 �.
3. Weight update: calculate importance weights
(𝑖𝑖) (𝑖𝑖)
𝑊𝑊𝑠𝑠 = 𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱�𝑠𝑠 �
4. Normalize weights
(𝑖𝑖)
𝑊𝑊𝑠𝑠
�𝑠𝑠(𝑖𝑖) =
𝑊𝑊 𝑁𝑁𝑝𝑝 (𝑗𝑗)
∑𝑗𝑗=1 𝑊𝑊𝑠𝑠
(𝑖𝑖) (𝑖𝑖) �𝑠𝑠(𝑖𝑖) .
5. Resampling: generate 𝑁𝑁𝑎𝑎 new particles 𝐱𝐱 𝑠𝑠 from the set {𝐱𝐱� 𝑠𝑠 } according to the importance weights 𝑊𝑊
6. Repeat from importance sampling step 2.

Resampling retains the most pertinent particles however destroys information by discounting the potential future descriptive
ability of particles – it does not really prevent sample impoverishment it simply excludes poor samples from calculations,
providing future stability through short-term increases in variance.

Our Adaptive Path Particle Filter 60 (APPF) leverages the descriptive ability of naively discarded particles in an adaptive
evolutionary environment with a well-defined fitness function leading to increased accuracy for recursive Bayesian
estimation of non-linear non-Gaussian dynamical systems. We embed a generation based adaptive particle switching step
into the particle filter weight update using the transition prior as our proposal distribution. This enables us to make use of
previously discarded particles 𝜓𝜓 if their discriminatory power is higher than the current particle set. [More details on the
theoretical underpinnings and formal justification of the APPF can be found in Hanif (2013) and Hanif & Smith (2012).]

(𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖)


𝑊𝑊𝑠𝑠 = max�𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱� 𝑠𝑠 �, 𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱�𝑠𝑠 �� where 𝐱𝐱� 𝑠𝑠 ~ 𝑝𝑝�𝐱𝐱 𝑠𝑠 |𝝍𝝍𝑠𝑠−1 � and 𝐱𝐱� 0:𝑠𝑠 = �𝐱𝐱 0:𝑠𝑠−1 , 𝐱𝐱� 𝑠𝑠 �

Mathematical Box [Adaptive Path Particle Filter]

1. Initialization: for 𝑑𝑑 = 1, … , 𝑁𝑁𝑎𝑎 , sample


(𝑖𝑖)
𝐱𝐱 0 ~ 𝑝𝑝(𝐱𝐱 0 )
(𝑖𝑖)
𝝍𝝍0 ~ 𝑝𝑝(𝐱𝐱 0 )
(𝑖𝑖) 1
with weights 𝑊𝑊0 =
𝑁𝑁𝑝𝑝
For 𝑝𝑝 ≥ 1
2. Importance sampling: for 𝑑𝑑 = 1, … , 𝑁𝑁𝑎𝑎 , draw samples
(𝑖𝑖) (𝑖𝑖)
𝐱𝐱� 𝑠𝑠 ~ 𝑝𝑝�𝐱𝐱 𝑠𝑠 |𝐱𝐱 𝑠𝑠−1 � set
(𝑖𝑖) (𝑖𝑖) (𝑖𝑖)
𝐱𝐱� 0:𝑠𝑠 = �𝐱𝐱 0:𝑠𝑠−1 , 𝐱𝐱� 𝑠𝑠 � and draw
(𝑖𝑖) (𝑖𝑖)
𝐱𝐱� 𝑠𝑠 ~ 𝑝𝑝�𝐱𝐱 𝑠𝑠 |𝝍𝝍𝑠𝑠−1 � set
(𝑖𝑖) (𝑖𝑖) (𝑖𝑖)
= 𝐱𝐱� 0:𝑠𝑠
� �𝐱𝐱 0:𝑠𝑠−1 , 𝐱𝐱� 𝑠𝑠
3. Weight update: calculate importance weights
(𝑖𝑖) (𝑖𝑖) (𝑖𝑖)
𝑊𝑊𝑠𝑠 = max�𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱�𝑠𝑠 �, 𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱� 𝑠𝑠 ��
Evaluate:
(𝑖𝑖) (𝑖𝑖)
if 𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱 𝑠𝑠 � > 𝑝𝑝�𝐲𝐲𝑠𝑠 |𝐱𝐱� 𝑠𝑠 � then

(𝑖𝑖) (𝑖𝑖)
𝐱𝐱� 𝑠𝑠 = 𝝍𝝍𝑠𝑠
end if
4. Normalize weights

60
More details on the theoretical underpinnings and formal justification of the APPF can be found in Hanif (2013) and Hanif and Smith
(2012).

239

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

(𝑖𝑖)
𝑊𝑊𝑠𝑠
�𝑠𝑠(𝑖𝑖) =
𝑊𝑊 𝑁𝑁
𝑝𝑝 (𝑗𝑗)
∑𝑗𝑗=1 𝑊𝑊𝑠𝑠
5. Commit pre-resample set of particles to memory:
(𝑖𝑖) (𝑖𝑖)
�𝝍𝝍𝑠𝑠 � = �𝐱𝐱� 𝑠𝑠 �
(𝑖𝑖) (𝑖𝑖) �𝑠𝑠 . (𝑖𝑖)
6. Resampling: generate 𝑁𝑁𝑎𝑎 new particles 𝐱𝐱 𝑠𝑠 from the set {𝐱𝐱� 𝑠𝑠 } according to the importance weights 𝑊𝑊
7. Repeat from importance sampling step 2.

Financial Example: Stochastic Volatility Estimation

Traditional measures of volatility are either market views or estimated from the past. Under such measures the correct value
for pricing derivatives cannot be known until the derivative has expired. As the volatility measure is not constant, not
predictable and not directly observable it is best modeled as a random variable (Wilmott 2007). Understanding the dynamics
of the volatility process in tandem with the dynamics of the underlying asset in the same timescale enable us to measure the
stochastic volatility process. However, modelling volatility as a stochastic process needs an observable volatility measure:
this is the stochastic volatility estimation problem.

The Heston stochastic volatility model is among the most popular stochastic volatility models and is defined by the coupled
two-dimensional stochastic differential equation:

d𝑋𝑋(𝑝𝑝)⁄𝑋𝑋(𝑝𝑝) = �𝑉𝑉(𝑝𝑝)d𝑊𝑊𝑋𝑋 (𝑝𝑝)


d𝑉𝑉(𝑝𝑝) = 𝜅𝜅�𝜃𝜃 − 𝑉𝑉(𝑝𝑝)�d𝑝𝑝 + 𝜀𝜀�𝑉𝑉(𝑝𝑝)d𝑊𝑊𝑉𝑉 (𝑝𝑝)

where 𝜅𝜅, 𝜃𝜃, 𝜀𝜀 are strictly positive constants, and 𝑊𝑊𝑋𝑋 and 𝑊𝑊𝑉𝑉 are scalar Brownian motions in some probability measure; we
assume that d𝑊𝑊𝑋𝑋 (𝑝𝑝) ∙ d𝑊𝑊𝑉𝑉 (𝑝𝑝) = 𝜌𝜌d𝑝𝑝, where the correlation measure ρ is some constant in [−1, 1]. 𝑋𝑋(𝑝𝑝) represents an asset
price process and is assumed to be a martingale in the chosen probability measure. 𝑉𝑉(𝑝𝑝) represents the instantaneous
variance of relative changes to 𝑋𝑋(𝑝𝑝) – the stochastic volatility 61. The Euler discretization with full truncation 62 of the model
takes the form:

1
ln 𝑋𝑋�(𝑝𝑝 + ∆) = ln 𝑋𝑋�(𝑝𝑝) − 𝑉𝑉�(𝑝𝑝)+ ∆ + �𝑉𝑉� (𝑝𝑝)+ 𝑍𝑍𝑋𝑋 √∆
2
𝑉𝑉�(𝑝𝑝 + ∆) = 𝑉𝑉(𝑝𝑝) + 𝜅𝜅�𝜃𝜃 − 𝑉𝑉�(𝑝𝑝)+ �Δ + 𝜀𝜀 �𝑉𝑉� (𝑝𝑝)+ 𝑍𝑍𝑉𝑉 √∆

where 𝑋𝑋� the observed price process and 𝑉𝑉� the stochastic volatility process are discrete-time approximations to 𝑋𝑋 and 𝑉𝑉,
respectivelty, and where 𝑍𝑍𝑋𝑋 and 𝑍𝑍𝑉𝑉 are Gaussian random variables with correlation 𝜌𝜌. The operator 𝑥𝑥 + = max(𝑥𝑥, 0) enables
the process for V to go below zero thereafter becoming deterministic with an upward drift 𝜅𝜅𝜃𝜃. To run the particle filters we
need to calibrate the parameters 𝜅𝜅, 𝜃𝜃, 𝜀𝜀.

Experimental Results – S&P 500 Stochastic Volatility

To calibrate the stochastic volatility process for the S&P 500 Index we ran a 10,000 iteration Markov-chain Monte Carlo
calibration to build an understanding of the price process (observation equation) and volatility process (state equation). We

61
SV is modeled as a mean-reverting square-root diffusion, with Ornstein-Uhlenbeck dynamics (a continuous-time analogue of the
discrete-time first-order autoregressive process).
62
A critical problem with naive Euler discretization enables the discrete process for V to become negative with non-zero probability,
which makes the computation of �𝑉𝑉� impossible.

240

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

took the joint MAP (maximum a posteriori) estimate63 of 𝜅𝜅 and 𝜃𝜃 from our MCMC calibration as per Chib, et al. (2002).
The Heston model stochastic volatility calibration for SPX can be seen in the first figure below, where we can see the full
truncation scheme forcing the SV process to be positive, and the associated parameter evolution can be seen in the second
figure (Hanif & Smith 2013).

Of note, we can see 𝜀𝜀 is a small constant throughout. This is attributable to the fact 𝜀𝜀 represents the volatility of volatility. If
it were large we would not observe the coupling (trend/momentum) between and amongst securities in markets as we do.

Figure 110: Heston model SPX daily closing Stochastic Volatility Figure 111: Heston model SPX Parameter Estimates and Evolution –
calibration – 10,000 iteration MCMC Jan ’10 – Dec ‘12 10,000 iteration MCMC Jan ’10 – Dec ‘12

Source: Hanif (2013), J.P. Morgan QDS. Source: Hanif (2013), J.P. Morgan QDS.

Given the price process we estimate the latent stochastic volatility process using the SIR, MCMC-PF64, PLA65 and APPF
particle filters run with N = 1,000 particles and systematic resampling66. Results can be seen in the table and figure below.
We can clearly see the APPF providing more accurate estimates of the underlying stochastic volatility process compared to
the other particle filters: the APPF provides statistically significant improvements in estimation accuracy compared to the
other filters.

Figure 112: Heston model experimental results: RMSE mean and execution time in seconds
Particle Filter RMSE Exec. (s)
PF (SIR) 0.05282 3.79
MCMC-PF 0.05393 59.37
PLA 0.05317 21.30
APPF 0.04961 39.33
Source: Hanif (2013), J.P.Morgan QDS

63
The MAP estimate is a Bayesian parameter estimation technique which takes the mode of the posterior distribution. It is unlike
maximum likelihood based point estimates which disregard the descriptive power of the MCMC process and associated pdfs.
64
The Markov-chain Monte Carlo particle filter (MCMC-PF) attempts to reduce degeneracy by jittering particle locations, using
Metropolis-Hastings to accept moves.
65
The particle learning particle filter (PLA) performs an MCMC after every 50 iterations.
66
There are a number of resampling schemes that can be adopted. The three most common schemes are systematic, residual and
multinomial. Of these multinomial is the most computationally efficient though systematic resampling is the most commonly used and
performs better in most, but not all, scenarios compared to other sampling schemes (Douc &
Cappé 2005).

241

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Figure 113: Heston model estimates for SPX – filter estimates (posterior means) vs. true state

Source: Hanif (2013), J.P.Morgan QDS

These results go some way in showing that selective pressure from our generation-gap and distribution-recombination
method does not lead to premature convergence. We have implicitly included a number of approaches to handling
premature convergence in dynamic optimization problems with evolutionary computation (Jin & Branke, 2005). Firstly, we
generate diversity after a change by resampling. We maintain diversity throughout the run through the importance sampling
diffusion of the current and past generation particle set. This generation based approach enables the learning algorithm to
maintain a memory, which in turn is the base of Bayesian inference. And finally, our multi-population approach enables us
to explore previously, possibly unexplored regions of the search space.

242

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Linear and Quadratic Discriminant Analysis


Learning algorithms can be classified as either discriminative or generative algorithms 67. In Discriminative Learning
algorithms, one seeks to learn the input-to-output mapping directly. Examples of this approach include Rosenblatt’s
Perceptron and Logistic Regression. In such discriminative learning algorithms, one models 𝑝𝑝(𝑦𝑦|𝑥𝑥) directly. An alternative
approach would to learn 𝑝𝑝(𝑦𝑦) and 𝑝𝑝(𝑥𝑥|𝑦𝑦) from the data, and use Bayes theorem to recover 𝑝𝑝(𝑦𝑦|𝑥𝑥). Learning algorithm
adopting this approach of modeling both 𝑝𝑝(𝑦𝑦) and 𝑝𝑝(𝑥𝑥|𝑦𝑦) are called Generative Learning algorithms, as they equivalently
learn the joint distribution 𝑝𝑝(𝑥𝑥, 𝑦𝑦) of the input and output processes.

Fitting Linear Discriminant Analysis on data with same covariance matrix and then Quadratic Discriminant Analysis on
data with different covariance matrices yields the two graphs below.

Figure 114: Applying Linear and Quadratic Discriminant Analysis on Toy Datasets.

Source: J.P.Morgan Macro QDS

67
For discriminant analysis (linear, quadratic, flexible, penalized and mixture), see Hastie et al (1994), Hastie et al (1995), Tibshirani
(1996b), Hastie et al (1998) and Ripley (1996). Laplace’s method for integration is described in Wong and Li (1992). Finite Mixture
Models are covered by Bishop (2006), Stephens (2000a, 2000b), Jasra, Holmes and Stephens (2005), Papaspiliopoulus and Roberts
(2008), Ishwaran and Zarepour (2002), Fraley and Raftery (2002), Dunson (2010a), Dunson and Bhattacharya (2010).

243

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Mathematical Model for Generative Models like LDA and QDA


In Linear Discriminant Analysis or LDA (also, called Gaussian Discriminant Analysis or GDA), we model
𝑦𝑦 ~ 𝐵𝐵𝑑𝑑𝑠𝑠𝑖𝑖𝑠𝑠𝐼𝐼𝑆𝑆𝑆𝑆𝑑𝑑(∅), 𝑥𝑥|𝑦𝑦 = 0 ~ 𝑁𝑁 �𝜇𝜇0 , Σ�, and 𝑥𝑥|𝑦𝑦 = 1 ~ 𝑁𝑁 �𝜇𝜇1 , Σ�. Note that the means are different, but the covariance
matrix is same for y=0 and y=1 case. The joint log-likelihood is given by
𝑆𝑆 �∅, 𝜇𝜇0 , 𝜇𝜇1 , Σ� = log ∏𝑚𝑚 𝑖𝑖=1 𝑝𝑝 �𝑥𝑥 , 𝑦𝑦 ; ∅, 𝜇𝜇0 , 𝜇𝜇1 , Σ �.
(𝑖𝑖) (𝑖𝑖)

Standard optimization yields the maximum likelihood answer as


∑𝑚𝑚 (𝑖𝑖) (𝑖𝑖) ∑𝑚𝑚 (𝑖𝑖) (𝑖𝑖) 𝑇𝑇
1 𝑖𝑖=1 𝟏𝟏 � 𝑑𝑑 =0�𝑥𝑥 𝑖𝑖=1 𝟏𝟏 � 𝑑𝑑 =1�𝑥𝑥 1
∅= ∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏 � 𝑦𝑦
(𝑖𝑖)
= 1�, 𝜇𝜇0 = , 𝜇𝜇1 = and Σ = ∑𝑚𝑚
𝑖𝑖=1 �𝑥𝑥
(𝑖𝑖)
− 𝜇𝜇𝑑𝑑(𝑖𝑖) � �𝑥𝑥 (𝑖𝑖) − 𝜇𝜇𝑑𝑑(𝑖𝑖) � .
𝑚𝑚 ∑𝑚𝑚 (𝑖𝑖)
𝑖𝑖=1 𝟏𝟏 � 𝑑𝑑 =0� ∑𝑚𝑚 (𝑖𝑖)
𝑖𝑖=1 𝟏𝟏 � 𝑑𝑑 =1� 𝑚𝑚
The above procedure fits a linear hyperplane to separate regions marked by classes y = 0 and y = 1.
Other points to note are:
• If we assume 𝑥𝑥|𝑦𝑦 = 0 ~ 𝑁𝑁 �𝜇𝜇0 , Σ0 � and 𝑥𝑥|𝑦𝑦 = 1 ~ 𝑁𝑁 �𝜇𝜇1 , Σ1 �, viz. we assume different covariance for the two
distributions, then we obtain a quadratic boundary and the consequent learning algorithm is called Quadratic
Discriminant Analysis.
• If the data were indeed Gaussian, then it can be shown that as the sample size increases, LDA asymptotically
performs better than any other algorithm.
• It can be shown that Logistic Regression is more general than LDA/QDA; hence logistic regression will
outperform LDA/QDA when the data is non-Gaussian (say, Poisson distributed).
• LDA with the covariance matrix restricted to a diagonal leads to the Gaussian Naïve Bayes model.
• LDA coupled with the Ledoit-Wolf shrinkage idea from portfolio management yields better results than plain
LDA.

A related algorithm is Naïve Bayes with Laplace correction. We describe it briefly below.

Naïve Bayes is a simple algorithm for text classification, which works surprisingly well in practice in spite of its simplicity.
We create a vector 𝑥𝑥 of length |V|, where |V| is the size of the dictionary. We set 𝑥𝑥𝑖𝑖 = 1in the vector if the ith word of the
dictionary is present in the text; else, we set it to zero. The naïve part of the Naïve Bayes title refers to the modeling
assumption that the different 𝑥𝑥𝑖𝑖 ’s are independent given 𝑦𝑦 ∈ {0,1}. The model parameters are
• 𝑦𝑦 ~ 𝐵𝐵𝑑𝑑𝑠𝑠𝑖𝑖𝑠𝑠𝐼𝐼𝑆𝑆𝑆𝑆𝑑𝑑 � ∅𝑑𝑑 � ↔ ∅𝑑𝑑 = 𝑃𝑃(𝑦𝑦 = 1),
• 𝑥𝑥𝑖𝑖 |𝑦𝑦 = 0 ~ 𝐵𝐵𝑑𝑑𝑠𝑠𝑖𝑖𝑠𝑠𝐼𝐼𝑆𝑆𝑆𝑆𝑑𝑑 � ∅𝑖𝑖|𝑑𝑑=0 � ↔ ∅𝑖𝑖|𝑑𝑑=0 = 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑦𝑦 = 0) , and
• 𝑥𝑥𝑖𝑖 |𝑦𝑦 = 1 ~ 𝐵𝐵𝑑𝑑𝑠𝑠𝑖𝑖𝑠𝑠𝐼𝐼𝑆𝑆𝑆𝑆𝑑𝑑 � ∅𝑖𝑖|𝑑𝑑=1 � ↔ ∅𝑖𝑖|𝑑𝑑=1 = 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑦𝑦 = 1) .
To calibrate the model, we maximize the logarithm of the joint likelihood of training set of size m 𝑆𝑆�∅𝑑𝑑 , ∅𝑖𝑖|𝑑𝑑=0 , ∅𝑖𝑖|𝑑𝑑=1 � =
∏𝑚𝑚 (𝑖𝑖) (𝑖𝑖)
𝑖𝑖=1 𝑝𝑝( 𝑥𝑥 , 𝑦𝑦 ). This yields the maximum likelihood answer as
(𝑖𝑖)
∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏� 𝑥𝑥𝑗𝑗 = 1 ∧ 𝑦𝑦
(𝑖𝑖)
= 1�
∅𝑗𝑗|𝑑𝑑=1 =
∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏{ 𝑦𝑦
(𝑖𝑖) = 1}
(𝑖𝑖)
∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏� 𝑥𝑥𝑗𝑗 = 1 ∧ 𝑦𝑦
(𝑖𝑖)
= 0�
∅𝑗𝑗|𝑑𝑑=0 =
∑𝑚𝑚𝑖𝑖=1 𝟏𝟏{ 𝑦𝑦
(𝑖𝑖) = 0}
𝑚𝑚 (𝑖𝑖)
∑𝑖𝑖=1 𝟏𝟏� 𝑦𝑦 = 1�
∅𝑑𝑑 =
𝑚𝑚
Naïve Bayes as derived above is susceptible to 0/0 errors. To avoid those, an approximation known as Laplace smoothing is
applied to restate the formulae as
(𝑖𝑖)
∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏� 𝑥𝑥𝑗𝑗 = 1 ∧ 𝑦𝑦
(𝑖𝑖)
= 1� + 1
∅𝑗𝑗|𝑑𝑑=1 = 𝑚𝑚
∑𝑖𝑖=1 𝟏𝟏{ 𝑦𝑦 = 1} + 2
(𝑖𝑖)

∑𝑖𝑖=1 𝟏𝟏� 𝑥𝑥𝑗𝑗(𝑖𝑖) = 1 ∧ 𝑦𝑦 (𝑖𝑖) = 0� + 1


𝑚𝑚
∅𝑗𝑗|𝑑𝑑=0 =
∑𝑚𝑚
𝑖𝑖=1 𝟏𝟏{ 𝑦𝑦
(𝑖𝑖) = 0} + 2
𝑚𝑚 (𝑖𝑖)
∑𝑖𝑖=1 𝟏𝟏� 𝑦𝑦 = 1� + 1
∅𝑑𝑑 =
𝑚𝑚 + 2
Other points to note are:

244

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

• Naïve Bayes is easily generalizable to the multivariate case; the model there is also called the multivariate
Bernoulli event model.
• It is common to discrete continuous valued variables and apply Naïve Bayes instead of LDA and QDA.

For the specific case of text classification, a multinomial event model can also be used. A text of length n is represented by
a vector 𝑥𝑥 = (𝑥𝑥1 , … , 𝑥𝑥𝑖𝑖 ), where 𝑥𝑥𝑖𝑖 = 𝑗𝑗 if ith word in the text is the jth word in the dictionary V. Consequently, 𝑥𝑥𝑖𝑖 ∈
{1, … , |𝑉𝑉|}. The probability model is
𝑦𝑦 ~ 𝐵𝐵𝑑𝑑𝑠𝑠𝑖𝑖𝑠𝑠𝐼𝐼𝑆𝑆𝑆𝑆𝑑𝑑( ∅𝑑𝑑 )
∅𝑖𝑖|𝑑𝑑=0 = 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑦𝑦 = 0)
∅𝑖𝑖|𝑑𝑑=1 = 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑦𝑦 = 1)
(𝑖𝑖) (𝑖𝑖)
Further, denote each text 𝑥𝑥 (𝑖𝑖) in the training sample as a vector of 𝑖𝑖𝑖𝑖 words or 𝑥𝑥 (𝑖𝑖) = (𝑥𝑥1 , … , 𝑥𝑥𝑖𝑖𝑖𝑖 ). Optimizing and
including the Laplace smoothing term yields the answer as
∑𝑚𝑚𝑖𝑖=1 𝟏𝟏� 𝑦𝑦
(𝑖𝑖)
= 1� + 1
∅𝑑𝑑 =
𝑚𝑚 + 2
𝒏𝒏𝒊𝒊 (𝑖𝑖)
∑𝑚𝑚 ∑
𝑖𝑖=1 𝒋𝒋=𝟏𝟏 𝟏𝟏� 𝑥𝑥𝑗𝑗 = 𝑘𝑘 ∧ 𝑦𝑦
(𝑖𝑖)
= 1� + 1
∅𝑘𝑘|𝑑𝑑=1 = 𝑚𝑚
∑𝑖𝑖=1 𝑖𝑖𝑖𝑖 𝟏𝟏{ 𝑦𝑦 = 1} + |𝑉𝑉|
(𝑖𝑖)
(𝑖𝑖)
∑𝑖𝑖=1 ∑𝒏𝒏𝒋𝒋=𝟏𝟏
𝑚𝑚 𝒊𝒊
𝟏𝟏� 𝑥𝑥𝑗𝑗 = 𝑘𝑘 ∧ 𝑦𝑦 (𝑖𝑖) = 0� + 1
∅𝑘𝑘|𝑑𝑑=0 =
∑𝑚𝑚𝑖𝑖=1 𝑖𝑖𝑖𝑖 𝟏𝟏{ 𝑦𝑦
(𝑖𝑖) = 0} + |𝑉𝑉|

245

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Common Misconceptions around Big Data in Trading


Figure 115: Common misconceptions around the application of Big Data and Machine Learning to trading

Source: J.P.Morgan Macro QDS

1. Not Just Big, But Also Alternative: Data sources used are often new or less known rather than just being ‘Big’ –
size of many commercial data sets is in Gigabytes rather than Petabytes. Keeping this in mind, we designate data
sources in this report as Big/Alternative instead of just Big.

2. Not High Frequency Trading: Machine Learning is not related to High Frequency Trading. Sophisticated
techniques can be and are used on intraday data; however, as execution speed increases, our ability to use
computationally heavy algorithms actually decreases significantly due to time constraints. On the other hand,
Machine Learning can be and is profitably used on many daily data sources.

3. Not Unstructured Alone: Big Data is not a synonym for unstructured data. There is a substantial amount of data
that is structured in tables with numeric or categorical entries. The unstructured portion is larger; but a caveat to
keep in mind is that even the latest AI schemes do not pass tests corresponding to Winograd’s schema. This
reduces the chance that processing large text boxes (as opposed to just tweets, social messages and small/self-
contained blog posts) can lead to clear market insight.

4. Not new data alone: While the principal advantage does arise from access to newer data sources, substantial
progress has been made in computational techniques as well. This progress ranges from simple improvements like
the adoption of the Bayesian paradigm to the more advanced like the re-discovery of artificial neural networks and
subsequent incorporation as Deep Learning.

246

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

5. Not always non-linear: Many techniques are linear or quasi-linear in the parameters being estimated; later in this
report, we illustrate examples of these including logistic regression (linear) and Kernelized support vector
machines (quasi-linear). Many others stem from easy extensions of linear models into the non-linear domain. It is
erroneous to assume that Machine Learning deals exclusively with non-linear models; though non-linear models
certainly dominate much of the recent literature on the topic.

6. Not always black box: Some Machine Learning techniques are packaged as black-box algorithms, i.e. they use
data to not only calibrate model parameters, but also to deduce the generic parametric form of the model as well to
choose the input features. However, we note that Machine Learning subsumes a wide variety of models that range
from the interpretable (like binary trees) to semi-interpretable (like support vector machines) to more black box
(like neural nets).

247

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Provenance of Data Analysis Techniques


To understand Big Data analysis techniques as used in investment processes, we find it useful to track their origin and place
them in one of the four following categories:

a. ‘Statistical Learning’ from Statistics;


b. ‘Machine/Deep Learning’ and ‘Artificial Intelligence’ from Computer Science;
c. ‘Time Series Analysis’ from Econometrics; and
d. ‘Signal Processing’ from Electrical Engineering.

This classification is useful in many data science applications, where we often have to put together tools and algorithms
drawn from these diverse disciplines. We have covered Machine Learning in detail in this report. In this section, we briefly
describe the other three segments.

Figure 116: Provenance of tools employed in modern financial data analysis

Source: J.P.Morgan Macro QDS

Statistical Learning from Statistics

Classical Statistics arose from need to collect representative samples from large populations. Research in statistics led to the
development of rigorous analysis techniques that concentrated initially on small data sets drawn from either agriculture or
industry. As data size increased, statisticians focused on the data-driven approach and computational aspects. Such
numerical modeling of ever-larger data sets with the aim of detecting patterns and trends is called ‘Statistical Learning’.
Both the theory and toolkit of statistical learning find heavy application in modern data science applications. For example,
one can use Principal Component Analysis (PCA) to uncover uncorrelated factors of variation behind any yield curve. Such
analysis typically reveals that much of the movement in yield curves can be explained through just three factors: a parallel
shift, a change in slope and a change in convexity. Attributing yield curve changes to PCA factors enables an analyst to
isolate sectors within the yield curve that have cheapened or richened beyond what was expected from traditional weighting
on the factors. This knowledge is used in both the initiation and closing of relative value opportunities.

248

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Techniques drawn from statistics include techniques from frequentist domain, Bayesian analysis, statistical learning and
compressed sensing. The simplest tools still used in practice like OLS/ANOVA and polynomial fit were borrowed from
frequentists, even if posed in a Bayesian framework nowadays. Other frequentist tools used include null hypothesis testing,
bootstrap estimation, distribution fitting, goodness-of-fit tests, tests for independence and homogeneity, Q-Q plot and the
Kolmogorov-Smirnov test. As discussed elsewhere in the report, much analysis has moved to the Bayesian paradigm. The
choice of prior family (conjugate, Zellner G, Jeffreys), estimation of hyperparameters and associated MCMC simulations
draw from this literature. Even simple Bayesian techniques like Naïve Bayes with Laplace correction continue to find use in
practical applications. The statistical learning literature has substantial intersection with Machine Learning research. A
simple example arises from Bayesian regularization of ordinary linear regression leading to Ridge and Lasso regression
models. Another example lies in the use of ensemble learning methods of bagging/boosting that enable weak learners to be
combined into strong ones. Compressed sensing arose from research on sparse matrix reconstruction with applications
initially on reconstruction of sub-sampled images. Viewing compressed sensing as L1-norm minimization leads to robust
portfolio construction.

Time Series Analysis from Econometrics

Time-series Analysis refers to the analytical toolkit used by econometricians for the specific analysis of financial data.
When the future evolution of an asset return depended on its past own values in a linear fashion, the return time-series was
said to follow an auto-regressive (AR) process. Certain other variables could be represented as a smoothed average of noise-
like terms and were called moving average (MA) processes. The Box-Jenkins approach developed in the 1970s used
correlations and other statistical tests to classify and study such auto-regressive moving average (ARMA) processes. To
model the observation that volatility in financial markets often occurred in bursts, new processes to model processes with
time-varying volatility were introduced under the rubric of GARCH (Generalized Auto-Regressive Conditional
Heteroskedastic) models. In financial economics, the technique of Impulse Response Function (IRF) is often used to discern
the impact of changing one macro-economic variable (say, Fed funds rate) on other macro-economic variables (like
inflation or GDP growth). In this primer, we make occasional use of these techniques in pre-processing steps before
employing Machine Learning or statistical learning algorithms. However, we do not describe details of any time-series
technique as they are not specific to Big Data Analysis and further, many are already well-known to traditional quantitative
researchers.

Signal Processing from Electrical Engineering

Signal processing arose from attempts by electrical engineers to efficiently encode and decode speech transmissions. Signal
processing techniques focused on recovering signals submersed in noise, and have been employed in quantitative
investment strategies since the 1980s. By letting the beta coefficient in linear regression to evolve across time, we get the
popular Kalman filter which was used widely in pairs trading strategies. The Hidden Markov Model (HMM) posited the
existence of latent states evolving as a Markov chain (i.e. future evolution of the system depended only on the current state,
not past states) that underlay the observed price and return behavior. Such HMMs find use in regime change models as also
in high-frequency trend following strategies. Signal processing engineers analyze the frequency content of their signals and
try to isolate specific frequencies through the use of frequency-selective filters. Such filters – for e.g. a low-pass filter
discarding higher frequency noise components – are used as a pre-processing step before feeding the data through a
Machine Learning model. In this primer, we describe only a small subset of signal processing techniques that find
widespread use in the context of Big Data analysis.

One can further classify signal processing tools as arising from either discrete-time signal processing or statistical signal
processing. Discrete-time signal processing dealt with design of frequency selective finite/infinite impulse response or
FIR/IIR filter banks using Discrete Fourier Transform (DFT) or Z-transform techniques. Use of FFT (an efficient algorithm
for DFT computation) analysis to design an appropriate Chebyshev/Butterworth filter is common. The trend-fitting
Hodrick-Prescott filter tends to find more space in financial analysis than signal processing papers. Techniques for speech
signal processing like Hidden Markov Model alongside the eponymous Viterbi’s algorithm is used to model a latent process
as a Markov chain. From Statistical signal processing, we get a variety of tools for estimation and detection. Sometimes
studied under the rubric of decision theory, these include Maximum Likelihood/Maximum A-Posteriori/Maximum Mean-
Square Error (ML/MAP/MMSE) estimators. Non-Bayesian estimators include von Neumann or minimax estimators.
Besides the Karhunen-Loeve expansion (with an expert use illustration in digital communication literature), quants borrow

249

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

practical tools like ROC (Receiver Operating Characteristic). Theoretical results like Cramer-Rao Lower Bound provide
justification for use of practical techniques through asymptotic consistency/convergence proofs. Machine Learning borrows
Expectation Maximization from this literature and makes extensive use of the same to find ML parameters for complicated
statistical models. Statistical signal processing is also the source for Kalman (extended/unscented) and Particle filters used
in quantitative trading.

250

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

A Brief History of Big Data Analysis


While the focus on Big Data is new, the search for new and quicker information has been a permanent feature of investing.
We can track this evolution through four historical anecdotes.

a. The need for reducing latency of receiving information provided the first thrust. The story of Nathaniel Rothschild
using carrier pigeons in June 1815 to learn about the outcome of the Battle of Waterloo to go long the London
bourse is often cited in this aspect.
b. The second thrust came from systematically collecting and analyzing “big” data. In the first half of the 20th
century, Benjamin Graham and other investors collected accounting ratios of firms on a systematic basis, and
developed the ideas of Value Investing from them.
c. The third thrust came from locating new data that was either hard or costly to collect. Sam Walton – the founder or
Walmart – used to fly in his helicopter over parking lots to evaluate his real estate investments in the early 50’s.
d. The fourth thrust came from using technological tools to accomplish the above objectives of quickly securing hard-
to-track data. In the 1980s, Marc Rich – the founder of Glencore – used binoculars to locate oil ships/tankers and
relayed the gleaned insight using satellite phones.

Understanding the historical evolution as above helps explain the alternative data available today to the investment
professional. Carrier pigeons have long given way to computerized networks. Data screened from accounting statements
have become standardized inputs to investments; aggregators such as Bloomberg and FactSet disseminate these widely
removing the need to manually collect them as was done by early value investors. Instead of flying over parking lots with a
helicopter, we can procure the same data from companies like Orbital Insight that use neural networks to process imagery
from low-earth orbit satellites. And finally instead of binoculars and satellite phones, we have firms like CargoMetrics that
locates oil ships along maritime pathways through satellites and use such information to trade commodities and currencies.

In this primer, we refer to our data sets as big/alternative data. Here, Big Data refers to large data sets, which can include
financial time-series such as tick-level order book information, often marked by the three Vs of volume, velocity and
variety. Alternative data refers to data – typically, but not-necessarily, non-financial – that has received lesser attention
from market participants and yet has potential utility in predicting future returns for some financial assets. Alternative data
stands differentiated from traditional data, by which we refer to standard financial data like daily market prices, company
filings and management reports

The notion of Big Data and the conceptual toolkit of data-driven models are not new to financial economics. As early as
1920, Wesley Mitchell established the National Bureau of Economic Research to collect data on a large scale about the US
economy. Using data sets collected, researchers attempted to statistically uncover the patterns inherent in data rather than
formulaically deriving the theory and then fitting the data to it. This statistical, a-theoretical approach using novel data sets
serves as a clear precursor to modern Machine Learning research on Big/Alternative data sets. In 1930, such statistical
analysis led to the claim of wave pattern in macroeconomic data by Simon Kuznets, who was awarded the Nobel Memorial
Prize in Economic Sciences (hereafter, ‘Economics Nobel’) in 1971. Similar claims of economic waves through statistical
analysis were made later by Kitchin, Juglar and Kondratiev. The same era also saw the dismissal of both a-
theoretical/statistical and theoretical/mathematical model by John Maynard Keynes (a claim seconded by Hayek later), who
saw social phenomena as being incompatible with strict formulation via either mathematical theorization or statistical
formulation. Yet, ironically, it was Keynesian models that led to the next round of large-scale data collection (growing up to
hundreds of thousands of prices and quantities across time) and analysis (up to hundreds of thousands of equations). The
first Economics Nobel was awarded precisely for the application of Big Data to Jan Tinbergen (shared with fellow
econometrician Ragnar Frisch) for his comprehensive national model for Netherlands, United Kingdom and the United
States. Lawrence Klein (Economics Nobel, 1980) formulated the first global large-scale macroeconomic model; the LINK
project spun off from his work at Wharton continues to be used till date for forecasting purposes. The most influential
critique of such models – based on past correlations, rather than formal theory – was made by Robert Lucas (Economics
Nobel, 1995), who argued for reestablishment of theory to account for evolution in empirical correlations triggered through
policy changes. Even the Bayesian paradigm, through which a researcher can systematically update his/her prior beliefs
based on streaming evidence, was formulated in an influential article by Chris Sims (Economics Nobel, 2011) [Sims(1980)].

251

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Apart from employment of new, large data sets, econometricians have also advanced the modern data analysis toolkit in a
significant manner. Recognizing the need to account for auto-correlations in predictor as well as predicted variables, the
Box-Jenkins approach was pioneered in the 1970s. Further, statistical properties of financial time-series tend to evolve with
time. To account for such time-varying variance (termed ‘heteroskedasticity’) and fat tails in asset returns, new models such
as ARCH (invented in Engle (1982), winning Robert Engle the Economics Nobel in 2003) and GARCH were developed;
and these continue to be widely used by investment practitioners.

A similar historical line of ups and downs can be traced in the computer science community for the development of modern
Deep Learning; academic historical overviews are present in Bengio (2009), LeCun et al. (2015) and Schmidhuber (2015).
An early paper in 1949 by the Canadian neuro-psychologist Donald Hebb – see Hebb (1949) - related learning within the
human brain to the formation of synapses (think, linking mechanism) between neurons (think, basic computing unit). A
simple calculating model for a neuron was suggested in 1945 by McCulloch and Pitts – see McCulloch-Pitts (1945) – which
could compute a weighted average of the input, and then returned one, if the average was above a threshold and zero,
otherwise.

Figure 117: The standard McCulloch-Pitts model of neuron

Source: J.P.Morgan Macro QDS

In 1958, the psychologist Franklin Rosenblatt built the first modern neural network model called the Perceptron and
showed that the weights in the McCulloch-Pitts model could be calibrated using the available data set; in essence, he had
invented what we now call a learning algorithm. The perceptron model was designed for image recognition purposes and
implemented in hardware, thus serving as a precursor to modern GPU units used in image signal processing. The learning
rule was further refined through the work in Widrow-Hoff (1960), which calibrated the parameters by minimizing the
difference between the actual pre-known output and the reconstructed one. Even today, Rosenblatt’s perceptron and the
Widrow-Hoff rule continue to find place in the Machine Learning curriculum. These results spurred the first wave of
excitement about Artificial Intelligence that ended abruptly in 1969, when the influential MIT theorist Marvin Minsky wrote
a scathing critique in his book titled “Perceptrons” [Minsky-Papert (1960)]. Minsky pointed that perceptrons as defined by
Rosenblatt can never replicate a simple structure like a XOR function, that is defined as 1⊕1 = 0⊕0 = 0 and 1⊕0 = 0⊕1 = 1.
This critique ushered in, what is now called, the first AI Winter.

The first breakthroughs happened in the 1970s [Werbos (1974), an aptly titled PhD thesis of “Beyond regression: New tools
for prediction and analysis…”], though they gained popularity only in the 1980s [Rumelhart et al (1986)]. The older neural
models had a simple weighted average followed by a piece-wise linear thresholding function. Newer models began to have
multiple layers of neurons interconnected to each other, and further replaced the simple threshold function (which returned

252

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

one if more than threshold and zero otherwise) with a non-linear, smooth function (now, called an activation function). The
intermediate layers of neurons hidden between the input and the output layer of neurons served to uncover new features
from data. These models, which could theoretically implement any function including the XOR 68, used regular high-school
calculus to calibrate the parameters (viz., weights on links between neurons); the technique itself is now called
backpropagation. Readers familiar with numerical analysis can think of backpropagation (using, ‘gradient descent’) as an
extension to the simple Newton’s algorithm for iteratively solving equations. Variants of gradient descent remain a
workhorse till today for training neural networks. The first practical application of neural networks to massive data sets
arose in 1989, when researchers at AT&T Bell Labs used data from the US Postal Service to decipher hand-written zip code
information; see LeCun et al (1989).

The second AI winter arose more gradually in the early 1990s. Calibrating weights of interconnections in a multi-layer
neural network was not only time-consuming, it was found to be error-prone as the number of hidden layers increased
[Schmidhuber (2015)]. Meanwhile, competing techniques from outside the neural network community started to make their
impression (as reported in LeCun (1995)); in this report, we shall later survey two of the most prominent of those, namely
Support Vector Machines and Random Forests. These techniques quickly eclipsed neural networks, and as funding declined
rapidly, active research continued only in select groups in Canada and United States.

The second AI winter ended in 2006 when Geoffrey Hinton’s research group at the University of Toronto demonstrated that
a multi-layer neural network could be efficiently trained using a strategy greedy, layer-wise pre-training [Hinton et al
(2006)]. While Hinton’s original analysis focused on a specific type of neural network called the Deep Belief Network,
other researchers could quickly extend it to many other types of multi-layer neural networks. This launched a new
renaissance in Machine Learning that continues till date and is profiled in detail in this primer.

68
For the universality claim, see Hornik et al (1989).

253

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

References

Abayomi, K., Gelman, A., and Levy, M. (2008), “Diagnostics for multivariate imputations”, Applied Statistics 57, 273–291.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I.(1995), “Fast discovery of association rules”,
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA.

Agresti, A. (2002), “Categorical Data Analysis”, second edition, New York: Wiley.

Akaike, H. (1973), “Information theory and an extension of the maximum likelihood principle”, Second International
Symposium on Information Theory, 267–281.

Amit, Y. and Geman, D. (1997), “Shape quantization and recognition with randomized trees”, Neural Computation 9: 1545–
1588.

Anderson, T. (2003), “An Introduction to Multivariate Statistical Analysis”,3rd ed., Wiley, New York.

Andrieu, C., and Robert, C. (2001), “Controlled MCMC for optimal sampling”, Technical report, Department of
Mathematics, University of Bristol.

Andrieu, C., and Thoms, J. (2008), “A tutorial on adaptive MCMC”, Statistics and Computing 18,343–373.

Ba, J., Mnih, V., & Kavukcuoglu, K. (2014), “Multiple object recognition with visual attention”, arXiv preprint
arXiv:1412.7755.

Babb, Tim, “How a Kalman filter works, in pictures”, Available at link.

Banerjee, A., Dunson, D. B., and Tokdar, S. (2011), “Efficient Gaussian process regression for large data sets”, Available at
link.

Barbieri, M. M., and Berger, J. O. (2004), “Optimal predictive model selection”, Annals of Statistics 32, 870–897.

Barnard, J., McCulloch, R. E., and Meng, X. L. (2000), “Modeling covariance matrices in terms of standard deviations and
correlations with application to shrinkage”. Statistica Sinica 10,1281–1311.

Bartlett, P. and Traskin, M. (2007), “Adaboost is consistent, in B. Schälkopf”, J. Platt and T. Hoffman (eds), Advances in
Neural Information Processing Systems 19, MIT Press, Cambridge, MA, 105-112.

Bell, A. and Sejnowski, T. (1995), “An information-maximization approach to blind separation and blind deconvolution”,
Neural Computation 7: 1129–1159.

Bengio, Y (2009), “Learning deep architectures for AI”, Foundations and Trends in Machine Learning, Vol 2:1.

Bengio, Y., Courville, A., & Vincent, P. (2013), “Representation learning: A review and new perspectives”, IEEE
transactions on pattern analysis and machine intelligence, 35(8), 1798-1828

Bengio, Y., Goodfellow, I. J., & Courville, A. (2015), “Deep Learning”. Nature, 521, 436-444.

Berry, S., M., Carlin, B. P., Lee, J. J., and Muller, P. (2010), “Bayesian Adaptive Methods for Clinical Trials”, London:
Chapman & Hall.

Betancourt, M. J. (2013), “Generalizing the no-U-turn sampler to Riemannian manifolds”, Available at link.
254

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Betancourt, M. J., and Stein, L. C. (2011), “The geometry of Hamiltonian Monte Carlo”, Available at link.

Bigelow, J. L., and Dunson, D. B. (2009), “Bayesian semiparametric joint models for functional predictors”, Journal of the
American Statistical Association 104, 26–36.

Biller, C. (2000), “Adaptive Bayesian regression splines in semiparametric generalized linear models”, Journal of
Computational and Graphical Statistics 9, 122–140.

Bilmes, Jeff (1998,) “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian
Mixture and Hidden Markov Models”, Available at link.

Bishop, C. (1995), “Neural Networks for Pattern Recognition”, Clarendon Press, Oxford.

Bishop, C. (2006), “Pattern Recognition and Machine Learning”, Springer,New York.

Blei, D., Ng, A., and Jordan, M. (2003), “Latent Dirichlet allocation”, Journal of Machine Learning Research 3, 993–1022.

Bollerslev, T (1986), “Generalized autoregressive conditional heteroskedasticity”, Journal of econometrics, Vol 31 (3),
307-327.

Bradlow, E. T., and Fader, P. S. (2001), “A Bayesian lifetime model for the “Hot 100” Billboard songs”, Journal of the
American Statistical Association 96, 368–381.

Breiman, L. (1992), “The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction
error”, Journal of the American Statistical Association 87: 738–754.

Breiman, L. (1996a), “Bagging predictors”, Machine Learning 26: 123–140.

Breiman, L. (1996b), “Stacked regressions”, Machine Learning 24: 51–64.

Breiman, L. (1998), “Arcing classifiers (with discussion)”, Annals of Statistics 26: 801–849.

Breiman, L. (1999), “Prediction games and arcing algorithms”, Neural Computation 11(7): 1493–1517.

Breiman, L. (2001), “Random Forests”, Journal of Machine Learning, Vol 45(1), 5-32. Available at link.

Breiman, L. and Spector, P. (1992), “Submodel selection and evaluation in regression: the X-random case”, International
Statistical Review 60: 291–319.

Brooks, S. P., Giudici, P., and Roberts, G. O. (2003), “Efficient construction of reversible jump MCMC proposal
distributions (with discussion)”, Journal of the Royal Statistical Society B 65,3–55.

Bruce, A. and Gao, H. (1996), “Applied Wavelet Analysis with S-PLUS”, Springer, New York.

Bühlmann, P. and Hothorn, T. (2007), “Boosting algorithms: regularization, prediction and model fitting (with discussion)”,
Statistical Science 22(4): 477–505.

Burges, C. (1998), “A tutorial on support vector machines for pattern recognition”, Knowledge Discovery and Data Mining
2(2): 121–167.

255

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Carvalho, C. M., Lopes, H. F., Polson, N. G., and Taddy, M. A. (2010), “Particle learning for general mixtures”, Bayesian
Analysis 5, 709–740.

Chen, S. S., Donoho, D. and Saunders, M. (1998), “Atomic decomposition by basis pursuit”, SIAM Journal on Scientific
Computing 20(1): 33–61.

Chen, Z (2003), “Bayesian filtering: From Kalman filters to particle filters”, Tech. rep., and beyond. Technical report,
Adaptive Systems Lab, McMaster University.

Cherkassky, V. and Mulier, F. (2007), “Learning from Data (2nd Edition)”, Wiley, New York.

Chib, S et al. (2002), “Markov chain Monte Carlo methods for stochastic volatility models”, Journal of Econometrics
108(2):281–316.

Chipman, H., George, E. I., and McCulloch, R. E. (1998), “Bayesian CART model search (with discussion)”, Journal of the
American Statistical Association 93, 935–960.

Chui, C. (1992), “An Introduction to Wavelets”, Academic Press, London.

Clemen, R. T. (1996), “Making Hard Decisions”, second edition. Belmont, Calif.: Duxbury Press.

Clyde, M., DeSimone, H., and Parmigiani, G. (1996), “Prediction via orthogonalized model mixing”, Journal of the
American Statistical Association 91, 1197–1208.

Comon, P. (1994), “Independent component analysis—a new concept?”, Signal Processing 36: 287–314.

Cook, S., Gelman, A., and Rubin, D. B. (2006), “Validation of software for Bayesian models using posterior quantiles”,
Journal of Computational and Graphical Statistics 15, 675–692.

Cox, D. and Wermuth, N. (1996), “Multivariate Dependencies: Models, Analysis and Interpretation”, Chapman and Hall,
London.

Cseke, B., and Heskes, T. (2011), “Approximate marginals in latent Gaussian models”, Journal of Machine Learning
Research 12, 417–454.

Daniels, M. J., and Kass, R. E. (1999), “Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical
models”, Journal of the American Statistical Association 94, 1254-1263.

Daniels, M. J., and Kass, R. E. (2001), “Shrinkage estimators for covariance matrices”, Biometrics 57, 1173–1184.

Dasarathy, B. (1991), “Nearest Neighbor Pattern Classification Techniques”, IEEE Computer Society Press, Los Alamitos,
CA.

Daubechies, I. (1992), “Ten Lectures in Wavelets”, Society for Industrial and Applied Mathematics, Philadelphia, PA.

Denison, D. G. T., Holmes, C. C., Mallick, B. K., and Smith, A. F. M. (2002), “Bayesian Methods for Nonlinear
Classification and Regression”, New York: Wiley.

Dietterich, T. (2000a), “Ensemble methods in machine learning”, Lecture Notes in Computer Science 1857: 1–15.

Dietterich, T. (2000b), “An experimental comparison of three methods for constructing ensembles of decision trees:
bagging, boosting, and randomization”, Machine Learning 40(2): 139–157.

256

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

DiMatteo, I., Genovese, C. R., and Kass, R. E. (2001), “Bayesian curve-fitting with free-knot splines”, Biometrika 88,
1055–1071.

Dobra, A., Tebaldi, C., and West, M. (2003), “Bayesian inference for incomplete multi-way tables”, Technical report,
Institute of Statistics and Decision Sciences, Duke University.

Donoho, D. and Johnstone, I. (1994), “Ideal spatial adaptation by wavelet shrinkage”, Biometrika 81: 425–455.

Douc, R & Cappé, O (2005), “Comparison of resampling schemes for particle filtering”, In Image and Signal Processing
and Analysis, 2005. ISPA 2005.

Doucet, A & Johansen, A (2008), “A tutorial on particle filtering and smoothing: Fifteen years later”.

Duda, R., Hart, P. and Stork, D. (2000), “Pattern Classification” (2nd Edition), Wiley, New York.

Dunson, D. B. (2005), “Bayesian semiparametric isotonic regression for count data”, Journal of the American Statistical
Association 100, 618–627.

Dunson, D. B. (2009), “Bayesian nonparametric hierarchical modeling”, Biometrical Journal 51,273–284.

Dunson, D. B. (2010a), “Flexible Bayes regression of epidemiologic data”, In Oxford Handbook of Applied Bayesian
Analysis, ed. A. O’Hagan and M. West. Oxford University Press.

Dunson, D. B. (2010b), “Nonparametric Bayes applications to biostatistics”, In Bayesian Non-parametrics, ed. N. L. Hjort,
C. Holmes, P. Muller, and S. G. Walker. Cambridge University Press.

Dunson, D. B., and Bhattacharya, A. (2010), “Nonparametric Bayes regression and classification through mixtures of
product kernels”, In Bayesian Statistics 9, ed. J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F.
M. Smith, and M. West, 145–164.Oxford University Press.

Dunson, D. B., and Taylor, J. A. (2005), “Approximate Bayesian inference for quantiles”, Journal of Nonparametric
Statistics 17, 385–400.

Edwards, D. (2000), “Introduction to Graphical Modelling”, 2nd Edition,Springer, New York.

Efron, B. and Tibshirani, R. (1993), “An Introduction to the Bootstrap”, Chapman and Hall, London.

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), “Least angle regression (with discussion)”, Annals of Statistics
32(2): 407–499.

Ekster, G (2014), “Finding and using unique datasets by hedge funds”, Hedge Week Article published on 3/11/2014.

Ekster, G (2015), “Driving investment process with alternative data”, White Paper by Integrity Research.

Elliott, R.J. and Van Der Hoek, J. and Malcolm, W.P. (2005) “Pairs trading”, Quantitative Finance, 5(3), 271-276.
Available at link.

Engle, R (1982), “Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom
inflation”, Econometrica, Vol 50 (4), 987-1008.

Evgeniou, T., Pontil, M. and Poggio, T. (2000), “Regularization networks and support vector machines”, Advances in
Computational Mathematics 13(1): 1–50.

Fan, J. and Gijbels, I. (1996), “Local Polynomial Modelling and Its Applications”, Chapman and Hall, London.

257

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Faragher, R (2012), “Understanding the Basis of the Kalman Filter via a Simple and Intuitive Derivation”.

Fill, J. A. (1998), “An interruptible algorithm for perfect sampling”. Annals of Applied Probability 8, 131–162.

Flury, B. (1990), “Principal points”, Biometrika 77: 33–41.

Fraley, C., and Raftery, A. E. (2002), “Model-based clustering, discriminant analysis, and density estimation”, Journal of
the American Statistical Association 97, 611–631.

Frank, I. and Friedman, J. (1993), “A statistical view of some chemometrics regression tools (with discussion)”,
Technometrics 35(2): 109–148.

Freund, Y. (1995), “Boosting a weak learning algorithm by majority”, Information and Computation 121(2): 256–285.

Freund, Y. and Schapire, R. (1996b), “Game theory, on-line prediction and boosting”, Proceedings of the Ninth Annual
Conference on Computational Learning Theory, Desenzano del Garda, Italy, 325–332.

Friedman, J. (1994b), “An overview of predictive learning and function approximation”, in V. Cherkassky, J. Friedman and
H. Wechsler (eds), From Statistics to Neural Networks, Vol. 136 of NATO ISI Series F,Springer, New York.

Friedman, J. (1999), “Stochastic gradient boosting”, Technical report, Stanford University.

Friedman, J. (2001), “Greedy function approximation: A gradient boosting machine”, Annals of Statistics 29(5): 1189–
1232.

Friedman, J. and Hall, P. (2007), “On bagging and nonlinear estimation”,Journal of Statistical Planning and Inference 137:
669–683.

Friedman, J. and Popescu, B. (2008), “Predictive learning via rule ensembles”, Annals of Applied Statistics, to appear.

Friedman, J., Hastie, T. and Tibshirani, R. (2000), “Additive logistic regression: a statistical view of boosting (with
discussion)”, Annals of Statistics 28: 337–307.

Gelfand, A. and Smith, A. (1990), “Sampling based approaches to calculating marginal densities, Journal of the American
Statistical Association 85: 398–409.

Gelman, A. (2005), “Analysis of variance: why it is more important than ever (with discussion)”,Annals of Statistics 33, 1–
53.

Gelman, A. (2006b), “The boxer, the wrestler, and the coin flip: a paradox of robust Bayesian inference and belief
functions”, American Statistician 60, 146–150.

Gelman, A. (2007a), “Struggles with survey weighting and regression modeling (with discussion)”, Statistical Science 22,
153–188.

Gelman, A. (2007b), “Discussion of ‘Bayesian checking of the second levels of hierarchical models”,’by M. J. Bayarri and
M. E. Castellanos. Statistical Science 22, 349–352.

Gelman, A., and Hill, J. (2007), “Data Analysis Using Regression and Multilevel/Hierarchical Models”, Cambridge
University Press.

Gelman, A., Carlin, J., Stern, H. and Rubin, D. (1995), “Bayesian Data Analysis”, CRC Press, Boca Raton, FL.

258

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Gelman, A., Chew, G. L., and Shnaidman, M. (2004), “Bayesian analysis of serial dilution assays”, Biometrics 60, 407–417.

Gelman, A; Carlin, J. B; Stern, H. S; Dunson, D. B; Vehtari, A and Rubin, D. B., “Bayesian Data Analysis”, CRC Press.

Gentle, J. E. (2003), “Random Number Generation and Monte Carlo Methods”, second edition. New York: Springer.

George, E. I., and McCulloch, R. E. (1993), “Variable selection via Gibbs sampling”, Journal of the American Statistical
Association 88, 881–889.

Gershman, S. J., Hoffman, M. D., and Blei, D. M. (2012), “Nonparametric variational inference”, In roceedings of the 29th
International Conference on Machine Learning, Edinburgh, Scotland.

Gersho, A. and Gray, R. (1992), “Vector Quantization and Signal Compression”, Kluwer Academic Publishers, Boston,
MA.

Geweke, J (1989), “Bayesian inference in econometric models using Monte Carlo integration”, Econometrica: Journal of the
Econometric Society, 1317–1339.

Gilovich, T., Griffin, D., and Kahneman, D. (2002), “Heuristics and Biases: The Psychology of Intuitive Judgment”,
Cambridge University Press.

Girolami, M., and Calderhead, B. (2011), “Riemann manifold Langevin and Hamiltonian Monte Carlo methods (with
discussion)”, Journal of the Royal Statistical Society B 73, 123–214.

Girosi, F., Jones, M. and Poggio, T. (1995), “Regularization theory and neural network architectures”, Neural Computation
7: 219–269.

Gneiting, T. (2011), “Making and evaluating point forecasts”, Journal of the American Statistical Association 106, 746–762.

Gordon, A. (1999), “Classification (2nd edition)”, Chapman and Hall/CRC Press, London.

Gordon, N et al. (1993), “Novel approach to nonlinear/non-Gaussian Bayesian state estimation”, In Radar and Signal
Processing, IEE Proceedings F, vol. 140, 107–113. IET.

Graves, A. (2013), “Generating sequences with recurrent neural networks”, arXiv preprint arXiv:1308.0850.

Graves, A., & Jaitly, N. (2014), “Towards End-To-End Speech Recognition with Recurrent Neural Networks”, In ICML
(Vol. 14, 1764-1772).

Green, P. and Silverman, B. (1994), “Nonparametric Regression and Generalized Linear Models: A Roughness Penalty
Approach”, Chapman and Hall, London.

Green, P. J. (1995), “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination”,
Biometrika 82, 711–732.

Greenland, S. (2005), “Multiple-bias modelling for analysis of observational data”, Journal of the Royal Statistical Society
A 168, 267–306.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015), “DRAW: A recurrent neural network for
image generation”, arXiv preprint arXiv:1502.04623.

Groves, R. M., Dillman, D. A., Eltinge, J. L., and Little, R. J. A., eds. (2002), “Survey Nonresponse”,New York: Wiley.

Hall, P. (1992), “The Bootstrap and Edgeworth Expansion”, Springer, NewYork.

259

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Hanif, A & Smith, R (2012), “Generation Based Path-Switching in Sequential Monte Carlo Methods”, IEEE Congress on
Evolutionary Computation (CEC), 2012 , pages 1–7. IEEE.

Hanif, A & Smith, R (2013), “Stochastic Volatility Modeling with Computational Intelligence Particle Filters”, Genetic and
Evolutionary Computation Conference (GECCO), ACM.

Hanif, A (2013), “Computational Intelligence Sequential Monte Carlos for Recursive Bayesian Estimation”, PhD Thesis,
Intelligent Systems Group, UCL.

Hannah, L., and Dunson, D. B. (2011), “Bayesian nonparametric multivariate convex regression”, Available at link.

Hastie, T. (1984), “Principal Curves and Surfaces”, PhD thesis, Stanford University.

Hastie, T. and Stuetzle, W. (1989), “Principal curves”, Journal of the American Statistical Association 84(406): 502–516.

Hastie, T. and Tibshirani, R. (1990), “Generalized Additive Models”, Chapman and Hall, London.

Hastie, T. and Tibshirani, R. (1996a), “Discriminant adaptive nearest neighbor classification”, IEEE Pattern Recognition
and Machine Intelligence 18: 607–616.

Hastie, T. and Tibshirani, R. (1996b), “Discriminant analysis by Gaussian mixtures”, Journal of the Royal Statistical Society
Series B. 58: 155–176.

Hastie, T. and Tibshirani, R. (1998), “Classification by pairwise coupling”,Annals of Statistics 26(2): 451–471.

Hastie, T., Buja, A. and Tibshirani, R. (1995), “Penalized discriminant analysis”, Annals of Statistics 23: 73–102.

Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward stagewise regression and the monotone lasso”,
Electronic Journal of Statistics 1: 1–29.

Hastie, T., Tibshirani, R. and Buja, A. (1994), “Flexible discriminant analysis by optimal scoring”, Journal of the American
Statistical Association 89: 1255–1270.

Hastie, T; Tibshirani, R and Friedman, J (2013), “The elements of statistical learning”, 2nd edition, Springer. Available at
link.

Hazelton, M. L., and Turlach, B. A. (2011), “Semiparametric regression with shape-constrained penalized splines”,
Computational Statistics and Data Analysis 55, 2871–2879.

Hebb, D. O (1949), “The organization of behavior: a neuropsychological theory”, Wiley and sons, New York.

Heskes, T., Opper, M., Wiegerinck, W., Winther, O., and Zoeter, O. (2005), “Approximate inference techniques with
expectation constraints”, Journal of Statistical Mechanics: Theory and Experiment, P11015.

Hinton, GE and Salakhutdinov, RR (2006), “Reducing the dimensionality of data with neural networks”, Science 313
(5786), 504-507.

Hinton, GE; Osindero, S and Teh, Y-W (2006), “A fast learning algorithm for deep belief nets”, Neural Computation.

Ho, T. K. (1995), “Random decision forests”, in M. Kavavaugh and P. Storms (eds), Proc. Third International Conference
on Document Analysis and Recognition, Vol. 1, IEEE Computer Society Press, New York, 278–282.

260

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Hodges, J. S., and Sargent, D. J. (2001), “Counting degrees of freedom in hierarchical and other richly parameterized
models”, Biometrika 88, 367–379.

Hoerl, A. E. and Kennard, R. (1970), “Ridge regression: biased estimation for nonorthogonal problems”, Technometrics 12:
55–67.

Hoff, P. D. (2007), “Extending the rank likelihood for semiparametric copula estimation”, Annals of Applied Statistics 1,
265–283.

Hornik, K; Stinchcombe, M; White, H, “Multilayer feedforward networks are universal approximators”, Neural Networks,
Vol 2 (5), 359-366.

Hubert, L and Arabie, P (1985), “Comparing partitions”, Journal of Classification.

Hyvärinen, A. and Oja, E. (2000), “Independent component analysis: algorithms and applications”, Neural Networks 13:
411–430.

Imai, K., and van Dyk, D. A. (2005), “A Bayesian analysis of the multinomial probit model using marginal data
augmentation”, Journal of Econometrics. 124, 311–334.

Ionides, E. L. (2008), “Truncated importance sampling”, Journal of Computational and Graphical Statistics, 17(2), 295-311.

Ishwaran, H., and Zarepour, M. (2002), “Dirichlet prior sieves in finite normal mixtures”,Statistica Sinica 12, 941–963.

Jaakkola, T. S., and Jordan, M. I. (2000), “Bayesian parameter estimation via variational methods”,Statistics and Computing
10, 25–37.

Jackman, S. (2001), “Multidimensional analysis of roll call data via Bayesian simulation: identification, estimation,
inference and model checking”, Political Analysis 9, 227–241.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013), “An Introduction to Statistical Learning”, Springer Texts in
Statistics.

Jasra, A., Holmes, C. C., and Stephens, D. A. (2005), “Markov chain Monte Carlo methods and the label switching problem
in Bayesian mixture modeling”, Statistical Science 20, 50–67.

Jiang, W. (2004), “Process consistency for Adaboost”, Annals of Statistics 32(1): 13–29.

Jin, Y & Branke, J (2005), “Evolutionary optimization in uncertain environments-a survey”, Evolutionary Computation,
IEEE Transactions on 9(3):303–317.

Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999), “Introduction to variational methods for graphical models”,
Machine Learning 37, 183–233.

Kadanoff, L. P (1966), “Scaling laws for Ising models near Tc”, Physics 2, 263.

Kalman, R.E. (1960), “A New Approach to Linear Filtering and Prediction Problems”, J. Basic Eng 82(1), 35-45.
Karpathy, A. (2015), “The unreasonable effectiveness of recurrent neural networks”, Andrej Karpathy blog.

Kaufman, L. and Rousseeuw, P. (1990), “Finding Groups in Data: An Introduction to Cluster Analysis”, Wiley, New York.

261

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Kearns, M. and Vazirani, U. (1994), “An Introduction to Computational Learning Theory”, MIT Press, Cambridge, MA.

Kitchin, Rob (2015), “Big Data and Official Statistics: Opportunities, Challenges and Risks”, Statistical Journal of IAOS
31, 471-481.

Kittler, J., Hatef, M., Duin, R. and Matas, J. (1998), “On combining classifiers”, IEEE Transaction on Pattern Analysis and
Machine Intelligence 20(3): 226–239.

Kleinberg, E. M. (1996), “An overtraining-resistant stochastic modeling method for pattern recognition”, Annals of
Statistics 24: 2319–2349.

Kleinberg, E.M. (1990), “Stochastic discrimination”, Annals of Mathematical Artificial Intelligence 1: 207–239.

Kohavi, R. (1995), “A study of cross-validation and bootstrap for accuracy estimation and model selection”, International
Joint Conference on Artificial Intelligence (IJCAI), Morgan Kaufmann, 1137–1143.

Kohonen, T. (1989), “Self-Organization and Associative Memory (3rd edition)”, Springer, Berlin.

Kohonen, T. (1990), “The self-organizing map”, Proceedings of the IEEE 78: 1464–1479.

Kohonen, T., Kaski, S., Lagus, K., Saloj¨arvi, J., Paatero, A. and Saarela, A. (2000), “Self-organization of a massive
document collection”, IEEE Transactions on Neural Networks 11(3): 574–585. Special Issue on Neural Networks for Data
Mining and Knowledge Discovery.

Koller, D. and Friedman, N. (2007), “Structured Probabilistic Models”, Stanford Bookstore Custom Publishing.
(Unpublished Draft).

Krishnamachari, R. T (2015), "MIMO Systems under Limited Feedback: A Signal Processing Perspective ", LAP
Publishing.

Krishnamachari, R. T and Varanasi, M. K. (2014), "MIMO Systems with quantized covariance feedback", IEEE
Transactions on Signal Processing, 62(2), Pg 485-495.

Krishnamachari, R. T and Varanasi, M. K. (2013a), "Interference alignment under limited feedback for MIMO interference
channels", IEEE Transactions on Signal Processing, 61(15), Pg. 3908-3917.

Krishnamachari, R. T and Varanasi, M. K. (2013b), "On the geometry and quantization of manifolds of positive semi-
definite matrices", IEEE Transactions on Signal Processing, 61 (18), Pg 4587-4599.

Krishnamachari, R. T and Varanasi, M. K. (2009), "Distortion-rate tradeoff of a source uniformly distributed over the
composite P_F(N) and the composite Stiefel manifolds", IEEE International Symposium on Information Theory.

Krishnamachari, R. T and Varanasi, M. K. (2008a), "Distortion-rate tradeoff of a source uniformly distributed over positive
semi-definite matrices", Asilomar Conference on Signals, Systems and Computers.

Krishnamachari, R. T and Varanasi, M. K. (2008b), "Volume of geodesic balls in the complex Stiefel manifold", Allerton
Conference on Communications, Control and Computing.

Krishnamachari, R. T and Varanasi, M. K. (2008c), "Volume of geodesic balls in the real Stiefel manifold", Conference on
Information Science and Systems.

Kuhn, M. (2008), “Building Predictive Models in R Using the caret Package”, Journal of Statistical Software, Vol 28(5), 1-
26. Available at link.

262

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Kurenkov, A (2015), “A ‘brief’ history of neural nets and Deep Learning”, Parts 1-4 available at link.
Laney, D (2001), “3D data management: Controlling data volume, velocity and variety”, META Group (then Gartner), File
949.

Lauritzen, S. (1996), “Graphical Models”, Oxford University Press.

Leblanc, M. and Tibshirani, R. (1996), “Combining estimates in regression and classification”, Journal of the American
Statistical Association 91: 1641–1650.

LeCun, Y., Bengio, Y., & Hinton, G. (2015), “Deep Learning. Nature”, 521(7553), 436-444.

LeCun, Y; Boser, B; Denker, J; Henderson, D; Howard, R; Hubbard, W; Jackel, L (1989), “Backpropagation Applied to
Handwritten Zip Code Recognition”, Neural Computation , Vol 1(4), 541-551.

LeCun, Y; Jackel, L.D; Bottou, L; Brunot, A; Cortes, C; Denker, J.S.; Drucker, H; Guyon, I; Muller, U.A; Sackinger,E;
Simard, P and Vapnik, V (1995), “Comparison of learning algorithms for handwritten digit recognition”, in Fogelman, F.
and Gallinari, P. (Eds), International Conference on Artificial Neural Networks, 53-60, EC2 & Cie, Paris.

Leimkuhler, B., and Reich, S. (2004), “Simulating Hamiltonian Dynamics”,. Cambridge University Press.

Leonard, T., and Hsu, J. S. (1992), “Bayesian inference for a covariance matrix”, Annals of Statistics 20, 1669–1696.

Levesque, HJ; Davis, E and Morgenstern, L (2011), “The Winograd schema challenge”, The Thirteenth International
Conference on Principles of Knowledge Representation and Learning.

Little, R. J. A., and Rubin, D. B. (2002), “Statistical Analysis with Missing Data”, second edition.New York: Wiley.

Liu, C. (2003), “Alternating subspace-spanning resampling to accelerate Markov chain Monte Carlo simulation”, Journal of
the American Statistical Association 98, 110–117.

Liu, C. (2004), “Robit regression: A simple robust alternative to logistic and probit regression.In Applied Bayesian
Modeling and Causal Inference from Incomplete-Data Perspectives”, ed. A.Gelman and X. L. Meng, 227–238. New York:
Wiley.

Liu, C., and Rubin, D. B. (1995), “ML estimation of the t distribution using EM and its extensions”,ECM and ECME.
Statistica Sinica 5, 19–39.

Liu, C., Rubin, D. B., and Wu, Y. N. (1998), “Parameter expansion to accelerate EM: The PX-EM algorithm”, Biometrika
85, 755–770.

Liu, J. (2001), “Monte Carlo Strategies in Scientific Computing”, New York: Springer

Liu, J., and Wu, Y. N. (1999), “Parameter expansion for data augmentation”, Journal of the American Statistical
Association 94, 1264–1274.

Loader, C. (1999), “Local Regression and Likelihood”, Springer, New York.

Lugosi, G. and Vayatis, N. (2004), “On the bayes-risk consistency of regularized boosting methods”, Annals of Statistics
32(1): 30–55.

MacQueen, J. (1967), “Some methods for classification and analysis of multivariate observations”, Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and Probability, eds. L.M. LeCam and J.Neyman, University of California
Press, 281–297.

263

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Madigan, D. and Raftery, A. (1994), “Model selection and accounting for model uncertainty using Occam’s window”,
Journal of the American Statistical Association 89: 1535–46.

Manning, C. D (2015), “Computational linguistics and Deep Learning”, Computational Linguistics, Vol 41(4), 701-707,
MIT Press.

Mardia, K., Kent, J. and Bibby, J. (1979), “Multivariate Analysis”, Academic Press.

Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian computational methods”, Statistics
and Computing 22, 1167–1180.

Martin, A. D., and Quinn, K. M. (2002), “Dynamic ideal point estimation via Markov chain Monte Carlo for the U.S.
Supreme Court, 1953–1999”, Political Analysis 10, 134–153.

Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000), “Boosting algorithms as gradient descent”, 12: 512–518.

McCulloch, W.S and Pitts, W. H (1945), “A logical calculus of the ideas immanent in nervous activity”, Bulletin of
Mathematical Biophysics, Vol 5, 115-133.

Mease, D. and Wyner, A. (2008), “Evidence contrary to the statistical view of boosting (with discussion)”, Journal of
Machine Learning Research 9: 131–156.

Mehta, P and Schwab, D. J. (2014), “An exact mapping between the variational renormalization group and Deep Learning”,
Manuscript posted on Arxiv at link.

Meir, R. and R¨atsch, G. (2003), “An introduction to boosting and leveraging”, in S. Mendelson and A. Smola (eds),
Lecture notes in Computer Science, Advanced Lectures in Machine Learning, Springer, New York.

Meir, R. and R¨atsch, G. (2003), “An introduction to boosting and leveraging”, in S. Mendelson and A. Smola (eds),
Lecture notes in Computer Science, Advanced Lectures in Machine Learning, Springer, New York.

Meng, X. L. (1994a), “On the rate of convergence of the ECM algorithm”, Annals of Statistics 22,326–339.

Meng, X. L., and Pedlow, S. (1992), “EM: A bibliographic review with missing articles”, In Proceedings of the American
Statistical Association, Section on Statistical Computing, 24–27.

Meng, X. L., and Rubin, D. B. (1991), “Using EM to obtain asymptotic variance-covariance matrices:The SEM algorithm”,
Journal of the American Statistical Association 86, 899–909.

Meng, X. L., and Rubin, D. B. (1993), “Maximum likelihood estimation via the ECM algorithm:A general framework”,
Biometrika 80, 267–278.

Meng, X. L., and van Dyk, D. A. (1997), “The EM algorithm—an old folk-song sung to a fast new tune (with discussion)”,
Journal of the Royal Statistical Society B 59, 511–567.

Minka, T. (2001), “Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth
Conference on Uncertainty in Artificial Intelligence”, ed. J. Breese and D. Koller, 362–369.

Minsky, M and Papert, S. A (1960), “Perceptrons”, MIT Press (latest edition, published in 1987).

Murray, J. S., Dunson, D. B., Carin, L., and Lucas, J. E. (2013), “Bayesian Gaussian copula factor models for mixed data”,
Journal of the American Statistical Association.

Neal, R. (1996), “Bayesian Learning for Neural Networks”, Springer, New York

264

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.
Marko Kolanovic, PhD Global Quantitative & Derivatives Strategy
(1-212) 272-1438 18 May 2017
[email protected]

Neal, R. and Hinton, G. (1998), “A view of the EM algorithm that justifies incremental, sparse, and other variants”; in
Learning in Graphical Models, M. Jordan (ed.), Dordrecht: Kluwer Academic Publishers, Boston, MA, 355–368.

Neal, R. M. (1994), “An improved acceptance procedure for the hybrid Monte Carlo algorithm”,Journal of Computational
Physics 111, 194–203.

Neal, R. M. (2011), “MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo”, ed. S. Brooks, A.
Gelman, G. L. Jones, and X. L. Meng, 113–162. New York: Chapman & Hall.

Neelon, B., and Dunson, D. B. (2004), “Bayesian isotonic regression and trend analysis”, Biometrics 60, 398–406.

Nelder, J. A. (1994), “The statistics of linear models: back to basics. Statistics and Computing“, 4,221–234.

O’Connell, Jared and Højsgaard, Søren (2011), “Hidden Semi Markov Models for Multiple Observation Sequences: The
mhsmm Package for R”, Journal of Statistical Software, 39(4). Available at link.

O’Hagan, A., and Forster, J. (2004), “Bayesian Inference”, second edition. London: Arnold.

Ohlssen, D. I., Sharples, L. D., and Spiegelhalter, D. J. (2007), “Flexible random-effects models using Bayesian semi-
parametric models: Applications to institutional comparisons”, Statistics in Medicine 26, 2088–2112.

Ormerod, J. T., and Wand, M. P. (2012), “Gaussian variational approximate inference for generalized linear mixed models”,
Journal of Computational and Graphical Statistics 21, 2–17.

Osborne, M., Presnell, B. and Turlach, B. (2000a), “A new approach to variable selection in least squares problems”, IMA
Journal of Numerical Analysis 20: 389–404.

Osborne, M., Presnell, B. and Turlach, B. (2000b), “On the lasso and its dual, Journal of Computational and Graphical
Statistics 9”: 319–337. Pace, R. K. and Barry, R. (1997). Sparse spatial autoregressions, Statistics and Probability Letters
33: 291–297.

Papaspiliopoulos, O., and Roberts, G. O. (2008), “RetrospectiveMarkov chainMonte Carlo methods for Dirichlet process
hierarchical models”, Biometrika 95, 169–186.

Park, M. Y. and Hastie, T. (2007), “l1-regularization path algorithm for generalized linear models”, Journal of the Royal
Statistical Society Series B 69: 659–677.

Park, T., and Casella, G. (2008), “The Bayesian lasso”, Journal of the American Statistical Association 103, 681–686.

Pati, D., and Dunson, D. B. (2011), “Bayesian closed surface fitting through tensor products”,Technical report, Department
of Statistics, Duke University.

Pearl, J. (2000), “Causality Models, Reasoning and Inference”, Cambridge University Press.

Peltola T, Marttinen P, Vehtari A (2012), “Finite Adaptation and Multistep Moves in the Metropolis-Hastings Algorithm for
Variable Selection in Genome-Wide Association Analysis”. PLoS One 7(11): e49445

Petris, Giovanni, Petrone, Sonia and Campagnoli, Patrizia (2009), “Dynamic Linear Models with R”, Springer.

Propp, J. G., and Wilson, D. B. (1996), “Exact sampling with coupled Markov chains and applications to statistical
mechanics”, Random Structures Algorithms 9, 223–252.

265

This document is being provided for the exclusive use of MARKO KOLANOVIC at JPMorgan Chase & Co. and clients of J.P. Morgan.

You might also like