Bayesian Analysis
Bayesian Analysis
Bayesian Analysis
10/2007
Bayesian analysis and Markov chain Monte
Carlo simulation
Medova, E.A.
p ( | x ) =
p ( x | ) p()
,
p( x | ) p ()d
where is the vector of unknown parameters governing our model, p ( ) is the prior sampling
density function of and x is a sample drawn from the true underlying distribution with
sampling density p(x | ) that we model. Thus the posterior distribution for takes into account
both our prior distribution for and the observed data x.
A conjugate prior family is a class of densities { p (i ) } which has the feature that given the
sampling density p(x |) the posterior density p (i x) also belongs to the class. The name arises
because we say that the prior p (i ) is conjugate to the sampling density considered as a
likelihood function p ( x ) for given x. The concept of conjugate prior as well as the term was
introduced by Raiffa and Schlaifer [14].
After obtaining a posterior distribution for the parameters we can compute various quantities of
interest such as integrals of the form
f ( y ) g ( y; ) p ( x) dyd ,
(1)
where f is some arbitrary function and g is the probability density function describing a related
parametric model. In general, because we are not assuming independence between the each of the
individual parameters this integral is difficult to compute, especially if there are many parameters.
This is the situation in which Markov chain Monte Carlo (MCMC) simulation is most
commonly used.
The distinguishing feature of MCMC is that the random samples of the integrand in (1) are
correlated, whereas in conventional Monte Carlo methods such samples are statistically
independent. The goal of MCMC methods is to construct an ergodic Markov chain that converges
quickly to its stationary distribution which is the required posterior density or some functional
thereof such as (1).
One can broadly categorize the use of MCMC methods as Bayesian or non-Bayesian. NonBayesian MCMC methods are used to compute quantities that depend on a distribution from a
statistical model that is non-parametric. In a Bayesian application we consider a parametric
model for the problem of interest. We assume some prior distribution on the parameters and try to
compute quantities of interest that involve the posterior distributions. This approach remains
suitable if the data is sparse, for example, in extreme value applications [10].
There are many different types of MCMC algorithms. The two most basic and widely used are
the Metropolis-Hastings algorithm and the Gibbs sampler which we will now review.
Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm [11, 8, 4] has been used extensively in physics but was little
known to others until Mller [12] and Tierney [19] expounded the value of this algorithm to
statisticians. The algorithm is extremely powerful and versatile and has been included in a list of
top 10 algorithms [5] and even claimed to be most likely the most powerful algorithm of all
time [1].
The Metropolis-Hastings algorithm can draw samples from any target probability density for
the uncertain parameters requiring only that this density can be calculated at . The algorithm
makes use of a proposal density q ( t , ) which depends on the current state of the chain t to
generate each new proposed parameter sample . The proposal is accepted as the next state
of the chain ( t +1 := ) with acceptance probability ( t , ) and rejected otherwise. It is the
specification of this probability that allows us to generate a Markov chain with the desired
target stationary density . The Metropolis-Hastings algorithm can thus be seen as a generalized
form of acceptance/rejection sampling with values drawn from approximate distributions which
are corrected in order that they behave asymptotically as random observations from the target
distribution.
The algorithm in step-by-step form is as follows:
a) Given the current position of our Markov chain t , generate a new value from the
proposal density q (see below).
b) Compute the acceptance probability
( ) q ( , t )
,
( , ) := min 1,
( t ) q ( t , )
(2)
( )
t 0
P ( t B ) ( B )
for all suitably (Borel) measurable sets B
We need to specify a starting point 0 , which may be chosen at random (and often is).
We should also specify a burn-in period to allow the chain to reach equilibrium. By
this we mean that we discard the first n values of the chain in order to reduce the
possibility of bias caused by the choice of the starting value 0 .
The proposal distribution should be a distribution that is easy to sample from. It is also
desirable to choose its density q to be close or similar to the target density , as this
will increase the acceptance rate and increase the efficiency of the algorithm.
We only need to know the target density function up to proportionality that is, we
do not need to know its normalizing constant, since this cancels in the calculation (2) of
the acceptance function .
The choice of the burn-in period still remains somewhat of an art, but is currently an active area
of research. One can simply use the eyeballing technique which merely involves inspecting
visual outputs of the chain to see whether or not it has reached equilibrium.
When the proposal density is symmetric, i.e. q (t , ) = q (, t ) (the original Metropolis
algorithm), the computation of the acceptance function is significantly faster. In this case from
(2) a proposal is accepted with probability =()/( t), i.e. its likelihood () relative to that
of ( t) (as originally suggested by Ulam for acceptance/rejection sampling).
where F , the distribution corresponding to f. Note that since this proposal density q is
symmetric the acceptance function is of the simple Metropolis form described above. Common
choices for q are the multivariate normal, multivariate t or the uniform distribution on the unit
sphere.
If q (t , ) := q ( ) then the candidate observation is drawn independently of the current state of
the chain. Note however that the state of the chain t +1 at t + 1 does depend on the previous state
t because the acceptance function (t , ) depends on t .
In the random walk chain we only need to specify the spread of q , i.e. a maximum for | | at a
single step. In the independence sampler we need to specify the spread and the location of q .
Choosing the spread of q is also something of an art. If the spread is large, then many of the
candidates will be far from the current value. They will therefore have a low probability of being
accepted, and the chain may remain stuck at a particular value for many iterations. This can be
especially problematic for multi-modal distributions; some of its modes may then not be explored
properly by the chain. On the other hand, if the spread is small the chain will take longer to
traverse the support of the density and low probability regions will be under-sampled. The
research reported in [13] suggests an optimal acceptance rate of around 0.25 for the random walk
chain. In the case of the independence sampler it is important [2] to ensure that the tails of q
dominate those of , otherwise the chain may get stuck in the tails of the target density. This
requirement is similar to that in importance sampling.
Multiple-block updates
When the number of dimensions is large it can be difficult to choose the proposal density q so
that the algorithm converges sufficiently rapidly. In such cases it is helpful to break up the space
into smaller blocks and to construct a Markov chain for each of these smaller blocks [8]. Suppose
that we split into two blocks (1 , 2 ) and let q1 1t t2 , 1 and q2 t2 1t , 2 be the proposal
densities for each block. We then break each iteration of the Metropolis-Hastings algorithm into 2
steps and at each step we update the corresponding block. To update block 1 we use the
acceptance function given by
(
(
) (
) (
)
)
(
(
) (
) (
) .
)
1 t2 q1 1 t2 , 1t
, 1 := min 1,
1t t2 q1 1t t2 , 1
t
1
t
2
(3)
2 1t q2 2 1t , t2
, 2 := min 1,
t2 1t q2 t2 1t , 2
t
2
t
1
(4)
If the blocks each consist of just a single variable, then the resulting algorithm is commonly
called the single-update Metropolis-Hastings algorithm. Suppose in the single-update algorithm it
turns out that each of the marginals of the target distribution (i ~ i ) can be directly sampled
from. Then we would naturally choose q (i ~ i ) := (i ~ i ) since all candidates will then be
accepted with probability 1. This special case is the well-known Gibbs sampler [2].
Gibbs sampler
Gibbs sampling is applicable in general when the joint parameter distribution is not known
explicitly but the conditional distribution of each parameter given the others is known. Let
P ( ) = P ( 1 ,..., k ) denote the joint parameter distribution and let p ( i | ~ i ) denote the
conditional density for the i -th component i given the other k 1 components, where
know how to sample directly from each p ( i | ~ i ) . The algorithm begins by picking the
arbitrary starting value 0 = 10 ,..., 0 . It then samples randomly from the conditional densities
p ( i | ~ i ) for i = 1,..., k successively as follows:
(
p (
)
| , ,..., )
1
1
0
3
0
k
This completes a transition from 0 to 1 and eventually generates a sample path 0 , 1 ,..., t ,
of a Markov chain whose stationary distribution is P.
In many cases we can use the Gibbs sampler which is significantly faster to compute than the
more general Metropolis-Hastings algorithm. In order to use Gibbs however we me must know
how to directly sample from the conditional posterior distributions for each parameter,
i.e. p (i ~i , x) , where x represents the data to time t.
It is assumed that losses beyond the uVaR level belong to the operational risk category. In most
cases, due to overlapping between risk types a detailed analysis of operational loss data is
required to support the assumption that the uVaR level approximately equals the unexpected loss
threshold. This approach to capital allocation for operational risk takes into account large but
rare operational losses, is naturally based on extreme value theory (EVT) [6,7] and focusses on
tail events and modelling the worst-case losses as characterized by loss maxima over regular
observation periods.
According to regulatory requirements [20] operational risk capital calculation requires two
distributions a severity distribution of loss values and a frequency distribution of loss
occurrences. In the approach described here a unified resulting asymptotic model known as the
peaks over threshold (POT) model [15,16,18] is applied. It is based on an asymptotic theory of
extremes and a point process representation of exceedances over a threshold given by the POT
model. The following is assumed.
Given an i.i.d. sequence of random losses X1,, Xn drawn from some distribution we are
interested in the distribution of the excess Y := X u over the threshold u. The distribution of
excesses is given by the conditional distribution function in terms of the tail of the underlying
distribution function F as
Fu ( y ) := P( X u y | X > u ) =
F (u + y ) F (u )
for 0 y .
1 F (u )
(5)
y
G, ( y ) = 1 1 +
1/
y
1 exp
0 where y [ 0, ] 0 or y [ 0, / ] < 0
(6)
= 0.
The identification of an appropriate threshold u is again somewhat of an art and requires a data
analysis based on a knowledge of EVT [3, 6, 7].
The capital provision for operational risk over the unexpected loss threshold u is given in [10] as
u E ( X u | X > u ) = u
u + u
,
1
(7)
where E ( X u | X > u ) =
u + u
is the expectation of excesses over the threshold u (which is
1
defined for 1 and must be replaced by the median for >1), u := + ( u ) and the
exceedances form a Poisson point process with intensity
(u )
u := 1 +
1/
(8)
1 ,..., X n ,
:= f |+ is of the same
form with the new hyper-parameters + determined by and the observations X 1 ,..., X n . In
the hierarchical Bayesian model the hyper-hyper parameters are chosen to generate a vague
prior due to the lack of a prior distribution for the hyper-parameters before excess loss data is
seen. Hence, we can decompose the posterior parameter density f | X , with the observations X
and the initial hyper-hyper parameters as
f | X , f X | ( X | ) f | ( | ) f ( | )
f X | ( X | ) f | ( | , )
f X | ( X | ) f ( | + ) .
f | f
is performed in 2 stages.
10
Type 2
Business Unit 1
x11
x12
x1J
Business Unit 2
x21
x22
x2J
xn,1
xn,2
xn,J
Data
Business Unit n
Type J
Hyperparameters
Parameters
Mean ()
Mean m
Scale (log)
Log 1
log 2
log J
Shape ()
Mean m
Variance s2
Variance s2
Illustrative example
The data is assumed to represent the operational losses of a bank attributable to three different
business units. The data starts on 03.01.1980 and ends on 31.12.1990. The time span is calculated
in years hence the parameters will also be measured on a yearly basis. The data has been
generated from the Danish insurance claims data [3] by two independent random multiplicative
factors to obtain the three sets of loss data summarized in Table 2.
A typical analysis of such data includes time series plots, log histogram plots, sample mean
excess plots, QQ plots for extreme value analysis against the GPD, Hill estimate plots of the
shape parameter and plots of the empirical distribution functions. All these tests have been
performed for the three data sets to conclude that data are heavy tailed and that the POT model is
valid.
11
X1 (Danish)
X2
X3
Min:
1.000
0.8
1.200
1st Qu.
1.321
1.057
1.585
Mean
3.385
2.708
4.062
Median
1.778
1.423
2.134
3rd Qu
2.967
2.374
3.560
Max
263.250
210.600
315.900
2167
2167
2167
StdDev
8.507
6.806
10.209
50
100
150
200
250
030180
030182
030184
030186
Time
12
030188
030190
Percent of Total
30
20
10
0
0
log(danish)
log 1
log 2
log 3
20
21
22
3.2
2.8
0.5
0.4
0.7
The tables below are a summary of the posterior mean estimates of the parameter values i and i
based on the MCMC posterior distribution mean parameter values.
For j = 1 (Unit 1)
Code
1000 loops
2000 loops
Mean(1)
Mean(log1)
Mean(1)
Expected Excess
37.34
3.14
0.77
18.46
1.41
180.70
36.89
3.13
0.80
18.35
1.39
211.75
13
For j = 2 (Unit 2)
Code
Mean (2)
Mean(log2)
Mean(2)
Expected Excess
36.41
3.16
0.77
19.04
1.34
186.22
35.76
3.13
0.8
18.5
1.30
218.40
1000 loops
2000 loops
For j = 3 (Unit 3)
Code
1000 loops
2000 loops
Mean (3)
Mean(log3)
Mean(3)
39.55
3.05
0.79
14.21 1.71
180.52
39.23
3.03
0.82
13.83 1.70
213.50
Expected Excess
The plots in Figure 3 below for the results of 2000 simulation loops show that convergence has
been reached for the marginal posterior distributions of all parameters for Unit 1 and that the
estimates of these parameters are distributed approximately normally. (Those of are thus
approximately lognormal.) Similar results hold for the other two units.
The capital provision for operational losses is calculated using expression (7). The probability of
such losses is given by the choice of threshold u for extreme operational losses. This threshold
must be obtained from an analysis of the historical operational loss data and should agree or
exceed the threshold level uVaR of unexpected losses due to market and credit risk. The
probability of crossing the combined market and credit risk threshold uVaR is chosen according to
the usual value at risk (VAR) risk management procedures. The level of losses u due to
operational risks is exceeded with probability , so that u uVaR . The probability of
exceeding u depends on the shape of the tail of the loss distribution but is in general very much
smaller than .
Assuming that three types of losses are the bank business unit losses from operational risk over a
period of 11 years the bank should hedge its operational risk for these units by putting aside
944.60 units of capital (1.39 x 211.75 + 1.30 x 218.40 + 1.70 x 213.50) for any one year period.
14
Although in this illustrative example unexpected losses above the combined VaR level (30 units
of capital) occur with probability 2.5% per annum, unexpected operational risk losses will exceed
this capital sum with probability less than 0.5%. In practice lower tail probabilities might be
chosen, but similar or higher probability ratios would obtain.
Note that in this example the loss data for each business unit was generated as independent and
the total capital figure takes the resulting diversification effect into account. On actual loss data
the dependencies in the realized data are taken into account by the method and the diversification
effect of the result can be analyzed by estimating each unit separately and adding the individual
capital figures (which conservatively treats losses across units as perfectly correlated) [10].
Although the results presented here are based on very large (2167) original sample sizes, the
simulation experiments on actual banking data reported in [10] verify the high accuracy of
MCMC Bayesian hierarchical methods for exceedance sample sizes as low as 10 and 25, as in
this example.
15
40
30
20
0.04
0.08
results[(3 * j) - 2, ]
50
0.0
density(results[(3 * j) - 2, ])$y
20
30
40
50
200
400
600
1000
3.5
3.0
2.5
0.4
0.8
results[(3 * j) - 1, ]
1.2
4.0
0.0
density(results[(3 * j) - 1, ])$y
log1
800
counter
density(results[(3 * j) - 2, ])$x
2.5
3.0
3.5
4.0
200
800
1000
60
50
40
30
sig
0.04
10
0.0
20
0.02
probability
600
0.06
400
counter
density(results[(3 * j) - 1, ])$x
20
30
40
50
60
200
400
600
800
sig
no.loop
Density Plot of xi
1000
1.5
xi
0.5
1.0
1.0
0.5
0.0
probability
1.5
2.0
10
0.0
0.5
1.0
1.5
xi
2.0
200
400
600
no.loop
16
800
1000
Conclusion
In this chapter we have introduced Markov chain Monte Carlo (MCMC) concepts and techniques
and shown how to apply them to the estimation of a Bayesian hierarchical model of
interdependent extreme operational risks. This model employs the peaks over threshold (POT)
model of extreme value theory (EVT) to generate both frequency and severity statistics for the
extreme operational losses of interdependent business units which are of interest at the board
level of a financial institution. These are obtained respectively in terms of Poisson exceedences
of an unexpected loss level for other risks and the generalized Pareto distribution (GPD).
The model leads to annual business unit capital allocations for unexpected extreme risks which
take account of the statistical interdependencies of individual business unit losses.
The concepts discussed in this chapter are illustrated by an artificially created example involving
three business units but actual banking studies are described in [10] and in forthcoming work
relating to internally collected operational loss data.
References
[1] Beichl I., Sullivan F. (2000). The Metropolis algorithm, Computing in Science & Engineering
2(1): 6569.
[2] Casella G., Edward I.G. (1992). Explaining the Gibbs sampler, The American Statistician 46:
167-17.
[3] Castillo E. (1988). Extreme Value Theory in Engineering. Academic Press, Orlando.
[4] Chib S., Greenberg E. (1995). Understanding the Metropolis-Hastings algorithm, The
in Operational Risk: Firmwide issues for financial institutions. Risk Books pp.115-127.
17
[10] Medova E.A., Kyriacou M.N. (2002). Extremes in operational risk measurement, in Risk
Management: Value At Risk And Beyond, M.A.H. Dempster, ed., Cambridge University
Press, pp. 247-274.
[11] Metropolis N., Rosenbluth A., Rosenbluth M., Teller A., Teller E. (1953). Equations of state
calculations by fast computing machines, Journal of Chemical Physics 21(1): 10871092.
[12] Muller P. (1993). A generic approach to posterior integration and Gibbs sampling,
Technical Report, Purdue University.
[13] Roberts G., Gelman A., Gilks W. (1994). Weak convergence and optimal scaling of random
walk Metropolis algorithms, Technical Report, University of Cambridge.
[14] Raiffa H., Schlaifer R. (1961). Applied Statistical Decision Theory, Harvard University
Press.
[15] Leadbetter M., LindGreen G., Rootzen H. (1983). Extremes and Related Properties of
Value At Risk And Beyond, M.A.H. Dempster, ed., Cambridge University Press.
[19] Tierney L. (1994). Markov chains for exploring posterior distributions, Annals of Statistics
22:17011762.
[20] The New Basel Capital Accord (2001). Bank of International Settlements Press Release.
18