Problem Set 6 Solution Numerical Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Problem set 6 (solution)

Numerical Methods for EOR

Week 6-Simulation and bootstrap


This week we focus on simulation and bootstrap.

Reading material
Read chapter 20 of Jones et al. The Wikipedia page on bootstrapping contains some additional explanation
of the bootstrap method.

Problem 1
Consider the table with losses depicted below. Individual losses can be modelled well with a Gamma
distribution in this case. In the case of loss distributions, the probability of large losses is important.
1. Estimate the parameters of the Gamma distribution, and calculate the 95th percentile of this distribution.
2. Calculate a 95% confidence interval for the 95th percentile using the delta method.
3. Bootstrap using the asymptotic distribution of α̂ and β̂ to get B = 1000 different estimates for the
95th percentile of the loss distribution. Use these estimates to connstruct a 95% confidence interval.
4. (Difficult) Can you think of a procedure to implement a nonparametric bootstrap that can be used to
estimate the 95th percentile of the loss distribution?

solution
First, we get the parameters and data.
table2.10 <- cbind(lower=c(0,2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300),upper=c(2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300,Inf),freq=c(41,48,24,18,15,14,16,12,6,11,5,4,3))

loglik <-function(p,d){
upper <- d[,2]
lower <- d[,1]
n <- d[,3]
ll<-n*log(ifelse(upper<Inf,pgamma(upper,p[1],p[2]),1)-
pgamma(lower,p[1],p[2]))
sum( (ll) )
}

p0 <- c(alpha=0.47,beta=0.014)
m <- optim(p0,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10)

1
## Warning in pgamma(upper, p[1], p[2]): NaNs produced
## Warning in pgamma(lower, p[1], p[2]): NaNs produced
## Warning in pgamma(upper, p[1], p[2]): NaNs produced
## Warning in pgamma(lower, p[1], p[2]): NaNs produced
theta <- qgamma(0.95,m$par[1],m$par[2])
theta

## [1] 132.0048
Now we proceed to calculate a 95% confidence interval using the delta method. We need to differentiate the
−1
distribution function of the claims FX (0.95; α, β) with respect to α and β. There is no closed form solution,
so we use numerical differentiation.
p <- m$par

eps <- c(1e-5,1e-6,1e-7)


d.alpha <- 0*eps
d.beta <- 0*eps
for (i in 1:3){
d.alpha[i] <- (qgamma(0.95,p[1]+eps[i],p[2])-qgamma(0.95,p[1]-eps[i],p[2]))/(2*eps[i])
d.beta[i] <- (qgamma(0.95,p[1],p[2]+eps[i])-qgamma(0.95,p[1],p[2]-eps[i]))/(2*eps[i])
}
d.alpha

## [1] 182.1136 182.1136 182.1136


d.beta

## [1] -9462.755 -9462.750 -9462.750


var.p <- solve(-m$hessian)
var.q95 <- t(c(d.alpha[2],d.beta[2])) %*% var.p %*% c(d.alpha[2],d.beta[2])

qgamma(0.95,p[1],p[2]) + qnorm(c(0.025,0.975))*sqrt(c(var.q95))

## [1] 104.4446 159.5650


Note how we check whether the steplength is appropriate when calculating the numerical derivative.
An alternative approach is parametric bootstrap using the distribution of the estimated coefficients.
library(mvtnorm)
B <- 10000
q.b <- rep(NA,B)
for (b in 1:B){

p.b <- rmvnorm(1,p,var.p)


if (!any(p.b<0)) q.b[b] <- qgamma(0.95,p.b[1],p.b[2])
}
# check for NA's due to negative draws
mean(is.na(q.b))

## [1] 0
q.pb <- q.b

# 95% confidence interval


quantile(q.b,c(0.025,0.975))

2
## 2.5% 97.5%
## 109.8377 168.4200
To do the nonparametric bootstrap, we first ‘expand’ the data to reflect each individual observation. Then
we sample with replacement from the line numbers, calculate the frequency table, estimate the model and its
95% percentile.
line.numbers <- rep(1:13,table2.10[,"freq"])

q.b <- rep(NA,B)


table2.10b <- table2.10
for (b in 1:B){
line.numbers.b <- sample(line.numbers,size=217,replace=TRUE)
table2.10b[,"freq"] <- table(factor(line.numbers.b,levels=1:13))
m.b <- optim(m$par,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10b)
q.b[b] <- qgamma(0.95,m.b$par[1],m.b$par[2])
}
q.npb <- q.b
quantile(q.b,c(0.025,0.975))

## 2.5% 97.5%
## 97.40332 172.30320
We graph the three approaches.
q.normal <- rnorm(B,mean=qgamma(0.95,p[1],p[2]),sd=sqrt(var.q95))
q.bootstrap <- data.frame(quantile=c(q.pb,q.npb,q.normal),
type=rep(c("parametric bootstrap","nonparametric bootstrap","delta method"),c(

ggplot(q.bootstrap) + geom_density(aes(x=quantile,color=type))

3
0.03

0.02

type
density

delta method
nonparametric bootstrap
parametric bootstrap
0.01

0.00

100 150 200


quantile

Problem 2
On Brightspace you will find dataset penalties.xlsx. It contains penalty kick results from the English
Premier League. The variable score indicates whether or not the penalty kick was converted to a goal, home
indicates whether or not the penalty was awarded to the home team, and goal.difference measures the goal
difference before the penalty kick was taken.
1. To assess whether home and goal.differences are significant predictors of conversion, estimate a logit
model by maximum likelihood. Program the loglikelihoodfunnction yourself and use optim.
2. Read the help of the glm-function in R. Re-estimate your model using that function. Are the results
similar?
3. Estimate the effect of playing at home on the probability of conversion and provide a 95% confidence
interval estimate for that parameter. Note that this is a non-linear model so the effect of playing at
home is not constant. Therefore in your calculations constrain the goal difference to some range, for
example (−2, 2). Such choice would result in five estimates for the effect of playing at home and 5
corresponding confidence intervals. To compute the confidence intervals use Delta method.
4. Estimate a 95% confidence interval for the effect of playing at home with zero goal difference using a
paired bootstrap (that is, by sampling from the observed distribution of (yi , x0i )0 ).

solution
First, we read the data, estimate the logit model.
#first read in the data
library(readxl)

4
d<-read_excel("penalties.xlsx")
head(d)

## # A tibble: 6 x 4
## id score home goal.difference
## <dbl> <chr> <dbl> <dbl>
## 1 328 score 0 -1
## 2 291 score 0 0
## 3 77 no score 0 -1
## 4 41 score 1 0
## 5 75 no score 0 -1
## 6 151 score 0 1
#we can remove first column and check the remaining data
d$id<-NULL

lapply(d,summary)

## $score
## Length Class Mode
## 402 character character
##
## $home
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.6219 1.0000 1.0000
##
## $goal.difference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.00000 -1.00000 0.00000 0.00995 1.00000 7.00000
table(d$score,d$home)

##
## 0 1
## no score 47 48
## score 105 202
About one in four penalties are missed, but it seems more penalties are missed by the away team. Now we
program the loglikelihood.
d$score.num=ifelse(d$score=="score",1,0)

#log likelihood for logit


logit.ll<-function(beta,y,x){
z <- beta[1]+x%*%beta[2:3]
p <- 1/(1+exp(-z))
prob <- ifelse(y==1,p,1-p)
-sum(log(prob))
}

The loglikelihood can be optimized with optim, a simple starting value is vector of zeros. Because of the
names of the positions in the vector of starting values, we know which coefficient measures what when
interpreting the output.
m <- optim(c(intercept=0,home=0,goal.difference=0),logit.ll,hessian=T,
y=d$score.num,x=as.matrix(d[,c("home","goal.difference")]))

5
m

## $par
## intercept home goal.difference
## 0.78290717 0.67015848 -0.05698959
##
## $value
## [1] 216.1122
##
## $counts
## function gradient
## 148 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## intercept home goal.difference
## intercept 71.1886337 38.74594 0.4798622
## home 38.7459449 38.74594 12.1776569
## goal.difference 0.4798622 12.17766 115.3177053
vcov <- solve(m$hessian)
results.m <- cbind(beta=m$par,sd=sqrt(diag(vcov)))
t.stat <- results.m[,1]/results.m[,2]
p.value <- 2*pnorm(-abs(t.stat))
results.m <- cbind(results.m,t.stat,p.value)
results.m

## beta sd t.stat p.value


## intercept 0.78290717 0.17898475 4.374156 1.219033e-05
## home 0.67015848 0.24673544 2.716101 6.605567e-03
## goal.difference -0.05698959 0.09655079 -0.590255 5.550197e-01
The effect of goal.difference is not significant (but it will be used below). Home playing teams are more
likely to convert a penalty kick into a goal. Note that optim says the optimizarion has converged.

subquestion b
Read the help of the glm-function in R. Re-estimate your model using that function. Are the results similar?
Generalized Linear Models can be used to estimate a logit model by specifying a binomial model with logit
link.
?glm
glm.m <- glm(score.num~home+goal.difference,data=d,family=binomial)
summary(glm.m)

##
## Call:
## glm(formula = score.num ~ home + goal.difference, family = binomial,
## data = d)
##

6
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9244 0.6157 0.6483 0.8075 0.9095
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.78269 0.17898 4.373 1.22e-05 ***
## home 0.67039 0.24673 2.717 0.00659 **
## goal.difference -0.05693 0.09655 -0.590 0.55544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 439.63 on 401 degrees of freedom
## Residual deviance: 432.22 on 399 degrees of freedom
## AIC: 438.22
##
## Number of Fisher Scoring iterations: 4
results.m

## beta sd t.stat p.value


## intercept 0.78290717 0.17898475 4.374156 1.219033e-05
## home 0.67015848 0.24673544 2.716101 6.605567e-03
## goal.difference -0.05698959 0.09655079 -0.590255 5.550197e-01
The results are practically identical.

subquestion c
Estimate the effect of playing at home on the probability of conversion and provide a 95% confidence interval
estimate for that parameter. Note that this is a non-linear model so the effect of playing at home is not
constant. Therefore in your calculations constrain the goal difference to some range, for example (−2, 2).
Such choice would result in five estimates for the effect of playing at home and 5 corresponding confidence
intervals. To compute the confidence intervals use Delta method.
As the effect of playing at home is a dummy variable, we cannot really calculate a marginal effect. Instead,
we have to calculate the change in the expect value of the dependent variable when the dummy variable
changes from 0 to 1. To calculate the probability, we also need the level of the goal.difference. The effect
of home is thus:
1 1
Pr(score|home, gd) − Pr(score|away, gd) = − .
1 + exp(−β0 − β1 − β2 gd) 1 + exp(−β0 − β2 gd)

This expression depends on the actual value of goal.difference, common in nonlinear models. For this
reason, we calculate the effect at different levels of the variable goal.difference
gd <- seq(-2,2,1) #pick the range for goal difference
b0 <- m$par["intercept"]
b1 <- m$par["home"]
b2 <- m$par["goal.difference"]
home.score.prob <- 1/(1+exp(-b0-b1-b2*gd))
away.score.prob <- 1/(1+exp(-b0-b2*gd))
effect <- cbind(goal.difference=gd,effect=home.score.prob-away.score.prob)
effect

7
## goal.difference effect
## [1,] -2 0.1170527
## [2,] -1 0.1206259
## [3,] 0 0.1241635
## [4,] 1 0.1276517
## [5,] 2 0.1310766
The effect is larger when goal difference is more positive, so the biggest advantage of playing at home occurs
when it least matters.
Now we calculate a 95% confidence interval for the effect. To do so, we need the derivative of the effect with
respect to the parameters. We have in a simple logit model with one explanatory variable

∂ 1 exp(−β0 − β1 X)
= = Pr(Y = 0) × Pr(Y = 1)
∂β0 1 + exp(−β0 − β1 X) (1 + exp(−β0 − β1 X))2

and
∂ 1 X exp(−β0 − β1 X)
= = X × Pr(Y = 0) × Pr(Y = 1)
∂β1 1 + exp(−β0 − β1 X) (1 + exp(−β0 − β1 X))2
Note that the function of interest is the change in probability of scoring. The standard deviations according
to the delta method are now computed as follows.
sd.difference <- rep(NA,5)
for (i in 1:length(gd)){
x.home <- c(1,1,gd[i])
x.away <- c(1,0,gd[i])
derivative.of.difference <- x.home * home.score.prob[i] * (1-home.score.prob[i]) -
x.away * away.score.prob[i] * (1-away.score.prob[i])
var.difference <- t(derivative.of.difference) %*% vcov %*% derivative.of.difference
sd.difference[i] <- sqrt(var.difference)
}
effect <- cbind(effect,lower.delta=effect[,2]+qnorm(0.025)*sd.difference,
upper.delta=effect[,2]+qnorm(0.975)*sd.difference)
effect

## goal.difference effect lower.delta upper.delta


## [1,] -2 0.1170527 0.03316055 0.2009448
## [2,] -1 0.1206259 0.03368129 0.2075706
## [3,] 0 0.1241635 0.03253809 0.2157888
## [4,] 1 0.1276517 0.03008978 0.2252136
## [5,] 2 0.1310766 0.02673029 0.2354229
Zero is not in the confidence intervals, so the effect is statistically significant at all levels of goal.difference.

subquestion d
Estimate a 95% confidence interval for the effect of playing at home with zero goal difference using a paired
bootstrap (that is, by sampling from the observed distribution of (yi , x0i )0 ).
First, we define the statistic of interest, and then we estimate that statistic. We need to estimate the effect at
zero goal difference, so the effect is
1 1
Pr(score|home, gd = 0) − Pr(score|away, gd = 0) = − .
1 + exp(−β0 − β1 ) 1 + exp(−β0 )

8
effect.at.0 <- function(b){
1/(1+exp(-b[1]-b[2])) - 1/(1+exp(-b[1]))
}
effect.at.0(m$par)

## intercept
## 0.1241635
The last number is of course equal to the effect at zero goals we calulated earlier. Now we set up sampling
from the 402 observations, estimate the model, and use glm to calculate the coefficients. We use the output
to construct different intervals.
B <- 9999
bootstrap.effects <- rep(NA,B)
for (b in 1:B){
sample.b <- sample(1:402,replace = TRUE)
glm.m <- glm(score.num~home+goal.difference,data=d[sample.b,],family=binomial)
bootstrap.effects[b] <- effect.at.0(coef(glm.m))
}

We try to learn about ψ̂ − ψ 0 by looking at ψ̂ ∗ − ψ̂. The first interval we consider is the ‘basic bootstrap’
interval. A (1 − α) confidence interval for the true parameter ψ 0 is

Pr(aα/2 ≤ ψ̂ − ψ 0 ≤ a1−α/2 ) = 1 − α.

This interval can be rewritten as follows:

Pr(−ψ̂ + aα/2 ≤ −ψ 0 ≤ −ψ̂ + a1−α/2 ) = Pr(ψ̂ − a1−α/2 ≤ −ψ 0 ≤ ψ̂ − aα/2 ) = 1 − α.

The quantiles aα/2 and a1−α/2 of the distribution of ψ̂ − ψ 0 are now estimated by the corresponding quantiles
of the bootstrap distribution of ψ̂ ∗ − ψ̂, so aα/2 is set to be the α/2 quantile of (ψ̂ ∗ − ψ̂) and since ψ̂ is fixed,
∗ ∗ ∗
this is ψ̂α/2 − ψ̂. So aα/2 = ψ̂α/2 − ψ̂ and a1−α/2 = ψ̂1−α/2 − ψ̂. The basic bootstrap interval is thus

∗ ∗
[ψ̂ − a1−α/2 , ψ̂ − aα/2 ] = [2ψ̂ − ψ̂1−α/2 , 2ψ̂ − ψ̂α/2 ].

Note that the high quantile of the bootstrap distribution is found in the lower bound of the interval.
hat.psi <- effect.at.0(m$par)
basic.bootstrap <- c(2*hat.psi-quantile(bootstrap.effects,prob=0.975),
2*hat.psi-quantile(bootstrap.effects,prob=0.025))
basic.bootstrap

## intercept intercept
## 0.03191625 0.21367621
The next method is the percentile method. Here, we take the quantiles of ψ̂ directly from the bootstrap
distribution, so a (1 − α) confidence interval is
∗ ∗
[ψ̂α/2 , ψ̂1−α/2 ].

In the example, it is
perc.bootstrap <- c(quantile(bootstrap.effects,prob=0.025),
quantile(bootstrap.effects,prob=0.975))
perc.bootstrap

## 2.5% 97.5%
## 0.0346507 0.2164107

9
A third approximation to the confidence interval, is obtained by assuming that the variation of ψ̂ − ψ 0 follows
a normal distribution. This is known as the normal bootstrap interval, and so a (1 − α) confidence interval is
q q
[ψ̂ + varψ̂ ∗ × Φ−1 (α/2), ψ̂ + varψ̂ ∗ × Φ−1 (1 − α/2)].

In this example, this boils down to


normal.interval <- c(hat.psi + sd(bootstrap.effects)*qnorm(0.025),
hat.psi + sd(bootstrap.effects)*qnorm(0.975))
normal.interval

## intercept intercept
## 0.03258446 0.21574245
A final interval is the BCa interval (Bias-Corrected accelerated). It adjusts the percentiles of ψ̂ ∗ we use to
find a confidence interval. As discussed in week 6, the quantiles are Pr(ξβ1 ≤ ψ ≤ ξβ2 ) ≈ 1 − α, with
   
b + zα/2 b + z1−α/2
β1 = Φ b + β2 = Φ b + .
1 − a(b + zα/2 ) 1 − a(b + z1−α/2 )

How can we find a and b? b is bias, b̂ = Φ−1 (F̂ ∗ (ψ̂)), so if ψ̂ = MeF̂ ∗ we find b = 0. An estimate for a is
1
(ψ̂(·) − ψ̂(−i) )3
P
a= 6 i 3/2
P 2
i (ψ̂(·) − ψ̂(−i) )

1
P
Subscript (−i): leave one out estimator, and ψ̂(·) = n i ψ̂(−i) .
b <- qnorm(mean(bootstrap.effects<hat.psi))
b

## [1] 0.008899538
psi.loo <- rep(NA,nrow(d))
for (i in 1:nrow(d)){
psi.loo[i] <- effect.at.0(coef(glm(score.num~home+goal.difference,data=d[-i,],family=binomial)))
}
psi.loo.c <- mean(psi.loo)-psi.loo
a <- (1/6)*(sum(psi.loo.cˆ3))/(sum(psi.loo.cˆ2))ˆ(3/2)
a

## [1] 0.002264807
beta.1 <- pnorm(b+(b+qnorm(0.025))/(1-a*(b+qnorm(0.025))))
beta.2 <- pnorm(b+(b+qnorm(0.975))/(1-a*(b+qnorm(0.975))))
beta.1

## [1] 0.0265823
beta.2

## [1] 0.9765156
bca.interval <- quantile(bootstrap.effects,prob=c(beta.1,beta.2))

We collect all intervals:


rbind(basic.bootstrap,perc.bootstrap,normal.interval,bca.interval)

## intercept intercept
## basic.bootstrap 0.03191625 0.2136762

10
## perc.bootstrap 0.03465070 0.2164107
## normal.interval 0.03258446 0.2157425
## bca.interval 0.03590484 0.2174756
In all these cases, bootstrap intervals are very similar to the delta-method.
These intervals can also be estimated using the boot function:
library(boot)
effect.at.0.boot <- function(dataset,bootstrap.sample){
glm.m <- glm(score.num~home+goal.difference,data=dataset[bootstrap.sample,],
family=binomial)
b <- coef(glm.m)
1/(1+exp(-b[1]-b[2])) - 1/(1+exp(-b[1]))
}
bb <- boot(d,effect.at.0.boot,R=9999)
boot.ci(bb)

## Warning in boot.ci(bb): bootstrap variances needed for studentized intervals


## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 9999 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bb)
##
## Intervals :
## Level Normal Basic
## 95% ( 0.0330, 0.2156 ) ( 0.0335, 0.2156 )
##
## Level Percentile BCa
## 95% ( 0.0328, 0.2149 ) ( 0.0330, 0.2152 )
## Calculations and Intervals on Original Scale

11

You might also like