CS1B Workbook Answers
CS1B Workbook Answers
in
+91-9711150002
INDEX
12 x 6
(vii) 6 - 8
(viii) 1- (3+5)/2
' +,+.'+./0
(ix) 85 (x) 1- ( '() ) (xi)
1.'+
2
(xiii)
1
(xiii) Inf
(iv) ln (-10) what does this output mean? [Hint: NaN means not a number]
,'(√+1
(vi) e-5 (vii) (viii) 10! (ix) 12∁2
4
1. Store the values 50, 100, 150, 200 and 250 in the objects A, B, C, D and E, respectively.
ASSIGNMENT – Vectors
Question 1. Create a vector, x, containing the 100 numbers 21, 22, ..., 120 using:-
Question 4. Create a vector a of (1, 2, 3, 4, 5, 6), a vector b of (0, 1) and a vector c of (5, 1, 3, 2).
What will be the result of:
Question 1. In poker, a roulette wheel has the numbers 0 to 36. The wheel is spun one way and a
ball is sent round the other way. Our event of interest is the number on which the ball
lands.
(i) Give the sample space of the above experiment and store it in object “S”.
(ii) Calculate a) P(S < 14) b) P (S ³ 8) c) P (20 < S £ 32)
(iii) Set the seed as 100 and use sample function to generate a gambler’s outcomes
if he plays roulette 700 times.
(iv) Obtain a frequency table of the above sample.
(v) Draw a histogram of the sample.
(vi) Now, calculate empirical probabilities of part (ii) using your sample.
(vii) Find empirical values of
a) Mean b) Median c) Coefficient Of Skewness.
Question 2. In a remote island where tsunamis are regular, the survival probability from a tsunami is
78%. Given that the population of the island is 15.
Question 3. You throw darts at a board until you hit the center area. Probability of hitting the center
area is 0.17.
(i) Find the probability that it takes eight throws until you hit the center from
scratch and also using inbuilt function.
(ii) Calculate the probability that it takes:
(a) At least 5 throws (b) One throw (c) More than 10
throws until you hit the centre.
(iii) Calculate the smallest no. of throws, x , before the first success such that:
(a) P(X£ x) ³ 0.9 (b) P(X >x) £ 0.4
(iv) Draw a step graph of CDF.
(v) Simulate 500 values of the experiment; and then compare empirical and
theoretical mean and variance. (Using seed =50)
(vi) Find the empirical probabilities for part (ii) and comment.
Question 4. Bob is a high school basketball player. His probability of making a free throw is 0.70.
(i) During the season, what is the probability that Bob makes his third free throw
on his fifth shot?
(ii) Draw 4 bar charts of the probability function for negative binomial distributions
with p = 0.7 and k = 1 ,2, 3 and 4, with titles showing the value of k. (all charts
should be displayed simultaneously).
(a) at most 3 shots didn’t become free throw before the 5th one did.
(b) more than 3 shots didn’t become free throw before the 4th one did.
(iv) Calculate the smallest number of people who didn’t believe the rumour, x,
before the fourth person did such that: - (a) P(X£ x) ³ 0.6 (b) P(X> x) £ 0.6.
(v) Simulate 1200 values of the number of shots which didn’t become a free throw
before the 4th one did. (Use seed=70)
(vi) Plot a histogram of the data obtained in (v) and superimpose a line of
theoretical expected frequencies on it. Comment on your findings.
(vii) Find theoretical and empirical values of: - (a) Standard Deviation (b) IQR
Question 5. Suppose we randomly select 5 cards without replacement from an ordinary deck of
playing cards.
(i) What is the probability of getting exactly 2 red cards (i.e., hearts or diamonds)?
a) From Scratch b) Using inbuilt function in R c) Using a binomial
approximation
(ii) Draw 4 bar charts showing the number of cards selected being red from
samples of size 5, 10, 15 and 20.
(iii) Find the probability that less than 2 cards in the selected cards are red.
(iv) Draw a step graph of the CDF .
(v) Simulate the no of cards which are red in the cards selected 1000 times. Use
seed= 45.
(vi) Find the empirical probability for part (iii) and comment.
(vii) Find empirical and theoretical values of:- a) Mean b) Lower Quartile
(viii) Draw line graph of the binomial approximation to the probabilities of the
number of red cards obtained from the selected cards. Superimpose the actual
probabilities on this graph, and comment.
Question 6. The complaints of hair found in McDonald's hamburgers are on an average 4 per week.
(i) Find the probability that there are 7 complaints per week;
a) From scratch b) Using inbuilt dpois function
(ii) Draw 4 bar charts showing the number of complaints in a week if it occurs at
rates of 2, 4, 8 and 20 per week; and comment on the shape and distribution for
larger values of lambda. (all charts should be displayed simultaneously).
(iii) Find the probability that there are at most 5 complaints in a week.
(v) Simulate the number of complaints received in 1000 separate weeks & plot a
histogram of the data. Use seed = 50
b) Draw graph of “A” showing how mean of simulations changes over the 1000
values compared to the true value.
Question 7. Suppose that on a certain highway, cars pass at an average rate of five cars per minute.
Assume that the duration of time between successive cars follows the exponential
distribution.
(iii) Add to any one of your graphs from part (ii) the following: -
a) a green dotted line showing the PDF of an exponential distribution with λ = 8
b) a red dashed line showing the PDF of an exponential distribution with λ =2
Also add legend to the graph obtained.
(viii) Obtain the empirical probabilities for part (iv) and compare your answers.
(ix) Obtain the empirical and theoretical:-
a) Mean b) Standard Deviation c) 95th Percentile
Question 8. Time spent on a computer (X) is gamma distributed with mean 20 min and variance 80
min2.
(i) Obtain the pdf of this gamma distribution when x= 45mins.
(ii) Obtain a graph of the pdf for this gamma distribution for x∈(10,50).
(iii) Find the probability that the time spent on the computer is:-
a) At most 35 mins b) between 40 to 60 mins.
(iv) Calculate the IQR.
(v) Simulate 1600 values of this distribution using seed = 40.
(vi) Draw a labelled diagram of the histogram of the densities of the data you
simulated and superimpose the graph of the actual pdf on it.
(viii) Obtain the empirical probabilities for part (iii) using the data you simulated.
Solutions
Answer 1
(i) Store outcomes in object R
#Any one of the following
S <- 0:36
S <- seq(0,36)
S <- seq(0,36,1)
S <- seq(0,36,by=1)
#could use R <- c(0,1,2,3,4,...,36) but would have to enter all 37
numbers...
(iii) Simulation
set.seed(100)
S1 <- sample(S, 700, replace=TRUE)
(iv) Table
table(S1)
(v) Histogram
hist(S1,breaks=(-0.5:40.5))
Answer 2
(i) Probability 9 people survive
#Any one of the following
n <- 15
p <- 0.78
x <- 9
(a)(factorial(n)/(factorial(n-x)*factorial(x)))*p^x*(1-p)^(n-x)
#Ans= 0.06064452
(b)dbinom(x,n,p)
#answer: 0.06064452
(iv) Mode
#we can see it is 12 survivors
(v) Mean
x <- 0:15
px <- dbinom(x,n,p)
sum(x*px)
#which is also 11.7 apprx equal to 12
(vii)qbinom(0.5,n,p)
#Median = 12
#IQR
1 <- qbinom(0.25,n,p)
q3 <- qbinom(0.75,n,p)
q3-q1
#Ans= 2
Answer 3
(i)
x <- 8
y <- x-1
p <- 0.17
#From Scratch
p*(1-p)^(y)
#Using inbuilt fn
dgeom(y,p)
#Ans= 0.04613129
(ii)(a)
pgeom(3,p,lower.tail = FALSE)
#Ans= 0.4745832
#(b)
dgeom(0,p)
#Ans= 0.17
#(c)
pgeom(9,p,lower.tail=FALSE)
#Ans= 0.1551604
(iii)(a)
qgeom(0.9,p)
#Ans= 12
#(b)
qgeom(0.4,p,lower.tail=FALSE)
#Ans= 4
(iv)
x <- 0:40
plot(x,pgeom(x,p),type="s",main="Step Graph of CDF",xlab="No of failures
before 1st success",ylab="P(X<=x)")
(v)
set.seed(50)
sample <- rgeom(500,p)
mean(sample)
#Ans= 4.82
var(sample)
#Ans= 34.23206
#th mean = q/p
(1-p)/p
#Ans= 4.882353
#th var = q/p^2
(1-p)/(p^2)
#Ans = 28.71972
# we see that the emperical and theoretical mean are in close agreement;
however there is slight difference between the emperical and th variances.
(vi)(a)
length(sample[sample>=5])/length(sample)
#Ans= 0.364
#(b)
length(sample[sample=1])/length(sample)
#Ans= 0.002
#(c)
length(sample[sample>10])/length(sample)
#Ans= 0.134
Answer 4
(i) Probability
#5th shot means 2 shots are not free throws before 3rd one is (Type 2
NB)
p <- 0.7
k <- 3
x <- 2
(iv) Quantiles
(a) Find smallest x such that P(X<=x)=0.6 or greater
qnbinom(0.6,k,p)
#answer: 1
#check
pnbinom(1,k,p)
pnbinom(2,k,p)
(b) Find smallest x such that P(X>x)<=0.6
qnbinom(0.6,k,p,lower=FALSE)
#answer: 1
(v) Simulations
set.seed(70)
k <- 4
N <- rnbinom(1200,k,p)
hist(N,breaks=(-0.5:10.5))
(b) superimpose
x<-0:10
lines(x,2000*dnbinom(x,k,p),type="o",col="blue")
Answer 5
(i) Probability
#sample size n=5 as 5 are selected
n <- 5
#the population has k=26 "successes"
k <- 26
#the population has N-k=26 "failures"
N <- 52
#We could calculate the probability of X=2 as follows
x <- 2
(a) From scratch using choose function
choose(k,x)*choose(N-k,n-x)/choose(N,n)
#answer: 0.3251301
(b) Using dhyper
dhyper(x,k,N-k,n)
#answer: 0.3251301
(c) Using binomial approx
p <- k/N
p
dbinom(x,n,p)
#answer:0.3125
(ii) 4 bar charts
(a)par(mfrow=c(2,2))
(b)x <- 0:20
barplot(dhyper(x,k,N-k,5),names=x, main="n=5",col="blue")
barplot(dhyper(x,k,N-k,10),names=x, main="n=10",col="blue")
barplot(dhyper(x,k,N-k,15),names=x, main="n=15",col="blue")
barplot(dhyper(x,k,N-k,20),names=x, main="n=20",col="blue")
(v) Simulations
set.seed(45)
H <- rhyper(1000,k,N-k,n)
(vi)length(H[H<2])/length(H)
#Ans= 0.183
(vii) mean(H)
#answer: 2.457
#compare with actual
n*k/N
#answer:2.5
quantile(H,0.25)
#answer: 2
#compare with actual
qhyper(0.25,k,N-k,n)
#answer:2
(viii)Binomial approximation
p <- k/N
x <-0:10
plot(x,dbinom(x,n,p),xlab="number of successes",
ylim=c(0,0.4),ylab="probability",type="o")
lines(x,dhyper(x,k,N-k,n),type="o",col="red")
Answer 6
(i)#Poisson(4)
m <- 4
#probability of 7 complaints in a week
x <- 7
(a) From scratch using exp and factorial function
m^x*exp(-m)/factorial(x)
#Ans= 0.1490028
(v) Simulations
set.seed(50)
P <- rpois(1000,m)
table(P)
hist(P,breaks=(-0.5:11.5))
(vi)#theoretical
qpois(0.75,m)-qpois(0.25,m)
#Ans= 2
quantile(P,0.75)-quantile(P,0.25)
#Ans=3
(vii)Trend of mean
A <- rep(0,1000)
for (i in 1:1000)
{A[i] <- mean(P[1:i])}
x<-1:1000
plot(x,A[1:1000])
abline(h=m,col="red",lty=2,lwd=2)
Answer 7
(i) Calculate the PDF
l <- 5
x <- 1
(a)l*exp(-l*x)
(b)dexp(x,l)
#answer: 0.03368973
(ii) Labelled graph of PDF
x <- seq(0,6,by=0.01)
(a)plot(x, dexp(x,l),type="l",col="blue",ylab="f(x)",main="PDF of Exp(5)")
(b)curve(dexp(x,l),0, 6, col="blue",ylab="f(x)",main="PDF of Exp(5)")
(c)plot(x,type="n",xlim=c(0,6),ylim=c(0,3),xlab="x",ylab="f(x)",main="PDF
of Exp(5)")
lines(x,dexp(x,l),type="l",col="blue")
(iii) l<-8
lines(x,dexp(x,l),type="l",col="green",lty=3)
l<-2
lines(x,dexp(x,l),type="l",col="red",lty=2)
# legend
legend("topright",title="PDF of Exp(l)",c("l = 5","l = 8","l =
2"),lty=c(1,3,2),col=c("blue","green","red"))
(vii)l <- 5
n <- 1000
set.seed(90)
W <- rexp(n,l)
(a) Draw a labelled histogram
hist(W,prob=TRUE,xlab="simulated value",main="simulations from Exp(5)")
(b) Superimpose actual PDF
#Histogram goes from 0 to 2 or could use range
range(W)
x <- seq(0,2,by=0.01)
lines(x,dexp(x,l),type="l",col="blue")
(a) P(X>0.5)
length(W[W>0.5])/length(W)
#answer: 0.07
#compare to actual probability of 0.082085
pexp(0.5,l,lower=FALSE)
(b) P(2.5<X<3.5)
length(W[W>2.5 & W<3.5])/length(W)
#answer: 0
#compare to actual probability of 3.701543e-06
pexp(3.5,l)-pexp(2.5,l)
(ix)Compare empirical & actual moments
(a) mean
mean(W)
#answer: 0.1994997
#compare with actual mean of 0.2
1/l
(b) standard deviation
sd(W)
#answer: 0.1934161
#th. sd = mean = 0.2
(c) 95th percentile
quantile(W,0.95)
#answer: 0.5931666
#compare with actual IQR of 0.5991465
qexp(0.95,l)
(i)
#mean = a/l
#var= a/l^2
20 <- a/l
80 <- a/(l^2)
#solving, we get
l <- 1/4
a <- 5
x <- 45
dgamma(45,a,l)
#ans = 0.002170331
Answer 8
(i)#mean = a/l
#var= a/l^2
20 <- a/l
80 <- a/(l^2)
#solving, we get
l <- 1/4
a <- 5
x <- 45
dgamma(45,a,l)
#ans = 0.002170331
(iii)(a)
#at most 35 means <=
pgamma(35,a,l)
#Ans= 0.9359932
(b)# between 40 to 60
pgamma(60,a,l)-pgamma(40,a,l)
# Ans= 0.02839605
(iv) qgamma(0.75,a,l)-qgamma(0.25,a,l)
#Ans= 11.62332
(v)set.seed(40)
S <- rgamma(1600,a,l)
(vii)(a)
length(S[S<=35])/length(S)
#Ans= 0.93375
(b)between 40 to 60
length(S[S>40 & S<60])/length(S)
#Ans= 0.029375
Answer 9
(i) Values of PDF
x <- 90
(a)m <- 75
s <- sqrt(36)
dnorm(x,m,s)
#answer: 0.002921383
(b)M <- 4.5
S <- sqrt(0.005)
dlnorm(x,M,S)
#answer:0.0626875
(ii) Probabilities
(a)pnorm(85,m,s)-pnorm(60,m,s)
#answer: 0.946
(b)plnorm(80,M,S)
#answer: 0.04761864
(v) Graphs
(a)xval <- seq(60,130,by=1)
plot(xval,dlnorm(xval,M,S),type="l",col="blue")
(b)lines(density(L),col="red")
ASSIGNMENT – Simulation
Question 1. A team has 20 people, each member has been independently infected by a deadly disease.
The survival probability for the disease is 65%.
(i) Use set.seed(28) and rbinom to simulate the number of survivors 400 times. Store this
in the object S.
(ii) (a) Use the table function on S to obtain a frequency table for the survivors.
(b) Hence, calculate the empirical probabilities.
(c) Compare the results of (b) with the actual probabilities from dbinom (round them
to 3DP using the round function).
(d) Use length to obtain the empirical probability of at most 13 survivors and compare
with the actual probability using pbinom.
(iii) (a) Draw a histogram of the results obtained from simulation, centring.
(b) Superimpose on the histogram a line graph of the expected frequencies for the
binomial distribution using the lines function.
(c) Comment on the differences.
(iv) Compare the following statistics for the distribution and simulated values:
(a) mean (b) standard deviation (c) IQR (use the quantile function).
(v) (a) Create a vector StdDev which contains 400 zeros.
(b) Use a loop to store the standard deviation of the first i values in the object S in the
𝑖 89 element of StdDev.
(c) Plot a graph of the object StdDev showing how the standard deviation of the
simulations changes over the 400 values compared to a horizontal line showing the true
value.
Question 2. The probability of having a female child can be assumed to be 0.45 independently from
birth to birth.
(i) Use set.seed(35) and rgeom to simulate the number of sons before the first
daughter 1,000 times. Store this in the object G.
(iii) Compare the following statistics for the distribution and simulated values:
(a) mean (b) variance.
ANSWERS
Answer 1
n <- 20
p <- 0.65
(i) set.seed(28)
S<-rbinom(400,n,p)
S
Answer 2
(i) set.seed(35)
G <- rgeom(1000,p)
(b) var(G)
#answer:2.9443
#compare with actual
(1-p)/p^2
#answer: 2.716049
#overestimates the mean and variance
(ii) Using for loop, simulate 5000 values of X, by simulating one value of λ and then
using it to obtain values of X. Store it in the object “values”. Use seed= 40.
(iii) Use your simulations to obtain empirical values of mean and variance.
(v) Obtain a histogram of your simulations and use “breaks” to make sure that the
bars line up with the correct values.
(vi) Calculate the empirical probability that there are more than 2 claims in a
particular month.
ANSWERS
Answer 1
(i) values=rep(0,5000)
(ii) set.seed(40)
for (i in 1:5000)
{lambda <- rexp(1,(1/0.4))
values[i] <- rpois(1,lambda)}
Question 1. Let X be the amount of time (in minutes) a postal clerk spends with his/her customer.
The time is known to have an exponential distribution with mean 4 minutes.
(i) Using seed= 50 and simulate a sample of size 100 from this distribution.
(b) Using for loop, store the sum of the results obtained in the ith sample of 100
simulations in the ith element of S. Use seed = 60.
(iv) (a) Draw a labelled histogram of the probabilities of the results in “S”.
(v) Find the probability that the sum of time spent with 100 customers is greater
than 400 minutes:-
(vi) (a) Repeat parts (iv) and (v) for sample sizes of 10,30,50 and 200 customers.(All
4 histograms should be displayed simultaneously).
(vii) Find the mean and variance of your simulated sample and compare it with the
theoretical mean and variance (using CLT).
(vii) Calculate the median, lower and upper quartiles empirically and theoretically
(using CLT).
Question 2. The number of claims received in a day (X) follows Poisson distribution with mean 7.
(iv) (a) Calculate the empirical probability of more than 5 claims in a day.
(c) Compare the above two probabilities with the exact probability using ppois.
(iii) (a) Use qqnorm to obtain a QQ plot for the simulations and a normal
distribution.
Question 3. The following data represent the average total number of marks obtained for a
particular exam, observed over seven exam sessions that had been administered by a
professional examination body: 87 53 72 90 78 85 83
(i) Enter these data into R and compute their sample mean and variance.
(ii) Investigate whether the Poisson model is appropriate for these data, by
calculating the sample mean and sample variance of 10 Poisson samples having
the same size and mean as the sample given above.
ANSWERS
Answer 1
(i) Exponential distribution simulations
#mean = 4, so rate = ¼
l <- 1/4
n <-100
set.seed(50)
E <- rexp(n,l)
(ii) Histogram
hist(E,xlab="time",main="")
(b) range(S)
x<-seq(250,600,by=0.1)
hist(S, prob=TRUE,xlab="sample sums",main="")
curve(dnorm(x,n/l,sqrt(n/l^2)),add=TRUE,lwd=2,col="red")
n <- 30
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")
n <- 50
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")
n <- 200
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")
#As the sample size increases, the normal approximation approaches ; the
empirical distribution more closely
Answer 2
(i) Poisson simulations
(a)m <- 7
set.seed(79)
P<-rpois(1500,m)
(b)#check range before plot histogram
range(P)
#or could use table(P)
hist(P,breaks=(-0.5:16.5),prob=TRUE)
(c)xvals<-seq(-0.5,11.5,by=0.01)
lines(xvals,dnorm(xvals,m,sqrt(m)),type="l",lwd=2,col="red")
#The probability from the simulation is closer than that from the
#normal approximation.
(iii) QQ plot
(a)qqnorm(P)
(b) qqline(P, lty=2, col="red",lwd=2)
Answer 3
ASSIGNMENT – Estimation
Question 1. The times (in minutes) between successive calls at a call center are contained in the CSV
data file “times”.
Analysts suggest that the data could be modelled using an Exponential distribution with
λ=0.0987 or as a Gamma distribution with α=0.8628 and λ=0.0939.
(iv) Superimpose the pdfs of these two distributions on the graph obtained in (iii).
(i) (a) Generate 1000 values from the fitted exponential distribution and store
them in object “x”. Use seed = 80.
(vi) (a) Generate 1000 values from the fitted gamma distribution and store then in
object “y”. Use seed=80.
(vii) Which is the most appropriate model? Use results obtained in (v) and (vi).
Question 2. The CSV file “complaints” contains the number of complaints of a smartphone in a year
made by 1,00,000 of its users.
No of complaints No of Users
0 83989
1 14667
2 1270
3 72
4 2
≥5 -
It is thought that this data could be modelled as a Poisson distribution with λ=0.175 or
as a Type 2 Negative Binomial distribution with k= 2.2569 and p= 0.92804.
(iii) (a) List out the expected frequencies for each of these fitted distributions to the
nearest whole number.
(b) Obtain the differences between the observed and expected frequencies for
the two fitted distributions.
(c) Hence, comment on the fit of these two distributions to the observed data.
ANSWERS
Answer 1
(i) Set your working directory
t <- read.table("times.csv")
t #Here t is a dataframe but we want it to be a vector; so :-t <-
t$V1
t #Now t is a vector.
(ii) hist(t,xlab="minutes",main="Times between successive calls")
#fitted gamma
a <- 0.8628
l2 <- 0.0939
lines(s, dgamma(s,a,l2),col="dark green",lty=3)
#Hard to comment on which is better fit from this graph. #Hence will
use QQ plots instead.
(v)(a) set.seed(80)
x <- rexp(1000,l)
(c) abline(0,1,col="red",lty=2)
#middle to upper sample values are higher than model #so heavier upper
tail - more positively skew
(c) abline(0,1,col="red",lty=2)
#Middle to higher values get worse (with the highest value very poor)
#but better in middle than exponential since both sides of line
(vii) Both have good fit at lower end but worse elsewhere.
Despite the single extreme value in the gamma
the middle has a better fit than the exponential.
Answer 2
(i) #set working directory
c <-read.table("complaints.csv")
c <- c$V1
c
(ii) range(c)
hist(c,breaks=c(0:5),ylim=c(0,100000),xlim=c(0,5),xlab = "no of compl
aints")
Question 2. An insurance company has clients for automobile policies. A sample of 8 automobiles for
which claims have been made in a month have been selected at random. The claim
amounts (in 000’s) are as follows: -
49,53,51,52,47,50,52,53
(ii) Find 99% CI for the standard deviation of the claim amount (use qchisq) .
(iii) (a) Assuming claims are normally distributed, use seed=50, rnorm and for loop
to obtain the mean of 800 re-samples of size 8.
(b) Hence obtain a 95% parametric bootstrap CI for the average claim size.
(iv) (a) Making no distributional assumption, use seed=50, sample and for loop to
obtain 800 re-samples of size 8.
(b) Hence obtain a non-parametric 95% CI for the average claim size.
(v) Using the method in (iii) and the same seed, obtain a 99% CI for the standard
deviation of claim size.
(vi) Using the method in (iv) and the same seed, obtain a non-parametric 99% CI for
the standard deviation of claim size.
The built in dataset, ChickWeight contains weight versus age of chicks on different
diets.
(vii) Use t.test to obtain a 99% CI for the average weights of chicks being fed diet 1.
(viii) Find 90% CI for the variance of the weights of chicks being fed diet 2.
Question 3. A random sample of 450 pineapples was taken from a large consignment and 85 were
found to be bad.
(ii) Comment on the likelihood of more than 25% of the pineapples in a sample
being bad.
Question 4. A statistician has a sample of 25 values from a Poisson distribution with mean of 7.
Find the exact 95% CI for the mean rate.
Question 5. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure
Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31
(i) (a) Using t.test, calculate a 99% CI for the difference in mean times for the two
groups.
(ii) It is now known that both the groups were made of the same employees at
different times. Incorporate this information and repeat (i).
(iii) (a) Find the 90% CI for the ratio of variances of the two groups.
Question 6. Random samples of 500 men and 700 women were asked whether they would have a
flyover near their residence. 300 men and 425 women were in favour of the proposal.
(i) Use two vectors and prop.test command to find the 95% CI for the difference in
proportions between men and women who were in favour.
(ii) Now, solve (i) using a matrix for the results instead of two vectors.
ANSWERS
Answer 1
(i) xbar <- 160
n <- 60
sigma <- 40
alpha <- 0.05
xbar + c(-1,1)*qnorm(1-alpha/2)*sigma/sqrt(n) OR
xbar + c(-1,1)*qnorm(alpha/2,lower=FALSE)*sigma/sqrt(n)
#Ans= (149.8788,170.1212)
Answer 2
(i) (a) claims <- c(49,53,51,52,47,50,52,53)
n <- length(claims)
alpha <- 0.05
mean(claims)+ c(-1,1)*qt(1-alpha/2,n-1)*sd(claims)/sqrt(n) OR
mean(claims)+ c(-1,1)*qt(alpha/2,n-1,lower=FALSE)*sd(claims)/sqrt(n)
#Ans= (49.11921,52.63079)
(b) t.test(claims,conf= 0.95)
(ii) sqrt(c((n-1)*var(claims)/qchisq(1-alpha/2,df=n-1),(n-1)*var(claims)/qc
hisq(alpha/2,df=n-1)))
#Ans = (1.388578,4.274418)
(vii) ChickWeight
x <- ChickWeight$weight[ChickWeight$Diet==1]
t.test(x,conf.level = 0.99) #Ans= (92.71988,112.57103)
(viii) y <- ChickWeight$weight[ChickWeight$Diet==2]
n<-length(y)
alpha <- 0.05
c((n-1)*var(y)/qchisq(1-alpha/2,df=n-1),(n-1)*var(y)/qchisq(alpha/2,df
=n-1))
#Ans= (4038.725,6727.576)
Answer 3
(i) x <- 85
n <- 450
binom.test(x,n,conf.level=0.95)
Answer 4
#careful x is the total claims but we are given the mean
x <- 25*7
n <- 30
poisson.test(x, n, conf=0.9)
#Ans= (5.12746,6.61250)
Answer 5
(i) (a) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
t.test(sp,np,conf.level = 0.99)
#Ans= (-2.834459,10.167793)
(b) #the confidence interval contains 0 so means could be equal
Answer 6
(i) #vector of success
x <- c(300,425)
#vector of trials
n <- c(500,700)
prop.test(x,n,conf=0.99,correct=FALSE)
#Ans= (-0.08093681,0.06665110)
(ii) m <- cbind(x,n-x)
m
prop.test(m,conf=0.99,correct=FALSE)
Question 2. An insurance company has clients for automobile policies. A sample of 8 automobiles for
which claims have been made in a month have been selected at random. The claim amo
unts (in £ 000’s) are as follows: -
49,53,51,52,47,50,52,53
(i) Test whether the average claim amount is greater than the presumed average
value of £ 50,000.
(ii) Test whether the standard deviation of claim amount is equal to 5 from scratch
using pchisq to obtain the p-value.
The built in dataset, ChickWeight contains weight versus age of chicks on different
diets.
(iii) Use t.test to test whether the average weights of chicks being fed diet 1 is
100gm.
(iv) Test whether the variance of the weights of chicks being fed diet 2 is less than
7000gm2.
Question 3. A random sample of 450 pineapples was taken from a large consignment and 85 were
found to be bad.
(i) Test whether the proportion of bad pineapples is less than 25% using
binom.test.
Question 4. The number of typing errors in a page are modelled using Poisson distribution with
mean λ. In a particular draft, there are 900 pages which were found to have 283 errors.
Question 5. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure
Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31
(i) Using t.test, test the hypothesis that the new procedure reduces the average
time required to assemble the device (you may assume equal variances).
(ii) It is now known that both the groups were made of the same employees at
different times. Incorporate this information and repeat (i).
(ii) Test the hypothesis that the lengths of times in the new procedure vary more
than those of standard procedure.
Question 6. Random samples of 500 men and 700 women were asked whether they would have a
flyover near their residence. 300 men and 425 women were in favour of the proposal.
(i) Use two vectors and prop.test to test the hypothesis that the proportion of
women that favour the proposal is greater than that of men. Do not use
continuity correction.
(ii) Now, solve (i) using a matrix for the results instead of two vectors.
Experts suggest that this proportion should be modelled using a Poisson distribution.
(ii) Use poisson.test and test whether the proportions of men and women
favouring the proposal are different.
Question 7. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure
Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31
(i) Store these results in two vectors and the value of the difference between their
means in the object “d”.
(ii) Carry out a permutation test to test the hypothesis that the lengths of times
using new procedure have a lower average than that using standard procedure.
(b) Create an object “index” that gives the positions of the values in “r”.
(c) Use the function combn on the object “index” to calculate all the
combinations of lengths of time using standard procedure and store this in the
object “p”.
(d) Use a loop to store the differences in the average lengths of the two groups
in the object “dif”.
(iii) (a) Plot a labelled histogram of the differences in the average lengths of times of
the two groups for every combination.
(b) Use the function abline to add a dotted vertical green line to show the
critical value.
(c) Use the function abline to add a dashed vertical blue line to show the
observed statistic.
(iv) (a) Calculate the p-value of the test based on this permutation test.
(b) The p-value calculated under the normality assumption was 13.19%.
Comment on your result.
(v) Repeat part (ii) but with 10,000 resamples from the object results using the
function sample and set.seed(77).
(vi) Calculate the p-value of the test using resampling and compare it to the answer
using all the combinations calculated in part (iv).
Question 8. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure
Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31
(i) Store the differences of pairs of results in the vector D and the mean value of
these differences in the object ObsD.
(ii) Carry out a permutation test to test the hypothesis that patients on the special
diet have a lower average blood pressure than the control group:
(b) Use the function permutations from the package gtools to calculate all the
permutations of the signs of the differences in object D and store these
permutations in the object p.
(c) Use a loop to store the mean differences in the average lengths of times of
the two groups in the object dif.
(iii) (a) Calculate the p-value of the test based on this permutation test.
(b) The p-value calculated under the normality assumption was 2.885%.
Comment on your result.
(iv) Repeat part (ii) but with 10,000 resamples from the object sign using the
function sample in the loop and set.seed(79).
(v) Calculate the p-value of the test using resampling and compare it to the answer
using all the combinations calculated in part (iii).
Question 9. The number of admissions in a hospital for each weekday are given below:-
Frequency 73 52 52 55 68 300
(b) Give the expected frequencies if the number of employees absent was
independent of the day (ie uniformly distributed).
(c) Use chisq.test to determine whether the observed results fit a uniform
distribution.
For a cleaning solution, the ratio of chemicals A,B and C should theoretically be 2:3:5.
For 120 bottles of the solution, the results were as follows:-
Chemicals A B C Total
Frequency 17 29 74 120
(ii) (a) Use chisq.test to determine whether the observed results are consistent with
the theoretical ratio.
(b) Extract the dof from the results of the above test.
A survey was analysed and it was found that the distribution of the number of
accidents of vehicle owners in a year is binomial with parameters n=3 and p. Data on
153 vehicle owners is as follows:-
No of claims 0 1 2 3
No of policies 60 75 16 2
(iii) (a) Show that the method of moments estimate for p is 0.246.
(b) Use chisq.test to carry out a goodness of fit for the specified binomial model
for the number of accidents of each vehicle owner in a year, ensuring that the
expected frequencies are greater than 5.
Question 10. Two sample polls of votes for two candidates A and B for a public office are taken, one
each from among the residents of rural and urban areas. The results (in 000’s) are given
in the table below.
A B
Rural 37 15
Urban 12 50
(i) (a) Store these names and frequencies in the matrix obs2
(b) Use chisq.test to determine whether the nature of the area is related to
voting preference in this election.
(ii) Use fisher.test to determine whether eye colour and hair colour are
independent and give the exact p-value.
ANSWERS
Answer 1
xbar <- 160
n <- 60
sigma <- 40
alpha <- 0.05
(a) mu <- 165
statistic <- (xbar-mu)/(sigma/sqrt(n))
statistic #Ans= -0.9682458
#critical value
qnorm(alpha)
#greater than critical value of -1.644854 so don't reject
#p-value
pnorm(statistic) #p-value= 16.64%
(b)mu <- 155
statistic <- (xbar-mu)/(sigma/sqrt(n))
Statistic #Ans= 0.9682458
#critical values
qnorm(alpha/2)
qnorm(1-alpha/2)
#between critical values of ±1.959964 so don't reject
#p-value
2*pnorm(statistic,lower=FALSE) #p-value= 33.297%
Answer 2
(i) claims <- c(49,53,51,52,47,50,52,53)
n <- length(claims)
alpha <- 0.05
mu <- 50
(a) statistic <- (mean(claims)-mu)/(sd(claims)/sqrt(n))
statistic #Ans= 1.178416
#p-value
1-pt(statistic,n-1)
#Ans= p-value = 13.85% so we cannot reject H0
(b) t.test(claims,alt="greater",mu=50)
(iii) ChickWeight
x <- ChickWeight$weight[ChickWeight$Diet==1]
t.test(x,conf.level = 0.95,mu=100)
#p-value = 48.93%; hence we may not reject Ho.
n<-length(y)
alpha <- 0.05
sigma <- sqrt(7000)
statistic <- (n-1)*var(y)/sigma^2
statistic #Ans= 87.16977
#lower critical value
qchisq(alpha,n-1) #Ans= 94.81124
#Ans= below lower value, so we may reject H0
#p-value (one-sided)
pchisq((n-1)*var(claims)/sigma^2,df=n-1)
#p value is smaller than alpha.
Answer 3
(i) x<-85
n<-450
alpha<-0.05
binom.test(x,n,p=0.25,alternative="less",conf.level=1-alpha)
#p-value<1% reject H0, p<0.25
Answer 4
(i) Test for lambda using poisson.test
x<-283
n<-900
alpha<-0.05
poisson.test(x,n,r=0.2,alt="greater",conf=1-alpha)
poisson.test(x,n,r=0.2,alt="greater")
#p-value is extremely small; reject H0, rate > 0.2
(ii)test <-poisson.test(x,n,r=0.2,alt="greater")
test$estimate #answer: 0.3144444
Answer 5
(i) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
#alt is mean(sp) greater than mean(np)
#sp>np
t.test(sp,np,alt="greater",var.equal=TRUE)
#p-value >0.05; hence do not reject
(ii) t.test(sp,np,alt="greater",var.equal=TRUE,paired=TRUE)
# p-value <0.05; hence we may reject
(iii) var.test(sp,np,alt="less")
#p-value is > 0.05; hence we may not reject.
Answer 6
(i) #vector of success
x <- c(300,425)
#vector of trials
n <- c(500,700)
prop.test(x,n,alt="greater",conf=0.95,correct=FALSE)
(iii) poisson.test(x,n)
#p=value=0.8803 so not reject H0, same proportions
Answer 7
(i) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
d <- mean(sp)-mean(np)
#Ans= 3.666666
(iii) Histogram
(a)hist(dif)
(b)abline(v=quantile(dif,0.05), col="green",lty=3)
or if do C-T
abline(v=quantile(dif,0.95), col="green",lty=3)
(c)abline(v=d, col="blue",lty=2)
(iv) #p-value
(a)length(dif[dif<=d])/length(dif) #Ans= 0.9470177
#So insufficient evidence to reject H0, no difference in mean blood pressure
(b)#p-value is very close to the value under the normality assumption
set.seed(77)
for (i in 1:10000)
{p<-sample(index, nsp, replace=FALSE)
dif[i]<-mean(r[p])-mean(r[-p])}
(vi) #p-value
length(dif[dif<=d])/length(dif) #Ans= 0.9456
#Insufficient evidence to reject H0, no difference in mean blood pressure
#very close to the value using all combination
Answer 8
(i)sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
D <- sp-np
ObsD <- mean(D)
ObsD #Ans= 3.666667
(iii) #p-value
(a)length(dif[dif<=ObsD])/length(dif) #Ans= 0.9902344
#So insufficient evidence to reject H0
(v) #p-value
length(dif[dif<=ObsD])/length(dif) # Ans= 0.9911
#Insufficient evidence to reject H0
#very close to value using all combinations
Answer 9
(i) Goodness of fit test - uniform
(a)obs <- c(73,52,52,55,68)
(b)#would be 300/5 = 60 each day
exptd <- rep(60,5)
(c)chisq.test(obs,p=exptd,rescale=TRUE)
#or
chisq.test(obs)
#or
exptd <- rep(1/5,5)
chisq.test(obs,p=exptd)
#statistic= 6.4333 on chi-square 4 #p-value = 0.169, not reject H0
Answer 10
(i)(a) obs2 <- matrix(c(37,12,15,50), 2, 2,
dimnames=list(c("rural", "urban"),c("A","B")))
obs2
#or do it separately
obs2 <- matrix(c(37,12,15,50), 2, 2)
rownames(obs2) <- c("rural", "urban")
colnames(obs2) <- c("A","B")
(b)chisq.test(obs2)
#statistic= 28.885 on chi-square 1
#p-value = 0, reject H0.
(c)chisq.test(obs2, correct=FALSE
#statistic= 30.962 on chi-square 1
#p-value = 0, reject H0
(ii) fisher.test(obs2)
#p-value = 2.36e-08, reject H0
Question 1. The average heights and weights of American women are given in the inbuilt
dataset “women”.
(b) Create objects Sxx, Syy and Sxy which contain the sum of squares for the
data.
(iv) (a) Calculate the Pearson Correlation coefficient of the rank of the heights
and rank of the weights.
(v) (a) Use cor.test and the Pearson correlation coefficient to test whether
ρ=0.
(c) Use the statistic from part (v)(b) to obtain the p-value for the test in part
(v)(a).
(vii) Use cor.test to test if the true value of Kendall’s correlation coefficient is
less than zero.
(viii) Use Fisher’s transformation to test whether H0 : ρ = 0.9 vs H1 : ρ> 0.9 stating
the p-value.
(ix) Use prcomp to carry out PCA on the women data and store it in the
object pca1.
(xi) (a) Obtain the principal components decomposition (matrix P) for the
women data from pca1.
(xiii) (a) Obtain the percentages each of the variances of the principal
components using the summary of the prcomp function.
Question 2. The built in data set “Iris” contains measurements (in cm) of the variables sepal
length, sepal width, petal length and petal width, respectively, for 50 flowers from
each of 3 species (Iris setosa, versicolor, and virginica) of iris.
(i) Extract the four measurements for the setosa species only and store them
in the 50x4 data frame, SDF.
(ii) Use plot to obtain a scatter graph of each pair of measurements for the
setosa species.
(iv) Comment on the relationship between Petal Width and the other
measurements.
(v) Use eigen to obtain the eigenvectors of XTX and store them in the matrix object
W.
(b) Calculate what percentage each of the variances in matrix S are of the total.
(b) Obtain the percentages in part (vii)(b) using the summary of the prcomp
function.
(c) Draw a scree diagram using plot on the result of the prcomp function and
hence state which principal component(s) should be dropped to simplify the
dimensionality.
(ix) (a) Carry out PCA with scaling of the data using prcomp.
(b) Using the Kaiser Criterion state which principal component(s) should be
dropped to simplify the dimensionality.
(x) (a) Using cbind and rep, or otherwise, obtain a new matrix P1 which has only the
first two principle components and vectors of zeroes for the removed
components.
ANSWERS
Answer 1
(i)(a) women
plot(women,main="Height vs Weight")
(b) #we see that the relationship between height and weight is almost
perfectly linear.
d <- x-y
1-(6*sum(d^2))/(n*(n^2-1))
(x)(a) Eigenvectors
#These are given when the variable is called:
pca1
#or can be extracted using:
pca1$rotation
#answer:
# PC1 PC2
#height 0.2762612 0.9610826
#weight 0.9610826 -0.2762612
(b) Explain
#These are the orthogonal vectors of the new co-ordinate system
#(which is a rotation of the old co-ordinate system)
(iv) Comment
#It looks like there might be weak positive correlation between
#Petal Width and all the other variables (as there are no values
#in the top left quadrant).
detach(iris)
pairs(X)
(b) Plot a labelled scatter graph of the data and add a red dashed regression
line onto your scatterplot.
(iv) Add green points to the scatterplot to show the fitted values.
(vi) Obtain the total sum of squares in the baby weights model together with its
split between the residual sum of squares and the regression sum of
squares:
(b) from first principles using the functions sum, mean, fitted and residuals.
Question 2. The average heights and weights of American women are given in the inbuilt
dataset “women”. Using model1 fitted in the previous question: -
(iii) Extract the estimated value of beta, the standard error of beta and the
degrees of freedom and store them in the objects b, se and dof.
(iv) Using the objects created in part (iii), use a first principles approach to:
(b) obtain the statistic and p-value for a test of H0: β = 3.6 vs H1: β < 3.6.
(c) obtain the statistic and p-value for a test of H0: β = 3.2 vs H1: β ≠ 3.2.
(v) Obtain the results of an F-test to test the ‘no linear relationship’ hypothesis
using the:
(vi) Calculate the F statistic and p-value from first principles by extracting the
mean sum of squares and degrees of freedom from the ANOVA table.
(vii) Obtain a 95% confidence interval for the error variance, σ2 , from first
principles.
(viii) (i) Estimate the mean weight of a women with height 55 inches.
Question 3. The average heights and weights of American women are given in the inbuilt
dataset “women”. Using model1 fitted in the first question: -
(ii) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance and whether a linear model
is appropriate.
(iv) Examine the final two graphs obtained by plot(model1) and comment.
(v) (a) Obtain a new linear regression model, model2, based on the data
without the third data point (height=60 inches).
(iii) (a) Obtain estimates for the slope and intercept parameters for model3.
(b) Add a red dashed regression line to your scatterplot of lny vs x from part
(i)(c).
(b) Re-plot the scatterplot of y vs x and this time add blue points to the
scatterplot to show the fitted values of y using model3.
(c) Add a dashed red regression curve that passes through the fitted points.
(v) Obtain a 95% confidence interval for the mean value of y when x = 8.5 .
ANSWERS
Answer 1
(i)Fit a linear regression model
women
model1 <- lm(weight~height,data=women)
coef(model1)[1]+coef(model1)[2]*72
(b) using predict
newdata1 <-data.frame(height=72)
Then we use the "predict" function
predict(model1,newdata1) #Answer: 160.8833
Answer 2
(i) Test beta=0
summary(model1)
#t statistic 37.85, p-value 1.09e-14
#reject H0, beta definitely not equal to zero
(v) F test
(a)anova(model1)
#F statistic = 1433
#p-value = 1.091e-14
#reject H0, there is a linear relationship btwn gestation and weight
(b)summary(model1)
Answer 3
(i)(a) Residuals from 1st principles
#residuals are the differences btwn true y values and fitted y values
women$weight-fitted(model1)
(b) Residuals using command
model1$residuals
#or
residuals(model1)
(ii) Plot the residuals against the fitted values and comment
(a)plot(model1,1)
plot(model1,5)
Answer 4
(i)(a) x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(0.33,0.51,0.75,1.16,1.9,2.59,5.14,7.39,11.3,17.4)
obs <- data.frame(x,y)
(b) Scatterplot x vs y
plot(obs,pch=3)
newdata <-data.frame(x=8.5)
exp(predict(model3,newdata,interval="confidence",level=0.95))
#(8.41,9.63)
(i) Extract the four measurements corresponding to the creator Matt and
store them in the data frame MDF.
(ii) Using the data for Matt, fit a linear regression model, model2, with B.width
as the response variable and A.length, A.width and B.length as explanatory
variables:
(v) Obtain the expected B width with A length 5.1cm, A width 3.5cm and B
length 1.4cm created by Matt:-
Question 2. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:
(i) Obtain the total sum of squares in model2 together with its split between
the residual sum of squares and the regression sums of squares using the
anova command.
(vi) Extract the value of β2 , the standard error of β2 and the degrees of
freedom and store them in the objects b2, se2 and dof.
(vii) Using the objects created in part (vi), use a first principles approach to:
(a) obtain a 90% confidence interval for β2 and compare to part (v)(a).
(b) obtain the statistic and p-value for a test of H0 : β2 = 0.3 vs H1 : β2 < 0.3.
Question 3. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:
(i) Carry out an F-test to test H0 : β1= β2= β3 = 0 using the summary command,
stating the test statistic and p-value clearly.
(ii) Obtain a 95% confidence interval for the error variance, σ2 , from first
principles.
Question 4. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:
(i) Obtain the expected B width with A length 5.94cm, A width 2.77cm and B
length 4.26cm created by Matt.
(a) mean B width with A length 5.94cm, A width 2.77cm and B length
4.26cm for creator Matt
(b) individual B width with A length 5.94cm, A width 2.77cm and B length
4.26cm for creator Matt
(a) from the model2 parameters (b) using the predict function.
(a) using the confint function (b) using the predict function.
Question 5. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:
(ii) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance and whether a linear model
is appropriate.
(iv) Examine the final two graphs obtained by plot(model2) and comment.
Question 6. This question uses the creator Matt data from Q1 which should be stored in the
data frame MDF. We are fitting multiple linear regression models with B.width as
the response variable and a combination of A.length, A.width and B.length as
explanatory variables.
Forward selection
(i) Fit the null regression model, fit0, to the B.width data.
(ii) Obtain the (Pearson) linear correlation coefficient between all the pairs of
variables.
(iii) Fit a linear regression model, fit1, with B.width as the response variable and
the variable with the greatest correlation with B.width as the explanatory
variable.
(iv) (a) Fit a linear regression model, fit2, with B.width as the response variable
and the variable from part (iii) and the variable with the next highest
correlation with B.width as the two explanatory variables.
(v) (a) Fit a linear regression model, fit3, with B.width as the response variable
and the variables from part (iv) plus the last variable as the explanatory
variables.
(vi) Comment on the output of the fit3 model and the results of the ANOVA
output.
Backward selection
Start with creator Matt linear regression model, model2, with B.width (y ) as the
response variable and A.length (x1) , A.width (x2) and B.length (x3) as explanatory
variables:
(vii) (a) Update the model to create model2b by removing the variable with βj
not significantly different from zero.
Question 7. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:
Forward selection
(i) (a) Fit a linear regression model, fit4, with B.width as the response variable
and a two-way interaction term between the two most significant variables.
(b) Compare the adjusted R2 of fit3 and fit4. Comment on these values and the
results of the ANOVA output.
(ii) Create two further models, fit5 and fit6, each containing the three
explanatory variables from fit3 plus a single two-way interaction term.
Show that only one of them improves the value of the adjusted 2 R but the
ANOVA output shows that there is no significant improvement in fit.
(iii) Explain why we would not consider adding a three-way interaction term in
this case.
Backward selection
Start with creator Matt linear regression model, fitA, with with B.width (y ) as the
response variable and A.length (x1) , A.width (x2) and B.length (x3) as explanatory
variables, together with all two and three way interactions.
(iv) Update the model fitA to create fitB, fitC, etc by removing:
Each time compare only the adjusted R2 of the models to ensure only those
models which improve the fit are kept.
(v) Comment on the limitations of only using adjusted R2 as a basis for model
fit.
ANSWERS
Answer 1
(i) data <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"",
dec = ".")
attach(data)
#We want columns 1 to 4
MDF <- data[Creators=="Matt",1:4]
MDF
Answer 2
data <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
attach(data)
MDF <- data[Creators=="Matt",1:4]
MDF
#which we fitted the following model to
model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)
Answer 3
(i) F test
summary(model2)
#F statistic = 36.98 #p-value = 2.571e-12
#reject H0, there is at least one non-zero slope parameter
Answer 4
(i) Estimate B width
#from 1st principles
coef(model2)[1]+coef(model2)[2]*5.94+coef(model2)[3]*2.77+coef(model2)[4]*
4.26
#or using predict
#Wrap the parameters inside a data frame
newdata2 <-data.frame(A.length=5.94,A.width=2.77,B.length=4.26)
#Then we use the 'predict' function
predict(model2,newdata2) #answer:1.325704cm
Answer 5
(i)(a) Residuals from 1st principles
#residuals are the difference btwn true y value and fitted y value
#since data is attached
B.width-fitted(model2)
#but that gives all 150 values, so:
MDF[,4]-fitted(model2)
MDF$B.width-fitted(model2)
B.width[51:100]-fitted(model2)
(b) Residuals using command
model2$residuals
model2$resid
residuals(model2)
resid(model2)
(ii) Plot the residuals against the fitted values and comment
(a)plot(model2,1)
(b) #68, 69, 74 are marked as outliers but still within 3 sds
3*summary(model2)$sigma
variance appears to start increasing towards the end
so may not be constant
Answer 6
#Forward selection
(i) Null model
fit0 <- lm(b.width ~ 1, data = MDF)
summary(fit0)
Answer 7
#Forward selection
fit3 <- lm(B.width~B.length+A.width+A.length,data=MDF)
#note that the order is different to model2
model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)
summary(fitD)
#adjusted R² has fallen from 0.6975 to 0.6919
#implies should not remove but yet none of the coefficients are
significant
#similarly if removed other interaction would see fall to 0.681
fitD <- update(fitC, . ~ . - A.length:B.length)
summary(fitD)
(c)#not appropriate to remove single terms when have 2 way interactions
#that involve them
#Note we would have got this model with forward selection if we had
#ONLY considered the adjusted R² and not the results of the ANOVA test
(v) Comment
#even though we have maximised the adjusted R²
#none of the coefficients are significant
#so need a better method of fit - hence tend to use the ANOVA test
#between models to check improvement (although in a later unit we use AIC)
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 100
www.sankhyiki.in
+91-9711150002
ASSIGNMENT – GLMs
Question 1. Two objects, A and B have been made by three creators, Tom, Matt and John. The
dataset “data1” contains 50 values each of the length and width of A and B created
by the three creators.
(i) Extract the four measurements corresponding to the creator Matt and
store them in the data frame MDF.
(ii) Using the data for Matt, fit a linear regression model, model2, with B.width
as the response variable and A.length, A.width and B.length as explanatory
variables:
(iii) (a) Use the function glm to fit an equivalent generalised linear model,
glmodel, to the creator Matt data. State explicitly the appropriate family
and the link function in the arguments.
(b) Confirm that the estimated parameters are identical to the linear model
in part (ii).
(c) Give a shortened version of the R code from part (iii)(b) that will fit the
same GLM as part (iii)(a) but makes use of the default settings of the glm
function.
Question 2. Two objects, A and B have been made by three creators, Tom, Matt and John. The
dataset “data1” contains 50 values each of the length and width of A and B created
by the three creators.
(i) (a) Assuming that the measurements are normally distributed, use the
function glm to fit a generalised linear model, glmodel1, with B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator
( γi )as explanatory variables:
(c) Explain what has happened to the coefficient for the creator Tom.
(ii) State the code for a linear predictor which also included a quadratic effect
from B.length.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 101
www.sankhyiki.in
+91-9711150002
The built-in data set esoph contains data from a case-control study of oesophageal
cancer in Ille-et-Vilaine, France. agegp contains 6 age groups, alcgp contains 4
alcohol consumption groups, tobgp contains 4 tobacco consumption groups, ncases
gives the number of observed cases of oesophageal cancer out of the group of size
ncontrols.
(iii) Fit a binomial generalised linear model, glmodel2, with a logit link function
to estimate the probability of obtaining oesophageal cancer as the response
variable and a linear predictor containing the main effects of agegp ( αi ),
alcgp ( βj ) and tobgp ( γk )
(iv) State the code for a linear predictor which also has interaction between
alcohol and tobacco.
Question 3. The first two parts use the data1 generalised linear model, glmodel1, with B.width as
the response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:
(i) (a) State the statistic, p-value and conclusion for a test of H0 : β2 =0 = vs H1 :
β2 ≠ 0.
(a) obtain a 90% confidence interval for the creator Matt coefficient.
The next two parts use the claims binomial probability generalized linear model,
glmodel2, with the probability of claim as the response variable and a linear predictor
containing the main effects of age ( αi ), alcohol ( βj ) and tobacco( γk )
(iii) State the p-value and conclusion for a test that the second non-base
category in the age group is zero.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 102
www.sankhyiki.in
+91-9711150002
(a) obtain a 99% confidence interval for the third non-base coefficient in the
alcohol group.
(b) test, at the 5% level, whether the first non-base coefficient in the
tobacco group is equal to 0.5.
Question 4. The first three parts use the “data1” generalised linear model, glmodel1, B.width as
the response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator
(γi)as explanatory variables:
(i) (a) Obtain the residual degrees of freedom and residual deviance for this
model.
(iii) (a) Create a new GLM, glmodel01, which does not contain creator as an
explanatory variable.
(c) Use anova to carry out a formal F test to compare these two models.
The last part of this exercise uses the oesophageal cancer binomial probability
generalised linear model, glmodel2, with the probability of obtaining oesophageal
cancer as the response variable and a linear predictor containing the main effects of
age ( αi ), alcohol ( βj ) and tobacco( γk ) :
(iv) (a) Create a new GLM, glmodel02, which does not contain tobacco as an
explanatory variable.
(c) Use anova to carry out a formal χ2 test to compare these two models.
Question 5. We are fitting generalised linear models to “data1” with B.width as the response
variable and a combination of A.length, A.width, B.length and Creator as
explanatory variables, assuming the measurements are normally distributed.
Forward selection
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 103
www.sankhyiki.in
+91-9711150002
(i) Fit the null generalised linear model, fit0, to the data.
First covariate
(ii) (a) By examining the scatterplot of all the pairs of variables explain why
either Creator or B.length should be chosen as our first explanatory
variable.
(b) Fit a linear regression model, fit1a, with B.width as the response
variable and Creator as the only explanatory variable. Determine the AIC
for fit1a.
(c) Fit a linear regression model, fit1b, with B.width as the response
variable and B.length as the only explanatory variable. Determine the AIC
for fit1b.
(d) By examining the AIC of fit1a and fit1b choose the model that provides
the best fit to the data.
(e) Use the anova function to carry out an F test comparing fit0 and the
model chosen in part (ii)(d).
Second covariate
(iii) (a) Fit a linear regression model, fit2, with B.width as the response
variable and both Creator and B.length as the explanatory variables.
(b) By examining the AIC and carrying out an F test compare fit2 and the
model chosen in part (ii)(d).
Third covariate
(iv) (a) Fit a linear regression model, fit3a, with B.width as the response
variable and Creator, B.length and A.length as explanatory variables.
Determine the AIC for fit3a.
(b) Fit a linear regression model, fit3b, with B.width as the response
variable and Creator, B.length and A.width as explanatory variables.
Determine the AIC for fit3b.
(c) By examining the AIC of fit3a and fit3b choose the model that provides
the best fit to the data.
(d) Use the anova function to carry out an F test comparing fit2 and the
model chosen in part (iv)(c).
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 104
www.sankhyiki.in
+91-9711150002
Fourth covariate
(v) (a) Fit a linear regression model, fit4, with B.width as the response
variable and all four covariates as the explanatory variables.
(b) By examining the AIC and carrying out an F test compare fit4 and the
model chosen in part (iv)(c).
Fifth covariate
(vi) (a) Fit a linear regression model, fit5a, with B.width as the response
variable, all four covariates as main effects and an interactive term
between Creator and A.width as explanatory variables. Determine the AIC
for fit5a.
(b) Fit a linear regression model, fit5b, with B.width as the response
variable, all four covariates as main effects and an interactive term
between B.length and A.width as explanatory variables. Determine the
AIC for fit5b.
(c) By examining the AIC of fit5a and fit5b choose the best fit to the data.
(d) Use the anova function to carry out an F test comparing fit4 and the
model chosen in part (vi)(c).
Sixth covariate
(vii) (a) Fit a linear regression model, fit6, with B.width as the response
variable, all four covariates as main effects, the interactive terms between
Creator and A.width, and between B.length and A.width as explanatory
variables.
(b) By examining the AIC and carrying out an F test compare fit6 and the
model chosen in part (vi)(c).
Seventh covariate
(viii) Show that adding interaction between B.length and A.length to fit6 leads
to a drop in the AIC and a significant improvement in the residual
deviance.
It can be shown that adding other two-way interactions terms do not improve the
AIC nor lead to a significant improvement in residual deviance.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 105
www.sankhyiki.in
+91-9711150002
(ix) Explain why we should not add any three-way interaction terms at this
stage.
Backward selection
(x) Fit the full generalised linear model, fitA, to the data1 data to model
B.width using Creator*B.length*A.length*A.width and show the AIC is
−109.79 .
(xi) Show that the generalised linear model, fitB, which removes the four-way
interaction term leads to an improvement in the AIC.
It can be shown that two three-way interaction terms have parameters that are
insignificant.
(xii) (a) Update the model fitB to create fitC1 by removing the three-way
interaction between Creator, B.length and A.width. Determine the AIC for
fitC1.
(b) Update the model fitB to create fitC2 by removing the three-way
interaction between Creator, B.length and A.length. Determine the AIC for
fitC2.
Let fitC be the model from parts (xii)(a) and (b) which produces the biggest
improvement in the AIC.
(xiv) Show that generalised linear model, fitE, which removes another
insignificant three- way interaction term also leads to an improvement
in the AIC.
(xv) Use the summary function to show that the parameter of the final
three-way interaction term is still significant but that the two-way
interaction term between Creator and A.length is not. Update the
model fitE to create fitF by removing this two-way interaction and show
it leads to an improvement in the AIC.
(xvi) Use the summary function to show that the parameters of three of the
two-way interaction terms are insignificant at the 5% level. Show that
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 106
www.sankhyiki.in
+91-9711150002
Question 6. This question uses the “data1” generalised linear model, glmodel1, B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:
(i) Obtain the value of the linear predictor for glmodel1 for creator Matt with
A length 5.1cm, A width 3.5cm and B length 1.4cm:
(ii) (a) Explain why the expected B width for creator Matt will be the same as
the linear predictor in part(i).
(b) Show that this is the case by using the predict function.
(iii) Explain why there is no constant for the creator Tom in the linear predictor.
(iv) Obtain the expected B width for creator Tom with A length 5.1cm, A width
3.5cm and B length 1.4cm:
Question 7. This question uses the “data1” generalised linear model, glmodel1, B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:
(i) Obtain the raw residuals for the generalised linear model:
(ii) Show that the raw residuals are the same as the:
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 107
www.sankhyiki.in
+91-9711150002
(iii) By examining the median, lower and upper quartiles of the residuals,
comment on their skewness.
(iv) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance of the residuals and whether
a normal model is appropriate.
(vi) Examine the final two graphs obtained by plot(glmodel1) and comment.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 108
www.sankhyiki.in
+91-9711150002
ANSWERS
Answer 1
(i)read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec =
".")
attach(data1)
#We want columns 1 to 4
MDF <- data1[Creators=="Matt",1:4]
MDF
Answer 2
(i)data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"",
dec = ".")
(a)glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
attach(data1)
glmodel1 <- glm(B.width~A.length+A.width+B.length+Creators)
(b)coef(glmodel1)
#or
glmodel1$coef
(c)#It has been absorbed into the 'intercept' coefficient.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 109
www.sankhyiki.in
+91-9711150002
#Since using binomial canonical link function which is the default, same
as:
glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp, data=esoph,
family=binomial)
#or if attach esoph
attach(esoph)
glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp,
family=binomial)
glmodel2
Answer 3
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
(i) Test beta2=0
(a)#beta2 is the coefficient of A.width
summary(glmodel1)
#t statistic 5.072, p-value 1.20e-06 (ie 0.00000120)
#reject H0, beta2 is not equal to zero
(b)#The p-value is in the 3rd row and 4th column
coef(summary(glmodel1))[3,4]
summary(glmodel1)$coef[3,4]
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 110
www.sankhyiki.in
+91-9711150002
(iii) Test whether 2nd non-base category in the age group is zero.
#The base category is absorbed into the intercept coefficient
#the second non-base category is the 'Quadratic' category denoted by Q
#ie agegp.Q
summary(glmodel2)
#t statistic -2.263, p-value 0.02362
#reject H0, parameter is not equal to zero
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 111
www.sankhyiki.in
+91-9711150002
Answer 4
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
(i)(a)# Residual dof and deviance are given both in the model and the
summary
glmodel1
summary(glmodel1)
#Residual dof = 144
#Residual deviance = 3.998
(b)glmodel1$df.res
glmodel1$dev
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 112
www.sankhyiki.in
+91-9711150002
Answer 5
#Forward selection
(i) Null model
fit0 <- glm(B.width~1,data=data1)
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 113
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 114
www.sankhyiki.in
+91-9711150002
#Backward selection
(x) Full model
fitA <- glm(B.width~Creators*B.length*A.length*A.width,data=data1)
fitA$aic #AIC is -109.79
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 115
www.sankhyiki.in
+91-9711150002
(xvi) Removing other 2 way interaction terms does not improve the AIC
summary(fitF)
# The Creators:A.width parameters are not significant
# The Creators:B.length parameters are not significant
# The A.length:A.width parameter is not significant either
fitG1 <- update(fitF,.~.-Creators:A.width)
summary(fitG1) #AIC is -116.14 worse
fitG2 <- update(fitF,.~.-Creators:B.length)
summary(fitG2) #AIC is -118.8 worse
fitG3 <- update(fitF,.~.-A.length:A.width)
summary(fitG3) #AIC is -118.61 worse
step(fitA,scope=~Creators*B.length*A.length*A.width,direction="backward",t
est="F")
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 116
www.sankhyiki.in
+91-9711150002
Answer 6
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <- glm(B.width~A.length+A.width+B.length+Creators,data=data1,fami
ly=gaussian (link="identity"))
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 117
www.sankhyiki.in
+91-9711150002
Answer 7
(i) Raw residuals
(a) From 1st principles
attach(data1)
B.width-fitted(glmodel1)
#or
data1[,4]-fitted(glmodel1)
#answer: -0.0396860931,...., -0.1867598541
(b) Using residuals command
resid(glmodel1,type="response")
#answer: -0.0396860931,...., -0.1867598541
#Note: glmodel1$resid gives the standardised (deviance) residuals not the raw
residuals.
(iv) Plot the residuals against the fitted values and comment
(a)plot(glmodel1,1)
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 118
www.sankhyiki.in
+91-9711150002
(b)#The middle section is good but there are issues in the extremes.
#the residuals at the lower end are more negative than expected - so the
fitted values are too large
#the residuals at the upper end are more positive than expected - so the
fitted values are too small
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 119
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 120
www.sankhyiki.in
+91-9711150002
The Bayesian estimate for p under quadratic loss for a single outcome x is:
(b) Repeat part (i)(b) but also store the ith theoretical Bayesian estimate in
the ith element of pm.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 121
www.sankhyiki.in
+91-9711150002
(iv) (a) Repeat part (iii)(b) but also store the i th theoretical Bayesian estimate
in the i thelement of pm.
Question 2. Consider the n = 30 independent and identically distributed observations (y1, y2, ..., yn)
G H I .J
given below from a random variable Y with probability distribution 𝑓(𝑦, 𝜃) =
K!
.
By assuming a prior distribution proportional to 𝑒 ,NG , we can show that the posterior
distribution of θ is:
U
𝑓(𝜃|𝑦+ , 𝑦' , … , 𝑦Q ) ∝ 𝜃 ∑TV/ KT 𝑒 ,(Q(N)G
(i) (a) Plot the posterior probability density function of θ for values of θ in the interval
[3.2, 6.8] and assuming a = 0.01. [Hint: the range of values of θ can be
obtained in R by seq(3.2, 6.8, by = 0.01).]
(b) Carry out a simulation of N = 5,000 posterior samples for the parameter θ using
seed 100.
(iii) Calculate the mean, median and standard deviation of the posterior distribution
of θ.
Two possible values for the true value of parameter θ are θ =15 and θ = 5.
(iv) Comment on these two values based on the posterior distribution of θ plotted
in part (ii) and summarised in part (iii).
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 122
www.sankhyiki.in
+91-9711150002
ANSWERS
Answer 1
(i) Binomial/beta posterior (sample size 1)
a <- 2
b <- 3
(a) x <- rep(0,1000)
(b)set.seed(77)
for (i in 1:1000)
{p <- rbeta(1,a,b)
x[i] <- rbinom(1,1,p)}
(c)mean(x) #answer: 0.382
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 123
www.sankhyiki.in
+91-9711150002
Answer 2
(i)(a) ## Data entry
y = c (5, 5, 6, 2, 4, 10, 2, 5, 5, 2, 5, 3, 7, 4, 4, 5, 4, 6, 7, 2, 8,
4, 6, 4, 3, 6, 6, 6, 5, 7)
## plot the posterior pdf of theta
theta = seq(3.2, 6.8, by = 0.01)
plot(theta, dgamma(theta, sum(y)-1, length(y) + 0.01), ylab ="Density",
type = "l")
(b) set.seed(100)
x = rgamma(5000, sum(y)-1, 30 + 0.01)
(iv) 15 is quite far away from the range of samples obtained for the
posterior distribution of θ. On the other hand, 5 is more likely
to be the true value. 15 is very unlikely to be the case if there
is no calculation error.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 124
www.sankhyiki.in
+91-9711150002
(ii) Plot a labelled bar chart of the simulated credibility estimates for p using
the functions bar plot and table.
(iii) Calculate the mean and standard deviation of the empirical credibility
estimates.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 125
www.sankhyiki.in
+91-9711150002
Answer 1
(i)
a <-2
b <-3
n <-5
Z <-n/(n+a+b)
(a) cp <- rep(0,1000)
(b) set.seed(79)
for (i in 1:1000){
p <- rbeta(1,a,b)
x <- rbinom(1,n,p)
cp[i] <- Z*x/n + (1-Z)*a/(a+b)
}
(ii) x <- seq(0.2, 0.7, by=0.1)
barplot(table(cp),names=x, xlab="credibility
premium",ylab="frequency", main="bar chart of credibility
premiums")
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 126
www.sankhyiki.in
+91-9711150002
ASSIGNMENT – EBCT
Question 1. The table below shows the aggregate claim amounts (in £m) made by 4 companies
from an insurer in a 5 year period:-
Year
Company 1 2 3 4 5
A 48 53 42 50 59
B 64 71 64 73 70
C 85 54 76 65 90
D 44 52 69 55 71
(i) Load the data frame and store it in the matrix “amounts”.
(ii) Store the number of years and number of brokers in the objects n and N,
respectively.
An actuary is using EBCT Model 1 to set premiums for the coming year.
(iii) (a) Use mean and rowMeans (or otherwise) to calculate an estimate of
E[m(θ)] andstore it in the object m.
(b) Use apply, var and mean to calculate an estimate of E[s2(θ)] and store it in
the object s.
(c) Use var and rowMeans (or otherwise) and your result from part (iii)(b) to
calculate an estimate of var[m(θ)] and store it in the object v.
(iv) Use your results from parts (ii) and (iii) to calculate the credibility factor and
store it in theobject Z.
(v) Calculate the EBCT premiums for each of the four companies.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 127
www.sankhyiki.in
+91-9711150002
Question 2. This question uses the data from previous question also.
The table below shows the volumes of business for each company in each year for
the insurer.
Year
Company 1 2 3 4 5
A 12 15 13 16 10
B 20 14 22 15 30
C 5 8 6 12 4
D 22 35 30 16 10
(i) Load the data frame of volumes and store it in the matrix “volume”.
An actuary is using EBCT Model 2 to set premiums for the coming year.
(ii) Calculate the claims per unit of risk volume and store them in the matrix X.
(iii) (a) Use rowSums to calculate the total policies for each company and store
them in the object Pi.
(b) Use sum to calculate the overall total policies for all companies and
store it in the object P.
(b) Use rowSums to calculate the mean claims per policy for each company
and store it in the object Xibar.
(c) Use rowSums and mean to calculate E[s2(θ)] and store it in the object s.
(d) Use sum and rowSums and your result from part (iii)(c) to calculate
var[m(θ)] and store it in the object v.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 128
www.sankhyiki.in
+91-9711150002
(v) Use your results from parts (iii) and (iv) to calculate the credibility factor for
each company and store the values in the object Zi.
(vi) If the volumes of business for each company for the coming year are 20, 25,
10 and 12, respectively, calculate the EBCT Model 2 premiums for each of
the four countries.
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 129
www.sankhyiki.in
+91-9711150002
ANSWERS
Answer 1
(i)#set working directory to where the data is stored
#could store in data frame but question asks for matrix
amounts<-as.matrix(read.table("ebct1.txt",header=TRUE))
(ii)n<-ncol(amounts)
n
N<-nrow(amounts)
N # n=5, N=4
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 130
www.sankhyiki.in
+91-9711150002
Answer 2
amounts<-as.matrix(read.table("ebct1.txt",header=TRUE))
(i) Load dataframe and store in matrix volume
#could store in data frame but question asks for matrix
volume<-as.matrix(read.table("ebct2.txt",header=TRUE))
(iv)(a) E[m(theta)] mean claims per policy for all companies (X bar)
m<-sum(amounts)/P
m
#answer 3.984127
(b) Mean claims per policy for each company (Xi bar)
Xibar<-rowSums(amounts)/rowSums(volume)
Xibar
#answer 3.818, 3.386, 10.57, 2.575
(c) E[s²(theta)]
#in one go
s <-mean(rowSums(volume*(X-Xibar)^2)/(n-1))
s
#in steps
row.var1<-rowSums(volume*(X-Xibar)^2)/(n-1)
s<-mean(row.var1)
s
#s = 104.642
(d) var[m(theta)]
#in one go
v<-(sum(rowSums(volume*(X-m)^2))/(n*N-1)-s)/Pstar
v
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 131
www.sankhyiki.in
+91-9711150002
#in steps
row.var2<-rowSums(volume*(X-m)^2)
v<-(sum(row.var2)/(n*N-1)-s)/Pstar
v
#v = 6.538782
Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 132