100% found this document useful (2 votes)
631 views132 pages

CS1B Workbook Answers

This document contains an index listing topics related to statistics and probability distributions in R, including arithmetic operations, objects, vectors, probability distributions like binomial, negative binomial, Poisson, normal, sampling distributions, simulations, estimation, confidence intervals, hypothesis testing, linear regression, generalized linear models, Bayesian statistics, and credibility theory. The document provides examples of assignments and questions to practice these statistical techniques in R.

Uploaded by

asdfjk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
631 views132 pages

CS1B Workbook Answers

This document contains an index listing topics related to statistics and probability distributions in R, including arithmetic operations, objects, vectors, probability distributions like binomial, negative binomial, Poisson, normal, sampling distributions, simulations, estimation, confidence intervals, hypothesis testing, linear regression, generalized linear models, Bayesian statistics, and credibility theory. The document provides examples of assignments and questions to practice these statistical techniques in R.

Uploaded by

asdfjk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

www.sankhyiki.

in
+91-9711150002

INDEX

1. Arithmetic Operations in R ……………………………………………………………………………………..3


2. Basic Arithmetic Functions………………………………………………………………………………………4
3. Objects…………………………………………………………………………………………………………………...5
4. Vectors …………………………………………………………………………………………………………………..6
5. Probability Distributions………………………………………………………………………………………….7
6. Simulations……………………………………………………………………………………………………………31
7. Conditional Expectations……………………………………………………………………………………….34
8. Central Limit Theorem…………………………………………………………………………………………..36
9. Estimation……………………………………………………………………………………………………………..43
10. Confidence Intervals………………………………………………………………………………………………49
11. Hypothesis Testing…………………………………………………………………………………………………53
12. Data Analysis………………………………………………………………………………………………………… 64
13. Linear Regression…………………………………………………………………………………………………..74
14. Multiple Linear Regression…………………………………………………………………………………….87
15. GLM…………………………………………………………………………………………………………………….101
16. Bayesian Statistics……………………………………………………………………………………………….121
17. Credibility Theory………………………………………………………………………………………………..125
18. EBCT……………………………………………………………………………………………………………………127

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 1


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 2


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Arithmetic Operations in R

Question Use R to calculate the following

(i) 5+10 (ii) 6-9 (iii) 7- - 9

(iv) -8+4 (v) 5x45 (vi) 56/14

12 x 6
(vii) 6 - 8
(viii) 1- (3+5)/2

' +,+.'+./0
(ix) 85 (x) 1- ( '() ) (xi)
1.'+

(xii) Remainder when 67 is divided by 8

2
(xiii)
1

Answers (i) 15 (ii) -3 (iii) 26

(iv) -4 (v) 225 (vi) 4

(vii) Command 6-12*(6/8) Result= -3

(viii) Command 1-(3+5)/2 Result= -3

(ix) Command 8^5 or 8**5 Result= 32768

(x) Command 1- 2(2/(2+3))^3 Result= 0.936

(xi) Command (1-1.21^(-10))/0.21 Result= 4.054078

(xii) Command 67%%8 Result=3

(xiii) Inf

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 3


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Basic Arithmetic Functions

1. - start a new session


- clear the console screen

2. Use R to calculate the following: -

(i) log(2050) (ii) ln (10) (iii) log5(35)

(iv) ln (-10) what does this output mean? [Hint: NaN means not a number]

(v) log 0 what does this output mean?

,'(√+1
(vi) e-5 (vii) (viii) 10! (ix) 12∁2
4

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 4


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Using Objects

1. Store the values 50, 100, 150, 200 and 250 in the objects A, B, C, D and E, respectively.

2. Calculate A+B, B - A, C/A, B*E, ln(E), exp(D)

3. Remove the object B

4. Remove the objects A and C in one go.

5. Remove all the objects in the working memory.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 5


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Vectors

Question 1. Create a vector, x, containing the 100 numbers 21, 22, ..., 120 using:-

(a) the c function

(b) the seq function.

Question 2. Use the seq function to generate the following:

(a) 50, 51, ...., 95

(b) 50, 47, 44, ...., 14

(c) 7 values equally spread between 1 and 31 inclusive.

Question 3. Use indexing on the vector x from Q1 to display:


3rd element
4th and 7th elements
6th-10th elements
all but the 5th element
all but the 2nd and 8th elements
all but the 3rd to 6th elements
all elements which are smaller than 27.

Question 4. Create a vector a of (1, 2, 3, 4, 5, 6), a vector b of (0, 1) and a vector c of (5, 1, 3, 2).
What will be the result of:

b-1 b*c a+b a^b a/c

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 6


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Probability Distributions

Question 1. In poker, a roulette wheel has the numbers 0 to 36. The wheel is spun one way and a
ball is sent round the other way. Our event of interest is the number on which the ball
lands.

(i) Give the sample space of the above experiment and store it in object “S”.
(ii) Calculate a) P(S < 14) b) P (S ³ 8) c) P (20 < S £ 32)
(iii) Set the seed as 100 and use sample function to generate a gambler’s outcomes
if he plays roulette 700 times.
(iv) Obtain a frequency table of the above sample.
(v) Draw a histogram of the sample.
(vi) Now, calculate empirical probabilities of part (ii) using your sample.
(vii) Find empirical values of
a) Mean b) Median c) Coefficient Of Skewness.

Question 2. In a remote island where tsunamis are regular, the survival probability from a tsunami is
78%. Given that the population of the island is 15.

(i) Find the probability that 9 people survive, using


a) Factorial function, from scratch b) Inbuilt functions in R.
(ii) Find the following probabilities:-
a) At most 12 people survive. b) More than 6 people survive.
(iii) Draw a bar chart showing the no of people surviving a tsunami and their survival
probabilities.
(iv) Using bar chart, also obtain modal no. of people that’ll survive.
(v) Calculate the mean no of people who will survive (from scratch).
(vi) Draw a step graph of the CDF.
(vii) Obtain Median and IQR for no. of survivors.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 7


www.sankhyiki.in
+91-9711150002

Question 3. You throw darts at a board until you hit the center area. Probability of hitting the center
area is 0.17.

(i) Find the probability that it takes eight throws until you hit the center from
scratch and also using inbuilt function.
(ii) Calculate the probability that it takes:
(a) At least 5 throws (b) One throw (c) More than 10
throws until you hit the centre.
(iii) Calculate the smallest no. of throws, x , before the first success such that:
(a) P(X£ x) ³ 0.9 (b) P(X >x) £ 0.4
(iv) Draw a step graph of CDF.
(v) Simulate 500 values of the experiment; and then compare empirical and
theoretical mean and variance. (Using seed =50)
(vi) Find the empirical probabilities for part (ii) and comment.

Question 4. Bob is a high school basketball player. His probability of making a free throw is 0.70.

(i) During the season, what is the probability that Bob makes his third free throw
on his fifth shot?

(a) From scratch (b) Using Inbuilt Function

(ii) Draw 4 bar charts of the probability function for negative binomial distributions
with p = 0.7 and k = 1 ,2, 3 and 4, with titles showing the value of k. (all charts
should be displayed simultaneously).

(iii) Find the probability that: -

(a) at most 3 shots didn’t become free throw before the 5th one did.

(b) more than 3 shots didn’t become free throw before the 4th one did.

(iv) Calculate the smallest number of people who didn’t believe the rumour, x,
before the fourth person did such that: - (a) P(X£ x) ³ 0.6 (b) P(X> x) £ 0.6.

(v) Simulate 1200 values of the number of shots which didn’t become a free throw
before the 4th one did. (Use seed=70)

(vi) Plot a histogram of the data obtained in (v) and superimpose a line of
theoretical expected frequencies on it. Comment on your findings.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 8


www.sankhyiki.in
+91-9711150002

(vii) Find theoretical and empirical values of: - (a) Standard Deviation (b) IQR

Question 5. Suppose we randomly select 5 cards without replacement from an ordinary deck of
playing cards.

(i) What is the probability of getting exactly 2 red cards (i.e., hearts or diamonds)?
a) From Scratch b) Using inbuilt function in R c) Using a binomial
approximation
(ii) Draw 4 bar charts showing the number of cards selected being red from
samples of size 5, 10, 15 and 20.
(iii) Find the probability that less than 2 cards in the selected cards are red.
(iv) Draw a step graph of the CDF .
(v) Simulate the no of cards which are red in the cards selected 1000 times. Use
seed= 45.
(vi) Find the empirical probability for part (iii) and comment.
(vii) Find empirical and theoretical values of:- a) Mean b) Lower Quartile
(viii) Draw line graph of the binomial approximation to the probabilities of the
number of red cards obtained from the selected cards. Superimpose the actual
probabilities on this graph, and comment.

Question 6. The complaints of hair found in McDonald's hamburgers are on an average 4 per week.

(i) Find the probability that there are 7 complaints per week;
a) From scratch b) Using inbuilt dpois function

(ii) Draw 4 bar charts showing the number of complaints in a week if it occurs at
rates of 2, 4, 8 and 20 per week; and comment on the shape and distribution for
larger values of lambda. (all charts should be displayed simultaneously).

(iii) Find the probability that there are at most 5 complaints in a week.

(iv) Draw a step graph of the CDF.

(v) Simulate the number of complaints received in 1000 separate weeks & plot a
histogram of the data. Use seed = 50

(vi) Find the empirical and theoretical values of the IQR.

(vii) Create a vector “A” containing 1000 zeroes.


a) Store the mean of the first i values obtained in (v) in the ith element of “A”
using for loop.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 9


www.sankhyiki.in
+91-9711150002

b) Draw graph of “A” showing how mean of simulations changes over the 1000
values compared to the true value.
Question 7. Suppose that on a certain highway, cars pass at an average rate of five cars per minute.
Assume that the duration of time between successive cars follows the exponential
distribution.

(i) Calculate the pdf when x=1: -


a) From Scratch b) Using inbuilt function

(ii) Draw graph of the PDF for x∈ (0,6) using the:


a) plot function b) curve function
c) plot function to draw a blank set of axes and then the lines function to draw
the PDF.

(iii) Add to any one of your graphs from part (ii) the following: -
a) a green dotted line showing the PDF of an exponential distribution with λ = 8
b) a red dashed line showing the PDF of an exponential distribution with λ =2
Also add legend to the graph obtained.

(iv) Calculate the probability of waiting: -


a) more than half a minute b) between 2.5 and 3.5 minutes

(v) Calculate the number of minutes waited, x, such that: -


a) P(X ≤ x) = 0.8 b) P(X > x) = 0.3

(vi) Calculate IQR for the waiting time.

(vii) Simulate 1000 waiting times (Use seed=90).


a) Plot a histogram of their densities.
b) Superimpose on the histogram a graph of the actual pdf.

(viii) Obtain the empirical probabilities for part (iv) and compare your answers.
(ix) Obtain the empirical and theoretical:-
a) Mean b) Standard Deviation c) 95th Percentile

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 10


www.sankhyiki.in
+91-9711150002

(x) (a) Create a vector “M” which contains 500 zeros.


(b) Use a loop to store the median of the first i values of simulations of part (vii)
in the ith value of “M”.
(c) Plot a graph of the object “M” showing how the median of the simulations
changes over the 500 values and show the median of the distribution on the
same graph.

Question 8. Time spent on a computer (X) is gamma distributed with mean 20 min and variance 80
min2.
(i) Obtain the pdf of this gamma distribution when x= 45mins.
(ii) Obtain a graph of the pdf for this gamma distribution for x∈(10,50).

(iii) Find the probability that the time spent on the computer is:-
a) At most 35 mins b) between 40 to 60 mins.
(iv) Calculate the IQR.
(v) Simulate 1600 values of this distribution using seed = 40.

(vi) Draw a labelled diagram of the histogram of the densities of the data you
simulated and superimpose the graph of the actual pdf on it.

(viii) Obtain the empirical probabilities for part (iii) using the data you simulated.

Question 9. (i) Find the pdf when x= 90 such that:-


(a) X is normally distributed with mean =75 and variance= 36.
(b) X follows lognormal distribution with μ = 4.5 and σ2 = 0.005.
(ii) Obtain the probabilities:-
a) P( 60 < N(75,36) < 85)
b) P( logN (4.5,0.005) < 80)
(iii) Find the 70th percentile for:
a) a normal distribution with mean 75 and variance 36
b) a lognormal distribution with parameters μ = 4.5 and σ2 = 0.005
(iv) Simulate 1,200 values from a lognormal distribution with μ = 4.5 and σ2 = 0.005.
Use seed = 50.
(v) Plot the PDF of a lognormal with parameters μ = 4.5 and σ2 = 0.005 for

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 11


www.sankhyiki.in
+91-9711150002

x∈(60,130) and superimpose the empirical PDF on the graph.


(vi) Use the generated data to calculate empirical probabilities in ii(b) and iii(b) and
compare them to the true values.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 12


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1
(i) Store outcomes in object R
#Any one of the following
S <- 0:36
S <- seq(0,36)
S <- seq(0,36,1)
S <- seq(0,36,by=1)
#could use R <- c(0,1,2,3,4,...,36) but would have to enter all 37
numbers...

(ii) Calculating probabilities


(a) P(S < 14)
length(S[S<14])/length(S)
#answer: 0.3783784
(b) P(S>=8)
length(S[S>=8])/length(S)
#answer: 0.7837838
(c) P(20<S<=32)
length(S[S>20&S<=32])/length(S)
#answer: 0.3243243

(iii) Simulation
set.seed(100)
S1 <- sample(S, 700, replace=TRUE)

(iv) Table
table(S1)

(v) Histogram
hist(S1,breaks=(-0.5:40.5))

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 13


www.sankhyiki.in
+91-9711150002

(vi) Empirical probabilities


(a) P(S < 14)
length(S1[S1<14])/length(S1)
#answer: 0.37
(b) P(S >= 8)
length(S1[S1>=8])/length(S1)
#answer: 0.7971429
(c) P(20<S<=32)
length(S1[S1>20&S1<=32])/length(S1)
#answer:0.2985714

(vii) Empirical moments


mean(S1)
#answer: 17.91857
median(S1)
#answer: 18
sd(S1)
#answer: 10.3999
skew <- sum((S1-mean(S1))^3)/length(S1)
skew/sd(S1)^3
#or
skew/var(S1)^1.5
#answer: 0.02160827

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 14


www.sankhyiki.in
+91-9711150002

Answer 2
(i) Probability 9 people survive
#Any one of the following
n <- 15
p <- 0.78
x <- 9
(a)(factorial(n)/(factorial(n-x)*factorial(x)))*p^x*(1-p)^(n-x)
#Ans= 0.06064452
(b)dbinom(x,n,p)
#answer: 0.06064452

(ii) Calculating probabilities


(a) P(X <= 12)
#any of the following
pbinom(12,n,p)
#answer: 0.6730802
(b) P(X > 6)
#the quickest way is
pbinom(6,n,p,lower=FALSE)
#answer: 0.9983765

(iii) bar chart


x <- 0:15
barplot(dbinom(x,n,p),names=x,xlab="number of
survivors",ylab="probability",main="Infectious disease survival",
col="green")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 15


www.sankhyiki.in
+91-9711150002

(iv) Mode
#we can see it is 12 survivors

(v) Mean
x <- 0:15
px <- dbinom(x,n,p)
sum(x*px)
#which is also 11.7 apprx equal to 12

(vi) Step graph


prob <- seq(0,1,0.01)
plot(prob, qbinom(prob,n,p),type="s",xlab="cumulative
probability",ylab="number of survivors",main="Graph Of CDF")

(vii)qbinom(0.5,n,p)
#Median = 12
#IQR
1 <- qbinom(0.25,n,p)
q3 <- qbinom(0.75,n,p)
q3-q1
#Ans= 2

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 16


www.sankhyiki.in
+91-9711150002

Answer 3
(i)
x <- 8
y <- x-1
p <- 0.17
#From Scratch
p*(1-p)^(y)
#Using inbuilt fn
dgeom(y,p)
#Ans= 0.04613129
(ii)(a)
pgeom(3,p,lower.tail = FALSE)
#Ans= 0.4745832
#(b)
dgeom(0,p)
#Ans= 0.17
#(c)
pgeom(9,p,lower.tail=FALSE)
#Ans= 0.1551604

(iii)(a)
qgeom(0.9,p)
#Ans= 12
#(b)
qgeom(0.4,p,lower.tail=FALSE)
#Ans= 4

(iv)
x <- 0:40
plot(x,pgeom(x,p),type="s",main="Step Graph of CDF",xlab="No of failures
before 1st success",ylab="P(X<=x)")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 17


www.sankhyiki.in
+91-9711150002

(v)
set.seed(50)
sample <- rgeom(500,p)
mean(sample)
#Ans= 4.82
var(sample)
#Ans= 34.23206
#th mean = q/p
(1-p)/p
#Ans= 4.882353
#th var = q/p^2
(1-p)/(p^2)
#Ans = 28.71972
# we see that the emperical and theoretical mean are in close agreement;
however there is slight difference between the emperical and th variances.

(vi)(a)
length(sample[sample>=5])/length(sample)
#Ans= 0.364
#(b)
length(sample[sample=1])/length(sample)
#Ans= 0.002
#(c)
length(sample[sample>10])/length(sample)
#Ans= 0.134

Answer 4
(i) Probability

#5th shot means 2 shots are not free throws before 3rd one is (Type 2
NB)

p <- 0.7
k <- 3
x <- 2

#(a) From scratch using gamma function


gamma(k+x)/(gamma(x+1)*gamma(k))*p^k*(1-p)^x
#answer:0.18522
(b) Using dnbinom
dnbinom(x,k,p)
#answer: 0.18522

(ii) 4 bar charts


(a)par(mfrow=c(2,2))
x <- 0:10
barplot(dnbinom(x,1,p),names=x, main="k=1",col="blue")
barplot(dnbinom(x,2,p),names=x, main="k=2",col="blue")
barplot(dnbinom(x,3,p),names=x, main="k=3",col="blue")
barplot(dnbinom(x,4,p),names=x, main="k=4",col="blue")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 18


www.sankhyiki.in
+91-9711150002

#reset graphic display


par(mfrow=c(1,1))

(iii) Calculating probabilities


(a)
x <- 5-3
k <- 3
pnbinom(2,k,p)
#answer: 0.83692
(b) 1-pnbinom(3,k,p)
pnbinom(3,k,p,lower=FALSE)
#answer: 0.07047

(iv) Quantiles
(a) Find smallest x such that P(X<=x)=0.6 or greater
qnbinom(0.6,k,p)
#answer: 1
#check
pnbinom(1,k,p)
pnbinom(2,k,p)
(b) Find smallest x such that P(X>x)<=0.6
qnbinom(0.6,k,p,lower=FALSE)
#answer: 1

(v) Simulations
set.seed(70)
k <- 4
N <- rnbinom(1200,k,p)

(vi) Histogram and line plot


(a) histogram
table(N)
#highest value seen in table(N) is 8

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 19


www.sankhyiki.in
+91-9711150002

hist(N,breaks=(-0.5:10.5))
(b) superimpose
x<-0:10
lines(x,2000*dnbinom(x,k,p),type="o",col="blue")

(vii) Empirical and actual spread


(a) sd
sd(N)
#answer: 1.492474
#compare with actual
sqrt(k*(1-p)/p^2)
#answer:1.564922
#simulation slightly more spread out
(b) IQRquantile
(N,0.75)-quantile(N,0.25)
#Ans=2

Answer 5
(i) Probability
#sample size n=5 as 5 are selected
n <- 5
#the population has k=26 "successes"
k <- 26
#the population has N-k=26 "failures"
N <- 52
#We could calculate the probability of X=2 as follows
x <- 2
(a) From scratch using choose function
choose(k,x)*choose(N-k,n-x)/choose(N,n)
#answer: 0.3251301
(b) Using dhyper
dhyper(x,k,N-k,n)
#answer: 0.3251301
(c) Using binomial approx
p <- k/N
p

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 20


www.sankhyiki.in
+91-9711150002

dbinom(x,n,p)
#answer:0.3125
(ii) 4 bar charts
(a)par(mfrow=c(2,2))
(b)x <- 0:20
barplot(dhyper(x,k,N-k,5),names=x, main="n=5",col="blue")
barplot(dhyper(x,k,N-k,10),names=x, main="n=10",col="blue")
barplot(dhyper(x,k,N-k,15),names=x, main="n=15",col="blue")
barplot(dhyper(x,k,N-k,20),names=x, main="n=20",col="blue")

(iii) Calculating probabilities


n <- 5
phyper(2,k,N-k,n)
#answer: 0.5

(iv) step graph of CDF


x <-0:5
plot(x, phyper(x,k,N-k,n),type="s")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 21


www.sankhyiki.in
+91-9711150002

(v) Simulations
set.seed(45)
H <- rhyper(1000,k,N-k,n)

(vi)length(H[H<2])/length(H)
#Ans= 0.183

(vii) mean(H)
#answer: 2.457
#compare with actual
n*k/N
#answer:2.5
quantile(H,0.25)
#answer: 2
#compare with actual
qhyper(0.25,k,N-k,n)
#answer:2

(viii)Binomial approximation
p <- k/N
x <-0:10
plot(x,dbinom(x,n,p),xlab="number of successes",
ylim=c(0,0.4),ylab="probability",type="o")
lines(x,dhyper(x,k,N-k,n),type="o",col="red")

Answer 6
(i)#Poisson(4)
m <- 4
#probability of 7 complaints in a week
x <- 7
(a) From scratch using exp and factorial function
m^x*exp(-m)/factorial(x)
#Ans= 0.1490028

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 22


www.sankhyiki.in
+91-9711150002

(b) Using dpois


dpois(x,m)
#Ans= 0.1490028

(ii) 4 bar charts


par(mfrow=c(2,2))
x <- 0:30
barplot(dpois(x,2),names=x, main="m=2",col="blue")
barplot(dpois(x,5),names=x, main="m=5",col="blue")
barplot(dpois(x,10),names=x, main="m=10",col="blue")
barplot(dpois(x,20),names=x, main="m=20",col="blue")
#shifts to right and gets more symmetrical

#reset graphic display


par(mfrow=c(1,1))

(iii) Calculating probability


# at most 5 complaints; means <=
ppois(5,m)
#Ans= 0.3007083

(iv) step graph of CDF


x <-0:10
plot(x, ppois(x,m),type="s")

(v) Simulations
set.seed(50)
P <- rpois(1000,m)
table(P)
hist(P,breaks=(-0.5:11.5))

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 23


www.sankhyiki.in
+91-9711150002

(vi)#theoretical
qpois(0.75,m)-qpois(0.25,m)
#Ans= 2
quantile(P,0.75)-quantile(P,0.25)
#Ans=3

(vii)Trend of mean
A <- rep(0,1000)
for (i in 1:1000)
{A[i] <- mean(P[1:i])}
x<-1:1000
plot(x,A[1:1000])
abline(h=m,col="red",lty=2,lwd=2)

Answer 7
(i) Calculate the PDF
l <- 5
x <- 1
(a)l*exp(-l*x)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 24


www.sankhyiki.in
+91-9711150002

(b)dexp(x,l)
#answer: 0.03368973
(ii) Labelled graph of PDF
x <- seq(0,6,by=0.01)
(a)plot(x, dexp(x,l),type="l",col="blue",ylab="f(x)",main="PDF of Exp(5)")
(b)curve(dexp(x,l),0, 6, col="blue",ylab="f(x)",main="PDF of Exp(5)")
(c)plot(x,type="n",xlim=c(0,6),ylim=c(0,3),xlab="x",ylab="f(x)",main="PDF
of Exp(5)")
lines(x,dexp(x,l),type="l",col="blue")

(iii) l<-8
lines(x,dexp(x,l),type="l",col="green",lty=3)
l<-2
lines(x,dexp(x,l),type="l",col="red",lty=2)
# legend
legend("topright",title="PDF of Exp(l)",c("l = 5","l = 8","l =
2"),lty=c(1,3,2),col=c("blue","green","red"))

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 25


www.sankhyiki.in
+91-9711150002

(iv) Calculating probabilities using pexp


l <- 5
(a) P(X > 0.5)
1-pexp(0.5,l)
#answer: 0.082085
(b) P(2.5 < X < 3.5)
pexp(3.5,l) - pexp(2.5,l)
#answer: 3.701543e-06

(v) Calculating values using qexp


l <- 5
(a) P(X <= x)=0.8
qexp(0.8,l)
#answer: 0.3218876
(b) P(X > x) = 0.3
qexp(0.3,l,lower=FALSE)
#or
qexp(0.7,l)
#answer: 0.2407946

(vi) Obtain the IQR for the waiting time


qexp(0.75,l) - qexp(0.25,l)
#answer: 0.2197225

(vii)l <- 5
n <- 1000
set.seed(90)
W <- rexp(n,l)
(a) Draw a labelled histogram
hist(W,prob=TRUE,xlab="simulated value",main="simulations from Exp(5)")
(b) Superimpose actual PDF
#Histogram goes from 0 to 2 or could use range
range(W)
x <- seq(0,2,by=0.01)
lines(x,dexp(x,l),type="l",col="blue")

(viii) Empirical probabilities

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 26


www.sankhyiki.in
+91-9711150002

(a) P(X>0.5)
length(W[W>0.5])/length(W)
#answer: 0.07
#compare to actual probability of 0.082085
pexp(0.5,l,lower=FALSE)
(b) P(2.5<X<3.5)
length(W[W>2.5 & W<3.5])/length(W)
#answer: 0
#compare to actual probability of 3.701543e-06
pexp(3.5,l)-pexp(2.5,l)
(ix)Compare empirical & actual moments
(a) mean
mean(W)
#answer: 0.1994997
#compare with actual mean of 0.2
1/l
(b) standard deviation
sd(W)
#answer: 0.1934161
#th. sd = mean = 0.2
(c) 95th percentile
quantile(W,0.95)
#answer: 0.5931666
#compare with actual IQR of 0.5991465
qexp(0.95,l)

(x) M <- rep(0,500)


(b) Loop
for (i in 1:500)
{M[i] <- median(W[1:i])}
(c) Plot trend
#plot the trend of the median
x<-1:500
plot(x,M[1:500])
#add horizontal line at the median
abline(h=qexp(0.5,l),col="red",lty=2,lwd=2)

#After 500 simulations still not settled down to long-term value

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 27


www.sankhyiki.in
+91-9711150002

(i)
#mean = a/l
#var= a/l^2
20 <- a/l
80 <- a/(l^2)
#solving, we get
l <- 1/4
a <- 5
x <- 45
dgamma(45,a,l)
#ans = 0.002170331

Answer 8
(i)#mean = a/l
#var= a/l^2
20 <- a/l
80 <- a/(l^2)
#solving, we get
l <- 1/4
a <- 5
x <- 45
dgamma(45,a,l)
#ans = 0.002170331

(ii)x <- seq(10,50,0.01)


plot(x,dgamma(x,a,l))

(iii)(a)
#at most 35 means <=
pgamma(35,a,l)
#Ans= 0.9359932

(b)# between 40 to 60
pgamma(60,a,l)-pgamma(40,a,l)
# Ans= 0.02839605

(iv) qgamma(0.75,a,l)-qgamma(0.25,a,l)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 28


www.sankhyiki.in
+91-9711150002

#Ans= 11.62332

(v)set.seed(40)
S <- rgamma(1600,a,l)

(vi)hist(S,probability = TRUE,main="Graph of Densities of


Sample",ylim=c(0,0.05))
#ylim set according to graph obtained so as to see complete graph
x <- seq(0,60,0.01)
lines(x,dgamma(x,a,l),col="red")

(vii)(a)
length(S[S<=35])/length(S)
#Ans= 0.93375
(b)between 40 to 60
length(S[S>40 & S<60])/length(S)
#Ans= 0.029375

Answer 9
(i) Values of PDF
x <- 90
(a)m <- 75
s <- sqrt(36)
dnorm(x,m,s)
#answer: 0.002921383
(b)M <- 4.5
S <- sqrt(0.005)
dlnorm(x,M,S)
#answer:0.0626875

(ii) Probabilities
(a)pnorm(85,m,s)-pnorm(60,m,s)
#answer: 0.946
(b)plnorm(80,M,S)
#answer: 0.04761864

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 29


www.sankhyiki.in
+91-9711150002

(iii) 70th percentile


(a)qnorm(0.70,m,s)
#answer: 78.1464
(b)qlnorm(0.70,M,S)
#answer: 93.41769

(iv) Generating random simulations with exact mu and sigma


#1000 simulations from N(0,1)
n <- 1200
set.seed(50)
Z <- rnorm(n,0,1)
#Now standardise these
Z <- (Z-mean(Z))/sd(Z)
#Now destandardise to obtain N(M,S)
X <- Z*S + M
#Now find the exponential to obtain logN(M,S)
L <- exp(X)

(v) Graphs
(a)xval <- seq(60,130,by=1)
plot(xval,dlnorm(xval,M,S),type="l",col="blue")

(b)lines(density(L),col="red")

(vi) Empirical results


#We can calculate empirical probabilities, quantiles and moments in the
usual way
(a)
length(L[L<80])/length(L)
#answer:0.05166667
#True value was 0.04761864
(b)
quantile(L,0.70)
#answer: 93.39722
#True value was 93.41769
#Very similar

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 30


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Simulation

Question 1. A team has 20 people, each member has been independently infected by a deadly disease.
The survival probability for the disease is 65%.

(i) Use set.seed(28) and rbinom to simulate the number of survivors 400 times. Store this
in the object S.
(ii) (a) Use the table function on S to obtain a frequency table for the survivors.
(b) Hence, calculate the empirical probabilities.
(c) Compare the results of (b) with the actual probabilities from dbinom (round them
to 3DP using the round function).
(d) Use length to obtain the empirical probability of at most 13 survivors and compare
with the actual probability using pbinom.
(iii) (a) Draw a histogram of the results obtained from simulation, centring.
(b) Superimpose on the histogram a line graph of the expected frequencies for the
binomial distribution using the lines function.
(c) Comment on the differences.
(iv) Compare the following statistics for the distribution and simulated values:
(a) mean (b) standard deviation (c) IQR (use the quantile function).
(v) (a) Create a vector StdDev which contains 400 zeros.
(b) Use a loop to store the standard deviation of the first i values in the object S in the
𝑖 89 element of StdDev.
(c) Plot a graph of the object StdDev showing how the standard deviation of the
simulations changes over the 400 values compared to a horizontal line showing the true
value.

Question 2. The probability of having a female child can be assumed to be 0.45 independently from
birth to birth.

(i) Use set.seed(35) and rgeom to simulate the number of sons before the first
daughter 1,000 times. Store this in the object G.

(ii) (a) Use length to obtain the empirical probabilities for


(1) at most 3 sons before her first daughter
(2) more than 4 sons before her first daughter
(3) 2 sons before her first daughter.
(b) Use quantile to calculate the empirical results for
(1) P(X £ x) ³ 0.8 (2) P(X > x) £ 0.6

(iii) Compare the following statistics for the distribution and simulated values:
(a) mean (b) variance.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 31


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
n <- 20
p <- 0.65
(i) set.seed(28)
S<-rbinom(400,n,p)
S

(ii) (a) table(S)


#answer:
# 7 8 9 10 11 12 13 14 15 16 17 18
# 2 5 21 26 45 64 70 68 49 29 13 8

(b) table(S)/length(S) or table(S)/400


(c) round(dbinom(c(0:20),n,p),3)
#Fairly similar
(d) length(S[S<=13])/length(S)
#answer: 0.5825
pbinom(13,n,p)
#answer: 0.5833746
#simulations have slightly less smaller values

(iii) (a) hist(S,breaks=(-0.5:20.5))


(b) x<-0:20
lines(x,400*dbinom(x,n,p),type="o",col="blue")
(c) There is an overestimation from 10 to 16
Doesnt seem to be particularly skewed - so possibly just
random variation

(iv) (a) mean(S)


#answer: 12.98
#compare with actual
n*p
#answer:13
#very similar
(b) sd(S)
#answer: 2.207211
sqrt(n*p*(1-p))
#answer: 2.133073
#slightly less spread
(c) quantile(S,0.75)-quantile(S,0.25)
#answer: 2
qbinom(0.75,n,p)-qbinom(0.25,n,p)
#answer: 2
#same
(v) (a) StdDev <- rep(0,400)
(b) for (i in 1:400)
{StdDev[i] <- sd(S[1:i])
}
StdDev
(c) x<-1:400
plot(x,StdDev[1:400])
#draw a red line to show correct value
abline(h=sqrt(n*p*(1-p)),col="red",lty=2,lwd=2)
#After 400 simulations still not settled down

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 32


www.sankhyiki.in
+91-9711150002

Answer 2
(i) set.seed(35)
G <- rgeom(1000,p)

(ii) (a) (1) length(G[G<=3])/length(G)


#answer: 0.9
(2) length(G[G>4])/length(G)
#answer: 0.055
(3) length(G[G==2])/length(G)
#answer: 0.14

(b) (1) quantile(G,0.8)


#answer: 2
(2) quantile(G,0.4)
#answer: 0

(iii) (a) mean(G)


#mean 1.262
#compare with actual
(1-p)/p
#answer: 1.2222

(b) var(G)
#answer:2.9443
#compare with actual
(1-p)/p^2
#answer: 2.716049
#overestimates the mean and variance

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 33


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Conditional Expectation


Question 1. The number of claims per month are independent Poisson random variables with mean
λ, where λ has an exponential distribution with mean 0.4. Let X denote the number of
claims in a month.
Hence, X| λ ~ Poisson(λ)

(i) Create a vector “values” that contains 5000 zeroes.

(ii) Using for loop, simulate 5000 values of X, by simulating one value of λ and then
using it to obtain values of X. Store it in the object “values”. Use seed= 40.

(iii) Use your simulations to obtain empirical values of mean and variance.

(iv) Calculate the theoretical mean and variance of X.

(v) Obtain a histogram of your simulations and use “breaks” to make sure that the
bars line up with the correct values.

(vi) Calculate the empirical probability that there are more than 2 claims in a
particular month.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 34


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1

(i) values=rep(0,5000)

(ii) set.seed(40)
for (i in 1:5000)
{lambda <- rexp(1,(1/0.4))
values[i] <- rpois(1,lambda)}

(iii) mean(values) #answer: 0.389


var(values) #answer: 0.5193829

(iv) #Th. mean = E[X] = E[E(X|lambda)]= E[lambda] = 0.4


#Th. var = V[X]= E[var(X|lambda)] +var[E(X|lambda)] = E[lambda]+ var[lambda]
0.4 + (0.4)^2 #answer: 0.56

(v) range(values) #answer:0 6


hist(values,probability = TRUE,xlim=c(-0.5,6.5))

(vi) length(values[values>2])/length(values) #answer: 0.0218

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 35


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Central Limit Theorem

Question 1. Let X be the amount of time (in minutes) a postal clerk spends with his/her customer.
The time is known to have an exponential distribution with mean 4 minutes.

(i) Using seed= 50 and simulate a sample of size 100 from this distribution.

(ii) Obtain the histogram of the simulated data.

(iii) (a) Create a vector “S” which contains 1,500 zeros.

(b) Using for loop, store the sum of the results obtained in the ith sample of 100
simulations in the ith element of S. Use seed = 60.

(iv) (a) Draw a labelled histogram of the probabilities of the results in “S”.

(b) Superimpose on the histogram a graph of approximate pdf of the sums


(using CLT).

(v) Find the probability that the sum of time spent with 100 customers is greater
than 400 minutes:-

(a) Empirically using your simulated data.

(b) Theoretically using CLT.

(vi) (a) Repeat parts (iv) and (v) for sample sizes of 10,30,50 and 200 customers.(All
4 histograms should be displayed simultaneously).

(b) Comment on your findings.

(vii) Find the mean and variance of your simulated sample and compare it with the
theoretical mean and variance (using CLT).

(vii) Calculate the median, lower and upper quartiles empirically and theoretically
(using CLT).

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 36


www.sankhyiki.in
+91-9711150002

Question 2. The number of claims received in a day (X) follows Poisson distribution with mean 7.

(i) Simulate 1500 values from this distribution. Use seed=79.

(ii) Plot the histogram of the probabilities of the simulated data.

(iii) Superimpose on the histogram, the graph of pdf (using CLT).

(iv) (a) Calculate the empirical probability of more than 5 claims in a day.

(b) Calculate the equivalent probability using normal approximation.

(c) Compare the above two probabilities with the exact probability using ppois.

(iii) (a) Use qqnorm to obtain a QQ plot for the simulations and a normal
distribution.

(b) Use qqline to add a line.

(c) Comment on how close the normal distribution approximation is to the


Poisson.

Question 3. The following data represent the average total number of marks obtained for a
particular exam, observed over seven exam sessions that had been administered by a
professional examination body: 87 53 72 90 78 85 83

(i) Enter these data into R and compute their sample mean and variance.

(ii) Investigate whether the Poisson model is appropriate for these data, by
calculating the sample mean and sample variance of 10 Poisson samples having
the same size and mean as the sample given above.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 37


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i) Exponential distribution simulations
#mean = 4, so rate = ¼
l <- 1/4
n <-100
set.seed(50)
E <- rexp(n,l)

(ii) Histogram
hist(E,xlab="time",main="")

(iii) Sum of repeat samples


#We need to generate repeated samples in order to get the sums of each
sample
#hence E<-rexp() needs to appear within the loop
S<-rep(0,1000)
set.seed(60)
for (i in 1:1000)
{E<-rexp(100,l);S[i]<-sum(E)}

(iv) (a)hist(S, prob=TRUE,xlab="sample sums",main="")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 38


www.sankhyiki.in
+91-9711150002

(b) range(S)
x<-seq(250,600,by=0.1)
hist(S, prob=TRUE,xlab="sample sums",main="")
curve(dnorm(x,n/l,sqrt(n/l^2)),add=TRUE,lwd=2,col="red")

(v) (a)length(S[S>400])/length(S) #answer: 0.497


(b)#since xsum ~ N(n/l,n/l^2)
pnorm(400,n/l,sqrt(n/l^2),lower=FALSE) #answer: 0.5

(vi) Different sample sizes


par(mfrow=c(2,2))
x<-seq(0,1000,by=.1)
n <-10
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 39


www.sankhyiki.in
+91-9711150002

n <- 30
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")
n <- 50
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")
n <- 200
S<-rep(0,1000)
set.seed(50)
for (i in 1:1000)
{E<-rexp(n,l);S[i]<-sum(E)}
hist(S, prob=TRUE,xlab="sample sums",main="")
lines(x,dnorm(x,n/l,sqrt(n/l^2)),type="l",lwd=2,col="red")

#As the sample size increases, the normal approximation approaches ; the
empirical distribution more closely

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 40


www.sankhyiki.in
+91-9711150002

Answer 2
(i) Poisson simulations
(a)m <- 7
set.seed(79)
P<-rpois(1500,m)
(b)#check range before plot histogram
range(P)
#or could use table(P)
hist(P,breaks=(-0.5:16.5),prob=TRUE)

(c)xvals<-seq(-0.5,11.5,by=0.01)
lines(xvals,dnorm(xvals,m,sqrt(m)),type="l",lwd=2,col="red")

(ii) #Probability > 5


(a)length(P[P>5])/length(P) #answer: 0.7
(b)#Need continuity correction when approx discrete with continuous
distribn
pnorm(5.5,m,sqrt(m),lower=FALSE) #answer: 0.7146248
(c)ppois(5,7,lower=FALSE) #Answer: 0.6992917

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 41


www.sankhyiki.in
+91-9711150002

#The probability from the simulation is closer than that from the
#normal approximation.

(iii) QQ plot
(a)qqnorm(P)
(b) qqline(P, lty=2, col="red",lwd=2)

(c)#QQ plot 'banana shape' shows signs of positive skew.


#The mean of the Poisson distribution is not large enough
#to ensure normal is a good approximation

Answer 3

(i) y <- c(87, 53, 72, 90, 78, 85, 83


c(mean=mean(y), variance=var(y)) # mean78.29 variance 159.90
(ii) xbar = s2 = numeric(10)
for (j in 1:10){
x <- rpois(7, 78.29)
xbar[j] = mean(x)
s2[j] = var(x)
}
xbar
#[1] 77.85714 79.71429 68.71429 82.14286 69.71429 84.57143
77.28571 83.0000 76.85714 79.28571
s2
#[1] 104.80952 127.23810 136.23810 42.47619 51.90476 103.28571
83.90476 107.33333 49.80952 90.57143
It is unusual to get as large a difference between the mean and
the variance as that observed for these data,making it doubtful
that these data are from a Poisson distribution.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 42


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Estimation
Question 1. The times (in minutes) between successive calls at a call center are contained in the CSV
data file “times”.

(i) Load the data and store it in an object “t”.

(iii) Plot a histogram of times using the data.

(iii) Plot the empirical pdf using the data.

Analysts suggest that the data could be modelled using an Exponential distribution with
λ=0.0987 or as a Gamma distribution with α=0.8628 and λ=0.0939.

(iv) Superimpose the pdfs of these two distributions on the graph obtained in (iii).

(i) (a) Generate 1000 values from the fitted exponential distribution and store
them in object “x”. Use seed = 80.

(b) Plot a QQ plot using qqplot on x and t.

(c) Add abline(0,1) to the QQ plot and comment on the fit.

(vi) (a) Generate 1000 values from the fitted gamma distribution and store then in
object “y”. Use seed=80.

(b) Plot a QQ plot using qqplot on y and t.

(c) Add abline(0,1) to the QQ plot and comment on the fit.

(vii) Which is the most appropriate model? Use results obtained in (v) and (vi).

Question 2. The CSV file “complaints” contains the number of complaints of a smartphone in a year
made by 1,00,000 of its users.

No of complaints No of Users

0 83989
1 14667
2 1270
3 72
4 2
≥5 -

(i) Extract the data and save it as a vector in an object “c”.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 43


www.sankhyiki.in
+91-9711150002

(ii) Plot a histogram of complaints ensuring the bars line up correctly.

It is thought that this data could be modelled as a Poisson distribution with λ=0.175 or
as a Type 2 Negative Binomial distribution with k= 2.2569 and p= 0.92804.

(iii) (a) List out the expected frequencies for each of these fitted distributions to the
nearest whole number.

(b) Obtain the differences between the observed and expected frequencies for
the two fitted distributions.

(c) Hence, comment on the fit of these two distributions to the observed data.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 44


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i) Set your working directory
t <- read.table("times.csv")
t #Here t is a dataframe but we want it to be a vector; so :-t <-
t$V1
t #Now t is a vector.
(ii) hist(t,xlab="minutes",main="Times between successive calls")

(iii) plot(density(t),main="Times between Successive calls")

(iv) range(t) #goes from 0.0202 to 56.607 ; so we'll go from 0 to 60


s <- seq(0,60,1)
#fitted exponential
l <- 0.0987
lines(s, dexp(s,l),col="red",lty=2)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 45


www.sankhyiki.in
+91-9711150002

#fitted gamma
a <- 0.8628
l2 <- 0.0939
lines(s, dgamma(s,a,l2),col="dark green",lty=3)
#Hard to comment on which is better fit from this graph. #Hence will
use QQ plots instead.

(v)(a) set.seed(80)
x <- rexp(1000,l)

(b) qqplot(x,t,xlab="exponential quantiles",ylab="sample quantiles")

(c) abline(0,1,col="red",lty=2)
#middle to upper sample values are higher than model #so heavier upper
tail - more positively skew

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 46


www.sankhyiki.in
+91-9711150002

(vi) (a) set.seed(80)


y <- rgamma(1000,a,l2)

(b) qqplot(y,t,xlab="gamma quantiles",ylab="sample quantiles")

(c) abline(0,1,col="red",lty=2)
#Middle to higher values get worse (with the highest value very poor)
#but better in middle than exponential since both sides of line

(vii) Both have good fit at lower end but worse elsewhere.
Despite the single extreme value in the gamma
the middle has a better fit than the exponential.

Answer 2
(i) #set working directory
c <-read.table("complaints.csv")
c <- c$V1
c
(ii) range(c)
hist(c,breaks=c(0:5),ylim=c(0,100000),xlim=c(0,5),xlab = "no of compl
aints")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 47


www.sankhyiki.in
+91-9711150002

(ii) #fitted poisson


m <- 0.175
x <- 0:5
#(a) round(100000*dpois(x,m),0)
#(b)f<-c(83989,14667,1270,72,2,0)
f-round(100000*dpois(x,m),0)
#Fitted negative binomial
k <- 2.2569
p <- 0.92804
#(a)round(100000*dnbinom(x,k,p),0)
#(b)f-round(100000*dnbinom(x,k,p),0)
#(c) Poisson expected frequencies are closer to the observed
frequencies.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 48


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Confidence Intervals


Question 1. Muscular endurance of athletes is assumed to be normally distributed with a standard
deviation 40. The mean muscular endurance score of a random sample of 60 athletes
was found to be 160. Obtain from scratch: -

(i) 95% CI for the true mean

(ii) 99% CI for the true mean of the form (0, L)

Question 2. An insurance company has clients for automobile policies. A sample of 8 automobiles for
which claims have been made in a month have been selected at random. The claim
amounts (in 000’s) are as follows: -

49,53,51,52,47,50,52,53

(i) Find the 95% CI for the average claim amount


(a) From scratch
(b) using t.test command

(ii) Find 99% CI for the standard deviation of the claim amount (use qchisq) .

(iii) (a) Assuming claims are normally distributed, use seed=50, rnorm and for loop
to obtain the mean of 800 re-samples of size 8.

(b) Hence obtain a 95% parametric bootstrap CI for the average claim size.

(iv) (a) Making no distributional assumption, use seed=50, sample and for loop to
obtain 800 re-samples of size 8.

(b) Hence obtain a non-parametric 95% CI for the average claim size.

(v) Using the method in (iii) and the same seed, obtain a 99% CI for the standard
deviation of claim size.

(vi) Using the method in (iv) and the same seed, obtain a non-parametric 99% CI for
the standard deviation of claim size.

The built in dataset, ChickWeight contains weight versus age of chicks on different
diets.

(vii) Use t.test to obtain a 99% CI for the average weights of chicks being fed diet 1.

(viii) Find 90% CI for the variance of the weights of chicks being fed diet 2.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 49


www.sankhyiki.in
+91-9711150002

Question 3. A random sample of 450 pineapples was taken from a large consignment and 85 were
found to be bad.

(i) Calculate a 95% CI for the proportion of bad pineapples.

(ii) Comment on the likelihood of more than 25% of the pineapples in a sample
being bad.

Question 4. A statistician has a sample of 25 values from a Poisson distribution with mean of 7.
Find the exact 95% CI for the mean rate.

Question 5. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure

Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31

(i) (a) Using t.test, calculate a 99% CI for the difference in mean times for the two
groups.

(b) Comment on the results.

(ii) It is now known that both the groups were made of the same employees at
different times. Incorporate this information and repeat (i).

(iii) (a) Find the 90% CI for the ratio of variances of the two groups.

(b) Comment on the result.

Question 6. Random samples of 500 men and 700 women were asked whether they would have a
flyover near their residence. 300 men and 425 women were in favour of the proposal.

(i) Use two vectors and prop.test command to find the 95% CI for the difference in
proportions between men and women who were in favour.

(ii) Now, solve (i) using a matrix for the results instead of two vectors.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 50


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i) xbar <- 160
n <- 60
sigma <- 40
alpha <- 0.05
xbar + c(-1,1)*qnorm(1-alpha/2)*sigma/sqrt(n) OR
xbar + c(-1,1)*qnorm(alpha/2,lower=FALSE)*sigma/sqrt(n)
#Ans= (149.8788,170.1212)

(ii) alpha <- 0.01


xbar + qnorm(1-alpha)*sigma/sqrt(n) OR
xbar + qnorm(alpha,lower=FALSE)*sigma/sqrt(n)
#Ans= (0,172.0132)

Answer 2
(i) (a) claims <- c(49,53,51,52,47,50,52,53)
n <- length(claims)
alpha <- 0.05
mean(claims)+ c(-1,1)*qt(1-alpha/2,n-1)*sd(claims)/sqrt(n) OR
mean(claims)+ c(-1,1)*qt(alpha/2,n-1,lower=FALSE)*sd(claims)/sqrt(n)
#Ans= (49.11921,52.63079)
(b) t.test(claims,conf= 0.95)

(ii) sqrt(c((n-1)*var(claims)/qchisq(1-alpha/2,df=n-1),(n-1)*var(claims)/qc
hisq(alpha/2,df=n-1)))
#Ans = (1.388578,4.274418)

(iii) #define sample size


n <- length(claims)
(a) bm <- rep(0,800)
set.seed(50)
for(i in 1:800)
{bm[i]<-mean(rnorm(n,mean=mean(claims),sd=sd(claims)))}
(b) quantile(bm,c(0.025,0.975)) #Ans= (49.47789,52.41451)

(iv) (a) bm <- rep(0,800)


set.seed(50)
for(i in 1:800)
{bm[i]<-mean(sample(claims,replace=TRUE))}
(b) quantile(bm,c(0.025,0.975)) #Ans = (49.375,52.125)

(v) (a) bs <- rep(0,800)


set.seed(50)
for(i in 1:800)
{bs[i]<-sd(rnorm(n,mean=mean(claims),sd=sd(claims)))}
(b) quantile(bs,c(0.005,0.995)) #Ans = (0.7806731,3.5638685)

(vi) (a) bs <- rep(0,800)


set.seed(50)
for(i in 1:800)
{bs[i]<-sd(sample(claims,replace=TRUE))}
(b) quantile(bs,c(0.005,0.995)) #Ans = (0.5175492,2.8755118)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 51


www.sankhyiki.in
+91-9711150002

(vii) ChickWeight
x <- ChickWeight$weight[ChickWeight$Diet==1]
t.test(x,conf.level = 0.99) #Ans= (92.71988,112.57103)
(viii) y <- ChickWeight$weight[ChickWeight$Diet==2]
n<-length(y)
alpha <- 0.05
c((n-1)*var(y)/qchisq(1-alpha/2,df=n-1),(n-1)*var(y)/qchisq(alpha/2,df
=n-1))
#Ans= (4038.725,6727.576)

Answer 3
(i) x <- 85
n <- 450
binom.test(x,n,conf.level=0.95)

(ii) #Since 95% CI for p doesn't contain p=0.25(or higher values of p)


#it is unlikely that more than 25% of the pineapples will be bad.

Answer 4
#careful x is the total claims but we are given the mean
x <- 25*7
n <- 30
poisson.test(x, n, conf=0.9)
#Ans= (5.12746,6.61250)

Answer 5
(i) (a) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
t.test(sp,np,conf.level = 0.99)
#Ans= (-2.834459,10.167793)
(b) #the confidence interval contains 0 so means could be equal

(ii) t.test(sp,np,conf.level = 0.99,paired = TRUE)


#Ans= (-0.4428268,7.7761601)
(iii) (a)var.test(sp,np,conf=0.9)
#Ans= (0.3550003,4.1962955)
(b) The CI contains 1 so variances could be equal.

Answer 6
(i) #vector of success
x <- c(300,425)
#vector of trials
n <- c(500,700)
prop.test(x,n,conf=0.99,correct=FALSE)
#Ans= (-0.08093681,0.06665110)
(ii) m <- cbind(x,n-x)
m
prop.test(m,conf=0.99,correct=FALSE)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 52


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Hypothesis Tests


Question 1. Muscular endurance of athletes is assumed to be normally distributed with a standard d
eviation 40. The mean muscular endurance score of a random sample of 60 athletes was
found to be 160. Carry out the following tests (use pnorm): -
(a) H0 : μ = 165 vs H1 : μ <165

(b) H0 : μ = 155 vs H1 : μ ≠155

Question 2. An insurance company has clients for automobile policies. A sample of 8 automobiles for
which claims have been made in a month have been selected at random. The claim amo
unts (in £ 000’s) are as follows: -
49,53,51,52,47,50,52,53

(i) Test whether the average claim amount is greater than the presumed average
value of £ 50,000.

(a) From scratch (b) Using t.test.

(ii) Test whether the standard deviation of claim amount is equal to 5 from scratch
using pchisq to obtain the p-value.

The built in dataset, ChickWeight contains weight versus age of chicks on different
diets.

(iii) Use t.test to test whether the average weights of chicks being fed diet 1 is
100gm.

(iv) Test whether the variance of the weights of chicks being fed diet 2 is less than
7000gm2.

Question 3. A random sample of 450 pineapples was taken from a large consignment and 85 were
found to be bad.

(i) Test whether the proportion of bad pineapples is less than 25% using
binom.test.

(ii) Extract the p-value of the test in (i).

Question 4. The number of typing errors in a page are modelled using Poisson distribution with
mean λ. In a particular draft, there are 900 pages which were found to have 283 errors.

(i) Test the null hypothesis H0 : λ= 0.2 against H1 : λ > 0.2.

(ii) Extract the estimate for the mean rate in (i).

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 53


www.sankhyiki.in
+91-9711150002

Question 5. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure

Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31

(i) Using t.test, test the hypothesis that the new procedure reduces the average
time required to assemble the device (you may assume equal variances).

(ii) It is now known that both the groups were made of the same employees at
different times. Incorporate this information and repeat (i).

(ii) Test the hypothesis that the lengths of times in the new procedure vary more
than those of standard procedure.

Question 6. Random samples of 500 men and 700 women were asked whether they would have a
flyover near their residence. 300 men and 425 women were in favour of the proposal.

(i) Use two vectors and prop.test to test the hypothesis that the proportion of
women that favour the proposal is greater than that of men. Do not use
continuity correction.

(ii) Now, solve (i) using a matrix for the results instead of two vectors.

Experts suggest that this proportion should be modelled using a Poisson distribution.

(ii) Use poisson.test and test whether the proportions of men and women
favouring the proposal are different.

Question 7. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure

Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31

(i) Store these results in two vectors and the value of the difference between their
means in the object “d”.
(ii) Carry out a permutation test to test the hypothesis that the lengths of times
using new procedure have a lower average than that using standard procedure.

(a) Store the combined vector of results in the object “r”

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 54


www.sankhyiki.in
+91-9711150002

(b) Create an object “index” that gives the positions of the values in “r”.

(c) Use the function combn on the object “index” to calculate all the
combinations of lengths of time using standard procedure and store this in the
object “p”.

(d) Use a loop to store the differences in the average lengths of the two groups
in the object “dif”.

(iii) (a) Plot a labelled histogram of the differences in the average lengths of times of
the two groups for every combination.

(b) Use the function abline to add a dotted vertical green line to show the
critical value.

(c) Use the function abline to add a dashed vertical blue line to show the
observed statistic.

(iv) (a) Calculate the p-value of the test based on this permutation test.

(b) The p-value calculated under the normality assumption was 13.19%.
Comment on your result.

(v) Repeat part (ii) but with 10,000 resamples from the object results using the
function sample and set.seed(77).

(vi) Calculate the p-value of the test using resampling and compare it to the answer
using all the combinations calculated in part (iv).

Question 8. The length of time (in minutes) required to assemble the device using standard
procedure and new procedure are given below. Two groups of nine employees were
selected randomly, one group using the new procedure and other using the standard
procedure

Standard Procedure 32 37 35 28 41 44 35 31 34
New Procedure 35 31 29 25 34 40 27 32 31

(i) Store the differences of pairs of results in the vector D and the mean value of
these differences in the object ObsD.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 55


www.sankhyiki.in
+91-9711150002

(ii) Carry out a permutation test to test the hypothesis that patients on the special
diet have a lower average blood pressure than the control group:

(a) Store the values ( 1,1) − in the vector sign.

(b) Use the function permutations from the package gtools to calculate all the
permutations of the signs of the differences in object D and store these
permutations in the object p.

(c) Use a loop to store the mean differences in the average lengths of times of
the two groups in the object dif.

(iii) (a) Calculate the p-value of the test based on this permutation test.

(b) The p-value calculated under the normality assumption was 2.885%.
Comment on your result.

(iv) Repeat part (ii) but with 10,000 resamples from the object sign using the
function sample in the loop and set.seed(79).

(v) Calculate the p-value of the test using resampling and compare it to the answer
using all the combinations calculated in part (iii).

Question 9. The number of admissions in a hospital for each weekday are given below:-

Day Mon Tue Wed Thurs Fri Total

Frequency 73 52 52 55 68 300

(i) (a) Store these frequencies in the vector obs.

(b) Give the expected frequencies if the number of employees absent was
independent of the day (ie uniformly distributed).

(c) Use chisq.test to determine whether the observed results fit a uniform
distribution.

For a cleaning solution, the ratio of chemicals A,B and C should theoretically be 2:3:5.
For 120 bottles of the solution, the results were as follows:-

Chemicals A B C Total

Frequency 17 29 74 120

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 56


www.sankhyiki.in
+91-9711150002

(ii) (a) Use chisq.test to determine whether the observed results are consistent with
the theoretical ratio.

(b) Extract the dof from the results of the above test.

A survey was analysed and it was found that the distribution of the number of
accidents of vehicle owners in a year is binomial with parameters n=3 and p. Data on
153 vehicle owners is as follows:-

No of claims 0 1 2 3

No of policies 60 75 16 2

(iii) (a) Show that the method of moments estimate for p is 0.246.

(b) Use chisq.test to carry out a goodness of fit for the specified binomial model
for the number of accidents of each vehicle owner in a year, ensuring that the
expected frequencies are greater than 5.

(c) Use pchisq to find the correct the p-value.

Question 10. Two sample polls of votes for two candidates A and B for a public office are taken, one
each from among the residents of rural and urban areas. The results (in 000’s) are given
in the table below.

Area Votes for

A B

Rural 37 15

Urban 12 50

(i) (a) Store these names and frequencies in the matrix obs2

(b) Use chisq.test to determine whether the nature of the area is related to
voting preference in this election.

(c) Repeat part(b) without Yates continuity correction.

(ii) Use fisher.test to determine whether eye colour and hair colour are
independent and give the exact p-value.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 57


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
xbar <- 160
n <- 60
sigma <- 40
alpha <- 0.05
(a) mu <- 165
statistic <- (xbar-mu)/(sigma/sqrt(n))
statistic #Ans= -0.9682458
#critical value
qnorm(alpha)
#greater than critical value of -1.644854 so don't reject
#p-value
pnorm(statistic) #p-value= 16.64%
(b)mu <- 155
statistic <- (xbar-mu)/(sigma/sqrt(n))
Statistic #Ans= 0.9682458
#critical values
qnorm(alpha/2)
qnorm(1-alpha/2)
#between critical values of ±1.959964 so don't reject
#p-value
2*pnorm(statistic,lower=FALSE) #p-value= 33.297%

Answer 2
(i) claims <- c(49,53,51,52,47,50,52,53)
n <- length(claims)
alpha <- 0.05
mu <- 50
(a) statistic <- (mean(claims)-mu)/(sd(claims)/sqrt(n))
statistic #Ans= 1.178416
#p-value
1-pt(statistic,n-1)
#Ans= p-value = 13.85% so we cannot reject H0
(b) t.test(claims,alt="greater",mu=50)

(ii) sigma <- 5


alpha <- 0.05
statistic <- (n-1)*var(claims)/sigma^2
statistic #Ans= 1.235
#critical value
qchisq(alpha/2,n-1)
qchisq(alpha/2,n-1,lower=FALSE)
#Ans= below lower critical value so reject H0
#p-value
2*(pchisq((n-1)*var(claims)/sigma^2,df=n-1)) #Ans = 1.98%

(iii) ChickWeight
x <- ChickWeight$weight[ChickWeight$Diet==1]
t.test(x,conf.level = 0.95,mu=100)
#p-value = 48.93%; hence we may not reject Ho.

(iv) y <- ChickWeight$weight[ChickWeight$Diet==2]

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 58


www.sankhyiki.in
+91-9711150002

n<-length(y)
alpha <- 0.05
sigma <- sqrt(7000)
statistic <- (n-1)*var(y)/sigma^2
statistic #Ans= 87.16977
#lower critical value
qchisq(alpha,n-1) #Ans= 94.81124
#Ans= below lower value, so we may reject H0
#p-value (one-sided)
pchisq((n-1)*var(claims)/sigma^2,df=n-1)
#p value is smaller than alpha.

Answer 3
(i) x<-85
n<-450
alpha<-0.05
binom.test(x,n,p=0.25,alternative="less",conf.level=1-alpha)
#p-value<1% reject H0, p<0.25

(ii) test <-binom.test(x,n,p=0.25,alt="less")


test$p.value #Ans= 0.00126712

Answer 4
(i) Test for lambda using poisson.test
x<-283
n<-900
alpha<-0.05
poisson.test(x,n,r=0.2,alt="greater",conf=1-alpha)
poisson.test(x,n,r=0.2,alt="greater")
#p-value is extremely small; reject H0, rate > 0.2

(ii)test <-poisson.test(x,n,r=0.2,alt="greater")
test$estimate #answer: 0.3144444

Answer 5
(i) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
#alt is mean(sp) greater than mean(np)
#sp>np
t.test(sp,np,alt="greater",var.equal=TRUE)
#p-value >0.05; hence do not reject

(ii) t.test(sp,np,alt="greater",var.equal=TRUE,paired=TRUE)
# p-value <0.05; hence we may reject
(iii) var.test(sp,np,alt="less")
#p-value is > 0.05; hence we may not reject.

Answer 6
(i) #vector of success
x <- c(300,425)
#vector of trials
n <- c(500,700)
prop.test(x,n,alt="greater",conf=0.95,correct=FALSE)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 59


www.sankhyiki.in
+91-9711150002

#p-value = 0.5985 hence not reject H0, no difference in proportions

(ii)m <- cbind(x,n-x)


m
prop.test(m,alt="greater",correct=FALSE)

(iii) poisson.test(x,n)
#p=value=0.8803 so not reject H0, same proportions

Answer 7
(i) sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
d <- mean(sp)-mean(np)
#Ans= 3.666666

(ii) Permutation test


(a)r <- c(sp,np)
(b)index <- 1:length(r)
(c)nsp <- length(sp)
p<-combn(index,nsp)
(d)n<-ncol(p)
dif<-rep(0,n)
for (i in 1:n)
{dif[i]<-mean(r[p[,i]])-mean(r[-p[,i]])}

(iii) Histogram
(a)hist(dif)
(b)abline(v=quantile(dif,0.05), col="green",lty=3)
or if do C-T
abline(v=quantile(dif,0.95), col="green",lty=3)

(c)abline(v=d, col="blue",lty=2)

(iv) #p-value
(a)length(dif[dif<=d])/length(dif) #Ans= 0.9470177
#So insufficient evidence to reject H0, no difference in mean blood pressure
(b)#p-value is very close to the value under the normality assumption

(v) Permutation test using resampling


dif<-rep(0,10000)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 60


www.sankhyiki.in
+91-9711150002

set.seed(77)
for (i in 1:10000)
{p<-sample(index, nsp, replace=FALSE)
dif[i]<-mean(r[p])-mean(r[-p])}

(vi) #p-value
length(dif[dif<=d])/length(dif) #Ans= 0.9456
#Insufficient evidence to reject H0, no difference in mean blood pressure
#very close to the value using all combination

Answer 8
(i)sp <- c(32,37,35,28,41,44,35,31,34)
np <- c(35,31,29,25,34,40,27,32,31)
D <- sp-np
ObsD <- mean(D)
ObsD #Ans= 3.666667

(ii) Permutation test


(a)sign <- c(-1,1)
(b)nD <- length(D)
library(gtools)
p<-permutations(2,nD,sign,repeats.allowed=TRUE)
(c)n<-nrow(p)
dif<-rep(0,n)
for (i in 1:n)
{dif[i]<-mean(D*p[i,])}

(iii) #p-value
(a)length(dif[dif<=ObsD])/length(dif) #Ans= 0.9902344
#So insufficient evidence to reject H0

(iv) Permutation test using resampling


dif<-rep(0,10000)
set.seed(79)
for (i in 1:10000)
{p<-sample(sign, nD, replace=TRUE)
dif[i]<-mean(D*p)}

(v) #p-value
length(dif[dif<=ObsD])/length(dif) # Ans= 0.9911
#Insufficient evidence to reject H0
#very close to value using all combinations

Answer 9
(i) Goodness of fit test - uniform
(a)obs <- c(73,52,52,55,68)
(b)#would be 300/5 = 60 each day
exptd <- rep(60,5)
(c)chisq.test(obs,p=exptd,rescale=TRUE)
#or
chisq.test(obs)
#or
exptd <- rep(1/5,5)
chisq.test(obs,p=exptd)
#statistic= 6.4333 on chi-square 4 #p-value = 0.169, not reject H0

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 61


www.sankhyiki.in
+91-9711150002

(ii) Goodness of fit test - given freq/prob


(a)obs <- c(17,29,74)
exptd <- c(2,3,5)
chisq.test(obs,p=exptd,rescale=TRUE)
#or
exptd <- c(0.2,0.3,0.5)
chisq.test(obs,p=exptd)
#statistic= 6.6694 on chi-square 2
#p-value = 0.03562, reject H0, not consistent with theoretical ratio
(b)chisq.test(obs,p=exptd)$parameter

(iii) Goodness of fit test - given distribn


(a)#equate mean=np to sample mean
x <- c(0,1,2,3)
f <- c(60,75,16,2)
xbar <- sum(x*f)/sum(f)
n<-3
p <- xbar/n
p
(b)exptd <- c(dbinom(c(0,1,2,3),n,p))
exptd
#We should NOT carry out the test as last group exptd freq <5
exptd*sum(f)
#However, if you do then R will give you a warning message.
#need to combine last 2 groups
exptd2 <-c(dbinom(c(0,1),n,p),1-pbinom(1,n,p))
exptd2
exptd2*sum(f)
#or
exptd2 <- c(exptd[1:2],sum(exptd[3:4]))
exptd2
f2 <- c(f[1:2],sum(f[3:4]))
f2
chisq.test(f2,p=exptd2)
#statistic= 3.4675 on chi-square 2
#p-value = 0.1766, not reject H0, binomial model good fit
(c)#HOWEVER dof should be 1 as we estimated a parameter
#extract statistic
test <-chisq.test(f2,p=exptd2)
test$statistic
pchisq(test$statistic,df=1,lower=FALSE) #p-value=0.06258403

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 62


www.sankhyiki.in
+91-9711150002

Answer 10
(i)(a) obs2 <- matrix(c(37,12,15,50), 2, 2,
dimnames=list(c("rural", "urban"),c("A","B")))
obs2
#or do it separately
obs2 <- matrix(c(37,12,15,50), 2, 2)
rownames(obs2) <- c("rural", "urban")
colnames(obs2) <- c("A","B")
(b)chisq.test(obs2)
#statistic= 28.885 on chi-square 1
#p-value = 0, reject H0.
(c)chisq.test(obs2, correct=FALSE
#statistic= 30.962 on chi-square 1
#p-value = 0, reject H0

(ii) fisher.test(obs2)
#p-value = 2.36e-08, reject H0

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 63


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Data Analysis

Question 1. The average heights and weights of American women are given in the inbuilt
dataset “women”.

(i) (a) Obtain the scattergraph between heights and weights.

(b) Comment on the linear relationship

(ii) Obtain the following correlation coefficients:

(a) Pearson (b) Spearman (c) Kendall.

(iii) (a) Store the heights in vector x and weights in vector y.

(b) Create objects Sxx, Syy and Sxy which contain the sum of squares for the
data.

(c) Hence calculate the (pearson) correlation coefficient.

(iv) (a) Calculate the Pearson Correlation coefficient of the rank of the heights
and rank of the weights.

(b) Check this gives the same result as the formula:

(v) (a) Use cor.test and the Pearson correlation coefficient to test whether
ρ=0.

(b) Extract the statistic for the test in part (v)(a).

(c) Use the statistic from part (v)(b) to obtain the p-value for the test in part
(v)(a).

(vi) Use cor.test and Spearman’s correlation coefficient to test H0 : ρs = 0 vs


H1 : ρs > 0 at the 1% significance level.

(vii) Use cor.test to test if the true value of Kendall’s correlation coefficient is
less than zero.

(viii) Use Fisher’s transformation to test whether H0 : ρ = 0.9 vs H1 : ρ> 0.9 stating
the p-value.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 64


www.sankhyiki.in
+91-9711150002

(ix) Use prcomp to carry out PCA on the women data and store it in the
object pca1.

(x) (a) Obtain the eigenvectors (matrix W) of each principal component in


pca1.

(b) Explain what they represent.

(xi) (a) Obtain the principal components decomposition (matrix P) for the
women data from pca1.

(b) Interpret what this represents.

(xii) (a) Obtain the centre and scale of pca1.

(c) Explain what these mean.

(xiii) (a) Obtain the percentages each of the variances of the principal
components using the summary of the prcomp function.

(b) Using part (a) determine which components, if any,


should be dropped.

Question 2. The built in data set “Iris” contains measurements (in cm) of the variables sepal
length, sepal width, petal length and petal width, respectively, for 50 flowers from
each of 3 species (Iris setosa, versicolor, and virginica) of iris.

(i) Extract the four measurements for the setosa species only and store them
in the 50x4 data frame, SDF.

(ii) Use plot to obtain a scatter graph of each pair of measurements for the
setosa species.

(ii) Use pairs to produce the following scatter graph:

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 65


www.sankhyiki.in
+91-9711150002

(iv) Comment on the relationship between Petal Width and the other
measurements.

(iv) Using scale or otherwise, obtain a scaled matrix of the 50 observations of 4


variables which have zero mean and store in the matrix object X.

(v) Use eigen to obtain the eigenvectors of XTX and store them in the matrix object
W.

(vi) Obtain P=XW, the principal components decomposition P of X.

(vii) (a) Obtain the diagonal matrix S = PTP .

(b) Calculate what percentage each of the variances in matrix S are of the total.

(c) State which principal component(s) should be dropped to simplify the


dimensionality.

(viii) (a) Obtain the matrix P using the prcomp function.

(b) Obtain the percentages in part (vii)(b) using the summary of the prcomp
function.

(c) Draw a scree diagram using plot on the result of the prcomp function and
hence state which principal component(s) should be dropped to simplify the
dimensionality.

(ix) (a) Carry out PCA with scaling of the data using prcomp.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 66


www.sankhyiki.in
+91-9711150002

(b) Using the Kaiser Criterion state which principal component(s) should be
dropped to simplify the dimensionality.

(x) (a) Using cbind and rep, or otherwise, obtain a new matrix P1 which has only the
first two principle components and vectors of zeroes for the removed
components.

(b) Obtain the reduced data set X1 using X1=P1WT.

(c) Use pairs to plot X1 and compare to the original data.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 67


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i)(a) women
plot(women,main="Height vs Weight")

(b) #we see that the relationship between height and weight is almost
perfectly linear.

(ii) (a) women


cor(women,method="pearson")
#or
cor(women)
#or
cor(women$height,women$weight) #r=0.99549480
(b)cor(women,method="spearman") #rs=1 as monotonically increasing
(c)cor(women,method="kendall") #t=1

(iii)(a)x <- women$height


y <- women$weight
(b)Sxx <- sum((x-mean(x))^2)
Syy <- sum((y-mean(y))^2)
Sxy <- sum((x-mean(x))*(y-mean(y)))
#Sxx=280, Syy=3362.933, Sxy=966
(c)Sxy/sqrt(Sxx*Syy) #r=0.9954948

(iv)(a) Pearson correlation coefficient of ranks


cor(rank(women$height),rank(women$weight) #rs=1 as before
(b) Obtain Spearman correlation coefficient (from scratch)
x <- rank(x)
y <- rank(y)
n <- length(x)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 68


www.sankhyiki.in
+91-9711150002

d <- x-y
1-(6*sum(d^2))/(n*(n^2-1))

(v)(a)(Pearson) correlation coefficient inference (using cor.test)


#default is already 2 sided and 5% significance level|
cor.test(women$height,women$weight, method="pearson")
#p-value = 1.091e-14 reject H0, rho is not zero
(b) Extracting values from cor.test
#Just like tests before we can extract each bit of info from the output
test <- cor.test(women$height,women$weight, method="pearson")
test$statistic
#statistic =37.85531

(vi) Spearman correlation coefficient inference (usingcor.test)


cor.test(women$height,women$weight,method="spearman",alt="greater",conf=0.99)
#p-value very small; reject H0, rho is greater than zero

(vii) Kendall correlation coefficient inference (using cor.test)


cor.test(women$height,women$weight,method="kendall",alt="less")
#p-value = 1 do NOT reject H0, tau is not less than zero

(viii) Test rho>0.9 using Fisher's transformation


n<-nrow(women)
r <- cor(women$height,women$weight)
z <- (atanh(r)-atanh(0.9))/sqrt(1/(n-3))
z #statistic is 5.454174
pnorm(z,lower.tail=FALSE)
#p-value is 2.460049e-08, reject Ho

(ix) PCA using prcomp


women
pca1<-prcomp(women)

(x)(a) Eigenvectors
#These are given when the variable is called:
pca1
#or can be extracted using:
pca1$rotation
#answer:
# PC1 PC2
#height 0.2762612 0.9610826
#weight 0.9610826 -0.2762612
(b) Explain
#These are the orthogonal vectors of the new co-ordinate system
#(which is a rotation of the old co-ordinate system)

(xi)(a) Principal components decomposition


pca1$x
(b) Interpret P
#Expresses the 6 points in terms of the two new principal components

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 69


www.sankhyiki.in
+91-9711150002

#like a new co-ordinate system


#with the most important one across the horizontal axis

(xii)(a) center and scale


pca1$center
# pca1$center
#height weight
#65.0000 136.7333
pca1$scale
#answer: #FALSE

(xiii)(a) percentages using prcomp


summary(pca1)
# PC1 PC2
#Standard deviation 16.1259 0.40754
#Proportion of Variance 0.9994 0.00064
#99.94% from 1st component and only 0.064% from the 2nd component
(b) Which component(s) should be dropped
#We should clearly drop the second component

#(i) Extract 4 measurements of setosa


attach(iris)
#We want columns 1 to 4
SDF <- iris[Species=="setosa",1:4]
SDF

(ii) Plot scattergraph of pairs using plot


plot(SDF)

(iii) Plot scattergraphs of pairs using pairs


pairs(SDF,main="Setosa Iris measurements",labels=c("Sepal Length", "Sepal
Width", "Petal Length", "Petal Width"),pch=16)

(iv) Comment
#It looks like there might be weak positive correlation between
#Petal Width and all the other variables (as there are no values
#in the top left quadrant).
detach(iris)

(v) Convert to matrix and store in X


X <- scale(as.matrix(SDF), center=TRUE, scale=FALSE)
#or
X <- scale(as.matrix(SDF), scale=FALSE)
#or even
X <- scale(SDF, scale=FALSE)
X

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 70


www.sankhyiki.in
+91-9711150002

(vi) Store eigenvectors of X^T * X in W


W <- eigen(t(X) %*% X)$vectors
W

(vii) Obtain principal components decomposition P of X


P <- X %*% W
P

(viii)(a) Obtain matrix S = P^T * P


#note use rounding here and later to make it clearer
S <- t(P) %*% P
round(S,4)
(b) Percentage of total variance
round(diag(S/sum(S)),4) #answer: 76.47% 11.94% 8.67% 2.92%
(c) Which drop?
#1st two PCs explain 88.41%
#1st three PCs explain 97.08% which is more than 90%
#so probably drop just 4th PC

(ix)(a) PCA using prcomp


pca<-prcomp(X)
#or
pca<-prcomp(SDF)
pca
summary(pca)
#or could extract it
pca$x
#compare P
P
#Note, the matrix P has the opposite signs for PC1.
(b) percentages using prcomp
#it's in the summary
summary(pca)
#compare
round(diag(S/sum(S)),4)
(c) Scree diagram
plot(pca,type="line",main="scree plot")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 71


www.sankhyiki.in
+91-9711150002

#seems to level off after just the 1st PC

(x) (a) PCA of scaled data


pca1 <- prcomp(SDF,scale=TRUE)
pca1
(b) The Kaiser Criterion
#sd given in the summary
summary(pca1)
#or could extract sd
pca1$sdev
#Kaiser Criterion only keep components whose var (or sd) of scaled data >1
#Hence would suggest keeping only 1st 2 PCs

(xi)(a) Reduced principal components P1


n <- nrow(P)
P1 <- cbind(P[,1:2],rep(0,n),rep(0,n))
P1
#or
P1 <- cbind(pca$x[,1:2],rep(0,n),rep(0,n))
P1
(b) Reduced data set X1
X1 <- P1 %*% t(W)
X1
#or
X1 <- P1 %*% solve(W)
(c) Plot X1
pairs(X1)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 72


www.sankhyiki.in
+91-9711150002

pairs(X)

#It captures sepal length and width relationship well


#but not the other relationships.
detach(iris)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 73


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Linear Regression


Question 1. The average heights and weights of American women are given in the inbuilt
dataset “women”.

(i) Fit a linear regression model, model1, of weight on height.

(ii) (a) Obtain the slope and intercept parameters.

(b) Plot a labelled scatter graph of the data and add a red dashed regression
line onto your scatterplot.

(iii) Obtain the fitted values:

(a) By extracting them from model1.

(b) by using the fitted command.

(c) using the predict command

(iv) Add green points to the scatterplot to show the fitted values.

(v) Obtain the expected weight of a women whose height is 73inches.

(a) from first principles by extracting the coefficients from model1.

(b) using the predict function.

(vi) Obtain the total sum of squares in the baby weights model together with its
split between the residual sum of squares and the regression sum of
squares:

(a) using the ANOVA command

(b) from first principles using the functions sum, mean, fitted and residuals.

(vii) Obtain the coefficient of determination, R2 , from the:

(a) linear regression model, model1

(b) correlation coefficient

(c) by calculation from the values in the ANOVA table.

(viii) Obtain the correlation coefficient from the extracted coefficient of


determination.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 74


www.sankhyiki.in
+91-9711150002

Question 2. The average heights and weights of American women are given in the inbuilt
dataset “women”. Using model1 fitted in the previous question: -

(i) Obtain the statistic and p-value for a test of H0 : β = 0 vs H1 : β ≠ 0.

(ii) Use confint to:

(a) obtain a 99% confidence interval for the slope parameter

(b) test, at the 5% level, whether β = 3.4.

(iii) Extract the estimated value of beta, the standard error of beta and the
degrees of freedom and store them in the objects b, se and dof.

(iv) Using the objects created in part (iii), use a first principles approach to:

(a) obtain a 90% confidence interval for the slope parameter.

(b) obtain the statistic and p-value for a test of H0: β = 3.6 vs H1: β < 3.6.

(c) obtain the statistic and p-value for a test of H0: β = 3.2 vs H1: β ≠ 3.2.

(v) Obtain the results of an F-test to test the ‘no linear relationship’ hypothesis
using the:

(a) anova command (b) summary command.

(vi) Calculate the F statistic and p-value from first principles by extracting the
mean sum of squares and degrees of freedom from the ANOVA table.

(vii) Obtain a 95% confidence interval for the error variance, σ2 , from first
principles.

(viii) (i) Estimate the mean weight of a women with height 55 inches.

(a) from first principles by extracting the coefficients from model1.

(b) using the predict function.

(ix) Obtain a 90% confidence interval for the:

(a) mean weight of a women with height 55 inches.

(b) weight of an individual women with height 55 inches.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 75


www.sankhyiki.in
+91-9711150002

Question 3. The average heights and weights of American women are given in the inbuilt
dataset “women”. Using model1 fitted in the first question: -

(i) Obtain the residuals for the regression model:

(a) from first principles using the fitted command

(b) using the residuals function.

(ii) (a) Obtain a plot of the residuals against the fitted values.

(b) Comment on the constancy of the variance and whether a linear model
is appropriate.

(iii) (a) Obtain a Q-Q plot of the residuals.

(b) Comment on the normality assumption.

(iv) Examine the final two graphs obtained by plot(model1) and comment.

(v) (a) Obtain a new linear regression model, model2, based on the data
without the third data point (height=60 inches).

(b) By examining the new value of R2 comment on the fit of model2


compared to that of model1.

Question 4. The following is the data observed from an experiment: -

It is thought that a suitable model is

(i) (a) Store the data in a dataframe “obs”.

(b) Obtain a scatterplot of y vs x and comment whether a linear model is


appropriate for these data.

(c) Obtain a scatterplot of lny vs x and comment whether a linear model is


appropriate for this transformation.

(ii) Fit a linear regression model, model3, of ln yi on x .

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 76


www.sankhyiki.in
+91-9711150002

(iii) (a) Obtain estimates for the slope and intercept parameters for model3.

(b) Add a red dashed regression line to your scatterplot of lny vs x from part
(i)(c).

(iv) (a) Use part (iii)(a) to obtain estimates for a and b .

(b) Re-plot the scatterplot of y vs x and this time add blue points to the
scatterplot to show the fitted values of y using model3.

(c) Add a dashed red regression curve that passes through the fitted points.

(v) Obtain a 95% confidence interval for the mean value of y when x = 8.5 .

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 77


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i)Fit a linear regression model
women
model1 <- lm(weight~height,data=women)

(ii)(a) Obtain slope and intercept parameters


model1
#alternatively we could get them from the summary function
summary(model1)
#or using the coefficients (coef) function or subset
coef(model1)
model1$coef
#answer: slope param = 3.45000, intercept param = -87.51667
(b) Add the regression line
plot(women,main="Heights and Weights of Women",xlab="Height",ylab="Weight
(kg)",pch=3)
abline(model1,col="red",lty="dashed")
#or
abline(model1,col="red",lty=2)

(iii) Obtain the fitted values


(a)model1$fitted
(b)fitted(model1)
fitted.values(model1)
(c)predict(model1)
#Answer: 112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 13
6.7333 140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

(iv) Add green points for the fitted values


points(women$height,fitted(model1),col="green",pch=16)

(v) Predict weight ; height=72 cm


(a) from 1st principles

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 78


www.sankhyiki.in
+91-9711150002

coef(model1)[1]+coef(model1)[2]*72
(b) using predict
newdata1 <-data.frame(height=72)
Then we use the "predict" function
predict(model1,newdata1) #Answer: 160.8833

(vi)(a) Obtain SStot, SSres and SSreg from ANOVA


#the residual and regression sum of squares are given in the following:
anova(model1)
#add up the sum of squares to get the total sum of squares:
anova(model1)[1,2]+anova(model1)[2,2]
#Ans= 3362.933
(b) Obtain SStot, SSres and SSreg from 1st principles
#for brevity let's store variables in x and y
x <- women$height
y <- women$weight
n <- nrow(women)
#SS TOT
sum((y-mean(y))^2)
#We could also calculate it from var(y) but the qn asks otherwise
var(y)*(n-1) #Ans= 3362.933
#SS RES
#just the sum of squares of the residuals (as the mean is zero)
sum(residuals(model1)^2) #Ans= 30.23333
#Alternatively:
sum((y-fitted(model1))^2)
#We could have also calculated it from summary(model1) but the qn asks oth
erwise
(n-2)*summary(model1)$sigma^2
#SS REG
sum((fitted(model1)-mean(y))^2) #Ans= 3332.7
#Also could have calculated Sxx, Syy and Sxy from 1st principles but qn as
ks otherwise

(vii)(a) Obtain the coefficient of determination from model


#This is given in the summary
summary(model1)
#Note we can extract it from the summary as follows:
summary(model1)$r.squared
summary(model1)$r.sq
#Ans= 0.9910098
(b) Obtain the coefficient of determination from r
#square the correlation coefficient
cor(women$height,women$weight)^2
cor(x,y)^2
(c) Obtain the coefficient of determination from results in anova(model1)
anova(model1)[1,2]/(anova(model1)[1,2]+anova(model1)[2,2])

(viii) Obtain the correlation coefficient from R²


#square root R²
sqrt(summary(model1)$r.squared)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 79


www.sankhyiki.in
+91-9711150002

Answer 2
(i) Test beta=0
summary(model1)
#t statistic 37.85, p-value 1.09e-14
#reject H0, beta definitely not equal to zero

(ii)(a) 99% CI for beta


confint(model1,level=0.99)
#answer: (3.175472,3.724528)
#or could have extracted just the beta row using:
confint(model1,2,level=0.99)
b) Test if beta=3.4
#95% CI for beta is:
confint(model1,2,level=0.95)
#since the CI (3.253112,3.646888) contains beta=3.4 we do not reject the n
ull hypothesis

(iii) Extract b, se and dof


#we can extract the estimated beta,the s.e(beta) and the dof as follows:
b <- coef(model1)[2] #Ans= 3.45
#or
b <- model1$coef[2]
b <- coef(summary(model1))[2,1]
b <- summary(model1)$coef[2,1]
se <-coef(summary(model1))[2,2] #Ans= 0.0911365
dof <- model1$df #Ans= 13

(iv)(a) 90% CI for beta (1st principles)


#using b, se and df from part (iii) the 90% CI is given by:
b+c(-1,1)*qt(0.95,dof)*se
#answer: (3.288603,3.611397)
#check
confint(model1,2,level=0.9)
(b) 1 sided test of beta=3.6 (1st principles)
#using b, se and dof from part (iii) the statistic is:
t <- (b-3.6)/se
t
#statistic -1.645883
#p-value is the prob of being less than this statistic
pt(t,dof)
#p-value 0.06186836 ; hence beta is not significantly different from 3.6
(c) 2 sided test of beta=3.2(1st principles)
#using b, se and dof from part (iv) the statistic is:
t <- (b-3.2)/se
t #statistic 2.743138
#nearest critical region is >, so calculate this prob
#but since 2 sided we then double it to get the p-value
2*pt(t,dof,lower.tail=FALSE)
#p-value 0.01675607
#reject H0 at 5% level, beta is significantly different from 3.2

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 80


www.sankhyiki.in
+91-9711150002

(v) F test
(a)anova(model1)
#F statistic = 1433
#p-value = 1.091e-14
#reject H0, there is a linear relationship btwn gestation and weight
(b)summary(model1)

(vi) F test (first principles)


#Obtaining the F statistic
f <- anova(model1)[1,3]/anova(model1)[2,3]
f
#Obtaining the dof
df1 <- anova(model1)[1,1]
df2 <- anova(model1)[2,1]
#df1= 1 ; df2= 13
#calculating the p-value
pf(f,df1,df2,lower.tail=FALSE)
#Ans= 1.090973e-14

(vii) 95% CI for error variance from 1st principles


dof <- model1$df
#or
dof <- anova(model1)[2,1]
dof <- anova(model1)$Df[2]
SSres <- anova(model1)[2,2]
#or could extract sigma from summary and then multiply by (n-2)
#CI given by
c(SSres/qchisq(0.975,df=dof),SSres/qchisq(0.025,df=dof))
#answer (1.222260,6.036103)

(viii)(a)coef(model1)[1]+coef(model1)[2]*55 #Ans= 102.2333


(b)newdata1 <-data.frame(height=55)
#Then we use the "predict" function
predict(model1,newdata1)

(xi) 90% CI for weight of baby at 33 weeks


(a) mean weight
predict(model1,newdata1,interval="confidence",level=0.90)
# (100.4752,103.9915)
(b) indiv baby weight
predict(model1,newdata1,interval="predict",level=0.90)
# (99.01078,105.4559)

Answer 3
(i)(a) Residuals from 1st principles
#residuals are the differences btwn true y values and fitted y values
women$weight-fitted(model1)
(b) Residuals using command
model1$residuals
#or
residuals(model1)

(ii) Plot the residuals against the fitted values and comment
(a)plot(model1,1)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 81


www.sankhyiki.in
+91-9711150002

(b) Clearly, there is a pattern in residuals which tells us that


variability of error is not constant.
(iii) Q-Q plot of residuals and comment
(a)plot(model1,2)

(b)Not a good fit, hence cannot be sure about normality assumption


(iv) Examine the final two graphs obtained by plot(model1)
plot(model1,3)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 82


www.sankhyiki.in
+91-9711150002

plot(model1,5)

(v)(a) New linear regression model excl. height=60


model2 <- lm(weight~height,data=women,subset=(height!=60))
(b) Compare R² from model 1 and model 2
summary(model1)$r.squared
#R² for model1 was 0.9910098
summary(model2)$r.squared
#R² for model2 is 0.9902325
#Hence model1 has a better fit

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 83


www.sankhyiki.in
+91-9711150002

Answer 4
(i)(a) x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(0.33,0.51,0.75,1.16,1.9,2.59,5.14,7.39,11.3,17.4)
obs <- data.frame(x,y)

(b) Scatterplot x vs y
plot(obs,pch=3)

#the original data shows exponential growth


(c)Scatterplot x vs lny
#now plot lny instead
attach(obs)
plot(x,log(y),pch=3)

#much more linear

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 84


www.sankhyiki.in
+91-9711150002

(ii) Fit regression model to lny on x


model3 <- lm(log(y)~x)

(iii)(a) Obtain slope and intercept parameters


model3
#intercept=-1.5998 #slope = 0.4467
(b) Add red dashed regression line to log graph
abline(model3,col="red",lty="dashed")

(iv)(a) Obtain estimates for a and b


#a is given by:
a<-exp(coef(model3)[1])
a
#b is the same as beta
b<-coef(model3)[2]
b
#a=0.2019 b=0.4467
(b) Replot scattergraph and add blue points for the fitted values
plot(obs,pch=3)
points(x,exp(fitted(model3)),col="blue",pch=16)
#or from first principles
points(x,a*exp(b*x),col="green",pch=3)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 85


www.sankhyiki.in
+91-9711150002

(c) Add dashed red regression line


lines(x,a*exp(b*x),col="red",lty=2)
#or
lines(x,exp(fitted(model3)),col="red",lty=2)
#alternatively you could plot it from scratch using curve
#then add scatterplot (and fitted) values using points

(v) 95% confidence interval for y when x=8.5.

newdata <-data.frame(x=8.5)
exp(predict(model3,newdata,interval="confidence",level=0.95))
#(8.41,9.63)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 86


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Multivariate Linear Regression


Question 1. Two objects, A and B have been made by three creators, Tom, Matt and John. The
dataset “data1” contains 50 values each of the length and width of A and B created
by the three creators.

(i) Extract the four measurements corresponding to the creator Matt and
store them in the data frame MDF.

(ii) Using the data for Matt, fit a linear regression model, model2, with B.width
as the response variable and A.length, A.width and B.length as explanatory
variables:

(iii) Obtain the slope and intercept parameters.

(iv) Obtain the fitted B width values:

(a) by extracting them from model2

(b) using the fitted command

(c) using the predict command.

(v) Obtain the expected B width with A length 5.1cm, A width 3.5cm and B
length 1.4cm created by Matt:-

(a) from first principles by extracting the coefficients from model2.

(b) using the predict function.

Question 2. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:

(i) Obtain the total sum of squares in model2 together with its split between
the residual sum of squares and the regression sums of squares using the
anova command.

(ii) Obtain the coefficient of determination, R2 from the:

(a) linear regression model, model2

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 87


www.sankhyiki.in
+91-9711150002

(b) by calculation from the values in the ANOVA table.

(iii) Obtain the adjusted coefficient of determination, R2adj from the:

(a) linear regression model, model2

(b) by calculation from the values in the ANOVA table.

(iv) State the statistic and p-value for a test of H0 : β1 = 0 vs H1 : β1 ≠ 0 .

(v) Use confint to:

(a) obtain a 90% confidence interval for β2

(b) test, at the 5% level, whether β3 = 0.24.

(vi) Extract the value of β2 , the standard error of β2 and the degrees of
freedom and store them in the objects b2, se2 and dof.

(vii) Using the objects created in part (vi), use a first principles approach to:

(a) obtain a 90% confidence interval for β2 and compare to part (v)(a).

(b) obtain the statistic and p-value for a test of H0 : β2 = 0.3 vs H1 : β2 < 0.3.

Question 3. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:

(i) Carry out an F-test to test H0 : β1= β2= β3 = 0 using the summary command,
stating the test statistic and p-value clearly.

(ii) Obtain a 95% confidence interval for the error variance, σ2 , from first
principles.

Question 4. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:

(i) Obtain the expected B width with A length 5.94cm, A width 2.77cm and B
length 4.26cm created by Matt.

(ii) Obtain a 90% confidence interval for the:

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 88


www.sankhyiki.in
+91-9711150002

(a) mean B width with A length 5.94cm, A width 2.77cm and B length
4.26cm for creator Matt

(b) individual B width with A length 5.94cm, A width 2.77cm and B length
4.26cm for creator Matt

(iii) Obtain the estimated mean value of α :

(a) from the model2 parameters (b) using the predict function.

(iv) Obtain a 99% confidence interval for the mean value of α :

(a) using the confint function (b) using the predict function.

Question 5. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:

(i) Obtain the residuals for the regression model:

(a) from first principles using the fitted command

(b) using the residuals function.

(ii) (a) Obtain a plot of the residuals against the fitted values.

(b) Comment on the constancy of the variance and whether a linear model
is appropriate.

(iii) (a) Obtain a Q-Q plot of the residuals.

(b) Comment on the normality assumption.

(iv) Examine the final two graphs obtained by plot(model2) and comment.

Question 6. This question uses the creator Matt data from Q1 which should be stored in the
data frame MDF. We are fitting multiple linear regression models with B.width as
the response variable and a combination of A.length, A.width and B.length as
explanatory variables.
Forward selection

(i) Fit the null regression model, fit0, to the B.width data.

(ii) Obtain the (Pearson) linear correlation coefficient between all the pairs of
variables.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 89


www.sankhyiki.in
+91-9711150002

(iii) Fit a linear regression model, fit1, with B.width as the response variable and
the variable with the greatest correlation with B.width as the explanatory
variable.

(iv) (a) Fit a linear regression model, fit2, with B.width as the response variable
and the variable from part (iii) and the variable with the next highest
correlation with B.width as the two explanatory variables.

(b) Compare the adjusted R2 of fit1 and fit2 and comment.

(v) (a) Fit a linear regression model, fit3, with B.width as the response variable
and the variables from part (iv) plus the last variable as the explanatory
variables.

(b) Compare the adjusted R2 of fit2 and fit3 and comment.

(vi) Comment on the output of the fit3 model and the results of the ANOVA
output.

Backward selection

Start with creator Matt linear regression model, model2, with B.width (y ) as the
response variable and A.length (x1) , A.width (x2) and B.length (x3) as explanatory
variables:

(vii) (a) Update the model to create model2b by removing the variable with βj
not significantly different from zero.

(b) Compare the adjusted R2 of model2 and model2b and comment.

Question 7. This question uses the creator Matt linear regression model, model2, with B.width
(y ) as the response variable and A.length (x1) , A.width (x2) and B.length (x3) as
explanatory variables:

Forward selection

(i) (a) Fit a linear regression model, fit4, with B.width as the response variable
and a two-way interaction term between the two most significant variables.

(b) Compare the adjusted R2 of fit3 and fit4. Comment on these values and the
results of the ANOVA output.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 90


www.sankhyiki.in
+91-9711150002

(ii) Create two further models, fit5 and fit6, each containing the three
explanatory variables from fit3 plus a single two-way interaction term.
Show that only one of them improves the value of the adjusted 2 R but the
ANOVA output shows that there is no significant improvement in fit.

(iii) Explain why we would not consider adding a three-way interaction term in
this case.

Backward selection

Start with creator Matt linear regression model, fitA, with with B.width (y ) as the
response variable and A.length (x1) , A.width (x2) and B.length (x3) as explanatory
variables, together with all two and three way interactions.

(iv) Update the model fitA to create fitB, fitC, etc by removing:

(a) any insignificant three way interaction terms

(b) any insignificant two way interaction terms

(c) any insignificant main effect terms.

Each time compare only the adjusted R2 of the models to ensure only those
models which improve the fit are kept.

(v) Comment on the limitations of only using adjusted R2 as a basis for model
fit.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 91


www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i) data <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"",
dec = ".")
attach(data)
#We want columns 1 to 4
MDF <- data[Creators=="Matt",1:4]
MDF

(ii) Fit a linear regression model


model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)
#Note that the following code is incorrect as it uses ALL 150 values from
original data frame
# model2 <- lm(B.width~A.length+A.width+B.length)

(iii) Obtain slope and intercept parameters


model2
#alternatively we could get them from the summary function
summary(model2)
#or using the coefficients (coef) function or subset
coef(model2)
model2$coef
#answer: -0.1686, -0.07398, 0.2233, 0.3088 (4 SF)

(iv) Obtain the fitted values


(a)model2$fitted
(b)fitted(model2)
(c)predict(model2)
#answer: 1.4791476, ....., 1.3007574

(v) Predict B width


(a) from 1st principles
coef(model2)[1]+coef(model2)[2]*5.1+coef(model2)[3]*3.5+coef(model2)[4]*1.
4
(b) using predict
#Wrap the parameters inside a data frame
newdata2 <-data.frame(A.length=5.1,A.width=3.5,B.length=1.4)
#Then we use the 'predict' function
predict(model2,newdata2)
#answer 0.6678075cm

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 92


www.sankhyiki.in
+91-9711150002

Answer 2
data <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
attach(data)
MDF <- data[Creators=="Matt",1:4]
MDF
#which we fitted the following model to
model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)

(i) Obtain SStot, SSres and SSreg from ANOVA


#the residual sum of squares is given in the following:
anova(model2)
#answer: 0.56164
#to get the total sum of squares you can either add them up:
sum(anova(model2)[,2])
#could use colSums (but gives a wrong total MSS sum)
colSums(anova(model2))
#answer: 1.9162
#regression sum of squares is difference between them
sum(anova(model2)[,2])-anova(model2)[4,2]
#or
sum(anova(model2)[1:3,2])
#answer: 1.354562

(ii)(a) Obtain the coefficient of determination from model


#This is given in the summary
summary(model2)
#Note we can extract it from the summary as follows:
summary(model2)$r.squared
summary(model2)$r.sq
#answer: 0.7069001 (only 0.7069 given in summary)
(b) Obtain the coefficient of determination from results in anova(model2)
#SSREG divided by total
(sum(anova(model2)[,2])-anova(model2)[4,2])/sum(anova(model2)[,2])
#1-SSres/SStot
1-(anova(model2)[4,2])/sum(anova(model2)[,2])
#answer: 0.7069001

(iii)(a) Obtain the adjusted R² from model


#This is given in the summary
summary(model2)
#Note we can extract it from the summary as follows:
summary(model2)$adj.r.squared
#answer: 0.6877849 (only 0.6878 given in summary)
(b) Obtain the adjusted R² from results in anova(model2)
#1-MSSres/MSStot
#MSSres given in anova table

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 93


www.sankhyiki.in
+91-9711150002

#MSStot is not - so total SS divided by total df (or divide by (n-1))


1-(anova(model2)[4,3])/(sum(anova(model2)[,2])/sum(anova(model2)[,1]))
n <- nrow(VDF)
1-(anova(model2)[4,3])/(sum(anova(model2)[,2])/(n-1))

(iv) Test beta1=0


#beta1 is the coefficient of A.length
summary(model2)
#t statistic -1.560, p-value 0.125599
#do not reject H0, beta1 equal to zero

(v)(a) 90% CI for beta2


#beta2 is the coefficient of A.width
confint(model2,level=0.9)
#answer: (0.119, 0.327)
#or could have extracted just the beta row using any of the following:
confint(model2,3,level=0.9)
confint(model2,level=0.9)[3,]
(b) Test if beta3=0.24
#beta3 is the coefficient of B.length
confint(model2,4,level=0.95)
confint(model2,4)
#since (0.201,0.416) contains beta=0.24 we do not reject the null
hypothesis

(vi) Extract b2, se2 and dof


#beta2 is the coefficient of A.width
#we can extract beta2,the s.e(beta2) and the dof as follows:
b2 <- coef(model2)[3]
#or
b2 <- model2$coef[3]
se2 <-coef(summary(model2))[3,2]
dof <- model2$df

(vii)(a) 90% CI for beta2 (1st principles)


#using b2, se2 and dof from part (iii) the 90% CI is given by:
b2+c(-1,1)*qt(0.95,dof)*se2
#answer: (0.119, 0.327)
#same as part (ii)(a)
(b) 1 sided test of beta2=0.3 (1st principles)
#using b2, se2 and dof from part (iii) the statistic is:
t <- (b2-0.3)/se2
t
#statistic -1.240004
#p-value is the prob of being less than this statistic
pt(t,dof)
#p-value 0.1106
#do not reject H0 at 5% level, beta is not < 0.3

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 94


www.sankhyiki.in
+91-9711150002

Answer 3
(i) F test
summary(model2)
#F statistic = 36.98 #p-value = 2.571e-12
#reject H0, there is at least one non-zero slope parameter

(ii) 95% CI for error variance from 1st principles


dof <- model2$df
SSres <- anova(model2)[4,2]
#or could extract sigma from summary and then multiply by (n-2)=4
#CI given by
c(SSres/qchisq(0.975,df=dof),SSres/qchisq(0.025,df=dof))
#answer (0.00843, 0.0193)

Answer 4
(i) Estimate B width
#from 1st principles
coef(model2)[1]+coef(model2)[2]*5.94+coef(model2)[3]*2.77+coef(model2)[4]*
4.26
#or using predict
#Wrap the parameters inside a data frame
newdata2 <-data.frame(A.length=5.94,A.width=2.77,B.length=4.26)
#Then we use the 'predict' function
predict(model2,newdata2) #answer:1.325704cm

(ii) 90% CI for B width


(a) mean B width
predict(model2,newdata2,interval="confidence",level=0.90) #(1.299, 1.352)
(b) indiv B width
predict(model2,newdata2,interval="predict",level=0.90) #(1.138, 1.513)

(iii) estimated mean value of alpha


(a) model parameter (ie intercept)
coef(model2)[1]
#or just read off from model1 or summary(model1) #answer = -0.16864
(b) using predict
newdata0 <-data.frame(A.length=0,A.width=0,B.length=0)
predict(model2,newdata0) #answer = -0.16864
#clearly the linear relationship does not continue backwards to zero
lengths

(iv) 99% CI for mean value of alpha


(a) confint(model2,1,level=0.99) #answer (-0.6777, 0.3404)
(b)predict(model2,newdata0,interval="confidence",level=0.99)
#answer (-0.6777, 0.3404)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 95


www.sankhyiki.in
+91-9711150002

Answer 5
(i)(a) Residuals from 1st principles
#residuals are the difference btwn true y value and fitted y value
#since data is attached
B.width-fitted(model2)
#but that gives all 150 values, so:
MDF[,4]-fitted(model2)
MDF$B.width-fitted(model2)
B.width[51:100]-fitted(model2)
(b) Residuals using command
model2$residuals
model2$resid
residuals(model2)
resid(model2)

(ii) Plot the residuals against the fitted values and comment
(a)plot(model2,1)
(b) #68, 69, 74 are marked as outliers but still within 3 sds
3*summary(model2)$sigma
variance appears to start increasing towards the end
so may not be constant

(iii) Q-Q plot of residuals and comment


(a)plot(model2,2)

(b)#middle is good, however extremes detract from normal


#appears to have 'fat' tails

(iv) The final two graphs obtained by plot(model2)


plot(model2,3)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 96


www.sankhyiki.in
+91-9711150002

#this appears to display constant variance


plot(model2,5)

#point 99 has the most influence


#no combination of outlier and high influence

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 97


www.sankhyiki.in
+91-9711150002

Answer 6
#Forward selection
(i) Null model
fit0 <- lm(b.width ~ 1, data = MDF)
summary(fit0)

(ii) Correlation coefficient


cor(MDF)

(iii) One explanatory variable model


#B.length has the greatest correlation with B.width
fit1 <- update(fit0, . ~ . + B.length)
summary(fit1)
#adjusted R² is 0.6109

#(iv) Two explanatory variables model


(a)#A.width has the next greatest correlation with B.width
fit2 <- update(fit1, . ~ . + A.width)
summary(fit2)
#adjusted R² is now 0.6783
(b)#has increased therefore keep both

(v) Three explanatory variables model


(a)#A.length has the next greatest correlation with B.width
fit3 <- update(fit2, . ~ . + A.length)
summary(fit3)
#adjusted R² is now 0.6878
(b)#has increased therefore keep all three

(vi) Three explanatory variables model


summary(fit3)
#A.length parameter not significant - which suggests remove
anova(fit3)
#not a significant improvement in fit when add last variable
#so both would suggest not include A.length
#but yet the adjusted R² does increase marginally
#Problem caused by overlap btwn variables - PCA would remove this issue

(vii) Backward selection


(a)#From earlier we had the following
model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)
summary(model2)
#A.length is not significant
model2b <- update(model2, .~.-A.length)
summary(model2b)
(b)#adjusted R² has fallen from 0.6878 to 0.6783
#hence do not remove A.length

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 98


www.sankhyiki.in
+91-9711150002

Answer 7
#Forward selection
fit3 <- lm(B.width~B.length+A.width+A.length,data=MDF)
#note that the order is different to model2
model2 <- lm(B.width~A.length+A.width+B.length,data=MDF)

(i) Add two-way interactive term


(a)summary(fit3)
#B.length and A.width have the greatest significance
#Note: this would also be the case if we started with model2
fit4 <- update(fit3, . ~ . + B.length:A.width)
(b)summary(fit4)
#adjusted R² has fallen from 0.6878 to 0.6814
anova(fit4)
#not significant improvement in model fit4 - confirming do not add

(ii) Other two-way interactive terms


fit5 <- update(fit3, . ~ . + B.length:A.length)
summary(fit5)
#adjusted R² has risen from 0.6878 to 0.6919
anova(fit5)
#This shows that there is no significant improvement - so should not add
fit6 <- update(fit3, . ~ . + A.width:A.length)
summary(fit6)
#Adjusted R² has fallen from 0.6878 to 0.681
anova(fit6)
#No significant improvement - confirming do not add

(iii) Three-way interactive term


#since no two-way terms have been included
#so should not include three way
#can still check - will see that adjusted R² will fall

(iv) Backward selection


#start with all explanatory variables and their interactions
fitA <- lm(B.width~A.length*A.width*B.length,data=MDF)
summary(fitA)
(a)#3 way interaction not significant so remove that parameter
fitB <- update(fitA, . ~ . - A.length:A.width:B.length)
summary(fitB)
#adjusted R² has risen from 0.6863 to 0.691
(b)#least significant 2 way interaction
#A.width:B.length with p-value of 0.7818
fitC <- update(fitB, . ~ . - A.width:B.length)
summary(fitC)
#adjusted R² has risen from 0.691 to 0.6975
#next least significant 2 way interaction
#A.length:A.width with p-value of 0.1815
fitD <- update(fitC, . ~ . - A.length:A.width)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 99


www.sankhyiki.in
+91-9711150002

summary(fitD)
#adjusted R² has fallen from 0.6975 to 0.6919
#implies should not remove but yet none of the coefficients are
significant
#similarly if removed other interaction would see fall to 0.681
fitD <- update(fitC, . ~ . - A.length:B.length)
summary(fitD)
(c)#not appropriate to remove single terms when have 2 way interactions
#that involve them
#Note we would have got this model with forward selection if we had
#ONLY considered the adjusted R² and not the results of the ANOVA test

(v) Comment
#even though we have maximised the adjusted R²
#none of the coefficients are significant
#so need a better method of fit - hence tend to use the ANOVA test
#between models to check improvement (although in a later unit we use AIC)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 100
www.sankhyiki.in
+91-9711150002

ASSIGNMENT – GLMs
Question 1. Two objects, A and B have been made by three creators, Tom, Matt and John. The
dataset “data1” contains 50 values each of the length and width of A and B created
by the three creators.

(i) Extract the four measurements corresponding to the creator Matt and
store them in the data frame MDF.

(ii) Using the data for Matt, fit a linear regression model, model2, with B.width
as the response variable and A.length, A.width and B.length as explanatory
variables:

State the coefficients of the fitted model.

(iii) (a) Use the function glm to fit an equivalent generalised linear model,
glmodel, to the creator Matt data. State explicitly the appropriate family
and the link function in the arguments.

(b) Confirm that the estimated parameters are identical to the linear model
in part (ii).

(c) Give a shortened version of the R code from part (iii)(b) that will fit the
same GLM as part (iii)(a) but makes use of the default settings of the glm
function.

Question 2. Two objects, A and B have been made by three creators, Tom, Matt and John. The
dataset “data1” contains 50 values each of the length and width of A and B created
by the three creators.

(i) (a) Assuming that the measurements are normally distributed, use the
function glm to fit a generalised linear model, glmodel1, with B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator
( γi )as explanatory variables:

(b) Obtain the coefficients by extracting them from glmodel1.

(c) Explain what has happened to the coefficient for the creator Tom.

(ii) State the code for a linear predictor which also included a quadratic effect
from B.length.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 101
www.sankhyiki.in
+91-9711150002

The built-in data set esoph contains data from a case-control study of oesophageal
cancer in Ille-et-Vilaine, France. agegp contains 6 age groups, alcgp contains 4
alcohol consumption groups, tobgp contains 4 tobacco consumption groups, ncases
gives the number of observed cases of oesophageal cancer out of the group of size
ncontrols.

(iii) Fit a binomial generalised linear model, glmodel2, with a logit link function
to estimate the probability of obtaining oesophageal cancer as the response
variable and a linear predictor containing the main effects of agegp ( αi ),
alcgp ( βj ) and tobgp ( γk )

(iv) State the code for a linear predictor which also has interaction between
alcohol and tobacco.

Question 3. The first two parts use the data1 generalised linear model, glmodel1, with B.width as
the response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:

(i) (a) State the statistic, p-value and conclusion for a test of H0 : β2 =0 = vs H1 :
β2 ≠ 0.

(b) Use R to extract only the p-value for this test.

(ii) Use confint to:

(a) obtain a 90% confidence interval for the creator Matt coefficient.

(b) test, at the 5% level, whether β1 = −0.2

The next two parts use the claims binomial probability generalized linear model,
glmodel2, with the probability of claim as the response variable and a linear predictor
containing the main effects of age ( αi ), alcohol ( βj ) and tobacco( γk )

(iii) State the p-value and conclusion for a test that the second non-base
category in the age group is zero.

(iv) Use confint to:

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 102
www.sankhyiki.in
+91-9711150002

(a) obtain a 99% confidence interval for the third non-base coefficient in the
alcohol group.

(b) test, at the 5% level, whether the first non-base coefficient in the
tobacco group is equal to 0.5.

Question 4. The first three parts use the “data1” generalised linear model, glmodel1, B.width as
the response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator
(γi)as explanatory variables:

(i) (a) Obtain the residual degrees of freedom and residual deviance for this
model.

(b) Use R to extract only these numbers from the model.

(ii) Find and extract the AIC for this model.

(iii) (a) Create a new GLM, glmodel01, which does not contain creator as an
explanatory variable.

(b) Use the AIC to compare glmodel01 and glmodel1.

(c) Use anova to carry out a formal F test to compare these two models.

The last part of this exercise uses the oesophageal cancer binomial probability
generalised linear model, glmodel2, with the probability of obtaining oesophageal
cancer as the response variable and a linear predictor containing the main effects of
age ( αi ), alcohol ( βj ) and tobacco( γk ) :

(iv) (a) Create a new GLM, glmodel02, which does not contain tobacco as an
explanatory variable.

(b) Use the AIC to compare glmodel02 and glmodel2.

(c) Use anova to carry out a formal χ2 test to compare these two models.

Question 5. We are fitting generalised linear models to “data1” with B.width as the response
variable and a combination of A.length, A.width, B.length and Creator as
explanatory variables, assuming the measurements are normally distributed.

Forward selection

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 103
www.sankhyiki.in
+91-9711150002

(i) Fit the null generalised linear model, fit0, to the data.

First covariate

(ii) (a) By examining the scatterplot of all the pairs of variables explain why
either Creator or B.length should be chosen as our first explanatory
variable.

(b) Fit a linear regression model, fit1a, with B.width as the response
variable and Creator as the only explanatory variable. Determine the AIC
for fit1a.

(c) Fit a linear regression model, fit1b, with B.width as the response
variable and B.length as the only explanatory variable. Determine the AIC
for fit1b.

(d) By examining the AIC of fit1a and fit1b choose the model that provides
the best fit to the data.

(e) Use the anova function to carry out an F test comparing fit0 and the
model chosen in part (ii)(d).

Second covariate

(iii) (a) Fit a linear regression model, fit2, with B.width as the response
variable and both Creator and B.length as the explanatory variables.

(b) By examining the AIC and carrying out an F test compare fit2 and the
model chosen in part (ii)(d).

Third covariate

(iv) (a) Fit a linear regression model, fit3a, with B.width as the response
variable and Creator, B.length and A.length as explanatory variables.
Determine the AIC for fit3a.

(b) Fit a linear regression model, fit3b, with B.width as the response
variable and Creator, B.length and A.width as explanatory variables.
Determine the AIC for fit3b.

(c) By examining the AIC of fit3a and fit3b choose the model that provides
the best fit to the data.

(d) Use the anova function to carry out an F test comparing fit2 and the
model chosen in part (iv)(c).

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 104
www.sankhyiki.in
+91-9711150002

Fourth covariate

(v) (a) Fit a linear regression model, fit4, with B.width as the response
variable and all four covariates as the explanatory variables.

(b) By examining the AIC and carrying out an F test compare fit4 and the
model chosen in part (iv)(c).

Fifth covariate

(vi) (a) Fit a linear regression model, fit5a, with B.width as the response
variable, all four covariates as main effects and an interactive term
between Creator and A.width as explanatory variables. Determine the AIC
for fit5a.

(b) Fit a linear regression model, fit5b, with B.width as the response
variable, all four covariates as main effects and an interactive term
between B.length and A.width as explanatory variables. Determine the
AIC for fit5b.

(c) By examining the AIC of fit5a and fit5b choose the best fit to the data.

(d) Use the anova function to carry out an F test comparing fit4 and the
model chosen in part (vi)(c).

Sixth covariate

(vii) (a) Fit a linear regression model, fit6, with B.width as the response
variable, all four covariates as main effects, the interactive terms between
Creator and A.width, and between B.length and A.width as explanatory
variables.

(b) By examining the AIC and carrying out an F test compare fit6 and the
model chosen in part (vi)(c).

Seventh covariate

(viii) Show that adding interaction between B.length and A.length to fit6 leads
to a drop in the AIC and a significant improvement in the residual
deviance.

It can be shown that adding other two-way interactions terms do not improve the
AIC nor lead to a significant improvement in residual deviance.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 105
www.sankhyiki.in
+91-9711150002

(ix) Explain why we should not add any three-way interaction terms at this
stage.

Backward selection

(x) Fit the full generalised linear model, fitA, to the data1 data to model
B.width using Creator*B.length*A.length*A.width and show the AIC is
−109.79 .

(xi) Show that the generalised linear model, fitB, which removes the four-way
interaction term leads to an improvement in the AIC.

It can be shown that two three-way interaction terms have parameters that are
insignificant.

(xii) (a) Update the model fitB to create fitC1 by removing the three-way
interaction between Creator, B.length and A.width. Determine the AIC for
fitC1.

(b) Update the model fitB to create fitC2 by removing the three-way
interaction between Creator, B.length and A.length. Determine the AIC for
fitC2.

Let fitC be the model from parts (xii)(a) and (b) which produces the biggest
improvement in the AIC.

(xiii) It can be shown that another three-way interaction term has


insignificant parameters at the 10% level. Use the summary function to
determine which interaction term this is. Create the generalised linear
model, fitD, which removes it and show that there is an improvement in
the AIC.

(xiv) Show that generalised linear model, fitE, which removes another
insignificant three- way interaction term also leads to an improvement
in the AIC.

(xv) Use the summary function to show that the parameter of the final
three-way interaction term is still significant but that the two-way
interaction term between Creator and A.length is not. Update the
model fitE to create fitF by removing this two-way interaction and show
it leads to an improvement in the AIC.

(xvi) Use the summary function to show that the parameters of three of the
two-way interaction terms are insignificant at the 5% level. Show that

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 106
www.sankhyiki.in
+91-9711150002

removing any of these interaction terms leads to no improvement in the


AIC.

Question 6. This question uses the “data1” generalised linear model, glmodel1, B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:

(i) Obtain the value of the linear predictor for glmodel1 for creator Matt with
A length 5.1cm, A width 3.5cm and B length 1.4cm:

(a) from first principles by extracting the coefficients from glmodel1.

(b) using the predict function.

(ii) (a) Explain why the expected B width for creator Matt will be the same as
the linear predictor in part(i).

(b) Show that this is the case by using the predict function.

(iii) Explain why there is no constant for the creator Tom in the linear predictor.

(iv) Obtain the expected B width for creator Tom with A length 5.1cm, A width
3.5cm and B length 1.4cm:

(a) from first principles by extracting the coefficients from glmodel1.

(b) using the predict function.

Question 7. This question uses the “data1” generalised linear model, glmodel1, B.width as the
response variable and A.length (x1) , A.width (x2) , B.length (x3) and Creator ( γi )as
explanatory variables:

(i) Obtain the raw residuals for the generalised linear model:

(a) from first principles using the fitted command

(b) using the residuals function.

(ii) Show that the raw residuals are the same as the:

(a) deviance residuals

(b) Pearson residuals.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 107
www.sankhyiki.in
+91-9711150002

(iii) By examining the median, lower and upper quartiles of the residuals,
comment on their skewness.

(iv) (a) Obtain a plot of the residuals against the fitted values.

(b) Comment on the constancy of the variance of the residuals and whether
a normal model is appropriate.

(v) (a) Obtain a Q-Q plot of the residuals.

(b) Comment on the normality assumption.

(vi) Examine the final two graphs obtained by plot(glmodel1) and comment.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 108
www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i)read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec =
".")
attach(data1)
#We want columns 1 to 4
MDF <- data1[Creators=="Matt",1:4]
MDF

(ii) Fit a linear regression model


lmodel <- lm(B.width~A.length+A.width+B.length,data=MDF)
#Note that lm(B.width~A.length+A.width+B.length,data=MDF) uses ALL data
#from original data frame
lmodel
#(Intercept) A.length A.width B.length
#-0.16864 -0.07398 0.22328 0.30875

(iii) Fit a GLM


(a)#reasonable to assume measurements are normally distributed
glmodel <- glm(B.width~A.length+A.width+B.length,data=MDF,family=gaussian
(link="identity"))
(b)glmodel
# (Intercept) A.length A.width B.length
#-0.16864 -0.07398 0.22328 0.30875
(c)glm(B.width~A.length+A.width+B.length,data=MDF)
#since normal distribution is default
#and identity is default link function for normal

Answer 2
(i)data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"",
dec = ".")
(a)glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
attach(data1)
glmodel1 <- glm(B.width~A.length+A.width+B.length+Creators)
(b)coef(glmodel1)
#or
glmodel1$coef
(c)#It has been absorbed into the 'intercept' coefficient.

(ii) Code for quadratic effect


glm(B.width~A.length+A.width+B.length+Creators+I(B.length^2))

(iii)glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp,


data=esoph, family=binomial (link="logit"))

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 109
www.sankhyiki.in
+91-9711150002

#Since using binomial canonical link function which is the default, same
as:
glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp, data=esoph,
family=binomial)
#or if attach esoph
attach(esoph)
glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp,
family=binomial)
glmodel2

(iv) Code for interaction


#either:
#agegp+alcgp+tobgp+alcgp:tobgp
#ie glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp+alcgp:tobgp,
family=binomial)
#or:
#agegp+alcgp*tobgp
#ie glm(cbind(ncases,ncontrols) ~ agegp+alcgp*tobgp, family=binomial)

Answer 3
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
(i) Test beta2=0
(a)#beta2 is the coefficient of A.width
summary(glmodel1)
#t statistic 5.072, p-value 1.20e-06 (ie 0.00000120)
#reject H0, beta2 is not equal to zero
(b)#The p-value is in the 3rd row and 4th column
coef(summary(glmodel1))[3,4]
summary(glmodel1)$coef[3,4]

(ii)(a) 90% CI for Matt


confint(glmodel1,level=0.9)
#answer: (0.446, 0.851)
(b) Test if beta1=-0.2
#beta1 is the coefficient of A.length
confint(glmodel1,2,level=0.95)
confint(glmodel1,2)
#since it does not contain beta=-0.2 we reject the null hypothesis

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 110
www.sankhyiki.in
+91-9711150002

#Oesophageal cancer model


#Using the esoph data we fitted the following model
glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp, data=esoph,
family=binomial)
#For ordered factors it gives coefficients for the intercept (base),
#the linear term (L), the quadratic term (Q), the cubic term (C)
#and higher powers (4, 5)

(iii) Test whether 2nd non-base category in the age group is zero.
#The base category is absorbed into the intercept coefficient
#the second non-base category is the 'Quadratic' category denoted by Q
#ie agegp.Q
summary(glmodel2)
#t statistic -2.263, p-value 0.02362
#reject H0, parameter is not equal to zero

(iv)(a) 99% CI for 3rd non-base coefficient in the alcohol group.


#the third non-base category is the "Cubic" category denoted by C
#ie alcgp.C
confint(glmodel2,level=0.99)
#answer: (-0.153, 0.668)
#or could have extracted just the alcgpC row (9th row) using any of the
following:
confint(glmodel2,9,level=0.99)
confint(glmodel2,"alcgp.C",level=0.99)
confint(glmodel2,level=0.99)[9,]
(b) Test if 1st non-base coefficient in the tobacco group is equal to 0.5.
#the 1st non-base category is the "Linear" category denoted by L
#ie tobgp.L
confint(glmodel2,level=0.95)
confint(glmodel2)
#since it contains 0.5 we do not reject the null hypothesis
#or could have extracted just the tobgpL row (10th row) using any of the
following:
confint(glmodel2,10)
confint(glmodel2,"tobgp.L")
confint(glmodel2)[10,]

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 111
www.sankhyiki.in
+91-9711150002

Answer 4
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <-
glm(B.width~A.length+A.width+B.length+Creators,data=data1,family=gaussian
(link="identity"))
(i)(a)# Residual dof and deviance are given both in the model and the
summary
glmodel1
summary(glmodel1)
#Residual dof = 144
#Residual deviance = 3.998
(b)glmodel1$df.res
glmodel1$dev

(ii)#AIC is given both in the model and the summary


glmodel1
summary(glmodel1) #AIC = -104.06
#we extract it using either of the following:
glmodel1$aic
summary(glmodel1)$aic
#Note that the AIC is negative because the maximised likelihood in the
deviance is greater than the number of parameters in the model

(iii)(a)glmodel01 <- glm(B.width~A.length+A.width+B.length,data=data1)


(b)glmodel01 #AIC = -63.5
#Since AIC for glmodel1 (-104.06) is lower it is considered the better
model ; so we should include creators in our model
(c)anova(glmodel01, glmodel1, test="F")
#The p-value is 5.143e-10 which is way less than 5% so we would reject H0
#The model with Species significantly reduces the scaled deviance
#and is a better fit.

#We fitted the following model to the esoph data


glmodel2 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp+tobgp, data=esoph,
family=binomial)
(iv)(a)glmodel02 <- glm(cbind(ncases,ncontrols) ~ agegp+alcgp, data=esoph,
family=binomial)
(b)glmodel2
glmodel02
#AIC for glmodel2 is 225.5, AIC for glmodel02 is 230.1
#Since AIC for glmodel2 is lower it is considered the better model
#so we should include tobacco in our model
(c)anova(glmodel02, glmodel2, test="Chi")
#The p-value is 0.0141 which is smaller than 5% so we would reject H0
#The model with tobacco significantly reduces the scaled deviance
#and is a better fit.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 112
www.sankhyiki.in
+91-9711150002

Answer 5
#Forward selection
(i) Null model
fit0 <- glm(B.width~1,data=data1)

(ii) First covariate


(a) Scatterplot
plot(data1)
#This plot shows categorical data as well as quantitative data.
#A quick search on the internet shows that there are lots of different
ways of displaying categorical data in a similar way to a pairs plot (some
being much more useful than this way) but all require packages that are
very unlikely to be used in the CS1 exam.
This means we have to interpret the pairs plot as it stands.
If B width could be explained by the creators we would expect the plotted
points to form three non-overlapping sections or lines. This would mean
that we want to use creators as an explanatory variable. At the other
extreme, if the plotted points formed three completely overlapping
sections or lines then B width is not being explained by species and we
would not want to use Creators as an explanatory variable. In practice we
are not going to see these extremes so we will have to use judgement as to
how close the plotted points are to forming separate sections or lines.
#We can see that the strongest correlation is between B.width and Creators
or B.width and B.length

(b) Add Creators to null model


fit1a <- update(fit0, . ~ . + Creators)
fit1a$aic
summary(fit1a)
#AIC: -45.29
(c) Add B.length to null model
fit1b <- update(fit0, . ~ . + Blength)
fit1b$aic
summary(fit1b) #AIC: -43.59
(d) Best model
#Since AIC for fit1a (-45.29) is lower (ie more negative) it is considered
the better model
(e) Anova test
anova(fit0, fit1a, test="F")
#The p-value is 2.2e-16 which is less than 5% so we would reject H0
#The model with Creators significantly reduces the residual deviance
#and is a better fit.
(iii) Second covariate
(a) Add B.length to fit1a model
fit2 <- update(fit1a, . ~ . + B.length)
#or from scratch:
fit2 <- glm(B.width~Creators+B.length,data=data1)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 113
www.sankhyiki.in
+91-9711150002

(b) AIC and anova test


fit2$aic #AIC: -83.41
#Since AIC for fit2 is lower than fit1a it is considered the better model
#so we should also include B.length in our model
anova(fit1a, fit2, test="F")
#The p-value is 4.409e-10 which is less than 5% so we would reject H0
#The model with B.length significantly reduces the residual deviance
#and is a better fit.
(iv) Third covariate
(a) Add A.length to fit2 model
fit3a <- update(fit2, . ~ . + A.length)
fit3a$aic #AIC: -81.41
(b) Add A.width to fit2 model
fit3b <- update(fit2, . ~ . + A.width)
fit3b$aic #AIC: -101.6
(c) Best model
#Since AIC for fit3b (-101.6) is lower it is considered the better model
(d) anova test
anova(fit2, fit3b, test="F")
#The p-value is 1.03e-05 which is less than 5% so we would reject H0
#The model with A.width significantly reduces the residual deviance
#and is a better fit.
(v) Fourth covariate
(a) Add A.length to fit3b model
fit4 <- update(fit3b, . ~ . + A.length)
(b) AIC and anova test
fit4$aic #AIC: -104.06
#Since AIC for fit4 is lower than fit3b it is considered the better model
#so we should also include A.length in our model
anova(fit3b, fit4, test="F")
#The p-value is 0.03889 which is less than 5% so we would reject H0
#The model with A.length significantly reduces the residual deviance
#and is a better fit.
(vi) Fifth covariate
(a)fit5a <- update(fit4, . ~ . + Creators:A.width)
fit5a$aic #AIC: -109.9 improvement (compared to fit4)
(b)fit5b <- update(fit4, . ~ . + B.length:A.width)
fit5b$aic #AIC: -107.5 improvement (compared to fit4)
(c) AIC
Since AIC for fit5a is lower than fit5b it is considered the better model
#so we should also include the interaction term Creators:A.width
(d) anova test
anova(fit4, fit5a, test="F")
#The p-value is 0.009593 which is less than 5% so we would reject H0
#The model with Creators:A.width significantly reduces the residual
deviance
#and is a better fit.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 114
www.sankhyiki.in
+91-9711150002

(vii) Sixth covariate


(a) Add B.length:A.width to fit5a model
fit6 <- update(fit5a, . ~ . + B.length:A.width)
(b) AIC and anova test
fit6$aic #AIC: -114.2586
#Since AIC for fit6 is lower than fit5a it is considered the better model
#so we should also include B.length:A.width in our model
anova(fit5a, fit6, test="F")
#The p-value is 0.0145 which is less than 5% so we would reject H0
#The model with B.length:A.width.Width significantly reduces the residual
deviance
#and is a better fit.

(viii) Seventh covariate


fit7 <- update(fit6, . ~ . + B.length:A.length)
fit7$aic #AIC: -116.5533
#Since AIC for fit7 is lower than fit6 it is considered the better model
#so we should also include B.length:A.length in our model
anova(fit6, fit7, test="F")
#The p-value is 0.04566 which is less than 5% so we would reject H0
#The model with B.length:A.length significantly reduces the residual
deviance
#and is a better fit.
(ix) Three-way interaction
#We can't add B.length:B.width:A.length as there's no 2 way
#interaction between the last two.
#nor can we add any three way interaction involving Creators as there's no
2 way
#interaction between creators and the other covariates.
#Forward selection shortcut
#All of parts (i) to (ix) and more could be achieved with the following
command
step(fit0,scope=~Creators*B.length*A.length*A.width,direction="forward",te
st="F")

#Backward selection
(x) Full model
fitA <- glm(B.width~Creators*B.length*A.length*A.width,data=data1)
fitA$aic #AIC is -109.79

(xi) Remove 4 way interaction term


fitB <- update(fitA,.~.-Creators:B.length:A.length:A.width)
summary(fitB) #AIC is -112.49

(xii) Remove 3 way interaction terms


(a)fitC1 <- update(fitB,.~.-Creators:B.length:A.width)
fitC1$aic #AIC is -113.76
(b)fitC2 <- update(fitB,.~.-Creators:B.length:A.length)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 115
www.sankhyiki.in
+91-9711150002

fitC2$aic #AIC is -115.13


#fitC2 leads to the greatest drop in AIC

(xiii) Remove other 3 way interaction term


summary(fitC2)
#The Creators:B.length:A.width parameter is not significant
fitD <- update(fitC2,.~.-Species:B.length:A.width)
summary(fitD) #AIC is -117.22
#The Creators:A.length:A.width parameter is not significant

(xiv) Remove another 3 way interaction term


fitE <- update(fitD,.~.-Creators:A.length:A.width)
fitE$aic #AIC is -117.6

(xv) Remove 2 way interaction term


summary(fitE)
# The B.length:A.length:A.width parameter is significant
# But no Creators:A.length parameter is significant
fitF <- update(fitE,.~.-Creators:A.length)
fitF$aic #AIC is -120.62 improvement

(xvi) Removing other 2 way interaction terms does not improve the AIC
summary(fitF)
# The Creators:A.width parameters are not significant
# The Creators:B.length parameters are not significant
# The A.length:A.width parameter is not significant either
fitG1 <- update(fitF,.~.-Creators:A.width)
summary(fitG1) #AIC is -116.14 worse
fitG2 <- update(fitF,.~.-Creators:B.length)
summary(fitG2) #AIC is -118.8 worse
fitG3 <- update(fitF,.~.-A.length:A.width)
summary(fitG3) #AIC is -118.61 worse

#Backward selection shortcut


#All of parts (xi) to (xiv) and more could be achieved with the following
command:

step(fitA,scope=~Creators*B.length*A.length*A.width,direction="backward",t
est="F")

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 116
www.sankhyiki.in
+91-9711150002

Answer 6
data1 <- read.csv("data1.csv", header = TRUE, sep = ",", quote = "\"", dec
= ".")
glmodel1 <- glm(B.width~A.length+A.width+B.length+Creators,data=data1,fami
ly=gaussian (link="identity"))

(i) Linear predictor


(a) From 1st principles
#A length 5.1cm, A width 3.5cm, B length 1.4cm and creator Matt:
coef(glmodel1)[1]+coef(glmodel1)[2]*5.1+coef(glmodel1)[3]*3.5+coef(glmodel
1)[4]*1.4+coef(glmodel1)[5]
#0.8877986
(b) Using the predict function
newdata <-data.frame(A.length=5.1,A.width=3.5,B.length=1.4,Creators="Matt"
)
predict(glmodel1,newdata)
#or
predict(glmodel1,newdata,type="link")
#0.8877986

(ii) Response variable


(a) Explain why expected response is same as linear predictor
#The canonical link function for the normal distribution is the identity f
unction
#Hence the mean response variable is equal to the linear predictor
(b) Using the predict function
predict(glmodel1,newdata,type="response")
Expected B.width = 0.8877986 cm
(iii) Creator Tom
#It has been absorbed into the intercept parameter.

(iv) Response variable for creator Tom


(a) From 1st principles
#A length 5.1cm, A width 3.5cm, B length 1.4cm and creator Tom:
coef(glmodel1)[1]+coef(glmodel1)[2]*5.1+coef(glmodel1)[3]*3.5+coef(glmodel
1)[4]*1.4
#0.2396861cm
(b) Using the predict function
newdata <-data.frame(A.length=5.1,A.width=3.5,B.length=1.4,Creators="Tom")
predict(glmodel1,newdata,type="response")
#0.2396861cm

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 117
www.sankhyiki.in
+91-9711150002

Answer 7
(i) Raw residuals
(a) From 1st principles
attach(data1)
B.width-fitted(glmodel1)
#or
data1[,4]-fitted(glmodel1)
#answer: -0.0396860931,...., -0.1867598541
(b) Using residuals command
resid(glmodel1,type="response")
#answer: -0.0396860931,...., -0.1867598541
#Note: glmodel1$resid gives the standardised (deviance) residuals not the raw
residuals.

(ii) Standardised residuals


(a) Deviance residuals
glmodel1$resid
resid(glmodel1)
resid(glmodel1, type="deviance")
#answer: -0.0396860931,..., -0.1867598541
(b) Pearson residuals
resid(glmodel1, type="pearson")
#answer: -0.0396860931,..., -0.1867598541

(iii) Skewness of residuals


#summary gives min, max and quartiles of deviance residuals
summary(glmodel1)
# Min 1Q Median 3Q Max
#-0.59239 -0.08288 -0.01349 0.08773 0.45239
#Median is nearly zero, lower and upper quartiles have nearly equal absolute
values.
#So middle 50% of the data is nearly symmetrical.

(iv) Plot the residuals against the fitted values and comment
(a)plot(glmodel1,1)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 118
www.sankhyiki.in
+91-9711150002

(b)#Line fairly horizontal - so variance of residuals is fairly constant


#So normal model is appropriate

(v) Q-Q plot of residuals and comment


(a)plot(glmodel1,2)

(b)#The middle section is good but there are issues in the extremes.
#the residuals at the lower end are more negative than expected - so the
fitted values are too large
#the residuals at the upper end are more positive than expected - so the
fitted values are too small

#Final two graphs obtained by plot(model)


plot(glmodel1,3)

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 119
www.sankhyiki.in
+91-9711150002

#The variance is clearly increasing - implying a defect in our model


#Interaction terms may resolve this problem
plot(glmodel1,5)

#No data points have undue influence on our model.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 120
www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Bayesian Statistics


Question 1. The probability of a person dying from a particular disease is p. The prior
distribution of p is beta with parameters a = 2 and b = 3.

(i) (a) Create a vector x which contains 1,000 zeros.

(b) Use a loop to obtain 1,000 simulations of the posterior outcome


(where 1 denotes death and 0 denotes survival) for a single person. Use
the functions set.seed(77), rbeta and rbinom and store the i th outcome in
the i th element of x.

(c) Hence, obtain an empirical Bayesian estimate for p under quadratic


loss.

The Bayesian estimate for p under quadratic loss for a single outcome x is:

(ii) (a) Create a vector pm which contains 1,000 zeros.

(b) Repeat part (i)(b) but also store the ith theoretical Bayesian estimate in
the ith element of pm.

(c) Compare the average empirical and theoretical Bayesian estimates


under quadratic loss.

A biologist is now going to analyse samples of 12 people.

(iii) (a) Create a vector xp which contains 1,000 zeros.

(b) Use a loop to obtain 1,000 simulations of the posterior probability of


death, based on 1,000 samples each of 12 people. Use the functions
set.seed(79), rbeta and rbinom and store the estimate for the probability
in the i th outcome in the i th element of xp.

(c) Hence, obtain an empirical Bayesian estimate for p under quadratic


loss.

The Bayesian estimate for p under quadratic loss, given x deaths in a


sample of size n is:

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 121
www.sankhyiki.in
+91-9711150002

(iv) (a) Repeat part (iii)(b) but also store the i th theoretical Bayesian estimate
in the i thelement of pm.

(b) Compare the average empirical and theoretical Bayesian estimates


under quadratic loss.

Question 2. Consider the n = 30 independent and identically distributed observations (y1, y2, ..., yn)
G H I .J
given below from a random variable Y with probability distribution 𝑓(𝑦, 𝜃) =
K!
.

You can enter the y values into R by using:


y=c(5,5,6,2,4,10,2,5,5,2,5,3,7,4,4,5,4,6,7,2,8,4,6,4,3
, 6,6,6,5,7)

By assuming a prior distribution proportional to 𝑒 ,NG , we can show that the posterior
distribution of θ is:
U
𝑓(𝜃|𝑦+ , 𝑦' , … , 𝑦Q ) ∝ 𝜃 ∑TV/ KT 𝑒 ,(Q(N)G

We can observe that the posterior distribution of θ is Gamma with parameters


∑QWY+ 𝑦W − 1 and 𝑛 + 𝛼.

(i) (a) Plot the posterior probability density function of θ for values of θ in the interval
[3.2, 6.8] and assuming a = 0.01. [Hint: the range of values of θ can be
obtained in R by seq(3.2, 6.8, by = 0.01).]
(b) Carry out a simulation of N = 5,000 posterior samples for the parameter θ using
seed 100.

(ii) Plot the histogram of the posterior distribution of θ.

(iii) Calculate the mean, median and standard deviation of the posterior distribution
of θ.

Two possible values for the true value of parameter θ are θ =15 and θ = 5.
(iv) Comment on these two values based on the posterior distribution of θ plotted
in part (ii) and summarised in part (iii).

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 122
www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i) Binomial/beta posterior (sample size 1)
a <- 2
b <- 3
(a) x <- rep(0,1000)
(b)set.seed(77)
for (i in 1:1000)
{p <- rbeta(1,a,b)
x[i] <- rbinom(1,1,p)}
(c)mean(x) #answer: 0.382

(ii) Compare empirical mean to theoretical mean


(a)pm <- rep(0,1000)
(b)set.seed(77)for (i in 1:1000)
{p <- rbeta(1,a,b)
x[i] <- rbinom(1,1,p)
pm[i] <- (x[i]+a)/(a+b+1)}
(c)mean(x) #answer: 0.382
mean(pm) #answer: 0.397
#About 4% difference

(iii) Binomial/beta posterior (sample size n)


n <- 12
(a)xp <- rep(0,1000)
(b)set.seed(79)
for (i in 1:1000)
{p <- rbeta(1,a,b)
x <- rbinom(1,n,p)
xp[i] <- x/n}
#Note: Could also do x <- rbinom(n,1,p)
(c)mean(xp)
#answer: 0.4097 (or 0.4023 if use rbinom(n,1,p))

(iv) Compare empirical mean to theoretical mean


(a)set.seed(79)
for (i in 1:1000)
{p <- rbeta(1,a,b)
x <- rbinom(1,n,p)
xp[i] <- x/n
pm[i] <- (x+a)/(a+b+n)}
(b)#The average of our 1,000 empirical posterior means is:
mean(xp)
#answer: 0.4097 (or 0.4023 if use rbinom(n,1,p))
#The average of our 1,000 theoretical posterior means is:
mean(pm)
#answer: 0.4068 (or 0.4016 if use rbinom(n,1,p))
#very similar - less than 1% difference

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 123
www.sankhyiki.in
+91-9711150002

Answer 2
(i)(a) ## Data entry
y = c (5, 5, 6, 2, 4, 10, 2, 5, 5, 2, 5, 3, 7, 4, 4, 5, 4, 6, 7, 2, 8,
4, 6, 4, 3, 6, 6, 6, 5, 7)
## plot the posterior pdf of theta
theta = seq(3.2, 6.8, by = 0.01)
plot(theta, dgamma(theta, sum(y)-1, length(y) + 0.01), ylab ="Density",
type = "l")

(b) set.seed(100)
x = rgamma(5000, sum(y)-1, 30 + 0.01)

(ii) hist(x, main="Posterior distribution of theta",xlab="theta")

(iii) mean(x) # 4.893173


median(x) # 4.879693
sd(x) # 0.3988416

(iv) 15 is quite far away from the range of samples obtained for the
posterior distribution of θ. On the other hand, 5 is more likely
to be the true value. 15 is very unlikely to be the case if there
is no calculation error.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 124
www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Credibility Theory


Question 1. The probability of a person dying from a particular disease is p. The prior
distribution of p is beta with parameters a = 2 and b = 3.
A statistician is going to calculate a credibility estimate of p:

where the credibility factor is:

The statistician is going to take samples of 5 people to calculate the credibility


estimate.

(i) (a) Create a vector cp which contains 1,000 zeros.

(b) Use a loop to obtain 1,000 simulations of the posterior probability of


death, based on 1,000 random samples each containing 5 people. Use the
functions set.seed(79), rbeta and rbinom and store the credibility estimate
of the i th outcome in the i th element of cp.

(ii) Plot a labelled bar chart of the simulated credibility estimates for p using
the functions bar plot and table.

(iii) Calculate the mean and standard deviation of the empirical credibility
estimates.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 125
www.sankhyiki.in
+91-9711150002

Answer 1
(i)

a <-2
b <-3
n <-5
Z <-n/(n+a+b)
(a) cp <- rep(0,1000)
(b) set.seed(79)
for (i in 1:1000){
p <- rbeta(1,a,b)
x <- rbinom(1,n,p)
cp[i] <- Z*x/n + (1-Z)*a/(a+b)
}
(ii) x <- seq(0.2, 0.7, by=0.1)
barplot(table(cp),names=x, xlab="credibility
premium",ylab="frequency", main="bar chart of credibility
premiums")

#answer: frequencies of 158, 243, 216, 210, 140, 33

(iii) mean(cp) #answer 0.403


sd(cp) #answer: 0.1393931

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 126
www.sankhyiki.in
+91-9711150002

ASSIGNMENT – EBCT
Question 1. The table below shows the aggregate claim amounts (in £m) made by 4 companies
from an insurer in a 5 year period:-

Year

Company 1 2 3 4 5

A 48 53 42 50 59

B 64 71 64 73 70

C 85 54 76 65 90

D 44 52 69 55 71

This data is contained in the file “ebct1.txt” .

(i) Load the data frame and store it in the matrix “amounts”.

(ii) Store the number of years and number of brokers in the objects n and N,
respectively.

An actuary is using EBCT Model 1 to set premiums for the coming year.

(iii) (a) Use mean and rowMeans (or otherwise) to calculate an estimate of
E[m(θ)] andstore it in the object m.

(b) Use apply, var and mean to calculate an estimate of E[s2(θ)] and store it in
the object s.

(c) Use var and rowMeans (or otherwise) and your result from part (iii)(b) to
calculate an estimate of var[m(θ)] and store it in the object v.

(iv) Use your results from parts (ii) and (iii) to calculate the credibility factor and
store it in theobject Z.

(v) Calculate the EBCT premiums for each of the four companies.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 127
www.sankhyiki.in
+91-9711150002

Question 2. This question uses the data from previous question also.
The table below shows the volumes of business for each company in each year for
the insurer.

Year

Company 1 2 3 4 5

A 12 15 13 16 10

B 20 14 22 15 30

C 5 8 6 12 4

D 22 35 30 16 10

This data is contained in the file : “ebct2.txt”

(i) Load the data frame of volumes and store it in the matrix “volume”.

An actuary is using EBCT Model 2 to set premiums for the coming year.

(ii) Calculate the claims per unit of risk volume and store them in the matrix X.

(iii) (a) Use rowSums to calculate the total policies for each company and store
them in the object Pi.

(b) Use sum to calculate the overall total policies for all companies and
store it in the object P.

(c) Use ncol, nrow and sum to calculate

and store it in the object Pstar.

(iv) (a) Calculate E[m(θ)] and store it in the object m.

(b) Use rowSums to calculate the mean claims per policy for each company
and store it in the object Xibar.

(c) Use rowSums and mean to calculate E[s2(θ)] and store it in the object s.

(d) Use sum and rowSums and your result from part (iii)(c) to calculate
var[m(θ)] and store it in the object v.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 128
www.sankhyiki.in
+91-9711150002

(v) Use your results from parts (iii) and (iv) to calculate the credibility factor for
each company and store the values in the object Zi.

(vi) If the volumes of business for each company for the coming year are 20, 25,
10 and 12, respectively, calculate the EBCT Model 2 premiums for each of
the four countries.

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 129
www.sankhyiki.in
+91-9711150002

ANSWERS
Answer 1
(i)#set working directory to where the data is stored
#could store in data frame but question asks for matrix
amounts<-as.matrix(read.table("ebct1.txt",header=TRUE))

(ii)n<-ncol(amounts)
n
N<-nrow(amounts)
N # n=5, N=4

(iii)(a) Calculate E[m(theta)] and store in m


#in one go
m <-mean(rowMeans(amounts))
m
#in one go if matrix
m<-mean(amounts)
#in steps
row.mean <- rowMeans(amounts)
m<-mean(row.mean)
m # m = 62.75
(b) Calculate E[s²(theta)] and store in s
#in one go
s <-mean(apply(amounts,1,var))
#in steps
row.var<-apply(amounts,1,var)
row.var
s<-mean(row.var)
s # s = 101.2
(c) Calculate var[m(theta)] and store in v
#in one go
v<-var(rowMeans(amounts))-s/n
v
#using intermediate results
v<-var(row.mean)-s/n
v # v = 90.33
(iv) Z<-n/(n+s/v)
Z #z = 0.8169485

(v) Calculate credibility premiums


#in one go
Z*rowMeans(amounts)+(1-Z)*m
#using intermediate objects
Z*row.mean+(1-Z)*m
#Company A premium = 52.66
#Company B premium = 67.37
#Company C premium = 71.94
#Company D premium = 59.03

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 130
www.sankhyiki.in
+91-9711150002

Answer 2
amounts<-as.matrix(read.table("ebct1.txt",header=TRUE))
(i) Load dataframe and store in matrix volume
#could store in data frame but question asks for matrix
volume<-as.matrix(read.table("ebct2.txt",header=TRUE))

(ii) Claims per unit of risk volume


X <- amounts/volume
X

(iii) policy volume totals


(a) total policies for each company
Pi <-rowSums(volume)
Pi
#answer: 66,101,35,113
(b) total policies for all companies
P <- sum(Pi)
#or
P <- sum(volume)
P
#answer: 315
(c) calculate P*
n<-ncol(amounts)
N<-nrow(amounts)
Pstar <- sum(Pi*(1-Pi/P))/(N*n-1)
Pstar
#answer 11.80852

(iv)(a) E[m(theta)] mean claims per policy for all companies (X bar)
m<-sum(amounts)/P
m
#answer 3.984127
(b) Mean claims per policy for each company (Xi bar)
Xibar<-rowSums(amounts)/rowSums(volume)
Xibar
#answer 3.818, 3.386, 10.57, 2.575
(c) E[s²(theta)]
#in one go
s <-mean(rowSums(volume*(X-Xibar)^2)/(n-1))
s
#in steps
row.var1<-rowSums(volume*(X-Xibar)^2)/(n-1)
s<-mean(row.var1)
s
#s = 104.642
(d) var[m(theta)]
#in one go
v<-(sum(rowSums(volume*(X-m)^2))/(n*N-1)-s)/Pstar
v

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 131
www.sankhyiki.in
+91-9711150002

#in steps
row.var2<-rowSums(volume*(X-m)^2)
v<-(sum(row.var2)/(n*N-1)-s)/Pstar
v
#v = 6.538782

(v) Credibility factors


Zi<-Pi/(Pi+s/v)
Zi
#answer 0.8048, 0.8632, 0.6862, 0.8759

(vi) Credibility premiums


#the credibility premium per unit of risk volume for each company is
cred.prem <- Zi*Xibar+(1-Zi)*m
cred.prem
#answer 3.851, 3.468, 8.505, 2.750
#Storing the risk volumes for the coming year in new.volume
new.volume <- c(20,25,10,12)
#the credibility premium for each company for the coming year is
cred.prem*new.volume
#answer 77.0, 86.7, 85.0, 33.0

Satya Niketan | North Campus | Mumbai| Jaipur | Kolkata | Siliguri Page 132

You might also like