CS2B Summary Sheets 2021
CS2B Summary Sheets 2021
barplot(dpois(0:<n>, <rate>),
main = "PMF of Pois(<rate>)",
xlab = "n",
ylab = "P(N = n)",
names.arg = 0:<n>)
For the same process and times, to calculate the probability of more than or equal to a certain
number of arrivals, <n>:
Quantiles of arrivals
We can calculate quantiles of the distribution of arrivals between two times, <s> and <t>, by
using the qpois() function with the following structure:
dexp(<t>, <lambda>)
To plot the PDF of the inter-arrival time distribution from 0 to <t> using an increment of <h>:
pexp(<t>, <lambda>)
To calculate the probability of the next arrival being greater than or equal to <t> time units we
can use the pexp() function with the following structure:
1 - pexp(<t>, <lambda>)
pexp(<t>, <lambda>, lower = FALSE)
rexp(<n>, <lambda>)
set.seed(<seed>)
(arrivals = rpois(<sims>, <lambda> * <t>))
total = numeric(<sims>)
for(i in 1:length(arrivals)){
value = rbinom(arrivals[i], <n>, <p>)
total[i] = sum(value * <y>)
}
Note that you may need to adjust this for different distributions for the value associated with
each arrival. More detail on compound Poisson processes can be found in Chapter 19.
hist(total)
mean(total)
sd(total)
Estimating the probability that the total value is greater than <x>:
getwd()
setwd("<folder location>")
claims = read.csv("<data.csv>")
claims
head(claims)
claims[6,]
claims[6,2]
claims$claim.type
Selecting rows that meet a condition, for example all house insurance claims:
claims[claims$claim.type == "House", ]
<L> = nrow(<claims>)/<t>
Future arrivals
Calculating the probability of exactly <n> arrivals over a period of <t> time units:
CDF-type probability calculations, quantile calculations and simulations can be performed with
ppois(), qpois() and rpois() respectively.
Calculating the probability of the next arrival being less than or equal to <t> time units:
pexp(<t>, <L>)
Estimating the parameter for each type, over an observation period of <t> time units:
LM = nrow(motor) / <t>
LH = nrow(house) / <t>
Creating vectors
To create row vectors of data:
data = c( , , , … , )
Creating matrices
Creating a matrix:
diag(<n>)
Summing rows
To sum rows:
rowSums(<matrix>)
Matrix multiplication
Use the %*% operator:
Alternatively:
install.packages("markovchain")
library("markovchain")
is.irreducible(<mc>)
period(<mc>)
Write a function with a for loop, set M as the identity matrix, and for each j in the loop,
multiply M by P . Return M as the output of the function:
Pn = function(P,n) {
M = diag(nrow(P))
for (j in 1:n) {
M = M %*% P
}
M
}
Package functionality
Use the power operator to calculate powers of the transition matrix using markovchain objects.
For example, to calculate the <n>-step transition probabilities for the markovchain object <mc>:
<mc> ^ <n>
Use the %*% operator to matrix multiply the starting distribution by the relevant power of the
transition matrix:
We could also construct a function that takes as inputs the starting position, the transition matrix
and the number of steps, n , and returns the expected distribution after that many steps.
A possible function that calculates this, without using the matrix power function, is as follows:
Package functionality
To calculate the expected distribution after <n> steps with a given starting distribution, <d>, we
can use the following command structure:
Raise the transition matrix to a very large power to find the long-term behaviour of the chain:
Pn(<matrix>, <power>)
Alternatively, create a row vector of any starting position and multiply it by the matrix many
times, using a for loop:
Package functionality
steadyStates(<mc>)
To calculate the expected total income over <n> years for any given starting distribution <d>,
discount vector <discount> and premium <premium>, we can use the following commands:
discount = <discount>
Use this function when we have time
income = <premium> * (1 – discount)
inhomogeneous model :-
position = <d>
exp.prem <- function(age,start,n){
prem.inc = 0 also use this in for loop : -
prem.inc
Package functionality
To calculate the total expected premium income over the next <n> years with income
vector <income>, we can use the expectedRewards() function with the following structure:
If we instead want to consider the expected income for a specific starting distribution, <d>, we
can use the following structure:
Leave out the t0 argument for a start state to be chosen at random. Leave out the include.t0
argument to exclude the start state in the output (or set it to FALSE).
prob1 = function(x){
0.015*x
}
prob2 = function(x){
0.02*x
}
We can then use these to calculate the one-step transition matrix as a function. For example:
Px = function(x) {
matrix(c(1 - prob1(x), prob1(x),
1 - prob2(x), prob2(x)),
nrow = 2, ncol = 2, byrow = TRUE)
}
Alternatively, we could code the probabilities directly into the matrix function. For example:
Px = function(x) {
matrix(c(1 - 0.015*x, 0.015*x,
1 - 0.02*x, 0.02*x),
nrow = 2, ncol = 2, byrow = TRUE)
}
position = <d>
for (j in <a>:(<a> + <b> - 1)) {
position = position %*% Px(j)
}
position
length(<journey>)
table(<journey>)
doubles = character(length(<journey>) - 1)
for (i in 1:(length(doubles))){
doubles[i] = paste(<journey>[i], <journey>[i+1], sep = "")
}
triplets = character(length(<journey>) - 2)
for (i in 1:(length(triplets))){
triplets[i] =
paste(<journey>[i], <journey>[i+1], <journey>[i+2], sep = "")
}
triplets = paste(doubles[-length(doubles)],
<journey>[-(1:2)], sep = "")
Calculating ni
(ni = table(<journey>[-length(<journey>)]))
Calculating nij
(nij = table(doubles))
Page 4 CS2B: Embedded jump chains – Chapters 2 and 4 – Summary
ˆij
Calculating p
pij = numeric(length(nij))
names(pij) = names(nij)
for(AB in names(nij)){
pij[AB] = nij[AB] / ni[substr(AB, 1, 1)]
}
library(markovchain)
markovchainFit(<journey>)$estimate
(nijk = table(triplets))
triplets
ABA ABC ACA ACB BAB BAC BCA BCB CAB CAC CBA CBC
47 36 66 157 59 147 34 98 24 76 159 96
Calculating tnij
(tnij = table(doubles[-length(doubles)]))
trip.table = as.matrix(nijk)
trip.table = cbind(trip.table, NA)
colnames(trip.table) = c("Observed", "Expected")
for(ABC in rownames(trip.table)){
AB = substr(ABC, 1, 2)
BC = substr(ABC, 2, 3)
tnAB = tnij[AB]
pBC = pij[BC]
trip.table[ABC, "Expected"] = tnAB * pBC
}
To perform a 2 test at the 5% significance level, assuming all possible triplets were observed:
O = trip.table[,"Observed"]
E = trip.table[,"Expected"]
sum((O-E)^2/E)
r = length(names(nijk))
q = length(names(nij))
s = length(unique(<journey>))
DOF = r - 2*q + s
qchisq(0.95, DOF)
rowSums(A)
mu12 = 0.4
mu13 = 0.6
mu21 = 0.5
mu23 = 1
mu31 = 0.8
mu32 = 1.2
where:
A is the generator matrix
I is the identity matrix, and
h = 1/12
Ph = diag(3)+A*h
Ph
Page 4 CS2B: Markov jump processes – Summary
Pn = function(P, n) {
M = diag(nrow(P))
for (j in 1:n) {
M = M %*% P
}
M
}
Pn(Ph, 240)
library(markovchain)
mc <- new("markovchain", transitionMatrix = Ph,
states = states, name = "my.mc")
mc ^ 240
Ph = function(h, A){
diag(nrow(A)) + h * A
}
Creating a function to calculate P t for a given value of t , time increment, h , and generator
matrix, A , by using our Ph() and Pn() functions:
Pt = function(t, h, A){
n = t / h
Pn(Ph(h, A), n)
}
Alternatively, we can calculate using the markovchain package:
Pt = function(t, h, A){
mc <- new("markovchain", transitionMatrix = Ph(h,A),
states = states, name = "my.mc")
mc^(t/h)
}
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
state <- 0:8
gen.matrixv <- function(lambda, mu){
mat <- matrix(0, nrow = 9, ncol = 9)
for (k in 1:9){
if (k > 1){
A function that outputs the generator mat[k, k - 1] <- (k - 1) * mu
matrix for the jump process as a }
matrix in R. The function should take if (k < 9){
the two rates as inputs, λ and μ : mat[k, k + 1] <- lambda
}
(number of states = 0 to 8) }
mat <- mat + diag(-rowSums(mat))
dimnames(mat) <- list(state, state)
mat
}
CS2B: Markov jump processes – Summary Page 5
Creating a markovchain object with transition matrix constructed from the probabilities over
the period h and generator matrix A:
To force the path to start at a particular state, "Start state", we can include
t0 = "Start state" in the function.
Exact method
(P = generatorToTransitionMatrix(A))
Calculating the i :
ljs = abs(diag(A))
ljs[sims]
A = function(t){
sigma12 = 0.4*t states = paste("State", 1:3)
sigma13 = 0.6*t A = function(t){
sigma11 = -t matrix(c( -t, 0.4*t, 0.6*t,
sigma21 = 0.5*t 0.5*t, -1.5*t, t,
sigma23 = t 0.8*t, 1.2*t, -2*t),
sigma22 = -1.5*t nrow = 3, ncol = 3, byrow = TRUE,
sigma31 = 0.8*t dimnames = list(states, states))
sigma32 = 1.2*t }
sigma33 = -2*t
t s
1
h
P s , t P s hj , s hj h
j 0
Pst = function(s,t,h){
M = diag(length(states))
dimnames(M) = list(states,states)
n = (t - s)/h
for (j in 1:n){
M = M %*% Ph(h, A(s))
s <- s+h
}
M
}
1.1 Calculating x
To calculate the force of mortality when mortality follows Makeham’s law with constants <A>,
x
<B> and <C>, ie x A Bc :
A = <A>
B = <B>
C = <C>
mu = function(x){
A + B * C ^ x
}
s ^ t * g ^ (C ^ x * (C ^ t - 1))
}
Calculating the probability that a life aged <x> lives for at least another <t> years:
tpx(<x>, <t>)
Calculating the survival probability that a life aged <x> survives for at least another <t> years for
each <t> in a vector of times, <time vector>:
x t
ds
Constructing a function to calculate t px e x s , using a function for the force of mortality,
<mu>:
1.4 Calculating t qx
To calculate t qx , the probability that a life aged <x> does not survive at further <t> years,
assuming we have a function that calculates t px , tpx:
1 - tpx(<x>, <t>)
1.5 Calculating mx
1
To calculate mx qx 0 t px dt for a life aged <x>, assuming we have a function for t px , tpx:
Or:
Alternatively, to plot a <function> with fixed arguments <fixed arguments> over the input
range <x values> for the varying argument, <arg>:
dexp(<t>, <mu>)
pexp(<t>, <mu>)
To calculate the 100 <u>th percentile of Tx , ie the value of t such that: P Tx t <u>:
qexp(<u>, <mu>)
rexp(<n>, <mu>)
shape
1
scale
To calculate the 100 <u>th percentile of Tx , ie the value of t such that: P Tx t <u>:
To calculate the 100 <u>th percentile of Tx , ie the value of t such that: P Tx t <u>:
ex
To calculate the expected future lifetime under the exponential model with parameter <mu> using
integration:
1/<mu>
or:
ex
To calculate the curtate expected future lifetime under the exponential model using an
appropriate upper limit, <upper limit>:
exp(-<mu>) / (1 - exp(-<mu>))
To calculate the curtate expected future lifetime under the Weibull model using an appropriate
upper limit, <upper limit>:
In this case, it is Tstart age that follows the Gompertz distribution with parameters B and c or,
in R, parameters shape lnc and rate B . The distribution of Tstart age a follows the
Throughout the rest of this section, <shape> and <rate> are used to denote appropriately
calculated shape and rate parameters for Tx for the case where Tstart age follows the
Gompertz with parameters <B> and <C>.
Alternatively:
To calculate the t px , the probability that a life aged <x> survives at least another <t> years using
the integrated hazard:
ex
To calculate the expected future lifetime under the Gompertz model for a life aged <x>:
ex
To calculate the curtate expected future lifetime using an appropriate upper limit,
<upper limit>:
surv.data = read.csv("<data.csv>")
Calculating t j and d j together from a data set, <surv data>, assuming it contains a column of
event times, <event times>, and a column of observation end reasons (say ‘Death’ or
‘Censoring’), <end reason>:
(deaths = as.data.frame(table(
<surv data>$<event times>[<surv data>$<end reason> == "Death"])))
deaths$tj = as.numeric(as.vector(deaths$tj))
or:
1.4 Calculating nj
Calculating n j from a table (with no lives being left truncated) containing t j and d j , <deaths>,
and a data set, <surv data>, containing a column of event times, <event times>:
or:
or:
<deaths>$nj = numeric(nrow(<deaths>))
for (i in 1:nrow(<deaths>)){
<deaths>$nj[i] = sum((<surv data>$<event times> >= <deaths>$tj[i])
}
This is used when there is more than one deaths or censoring at one one time :-
deaths$nj <- sapply(deaths$tj,function(j){
sum(surv.data$Events[surv.data$Time>=j])})
Page 4 CS2B-7: Estimating the lifetime distribution – Summary
<deaths>$Lj = cumsum(<deaths>$lamj)
<deaths>$sna = exp(-<deaths>$Lj)
This part assumes that the last observation is a censoring at time <last obs>.
Creating an appropriate vector of the survival function estimate from a table, <deaths>,
containing Sˆ t in the column <s est>:
This part assumes that the last observation is a death at time <last obs>.
Creating an appropriate vector of the survival function estimate from a table, <deaths>,
containing Sˆ t in the column <s est>:
To tell R that the next plot() command should be on the same chart:
par(new = TRUE)
Alternatively, using the lines() function to add another survival curve estimate at times
<times> with values <surv est>:
Calculating standard error estimates for the survival function from a table containing SˆKM t in a
column skm as well as n j and d j , <deaths>:
Confidence intervals
Calculating the relevant percentage point for a 100 1 % level symmetric confidence interval:
z = qnorm(1 - <alpha> / 2)
KM_CI$lower_95 = pmax(KM_CI$lower_95, 0)
KM_CI$upper_95 = pmin(KM_CI$upper_95, 1)
Calculating standard error estimates for the integrated hazard from a table containing n j and d j ,
<deaths>:
Calculating the relevant percentage point for a 100 1 % level symmetric confidence interval:
z = qnorm(1 - <alpha> / 2)
Assuming that the table <deaths> also has a column containing ̂ t called Lj, calculating the
approximate confidence intervals for the integrated hazard:
Assuming that the table <deaths> also has a column containing SˆNA t called sna, calculating
the approximate confidence intervals for the survival function, using appropriate bounds:
To create a vector of event codes (1’s and 0’s) from a vector of reasons that observation ended
for each life, <reasons>, that contains, for each life, either an indicator of death, <death>, or
right censoring, <censoring>:
To create a survival object from a vector of event times, <event times>, and a vector of
event codes, <event codes>, that, for each life, contains either the value ‘1’ (indicating that the
event was a death) or ‘0’ (indicating that the life was right censored):
To calculate the Kaplan-Meier estimate of the survival function given a survival object,
<survival object>, without any grouping:
summary(fitKM)
To extract the estimate and lower and upper parts of the approximate confidence intervals:
fitKM.summary = summary(fitKM)
fitKM.summary$surv
fitKM.summary$lower
fitKM.summary$upper
To calculate the Nelson-Aalen estimate of the survival function given a survival object,
<survival object>, without any grouping:
summary(fitNA)
To extract the estimate and lower and upper parts of the approximate confidence intervals:
fitNA.summary = summary(fitNA)
fitNA.summary$surv
fitNA.summary$lower
fitNA.summary$upper
plot(<fit>)
Adding a survival function estimate, constructed using the survfit() function and stored in
<fit>, to an existing graph:
lines(<fit>)
To calculate the Nelson-Aalen estimate of the survival function given a survival object,
<survival object>, with groups for each life in the vector <groups>:
part.lik = function(b){
<formula for L>
}
part.log.lik = function(b){
<formula for log(L)>
}
In the context of Chapter 8, <function> may be a function that calculates the negative partial
log-likelihood.
Remember to look for a code of 1 or 2 in the output to determine whether the algorithm appears
to have converged on a solution.
MLE$estimate
Page 4 CS2B-8: Proportional hazards models – Summary
surv.data = read.csv("<data.csv>")
If there are multiple explanatory variables to include, say n , then the structure of
<explanatory variables> is <var 1> + <var 2> + … + <var n>.
<fit>$logtest["pvalue"]
Performing a likelihood ratio test to assess whether <fit2> is a significant improvement over
<fit1> where <fit2> contains the same parameters as <fit1> as well as some additional
parameters:
Using a function
Alternatively, we can set up a function that calculates the exposed to risk for each individual and
then apply this function to the whole data table.
We first set up a function that operates on x , the vector of four data values in an individual
record.
enddate = min(d71,d4,d99);
enddate2 - startdate }
etr = sum(apply(data,1,etrfn))
etr
When to calculate the number of lives who died at age 70 last birthday during the investigation period.
When to calculate the number of lives aged 70 last birthday whose policies were in force at the start of the
investigation period.
The acf() function uses equation (10.2) from Chapter 10 of the Course Notes but omits
m j
the factor of on the denominator, ie it is using:
m
m j
(zi z )(zi j z )
r j i 1
m
(zi z )2
i 1
The formula on page 34 of the Tables is the same as equation (10.2) from Chapter 10 of
the Course Notes:
m j
(zi z )(zi j z )
r j i 1 m
m j
(z z )2
m i 1 i
m
This can be calculated in R as the estimate from the acf() function multiplied by .
m j
The cor() function uses equation (10.1) from Chapter 10 of the Course Notes:
m j
(zi z (1) )(zi j z (2) )
i 1
rj
m j m j
(zi z (1) )2 (zi j z (2) )2
i 1 i 1
n = 6; xx = c(0,5,15,20,30,40)
a0 = 0.00077; a1 = -0.00012
b = c(0.70,-0.88,-0.91,1.75)*1e-6
xvalues = 1:40
yvalues = NULL
for(x in xvalues){
term = a0 + a1*x;
for(j in 1:(n-2)){
c1 = (xx[n] - xx[j])/(xx[n] - xx[n-1])
c2 = (xx[n-1] - xx[j])/(xx[n] - xx[n-1])
Phixj = phi(x,j) - c1*phi(x,n-1) - c2*phi(x,n)
term=term+b[j]*Phixj}
yvalues = c(yvalues,term)}
n2
f (x) a0 a1 x b j j (x)
j 1
xn x j xn1 x j
where j (x) j (x) n1 (x) n (x)
xn xn1 xn xn1
(x x )3 if x x j
j
and j (x)
0 otherwise
CS2B-13 and 14: Time series – Summary Page 1
1 Summary
set.seed(123)
Xt <- arima.sim(n=n1,list(ar=c(ar1,ar2,…),ma=c(ma1,ma2,…)),sd=sd1)
To add a deterministic trend onto a time series Xt with integer times, we create a new time series
Yt a ct Xt :
n <- length(Xt)
Yt <- a + c*seq(1,n) + Xt
To create an ARIMA(p, d , q) time series where d 1 (ie a time series Yt such that Xt 1 B Yt ):
Yt <- constant+cumsum(Xt)
Yt <- ts(Yt)
To plot the correlogram and partial correlogram of the data Xt, where k relates to the number of
ACFs (or PACFs) to include in the graph:
par(mfrow=c(2,1))
acf(Xt,lag.max=k,main="Time series Xt",ylab="Sample ACF")
pacf(Xt,lag.max=k,main="Time series Xt",ylab="Sample PACF")
par(mfrow=c(1,1))
Note: acf plots start from lag 0 and include k+1 values in the graph. pacf plots start from lag 1
and include k values in the graph.
Page 4 CS2B-13 and 14: Time series – Summary
Note that, as before, the theoretical ACF is calculated for lags 0 lag k , whereas the theoretical
PACF is calculated for lags 1 lag k .
frequency(Xt)
start(Xt)
end(Xt)
The sample ACFs and PACFs, for lags 0 lag k or 1 lag k respectively, are:
PP.test(Xt)
(This assumes there are no other underlying reasons for non-stationarity, besides the unit root.)
Removing trends
d
To difference time series data d times, we find Dt 1 B Xt :
Dt <- diff(Xt,lag=1,differences=d)
To check how frequently the seasonal variation occurs, plot the series or the sample ACF:
plot(Xt)
points(Xt, pch = 20)
abline(v = <vector of values>)
acf(Xt)
To apply seasonal differencing with a lag of n, we find St 1 Bn Xt :
St <- diff(Xt,lag=n,differences=1)
To apply the method of seasonal means, we first create a data frame containing the values of Xt
and their associated time periods. For example, for a series with monthly data:
We then calculate the relevant averages, for example for each month:
=
Wt Xt – xbars$values
plot(decompose(Xt,type="additive"))
(Use "additive" if the seasonality / variability appears to be constant over time. Otherwise use
"multiplicative".)
Alternatively, if data is additive and does not span an integer time period, use the stl function:
The second line (labelled ‘trend’) is calculated using the method of moving averages.
The third line is calculated by subtracting the trend and then applying the method of seasonal means.
The final line represents the random variation in the data. It is calculated by subtracting the seasonal and
trend components from the observed data.
(In fact, the calculation is somewhat more complicated than the description implies. However, a deeper
understanding of the method is not required.)
Page 6 CS2B-13 and 14: Time series – Summary
Combining plots
To print graphs in an a a b array, use par(mfrow=c(a,b)).
m <- matrix(2,2,data=c(1,2,1,3))
layout(m)
Fitting a model
To fit an ARIMA(p, d , q) to data Xt:
d <- diff(Xt,differences=d)
arima(d,order=c(p,0,q))
To find what order of autoregressive model AR(p) is the best fit, and to find the fitted mean ̂ :
ar(Xt)
ar(Xt)$x.mean
To plot the (standardised) residuals and the ACF of the residuals of the fitted model fit:
tsdiag(fit)
Box.test(e,lag=m,type="Ljung",fitdf=p+q)
To extract the Akaike Information Criterion (AIC) for a fitted model fit:
fit$aic
Page 7 CS2B-13 and 14: Time series – Summary
answer<-numeric(4)
for (p in 0:2) for (d in 0:2) for (q in 0:2) {
aic<-arima(Xt,order=c(p,d,q))$aic
row<-c(p,d,q,aic)
answer<-rbind(answer,row)
}
answer <- answer[-1,]
answer
e <- a$residuals
n <- length(e)
turns <- numeric(n)
Our loop runs from i = 2,...,(n-1), not from i=1,...,n, because the first and last residuals can not be turning points
for(i in 2:(n-1)){
if(e[i-1]<e[i] && e[i]>e[i+1]){
turns[i] <- 1
}
else{next(i)}
}
for(i in 2:(n-1)){
if(e[i-1]>e[i] && e[i]<e[i+1]){
turns[i] <- 1
}
else{next(i)}
}
turns
nt <- sum(turns)
E <- (2/3)*(n-2)
V <- (16*n - 29)/90
E
nt
ts <- (nt+0.5-E)/sqrt(V)
Since the value of ts is greater than -1.96 , so we have insufficient evidence to reject Null Hypothesis
So residuals follow the white noise Process
Page 8 CS2B-13 and 14: Time series – Summary
Forecasting
Stationary data
p <- predict(fit,n.ahead=n); p
To plot the forecast on the same graph as the original data Xt, extend the x axis to include the
start time of the data and the end time the forecast:
Non-stationary data
If the model was fitted to differenced data Dt, where the original data was Xt (ie Dt 1 B Xt )
we have to add back the trend:
p <- predict(fit,n.ahead=n)$pred
p.with.trend <- tail(Xt,1)+cumsum(p)
p.with.trend <- ts(p.with.trend,start=c(start year,period),
frequency = number of observations per year)
To plot the forecast on the same graph as the original data Xt, extend the x axis as before, and
also extend the y axis to include all values of both the forecast and the data:
Exponential smoothing
To forecast n future values and 95% confidence intervals of the forecasts, for data Xt, and
smoothing parameter a:
HW <- HoltWinters(Xt,alpha=a,beta=FALSE,gamma=FALSE)
predict(HW,n.ahead=n,level=0.95,prediction.interval=TRUE)
a b
To calculate the eigenvalues of the 2 2 coefficient matrix A :
c d
A <- matrix(2,2,data=c(a,c,b,d))
eigen(A)$values
To find a vector , for two time series Xt and Yt, such that X t Yt is stationary:
First write a function to carry out the PP.test for the cointegrating vector:
Then specify (arbitrary) starting values for the iteration, and minimise the value of the function
find.coint:
v <- c(1,1)
fit <- nlm(find.coint,v); fit
c(fit$estimate[1],fit$estimate[2])
is.cointegrated<-function(x,y) {
xp<- PP.test(x)$p.value>=0.05 && PP.test(diff(x))$p.value<0.05
yp<- PP.test(y)$p.value>=0.05 && PP.test(diff(y))$p.value<0.05
find.coint<-function(coint) {
comb<-coint[1]*x+coint[2]*y
test<-PP.test(comb)$p.value
test
}
v<-c(1,1)
fit<-nlm(find.coint,v)
a<-fit$estimate[1]
b<-fit$estimate[2]
v.fit<-c(a,b)
v.fit.check<-round(a,6)==0 && round(b,6)==0
comb<-a*x+b*y
combp<- PP.test(comb)$p.value<0.05
if (xp==TRUE && yp==TRUE && combp==TRUE && v.fit.check==FALSE) {
print("the vectors are cointegrated with cointegrating vector");
print(v.fit)
} else {
print("x and y are not cointegrated")
}
}
CS2B-15: Loss distributions – Summary Page 1
Loss distributions
Summary
1 Summary
Calculation
X exp() X Ga m m a ( , ) X log N( , 2 ) X W(c, )
required
f (x) dexp(x, ) dgamma(x, , ) dlnorm(x, , ) dweibull(x, , c
(1 ) )
simulate n
rgamma(n, , ) rlnorm(n, , ) (1 ) )
random rexp(n, ) rweibull(n, , c
variates
(1 )
In R, the second parameter of the Weibull distribution is c .
2
The lognormal (and normal) distributions use instead of .
Summarising data
head(x)
summary(x)
min(x); max(x)
quantile(x,seq(0,0.5,by=0.1))
Empirical probabilities
To estimate P ( X c )
length(x[x>c])/length(x)
Key metrics
mean(x)
var(x)
quantile(x,0.5)
Then specify starting values a, b, c etc for the iteration, and find the maximum likelihood:
Alternatively, writing a function that takes both the sample data and parameters as inputs:
Then specify starting values a, b, c etc for the iteration, and find the maximum likelihood:
The <starting values> are not required when formulae exist. This is the case for the
Histogram
hist(data, freq=FALSE)
x <- seq(0,max(data))
parameter1 <- nlm$estimate[1]
parameter2 <- nlm$estimate[2] … etc
y <- dXXX(x, parameter1, parameter2, …)
lines(x,y)
Q-Q plots
set.seed(XXX)
x <- rXXX(n, parameter1, parameter2, …)
The fitrdistr() function does not support the Pareto distribution. The full list of supported
distributions, given in the required format to input into the function, is:
"beta" "lognormal"
"cauchy" "logistic"
"chi-squared" "negative binomial"
"exponential" "normal"
"gamma" "Poisson"
"geometric" "t"
"log-normal" "weibull"
1 Summary
Block maxima
To find the maximum in each year:
(data - constant)%/%n+1
(but be prepared to adjust the constant and the +1, depending on the data provided).
Threshold exceedances
For threshold level u, the threshold exceedances te are:
u <- constant
te <- data[data>u]-u;
Specify starting values a, b, c, for the iteration and find the maximum likelihood estimates:
p <- c(a,b,c)
nlm(fMLE,p)
Q-Q plots
GEV distribution
First simulate 1,000 values from the fitted GEV distribution (for 0 ):
GPD distribution
First simulate 1,000 values from the fitted GPD distribution (for 0 ):
Hazard rate
x <- seq(min,max,by=constant)
H <- dXXX(x,parameters of distribution)/(1-pXXX(x,parameters of
distribution))
y <- c(min:max)
Write the survival function. For distributions built into R, this is:
Sy <- function(y) {
1 - pXXX(y,parameters of distribution)
}
ex <- integrate(Sy,x,Inf)$value/Sy(x)
1 Summary
Simulation
Setting the seed for generating random numbers
set.seed(17)
The random numbers generated in R are really just a single very long sequence of numbers.
The set.seed function sets the starting point to use for the next random number ‘generated’.
An integer from 1 to 100 is usually a good choice.
runif(1)
To generate a vector of length n of independent random numbers from the U(0,1) distribution:
runif(n)
To generate a vector of length n of independent random numbers from the N( , 2 ) distribution:
Probability functions
Finding the percentiles points of a distribution
qnorm(0.975)
This is for the N(0,1) distribution and will give the familiar value 1.96 (or rather, the slightly more
accurate value of 1.959964). Other distributions work in the same way, eg to calculate the
median of an Gamma(1,1) distribution:
qgamma(0.5, 1, 1)
Correlations
Calculating correlation coefficients
The default method is Pearson, but you can change the method to "spearman" or "kendall".
Note that these arguments must be written with a small letter.
If your data values are stored in a two-column matrix, you can use:
cmatrix = cor(matrix)
This returns the correlation matrix, rather than a single value. The correlation coefficient is the
‘off-diagonal’ value in this matrix:
cmatrix[2,1] or cmatrix[1,2]
1 Copula Package
gumbelCopula(param = <alpha> , dim = 2) OR gum.cop = archmCopula("gumbel", 3, 2)
frankCopula(param = <alpha>, dim = 2)
claytonCopula(param = <alpha>, dim = 2)
archmCopula(family, param = <alpha>, dim = 2)
Fitting a copula
fitCopula(<copula>, <data>, method ="...")
Methods:-
mpl = max pseudo-likelihood (using pobs)
ml = max likelihood (using true values)
itau = invert Kendall's tau estimator (using pobs or true values)
irho = invert spearman's rho (using pobs or true values)
summary(<fitCopula>)
Marginals
my.joint.dist = mvdc(gum.cop, margins = c("norm", "gamma"),
paramMargins = list(list(mean = 3, sd = 2), list(shape = 2, rate = 4)))
set.seed(100)
joint.sims = rMvdc(1000, my.joint.dist)
Reinsurance
1 Summary
y <- pmin(x,M)
z <- pmax(0,x - M)
w <- z[z>0]
ky <- pmin(k*x,M)
kz <- pmax(0,k*x - M)
w <- kz[kz>0]
To calculate the percentage reduction in the number of claims affecting the insurer, if a policy
excess is involved:
y <- pmax(0,x - M)
length(y[y>0])/length(x)
To find the retention level M required for the reinsurer to break even:
To find the retention level that minimises the variance of the insurer’s claims (if the reinsurer’s
maximum payment on a claim is Zmax):
f <- function(M) {
var(x - pmin(pmax(x - M,0),Zmax))
}
nlm(f,constant)
Proportional reinsurance
To find the amounts paid by the insurer and reinsurer, for retained proportion a:
y <- a*x
z <- (1 - a)*x
Page 4 CS2B-18: Reinsurance – Summary
ky <- a*k*x
kz <- (1 - a)*k*x
nz
ln L nm ln P X M ln fX (zi M)
i 1
Initial calculations:
nm <- length(z[z==0])
zcensored <- z[z>0]
M <- constant
Write the negative log-likelihood function (where pXXX and dXXX are the distribution function
and density function, respectively):
p <- c(a,b,c)
nlm <- nlm(flnL,p); nlm
set.seed(123)
n <- rpois(10000,l)
s <- numeric(10000)
for(i in 1:10000)
{x <- rXXX(n[i],parameters of claim size distribution)
s[i] <- sum(x)}
set.seed(123)
n <- rbinom(10000,size=m,prob=p)
s <- numeric(10000)
for(i in 1: 10000) {
x <- rXXX(n[i], parameters of claim size distribution)
s[i] <- sum(x)
}
set.seed(123)
n <- rnbinom(10000,size=k,prob=p)
s <- numeric(10000)
for(i in 1: 10000) {
x <- rXXX(n[i], parameters of claim size distribution)
s[i] <- sum(x)
}
set.seed(123)
n <- rNNN(10000, parameters of claim number distribution)
sI <- numeric(10000)
for(i in 1:10000) {
x <- rXXX(n[i], parameters of claim size distribution)
y <- a/100*x
sI[i] <- sum(y)
}
Page 4 CS2B-19 and 20: Risk models – Summary
set.seed(123)
n <- rNNN(10000, parameters of claim number distribution)
sR <- numeric(10000)
for(i in 1:10000) {
x <- rXXX(n[i], parameters of claim size distribution)
z <- (1-a/100)*x
sR[i] <- sum(z)
}
set.seed(123)
n <- rNNN(10000, parameters of claim number distribution)
sI <- numeric(10000)
for(i in 1:10000) {
x <- rXXX(n[i], parameters of claim size distribution)
y <- pmin(x,M)
sI[i] <- sum(y)
}
set.seed(123)
n <- rNNN(10000, parameters of claim number distribution)
sR <- numeric(10000)
for(i in 1:10000) {
x <- rXXX(n[i], parameters of claims size distribution)
z <- pmax(0,x-M)
sR[i] <- sum(z)
}
sI <- pmin(s,M)
or:
sR <- pmax(0,s-M)
set.seed(123)
S.sim <- numeric(10000)
for (j in 1:10000) {
When reinsurance applies, insert the calculation of y (or z) immediately before the line:
death <- …
and, in the line sim <- x*death, replace x with y (or z).
Simulate the aggregate claims S.sim before reinsurance, and then calculate:
set.seed(123)
S.sim <- numeric(10000)
for (i in 1:10000) {
death <- rbinom(n,1,q)
x <- rXXX (n, parameters = size[,1], size[,2], … )
sim <- x*death
S.sim[i] <- sum(sim)
}
When reinsurance applies, insert the calculation of y (or z) immediately before the line:
and, in the line sim <- x*death, replace x with y (or z).
Simulate the aggregate claims S.sim before reinsurance, and then calculate:
The code to generate 10,000 simulations from the mixture distribution of the number of claims
for a compound Poisson distribution with unknown continuously distributed parameter is:
The code to generate 10,000 simulations from the mixture distribution of a randomly chosen
policy with a compound Poisson distribution with unknown continuously distributed parameter is:
set.seed(123)
for (i in 1:sims){
S[i] <- sum(rXXX(N[i], parameters of claim distribution))
}
The code to generate 10,000 simulations of the aggregate portfolio claims with 100 such policies
is:
set.seed(123)
for (i in 1:sims){
for (j in 1:policies){
S[j] <- sum(rXXX(N[j], parameters of claim distribution))
}
Results[i, ] <- S
The code to generate 10,000 simulations from the mixture distribution of a randomly chosen
policy with a compound Poisson distribution with unknown discretely distributed parameter is:
set.seed(123)
for (i in 1:sims){
S[i] <- sum(rXXX(N[i], parameters of claim distribution))
}
The code to generate 10,000 simulations of the aggregate portfolio claims with 100 policies, all
with the same Poisson parameter is:
set.seed(123)
for (i in 1:sims){
for (j in 1:policies) {
S[j] <- sum(rXXX(N[i], parameters of claim distribution))
}
Results[i, ] <- S
Remember that the size of the sample when generating the observations for the Poisson
parameter is 1 for each simulation (as it is the same across policies). This has been highlighted in
red in the code above.
Machine learning
Summary
We can calculate z-scores for a data set using the scale() function:
scale(data)
k-means algorithm
R has a built-in function kmeans() that carries out the calculations for the k-means algorithm. To
use it, you just need to specify the data and the number of clusters, k , to search for.
model1 = kmeans(data,k)
Alternatively, you can specify a starting set of cluster centres. The number of centres that you
provide will be the number of clusters that the algorithm returns:
model1 = kmeans(data,centres)
It’s best to assign the output to a variable (we’ve called it model1 here) so that you can access the
components of the results. For example, you can then access the coordinates of the centroids the
algorithm has found:
model1$centers
To calculate the centroid of cluster i for the points in the data set data using the clustering
vector clusters:
colMeans(data[clusters == i,])
To calculate the centroids for all clusters at once for the points in the data set data using the
clustering vector clusters:
Alternatively, the aggregate() function gives this same output but as rows:
To plot the points in the data set data and colour them according to the values in the vector
clusters:
plot(data, col = clusters)
When you are using the aggregate function to calculate the centres :-
apply(data, 1, function(row) { sqrt(sum((row - centres[1,2:n])^2))} )
where n is the last column number in the "centres"
Calculating distances
To calculate the distance between each row of the data set data and the cluster centre centre:
You may need to select relevant elements of each row if there are additional columns over and
above the variables used in the clustering.
Assigning clusters
To assign a cluster based on a set of distance columns dist_columns in the data set data:
Calculating the total within-group sum of squares when applying k-means for 1 to k clusters on
the data set data:
tot.withinss = numeric(k)
for(i in 1:k){
tot.withinss[i] = kmeans(data, i)$tot.withinss
}
(data$Romcom-centers[1,3])^2+
(data$Action-centers[1,4])^2+
(data$Comedy-centers[1,5])^2+
(data$Fantasy-centers[1,6])^2)
(data$Romcom-centers[2,3])^2+
(data$Action-centers[2,4])^2+
(data$Comedy-centers[2,5])^2+
(data$Fantasy-centers[2,6])^2)
(data$Romcom-centers[3,3])^2+
(data$Action-centers[3,4])^2+
(data$Comedy-centers[3,5])^2+
(data$Fantasy-centers[3,6])^2)
To construct a decision tree by testing numerous conditions on each data point we can use nested
ifelse statements. For example, consider the following decision tree that uses height and
weight to divide athletes up into sports:
We have used the first letter of each sport to represent the categories. Running this on some
example data:
height =
c(155,150,170,180,197,170,175,168,170,205,193,199,185,170,207,205,1
90,180)
weight =
c(65,69,64,66,71,73,80,85,88,85,90,95,96,96,100,102,101,103)
decision(data$height, data$weight)
[1] "H" "H" "C" "C" "B" "T" "T" "T" "R" "B" "R" "B" "R"
[14] "R" "B" "B" "R" "R"
To compare predictions in the vector preds with actual values in the vector obs:
table(preds,obs)
To calculate the index of the maximum probability for each row in a data set probs containing
probabilities in columns:
apply(probs, 1, which.min)
The code for pdf of uniform distribution taking care of x can take only integer values :
ddunif <- function(x,a,b){
if(sum(x != floor(x))!=0){
stop("Error:A value in x is not an integer")
}
ifelse(x<a | x>b ,0,1/(b-a+1))
}
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cond.prob.given.T1 = function(F1, F2){
ddunif(F1, 40, 100) * ddunif(F2, 40, 100)
}
cond.prob.given.T2 = function(F1, F2){
ddunif(F1, 10, 70) * ddunif(F2, 30, 90)
}
cond.prob.given.T3 = function(F1, F2){
ddunif(F1, 0, 80) * ddunif(F2, 0, 60)
}
prior1 = 0.5
prior2 = 0.3
prior3 = 0.2