Basic R Programming: Exercises
Basic R Programming: Exercises
R Programming
John Fox
1. Logistic Regression: Iterated weighted least squares (IWLS) is a standard method of fitting
generalized linear models to data. As described in Section 5.5 of An R and S-PLUS Com-
panion to Applied Regression (Fox, 2002), the IWLS algorithm applied to binomial logistic
regression proceeds as follows:
(a) Set the regression coefficients to initial values, such as β(0) = 0 (where the superscript
0 indicates start values).
(b) At each iteration t calculate the current fitted probabilities μ, variance-function values
ν, working-response values z, and weights w:
(t) (t)
μi = [1 + exp(−η i )]−1
(t) (t) (t)
vi = μi (1 − μi )
(t) (t) (t) (t)
zi = η i + (yi − μi )/vi
(t)
wi = ni vi
Here, ni represents the binomial denominator for the ith observation; for binary data,
all of the ni are 1.
(c) Regress the working response on the predictors by weighted least squares, minimizing
the weighted residual sum of squares
n
X (t) (t)
wi (zi − x0i β)2
i=1
where W = diag{wi } is the diagonal matrix of weights from the last iteration and X is
the model matrix.
Problem: Program this method in R. The function that you define should take (at least)
three arguments: The model matrix X; the response vector of observed proportions y; and
1
the vector of binomial denominators n. I suggest that you let n default to a vector of 1s (i.e.,
for binary data, where y consists of 0s and 1s), and that you attach a column of 1s to the
model matrix for the regression constant so that the user does not have to do this when the
function is called.
Programming hints:
• Adapt the structure of the example developed in Section 8.5.1 of “Writing Programs”
(Fox and Weisberg, draft), but note that this example is for binary logistic regression,
while the current exercise is to program the more general binomial logit model.
• Use the lsfit function to get the weighted-least-squares fit, calling the function as
lsfit(X, z, w, intercept=FALSE), where X is the model matrix; z is the current
working response; and w is the current weight vector. The argument intercept=FALSE
is needed because the model matrix already has a column of 1s. The function lsfit
returns a list, with element $coef containing the regression coefficients. See ?lsfit for
details.
• One tricky point is that lsfit requires that the weights (w) be a vector, while your
calculation will probably produce a one-column matrix of weights. You can coerce the
weights to a vector using the function as.vector.
• Return a list with the maximum-likelihood estimates of the coefficients, the covariance
matrix of the coefficients, and the number of iterations required.
• You can test your function on the Mroz data in the car package, being careful to make
all of the variables numeric. You might also try fitting a binomial (as opposed to binary)
logit model.
2. A Challenging Problem – Ordered Logit and Probit Models: Ordered logit and probit models
are popular regression models for ordinal response variables; the ordered logit model is also
called the proportional-odds model (see below for an explanation). The following description
is adapted from Fox, Applied Regression Analysis and Generalized Linear Models, Second
Edition (2008, Ch. 14):
Imagine that there is a latent (i.e., unobservable) variable ξ that is a linear function of X’s
plus a random error:
ξ i = α + β 1 Xi1 + · · · + β k Xik + εi
The latent response ξ is dissected by m − 1 thresholds (i.e., boundaries) into m regions.
Denoting the thresholds by α1 < α2 < · · · < αm−1 , and the resulting response by Y , we
observe ⎧
⎪
⎪ 1 if ξ i ≤ α1
⎪
⎪
⎪
⎨ 2 if α1 < ξ i ≤ α2
Yi = .
. (1)
⎪ .
⎪
⎪
⎪
⎪ m − 1 if αm−2 < ξ i ≤ αm−1
⎩
m if αm−1 < ξ i
The thresholds, regions, and corresponding values of ξ and Y are represented graphically in
the following figure. Notice that the thresholds are not in general uniformly spaced.
2
1 2 m − 1 m Y
ξ
α1 α2 … αm−2 αm−1
Pr(Yi ≤ j) = Pr(ξ i ≤ αj )
= Pr(α + β 1 Xi1 + · · · + β k Xik + εi ≤ αj )
= Pr(εi ≤ αj − α − β 1 Xi1 − · · · − β k Xik )
If the errors εi are independently distributed according to the standard normal distribution,
then we obtain the ordered probit model. If the errors follow the similar logistic distribution,
then we get the ordered logit model. In the latter event,
Pr(Yi ≤ j)
logit [Pr(Yi ≤ j)] = loge
Pr(Yi > j)
= αj − α − β 1 Xi1 − · · · − β k Xik
Equivalently,
Pr(Yi > j)
logit [Pr(Yi > j)] = loge (2)
Pr(Yi ≤ j)
= (α − αj ) + β 1 Xi1 + · · · + β k Xik
for j = 1, 2, . . . , m − 1.
The logits in Equation 2 are for cumulative categories–at each point contrasting categories
above category j with category j and below. The slopes for each of these regression equations
are identical; the equations differ only in their intercepts.
Put another way, for a fixed set of X’s, any two different cumulative log-odds (i.e., logits)–
say, at categories j and j 0 –differ only by the constant (αj − αj 0 ). The odds, therefore, are
proportional to one another; that is,
oddsj ¡ ¢ eαj
= exp logitj − logitj 0 = exp(αj − αj 0 ) = α 0
oddsj 0 e j
where, for example, oddsj = Pr(Yi > j) and logitj =logit [Pr(Yi > j)]. For this reason,
Equation 2 is called the proportional-odds logit model.
There are (k + 1) + (m − 1) = k + m parameters to estimate in the proportional-odds model,
including the regression coefficients α, β 1 , . . . , β k and the category thresholds α1 , . . . , αm−1 .
Note, however, that there is an extra parameter in the regression equations (Equation 2),
because each equation has its own constant, −αj , along with the common constant α. A
simple solution is to set α = 0 (and to absorb the negative sign into αj ), producing
3
and thus
where Λ (·) is the cumulative logistic distribution. In this parametrization, the intercepts αj
are the negatives of the category thresholds. The ordered probit model is similar with
π i1 = 1 − Pr(Yi > 1)
π i2 = Pr(Yi > 1) − Pr(Yi > 2)
..
.
π im = Pr(Yi > m − 1)
Problem: Program the ordered-logit model or the ordered-probit model (or both). The
function that you define should take (at least) two arguments: The model matrix X and the
response vector y, which should be a factor or ordered factor; I suggest that you attach a
column of 1s to the model matrix for the regression constants so that the user does not
have to do this when the function is called; the ordered logit and probit models always have
constants. Your function can include an argument to indicate which model – logit or probit
– is to be fit.
Programming hints:
• The parameters consist of the intercepts and the other regression coefficients, say a and
b. Although there are cleverer ways to proceed, you can set b to a vector of zeroes to
start, and compute start values for a from the marginal distribution of the response;
e.g.,
marg.p <- rev(cumsum(rev(table(y)/n)))[-1]
a <- log(marg.p/(1 - marg.p))
Here y is the response vector and n is the number of observations.
4
• If you’re fitting the ordered logit model, use the cumulative logistic distribution func-
tion plogis(); if you’re fitting the ordered probit model, use the cumulative normal
distribution function pnorm().
• Use optim() to maximize the likelihood, treating the lreg2() function in Section 8.5.1
of “Writing Programs” (Fox and Weisberg, draft) as a model, but noting that for the
ordered logit and probit models, I have shown only the log-likelihood and not the gra-
dient.
• Return a list with the maximum-likelihood estimates of the coefficients, including the
intercepts or thresholds (negative of the intercepts); the covariance matrix of the coeffi-
cients (obtained from the inverse-Hessian returned by optim()), the residual deviance for
the model (i.e., minus twice the maximized log-likelihood), and an indication of whether
or not the computation converged.
• The ordered logit and probit models may be fit by the polr() function in the MASS
package (one of the standard R packages). You can use polr() to verify that your
function works properly. To test your program, you can use the WVS dataset in the
effects package. For testing purposes, use a simple additive model rather than the
model with interactions given in ?WVS.
3. General Cumulative Logit and Probit Models: The ordered logit and probit models of the
previous problem make the strong assumption that all m − 1 cumulative probabilities can be
modeled with the same regression coefficients, except for different intercepts. More general
versions of these models permit different regression coefficients:
¡ ¢
Pr(Yi > j) = Λ αj + β 1j Xi1 + · · · + β kj Xik , j = 1, . . . , m − 1
or ¡ ¢
Pr(Yi > j) = Φ αj + β 1j Xi1 + · · · + β kj Xik , j = 1, . . . , m − 1
Program one or the other (or both) of these models. For your example regression, use a
likelihood-ratio test to compare the more general cumulative logit or probit model to the
more restrictive ordered logit or probit model of the preceding problem. This test checks the
assumption of equal slopes. The cumulative logit and probit models (along with the ordered
logit and probit models) can be fit by the vglm() function in the VGAM package.
4. Numerical Linear Algebra: A matrix is said to be in reduced row-echelon form when it satisfies
the following criteria:
(a) All of its nonzero rows (if any) precede all of its zero rows (if any).
(b) The first entry (from left to right) – called the leading entry – in each nonzero row is
1.
(c) The leading entry in each nonzero row after the first is to the right of the leading entry
in the previous row.
(d) All other entries are 0 in a column containing a leading entry.
A matrix can be put into reduced row echelon form by a sequence of elementary row opera-
tions, which are of three types:
5
(b) Add a multiple of one row to another, replacing the other row.
(c) Exchange two rows.
(a) If there is a 0 in the current row and column (called the pivot), if possible exchange for
a lower row to bring a nonzero element into the pivot position; if there is no nonzero
pivot available, move to the right and repeat this step. If there are no nonzero elements
anywhere to the right (and below), then stop.
(b) Divide the current row by the pivot, putting a 1 in the pivot position.
(c) Proceeding through the other rows of the matrix, multiply the pivot row by the element
in the pivot column in another row, subtracting the result from the other row; this zeroes
out the pivot column.
6
The matrix is now in reduced row-echelon form. The rank of a matrix is the number of
nonzero rows in its reduced row-echelon form, and so the matrix in this example is of rank 2.
Problem: Write an R function to calculate the reduced row-echelon form of a matrix by
elimination.
Programming hints:
• When you do “floating-point” arithmetic on a computer, there are almost always round-
ing errors. One consequence is that you cannot rely on a number being exactly equal to
a value such as 0. When you test that an element, say x, is 0, therefore, you should do
so within a tolerance – e.g., |x| < 1 × 10−6 .
• The computations tend to be more accurate if the absolute values of the pivots are as
large as possible. Consequently, you can exchange a row for a lower one to get a larger
pivot even if the element in the pivot position is nonzero.
5. A less difficult problem: Write a function to compute running medians. Running medians are
a simple smoothing method usually applied to time-series. For example, for the numbers 7,
5, 2, 8, 5, 5, 9, 4, 7, 8, the running medians of length 3 are 5, 5, 5, 5, 5, 5, 7, 7. The first
running median is the median of the three numbers 7, 5, and 2; the second running median
is the median of 5, 2, and 8; and so on. Your function should take two arguments: the data
(say, x), and the number of observations for each median (say, length). Notice that there
are fewer running medians than observations. How many fewer?
6. Simulation: Develop a simulation illustrating the central limit theorem for the mean: Almost
regardless of the population distribution of X, the mean X of repeated samples of size n drawn
from the population is approximately normally distributed, with mean E(X) = E(X) = μ,
and variance V (X) = V (X)/n = σ2 /n, and with the approximation improving as the sample
size grows. Sample from a highly skewed distribution, such as the exponential distribution
with a small “rate” parameter λ (e.g., λ = 1); use several different sample sizes, such as 1,
2, 5, 25, and 100, and draw many samples of each size, comparing the observed distribution
of sample means with the approximating normal distribution. Exponential random variables
may be generated in R using the rexp() function. Note that the mean of an exponential
random variable is 1/λ and its variance is 1/λ2 .