Lecturenotes
Lecturenotes
Introductory Material
ST419 2008/2009
Computational Statistics
This course teaches the fundamental computing skills required by practicing statisticians. We
focus on analysis of data using a computer and simulation as a tool to improve understanding
of statistical models. The software package R is taught throughout. On successful completion
of the course you will be a competent R user with skills that adapt readily to other statistical
software packages. A small subset of the things you will be able to do are listed below:
• generate detailed descriptive analyses of data sets (both numbers and graphics),
• fit statistical models and interpret the output from model fitting,
• write your own R functions to simplify commonly performed tasks or to implement novel
statistical methods.
Course Administration
Prerequisites
This is intended primarily for MSc Statistics, MSc Social Research Methods (Statistics) and MSc
Operational Research students. You are required to have a good basic knowledge of statistics
(to the level of ST203 Statistics for Management Sciences). The course makes extensive use
of computers in teaching and assessment. Basic familiarity with machines running a Windows
operating system is assumed.
1
For this year’s course, I have slightly modified the original course notes written by Dr. J. Penzer
Timetabling
You are required to attend four hours per week for this course.
On Tuesday of week 10, we will be in H102 from 0900-1200 for students’ presentations (see
assessment below).
Assessment
This course is assessed by coursework (50%) and by a two hour examination during the summer
term (50%). There are two pieces of coursework:
Detailed instructions for each of the projects will be given. There will be two practice written
tests.
Books
This course has detailed lecture notes. It should not be necessary to buy a book for this course.
• Venable, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, (Fourth Edi-
tion), Springer. (QA276.4 V44)
• Venables, W. N., Smith, D.M. and the R Core Development Team (2001) An Introduction
to R, freely available from http://cran.r-project.org/doc/manuals/R-intro.pdf.
Course content
Week 1 Data, Session Management and Simple Statistics – R objects: vectors, data
frames and model objects; assignment; descriptive analysis; managing objects;
importing data; t-tests; simple regression; financial series.
On Mondays we will give details of the application of R to new material. On Tuesdays we will
go through the exercises from the previous day - you will be expected to have attempted to
complete the exercises so that you can contribute to the class. Any common problems with the
exercises or in the use of R will be dealt with during these sessions.
Computer practicals
The only way to learn a computer language is to actually use it. The computer practicals provide
time when you can work on material from the lecture notes and exercises. Using a combination
of the notes, the R help system and experimentation you should be able to work out most of the
commands. The R commands in the text are there to illustrate certain points. You do not have
to follow them rigidly. There is often more than one way to do something and you may be able
to find a better solution than the one given. If you find something interesting or difficult, spend
more time on it; if you get the idea straight away, move on. If you can’t work something out
by experimenting with it, ask me or Limin. Please don’t mindlessly type in commands
from the notes without understanding what they do.
Lecture notes
The course attempts to convey a large amount of information in a short space of time. Some
of the material is of a technical nature and may not be covered explicitly in the lectures and
classes. You are expected to read the lecture notes thoroughly. The syllabus is defined
by the contents of the lecture notes with the exception of topics marked with a †. Additional
reading is suggested at the end of each weeks notes. Some of the notational conventions adopted
in the notes are described below.
italic typewriter font – things to be replaced with with an appropriate value, iden-
tifier or expression.
If you miss a set of notes, additional copies are available from three sources:
• Public folder – to find copies of the notes in pdf format, open Outlook and go to
Public Folders −→All Public Folders −→Departments
−→Statistics −→ST419 −→Notes
• Moodle
The versions on the notes on the public folders and the website will be updated to include
corrections. Despite my best efforts, there will be mistakes in the notes. If you spot
something that looks wrong please let me know.
Asides
Interesting issues that arise from the topic under consideration are placed
in boxes in the text. These may be hints on using R effectively, quick
questions or suggestions for further work.
R software is freely available under the GNU General Public License. The R project homepage
is http://www.r-project.org/. You can download the software to install on your own machine
from http://www.stats.bris.ac.uk/R/. Data files associated with the course can be found in
the ST419 public folder under Data.
Communicating with me
By far the best way to communicate with me is via email. I will respond to sensible emails
relating to:
The ST419 projects are unsupervised; I will not provide any direct assistance with projects.
1.1.1 Objects
R works on objects. All objects have the following properties (referred to as intrinsic attributes):
• mode: tells us what kind of thing the object is – possible modes include numeric, complex,
logical, character and list.
• length: is the number of components that make up the object.
At the simplest level, an object is a convenient way to store information. In statistics, we need
to store observations of a variable of interest. This is done using a numeric vector. Note that
there are no scalars in R; a number is just a numeric vector of length 1. Vectors are referred to
as atomic structures; all of their components have the same mode.
If an object stores information, we need to name it so that we can refer to it later (and thus
recover the information that it contains). The term used for the name of an object is identifier.
An identifier is something that we choose. Identifiers can be chosen fairly freely in R. The points
below are a few simple rules to bear in mind.
• In general any combination of letters, digits and the dot character can be used although
it is obviously sensible to choose names that are reasonably descriptive.
• You cannot start an identifier with a digit or a dot so moonbase3.sample is acceptable
but 3moons.samplebase and .sample3basemoon are not.
• Identifiers are CASE SENSITIVE so moon.sample is different from moon.Sample. It is
easy to get caught out by this.
6
1. Data, Session Management and Simple Statistics
• Some characters are already assigned values. These include c, q, t, C, D, F, I and T. Avoid
using these as identifiers.
Typically we are interested in data sets that consist of several variables. In R, data sets are
represented by an object known as a data frame. As with all objects, a data frame has the
intrinsic attributes mode and length; data frames are of mode list and the length of a data
frame is the number of variables that is contains. In common with many larger objects, a data
frame has other attributes in addition to mode and length. The non-intrinsic attributes of a
data frame are:
• names: these are the names of the variables that make up the data set,
• row.names: these are the names of the individuals on whom the observations are made,
• class: this attribute can be thought of as a detailed specification of the kind of thing the
object is; in this case the class is "data.frame".
The class attribute tells certain functions (generic functions) how to deal with the object. For
example, objects of class "data.frame" are displayed on screen in a particular way.
R works by calling functions. The notion of a function is a familiar one; a function takes a
collection of inputs and maps them to a single output. In R, inputs and output are objects. The
inputs are referred to as arguments and the output is referred to as the return value. Many useful
functions are part of the standard R setup. Others are available via the inclusion of packages
(see 3.2.2). One of the key advantages of R is the ease with which users can define their own
functions (see 3.3). Functions are also objects; the mode of a function is "function" (sensibly
enough). In computing functions have side-effects. For example, calling a function may cause
a graph to be drawn. In many instances, it is the side-effect that is important rather than the
return value.
During an R session, a number of objects will be generated; for example we may generate vectors,
data frames and functions. For the duration of the session, these objects are stored in an area of
memory referred to as the workspace. If we want to save the objects for future use, we instruct
R to write them to a file in our current working directory (directory is just another name for a
folder). Note the distinction: things in memory are temporary (they will be lost when we log
out); files are more permanent (they are stored on disk and the information they contain can be
loaded into memory during our next session). Managing objects and files is an important part
of using R effectively (see 1.2.7).
In the initial (three week) period of the course, we will not be doing much statistics. However,
we will use some simple statistical ideas to illustrate the use of R.
Suppose that we have a random sample y1 , . . . , yn from a population that is normally distributed.
We can view y1 , . . . , yn as being an instance of independent identically distributed random
variables Y1 , . . . , Yn , where Yi ∼ N (µY , σY2 ) for i = 1, . . . , n. We would like to test
H0 : µY = µ0 ,
H1 : µY 6= µ0 ,
where µ0 is some real number. We set significance level α, that is, the probability of rejecting
the null hypothesis when it is true is α. Our test statistic is
Ȳ − µ0
T = √ ,
SY / n
where
n n
1X 1 X
Ȳ = Yi and SY2 = (Yi − Ȳ )2 .
n n−1
i=1 i=1
Note that Ȳ and SY2 are respectively unbiased estimators of µY and σY2 . Well established theory
tells us, if H0 is true, T follows a t-distribution on n − 1 degrees of freedom. We evaluate the
sample value of our statistic
ȳ − µ0
t= √
sy / n
where
n n
1X 1 X
ȳ = yi and s2y = (yi − ȳ)2 .
n n−1
i=1 i=1
This sample value is used to determine whether H0 is rejected. Two (equivalent) approaches for
the two-tailed case are described below.
For a given sample, these approaches lead to the same conclusion. Using p-values tends to be
favoured; at any significance level larger than p, we reject H0 .
Simple regression
If we believe there is a linear relationship between a response variable, y1 , . . . , yn , and the values
of a fixed explanatory factor, x1 , . . . , xn , we may model this using simple linear regression:
where α is the intercept parameter and β is the slope parameter. The notation εi ∼ (0, σε2 ) indi-
cates that the errors, εi , are random variables with mean zero and variance σε2 . The
P parameters
α and β are estimated by least squares, that is, by finding α̂ and β̂ that minimize ni=1 (Yi − Ŷi )2 ,
where Ŷi = α̂ + β̂xi is the fitted value at i. It is easy to show (try it) that
Pn
(Y − Ȳ )(xi − x̄)
β̂ = i=1Pn i 2 2
and α̂ = Ȳ − β̂ x̄.
i=1 xi − nx̄
The residuals of the fitted modelled are defined as Ei = Yi − Ŷi , for i = 1, . . . , n. The sample
residuals, ei = yi − ŷi , can be interpreted as the difference between what we observe and what
the model predicts. Scaled versions of the sample residuals are used as diagnostics in regression.
In practice, we often want to test whether the explanatory factor influences the response. The
null hypothesis is that there is no linear relationship,
H0 : β = 0,
H1 : β 6= 0.
is an unbiased estimator for the error variance, σε2 . We discuss the properties of these statistics
in section 4.1.5.
This is not a course in finance and I am no financial expert, however, financial data provide
many of the examples we will consider. A few preliminaries are important. The frequency
with which financial data are observed varies. We will work mostly with daily data, that is,
observations made every working day. The presence of weekends and bank holidays means that
the observations in daily data are not regularly spaced in time. Irregular spacing is a feature
common to many financial series.
Suppose that we observe the price of an asset, y1 , . . . , yn . Our interest lies not in the actual
price of the asset but in whether the price has increased or decreased relative to the previous
price; in other words, we are interested in returns. There are many ways of constructing returns,
three of which are given below; see Tsay (2002) for more details:
yt
• simple gross return: yt−1
Financial professionals make interesting use of simple regression; the returns from a stock are
regressed against the returns from an appropriate index. The resulting slope estimate is referred
to as the beta value of the stock. The beta value is taken as a measure of the risk associated
with the stock relative to market risk; if a beta value is greater than 1 the stock is considered
more risky (and potentially more rewarding), if the beta value is less than 1 then the stock is
less risky. We calculate beta values for real stocks and discuss problems associated with their
interpretation in the exercise.
> 1
[1] 1
> 1+4.23
[1] 5.23
> 1+1/2*9-3.14
[1] 2.36
# Note the order in which operations are performed
# in the final calculation
Comments in R
R ignores anything after a # sign in a command. We will follow this
convention. Anything after a # in a set of R commands is a comment.
We can create vectors at the command prompt using the concatenation function c(...).
c(object1,object2,...)
This function takes arguments of the same mode and returns a vector containing these values.
> c(1,2,3)
[1] 1 2 3
> c("Ali","Bet","Cat")
[1] "Ali" "Bet" "Cat"
In order to make use of vectors, we need identifiers for them (we do not want to have to write
vectors from scratch every time we use them). This is done using the assignment operator <-.
name now refers to an object whose value is the result of evaluating expression .
> c(1,2,3)+c(4,5,6)
[1] 5 7 9
> numbers + numbers
[1] 2 4 6
> numbers - c(8,7.5,-2)
[1] -7.0 -5.5 5.0
> c(1,2,4)*c(1,3,3)
[1] 1 6 12
> c(12,12,12)/numbers
[1] 12 6 4
Note in the above example that multiplication and division are done element by element.
Reusing commands
If you want to bring back a command which you have used earlier in the
session, press the up arrow key ↑. This allows you to go back through
the commands until you find the one you want. The commands reappear
at the command line and can be edited and then run by pressing return.
The outcome of an arithmetic calculation can be given an identifier for later use.
If we try to add together vectors of different lengths, R uses a recycling rule; the smaller vector
is repeated until the dimensions match.
If the dimension of the larger vector is not a multiple of the dimension of the smaller vector, a
warning message will be given. The concatenation function can be used to concatenate vectors.
> c(small,large,small)
[1] 1 2 0 0 0 0 0 0 1 2
We have now created a number of objects. To ensure clarity in the following examples we need
to remove all of the objects we have created.
> rm(list=objects())
We want to work with data sets. In general we have multiple observations for each variable.
Vectors provide a convenient way to store observations.
We have taken a random sample of the weight of 5 sheep in the UK. The weights (kg) are
84.5 72.6 75.7 94.8 71.3
We are going to put these values in a vector and illustrate some standard procedures.
> mean(weight)
[1] 79.78
A data frame is an R object that can be thought of as representing a data set. A data frame
consists of variables (columns vectors) of the same length with each row corresponding to an
experimental unit. The general syntax for setting up a data frame is
name $variable
Once a data frame has been created we can view and edit it in a spreadsheet format using the
command fix(...) (or equivalently data.entry(...)). New variables can be added to an
existing data frame by assignment.
Suppose that, for each of the sheep weighed in the example above, we also measure the height
at the shoulder. The heights (cm) are
86.5 71.8 77.2 84.9 75.4.
We will set up another variable for height. We would also like to have a single structure in which
the association between weight and height (that is, that they are two measurements of the same
sheep) is made explicit. This is done by adding each variable to a dataframe. We will call the
data frame sheep and view it using fix(sheep).
Suppose that a third variable consisting of measurements of the length of the sheep’s backs
becomes available. The values (in cm) are
130.4 100.2 109.4 140.6 101.4
We can include a new variable in the data frame using assignment. Suppose we choose the
identifier backlength for this new variable:
A set of descriptive statistics is produced by the function summary(...). The argument can
be an individual variable or a data frame. The output is a table. Most functions to generate
descriptive statistics are reasonably obvious; a couple are given below with more in the glossary.
> summary(sheep$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
71.30 72.60 75.70 79.78 84.50 94.80
> summary(sheep)
weight height backlength
Min. : 71.30 Min. : 71.80 Min. : 100.2
1st Qu. : 72.60 1st Qu. : 75.40 1st Qu. : 101.4
Median : 75.70 Median : 77.20 Median : 109.4
Mean : 79.78 Mean : 79.16 Mean : 116.4
3rd Qu. : 84.50 3rd Qu. : 84.90 3rd Qu. : 130.4
Max. : 94.80 Max. : 86.50 Max. : 140.6
> IQR(sheep$height)
[1] 9.5
> sd(sheep$backlength)
[1] 18.15269
• If you have a query about a specific function then typing ? and then the functions name
at the prompt will bring up the relevant help page.
> ?mean
> ?setwd
> ?t.test
• If you problem is of a more general nature, then typing help.start() will open up a
window which allows you to browse for the information you want. The search engine on
this page is very helpful for finding context specific information.
All of the objects created during an R session are stored in a workspace in memory. We can see
the objects that are currently in the workspace by using the command objects(). Notice the
(), these are vital for the command to work.
> objects()
[1] "height" "meanweight" "numobs" "sheep" "total"
[6] "weight"
The information in the variables height and weight is now encapsulated in the data frame
sheep. We can tidy up our workspace by removing the height and weight variables (and
various others that we are no longer interested in) using the rm(...) function. Do this and
then check what is left.
> rm(height,weight,meanweight,numobs,total)
> objects()
[1] "sheep"
The height and weight variables are now only accessible via the sheep data frame.
> weight
Error: Object "weight" not found
> sheep$weight
[1] 84.5 72.6 75.7 94.8 71.3
The advantage of this encapsulation of information is that we could now have another data frame,
say dog, with height and weight variables without any ambiguity. However, the $ notation can
be a bit cumbersome. If we are going to be using the variables in the sheep data frame a lot,
we can make them visible from the command line by using the attach(...) command. When
we have finished using the data frame, it is good practice to use the detach() command (notice
the empty () again) so the encapsulated variables are no longer visible.
> weight
Error: Object "weight" not found
> attach(sheep)
> weight
We can save the current workspace to file at any time. It is a good idea to create a folder for
each new piece of work that you start. We are going to create a folder called ST419chapter1 for
the work we are doing in this chapter. The correct R terminology for this folder is the working
directory. You can set ST419chapter1 to be the working directory whenever you want to work
on objects created during work on this chapter (see §1.2.11).
• Create a folder
In My computer use File −→New −→Folder to create a new folder in your
H: space and give your new folder the name ST419chapter1.
The command to save a workspace is save.image(...). If we do not specify a file name the
workspace will be saved in a file called .Rdata. It is usually a good idea to use something a
little more informative than this default.
The dir() command gives a listing of the files in the current working directory. You will see
that your ST419chapter1 directory now contains a file intro1.Rdata. This file contains the
objects from our current session, that is, the sheep data frame. It is important to be clear about
the distinction between objects (which are contained in a workspace, listed by objects()) and
the workspace (which can be stored in a file, contained in a directory, listed by dir()).
In practice, five observations would not provide a very good basis for inference about the entire
UK sheep population. Usually we will want to import information from large (potentially very
large) data sets that are in an electronic form. We will use data that are in a plain text format.
The variables are in columns with the first row of the data giving the variable names. The file
sheep.dat contains weight and height measurements from 100 randomly selected UK sheep.
We are going to copy this file to our working directory ST419chapter1. The information is read
into R using the read.table(...) function. This function returns a data frame.
Using header = TRUE gives us a data frame in which the variable names in the first row of the
data file are used as identifiers for the columns of our data set. If we exclude the header =
TRUE, the first line will be treated as a line of data.
Common wisdom states that the population mean sheep weight is 80kg. The data from 100
randomly selected sheep may be used to test this. We formulate a hypothesis test in which
the null hypothesis is population mean, µ, of UK sheep is 80kg and the alternative is that the
population mean takes a different value:
H0 : µ = 80,
H1 : µ 6= 80.
We set significance level of 5%, that is α = 0.05. Assuming that sheep weight is normally
distributed with unknown variance, the appropriate test is a t-test (two-tailed). We can use the
function t.test(...) to perform this test.
data: weight
t = 2.1486, df = 99, p-value = 0.03411
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
80.21048 85.29312
sample estimates:
mean of x
82.7518
Notice the first argument of the t-test function is the variable that we want to test. The other
arguments are optional. The argument mu is used to set the value of the mean that we would like
to test (the default is zero). The output includes the sample value of our test statistic t = 2.1486
and the associated p-value 0.03411. For this example, p < α so we reject H0 and conclude that
there is evidence to suggest the mean weight of UK sheep is not 80kg. What conclusion would
we have come to if the significance level had been 1%?
We can use the alternative argument to do one-tailed tests. For each of the following write
down the hypotheses that are being tested and the conclusion of the test.
The weight of sheep is of interest to farmers. However, weighing the sheep is time consuming
and emotionally draining (the sheep do not like getting on the scales). Measuring sheep height
is much easier. It would be enormously advantageous for the farmer to have a simple mechanism
to approximate a sheep’s weight from a measurement of its height. One obvious way to do this
is to fit a simple linear regression model with height as the explanatory variable and weight as
the response.
The plausibility of a linear model can be investigated with a simple scatter plot. The R command
plot(...) is very versatile; here we use it in one of its simplest forms.
> plot(height,weight)
Notice that the x-axis variable is given first in this type of plot command. To fit linear models
we use the R function lm(...). Once again this is very flexible but is used here in a simple
form to fit a simple linear regression of weight on height.
• The strange argument weight ∼ height: this is a model formula. The ∼ means “de-
scribed by”. So the command here is asking for a linear model in which weight is described
by height. We will deal with the model formula notation in much greater detail in section
5.2.
• Nothing happens: no output from our model is printed to the screen. This is because R
works by putting all of the information into the object returned by the lm(...) function.
This is known as a model object. In this instance we are storing the information in a model
object called lmres which we can then interrogate using extractor functions.
The simplest way to extract information is just to type the identifier of a model object (in
this case we have chosen the identifier lmres). We can also use the summary(...) function to
provide more detailed information or abline(...) to generate a fitted line plot. For each of
the commands below make a note of the output.
> lmres
> summary(lmres)
> abline(lmres)
From the output of these commands write down the slope and intercept estimates. Does height
influence weight? What weight would you predict for a sheep with height 56cm?
Before ending a session it is advisable to remove any unwanted objects and then save the
workspace in the current working directory.
> save.image("intro2.Rdata")
> quit()
On quitting you will be given the option of saving the current workspace. If you say Yes the
current workspace will be saved to the default file .Rdata (we have already saved the workspace,
there is no need to save it again).
In order to return to the workspace we have just saved. Restart R, set the appropriate working
directory and then use the load(...) command to bring back the saved objects.
> setwd("H:/ST419chapter1")
> load("intro2.Rdata")
> objects()
# you should have all the objects from the previous session
We have already used the function length(...) to get the length of a vector. This can be
applied to other objects.
> length(sheep2)
[1] 2
> length(plot)
[1] 1
The length of a data frame is the number of variables it contains. The length of a function can
be evaluated but is a bit meaningless. The function mode(...) gives us the mode of an object,
while attributes(...) lists the non-intrinsic attributes.
> mode(sheep2)
[1] "list"
> attributes(sheep2)
$names
$class
[1] "data.frame"
$row.names
[1] "1" "2" "3" # ... etc to "100"
> mode(plot)
[1] "function"
> attributes(plot)
NULL
> mode(lmres)
[1] "list"
> attributes(lmres)
$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
$class
[1] "lm"
Notice that, although both are of mode "list", data frames are of class "data.frame" while
linear model objects are of class "lm". This influences the way they appear on screen when you
type the identifier.
> sheep2
> lmres
A data frame is printed out in full while very little output is given for a linear model object.
We can force R to treat both of these objects as lists by using the unclass(...) function. Try
the following scroll back through the screen to see all the output.
> unclass(sheep2)
> unclass(lmres)
Notice that the final part of the output from > unclass(lmres) looks like the output from
sheep2. This is because lmres contains an attribute ($model) that is a data frame made up
of the variables that we put into our model. We can find out the class of something using the
class(...) function.
> class(lmres$model)
[1] "data.frame"
This is not quite the end of the story. The mode is a convenient device that disguises a slightly
messier reality. R really works on the type of an object. In some instances type is more specific
than mode, in others the two properties are identical. We can find out the type of something
using the function typeof(...). The way in which it is stored is given by storage.mode(...).
> typeof(sheep2)
[1] "list"
> typeof(sheep2$weight)
[1] "double"
> z <- 1 + 1i
> typeof(z)
[1] "complex"
> typeof(plot)
[1] "closure"
> storage.mode(plot)
[1] "function"
• Linear model fitting lm(...) can fit a variety of linear models, argument is a model
formula.
1.4 Exercise
The data file IntelNASDAQ.dat is a plain text file containing the daily closing prices for Intel
stock and the NASDAQ index from 3rd January 2005 to 9th September 2005. The top row
contains the variable names Intel, NASDAQ, IntelB1 and NASDAQB1.
1. Import data: Create a data frame (with a suitable identifier) containing values from the
file IntelNASDAQ.dat.
2. Descriptive analysis: Using the plot(...) command with single argument will generate
a time series plot of the data (data values on the y-axis, time on the x-axis). Using the
argument type = "l" (for lines) will improve the appearance of the plot (more on this in
the next chapter).
(a) Generate a time series plot and summary statistics, means and standard deviations
for Intel and NASDAQ variables.
(b) Is the mean value useful in this context?
(c) Can we perform meaningful comparisons of the variability of Intel and NASDAQ?
3. Generate log returns: The variables IntelB1 and NASDAQB1 contain back-shifted values of
the original series, that is, yt−1 for t = 2, . . . , n.
(a) Look at the data using the spreadsheet format available in R to make sure you
understand how IntelB1 and NASDAQB1 are derived from Intel and NASDAQ. [Note:
NA (not available) denotes a missing value, more on these in section 3.2.5.]
(b) Using an appropriate arithmetic combination of the original and the back-shifted
values, generate log return series for Intel and NASDAQ.
(c) Put the log return variables you have just generated into the data frame and remove
them from the workspace.
(d) Attach the data frame.
4. Descriptive analysis of log returns: We often want R to ignore missing values in a variable.
This is done using the argument na.rm = TRUE – you will need this to calculate the mean
below.
(a) Generate time series plots and summary statistics for the log return series.
(b) How could the sign of the mean of the series be interpreted?
5. A hypothesis test: We suspect that Intel stock has not done very well. We can test the
null hypothesis that the mean return is zero against the alternative that the mean return
is negative (H0 : µ = 0 versus H1 : µ < 0.)
(a) Choose a significance level and perform the appropriate hypothesis test.
(b) Interpret the output of the hypothesis test and write down your conclusions.
(c) Comment on the validity of the test.
6. Generate beta value: We would like to determine how risky investment in Intel is relative
to an investment tracking the NASDAQ100. One possible approach is to evaluate the beta
value for Intel.
(a) Plot Intel log returns against NASDAQ log returns. Does a linear relationship seem
plausible?
(b) Perform simple linear regression of Intel log returns on NASDAQ log returns. Write
down and interpret the beta value from this fit.
(c) Why might we doubt the usefulness of a beta value calculated in this naive way?
(d) † Using the function lm(...), R automatically generates output associated with
testing H0 : β = 0 versus H1 : β 6= 0 for our simple regression model. What is really
of interest is to test H0 : β = 1 versus H1 : β 6= 1. Evaluate the appropriate statistic
for this hypothesis test.
1.5 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition,
Springer. [Chapter 1 and objects and data frames sections of chapter 2.]
• Venables, W. N. et. al. (2001) An Introduction to R. [Chapter 1, sections 2.1, 2.2, 6.3 and
7.1.]
• Newbold, P (1995) Statistics for Business and Economics, 4th edition, Prentice-Hall. [A
new edition became available in 2003. One of many books with a good elementary intro-
duction to hypothesis testing and regression (chapter 9 and chapter 12 in the 4th edition).]
• Tsay, R (2002) Analysis of Financial Time Series, Wiley. [For those who want to know
more about financial series.]
• Murakami, H (1982) A Wild Sheep Chase (English translation A. Birnbaum (1989)), Vin-
tage. [Fiction – no relevance other than a possible influence on the choice of animal in the
examples.]
In this chapter we look at accessing part of an object. We have seen some examples of this
already; in section 1.2.4 we use the $ operator to select a named variable (vector) from a data
frame. We may also be interested in selecting part of a vector, for example, those values lying
in a particular range (trimming). We introduce regular sequence and logical vectors and show
how they can be used in selection. The second theme of this chapter is graphics. Pictorial
representations of both data and the results of statistical analysis are of profound importance.
Graphics are one of the key tools which statistician use to communicate with each other and with
the rest of the world (the people who pay our wages). R has powerful graphical capabilities.
In this chapter we introduce some of the basics and provide pointers to more sophisticated
techniques.
Logic is fundamental to the operation of a computer. We use logic as a mechanism for selection
and as a tool in writing functions (see section 3.3). There are many equivalence ways of thinking
about logical operations; some alternatives are given below.
Logical statements are a part of natural language, for example, “sheep eat grass”, “the Queen
is dead” or “I am Spartacus”. A logical statement is either true or false (we will not consider
anything other than two-valued logic). In computing, logical statements can be formed by
comparing two expressions. One possible comparison is for equality; for example, (expr1 ==
expr2 ) is a logical statement. Note the double equality, ==, to denote comparison; in R a single
equality, =, means assignment (equivalent to <-) which is not what we want here. As with
natural language, a logical statement in computing is either TRUE or FALSE. Logical statements
25
2. More R Essentials and Graphics
are combined using the operators AND and OR (these are denoted & and | in R – more details
in section 2.2.6). Combining a true statement with a true statement results in a true statement
(the same holds for false statements). However, the result of combining a true statement with
a false statement depends on the operator used. The truth tables below demonstrate.
The negation operator NOT (! in R) is applied to a single statement (unary operator). Negation
turns a true statement in to a false one and vice versa. Suppose that A and B represent logical
statements. De Morgan’s law then tells us that
One consequence of de Morgan’s laws is that the AND operation can be represented in terms of OR
and NOT since (A AND B) = NOT(NOT(A) OR NOT(B)); similarly, (A OR B) = NOT(NOT(A) AND
NOT(B)).
The operators AND and OR are analogous to intersection and union in set theory with TRUE and
FALSE corresponding to inclusion or exclusion from a set. To illustrate, suppose that Y and Z
are sets. Then x ∈ Y but x ∈/ Z implies x ∈ Y ∪ Z but x ∈ / Y ∩ Z. This leads to a tables with
the same structure as the truth tables 2.1.1.
∩ x∈
/Y x∈Y ∪ x∈
/Y x∈Y
x∈
/Z x∈
/ Y ∩Z x∈
/ Y ∩Z x∈
/Z x∈
/ Y ∪Z x∈Y ∪Z
x∈Z x∈
/ Y ∩Z x∈Y ∩Z x∈Z x∈Y ∪Z x∈Y ∪Z
The complement can be interpreted as negation; Y c is the set of all elements that are not in Y .
De Morgan’s laws now take on a familiar form
We consider two-valued logic. Another system that works on just two values is modular arith-
metic where the modulus is 2 (referred to as mod 2 arithmetic). The two values are 0 and 1
(representing FALSE and TRUE). The results of a mod 2 arithmetic expression is the remain-
der when the value calculated in the usual way is divided by 2. The logical operation AND is
equivalent to mod 2 multiplication and NOT is represented by mod 2 addition of 1.
mod 2 de Morgan
Use one of de Morgan’s laws to derive and expression for OR using mod
2 arithmetic
Graphics form an important part of any descriptive analysis of a data set; a histogram provides
a visual impression of the distribution of the data and comparison with specific probability dis-
tributions, such as the normal, are possible using a quantile-quantile (qq) plot. The distribution
of several variables can be compared using parallel boxplots and relationships investigated using
scatter plots. Some plots are specific to a type of data; for example, in time series analysis, time
series plots and correlograms (plots that are indicative of serial correlation) are commonly used.
Graphical methods also play a role in model building and the analysis of output from a fitting
process, in particular, diagnostic plots are used to determine whether a model is an adequate
representation of the data.
2.2 R Essentials
At the start of each chapter’s practical work we will go through the same process; create a
directory for the work, copy the data into this directory, start R and change the working directory
to the directory we have just created.
• Create a folder
In My computer use File −→New −→Folder to create a new folder in your
H: space and give your new folder the name ST419chapter2.
• Start R
Go to Start −→Programs −→Statistics −→R to start the R software pack-
age.
For the initial part of the work we are going to use two data sets relating to marks. The first is
a toy data set with marks (percentage) for six students in three MSc courses. These data are in
a file called marks.dat. The second is a larger data set with marks (out of 40) for three difficult
exams for a second year undergraduate group (marks2.dat). The candidate numbers have been
changed to protect the innocent.
The identifiers for the data frames are mkmsc for the MSc course marks and mk2nd for the second
year exam marks. The variable in mkmsc are courseA, courseB and courseC; for mk2nd we have
exam1, exam2 and exam3.
In mathematics, the order in which arguments are passed to a function determines what we do
with them; for example, if f (u, v) = u2 + 3v then f (v, u) = v 2 + 3u, f (a, b) = a2 + 3b and so on.
We are so accustomed to this idea that it seems trivial. R functions can operate using the same
convention; for example, plot(exam1,exam2) will plot exam2 against exam1 (that is, exam1 on
the x-axis and exam2 on the y-axis) while plot(exam2,exam1) will plot exam1 against exam2.
The position of the argument in the call tells R what to do with it.
> attach(mk2nd)
> plot(exam1, exam2)
> plot(exam2, exam1)
An alternative, that is available in R and many other languages, is to use named arguments;
for example, the arguments of the plot command are x and y. If we name arguments, this takes
precedence over the ordering so plot(y = exam2, x = exam1) has exactly the same effect as
plot(x = exam1, y = exam2) (which is also the same as plot(exam1,exam2)).
In the previous example, there was little benefit (other than clarity) in the use of named argu-
ments. The real use of named arguments is when some arguments have default values. We have
encountered this already in the t.test(...) function of section 1.2.9. Often a combination of
position and naming arguments is used. We will illustrate with three of the arguments of the
t.test(...) function, namely x the data to test (the first argument), mu the population mean
in the null hypothesis and alternative which specifies the type of alternative to be used. The
argument x does not have a default value so we must past this to the machine. Suppose the we
want to test hypotheses about the population mean result for exam1, µ1 say.
>t.test(exam1)
will use the default values mu = 0 and alternative = "two.sided". The test that will be
performed is H0 : µ1 = 0 vs. H1 : µ1 6= 0. Specifying values of mu and alternative will have
the obvious consequences.
In the first of these the argument alternative is not specified so the default value ("two.sided")
is used. In all except the last of these, the position of exam1 (as the first unnamed argument)
tells R this value is the x argument. We can use the name of the x argument if we choose. Note
that the help system (?t.test ) can be used to find the names and default values of arguments.
Paired t-test
Use help to find out about the other arguments of t.test. Using the
y and paired arguments, test the hypothesis that the population mean
marks for exam1 and exam2 are identical. What about exam2 and exam3
As mentioned in section 1.1.1, data frames have a names attribute that refers to the names of
the variables in the data frame. These names can be listed using the names(...) function.
> names(mkmsc)
[1] "courseA" "courseB" "courseC"
The operator $ allows us to access the sub-objects listed by names(...) directly. We can change
the names and change the contents of the sub-objects by assignment. Suppose that the marks
actually come from ST402, ST419 and ST422. We want to assign these names to the variables,
look at the ST419 marks and then add 2 to each of these marks.
Non-intrinsic attributes other than names can be changed by assignment. For example, in a
data frame, we may want to change the default row names into something more informative (in
> row.names(mkmsc)
[1] "1" "2" "3" "4" "5" "6"
In the data set of marks for the second year exams, the first column is a column of candidate
numbers. These identify each row and should really be the row names. A column in the
data file can be used as row names by specifying a number for the row.names argument in a
read.table(...) function; this number refers to the column in which row names can be found.
We will redefine our mk2nd data frame.
A regular sequence is a sequence of numbers or characters that follow a fixed pattern. These
are useful for selecting portions of a vector and in generating values for categorical variables.
We can use a number of different methods for generating sequences. First we investigate the :
sequence generator.
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> 2*1:10
[1] 2 4 6 8 10 12 14 16 18 20
> 1:10 + 1:20
[1] 2 4 6 8 10 12 14 16 18 20 12 14 16 18 20 22 24 26 28 30
> 1:10-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(10-1)
[1] 1 2 3 4 5 6 7 8 9
The seq function allows a greater degree of sophistication in generating sequences. The function
definition is
The arguments from and to are self explanatory. by gives the increment for the sequence and
length the number of entries that we want to appear. Notice that if we include all of these
arguments there will be some redundancy and an error message will be given. along allows a
vector to be specified whose length is the desired length of our sequence. Notice how the named
arguments are used below. By playing around with the command, see whether you can work
out what the default values for the arguments are.
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(to=10, from=1)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(1,10,by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
[16] 8.5 9.0 9.5 10.0
> seq(1,10,length=19)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
[16] 8.5 9.0 9.5 10.0
> seq(1,10,length=19,by=0.25)
Error in seq.default(1, 10, length = 19, by = 0.25) :
Too many arguments
> seq(1,by=2,length=6)
[1] 1 3 5 7 9 11
> seq(to=30,length=13)
[1] 18 19 20 21 22 23 24 25 26 27 28 29 30
> seq(to=30)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30
Finally you can use the function rep. Find out about rep using the R help system then experi-
ment with it.
> ?rep
> rep(1, times = 3)
[1] 1 1 1
> rep((1:3), each =5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The logical values in R are TRUE and FALSE. Comparisons are performed using == for equality,
!= for inequality, > for greater than, >= for greater than or equal, < for less than and <= for less
than or equal to. The logical operators are & for AND, | for OR and ! for NOT. When logical
values are combined the result is another logical value (either TRUE or FALSE). For example, TRUE
& TRUE = TRUE; two true statements joined with AND results in a true statement – “pandas
eat bamboo” AND “sheep eat grass”, as a combination of two true statements, is also true.
However, TRUE & FALSE = FALSE – “statisticians are joyful” AND “pandas eat cheese” is false
since (at least) one of these statements is false. Note that the scalar logical operators || (OR)
and && (AND) are available. These are faster but do not yield vector return values.
For any logical operator, we can work out the outcome of every possible combination of TRUE
and FALSE to generate a truth table. We can do this in R by creating logical vectors.
The last one of these is sometimes treated as a logical operator in its own right know as EX-
CLUSIVE OR (XOR). Try to work out what is going on here (a Venn diagram may help).
We can generate logical vectors from data. We use the marks example below to generate some
logical vectors. In each case give an English version of the logical statement; for example, the
first statement is ST419 marks are over 70 and R tells us that this is true for the second and
fourth students.
> attach(mkmsc)
> ST419>70
[1] FALSE TRUE FALSE TRUE FALSE FALSE
> ST419>70 & ST402>70
[1] FALSE TRUE FALSE TRUE FALSE FALSE
> ST419>70 & ST402>80
[1] FALSE FALSE FALSE TRUE FALSE FALSE
> row.names(mkmsc)=="Ali"
[1] TRUE FALSE FALSE FALSE FALSE FALSE
We often only want to use a portion of the the data or a subset whose members satisfy a
particular criteria. The notation for referring to a part of a vector is the square brackets [].
This is known as the subscripting operator. The contents of the brackets is a vector, referring
to as an indexing vector. If the indexing vector is integer valued (often a regular sequence), the
elements with indices corresponding to the values of the indexing vector are selected. If there is
a minus sign in front then all elements except those indexed by the sequence will be selected.
> ST402
[1] 52 71 44 90 23 66
> ST402[1]
[1] 52
> ST402[c(1,2,6)]
[1] 52 71 66
> ST402[1:4]
[1] 52 71 44 90
> ST402[-(1:4)]
[1] 23 66
> ST402[seq(6,by=-2)]
[1] 66 90 71
> ST402[rep((1:3),each=2)]
[1] 52 52 71 71 44 44
> row.names(mkmsc)[1:3]
[1] "Ali" "Bet" "Cal"
If the contents of the [] brackets is a logical vector of the same length as our original, the result
is just those elements for which the value of the logical vector is true. We can now work things
like:
• the ST402 marks for people who got less than 65 in ST419,
• the ST422 marks for people for people whose total marks are less than 200.
These are implemented below. Try to work out and English statement for what the fourth
command is doing (note 50 is the pass mark for these exams).
> ST419[ST419>70]
[1] 82 78
> ST402[ST419<65]
[1] 44 23 66
> ST422[(ST402+ST419+ST422)<200]
[1] 71 55 52 61
> ST422[ST402>50 & ST419>50 & ST422>50]
[1] 71 84 68 61
In practice we may be more interested in the names of the students with ST419 marks over 70
rather than the marks themselves. Try to work out what the last two of the commands below
are doing.
> row.names(mkmsc)[ST419>70]
[1] "Bet" "Dan"
> row.names(mkmsc)[ST402<50 | ST419<50 | ST422<50]
[1] "Cal" "Eli"
> names(mkmsc)[c(sum(ST402), sum(ST419), sum(ST422)) > 350]
[1] "ST419" "ST422"
> detach()
> mode(label1)
[1] "character"
# Note that the result of paste is of mode "character"
†The functions grep and regexpr can be used to find specific patterns of characters. The
wildcard . can represent any characters. Functions are also available for character substitution.
[1] 2 3 5
> grep("z.", cvec)
[1] 1 2 5
> regexpr(".z.", cvec)
[1] -1 1 -1 -1 1
attr(,"match.length")
[1] -1 3 -1 -1 3
2.2.9 Lists*
A list is an ordered collection of objects. The objects in a list are referred to as components. The
components may be of different modes. Lists can be generated using the function list(...);
the return value is a list with components made up of the arguments passed to the function.
Below we create a list, look at the intrinsic attributes then display the contents of the list.
Here we include a title, a date and our MSc exam marks data frame. We can use the indexing
operator [] to access parts of the list.
> mklst[2]
[[1]]
[1] 10 6 2005
> mode(mklst[2])
[1] "list"
> length(mklst[2])
[1] 1
Note that the result of using indexing on a list is another list. If we want to add 4 to the day
we cannot do this using [] indexing since mklst[2] is not numeric:
> mklst[2]+c(4,0,0)
Error in mklst[2] + c(4, 0, 0) : non-numeric argument to binary operator
To avoid this problem we used a double square bracket subscripting operator [[]]. In our
example, mklst[[2]] is a numeric vector.
> mklst[[2]]
[1] 10 6 2005
> mode(mklst[[2]])
[1] "numeric"
> mklst[[2]] <- mklst[[2]] + c(4,0,0)
> mklst
In fact, since mklst[[2]] is a numeric vector, we can refer to the first element directly (and
add 4 to it).
The default labels for the components of a list (1, 2, 3,...) bear no association to what
the list contains. In the example above, we have to remember that the date was the second
component of our list. This is no great hardship for a list with three elements but becomes very
inconvenient in larger lists. We can associate names with the components of a list.
We can now refer to components of the list using either the [[]] operator or the $ operator.
> mklst[[2]]
[1] 18 6 2005
> mklst[["date"]]
[1] 18 6 2005
> mklst$date
[1] 18 6 2005
Note that in using the $ operator we can refer to a name directly (we do not need to use the "").
Names may also be assigned when we define the list; the same effect would have been achieved
using:
The attach(...) and detach() functions described in section 1.2.7 can be applied to any list
with named components.
R has powerful and flexible graphics capabilities. This section provides a flavour of the sorts of
things that are possible rather than a comprehensive treatment.
On of the simplest ways to get a feel for the distribution of data is to generate a histogram.
This is done using the hist(...) command. By setting the value of arguments of hist(...)
we can alter the appearance of the histogram; setting probability = TRUE will give relative
frequencies, nclass allows us to suggest the number of classes to use and breaks allows the
precise break points in the histogram to be specified.
> attach(mk2nd)
> hist(exam1)
> hist(exam1, probability = TRUE)
> hist(exam1, nclass=10)
> hist(exam1, breaks=c(0,20,25,30,40))
We can use a selection from a variable or data frame as the main argument in our hist(...)
command. Notice that we cannot pass the data frame as an argument directly. I have used a
trick selection (selecting the whole data frame) to get round this - there may well be a better
way.
It is often useful to compare a data set to the normal distribution. The qqnorm(...) command
plots the sample quantiles against the quantiles from a normal distribution. A qqline(...)
command after qqnorm(...) will draw a straight line through the coordinates corresponding
to the first and third quartiles. We would expect a sample from a normal to yield points on the
qq-plot that are close to this line.
> qqnorm(exam2)
> qqline(exam2)
> qqnorm(mk2nd[mk2nd<=max(mk2nd)])
> qqline(mk2nd[mk2nd<=max(mk2nd)])
Boxplots provide another mechanism for getting a feel for for the distribution of data. Parallel
boxplots are useful for comparison. The full name is a box-and-whisker plot. The box is made
up by connecting three horizontal lines: the lower quartile, median and upper quartile. In the
default set up, the whiskers extend to any data points that are within 1.5 times the inter-quartile
range of the edge of the box.
> boxplot(exam1,exam2,exam3)
# The label here are not very informative
> boxplot(mk2nd)
# Using the data frame as an argument gives a better result
The default plot type in R is usually quite sensible but we may on occasion want to change
it. This is done using the type argument in the plotting function. Below is a (rather silly)
illustration of some of the different types of plot.
> plot(exam1,exam2)
> plot(exam1, exam2, type="p") # points (the default)
> plot(exam1, exam2, type="l") # lines
> plot(exam1, exam2, type="b") # both lines and points
> plot(exam1, exam2, type="o") # overlaid lines on points
> plot(exam1, exam2, type="h") # high density (vertical lines)
If we look at the original scatter plot you notice that the individual points are marked by small
circles. This is not necessarily what we want. The symbol used to mark the points can be
changed by altering a graphics parameter pch (plotting character). There are a huge number of
graphics parameters than can be altered but this is one of the most useful to be able to change.
You might want to make a permanent change to a graphics parameter. This is done using the
par(...) command. This command can also be used to list the graphics parameters.
R has a number of interactive graphics capabilities. One of the most useful is the identify(...)
command. This allows us to label interesting points on the plot. After an identify(...)
command, R will wait while the users selects points on the plot using the mouse. The process
> plot(exam1,exam2)
> identify(exam1,exam2)
> identify(exam1,exam2,row.names(mk2nd))
# Note the default marks are the position (row number) of the point in
the data frame. Using row names may be more informative.
R allows you to put more than one plot on the page by setting the mfrow parameter. The value
that mfrow is set to is an integer vector of length 2 giving the number of rows and the number
of columns.
> par(mfrow=c(3,2))
> hist(exam1)
> qqnorm(exam1)
> hist(exam2)
> qqnorm(exam2)
> hist(exam3)
> qqnorm(exam3)
> par(mfrow=c(1,1))
You could use par(mfcol=c(3,2)) to get the same 3 × 2 multi-figure plot. Work out the
difference between mfrow and mfcol. R also allows you to change the tick marks and labels, the
borders around plots and the space allocated for titles – more can be found in Venables’s et. al.
(see supplementary reading section 2.6.2).
R allows you to build plots up more or less from scratch. The commands text(...) and
lines(...) can be used to add text and lines to a plot. In the following example we plot the
positions of five cities (using their longitude and latitude) and then connect them using lines.
We can generate various types of three dimensional plot in R. The input for this type of plotting
function is two vectors (values from the x and y axes) and a matrix (values from the surface
that is to be plotted). We are going to generate 3-D plots for the surface
f (x, y) = x2 y 3
for x ∈ [−1, 1] and y ∈ [−1, 1]. In order to do this we first generate regular sequence of 50
points on the interval [−1, 1] and then use the outer function to give the 50 × 50 matrix (outer
product) of points on the surface (more on matrices in section 3.2.7).
Trellis graphics provide graphical output with a uniform style. The functions are written in R
and are slow and memory intensive. A graph produced using trellis graphics cannot be altered
so everything we want to include must be specified in the initial command. Trellis graphics
are implemented using the lattice library (more on the use of libraries in the next chapter).
We will not use trellis graphics in this course. However, if you are interested in producing
sophisticated conditional plots, you may want to look at the trellis graphics section of Venables
and Ripley (see supplementary reading section 2.6.2).
• Regular sequences : sequence generator gives sequence with unit steps between left
and right arguments
seq(...) more general regular sequence generator, can set step
(by) and/or length of sequence (length)
rep(...) for generating sequence with repeated values
• Logical values TRUE and FALSE
• Generating lists list(...) put arguments (which may be named) into a list.
2.5 Exercises
The file IntelNASDAQdated.dat contains closing prices of Intel stocks and the NASDAQ100
index from 3rd January 2000 to 9th September 2005. The variables in this file are Day, Month,
Year, Intel, NASDAQ, IntelLR and NASDAQLR. Here, LR denotes log returns. The months are
specified by their first three letter, that is Jan, Feb, . . . . Import these data into an appropriately
named data frame.
(a) Plot the Intel log returns against NASDAQ log returns.
(b) Identify the outliers in the plot generated above.
(c) Use the Day, Month and Year variables to construct row names for the data frame.
(d) Repeat the plot and identification stages above using the row names you have just
generated to label the points.
(e) What do the outliers you have identified correspond to.
2. Comparing 2000 with 2003: Selection allows us to construct subvectors and compare them.
(a) Using appropriate selection construct a data frame containing log returns for both
series for 2000.
(b) Construct another data frame with log returns for both series for 2003.
(c) Generate a scatter plot of Intel log returns against NASDAQ log returns for the year
2000.
(d) Add points for 2003 to the plot generated above using a differ colour and plot char-
acter.
(e) Comment on this plot noting any similarities and differences between 2000 and 2003.
(f) Fit linear models with Intel log returns as the response variable and NASDAQ log
returns as the explanatory variable for 2000. Repeat for 2003. Compare the resulting
models.
(a) Change the appropriate graphics parameters to get four plots to a page.
(b) Generate histograms for Intel log returns and NASDAQ log returns for 2000 and 2003
(four plots in all).
(c) The plots just generated are not much good for comparison. Redo this plot setting
the breaks in the histograms so that they are all drawn on the same horizontal scale.
(d) Reset the appropriate graphics parameters to get one plot per page.
4. Boxplots: Often a good way of doing quick comparisons between several variables.
5. Test of variances: It seems clear from the descriptive plots that 2003 differs from 2000 quite
dramatically. We can perform hypothesis tests to check whether the statistics support this.
(a) Find the mean price of Intel stock on your birthday for the period.
(b) Find the date on which the NASDAQ index reached its lowest point for the period.
Repeat for Intel and comment on the results.
(c) Find the standard deviation of Intel log returns for the single month that contains
the highest price of Intel stock.
(a) Create a vector containing the standard deviations of Intel log returns for each month
from January 2000 to September 2004. [Hint: look up the use of factors – chapter 4
of An Introduction to R.]
(b) Plot the values of the standard deviations against time.
2.6 Reading
• Venables, W. N. et. al. (2001) An Introduction to R. [Chapter 2 for vectors and selection,
chapter 6 for data frames, lists and chapter 7 for reading in data and chapter 12 for
graphics.]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition,
Springer. [Chapter 2 (excluding 2.4) for data manipulation (includes some topics we have
not covered yet), chapter 3 up to 3.4 for more on language elements and chapter 4 for
graphics.]
• Venables, W. N. et. al. (2001) An Introduction to R. [Section 12.5 for details of plot
spacing and tick marks, chapter 4 for factors]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition,
Springer. [Section 4.5 for trellis graphics]
Writing functions in R
A program is a set of instructions for a computer to follow. Putting a set of instructions together
in a program means that we do not have to rewrite them every time we want to execute them.
Programming a computer is a demanding (but potentially rewarding) task. The process is made
simpler and the end product more effective by following the simple guidelines set out below.
• Problem specification: The starting point for any program should be specification of
the problem that we would like to solve. We need to have a clear idea of what we want
the program to do before we start trying to write it. In particular, the available inputs
(arguments) and the desired output (return value) should be specified.
• Code planning: In the early stage of writing a program it is best to stay away from the
machine. This is hard to do, but time sketching out a rough version of the program with
pen and paper is almost always time well spent. Developing your own pseudo-code, some
thing between natural and programming language, often helps this process.
• Identifying constants: If you see a number cropping up repeated in the same context,
it is good practice to give this value to an identifier at the start of the program. The
advantage is one of maintenance; if we need to change the value it can be done with a
single alteration to the program.
44
3. Writing functions in R
‘obvious code’. Just a few simple comments about what the program is for and what each
part does will help.
• Solving runtime problems: Having carefully crafted and typed in your program, more
often than not, the computer will refuse to run it at first time of asking. You may get
some vaguely helpful suggestion as to where you have gone wrong. We will not be writing
big programs so we do not require very sophisticated debugging tools. If the error message
does not lead you straight to the solution of the problem, it is possible to comment out
sections of the program to try to isolate the code which is causing the problem.
• Program verification: Once the program runs, we need to make sure it does what it is
intended to do. It is possible to automate the process of program verification but we will
usually be satisfied with checking some cases for which the correct output is known.
Computers are very good at performing repetitive tasks. If we want a set of operations to
be repeated several times we use what is known as a loop. The computer will execute the
instructions in the loop a specified number of times or until a specified condition is met. Once
the loop is complete, the computer moves on to the section of code immediately following the
loop as illustrated below.
program section A
loop B
C
order of execution A B B ...B C
There are three types of loop common to most (C-like) programming languages: the for loop,
the while loop and the repeat loop. These types of loop are equivalent in the sense that a loop
constructed using one type could also be constructed using either of the other two. Details of
the R implementation can be found in sections 3.5.1 and 3.5.3. In general loops are implemented
very inefficiently in R; we discuss ways of avoiding loops in section 3.5.4. However, a loop is
sometimes the only way to achieve the result we want.
Often there are parts of the program that we only want to execute if certain conditions are
met. Conditional statements are used to provide branches in computer programs. The simplest
structure is common to most programming languages;
if (condition ) ifbranch
program section A
if (condition ) B
D
order of execution if the condition is true A B D
if the condition is false A D
In this example, B forms the if branch of our program. Another common conditional statement
takes the form
program section A
if (condition ) B
else C
D
order of execution if the condition is true A B D
if the condition is false A C D
Mathematical models fall into two categories, deterministic and stochastic. For a deterministic
model, if we know the input, we can determine exactly what the output will be. This is not true
for a stochastic model; the output of a stochastic model is a random variable. Remember that we
are talking about the properties of models; discussions of the deterministic or stochastic nature
of real-life phenomena often involve very muddled thinking and should be left to the philoso-
phers. Simple deterministic models can produce remarkably complex and beautiful behaviour.
A example is the logisitic map,
xn+1 = rxn (1 − xn ), (3.1)
for n = 1, 2, . . . and starting value x0 ∈ (0, 1). For some values of the parameter r, this map
will produce a sequence of numbers x1 , x2 , . . . that look random. Put more precisely, we would
not reject a null hypothesis of independence if we were to test these values. We call sequences
of this sort pseudo-random. Pseudo-random number generators play an important role in the
study of model properties using simulation (see chapter 4).
3.2 R essentials
R maintains a record of command history so that we can scroll through previous commands
using ↑ and ↓ keys. It may sometimes be useful to save commands for future use. We can save
the command history at any stage using
> savehistory("filename.Rhistory")
The commands entered so far will be saved to file in a plain text format. All of the sessions and
my solutions to the exercises are available as .Rhistory files on the webpage and public folders.
In order to restore saved commands we use
> loadhistory("filename.Rhistory")
The commands will not be run but will be made available via ↑ key. In order to run a saved set
of commands or commands in plain text format from some other source we use
The echo command tells R to print the results to screen. All of the commands in the file will
be executed using this approach. It is also possible to paste commands (that have been copied
to the Windows clipboard) at the R command prompt.
3.2.2 Packages
A package in R (library section in S) is a file containing a collection of objects which have some
common purpose. For example, the library MASS contains objects associated with the Venables
and Ripley’s Modern Applied Statistics with S. We can generate a list of available packages,
make the objects in a package available and get a list of the contents of a package using the
library(...) command.
If we are specifically interested in the data sets available in a package we can use the data(...)
command to list them and add them to our workspace.
Information on recommended packages (in the very large R Reference Index) along with a guide
to writing R extensions can be found at http://cran.r-project.org/manuals.html
When we pass an identifier to a function R looks for an associated object. The first place R
looks is in the workspace (also known as the global environment .GlobalEnv). If the object is
not found in the workspace, R will search any attached lists (including data frames), packages
loaded by the user and automatically loaded packages. The real function of attach(...) is now
clear; attach(...) places a list in the search path and detach(...) removes it. The function
search() returns the search path. Passing the name of any environments or its position to the
objects function will list the objects in that environment.
> search()
> objects(package:MASS)
> objects(4)
For anything other than the simplest functions, the format of the output that is printed to screen
must be carefully managed; a function is of little use if its results are not easy to interpret. The
function cat(...) concatenates its arguments and prints them as a character string. The
function substring(...) is use to specify part of a sequence of characters.
Note that "\n" is the new line character. The function format(...) allows greater flexibility
including setting the accuracy to which numerical results are displayed. A combination of
cat(...) and format(...) is often useful.
We have already encountered the special value NA denoting not available. This is used to
represent missing values. In general, computations involving NAs result in NA. In order to get
the result we require, it may be necessary to specify how we want missing values to be treated.
Many function have named arguments in which the treatment of missing values can be specified.
For example, for many statistical functions, using the argument na.rm = TRUE instructs R to
perform the calculation as if the missing values were not there. R has a representation of infinity
(Inf). However, the results of some expression are not computable, for example 0/0 or Inf -
Inf). R uses NaN to denote not a number.
†The value NA is of mode "logical". Despite the assurances of sections 2.1.1, R can actually
handle three valued logic. Try the following, construct the truth tables and try to work out
what is going on.
Some functions will only accept objects of a particular mode or class. It is useful to be able
to convert (coerce) an object from one class to another. The generic function for this purpose
is as(...). The first argument is the object that we would like to convert and the second is
the class we would like to convert to. There are a number of ad hoc functions of the form as.x
that perform a similar task. Particularly useful are as.numeric(...) and as.vector(...).
The function is(...) and similar is.x (...) functions check that an object is of the specified
class or type, for example is.logical(...) and is.numeric(...). [There are two functions
of the form is.x (...) command that do not fit this pattern: is.element(...) checks for
membership of a set; is.na(...) checks for missing values element by element.]
> is.vector(roots.of.two)
> is.character(roots.of.two)
> is(roots.of.two, "character")
> as.character(roots.of.two)
> is.logical(lvec1)
> as(lvec1, "numeric")
> is(as(lvec1, "numeric"), "numeric")
An array in R consists of a data vector that provides the contents of the matrix and a dimension
vector. A matrix is a two dimensional array. The dimension vector has two elements the number
of rows and the number of columns. A matrix can be generated by the array(...) function
with the dim argument set to be a vector of length 2. Alternatively, we can use the matrix(...)
command or provide a dim attribute for a vector. Notice the direction in which the matrix is
filled (columns first)
Using arithmetic operators on similar arrays will result in element by element calculations being
performed. This may be what we want for matrix addition. However, it does not correspond to
the usual mathematical definition of matrix multiplication.
> m1+m1
> m1*m1
Conformal matrices (that is, matrices for which the number of columns of the first is equal to
the number of rows of the second) can be multiplied using the operator %*%. Matrix inversion
and (equivalent) solution of linear systems of equations can be performed using the function
solve(...). For example, to solve the system Ax = b, we would use solve(A, b). If a single
matrix argument is passed to solve(...), it will return the inverse of the matrix.
The outer product of two vectors is the matrix generated by looking at every possible combi-
nation of their elements. The first two arguments of the outer(...) function are the vectors.
The third argument is the operation that we would like to perform (the default is multiplica-
tion). This is a simplification since outer product works on arrays of general dimension (not
just vectors). The following example is taken from Venables et. al.’s An Introduction to R
page. The aim is to generate the mass function for the determinant of a 2 × 2 matrix whose
elements are chosen from a discrete uniform distribution on the set {0, . . . , 5}. The determinant
is of the form AD − BC where A, B, C and D are uniformly distributed. The mass function
can be calculated by enumerating all possible outcomes. It exploits the table(...) function
that forms a frequency table of its argument (notice how plot(...) works for objects of class
"table".
Various other matrix operations are available in R: t(...) will take the transpose, nrow(...)
will give the number of rows and ncol(...) the number of columns. The function rbind(...)
(cbind(...)) will bind together the rows (columns) of matrices with the same number of
columns (rows). We can get the eigenvectors and eigenvalues of a symmetric matrix using
eigen(...) and perform singular value decomposition using svd(...). Investigating the use
of these commands is left as an exercise.
We have seen in the previous two sections that there are a large number of useful function built
into R; these include mean(...), plot(...) and lm(...). Explicitly telling the computer
to add up all of the values in a vector and then divide by the length every time we wanted
to calculate the mean would be extremely tiresome. Fortunately, R provides the mean(...)
function so we do not have to do long winded calculations. One of the most powerful features
of R is that the user can write their own functions. This allows complicated procedures to be
built with relative ease.
name(...)
We will discuss the possible forms of the argument list in section 3.6. When the function is called
the statements that make up expr1 are executed. The final line of expr1 gives the return value.
Consider the logistic map (3.1). The map could be rewritten as follows
xn+1 = g(xn ),
where
g(x) = rx(1 − x).
Clearly, the quadratic function g plays an important role in the map. We can write R code for
this function; note that we follow the usual computing convention and give this function the
descriptive name logistic rather than calling it g.
Now that we have defined the function, we can use it to evaluate the logistic function for different
values of r and x (including vector values).
> logistic(3,0)
[1] 0
> logistic(3,0.4)
[1] 0.72
> logistic(2,0.4)
[1] 0.48
> logistic(3.5, seq(0,1,length=6))
[1] 0.00 0.56 0.84 0.84 0.56 0.00
The expression whose statements are executed by a call to logistic(...) is just the single line
r*x*(1-x). This is also the return value of the function. The expression in a function may run
to several lines. In this case the expression is enclosed in curly braces { } and the final line of
the expression determines the return value. In the following function we use the logistic(...)
function that we have just defined.
This function lmap1 performs a single iteration in the logistic map. Note that the argument
name r is just a label. We can name this argument what we like.
This is a silly example to prove a point; in practice it is best to use sensible names that you will
be able to understand when you come back to look at the code.
In many instances we are more interested in the side effects of a function than its return value.
This type of function is written in exactly the same way as other functions. The only difference
is that we do not need to worry about the final line. [In fact it is polite to return a value that
indicates whether the function has been successful or not but this will not concern us for now.]
Here sequence is actually any vector expression but usually takes the form of a regular sequence
such as 1:5. The statements of expr1 are executed for each value of the loop variable in the
sequence. An couple of examples follow.
> load("moreRessentials.Rdata")
> attach(mk2nd)
> for (i in 1:length(exam1))
+ { ans <- exam1[i] + exam2[i] + exam3[i]
+ cat(row.names(mk2nd)[i], " total: ", ans, "\n")
+ }
if (condition ) ifbranch
if (condition ) ifbranch else elsebranch
Here the condition is an expression that yields a logical value (TRUE or FALSE) when evaluated.
This is typically a simple expression like x > y or dog == cat. A couple of examples follow to
illustrate the difference between if and if - else statements.
+ if (i<3) cat("small")
+ cat(" number \n")
+ }
As a slightly more substantial example, suppose that in order to pass student must achieve a
total mark of 60 or higher. We can easily write R code to tell us which students have passed.
In section 3.1.2 we mention the fact that loops are not efficiently implemented in R. One way
of avoiding the use of loops is to use commands that operate on whole objects. For example,
ifelse(...) is a conditional statement that works on whole vectors (rather than requiring a
loop to go through the elements).
ifelse(condition,vec1,vec2 )
If condition, vec1 and vec2 are vectors of the same length, the return value is a vector whose
ith elements if vec1[i] if condition[i] is true and vec2[i] otherwise. If condition, vec1
and vec2 are of different lengths, the recycling rule is used. We exploit recycling in this version
of the PASS/FAIL example above.
This is not quite what we want. In the previous example we had candidate numbers associated
with the PASS/FAIL results. We will return to this example in section 3.5.4.
†
3.5.3 More loops – while and repeat
We can achieve the same result as a for loop using either a while or repeat loop.
A while loop continues execution of the expression while the condition holds true. A repeat
loop repeated executes the expression until explicitly terminated. We illustrate with the trivial
example from section 3.5.1 that just prints out numbers from 1 to 5.
> j <- 1
> while (j<6)
+ { print(j)
+ j <- j+1
+ }
> j <- 1
> repeat
+ { print(j)
+ j <- j+1
+ if (j>5) break
+ }
Note that in these examples we have to explicitly initialise and increment the value of our loop
variable j. The break command can be used in any type of loop to terminate and move execution
to the final } of the loop expression. In a repeat loop, break is the only way to terminate the
loop.
A while loop can be useful when we have a condition for termination of the loop rather than an
exact point. For example, suppose that we want to get the first 10 pass marks from the mk2nd
data. We do not know exactly how far through the data we must look to find 10 pass marks.
We could write a for loop with a break statement but a while loop is neater.
Loops are an important part of any programming language. However, as we have mentioned
before, loops are very inefficiently implemented in R (S is no better). Having learnt about loops,
it is now important to learn how to avoid them where ever possible. We return to previous
examples to demonstrate how the same operations can be performed using whole vectors rather
than running through the indices using a loop. First the example in section 3.5.1. It is easy to
avoid using loops here by adding the vectors and using the cat(...) and paste(...) functions
to get the correct output.
Not the fill argument of the paste(...) determines the width of the output. This allows us
to get the output as a column (of the appropriate width). In section 3.5.2 we had nearly solved
the vectorization problem but the output was not quite what we wanted. This problem is solved
below.
We now consider an example that involves the use of two functions that we have not encountered
before. The function scan(...) is used to read vectors in from file while apply(...) allows
us to apply functions to whole sections of arrays (a good way of avoiding loops). The apply
function has three arguments, the first is an array (or matrix) and the third is the function that
we would like to apply. The second argument gives the index of the array we would like to apply
the function to. In matrices an index value of 1 corresponds to rows while an index value of 2
corresponds to columns. In the following example we scan in a set of values of the NASDAQ
index for 23 days in October 2003. For each day we have 39 reading (intradaily data, ten minute
readings from 8:40 to 15:00). We construct a matrix of log returns, remove the first row (why?)
and work out the standard deviation of the log returns for each of the remaining 38 time points
(8:50 to 15:00). Try to interpret the resulting plot.
We can exploit the ideas of named arguments and default values from 2.2.2 in functions that
we write ourselves. Consider a function to draw the binomial mass function for some values of
number of trial n (size) and probability of success p (prob).
The argument colour is a number to specify the colour of the plotted lines while outputvals
is a logical value; if outputvals is TRUE R will print out the values used to generate the plot.
In general, we may want green lines and not printing out all of the values used to generate the
plot. If this is the case, we can specify default values for these arguments.
Notice that the body of the function is unchanged. You can make these alterations easily using
fix(binomplot). Experiment with passing different arguments to the function.
It is often useful to be able to pass functions as arguments. For example, we may want an R
function that can plot any mathematical function (not just binomial mass). In R, we treat an
argument which is a function in an identical way to any other argument. The general plotting
function below will plot the values of a function ftoplot for a specified set of x values.
> genplot(sin)
> genplot(sin, ptype="h")
> cubfun <- function(x) xˆ3-6*x-6
> genplot(cubfun, x=seq(-3,2,length=500))
This works well for plotting a (mathematical) function with a single argument, however, in many
instances we need to pass more information to the (mathematical) function to be plotted. For
example, to generate binomial mass using a general function like genplot(...), we would like
to be able to pass the values of n and p. This can be done by using a ... argument. Any
arguments that cannot be identified as arguments of the main function will be passed to the
location of the ...
Binary operators are familiar from mathematics. The usual arithmetic operators such as × and
+ could be viewed as functions with two arguments. Convention dictates that we write 1 + 2
rather than +(1, 2). R syntax (sensibly) follows this convention; using binary operators with the
arguments either side of the function name is much easier for most of us to understand. Other
binary operators include matrix mulitplication %*% and outer product %o%. R allows us to write
our own binary operators of the form %name%. Examples follow (note that the "" around the
function name are not used when the function is called).
• Loops for(...)
while(...)
repeat
3.10 Exercises
1. Determinants for matrices with random components: In this example we generalize the
example given at the end of section 3.2.7.
(a) Write a function to generate a plot of the mass function for the determinant of a 2 × 2
matrix with elements that are uniformly distributed on the set {0, . . . , 5}.
(b) Generalize the function written above so that it will work on an arbitrary set of
integers.
(c) Provide an argument to the function that allows us (if we choose) to print out the
frequency table on which the plot is based.
(a) Write a back-shift function that takes a vector as its argument and returns the back-
shifted version of the vector. The first element of the returned vector should be NA.
Do not use loops! Check that your function works.
(b) Write a version of the function that back-shifts for an arbitrary number of times d.
Set a default value for d.
3. Cobweb plot: One way to view the iterations of the logistic map is using a cobweb plot.
(a) Write an R function to plot the logistic function and add the straight line y = x to
the plot (use the lines()) command.
(b) Edit your function so that, from a starting point of x = 0.1, it draws a line up
(vertically) to the logistic curve and then over (horizontally) to the line y = x.
(c) Include a loop to take a line vertically from the current point to the logistic curve
then horizontally over to the line y = x a fixed number of time (thus constructing
the cobweb plot).
(d) Is there anyway to do this without using a loop?
Yt = φYt−1 + εt ,
(a) The file (saved workspace) arsim.Rdata contains a data frame (simdata) of val-
ues simulated from an AR(1) with φ taking values between −0.9 and 0.9. Use the
load(...) command to add the objects from the file to your workspace. Write
a function that will generate a time series plot of any named columns from a data
frame. Test out your function on the simdata data frame.
(b) Write a function that evaluates the likelihood for values of φ and σε2 for a specified
data vector.
(c) Write a function that uses the function from part ii. to draw the likelihood surface
for a given region of the parameter space.
5. † Bifurcation plot: The convergence properties of the logistic map for different values of r
are summarized in a bifurcation plot. The bifurcation plot has values of r on the x axis
and the values that the logistic map converges to on the y axis. For small values of r the
map converges to a single value. As r increases we see period doubling and then chaotic
regions. By repeated running the logistic map for a fixed number of iterations and plotting
the last 50 values, write a function that will draw a bifurcation plot.
6. ††Mandelbrot set: The Mandelbrot set is a subset of the complex plane. Consider the
map
zn+1 = zn2 + c (3.2)
where z0 = c. The Mandelbrot set consists of those points c in the complex plane for
which the limit as n tends to infinity of |zn | is finite. Write an R function to construct a
reasonable visual impression of the Mandelbrot set.
3.11 Reading
• Venables, W. N. et. al. (2001) An Introduction to R. [Chapter 5 for arrays and matrices,
chapter 9 for program control and chapter 10 for writing functions.]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition,
Springer. [Section 3.5 for formatting a printing.]
The distribution of probability associated with a random variable Y , is given by the distribution
function, FY : R → [0, 1] defined by
FY (y) = P (Y ≤ y).
Distributions are often classified as either discrete (FY is a step function) or continuous (FY is
a continuous function). Note that real observations are never made on a continuous scale.
For a discrete distribution there is a countable set of possible values of the variable, {y1 , y2 , . . .}
(the points at which the discontinuities in FY occur). The probability mass function is defined
by
fY (y) = P (Y = y)
and takes non-zero values on {y1 , y2 , . . .}. The distribution function can then be written as
X
FY (y) = fY (yi ).
i:yi ≤y
For a continuous distribution the density function, fY , is a positive real-valued function satisfying
Z y
FY (y) = fY (u)du.
−∞
62
4. Distributions and simulation
In what follows we assume that Y is a continuous random variable (in the discrete case, sum-
mation replaces integration). The expected value value of Y is defined as
Z ∞
E(Y ) = yfY (y)dy
−∞
whenever this integral exists. Expectation is linear in the sense that for any constants a and b
E(a + bY ) = a + bE(Y ).
For any reasonably well behaved function g we can define the expectation
Z ∞
E(g(Y )) = g(y)fY (y)dy.
−∞
In particular, we may be interested in the moments E(Y 2 ), E(Y 3 ), . . . and the central moments
E[(Y − E(Y ))2 ], E[(Y − E(Y ))3 ], . . . and so on. For example, the second central moment is the
variance:
Var(Y ) = E[(Y − E(Y ))2 ] = E(Y 2 ) − E(Y )2 .
The variance operator has the property
Var(a + bY ) = b2 Var(Y ).
We are often interested in the relationship between random variables. For two random variables
X and Y we define the joint distribution as
It is clear from this definition that for any random variables with finite means,
E(X + Y ) = E(X) + E(Y ).
However,
Var(X + Y ) = Var(X) + 2Cov(X, Y ) + Var(Y ),
where
Cov(X, Y ) = E(X − E(X))(Y − E(Y )) = E(XY ) − E(X)E(Y ).
Covariance is used as a measure of the linear association between X and Y . It can be scaled to
give correlation
Cov(X, Y )
Corr(X, Y ) = p .
Var(X)Var(Y )
We say that X and Y are independent when there exist function FX and FY such that
FX,Y (x, y) = FX (x)FY (y)
for all x, y. It then holds that
Var(X + Y ) = Var(X) + Var(Y ).
i. A random sample from a population is a sample in which every member of the population
has equal probability of being included. A random sample of size n from a population can be
modelled by a set of independent and identically distributed random variable, Y1 , . . . , Yn . We
often use Y (with no subscript) to denote a random variable the has the same distribution
as all of the Yi s (this makes sense since the Yi s are identically distributed). Parameters
of interest include the population mean, µY = E(Y ), and the population variance, σY2 =
Var(Y ).
ii. A sample of size n from a population that is thought to depend linearly on fixed explanatory
factors, x1 , . . . , xn , can be modelled by the simple regression,
Yi = α + βxi + εi , {εt } ∼ iN (0, σε2 ). (4.1)
where iN (0, σε2 ) denotes normally and independently distributed with mean 0 and variance
σε2 . Parameters of interest include the intercept α, the slope β and the error variance σε2 .
The value of a statistic is a function of the data; to represent this general relationship we use
the notation
u = h(y1 , . . . , yn ).
The term statistic refers to the same function h applied to the random variables Y1 , . . . , Yn ,
U = h(Y1 , . . . , Yn ).
Just as the data, y1 , . . . , yn , are thought of as one instance of the model, Y1 , . . . , Yn , the value
u is taken to be one instance of the statistic U . The table below gives some statistics and their
sample values.
In all cases, U is a RANDOM VARIABLE. As such it makes sense to talk about the distribution
of U and its moments, E(U ), Var(U ), etc. Many of the statistics that we are interested in are
estimators of population parameters; referring to the table above, the sample mean, Ȳ , is used
as an estimator of the population mean, µY , the slope estimator, β̂, is used as an estimator of the
slope parameter, β. The observed value of the statistic is used as an estimate of the population
parameter. Note the distinction, an estimator is a model quantity (a random variable) while an
estimate is an observed quantity (a number).
We have an intuitive understanding of what constitutes a good estimator; a good estimator will
provide estimates that are close to the true value. As statisticians, we would like to make this
notion more precise. Suppose that U is an estimator for λ. We define two criteria that can be
used to judge the quality of an estimator. Note that evaluation of these quantities is based on
our model.
• Bias(U ) = E(U −λ) = E(U )−λ. Bias measures how far our estimate will be from the true
value on average. Unbiasedness (E(U ) = λ) is usually taken to be a desirable property for
an estimator.
• Var(U ) = E[(U − E(U ))2 ] = E(U 2 ) − E(U )2 . Variance measures the spread of the esti-
mator. An estimator with low variance (an efficient estimator) is usually desirable.
Neither bias nor variance alone provide an indication of the quality of an estimator; we do not
want a unbiased estimator with a large variance, nor do we want an efficient estimator with a
large bias. However, these measures can be combined to give a single measure of the quality of
an estimator.
• M SE(U ) = E[(U − λ)2 ] = (Bias(U ))2 + Var(U ). Mean square error is a widely used
measure of the quality of an estimator.
The properties of some statistics are (at least partly) analytically tractable. Two well known
examples follow.
Sample mean of a random sample: The mean and variance of the sample mean of a random
sample are easy to evaluate under the usual model. Suppose that Y1 , . . . , Yn are independent
identically distributed random variables with mean µY and variance σY2 .
n n
!
1X 1X
E(Ȳ ) = E Yi = E(Yi ) = E(Y ) = µY
n n
i=1 i=1
The mean of the sample mean is just the population mean. Thus, the sample mean is an unbiased
estimator of the population mean. The derivation of the variance exploits independence.
n n
!
1X 1 X
Var(Ȳ ) = Var Yi = 2 Var(Yi ) by independence
n n
i=1 i=1
1 1
= Var(Y ) = σY2 .
n n
The variance of the sample mean decreases as the sample size increases, a desirable property for
an unbiased estimator. The asymptotic distribution of the sample mean is described by a well
known theorem.
The central limit theorem: Let Y1 , . . . , Yn be independent, identically distributed random vari-
ables with mean E(Y ) = µY and finite variance Var(Y ) = σY2 < ∞. Let
Ȳ − µY
Zn = √ ,
σY / n
There is a neat proof of this theorem using cumulant generating functions. The central limit
theorem is used to provide an approximate distribution for the sample mean; in a large sample,
the sample mean is approximately normally distributed regardless of the distribution of the
population (or to be more precise, provided the population has finite variance). The central
limit theorem is a remarkable result that is used routinely in testing hypotheses about the
population mean. However, it is an asymptotic result; the normal distribution will only be a
good approximation to the distribution of the sample mean in large samples. In section 4.6 we
use simulation to investigate the small sample properties of the sample mean.
Least squares estimate of slope in regression For the simple regression model (4.1) the
least squares estimator of the slope is
Pn
(Y − Ȳ )(xi − x̄)
β̂ = i=1Pn i 2 2
.
i=1 xi − nx̄
where Pn
− Ŷi )2
i=1 (Yi
Sε2 = and Ŷi = α̂ + β̂xi .
n−2
Here α̂ = Ȳ − β̂ x̄ is the estimator of α.
For example, consider Y1 , . . . , Yn independent N (µY , σY2 ) random variables. We know that
Ȳ − µY
√ ∼ tn−1 .
SY / n
By definition
Ȳ − µY Ȳ − µY
P √ > tn−1,α/2 = α/2 and P √ ≤ −tn−1,α/2 = α/2.
SY / n SY / n
Rearranging yields
√ √
P Ȳ − tn−1,α/2 SY / n > µY = α/2 and P Ȳ + tn−1,α/2 SY / n ≤ µY = α/2.
Thus, the 100(1 − α)% confidence interval for µY is the familiar
√ √
[ȳ − tn−1,α/2 Sy / n, ȳ + tn−1,α/2 Sy / n).
Simulation involves using a computer to generate observed samples from a model. In many fields
this model is a complex deterministic system that does not have a tractable analytic solution,
such as, the models of the atmosphere used in weather forecasting. A statisticians, we use
simulation to provide insight into properties of statistics when these properties are not available
from theory (it should be noted that simulation is usually a method of last resort). For simulated
data we know the data generating process. The starting point for a simulation experiment is to
simulate observations from the stochastic model of interest; for example,
The simulated observations can be used to construct a simulated instance of a statistic; for
example
Pn
i. sample mean, ȳ = i=1 yi ,
Pn Pn 2
ii. estimate of the slope parameter, β̂ = i=1 (yi − ȳ)(xi − x̄)/( i=1 xi − nx̄2 ).
The process of simulating a sample can be repeated many times to build up a collection of
samples known as simulated replications. Evaluating a statistic for each replication allows us to
build up a picture of the distribution of the statistic. Consider repeatedly generating instances
from the model and evaluating the statistic U for each replication. This process will yields
(1) (1)
y1 , . . . , yn → u(1)
(2) (2)
y1 , . . . , yn → u(2)
.. ..
. .
(r) (r)
y1 , . . . , yn → u(r)
(j)
where yi is the j th replication of the ith sample member and u(j) is the j th replication of
the statistic of interest. Here we use r to denote the total number of replications. The values
u(1) , . . . , u(r) can be thought of as r instances of the random variable U .
The design of a simulation experiment is crucial to its success; it is easy to waste time on
inefficient or total meaningless simulations.
(a) sample size: usually increasing the sample size will improve the properties of a statis-
tic,
(b) parameters of the population distribution: a statistic may perform better for certain
parameter values.
(c) the population distribution: a statistic’s properties may change dramatically (or re-
main surprisingly stable) when the population distribution is changed.
3. The number of simulated replications: A large value of r should improve the accuracy of
our simulation. However, large values of r will increase the computational cost and there
may be better ways of improving our simulation.
(b) We may want to test whether U has a particular distribution. The usual χ2 good-
ness of fit tests can be applied to our simulated statistics, u(1) , . . . , u(r) . For some
distributions, specific tests may be appropriate; for example, the Shapiro-Wilk test
for normality.
(c) Graphical comparisons are also useful. In particular, parallel box plots provide a
good means of comparing two or more sets of simulated statistics.
R supports calculations for a large number of distributions. All of the commonly known distri-
butions are handled. The following table is a selection from page 51 of Venables et. al.’s An
Introduction to R.
The additional arguments are mostly self-explanatory. The ncp stands for non-centrality pa-
rameter and allows us to deal with non-central χ2 and non-central t distributions.
R can be used like a set of statistical tables, that is, for a random variable Y , work out the
values of p and q associated with expressions of the form
P (Y ≤ q) ≥ p.
pdistributionname (q,...)
will return the probability that the distribution named takes a value less than or equal to q.
The follow simple example illustrates.
Suppose X ∼ Bin(85, 0.6) and Y ∼ N (12, 9) and that X and Y are independent. Calculate
i. P (X ≤ 60)
ii. P (X ≥ 60)
iv. P (Y ≤ 15)
# i.
> pbinom(60,85,0.6)
[1] 0.9837345
# ii.
> 1 - pbinom(59,85,0.6)
[1] 0.02827591
# An alternative approach
> pbinom(59,85,0.6,lower.tail=FALSE)
[1] 0.02827591
# v. (using independence)
> pbinom(59,85,0.6) * pnorm(15,12,3,lower.tail=FALSE)
[1] 0.1541691
qdistributionname (p,...)
will return the value which the distribution named has probability p of being below. For a
discrete distribution it returns the smallest number with cdf value larger than p.
Suppose X ∼ Bin(85, 0.6) and Y ∼ N (12, 9) and that X and Y are independent. Solve
i. P (X ≤ q) ≥ 0.8
# iii.
> qnorm(0.25,12,3)
[1] 9.97653
R allows us to calculate density function values (probability mass in the discrete case). The
function
ddistributionname (x,...)
will generate the value of the density function (probability mass function) for the named distribu-
tion at x. These values can be used to density function plots.; We use this function to investigate
the effect of the parameter values on the t-distribution and gamma distribution below.
In determining which distribution is appropriate for a set of data, a plot comparing the empirical
density (or empirical cumulative distribution) with the theoretical density is useful. This applies
to data from simulation experiments as well as observed data.
> library(stepfun)
> plot(ecdf(allmarks))
> lines(x,pnorm(x, mean=mean(allmarks), sd = sd(allmarks)),col=2)
For discrete data we may want to use a slightly different plotting method. The probability mass
function of a binomial is given below using vertical lines to indicate that we only have positive
probability at discrete points.
Computer programs cannot produce genuine random numbers; for given input we can, in prin-
cipal, predict exactly what the output of a program will be. However, we can use a computer
to generate a sequence of pseudo-random numbers (as defined in section 3.1.3). The distinc-
tion is somewhat artificial since no finite sequence of numbers is truly random; randomness is a
property that can only really be attributed to mathematical models (random variables). From
this point on we will drop the pseudo with the understanding that, whenever we use the word
random in the context of numbers or samples, what we mean is something that appears random;
see Knuth, Seminumerical Algorithms for an interesting discussion of this issue.
Using R we can generate random instances from any commonly used distribution. The function
rdistributionname (n,...)
will return a random sample of size n from the named distribution. At the heart of this function,
R uses some of the most recent innovations in random number generation. We illustrate by
sampling from Poisson and normal distributions.
The random number generator works as an iterative process. Thus, consecutive identical com-
mands will not give the same output.
> rnorm(5)
[1] 0.4874291 0.7383247 0.5757814 -0.3053884 1.5117812
> rnorm(5)
[1] 0.38984324 -0.62124058 -2.21469989 1.12493092 -0.04493361
The command set.seed(...) allows us to determine the starting point of the iterative process
and thus ensure identical output from the random number generator. This is useful when
developing the code for a simulation experiment.
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
The functions sample(...) can be used to generate random permutations and random samples
from a data vector. The arguments to the function are the vector that we would like to sample
from and the size of the vector (if the size is excluded a permutation of the vector is generated).
Sampling with replacement is also possible using this command.
The number generated by the logistic map appear uncorrelated but are not uniformly distributed.
A common means of generating a random sample from a uniform distribution is to use a recursion
of the form
xt+1 = axt (mod m).
These are known as multiplicative congruential generators (). We can implement this generator
using a quick amendment to our pdordm function.
+ }
+ ans/mparam
+ }
> rnums <- pdordm(12345,10000)
> acf(rnums)
> hist(rnums,nclass=30)
This looks much closer to uniformity than the output of the logistic map.
To demonstrate how a simulation experiment works, we are going to look at the small sample
properties of the sample mean for a Poisson population, a statistic whose asymptotic properties
are well known. Consider a population that has a Poisson distribution with mean λ, so Y ∼
Pois(λ). We take a sample of size n. We would like to know whether the normal distribution
will provide a reasonable approximation to the distribution of the sample mean. Following the
principals laid down in section 4.1.7, we first consider the questions that we would like our
simulation to answer:
1. How large does n need to be for the normal distribution to be a reasonable approximation?
Writing down these questions makes clear the factors that we will need to make comparisons
across:
• n, the sample size: the central limit theorem tells us that for large values of n the normal
will be a reasonable approximation,
• λ, the parameter of the population distribution: we might expect the shape of the under-
lying Poisson distribution to have an effect on the distribution of the sample mean.
+ }
+ meanvec
+ }
> set.seed(1)
> poisSampMean1(10, 3, 6)
[1] 3.3 3.4 2.6 3.0 3.3 2.6
It is easy to alter this program to avoid the loop. We first put n × r simulated values into a
vector. We then use the matrix(...) function to divide this vector into r columns (each column
corresponds to a simulation replication). Using colMeans(...) to calculate the column means
will then yield r simulated instances of the sample mean.
To get a visual impression of the simulated sample means we write a function to draw a histogram
and plot a normal distribution with the same mean and standard deviation.
Try experimenting with this function with various values of n and λ with r = 1000. If this runs
very slowly, try reducing r. If you get a histogram with strange gaps in, try changing the value
of the nbins argument.
> histNorm(poisSampMean2(8,1,1000))
> histNorm(poisSampMean2(100,10,1000))
We can use a test for normality to provide some numerical indication of whether or not the
normal distribution provides good fit. The Shapiro-Wilk test is a standard test for normality.
First we need a function to extract the p-value from the output of the shapiro.test(...).
∗ If we feed a vector parameter to the rpois(...) command, the recycling rule will be used
in generating random values with each value of the vector used in turn and then repeated. We
exploit this below to write a function that generates a long vector of simulated instances then
splits them up using factors that we define and the tapply(...) function. Note that this is
probably taking loop avoidance a little bit too far; the solution is easy to implement using a
loop (try it) and my without loop solution causes memory problems for large values of r (can
you improve on it?).
In order to generate test statistic values for n = 5, 10, 50 and 100, with λ = 1, 3 and 10 (total
of 12 combinations) we apply the function defined above.
for t = 1, . . . , n.
2. Write a function that takes y and x as arguments and returns β̂, the estimate of the slope
from regression y on x.
3. Using the same parameter values as 1, write a function to simulation 1000 instances of the
model (r = 1000). For each simulated instance use 2 to generate the estimate of the slope
parameter. Return a vector of instances of β̂.
5. Repeat questions 3 and 4 with α = 10. How does this alter your analysis of β̂.
6. † Now consider the case where εt ∼ Pois(λ) − λ, for t = 1, . . . , n, that is, the errors
in the regression have zero mean but are not normally distributed. Leaving the other
parameters unchanged, write a function to simulate simple regression with Poisson errors,
generate instances of the statistic t = (β̂ − β)/se(β̂) and perform a goodness of fit test
(check the Kolmogorov–Smirnov test ks.test) comparing these values with an appropriate
t-distribution. Compare results across different values of λ.
4.9 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Section 5.1 for distributions and section 5.2 for random numbers]
Linear Models I
In regression problems we often want a model that represents our response variable in terms of
several explanatory variables. A multiple regression model is
Yi = β0 + β1 x1,i + . . . + βp xp,i + εi ,
where εi ∼ iN (0, σε2 ) is an error term and x1 , . . . , xp are p explanatory variables with xj =
(xj,1 , . . . , xj,n )0 for j = 1, . . . , p. Normality of errors is required for statements about the distri-
butions of test statistics made below to hold. A convenient representation of this model is the
matrix form
Y = Xβ + ε,
where Y = (Y1 , . . . , Yn )0 , X = (1, x1 , . . . , xp ), β = (β0 , β1 , . . . , βp )0 and ε = (ε1 , . . . , εn )0 . The
least squares estimator of β is given by (invertibility of X 0 X is guaranteed when rank(X) = p+1)
β̂ = (X 0 X)−1 X 0 Y
80
5. Linear Models I
where se(βˆj ) can be calculated from s2ε and the j th diagonal element of (X 0 X)−1 .
In determining the explanatory power of our regression model for these data the following
quantities are useful:
n
X
SST = (Y − Ȳ )0 (Y − Ȳ ) = (Yi − Ȳ )2 = S({1})
i=1
n
X
SSE = (Y − Ŷ )0 (Y − Ŷ ) = (Yi − Ŷi )2 = S({1, x1 , . . . , xp })
i=1
SSREG = SST − SSE = S({1}) − S({1, x1 , . . . , xp })
The total sum of squares (SST) measures the unexplained variability when we fit a model with
just a constant term (the sample mean Ȳ ). We use S({1}) to indicate that the sum of squares
is calculated for a model with just a constant term. The error sum of squares (SSE) measures
the unexplained variability when the full regression model is fitted. We use S({1, x1 , . . . , xp })
to indicate that the sum of squares is calculated for a model with the p explanatory variables
included. The regression sum of squares (SSREG) then measures the variability explained by
our regression model. The significance of the whole regression is measured by comparing SSREG
with SSE. In fact,
SSREG/p
∼ Fp,n−p−1
SSE/(n − p − 1)
under the null hypothesis that β1 = . . . = βp = 0. The R2 statistic is generated by comparing
SSREG with SST
SSREG
R2 = .
SST
This is a popular measure but should be treated with caution – we do not have a sampling
distribution for this statistic.
The sequential analysis of variance table is calculated by measuring the reduction in the er-
ror sum of squares as variables are added one by one. If we include variables in the order
x1 , x2 , . . . , xp , the analysis of variance table gives us
For the j th entry, the sum of square measures the reduction is unexplained variability going from
a model with explanatory variables x1 , . . . , xj−1 to x1 , . . . , xj . The associated test statistic is
under the null hypothesis that βj = 0. If we square the t-statistic for testing βj = 0, we get the
F -statistic for testing βj = 0 from a sequential analysis in which xj is the last variable to enter
the model.
As with all statistical models, it is important that we check that our fitted regression model
provides an adequate representation of the data. We also need to check the distributional
assumptions, on which inference is based, are reasonable. A key quantity in model checking is
the vector of residual values. Residuals are useful for detecting outlying points or other structure
we may not have accounted for in our model. They are calculated from the difference between
what is observed an what we expected based on our model:
ei = Yi − Ŷi .
The residuals, e1 , . . . , en do not have constant variance. In fact Var(ei ) = (1 − hi )σε2 where hi
is the ith diagonal element of H. The standardized residuals are scaled for by their estimated
variances
ei
ri = p .
sε (1 − hi )
Sometimes individual observations can have a strong influence on the estimated variance. This
may mask outliers. One approach to tackling this problem is to evaluate a deletion residual
e
ri∗ = p i .
s(i) (1 − hi )
where s2(i) is the estimate of the error variance from fitting a model excluding point i. These are
often referred to as studentized residuals.
The leverage hi measures the distance of the explanatory variable values associated with point
i, that is x1,i , . . . , xp,i , from the explanatory variable values associated with other points. A
point with high leverage will exert a strong influence over the parameter estimates. Cook’s
distance measures the influence of point i by comparing β̂ with β̂(i) , the parameter estimate that
results from excluding i from the fitting process. There are many other measures of influence –
equivalent measures are often given different names; see Atkinson (1985) for a clear treatment.
A model with several variables is constructed using the operator + in the model formula. In
this context + denotes inclusion not addition. We can use the operator - is used for exclusion
in model formulae. When models are updated, we use a . to denote contents of the original
model. The use of these operators is best understood in the context of an example.
To illustrate consider the Cars93 data from the MASS package. These are the values of variables
recorded on 93 cars in the USA in 1993. Consider the MPG.highway (fuel consumption in highway
driving) variable. It is reasonable to suppose that fuel consumption may, in part, be determined
by the size of the vehicle and by characteristics of the engine. We start by using pairs(...) to
generate a scatter plot matrix for four of the variables in the data set. A linear model is fitted
using the lm(...) function (notice we use the named argument data to specify the data set as
an alternative to attaching the data).
> library(MASS)
> ?Cars93
> names(Cars93)
> pairs(Cars93[c("MPG.highway","Horsepower","RPM","Weight")], col=2)
> lmMPG1 <- lm(MPG.highway ∼ Horsepower + RPM + Weight, data=Cars93)
> summary(lmMPG1)
The anova(...) function gives us the (sequential) analysis of variance table using the order in
which the variables are specified.
> anova(lmMPG1)
In the analysis of variance table for this example, adding each variable in the order specified
gives a significant reduction in the error sum of squares. However, the explanatory variables are
highly correlated; changing the order in which they are included alters the analysis of variance
table.
Notice that we could have achieved the same outcome by updating our lmMPG1 model object.
Below we use (.-Weight) to denote the existing right-hand-side of the model formula with the
variable Weight removed.
When explanatory variables are correlated, inclusion or exclusion of one variable will affect the
significance (as measured by the t-statistic) of the other variables. For example, if we exclude
the Weight variable, both Horsepower and RPM become highly significant. Variables cannot be
chosen simply on the basis of their significance in a larger model. This is the subject of the next
section.
The MASS package provides a number of functions that are useful in the process of variable
selection. Consider the situation where we are interested in constructing the best (according to
some criteria that we will specify later) linear model of MPG.highway in terms of ten of the other
variables in model.
In a forward search, from some starting model, we include variables one by one. We have
established that MPG.highway is reasonably well explained by Weight. A model with just the
Weight variable is a reasonable starting point for our forward search.
We may want to consider what the effect of adding another term to this model is likely to be.
We can do this using the addterm(...) function.
This function adds a single term from those listed and displays the corresponding F statistic.
The second argument is a model formula; the term ∼ . denotes our existing model. If we select
variables according to their significance (most significant first) the variable Length would be the
next to be included.
This process of variable inclusion can be repeated until there are no further significant variables
to include. Notice something strange here; the coefficient for Length is positive. Does this seem
counter intuitive? What happens when we remove Weight from the model?
An alternative approach is to start with a large model and remove terms one by one. The
function dropterm(...) allows us to see the impact of removing variables from a model. We
start by including all ten candidate explanatory variables.
Rather surprisingly, it is clear from this output that, in the presence of the other variables,
Horsepower is not significant. Horsepower can be removed using the update(...) function
and repeat the process.
Which variable does the output of this command suggest should be dropped next?
The process of model selection can be automated using the step(...) function. This uses the
Akaike information criteria (AIC) to select models. AIC is a measure of goodness of fit that
penalises models with too many parameters; low AIC is desirable. The function can be used to
perform a forward search.
Here the starting model (lmMPG7) just contains a constant term. The scope=list(upper=lmMPG5)
argument tells R the largest model that we are willing to consider. The process stops when the
model with the smallest AIC is that resulting from adding no further variables to the model.
Using a similar strategy we can automate backwards selection.
True step-wise regression allows us to go in both directions; variables may be both included and
removed (if these actions result in a reduction of AIC).
At each stage the output shows the AIC resulting from removing variables that are in the model,
including variables that are outside the model or doing nothing <none>. The result is a model
with four of original ten variables included.
Model selection procedures are based on arbitrary criteria. There is no guarantee that the
resulting model will be a good model (in the sense of giving good predictions) or that is will be
sensible (in terms of what we know about the economics/physics/chemistry/. . . of the process
under consideration). The procedures discussed in this section may also produce undesirable
results in the presence of outliers or influential points.
R will produce a set of four diagnostic plots when a linear model object is passed as an argument
to the plot(...) function. These are two plots of residuals against fitted values, a normal qq-
plot and an index plot of Cook’s distance.
> par(mfrow=c(2,2))
> plot(lmMPG8)
> par(mfrow=c(1,1))
We may want to generate other diagnostic plots. For example, a plot of studentised residuals
against fitted values with outlying values identified by name.
The leverages are generated using the lm.influence(...) function. Below we generate an
index plot of leverage a label the high leverage (influential) points using vehicle type.
We can fit a model that excludes certain observations by using the subset argument. For
example, in the leverage plot generated by the commands above, it is clear that many of the
influential points are vans and sporty cars. It is reasonable to suppose that a model built
including these types of vehicles may not provide the best representation of fuel consumption
for normal cars. The commands below update the model excluding vans and sporty cars.
> lmMPG8 <- update(lmMPG8, subset = (Type != "Van" & Type != "Sporty"))
> summary(lmMPG8)
The argument subset can also be used with the lm(...) function. Note that the subset is just
a logical vector; in this example the values of the vector will be true when the corresponding
vehicle is neither a van nor a sporty car. The result of excluding these types is a substantial
alteration in parameter values. In fact, it would be appropriate to repeat the process of model
selection if we are going to restrict our attention to a subset of observations.
Residuals scaled for constant variance are given by the function rstandard(...) applied to
a linear model objects, while cooks.distance(...) will give Cook’s distance (no surprises
there). Other measures of influence are available; dffits(...) gives influence on fitted values
and dfbetas(...) measures impact on individual parameter estimates.
† Colour and shading can be used to good effect in diagnostic plots. To illustrate consider the
regression model including weight alone. We draw a plot in which the influence (according to
Cook’s distance) determines the darkness of shading of points.
Here gray(...) is a vector of values between 0 and 1 on a gray scale (where 0 is black and 1 is
white). The named argument bg gives the colours to shade points and cex determines the size
of the plotting character. Other colour commands include rgb(...) (for red, green, blue) and
hsv (for hue, saturation, value).
5.5 Transformations
So far we have considered linear models for the observations. In many instances, a simple trans-
formation of the response variable improves the fit of a linear model. The use of transformations
requires caution; over-zealous application can result in a model that fits very well but tells us
nothing about the real world. The Box-Cox family of transformations is often used in practice:
(y λ − 1)/λ, λ 6= 0
(λ)
y =
log(y), λ = 0.
We may need to add a constant to the response variable values to ensure that they are positive
before applying this transformation.
The MASS package contains the function boxcox(...) that draws the profile likelihood values
against λ. This allows us to choose the transformation most appropriate for our data. To
illustrate we consider the model with just one explanatory variable, Weight.
The likelihood plot suggests that λ = −1 (reciprocal transformation) may be appropriate. Ap-
plying this transformation then refitting the model results in marked improvement in the prop-
erties of the residuals.
We have considered models that are linear in both the parameters and the explanatory variables.
Higher order terms in the explanatory variables are readily included in a regression model. A
regression model with a quadratic term is
Yi = β0 + β1 xi + β2 x2i + εi .
In polynomial regression the lower order terms are referred to as being marginal. For example, x
is marginal to x2 . If the marginal term is absent, a constraint is imposed on the fitted function;
A quadratic regression model is fitted in R by including a quadratic term in the model formula.
The I(...) is necessary to prevent ˆ from being interpreted as part of the model formula. This
function forces R to interpret an object in its simplest form – it is also useful for preventing
character vectors from being interpreted as factors. The commands below draw the fitted curve.
We can fit higher order terms in multiple regression. These often take the form of interactions.
For example, in the model
Yi = β0 + β1 x1 + β2 x2 + β3 x1 x2 + εi ,
the β3 parameter measures the strength of the interaction between x1 and x2 . This model can
be fitted using
or equivalently
Note that in the second of these commands, R fits the marginal terms automatically. More on
interactions next week.
†
5.7 Harmonic regression
The explanatory variables in a regression model may be vectors that we construct. We have
already seen an example of this in the simple time trend model
Yt = α + βt + εt
where t = 1, . . . , n is the explanatory variable. These constructed variables are often referred
to as dummy variables. In harmonic regression the dummy variables are constructed from
trigonometric functions. This is useful for modelling seasonal behaviour in time series. Suppose
that we have a time series with seasonal period s (s = 4 for quarterly data, s = 12 for monthly
data and s = 365.2422 for daily data). A simple harmonic regression model consists of a cyclical
function with the fundamental frequency, ω = 2π/s, plus noise:
Yt = α1 sin(ωt) + β1 cos(ωt) + εt .
In order to represent more complex seasonal patterns we may include harmonics. The model
below consists of a linear trend and a trigonometric representations of seasonality with m fre-
quencies (fundamental and m − 1 harmonics),
m
X
Yt = α0 + β0 t + (αj sin(ωjt) + βj cos(ωjt)) + εt
j=1
where m is a number that is smaller than the integer part of s/2 (why?).
5.8 Exercise
It is 1993. A well know US manufacturer has developed a new make of sporty car – Vehicle X.
The sales director wants to get some idea of how Vehicle X should be priced. Using the Cars93
data set, our aim is to find the regression which best explains price in terms of the other numeric
and integer variables (obviously not using the other price variables). To find out the class of the
variables try
1. Using the variable selection method of your choice propose an initial model for Price.
2. Construct diagnostic plots and comment on the quality of the model that you have pro-
posed.
3. Attempt model improvement; you might like to consider excluding certain points, trans-
formations or the inclusion of higher order terms. Be careful not to make your model too
complicated.
4. The sales director does not trust any model with more than three explanatory variables
in it. Reformulate your model (if necessary) in light of this information.
5. Vehicle X has CityMPG 25, HighwayMPG 35, 2.4 litre engine, 130 horsepower, 5500 rpm,
1500 revs per mile, 11.3 litre fuel tank, passenger capacity 2, length 175 inches, wheelbase
98 inches, width 64 inches, turn circle 40 feet, no rear seats or luggage room, and weight
2700 pounds. Using the model you feel most appropriate, suggest a price range for Vehicle
X. Do you have any reservations about this prediction?
6. Write a function that takes a linear model object as input and draws a selection of diag-
nostic plots (different from those generated by the generic plot(...) command).
7. † Write a function that takes a data frame and a numeric variable from this data frame
(the response) as arguments and does step wise selection using all of the other numeric
variables as candidate explanatory variables (the aim of this function is to avoid having
to type out a big list of variables when we want to do stepwise selection).
8. †Simulate a harmonic regression model for monthly data (s = 12) in with two frequencies
(m = 2). For each simulated replication fit a model that has three frequencies (m = 3).
What are the properties of α3 and β3 ?
5.9 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Section 6.3 for regression diagnostics and p172 for model selection]
• Dalgaard, P. (2002) Introductory Statistics with R, Springer. [Chapter 9 and 10 for some
gentle stuff (all R no theory)]
Linear Models II
Consider a situation in which we take measurements of some attribute Y on two distinct group.
We want to know whether the mean of group 1, µ1 , is different from that of group 2, µ2 . The
null hypothesis is equality, that is, H0 : µ1 = µ2 or H0 : µ1 − µ2 = 0. Suppose that we sample
n1 individuals from group 1 and n2 individuals from group 2. A model for these data is, for
i = 1, 2 and j = 1, . . . , ni ,
Our observations can be treated as coming from a single population. All of our response variable
values can be stacked in a single vector (Y1 , . . . , Yn )0 where n = n1 + n2 and Yi = Y1,i for
i = 1, . . . , n1 and Yi = Y2,i−n1 for i = n1 + 1, . . . , n. We also set up an additional variable d; this
is an indicator for being in group 2, so di = 0 for i = 1, . . . , n1 and di = 1 for i = n1 + 1, . . . , n.
Model (6.1) is then equivalent to
The linear model framework described above is readily generalised to other analysis of variance
problems. For example, suppose that we are interested in determining the effect on our response
variable, Y , of a factor with k levels. We could use a model
where dr is an indicator for the level r of the factor; thus, dr,i = 1 if the level of the factor is r
for observation i, and dr,i = 0 otherwise. Notice that there is no d1 variable. For a factor with k
levels we require a constant and k − 1 dummy variables to completely characterise the problem.
91
6. Linear Models II
6.2 Factors
Factors are variables that have a number of different levels rather than numeric values (these
levels may have numbers as labels but these are not interpreted as numeric values). For example,
in the Cars93 data set the Type variable is a factor; vehicles in the study are of type "Compact",
"Large", "Midsize", "Small", "Sporty" or "Van".
> library(MASS)
> class(Cars93$Type)
> levels(Cars93$Type)
Note that Cylinders is also a factor. Some of the levels of cylinders are labeled by number but
these are not interpreted as numeric values in model fitting.
> class(Cars93$Cylinders)
> levels(Cars93$Cylinders)
Factors are used to identify groups within a data set. For example, in order to work out the mean
of a variable for each type of vehicle, we would use the function tapply(...). This function
takes three arguments: the variable of interest, the factor (or list of factors) and the function
that we would like to apply. A few examples follow.
In order to create a factor variable, we use the function factor(...). This returns an object
with class "factor". We can specify an order for this factor using the levels argument and
ordered=TRUE. To illustrate, consider question 7 from the exercise in week 2. We have daily
readings of log returns and we would like to group these by month and year (that is, Jan 2000,
Feb 2000, . . . , Sep 2004) and work out the standard deviation.
> load("IntNas.Rdata")
> attach(intnas2)
> Monthf <- factor(Month, levels=c("Jan","Feb","Mar","Apr","May","Jun",
+ "Jul","Aug","Sep","Oct","Nov","Dec"), ordered=TRUE)
> IntelLRsds <- tapply(IntelLR, list(Monthf,Year), sd, na.rm=TRUE)
> plot(as.vector(IntelLRsds),type="l")
The results from fitting linear models in which the explanatory variables are constructed from
factors are often referred to as analysis of variance (ANOVA). Venables and Ripley, section
6.2, gives the details of the construction of explanatory variables from factors; this material is
technical and is not required for the course. In most instances the R default for constructing
the model matrix X from factors is sensible. We will build up the complexity of our models
using examples.
Consider the following question: does the provision of airbags affect the maximum price that
people are willing to pay for a car? We can test this using a linear model with a single factor
(one-way analysis of variance). The null hypothesis is that there is no effect. We start by
plotting Max.Price against AirBags notice the type of plot that is generated. Try to interpret
the output of the following commands.
> plot(AirBags,Max.Price)
> lmMaxP1 <- lm(Max.Price ∼ AirBags, data=Cars93)
> summary(lmMaxP1)
> plot(lmMaxP1)
Notice that R ensures that the columns of the model matrix are not linearly dependent by
excluding one level from the linear model. The validity of analysis of variance results is dependent
on constant variance within groups. We can see from the diagnostic plots that this is not entirely
unreasonable for these data. We could check using the function var.test(...)
A linear model with two factors is referred to as two-way analysis of variance. We can use this
technique to test the effect of two factors simultaneously. For example, we may ask whether, in
addition to provision of airbags, the availability of manual transmission explains differences in
maximum price.
As with multiple regression models, the sequential analysis of variance table is affected by the
order in which variables are included.
We can readily include interaction terms in our model. A term of the form a:b is used to denote
an interaction. We can include interactions and all lower order terms by the notation a*b. The
following example illustrates.
Note that the output of these commands is identical. What do you conclude about the interaction
between provision of airbags and manual transmission in determining maximum price?
Hines and Montgomery (1990, p.416) give the following results for an experiment in integrated
circuit manufacture in which arsenic is deposited on silicon wafers. The factors are deposition
time (A) and arsenic flow rate (B). Although both variables are quantitative, measurements are
only taken at a high (labeled 1) and low level (labeled 0) of each. The purpose is to find out
whether the factors have an effect and, if so, to estimate the sign and magnitude of the effect.
The response is the thickness of the deposited layer.
Treatment
combination A B Thickness
(1) 0 0 14.037, 14.165, 13.972, 13.907
a 1 0 14.821, 14.757, 14.843, 14.878
b 0 1 13.880, 13.860, 14.032, 13.914
ab 1 1 14.888, 14.921, 14.415, 14.932
The figures for thickness are stored in a single column row by row in the file arsenic.dat. We
will create a data frame for these data and then generate variables to represent the levels of the
two factors A and B.
Having established from this that neither B (the flow rate) nor the interaction term are signifi-
cant, we fit a model with just A the deposition time and generate some diagnostic plots.
In an analysis of covariance we are interested in the relationship between the response and
covariates (explanatory variables in the usual sense) for different levels of a set of factors. For
example, suppose we are interested in the behaviour of the relationship between fuel consumption
in city driving and vehicle weight, comparing vehicles where manual transmission is available
with those where it is not. We start with a plot of MPG.city against Weight with the level of
Man.trans.avail identified for each point. The legend(...) function pauses while we choose
a location for the legend on the plot.
Initially we fit a model in which the slope is the same for both level of the factor.
This indicates that, if the same slope is fitted for each level of the factor then the difference inter-
cepts is not significant. We can allow for an interaction between Weight and Man.trans.avail.
With the interaction term included, both the Man.trans.avail variable and the interaction
between Weight and Man.trans.avail are significant. This indicates that both slope and
intercept are significantly different when the levels of Man.trans.avail are taken into account.
Note that the difference between the slope estimates of these two models is precisely the inter-
action term from lmMPGint.
We can fit these models simultaneously using a model formula of the form a/x -1 where a is
a factor. This will fit the separate regression models. The -1 is to remove the overall intercept
term that we do not want.
6.4 Exercise
1. This question illustrates the importance of including marginal terms. Fit quadratic re-
gression models for MPG.highway with Horsepower as an explanatory variable. First of all
just with the quadratic term and them with the marginal term included. The commands
are:
lmMPG1 <- lm(MPG.highway ∼ I(Horsepower)ˆ2, data=Cars93)
lmMPG2 <- update(lmMPG1, ∼ Horsepower + .)
Plot MPG.highway against Horsepower and add the fitted lines generated by the two
regressions. Which do you think is more appropriate.
2. We wish to determine whether the log price is affected by the origin of the vehicle. Why
might we consider performing a log transformation? Create a variable logPrice by taking
logs of the Price variable.
(a) Carry out one-way analysis of variance with Origin as a factor. What conclusions
do you draw?
(b) Now include Type in the model. Can you explain the resulting changes?
(c) Is there a significant interaction between price and origin? [You will need to fit
another model to answer this question.]
(d) Draw a plot of logPrice against as.numeric(Type) with the colour of each point
determined by its origin. [You will need to use col = as.numeric(Origin).] Create
a legend for this plot. Does this help to explain the results of our models?
3. Write a function that takes two variables and the data frame that contains these variables
as arguments. The function should perform the quadratic regression of first variable on
the second and generate a plot of the response against the explanatory variable with the
fitted curve added to the plot.
4. † Consider the zero-mean auto-regressive model of order 1, AR(1):
wt = φwt−1 + εt , {εt } ∼ iN (0, 1).
Write a function to simulate a series of n values of an AR(1). Now consider a simple
regression model with time as the explanatory variable and autoregressive errors:
Yt = α + βt + wt , {wt } ∼ AR(1).
Write a function that simulates n values from this regression model and returns β̂. Write a
function that simulates r instances of β̂ and use it to investigate the MSE of the associated
estimator. How does this compare with (analytic) results for the case of uncorrelated
errors.
6.5 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Section 6.1 for analysis of covariance example and section 6.7 for factorial
design]
• Dalgaard, P. (2002) Introductory Statistics with R, Springer. [Section 10.3 dummy vari-
ables, 10.5 interaction, 10.6 two-way ANOVA and 10.7 analysis of covariance - all R no
theory]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Section 6.2 for stuff on contrasts]
Time Series
1. Back-shift operator, B. Moves the time index of an observation back by one time interval:
BYt = Yt−1 .
B p Yt = Yt−p .
98
7. Time Series
Stationarity features prominently in the statistical theory of time series. A model {Yt } is co-
variance stationary if:
In summary, the mean, variance and the correlation structure of the series do not change over
time.
7.1.3 Auto-correlation
The auto-correlation function (ACF) is of key interest in time series analysis. This function
describes the strength of the relationships between different points in the series. For a covariance
stationary time series model {Yt }, the autocorrelation function (ACF), ρ(.), is defined as
Models of the autoregressive moving average (ARMA) class are widely used to represent time
series. They are not the only time series models available and are often not the most appropriate.
However, they form a good starting point for the analysis of time series. In what follows we
describe zero mean processes. The definitions are easily adapted to the non-zero mean case.
Time series are usually characterised by their dependence structure; what happens today is
dependent on what happened yesterday. A simple way to model dependence on past observations
is to use ideas from regression. We can build a simple regression model for our time series in
which the explanatory variable is the observation immediately prior to our current observation.
This is an autoregressive model of order 1, AR(1):
or
φ(B)Yt = εt ,
where φ(B) = 1 − φ1 B − . . . − φp B p . We can specify conditions on the function φ(.) in order to
ensure that this model is stationary.
A more general class of model is defined by including dependence on past error terms. These
are referred to as autoregressive moving average (ARMA) models. If we include dependence on
the q previous error term, the result is an ARMA(p, q):
or
φ(B)Yt = θ(B)εt ,
where θ(B) = 1 + θ1 B + . . . + θq B q .
Models which are stationary after differencing may be represented as autoregressive integrated
moving averages (ARIMA). If {Yt } ∼ ARIMA(p, d, q), then {∆d Yt } ∼ ARMA(p, q). This rela-
tionship is represented by the model equation
φ(B)∆d Yt = θ(B)εt .
The order of an ARIMA model is determined by three numbers p, d and q. If the process is
seasonal, in addition to these numbers we have s the seasonal period, and P , D and Q the orders
of the seasonal parts of the model. † The seasonal ARIMA(p, d, q) × (P, D, Q)s model is
Φ(B s )φ(B)∆D d s
s ∆ Yt = Θ(B )θ(B)εt ,
Let {Xt } be a zero mean time series and let Ft denote the the past of the series, {Xt−i : i =
0, 1, . . . }. The generalised autoregressive conditionally heteroskedastic, GARCH(p, q), model is
defined by
Xt = σt εt ,
X p q
X
σt2 = c + 2
bi Xt−i + 2
ai σt−i ,
i=1 i=1
where σt2 = Var(Xt |Ft−1 ) = E(Xt2 |Ft−1 ), and {εt }P∼ IID(0,
P 1) with εt independent of Ft−1 .
All of the parameters are non-negative, c > 0 and ai + bi < 1. In financial applications,
GARCH models are routinely fitted to the (log) returns for the price of an asset. The popularity
of the GARCH class is due, at least in part, to the ease with which the likelihood is constructed;
if x = (x1 , . . . , xn ) is a sequence of observed returns then, by prediction error decomposition,
n
X
`(θ) = log f (x|θ, F0 ) = [log σt − log fε (xt /σt )] ,
t=1
In order to illustrate some time series ideas we will use the data in the file UKairpass.dat. This
file contains the monthly numbers of airline passengers (thousands) using UK airports from
January 1995 to June 2003. It contains two series, one for domestic passengers and one for
international passengers.
The function ts(...) provides a flexible mechanism for creating time series objects from vectors,
matrices or data frames. We will illustrate using a data frame.
Domestic and International are now time series objects. The argument frequency refers
to the seasonal period s. We can view time series objects by simply typing their name or use
the function tsp(...) to extract the start, end and frequency. Notice how, since we specified
frequency = 12 when we created these objects, R assumes that the appropriate labels for the
observations are month and year.
> Domestic
> tsp(Domestic)
The function ts.plot(...) will generate a time series plot. Below we add colour to differentiate
between the series and include a legend.
We can shift the series in time using the lag(...) function. This function has two arguments:
the series that we want to shift and k the number of time steps we would like to push the series
into the past (the default is 1). This function works in a counter intuitive way; if we want the
series back shift the series in the usual sense (so that Jan 1995 reading is at Feb 1995, and so
on) we have to use k=-1.
> lag(Domestic)
> lag(Domestic, k=-1)
> lag(Domestic, k=-12)
In order to take differences we use the diff(...) function. The diff(...) function takes two
arguments: the series we would like to difference and lag, the back-shift that we would like to
perform (default is 1). Here lag argument of the diff(...) function (not to be confused with
the lag(...) function!) is interpreted in the usual way so diff(Domestic, lag=12) is equal
to Domestic - lag(Domestic, k=-12). Note that this is seasonal differencing, ∆12 , not order
12 differencing ∆12 .
> diff(Domestic)
> ts.plot(diff(Domestic, lag=12), diff(International, lag=12), col=c(4,6))
> legend(locator(n=1),legend=c("Domestic","International"), col=c(4,6),
+ lty=1)
We can perform simple arithmetic and transformations using time series objects. The results
are also time series objects.
The function to generate the ACF in R is acf(...). This function returns an object of mode
list and (unless we alter the default setting) plots the ACF.
> acf(International)
> acfInt <- acf(International, plot=FALSE)
> mode(acfInt)
> attributes(acfInt)
> acfInt$acf
Note that 1.0 on the x-axis in this plot, denotes 1 year. We can use the time series plots and
ACF plots to guide the process of model identification. The time series plot of International
suggests a trend and seasonal pattern. We may remove these by differencing and look at the
ACF of the resulting data.
This final plot indicates that after differencing and seasonal differencing there is still some sig-
nificant correlation of points one year apart. † We may tentatively identify an ARIMA(0, 1, 0) ×
(0, 1, 1)12 on the basis of this plot.
In order to generate a multivariate time series from two series that are observed at the same
time points we may use the ts.union(...) function. Multivariate time series may be passed
as arguments to plotting functions. Passing a multivariate time series to the acf(...) function
will draw individual ACF plots and also the cross-correlations. We can also do perform simple
manipulation operations on a multivariate series; the result is that each of the component series
will be altered.
There are functions available in R to fit autoregressive integrated moving average (ARIMA)
models. We proceed by examples.
For every year from 1871 to 1970 the volume of water flowing through the Nile at Aswan was
recorded (I think the units are cubic kilometers). These 100 data points are in the file nile.dat.
We fit ARIMA models using the function arima(...). The commands below fit an AR(1) and
an ARIMA(0,1,1) to the Nile data. Note the brackets around the command; this is equivalent to
performing the command without brackets then typing the name of the object just created. The
order is a vector of three numbers (p, d, q). Which of these models do you think is preferable?
One of the key questions in time series modelling is, does our model account adequately for
serial correlation in the observed series? The function tsdiag(...) will give an index plot of
standardized residuals (over time), a plot of the sample ACF values of the residuals and a plot
of the Ljung-Box statistic calculated from the residuals for a given number of lags (this number
can be set using the gof.lag argument). Both sample ACF and Ljung-Box statistics are used
to detect serial correlation in the residuals, that is, serial correlation that we have not accounted
for using our model. The Ljung-Box statistic at lag h tests the hypothesis of no significant
correlation below lag h. The commands below generate diagnostic plots for our two models.
Does this change your opinion about which model is preferable?
> tsdiag(nileAR1)
> tsdiag(nileARIMA011)
One of the aims of time series modelling is to generate forecasts. In R this is done using the
predict(...) function which takes a model object and a number (number of prediction) as
arguments. The return value is a list with named components pred (the predictions) and se
(the associated standard errors). Below we forecast 20 years beyond the end of the series for
each of our models. What do you notice about these predictions? (A picture may help.) The
predictions are very different; can you explain this?
> predict(nileAR1,20)$pred
> predict(nileARIMA011,20)$pred
Seasonal ARIMA models are also fitted using the arima(...) function. The argument seasonal
is a list with named elements order (the seasonal order of the process, (P, D, Q)) and period
(the seasonal period). Earlier we indicated that an ARIMA(0, 1, 0) × (0, 1, 1)12 may be appro-
priate for the international air passengers data. We will fit this model and look at the properties
of residuals. What do you notice in the index plot of the residuals?
A good way to represent forecasts is in a plot along with the observed data. Below we generate
a plot of forecasts and an approximate 95% confidence interval for the forecasts.
7.5 Exercise
1. The file rpi.dat contains the monthly RPI for the UK (retail price index excluding housing
costs) from January 1987 to August 2002 measured on a scale which takes the January
1987 figure to be 100. Generate a time series plot of the data. Generate an plot of the
sample ACF for the original series, for the first differences and for the seasonal difference
of the first difference series. Comment on these plots.
2. The file passport.in7 contains the number of applications arriving at the UK passport
service monthly from July 1993 to June 2002. Fit an ARIMA(0, 1, 1) × (0, 1, 1)12 model
to these data and generate a plot of 5 years of forecasts. Is there anything in the residual
plots for this model that makes you feel uneasy about your forecasts?
3. In this exercise I would like you to write some of your own time series functions.
(a) The lag(...) function behaves counter intuitively (as described above). Write a
simple function that takes a time series object and a positive integer as arguments
and performs back-shifting operation in the normal way (should only take one line).
(b) By default the acf(...) function plots included the autocorrelation at lag 1. This
is irritating; we know that the ACF at lag 1 is equal to 1, we don’t need it to be
included on the plot. Try to write a function that plots the ACF from lag 2 upwards.
[Hint: have a look at the attributes of the return value of the acf(...) function.]
(c) Write a function that takes as arguments a time series and an integer, d, and difference
the series d times, that is, performs the operation ∆d .
(d) † Write a function that fits several different ARIMA models to a data set a choose
the one with minimum AIC.
4. A couple of interesting time series simulation problems (one easy, one tricky).
7.6 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Sections 14.1, 14.2 and 14.3 but not the spectral stuff.]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [The remainder of chapter 14 interesting spectral and financial time series stuff.]
We will consider principal components analysis as a data reduction technique. In this section
we depart somewhat from our previous notation. We follow most of the conventions adopted
in Mardia, Kent and Bibby (1979). In particular we consider a random 1 × p row vector, x,
with mean µ and variance matrix Σ. The reasons for this notational convention will become
apparent in §8.1.2.
The principal components are (standardized) linear combinations of our original variables. The
first principal component combines the variables in a way that maximizes variance. This has
important implications for the use of principal components as a data reduction technique. Re-
versing transformation (8.1) yields
x = µ + Γ0 y.
We know that y1 has maximal variance, y2 has the second largest variance and so on with yp
having the smallest variance. It may be reasonable to ignore principal components beyond some
point r < p. This allows us to reduce the dimension of our problem and focus analysis on the
first r principal components. As these r components account for the bulk of the variability, a
106
8. Exploratory Multivariate Analysis
x∗ = µ + Γ∗ 0 y ∗ ,
Now consider taking a sample x1 , . . . , xn where each xi is a 1 × p vector. We stack these vectors
row-wise in the n × p data matrix X = (x01 , . . . , x0n )0 . Another way to view X is as p column
vectorsPof dimension n stacked column-wise X = (X 1 , . . . , P X n ). We define the sample mean by
x̄ = n1 nj=1 xj , and the sample variance matrix by S = n1 nj=1 (xj − x̄)0 (xj − x̄).
We perform a decomposition that is analogous to that described for the population variance
matrix in §8.1.1. Thus, G0 SG = L where L is a matrix with the eigenvalues of S on the
diagonal (in decreasing order of magnitude) and G is the column-wise stack of the corresponding
eigenvectors, G = (g 1 , . . . , g p ). Defining 1 as an n × 1 vector of 1s, the sample principal
components are given by the columns of Y where
Y = (X − 1x̄)G.
Y k = (X − 1x̄)g k .
X ∗ = 1x̄ + G∗ 0 Y ∗
We will work initially with the data set temperature.dat. This contains observations of the
average daily temperature at 15 locations from 1 January 1997 to 31 December 2001. We
generate a data frame, check variable names and calculate the values p (number of variables)
and n (number of observations) in the usual way.
It will be useful later to have our data in matrix format. We use as.matrix() to convert a data
frame to a matrix, then work out the sample mean and covariance matrix
We could determine the principal components directly by computing the eigenvalues and eigen-
vectors of S. However, there are functions available in R to perform these calculations. The
function princomp(...) takes a data matrix argument and returns a list (of class princomp).
Information about the principal components can be extracted from this list; summary(...) and
plot(...) provide the variance of each component while loadings(...) gives the values in
the matrix G used to transform to principal components.
We gain some insight into the nature of the principal components by plotting
Which feature of the data does the first principal component represent? Is this surprising? What
about the second principal component?
We can generate approximations to our data matrix from the principal components. We first
set up the matrix of eigenvectors G and the matrix of mean values 1x̄ (note temps.pca$center
and colMeans(X) are the same thing)
Note the rather awkward matrix construction that we have to do here since R interprets Y[,1]
as a vector. This is not the case when we use more than one principal component
Do you think that using two principal components provides a reasonable approximation to the
data?
In this section we will use data on the weekly closing prices of six stocks recording for the weeks
starting 22nd November 2004 to 7th November 2005.
Using the notation of the previous section, suppose we observe x1 , . . . , xn where each xi is a 1×p
vector. There are numerous approaches to clustering available; we consider an agglomerative
clustering method. Starting with n clusters, points are brought into clusters one by one. The
aim is to keep dissimilarity within clusters low compared to dissimilarity between clusters. We
measure dissimilarity using the usual Euclidean distances between the rows of the data matrix
calculated by dist(...). The clustering method hclust(...) takes a matrix of distances as an
argument and returns a list (of class hclust). The plot(...) command applied to the output
of hclust produces a dendrogram illustrating how the clusters are put together. We can divide
the data up into a given number of clusters using the cutree(...) command; this is illustrated
below dividing the data into 4 clusters.
Conventional wisdom is that financial data contains volatility clustering. One rather crude way
to look for this is to perform cluster analysis on the square returns.
Cluster analysis is not guaranteed to produce sensible clusters. We can sometime identify struc-
ture from simple graphics such as those generated by stars(...). This produces a glyph for
each xi with shape determined by the values of the p variables.
8.3 Exercise
1. For the weather example in §8.1.3, find the difference between the data matrix and the
approximation generated using just the first principal component. Is it reasonable to model
this difference as IID normal? What about when 2 or 3 principal components are used?
2. † Our principal components in §8.1.3 are dominated by the first component. We might
consider principal components analysis in series for which this dominant component has
been removed. Using an appropriate model for seasonality, generate deseasonalised data
and apply principal components. What is the dominant effect now?
3. † From finance.yahoo.com, collect a long series of daily highs and lows from a number of
stocks in the same sector. The difference between high and low is a measure of volatility.
Can you identify volatility clustering using these differences?
1. The plain text data file prices.dat contains the daily closing prices ($) and log returns
of Intel stock and the NASDAQ index from 1st February 2000 to 31st July 2000. The first
line of the file gives the variable names. Create a data frame inp using:
> inp <- read.table("prices.dat", header=TRUE)
How many shares do we have and how much money do we have in the bank at the
end of the period (31st July)?
2. It has been suggested that the Intel log returns can be represented using simple regression
with the NASDAQ log returns as the explanatory variable. The model is
where Yt is Intel log return, xt is NASDAQ log return and the εt is a noise term for
t = 1, . . . , n.
(a) Fit model (8.2) using the Intel log returns and NASDAQ log returns from the inp data
frame created in question 1. Give the parameter estimated and associated p-values.
Give the null hypotheses and conclusions of the associated hypothesis test.
(b) We would like to include month of the year as a factor in our linear model.
i. How many explanatory variables are associated with the factor Month? When
the factor is included in a linear model, what is the nature of the explanatory
variables?
ii. Include Month in our existing model for Intel log returns. Is month a significant
explanatory factor? Give the reason for your answer.
iii. Fit a model for Intel log returns that includes the interaction between month
of the year and NASDAQ log returns along with all the marginal terms. Using
the output from this model, write down the estimated linear association between
Intel log returns and NASDAQ log returns for July.
(c) Write a function that takes a linear model object as an argument and returns the
date of the point with the largest (in absolute value) residual and the value of that
residual. Apply this function to the model fitted in part (b)iii. and write down the
date of the point with the largest absolute residual and the value of the residual.
(d) Generate an indicator for the date given in part (c) (that is, a vector that takes the
value 1 on that date and zero everywhere else). Include this indicator as an explana-
tory variable in the model generated in part (b)iii. Give the estimated coefficient for
the indicator. What does this estimate represent? Compare the standard diagnostic
plots for the model without the indicator to those with the indicator.
(a) Write a function that takes three arguments, the slope parameter β, the error standard
deviation σ, and a vector of explanatory variables x. The default values for β and
σ should both be 1. The function should return simulated values y1 , . . . , yn from
model (8.3) where n is the length of the explanatory variable vector. Set the random
number generator seed to 7 and test your function on x = (1, 2, . . . , 30) using the
default values of β and σ. Give the minimum and maximum of the simulated y
values.
(b) Write a function that uses the function in part (a) and returns a vector of r instances
of the parameter estimate β̂. Set the random number generator seed to 7 and test
your function on x = (1, 2, . . . , 30) and r = 50 using the default values of β and σ.
Give the mean and variance of the resulting vector of parameter estimates.
(c) Write a function that simulates instance of β̂ for every possible combination of β = 0.5
and 1.5, and σ = 1, 2 and 4 and returns the approximated mean square error for each
combination. Set the random number generator seed to 7 and use your function
with the NASDAQ log returns from the data frame inp created in question 1 as the
explanatory variable and r = 100.
8.5 Reading
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [Parts of sections 11.1 and 11.2.]
• Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, 4th edition.
Springer. [The remainder of chapter 11.]