CS1B R Programming
CS1B R Programming
Getting started
Covered in R1
1 Installing R
R is an open‐source programming language that is increasingly being used in the actuarial
profession with many applications in the world of statistics. If you haven’t used R before then
your first job is probably to install it on your computer or device. This section gives you a rough
guide on how to download and install R, although you may find that the process is slightly
different depending on the device you are using. You may also need to ask someone with
administrative privileges for help (eg your IT Service Desk) if you are installing R on a work
computer.
If you encounter any problems, then please do not contact ActEd for help. You will probably find
a solution much more quickly if you search the internet. Many people have published installation
guides as well as numerous problem solving tips in discussion forums.
1. Visit http://www.stats.bris.ac.uk/R/. (If you are IT savvy then you can probably find the
relevant file to download and install R without following the instructions below.)
2. Click on “Download R for Windows” (assuming you are using a Windows‐based system).
3. Click on “base”.
4. Click on “Download R 3.5.1 for Windows” or whatever the latest version is.
6. Find the downloaded file (eg R‐3.3.0‐win.exe) where it was saved and double click on it.
8. You can quickly progress through the next four pop‐ups with one click of OK and three
clicks of Next. However, please note the information in the third popup about
administrator rights.
12. A couple more clicks of Next (after changing the options if you wish) and then Finish and
you’re done.
13. You should then be able to run R using the desktop icon or via the Start menu.
2 Installing RStudio
We will have a very brief look at working directly in R in the next section, but most of the time we
will instead be working in RStudio. This is a more user‐friendly interface which you will probably
find easier to use. So you now need to install this as well:
2. Run the .exe file and follow the installation instructions. (You may need to ask your IT
service desk for help if you need administrator privileges.)
3. If you run into any difficulties then we recommend that you browse the internet for help.
3 Working directly in R
Although we will be working most of the time in RStudio, it might be useful to have a quick look at
R itself.
When you load up R you’ll be greeted with R’s graphical user interface (or GUI for short).
Inside this you’ll see one open window which is called the R console.
R console
You can change the display preferences by Edit/GUI preferences, where you can change features
like the font, size and style (normal, bold, italic) of the text:
Entering commands
Unlike modern mouse driven programs, R is a command based programming language.
So rather than choosing options from menus or clicking on icons we’ll be typing commands into
the console window that tell R what we want it to do.
For example if we type 2+3 and then press enter we’ll get
5, as shown.
In this introduction, we will write the command you will enter in red and the results of executing
that command in blue.
2+3
The trouble with R is going to be remembering the names of the commands, which is made
harder by the fact that R is case sensitive…
Graphics window
If we produce any graphics then they will appear in a separate window to the console, called the
graphics window.
R console
Graphics window
demo(graphics)
You’ll need to hit enter each time to move onto the next graphic.
Script window
Rather than entering commands directly in the console window we can use another window
called the script window (or script editor).
Script window
You can open this window first by clicking in the Console (to get the right menus at the top of the
GUI), and then choosing New script from the File menu:
Just like a script for a play or movie which contains the lines that you read out – it has the lines of
commands which can be “read out” or put into the console window either using copy and paste,
or more quickly by clicking on the line and typing CTRL+R. We’ll talk about scripts more later on.
Other interfaces
This is the basic graphics user interface (GUI) for R. As already mentioned, other packages are
also available which offer user‐friendly features, for example RStudio, which we will be using
shortly. Another example is R commander which is another GUI which expands the menus to
include standard commands such as importing data, producing graphs, carrying out tests and
fitting models to the data set.
Quitting R
To end your session in R you could type the command quit( ) or just q( ) in the console.
Alternatively choose exit from the File menu or just click on the close window icon in the corner:
R will then ask you if you want to save the workspace image:
We’ll talk about workspaces more later. Suffice to say, if you have created any objects (that is
important things assigned a special name) that you want next time then you may wish to click on
yes. If you haven’t done anything you wish to save, the just click No.
4 Working in RStudio
Now you have had a quick look at R, it won’t take long to become familiar with the basics of
RStudio.
Start/run RStudio and you will probably see something like this:
The panel on the left hand side is simply R’s Console, which we have already met. The panel at
the top on the right has a number of tabs. The first is the Environment (or Workspace) which will
prove very useful as it displays the values of variables and contents of datasets that we are using.
The second tab, History, not surprisingly displays a history of your work in R. We won’t worry
about the third tab for now.
The panel at the bottom on the right also has a number of useful tabs. One displays recently used
files, allowing you to access them quickly. Another, Plots, is simply the graphics window and will
display the plots/graphs that you ask R to produce. There are also important tabs called Packages
and Help which we will look at later.
If you open up a Script in RStudio, using File, New File, R Script, or by clicking on
the drop‐down arrow and then R Script (found in the top right‐hand corner of
the screen), then RStudio will display all four panels neatly arranged.
You can drag the edges of the panels to size them as you wish:
Commands
Just like in R, to clear the Console press Ctrl L from within the prompt of the Console.
To run lines of codes from the Script window, press Ctrl Enter (and not Ctrl R). Alternatively use
the Run Button on the top bar of the Script window.
Quitting RStudio
You can exit RStudio in the same way you exit R, or just press Ctrl Q.
5 Summary
Key terms
GUI Graphics User Interface
Console window The window where commands are entered and then executed by hitting
the enter key.
Graphics window The window where graphics are displayed. You can then export these to
put in any documentation you produce.
Script window A window where commands can be written but not executed. We can
transfer them to the console window and execute them using CTRL+R.
This will be covered in a later chapter.
Menus
R RStudio
Edit/GUI preferences Tools, Global Options, To change the font size, type,
Appearance etc
Key commands
R RStudio
6 Have a go
You will only get proficient at R by practising.
Basic arithmetic
Covered in R2
1 Arithmetic operators
Load up RStudio and clear the Console (using Ctrl L from within the Console).
Recall from the previous chapter that R is a command based programming language.
In the Console you’ll see a little greater than symbol which is the prompt:
>
We’ll type in a command at the prompt and then press enter to ask R to execute the command.
R will then return the result, or produce the graphical output or send the output to a file or
device.
Throughout this and later chapters we will use red for the commands we enter and blue for the
output/results that we get by executing that command.
Addition
So let’s get R to carry out out some simple arithmetic. Type:
2+3
Index numbers
If the output consists of many values over several lines then each new line starts with the index
number of the first element on that line.
So supposing I had five answers, three answers on the 1st line (say 5, 7 and 2) and two on the 2nd
line (say 8 and 3). Then the 1st line would have [1] and the 2nd line would have [4], eg:
[1] 5 7 2
[4] 8 3
Subtraction
Similarly we can subtract numbers. Enter the following:
‐3 ‐ ‐ 5
Spaces
Note that R ignores spaces so typing the same command with lots of spaces will give the same
answer. Try it again with spaces between the numbers, signs and operator:
‐ 3 ‐ ‐ 5
3*2‐1/4
5.75
We can see that R uses algebraic logic to calculate expressions – that is it calculates multiplication
and division before addition and subtraction.
3*(2 – ¼)
5.25
Note that RStudio automatically enters the closed bracked for you.
Suppose you make a mistake – and you actually wanted to calculate 1 / 5 not ¼.
Rather than retyping the expression again, you can use the up arrows key to move back to
previous commands and then edit that command and re‐execute it.
Try it out yourself. Use to bring up the previous command then to go back into that
command to edit it to read:
3*(2‐1/5)
5.4
Alternatively you could use the mouse to copy and paste the expression.
Powers
Powers can be obtained using the ^ key (or even a double asterisk, **):
2^3
2**3
Non‐numeric outputs
Let’s look at some non‐numeric outputs you might see in the future. Try entering the following:
5/0
Inf
0/0
NaN
which isn’t a misspelling of naan but stands for “Not a Number”, ie it’s undefined.
2^
R realises that we have clearly not finished our command and so it helpfully puts a + prompt to
say “and…”:
This + prompt also appears if our expression goes onto the next line or we can use this feature to
deliberately split up longer expressions over several lines to make it easier to read.
If you make a mistake and want to cancel what you are doing and type something else then just
press the Esc button.
2 Summary
Key terms
Prompt The icon that appears to the left of each line
Index number The start of each line of output has an index number eg [1]
Key commands
+ add
‐ subtract
* multiply
/ divide
^ power
** power
3 Have a go
You will only get proficient at R by practising.
3‐‐8
‐7+5
3×(2+5)
45
8÷0
1 - 1.08-3
0.08
Basic functions
Covered in R3
Functions
Log and exponential
Square root
Trigonometric functions
Factorial and Choose
1 Introduction
In the previous chapter we covered the common arithmetic operators. In this chapter we’re
going to look at functions for simple mathematical operations (such as logs and trig functions)
that we can find on a scientific calculator.
There are literally hundreds of functions included in the standard R program that cover
mathematical operations, statistical analysis, graphing and many other purposes.
We can also get extra functions for specific features (such as time series analysis, twitter data
mining or prettier graphics) by downloading something called packages. We’ll cover how to do
this in a later chapter.
arguments
You have the name of the function, followed by brackets. Inside the brackets you put the
arguments of the function in a specified order separated by commas. There are two types of
arguments: the values you’re going to put into the function and then the options. If we omit any
options then R will use the default settings.
Note that both brackets are needed even if there are no arguments. For example, in an earlier
video, we introduced the command for quitting R which was quit( ) or q( ). It has no arguments –
but it’s still got both brackets, which RStudio will remind you by automatically inserting the closed
bracket.
The function name is, unsurprisingly, log. The value is, unsurprisingly, the value we wish to find
the log of and there is only one option which is the base of the log.
Again note the index number [1] at the beginning of the output line – which, as mentioned
previously, gives the index number of the first answer on that output line.
Recall that R ignores spaces in commands, whereas spaces make life a bit easier for us humans to
read the commands. So we could put spaces after the comma, or around the equals or anywhere
that helps.
or (being quite risky for an actuary) we could omit the option names altogether.
Just make sure you’ve got all the arguments in the brackets in the correct order!
log(1000)
6.907755
R uses its default option setting which is base e, ie natural log (in this case ln 1000).
Non‐numeric outputs
Let’s just remind ourselves of some unusual outputs that we might get.
log(0)
–Inf
log(‐5)
NaN
Recall that this stands for “not a number” as the log function is undefined for negative numbers.
Errors in commands
Finally recall that R is case sensitive. So let’s try that command again (you can use the up cursor
key to get this command to reappear). If you type log with a capital L you’ll get this output:
So remembering the names of functions and their arguments is important and we’ll look at some
more ways to do this in the next chapter. However, you have probably already noticed that as
you type in functions in RStudio, it tries to predict what you are trying to do and also displays
information about the function selected. You can scroll down through the suggestions to find the
one you’re looking for.
If the function you are looking for is selected before you have finished typing, you can press enter
to jump to the arguments.
Exponential
The exponential function is exp(x)
The only argument is the value you’re finding the exponential of. It doesn’t have any options.
exp(‐5)
0.006737947
Square root
We covered powers in the previous chapter so we could calculate the square root either by taking
the power of a half:
36^(1/2)
sqrt(36)
3 Trigonometric functions
As you’d expect these are given by sin( ), cos( ) and tan ( ).
The only argument is the value you’re inputting and there are no options.
sin(30)
‐0.9880316
As you can see the function assumes the input is in radians not degrees ( radians degrees ).
180
We can enter pi as pi in R:
pi
3.141593
sin(pi/6)
0.5
cos(pi)
‐1
tan(pi/2)
1.633178e+16
This should be undefined. Why is it not? Because the value stored for pi is a little limited. There
are ways of getting more accurate values that we won’t cover here. Similarly:
sin(pi)
1.224606e‐16
This should be zero but is again hindered by the rounded value of pi.
The inverse trig functions arcsin, arcosine and arctangent are: asin( ), acos( ) and atan( ).
acos(1)
factorial(6)
720
Let’s find the factorial of a different number. Recall that we can use the up arrow to cycle
through the previous commands. Use this to bring up the previous factorial command and
change it to calculate 0!
factorial(0)
choose(n, k)
This has two arguments which are the values n and k. There are no options. For example:
choose (10,3)
120
Errors in commands
What happens if we enter only one of the two values? Say choose(20)?
R tells us helpfully that we’ve missed an argument and there’s no default. Which is a polite way
of telling us we’ve made a mess of things.
5 Summary
Key terms
Function An R command that has inputs = <value1>, <value2>, … and also has
options for how it works. It has the following form:
Key commands
log(x, base = b) logb x default base= exp{1} , ie it gives lnx
log(x) lnx
exp(x) ex
sqrt(x) x
sin(x), cos(x), tan(x) trig functions sine, cosine and tangent, in radians
factorial(x) x!
n
choose(n, k) Ck
6 Have a go
You will only get proficient at R by practising:
3×(2+5)
4‐5
8÷0
log(100)
ln5
log2 32
e 3
3 7
4
10!
8
C4
Getting help
Covered in R4
1 Generic help
In the last chapter we introduced some basic mathematical functions in R. However because R is
a command based programming language – if you don’t know how the commands work you’re
going to have trouble getting R to do what you want.
General help
help.start( )
In RStudio, this will display, in the bottom‐right window, general help html page which links to all
the manuals, reference documents and other miscellaneous material:
The most useful general item is “An Introduction to R” which is the starting R manual. However,
the manuals and FAQ documents are clearly intended for experts and so aren’t going to help you
much until you get much more proficient.
RStudio has its own help page which you can access by clicking the home button in the Help
window:
This links to lots of useful information and you might wish to try a few of the links yourself.
In particular, the first link, Learning R Online, recommends lots of free resources that you could
use, instead of this one if you wish, to get to grips with R.
(Alternatively use the function args, eg try typing args(log) in the Console.)
help(log)
or
?log
Either will open up the relevant help page in the help window of RStudio:
The help page gives us lots of information, probably beyond what we were looking for, about the
function.
Another way of reaching the same page is by using the search box in the help window – circled in
red above.
Rather than reading about them you can run them in R, either by copying and pasting or by using
the example function:
example(<function name>)
However, some of these examples can often be a bit obscure. So if in doubt you may just end up
searching the internet (eg looking on YouTube) to see if someone can explain it more simply.
If you know the function’s name starts with a given phrase, say log, you can type that into the
Console to reveal the list of functions, as mentioned previously:
This will list all the commands that start with log (with a brief explanation of each). In this case
there are 8. You can scroll through the list and hit enter when you find the one you want.
If you’re still not sure which of these it is then you could look up each of them using the help
function as we did above.
apropos(“<phrase>”)
This comes from the French á propos which means literally “to purpose”, ie with reference or with
regard to this purpose.
The only frustration is you have to put the letters that you’re searching for in speech marks:
You’ll see we get all the 8 functions starting with log that we saw earlier and a further 15
functions. (You may see even more depending on versions/packages installed.)
or
or type the phrase followed by a question mark in the search box of the help window, which
displays a slightly different set of results:
However, you may just end up searching the internet for “How do I … in R?”.
The beginning of each entry in the search results gives the name of the package where that
function can be found:
You can see above that the function log is in the base package – ie the standard package that is
included in R. We’ll talk more about packages later.
Finally, you may wish to explore the other features that can be found in the Help menu of
RStudio, including a list of keyboard shortcuts, which might help you save time as you become
more proficient in R. For example, Ctrl 2 takes you to the Console.
3 Summary
Key commands
help.start( ) Takes you to the general html help page
example(<function>) Runs the examples for <function> given at the bottom of its html
help page
apropos(“abc”) Lists the functions that contain the phrase abc in their name
4 Have a go
You will only get proficient at R by practising.
Using scripts
Covered in R5
However, this is not ideal for a number of reasons. It’s harder to spot errors (as commands are
interspersed with the outputs), it has to be run line by line so using the up arrow to re‐run
multiple line commands causes issues, and it’s also harder to share our work with our colleagues.
A better method is to use something called a Script. A Script is simply a text file with R commands
in it.
RStudio opens up a new script window in the top‐left corner of the screen. If a script is already
open then it will open up a new tab in the same window so you can have multiple scripts open at
the same time.
log(10)
sqrt(36)
sin(pi/2)
Just like a script for a play or movie which contains the lines that you read out – it has the lines of
commands which can be “read out” into the console.
We can do this by copying commands line by line or in blocks (using Ctrl+C or Edit/Copy) from the
editor and then pasting them into the console window (using Ctrl+V or Edit/Paste) and then
pressing enter to execute them. If you want to copy everything then first select all using the
standard shortcut command Ctrl+A.
This makes scripts a very efficient way of running the same code again.
However, a quicker way than copying and pasting is to “run” the commands from the script
editor.
To run a single command line in the Script, place the cursor in that line and then do one of:
press Ctrl+Enter
click on at the top of the Script widow
use the Menu: Code, Run Selected Line(s).
This “reads” the script into the console window and executes it. Note that each line of code, and
its output, will appear in the Console as we run it:
Even more useful is that our cursor moves onto the next line in the script window ready for the
next command. So if we press Ctrl+Enter three times, we will run the three lines of code
displayed in this script:
Similarly to run several lines in the script we just select at least part of those lines and press
Ctrl+Enter.
Alternatively, there are useful options in the Code menu, as well as some more short‐cut keys
which you may find useful:
Later in this chapter we’ll give another way of running the whole script without even opening it.
2 Using comments
Since we may be coming back to our work some time later or sharing our scripts with colleagues,
it might be helpful to add comments as an audit trail so that we, or our colleagues, can follow
what we are doing.
We enter comments by preceding them with a #. R then ignores anything to the right of the #
You can put comments on a separate line or after a command at the end of a line.
So let’s put a comment before our three commands in our script and also one on the same line of
a command but after it:
Writing comments as an audit trail is a really good habit to foster – especially for the CP2 exam
which tests audit trails.
3 Saving a script
There are a number of ways of saving a script:
click on one of the save icons in either the toolbar or at the top of the Script
window
use the Menu: File, Save or Save As
Press Ctrl+S.
(Note that if you are working in R and not RStudio, and your cursor is in the console window then
using CTRL+S or the save icon won’t save the script but something called the workspace which
we’ll cover in a later chapter.)
If you try to close down a script editor window before trying to save it then you will be prompted
to save it.
However, if you close down RStudio without saving a script then you won’t be given a prompt.
But don’t worry, the script should still be there the next time you open RStudio.
If you don’t change the file location, by default R will save the script in what is called the “working
directory” (ie folder for those of you too young to remember DOS).
Let’s save the current script as “test” (note where it is being saved).
If you look in the folder where you saved the file – you can see that it has been saved with the
extension “.R”:
4 Opening a script
There are several ways of opening a script:
And then when we save this file let’s put a “.R” extension on the end of the file name, for
example, “notepad test.R”.
In fact, R and Rstudio can read normal text files (which have the .txt extension) as scripts. So we
didn’t have to put the “.R” extension on the end. If the file had been saved with a .txt extension
then we could just have easily opened this in the same way using RStudio.
If you are using a word processor such as Word to create scripts then you will need to save the
script as a “.txt” file to strip out all the “invisible” coding.
If you don’t, RStudio will probably just open up the file in the programme it was written in.
6 Sourcing a script
We can actually run a whole script without opening it.
Find the test.R file we created earlier (or a different file you might have created) and open.
(Note that we could again source a text file, ie one that ends in “.txt”.
Even though it runs, nothing seems to happen in the console – all we have is the “source”
command:
Note the forward slashes in the filename path– more on that later.
There are two ways of doing this. The first is to use the print command.
print(<object>)
Object is the term used in R to describe a “thing” that we perform commands on. We’ll talk about
objects more in the next chapter.
If we load up the “test.R” script and put print( ) around all of the commands as follows:
Now we’ll save this script as before (Ctrl+S). Putting the cursor in the console we’ll source the file
as before using the menu item Code, Source File.
We can see that it has printed the output of the three commands.
The second way to display the outputs from the commands in the source script is to use the
source command:
The input value is the filename which includes the file path, for example C:\Documents\R. (The
easiest way of getting this is to copy it from the address bar in Windows Explorer.)
source(“C:\R\notepad test.txt”)
This is because R doesn’t like single backslashes (as they are used for other commands). So either
we need to change each of them to double backslashes (ie C:\\R\\test.R) or to forward (normal)
slashes (ie C:/R/test.R).
Let’s try again with forward slashes, but also set one of the arguments echo = TRUE. We need to
do this so that the output is printed in the console. Try leaving it out and you will see the same
problem we had earlier with no output.
In fact, the input as well as the output has been displayed, which is even more useful.
There are a number of other arguments that you coud use with the Source command but given
that we don’t think you’ll use it very much in your CS1 or CS2 studies, we won’t discuss them
here. You can always use the help menus if you find you need them.
7 Summary
Key terms
Script/editor window A window where commands can be written but not executed. We
can transfer them to the console window and execute them using
Ctrl+Enter
Comment Text following a #, which R ignores (ie does not execute) but we
can use to leave an audit trail
Menus
File, New File, R Script Opens a new script editor window
Code, Run Selected Line(s) Runs (ie transfers to the console and executes) the line of script
the cursor is on or the lines if several are selected.
Code, Run Region, Run All Runs (ie transfers to the console and executes) all the lines of the
script in the console
Code, Source File Runs a whole script file without opening it. Only displays outputs
in the console if explicitly told to do so (by using the print
command)
Key commands
print(<object>) Displays the <object> (or its output) in the R console.
8 Have a go
You will only get proficient at R by practising.
– log1000
10
– C6
3. Without referring to this chapter (unless you get stuck), use R to:
Using objects
Covered in R6
Just like a calculator can store values in its memories we can store values in (or assign values to)
what are called variables (or objects).
Suppose I want to store the value 5 to the variable/object capital A. There are a few ways of
doing this. One is to write A=5:
R has executed this command but nothing else is displayed in the Console. If we want to see
what’s assigned or stored in the variable A (in the Console) we can use the function
print(<object name>). Alternatively, a quicker method is to type the name of the object (in this
case A) and execute it:
We can see that these return the value 5 as that is what has been assigned to the object A.
However, most users of R prefer to use the assign command which is made up of a dash and an
inequality sign which together make an arrow. We can do this either way round:
8 ‐> B
This assigns the value 8 to the variable B. The arrow goes from the 8 to the B.
Alternatively:
C <‐ 10
This assigns 10 to the variable C. The arrow goes from the 10 to the C.
Ensure that you don’t enter a space between the dash and the inequality sign, because together
the two symbols are treated as one operation.
Again, using the print command or just typing the object (or variable) name displays the values
that have been assigned (ie stored) in those objects:
As we’ll discover later, we can assign all sorts of things to an object including characters (such as
names of policyholders) or even the results of functions. Since a function could take up many
lines, many users of R tend to use the second version which puts the object first. This makes it
easier to see it in the coding.
You can clear all the values you have allocated using the icon circled in red above. (You can also
use the menu: Session, Clear Workspace.) This effectively wipes R’s memory. But be careful, as
this can’t be undone.
We can use these named objects we’ve created in, for example, simple mathematical operations
such as addition, subtraction, multiplication and powers:
We can also put objects in functions such as square roots, logs, exponential or combinations:
And we’ll see later that we can store all sorts of other things in objects including characters and
even functions.
Naming objects
You can name objects using letters (a‐z, A‐Z), numbers (0‐9), full stops (.) and underscores (_) and
they can be as long as you like. However, they must start with a letter.
For example you could call an object “bob” or “larry123” or “go.west” or “age_at_death”. But
you can’t use spaces. For example, if we try to assign 5 to the object “bob larry” we get:
Just like on a calculator if we assign another value to the same memory it overwrites it.
We can see that R has overwritten the value 5 that was stored with the value 6 and it didn’t even
warn us that it had done so.
You can use the Environment window in RStudio to keep track of the objects you’ve assigned
values to.
For example if we’ve got data on age of death then we should store it in an object with this name.
Given that we can’t use spaces in an object name, sensible alternatives could include full stops,
underscores or capitals:
age.of.death
age_of_death
AgeOfDeath.
Removing objects
Once we’ve assigned a value to an object it will stay in R’s memory (called workspace) until we
exit R or manually remove it.
remove(A)
An abbreviation for the remove command is rm. So typing rm(B) removes object B.
Rather than repeating this for each of the functions we want to remove we could just list all the
objects we want to remove as the arguments. For example:
remove(C, D)
4 Summary
Key terms
Object Something which stores data which R can perform commands on (also
called a variable).
Object names can include letters (a‐z, A‐Z), numbers (0‐9), full stops (.)
and underscores (_). They must start with a letter.
Menus
Session, Clear Workspace Remove all objects in the current working environment
Key commands
<‐ Assigns a value (on the right) to an object (on the left)
‐> Assigns a value (on the left) to an object (on the right)
print(<object>) Displays the <object> in the R console. For objects which have a
value assigned to them it displays the value.
5 Have a go
You will only get proficient at R by practising.
Without referring to this chapter (unless you get stuck), use R to:
Store the values 5, 10, 15, 20 and 25 in the objects V, W, X, Y and Z, respectively.
Calculate V – W, V/W, X*Y, ln(Z), exp(W)
Remove the object V
Remove the objects W and X in one go.
Remove all the objects in the working memory.
Workspaces
Covered in R7
Workspaces
Working directory
RStudio Projects
1 Workspaces
When we open up R we start what is called a session. In this chapter we will look at how we can
keep a history of the commands we entered and the objects we’ve created so that they can be
used in another session.
We’ll now enter some of the commands we used in the last chapter:
A <- 5
A
8 -> B
C <- 10
A+B
A-B
A*B
We have created three objects (A, B and C) and these are stored in R’s local environment or
working memory. This is called the workspace.
You can see the values assigned to A, B and C in RStudio’s Environment window.
We can also see a record of all the commands entered into the console so far in the History
window, a different tab next to Environment:
You can delete some or all of your history using the icons circled above.
If we reopen RStudio you will see that (assuming you didn’t save the workspace after working
through an earlier chapter) we have lost the objects, but retained the history record, which is
stored in a file called “.Rhistory”.
R then saves a second file called “.RData”, which contains all the objects we created.
So it has loaded up the workspace called .RData from the folder ~/R/. We’ll see where that folder
is saved in a minute.
You should still now be able to see the values assigned to A, B and C as well as the history of
previous commands.
Rather than save your data in the working directory on exit, you may wish to save it in a different
folder. You can do this using the Session menu, or by using the save icon in the Environment
window. You can also save the history record somewhere other than the default location using
the save icon in the History window.
You can then open an old workspace using Session, Load Workspace from the menu, or using the
open icon from the Environment window.
Overwriting warning!
A brief word of warning about loading .RData files. Doing so overwrites any objects with the same
name. So if you had a different value assigned to object A in this new RData file, it would replace
whatever was in object A currently, without any warning.
We can find the location of this directory/folder using the “get working directory” command
getwd( ).
The tilde sign, ~, in a path name is a shortcut used by R. We can find what the shortcut is using
the path.expand command:
path.expand("~")
[1] "C:/Users/username/Documents"
Depending on your computer setup, tilde (~) and the working directory may be in a folder named
R in Program Files or elsewhere.
If you click on the icon next to working directory name near the top of the Console, , then you
can see the files stored in the working directory in the Files Window, which is similar to Windows
Explorer. The file list includes the two files mentioned above:
Alternatively, you can use the set working directory function whose argument is the new
directory/folder in speech marks:
setwd(“<folder address>”)
For example:
setwd(“C:/Users/username/Documents/R”)
or more quickly:
setwd(“~/R”)
Another way of setting the working directory is through the Files window menu:
You can navigate to the folder you want to use and then click on: More, Set as Working Directory.
Note that the Home folder, circled above, is the path of the tilde, ~.
If you wish to set your working directory to a folder that isn’t a sub‐folder of Home, then you will
need to use one of the other methods above.
All of these methods only change the working directory temporarily. If you close down RStudio
and reopen it, it reverts to the default directory. If you wish to change it permanently you need
to use the Menu: Tools, Global Options, General:
3 Projects
If you save your work in the default working directory, you will automatically pick up where you
left off. But you will probably want to keep work in different places with different names for easy
reference.
One way of doing this is to save your scripts, workspace and/or history record, as discussed
previously, and open them again when you want them.
A project will store a workspace, history file, scripts as well as the working directory altogether.
So if you’re about to start a significant piece of work, for example, an assignment, it’s probably a
good idea to start a new project. This is easily done from the menu using File, New Project.
After clicking on New Directory and then New Project you can enter the project name, chosen
here to be Test Project:
A final click on Create Project and you will be working in your new project.
RStudio has automatically set the working directory to the new location, which you can see at the
top of the Console window:
Let’s create a new object, E, for this project and assign 200 to it:
We’ll now close the project down using File, Close Project, clicking “Save” when it gives us the
prompt:
Note that the prompt tells us where we are saving the data.
If we then reopen RStudio, and open our project using the File menu:
our data and history will be restored and we can carry on with our work.
1. Clear the Environment of its objects; clear the history, close down any scripts and then
save the workspace as you exit RStudio.
2. Go to the working directory in Windows Explorer and delete the .RData and .Rhistory
files.
4 Summary
Key terms
Workspace R’s working memory which contains the objects created during
that session (as well as any data or packages that are loaded)
Working directory The default folder where R loads files from or saves files to
Menus
Session, Save Workspace As Saves the current workspace (ie objects created)
Session, Set Working Directory Changes the working directory (ie the default location where R
loads files from or saves files to) for this session
File, New Project Opens a new project (where you can store all work for an exercise
or assignment)
File, Open Project Opens an existing project (can also use the Recent Projects
option)
File, Close Project Closes the current project, will prompt you to save on exit
Tools, Global Options You can change the default working directory in the General tab
Key commands
setwd(“<folder address>”) Sets the working directory to the specified folder (eg “C:\\”)
5 Have a go
You will only get proficient at R by practising.
3. Enter code in the script that will assign the values 10, 50 and 100 to the objects X, Y and Z
respectively.
4. Also, in the script, set a new object Answer to equal X*Y/Z and then to output Answer.
6. Check the History window matches the Console and then view the Environment window.
Packages
Covered in R8
Loading Packages
Help with packages
Unloading and uninstalling packages
Installing packages
1 Loading packages
Packages are the name given to a collection of related R functions, data and code stored in a
particular format that R can load up.
In this chapter we will look at how we can install packages and get help on them.
Overview
To use a package in R we have to go through two steps:
install load
CRAN Workspace
First we have to install the package onto our computer’s hard drive. We only do this once. The
directory or folder where packages are stored is called the library.
Second we have to load the package into R’s workspace so we can use its features. We will have to
do this every time we want to make use of this package as to save time and memory R only loads
the standard (or base) packages each time. However, we will see that RStudio simplifies the process
greatly.
As well as containing the values assigned to any objects we’ve made, the workspace also contains
the functions/data/code from any packages that are loaded.
install load
CRAN Workspace
search ( )
We can see that a number of packages (stats, graphics, grDevices, utils, datasets, methods) are
loaded up in addition to the standard (base) package. We won’t worry about the other items now.
install load
CRAN Workspace
library ( ) search ( )
This opens up a new window listing all the packages installed and available in the library.
However, RStudio makes things much easier because there is tab, called Packages in the bottom‐
right window which displays the packages installed in your library.
Loading packages
We’ll now look at how we can load packages (that have been installed on our computer) into R’s
workspace (so that we can use their features).
All you need to do in RStudio is to use the tick box next to the package in the Packages window. For
example, here we have loaded the Graphics package:
You’ll see in the Console that R has used the library function to load the package:
If you ever need to use this in a program then you can copy the code into your script.
The first way of getting help is to use the help files that accompany the package. One way to access
these uses the library function. For example, for help with the package called MASS, we use:
library(help=MASS)
This opens a window which contains an introduction (which includes the dependencies, ie other
packages which are necessary for this package to work):
Scrolling down it gives lists of the new functions and datasets included in the package:
Alternatively, click on the name of the package in the Packages window to open up the relevant
pages in the Help window:
Clicking on “description” gives the information we saw above in the introduction. And then
underneath it lists all the package’s functions and datasets intermingled in alphabetical order.
Clicking on the links will take us to the help page on that specific function or dataset.
Some packages also come with demonstrations which can be accessed via the demo command. In
the same way that using the library command with no arguments lists all the packages installed on
our machine, using the demo command with no arguments lists all the demos available for all the
packages installed on our machine:
demo(<demo name>)
For example:
Unloading
You are unlikely to need unload a package as its presence in R doesn’t affect the running of other
functions in any way. However, all you need to do is uncheck the box in the Packages window if
you wish to unload or detach a package.
Uninstalling
Again, you probably won’t have a need during your CS1 or CS2 studies to uninstall a package, but
it’s possible in RStudio using the small cross on the package line in the Package Window.
This deletes the package from your library and if you decide you want it again you will need to
reinstall it using the instructions in the next section.
4 Installing Packages
This will take us to a page where we need to choose what is called a CRAN mirror:
Recall that CRAN stands for the Comprehensive R Archive Network. Essentially there are lots of
sites on the network which hold identical contents, hence the term “mirror”. You can scroll down
the page and choose one near to you or just select the first option on the list “0‐cloud” which should
automatically direct you to the nearest one.
Next click on packages on the left hand menu and then we can choose to view the packages sorted
by name or by date of publication:
If we choose to view them by name, we see the packages listed in alphabetical order with a brief
sentence describing its contents.
If you need to use a package in your CS1 or CS2 studies then you should be told what package that
is and so you won’t need to look too far. However, you do need to know how to install a package
if it isn’t one of the default packages automatically installed with R.
Installing packages
RStudio enables us to install packages with ease. Say we are asked to install the package called
ggplot2. First click on the Install button in the Packages window:
Start typing the name of the package and then select it from those displayed before clicking Install.
Once the relevant files have been downloaded you will be able to see the package listed in the
Packages window.
If you look back in the Console you will see that R has used this command to install the package:
install.packages("ggplot2")
You can then load it using the checkbox in the Packages window or the library function we saw
earlier:
library("ggplot2", lib.loc="~/R/R‐3.5.1/library")
5 Summary
Key terms
Key terms
Install Get a package from the web onto the computer’s hard drive (library)
Library Directory on the computer’s hard drive where packages are stored.
Load Get a package from the hard drive’s library into R’s workspace
Key commands
install.packages(“<package name>”, lib = “<library location>”)
installs the chosen package into the given library (or default library
location if not specified)
loads the chosen package from the given library (or default library
if omitted) into R’s workspace
library(help = “<package>”) Opens the help file for the specified package
6 Have a go
You will only get proficient at R by practising.
1. Choose another package and install, then load it. Use the help pages to use one of its
functions.
Data Types
Covered in R9
Types of data
Numeric data
Character data
Logical data
Complex data
Raw data
1 Types of data
We mentioned in a previous chapter that R stores information in data structures called objects.
We looked at how we could store (assign) numbers in these objects by using the assign command,
<‐.
A <‐ 5
To see what has been assigned to an object we can either use the function print function:
print(A)
However, in addition to numbers there are other types of data that can be assigned to (ie stored
in) objects. The five data types (sometimes called the “atomic modes”, “atomic vectors” or
“primitive objects”) are:
numeric
character
logical
complex
raw.
2 Numeric data
Numeric data are real numbers such as ‐2, 3.7 and pi. We can use the is.numeric function to
determine whether data is a real number or not. For example:
is.numeric(‐2)
TRUE
is.numeric(2+3i)
FALSE
is.numeric(“Bob”)
FALSE
is.numeric(A)
TRUE
Numeric can be subdivided further into: integer and double, depending on how the number is
stored in R’s working memory.
Double stands for double precision, that is a floating point number such as 3.78e+12, which is
stored in 2 pieces – the significant (ie the 3.78) and the exponent (ie the 12). This is the default
type of numeric data.
is.double(3)
TRUE
is.integer(3)
FALSE
So even though 3 is an integer, it is stored in R’s working memory as a floating point number.
Integer in R means that it is an integer and it is not stored as a floating point number. We can
force R to store an integer as a non‐floating point number by placing a capital L after the number:
is.integer(3L)
TRUE
3 Character data
Character data (sometimes called strings or character strings) are qualitative data such as names
of policyholders or cities. We use single or double quotes to specify data as character:
is.character(“Bob”)
TRUE
is.character(‘Larry’)
TRUE
is.character(“2”)
TRUE
Even though 2 is a number by using quotes we are telling R to treat it as a character. Just like we
might put ‘2 in Excel to make it text.
is.character(Alice)
Because Alice has not got quotes and is not a number, R assumes it must be an object. But since
we haven’t assigned anything to Alice, R gets a bit confused.
We can store character data in objects. For example, to store the name “Bob” in object B we
type:
B <‐ “Bob”
“Bob”
is.character(B)
TRUE
A useful function is nchar( ) which counts the number of characters in character data. For
example:
nchar(“actuary”)
If our qualitative data are categorical, that is if they can only take a specified number of categories
(eg policy type, gender, etc), then we would want to store them in a special object called a factor.
We’ll cover this in the next chapter.
4 Logical data
Logical data refers to data values which take one of the two Boolean states TRUE and FALSE.
is.logical(TRUE)
TRUE
is.logical(FALSE)
TRUE
is.logical(true)
Only the upper case TRUE and FALSE are used, so when R encounters true it thinks it must be an
object. But since we haven’t assigned anything to it, R has got confused.
C <‐ FALSE
FALSE
is.logical(C)
TRUE
We can use the abbreviations T and F to stand for TRUE and FALSE:
is.logical(T)
TRUE
is.logical(F)
TRUE
But unlike TRUE and FALSE, T and F are not reserved words in R. If we try to assign something to
the object FALSE we get the following:
That’s because FALSE is a reserved word and so cannot be used as an object name.
However, since T and F are not reserved we can treat them as objects that we can assign things
to. For example:
T <‐ 10
is.logical(T)
FALSE
If we had a wicked sense of humour we could even assign TRUE to F and FALSE to T and cause all
sorts of havoc! Hence, since T and F are not reserved it would be wise to always use TRUE and
FALSE to prevent such problems arising.
It should be mentioned that NA, standing for “not available”, is also treated as logical data:
is.logical(NA)
TRUE
is.na(NA)
TRUE
We will make extensive use of NA for missing data. Like TRUE and FALSE, NA is a reserved word
and so cannot be used an object name.
5 Complex data
Complex data are used for complex numbers such as 2+3i.
is.complex(2+3i)
TRUE
is.complex(2)
FALSE
is.complex(2+0i)
TRUE
is.complex(i)
R only recognises i as an imaginary number if there is a number before it. Otherwise it thinks it is
an object.
D <‐ 5‐2i
5‐2i
is.complex(D)
TRUE
Raw data
This final data type stands for raw byte data in hexadecimal. For example R stores the word
“actuary” as the following raw bytes 61 63 74 75 61 72 79.
For the purposes of our actuarial studies we will only be working with numeric, character and
logical data.
Coercion
When R imports data it will make (what it thinks) is a sensible decision as to what type of data it
is. If we want to tell R that it is a particular type of data we can coerce it using the “as.” functions.
For example, earlier we saw that R will, by default, store all numbers (eg 3) using double precision
(ie floating point values). If we want it to store 3 as an integer we would type as.integer(3).
We could also swap between data types using coercion as the following examples show:
as.integer(4.5)
as.character(5)
“5”
as.logical(1)
TRUE
as.complex(7)
7 + 0i
as.numeric(TRUE)
as.character(TRUE)
“TRUE”
as.complex(TRUE)
1 + 0i
as.numeric(“2.5”)
2.5
as.logical(“true”)
TRUE
as.complex(“5‐4i”)
5‐4i
However, sometimes this just doesn’t make any sense and R will return an NA (not available)
result. For example:
In the next chapter we’ll look at the data structures (objects) that R stores data in.
7 Summary
Key terms
Data type The information that can be stored in objects – can be one of the
following: numeric, character, logical, complex or raw
NA Not available – used for missing data or function results if it’s not
possible to perform the action
Logical Data type for the logical results TRUE, FALSE and NA
Double Stands for double precision, which refers to the two values used in
a floating point value (eg 2.78e‐12) – default way numeric data is
stored in R’s memory
Integer Integer numeric data which is not stored as floating point values
in R’s memory
Key commands
is.<data type>(<object>) Logical test of whether <object> has the <data type>
as.<data type>(<object>) Coerces the R <object> into the required <data type> if possible
8 Have a go
You will only get proficient at R by practising.
is.numeric(5+0i)
is.integer(5)
is.double(5L)
is.logical(0)
as.integer(3.47)
as.integer(“3.47”)
as.numeric(3i‐2)
as.logical(0)
as.complex(“3i‐2”)
as.character(3.47)
as.character(FALSE)
as.character(NA)
nchar(“hello”)
nchar(3.47)
nchar(FALSE)
nchar(3i‐2)
Objects
Covered in R10
Types of objects
Vectors and matrices
Arrays, data frames, lists and factors
1 Types of objects
R is an object orientated language. It stores information in data structures called objects.
Everything in R is an object (even unassigned numbers are treated as objects with no name) and
we perform operations on these objects.
In the previous chapter we looked at the types of data that can be stored in objects. These were
numeric (ie real numbers such as 2.7 or pi), character (ie qualitative data such as policyholders
names), logical (ie TRUE, FALSE and NA), complex (ie complex numbers such as 3 2i ) and raw
(which was raw data bytes).
In this chapter we’re going to look at six different types (or classes) of objects (ie data structures)
that we can store the data in, which are:
vectors
matrices
arrays
data frames
lists
factors.
The type of object is called its class. We can find the class of an object by using class(<object>).
The class of object tells R how functions interact with it. For example using a print command on a
vector object just displays its contents but using it on a function returns its output:
A<‐5
print(A)
print(log(A))
Everything we have worked with so far has been a vector. Even unassigned values are considered
vectors with a single element. For example:
is.vector(5)
TRUE
length(5)
Unlike a vector in maths, which contains only numbers, vectors in R can contain any of the five
types of data that we met in the previous chapter (numeric, character, logical, complex and raw).
Examples of vectors would include:
TRUE
"bob" TRUE 2 3i
3.7 "larry" 5i
1.4 FALSE 7
"ginger" 4 8.7i
TRUE
However R displays vectors horizontally rather than vertically. We will look at vectors in detail in
the next chapter.
A matrix is a two‐dimensional object containing data of the same type. It is essentially composed
of several vectors of the same length. You can find out whether an object is a matrix by using
is.matrix(<object>). The dimensions are called rows and columns and the numbers of each can be
found by using the nrow(<matrix object>) and ncol(<matrix object>), respectively. Alternatively,
you could use the dimensions command dim(<object>) to get both the number of rows and
columns.
Again, unlike matrices in maths which contain only numbers, matrices in R can contain any of the
five types of data (numeric, character, logical, complex and raw). Examples include:
"barry" "alice"
3 2.1 "harry" "belinda"
4.9 8.6
"larry" "chelsea"
A data frame is a two‐dimensional object (like a matrix). However, whilst each column (ie vector)
contains data of the same type the different columns (ie vectors) can be a different data type.
This will be most useful for statistical analysis where each row represents a single observation (eg
a single policyholder). For example, a data frame could include policyholders’ names, their ages
and their smoker status:
Alfie 34 TRUE
Belinda 28 FALSE
Charlie 31 FALSE
Delilah 38 TRUE
You can find out whether an object is a data frame by using is.data.frame(<object>). We will look
at data frames in more detail in a later chapter.
A list is a one‐dimensional ordered collection of data (like a vector) but the data items don’t have
to be the same type. We can have lists of things like vectors, matrices and data frames and even
lists! An example might be:
You can find out whether an object is a list by using is.list(<object>). Like a vector, the dimension
is called length and can be found by using the length(<list object>) command.
Factors are vectors of characters where the entries are categorical data (eg gender, insurance
group, country). Each entry can only take one of a specified number of categories (eg
male/female, or groups 1‐15, or UK, US, etc). We call these categories the levels of the factor. By
default, R will assign the levels alphabetically (so female=1 and male=2). If the categorical data
are ordinal (eg high/medium/low), then we use an ordered factor.
We have to be a bit careful when importing data into R as it often assumes that character data are
factors (for example policyholder’s names). So we might need to use coercion to tell R what type
of data values they are. We met coercion in the last chapter.
4 Summary
Key terms
Object Something which stores data which R can perform commands on
Factor vector of characters where the entries are categorical data (eg
gender, insurance group, country)
Key commands
class(<object>) Displays the class of an object
is.<object type>(<object>) Logical test of whether <object> has the <object type>
There is not a “Have a go” section in this chapter as we explore the key types of objects in more
detail in the next few chapters.
Vectors
Covered in R11
Creating vectors
Naming vectors
Indexing vectors
Vector arithmetic
1 Creating vectors
In the last chapter we briefly described the six types of objects (called classes) that R uses to store
data. In this chapter we look at the most fundamental of these which is the vector.
A vector is a one‐dimensional ordered collection of data of the same type (numeric, character,
logical, complex or raw). Vectors are also called atomic vectors as there is no object more basic
than this ‐ even unassigned values (eg 5) are considered vectors with a single element.
You can find out whether an object is a vector by using is.vector(<object>). The dimension is
called length and can be found by using the length(<vector object>) command.
The simplest way to create a vector is to use the c( ) function. The c stands for “concatenate”
which means “combine” or “join together”. Suppose we want to make a numeric vector, called v,
containing the numbers 1 to 10. We could do this as follows:
Even though a vector is vertical, we can see that R displays vectors horizontally, which is why it
measures the number of elements by length.
Rather than saying vector it gives the type of data it holds, as a vector is the default object and is
defined by its elements.
Another way of obtaining all this information in one go is to use the str(<object>) command which
displays the structure of an R object:
It says it is numeric data, gives the dimensions and the contents (in this case all of the contents,
but for larger objects it will only give some of them).
Note that because of the length of this vector, when we look at its structure it won’t display all of
its contents:
Since a vector is a collection of data of the same type, if we try to create a vector of different data
types R will try to coerce them all to the same type. For example:
In this case it has converted the logical data into numeric data (TRUE becomes 1 and FALSE
becomes 0) so that it is now a vector of numeric data .
1:10
c(1:10)
If we want some other sequence of numbers we could use the sequence generation command:
“from” is the value the sequence starts at. The default value is 1.
“to” is the value it finishes at – if only one argument is given it will assume this is the “to”
and the “from” argument will take the default value of 1
“by” is an optional argument which gives the steps the sequence increases by, its default is
±1 unless the length option is used
“length” is an optional argument that gives the required length of the sequence.
Finally, we could use the many functions that produce simulations from common distributions
such as runif(n) which returns n values from the U(0,1) or rnorm(n) which returns n values from
the N(0,1) distribution:
2 Naming vectors
It may be the case that we wish to name the elements in our vectors. For example suppose we
want to create a vector, age, which contains the ages (34, 28, 47) of three policyholders (bob,
larry and ginger). If wish to keep the names of the policyholders associated with their ages in the
vector we could do this as follows:
We can see that when the vector is displayed it also includes the names. Similarly the names are
given if we use the structure command:
It says that it is a named numeric vector with 3 elements (34, 28, 47). Then it gives the attributes
of the vector (ie its names attribute) which are the character data (“bob”, “larry”, “ginger”).
An alternative way of performing this action would be to use the names function. Suppose we
have another vector, claim.free, which gives the number of claim free years (3, 0, 8) for the three
policyholders (bob, larry and ginger). We can assign the names to the vector claim.free as
follows:
Note that you don’t have to name every element in a vector, you could use “” for those elements
to which you wish to give no name.
3 Indexing vectors
Sometimes we may be interested in only some of the elements in a vector. To do that we use
indexing.
Recall that in a previous chapter we said that the [1] at the start of the output line referred to the
index value of the first answer. The square brackets tell R it’s an index and the number gives the
position of the element.
Earlier we defined the vector v to contain the numbers 1, 2, 3, …., 10. So to select the third
element we type v[3]:
Suppose I want to select the second and fifth elements of the vector v. If I try v[2,5] we get the
following:
That’s because with indexing it thinks we are referring to the second row and the fifth column.
Since a vector only has one dimension it is very confused. So to specify both elements in just one
dimension (ie both values are rows) we need to use the combine function as follows v[c(2,5)]:
The following show clever ways of specifying the third to seventh elements and all the elements
from the fourth value to the end:
To specify all elements except the specified ones we use a negative in front of their positions:
We can use the results of logical tests to select elements. For example we could select all the
values of v whose values are between 4 and 8 inclusive as follows:
We can see that all the elements for which the test is TRUE are selected.
Finally, if we have a named vector then we can select the elements using their names. For
example, using our vector of policyholders’ ages we could just select Larry’s age as follows:
4 Vector arithmetic
The standard arithmetic operations work on vectors but on an element by element basis:
Whilst this is not quite the same as how vectors work in mathematics, it provides a powerful way
of performing the same operation on many values at once.
We can see from the following examples that vectors operate on other vectors also on an
element by element basis:
Let’s have a look at what happens if vectors are of different lengths by first defining vectors v3
and v4 as the values (1,2) and (1, 2, 3), respectively:
We can see that the shorter vector v3 has been extended by repetition to (1, 2, 1, 2). This is
called recycling. This works fine as the length of v1 is a multiple of v3.
Let’s look at what happens when this is not the case by adding vectors v1 and v4 together:
We can see that the shorter vector v4 has been extended by repetition to (1, 2, 3, 1) and a
warning message is displayed.
This process of recycling the shorter vector explains why v1 + 3 returns the following:
Recall that the default object is a vector. So 3 is treated as a vector of length 1. This is recycled to
(3, 3, 3, 3) and added to vector v1.
5 Summary
Key terms
Vector The default object type is a vector which is a one‐dimensional
ordered collection of data of the same type – it has one dimension
which is called length
Key commands
class(<object>) Gives the class of an object. Since atomic vectors are defined by
their data type it gives their data type instead
seq(from, to, by, length) Returns a sequence starting at “from”, finishing at “to”, either
increasing in steps of “by” (default ±1) or equally spaced so that
there are “length” length elements in the vector
6 Have a go
You will only get proficient at R by practising.
Using :
2. What type of vector (numeric, logical, character) would be formed from each of the
following:
c(3, 2, FALSE)
c(“larry”, 7, 2)
4. Create a named vector of the temperatures 18, 20, 15 for the cities London, Paris and
Stockholm using the c function.
Create a named vector for the indices 6125.7, 17140.20 and 15323.10 for the FTSE 100,
Dow Jones and Nikkei 225 using the names function.
3rd element
6th‐10th elements
6. Create a vector a of (1, 2, 3, 4, 5, 6), a vector b of (0, 1) and a vector c of (5, 1, 3, 2).
b‐1
b*c
a+b
a^b
a/c
Use indexing to obtain all the values which are greater than 2 and store this in m.
Use vectors m and n to obtain an empirical estimate of P(Z 2) (ie the probability that
Z 2 ).
Factors
Covered in R12
Creating factors
Specifying the order of the categories
Abbreviating the names of arguments
Changing the name of categories
Indexing and arithmetic
1 Creating factors
In the last chapter we looked at vector objects, which are one‐dimensional ordered collections of
data of the same type (numeric, character, logical, etc). These can be used to store results of one
particular variable that are of interest to us, say policyholders name, age, gender, etc.
In this chapter we look at a special vector used for storing categorical data (such as gender,
occupation, make of car, country, etc) called a factor. Unlike, for example, the policyholder’s
name, categorical data can only take one of a limited number of categories. For example, gender
can only take the categories male or female.
In R, the different categories are called levels and they are assigned the values 1, 2, 3, …, n. This
allows R to store them more efficiently (rather than treating each as unique) and use the
categories for graphing or as inputs in a statistical model, such as a generalised linear model.
As expected, we see it contains character data (“chr”), has six values and that they are stored in
R’s memory as “Male”, “Female”, etc.
factor(<vector object>)
Let’s take the data stored in the vector object gender and put it in a factor object, which we’ll call
gender.factor and print it out:
We can see that it prints the elements (but no longer as character values in speech marks) and it
also gives the levels (ie the categories) that the data can take. Note that the levels are, by default,
sorted into alphabetical order.
We can see that gender.factor is no longer a vector object but a factor object but still has length
6:
We can see that it is a factor with two levels (“Female” and “Male”) which are by default in
alphabetical order. However, when it displays the elements we no longer see “Male”, “Female”,
“Female”, … but 2, 1, 1, … . This is because R assigns positive integers to each level/category. The
female category is assigned a value of 1 and the male category is assigned a value of 2, and R
stores these numbers in its memory instead. So essentially we can see that R has converted our
categorical data to an equivalent numeric vector. This saves memory (1 and 2 use less space than
“Female” and “Male”) and means we can put these numbers into functions.
We can use levels(<factor object>) to display the levels and nlevels(<factor object>) to display
the number of levels of a factor object:
In the example above we converted a categorical character vector into a factor. We could also
have entered the data directly using the factor command.
Suppose we collect the occupations (which can take the categories of blue collar, white collar and
professional) of the same six policyholders. Abbreviating them as bc, wc and prof they were:
wc wc wc bc wc prof
We can put these in a factor object, which we’ll call occupation, as follows:
We can again see that the numbers have been assigned alphabetically to each category, so bc is
level 1, prof is level 2 and wc is level 3.
For an object that already exists we can change the levels using the levels(<factor object>)
command. For example the gender.factor object has two levels currently in alphabetical order
(Female, Male). To change them to (Male, Female) we do the following:
The levels have changed order, male is first instead of female but the assignment of 1 to female
and 2 to male is unchanged. This is unfortunate as the data values were stored internally as 2, 1,
1, … and so whereas before that was Male, Female, Female,… it now says Female, Male, Male. So
what we have done is relabelled the levels of the factors and by doing this have changed our data
set! This is obviously not a good idea.
So we have to specify the order of the levels when we create the factor. Levels is an optional
argument of the factor command:
So let’s redefine the object gender.factor using the factor command but this time we’ll specify the
order of the levels to be Male then Female:
Now when we print it out and look at its structure we get the following:
We can see that the levels are now in the specified order (Male then Female) and the assignment
of the numbers now follows this order, so Male = 1 and Female = 2. Hence the data values are as
they should be Male, Female, Female, etc.
Let’s now redefine the factor object occupation but this time we’ll specify the order of the levels
to be bc, then wc and then prof:
Note that if you are entering the above command in an RStudio Script, rather than in the Console,
then you won’t need to enter the “+” symbol, This is just R telling us we need to enter more if we
press enter before completing the command.
Now when we print it out and look at its structure we get the following:
We can see that the levels are now in the order we’ve specified and the assignment of the
numbers to the levels is in this order, so bc is now 1, wc is now 2 and prof is now 3.
Levels is the actually the second argument and so we could omit its name altogether:
We’ll now re‐enter the previous command but this time with the labels argument with the full
names in the same order as the levels argument:
We can see that the data are now printed in full. Unfortunately because there are two words in
the first two categories it’s a little confusing when they’re displayed to differentiate between the
policyholders. So it would be wise to use a full stop or underscore to separate the words.
We could re‐enter the whole command again but as we saw earlier we can change the labels of
an existing factor using the levels(<factor object>) command. This is a bit confusing as we’d
expect to use a “labels(<factor object>)” command but this is unfortunately the way R works. So,
being careful to ensure we keep the correct order of the levels/categories, we’ll change the labels
of the factor object “occupation” to “blue.collar”, “white.collar” and “professional” as follows:
We can see that not only have the levels been relabelled but using the full stops makes it much
easier to differentiate between the different data values. So levels inside the factor command
specifies the order only, however levels(<factor object>) replaces the category names (ie the
labels) with the new names.
Indexing
Just like for vectors we can select some of the elements using indexing. Here are some examples
on the occupation factor:
Factor arithmetic
Whilst each category/level is stored as a non‐negative integer, factors are, for all intents and
purposes, character data. As such, we can’t apply arithmetic operations to them. For example:
6 Ordered factors
The categorical data we’ve been looking at in this chapter so far (gender and occupation) has no
intrinsic order to it. As such, if we try to compare policyholders, it will return an error. For
example, comparing the first and second policyholder’s occupations (white.collar and
white.collar) gives the following:
However, some categorical data do have an inherent order. For example, we might have the
categories small, medium or large, which have the following order:
Or the categories strongly disagree, disagree, all the way up to strongly agree:
strongly disagree < disagree < neutral < agree < strongly agree
This kind of quantitative data is called ordinal data and, in R, we store ordinal data in an ordered
factor. To do this we use the factor command as before with the optional argument “ordered”
set to TRUE (by default if it’s omitted it is set to FALSE which gives us nominal data).
Suppose our six policyholders are asked to describe their general health and the ordered
categories are poor, average and good:
We can put these in an ordered factor object, which we’ll call health, as follows:
However, because R sets the levels alphabetically by default, the order it gives is not the most
sensible:
It says average < good < poor. So poor is the best health category! The lesson here is to always
specify the desired order of the levels!
Re‐entering the command with the levels option and the ordered option:
We can see that they are now in the correct ascending order poor < average < good.
7 Summary
Key terms
Factor A special type of vector used for storing categorical data
Categorical data Data which can only take one of a number specified categories, eg
gender taking only Male or Female
Key commands
factor(<vector object>) Turns a vector into a factor. Has optional arguments of levels,
labels and ordered
8 Have a go
You will only get proficient at R by practising.
1. Create an ordered factor, results, containing some maths test results from 7 students:
(A, C, C, E, D, B, B)
Label the grades A‐E as Excellent, Good, Average, Below Average and Poor.
2. Use a command in R to check that the second student performed better than the fifth
student.
Matrices
Covered in R13
Creating matrices
Naming matrices
Indexing matrices
Matrix arithmetic
Other matrix functions
1 Creating matrices
Recall from a previous chapter that a vector is a one‐dimensional ordered collection of data of the
same type (numeric, character, logical, complex or raw). In this chapter we look at matrices.
A matrix is a two‐dimensional ordered collection of data of the same type. For example:
"barry" "alice"
3 2.1 "harry" "belinda" TRUE TRUE FALSE
4.9 8.6 FALSE TRUE FALSE
"larry" "chelsea"
We can find out whether an object is a matrix by using is.matrix(<object>). The dimensions are
called rows and columns and the numbers of each can be found by using the nrow(<matrix
object>) and ncol(<matrix object>), respectively. Alternatively, you could use the dimensions
command dim(<object>) to get both the number of rows and columns.
Suppose we want to create a 2×2 matrix called A containing the values 3, 2, 4 and 1. We’ll need
to use the concatenate, c( ), function in the first argument to let R know these 4 values are the
data. Otherwise it will think we have one value 3 in a matrix with 2 rows and 4 columns!
Or as long as we enter the arguments in the correct order we can omit their names:
A <‐ matrix(c(3,2,4,1), 2, 2)
Notice how by default R will fill the elements of the matrix by column:
To specify that we want to fill the elements of the matrix by row we need to specify another
argument of the matrix function, byrow, as TRUE.
To create a matrix B with these same data values but filled by row, we would type:
Or again, as long as we enter the arguments in the correct order we can omit their names:
Recall that another way of obtaining all this information in one go is to use the str(<object>)
command which displays the structure of an R object:
It says it is numeric data, gives the dimensions (rows, columns) and the contents listed in column
order (in this case all of the contents, but for larger objects it will only give some of them).
Coercing
Since a matrix is a collection of data of the same type, if we try to create a matrix of different data
types, R will try to coerce them all to the same type. For example:
In this case it has converted the logical data into numeric data (TRUE becomes 1 and FALSE
becomes 0) so that it is now a matrix of numeric data .
So in the above example we can see that all 4 elements from Matrix A have been read in columns
and then used to fill up the first column of the new matrix. Likewise all 4 elements from Matrix B
form the second column.
We can see the new matrix is made up of the two new matrices stuck side by side (ie by column).
Similarly when we column bind two vectors:
We can see that the vectors are combined size by side (ie by columns) to form a matrix. This is
what we would expect as they are column vectors even though they are displayed horizontally.
We can see the new matrix is made up of the two new matrices stuck one on the other (ie by
row).
We can see that the vectors have been treated as row vectors and put on top of each other.
We can also use these functions if we want to add another row or column to a matrix:
2 Naming matrices
Recall that we could name each of the elements in our vectors. For matrices, we can name the
rows and columns. There are two ways of doing this. The first is to use the option dimnames in
the matrix command:
matrix(<data>, nrow = .., ncol = .. , byrow = FALSE, dimnames = list(<row names>, <col names>))
The “list” object for dimension names is because it is just one argument of the command but we
need names for both dimensions. We use the concatenate, c, command to combine together the
rows and the columns. Finally since the names are characters they should be enclosed in quotes.
For example suppose we want to create a matrix, N, that contains the expenditure on rent, food
and bills for two individuals A and B:
A B
Using the dimnames(<matrix object>) command lists out the row and column names:
These are also displayed when using the structure command, str(<object>):
An alternative way of naming the rows and columns of a matrix is to use the rownames or
colnames functions.
For example suppose we want to now create a matrix, P, that contains the expenditure on rent,
food and bills for two different individuals C and D:
C D
Note that if you create matrices using the cbind or rbind functions then the names of those
matrices will be used for the columns or rows, respectively:
We see here that the names of the vectors have become the names of the columns.
And in this second case the names of the vectors have become the names of the rows.
Since matrices have two dimensions we will need to give the row and column of the element we
want. So let’s define a bigger matrix:
1 4 7
M 2 5 8
3 6 9
We can choose the element in the first row and second column by writing M[1,2]:
To display all the elements of, say, the first row, we omit the figure for the column and type M[1,]:
Similarly, to display all the elements of the second column, we omit the figure for the row and
type M[ ,2]:
You notice that even though we have selected the column it displays it horizontally as R does for
vectors.
To display more than one row or more than one column we simply enter the multiple values using
the c( ) command. For example, to display the elements in the 1st and 2nd rows of the 3rd
column:
Or the 2nd and 3rd rows of the 1st and 3rd columns:
We could also select multiple rows or columns using a:b, the consecutive integer function:
We can use the dimension commands nrow( ) or ncol( ) to specify all the rows or columns until
the end:
To specify all row/column elements except the specified ones we use a negative in front of their
positions:
We can select elements using the results of logical tests. For example we could select all the
values of M whose value is between 4 and 8 inclusive as follows:
We can see it returns all the elements from the matrix for which the test is TRUE. These are
collected together in a vector. However, it doesn’t give their original positions in the matrix.
Finally, if we have named our rows and columns, then we can select the elements using their
names. For example, using our matrix N from earlier which had row names of rent, food and bills,
and column names of A and B we have:
4 Matrix arithmetic
We multiply or divide a non‐character matrix by a scalar as you’d expect:
3 1 6 2
A 2A
5 4 10 8
3 1 4 2 7 1 1 3
A B A B A B
5 4 3 1 2 5 8 3
Just like in maths, you can only perform this operation on matrices of the same dimensions:
However, unlike maths, if you add or subtract a scalar to a matrix, it does this to each element:
To multiply matrices you should use the operator %*% rather than * :
3 1 4 2 (3 4) (1 3) (3 2) (1 1) 15 7
A B AB
5 4 3 1 (5 4) (4 3) (5 2) (4 1) 32 14
Transpose
To obtain the transpose of a matrix we use the t(<matrix object>) function:
Determinants
These can be found using, unsurprisingly, the det(<matrix object>) function:
3 1
A det A (3 4) (1 5) 7
5 4
Inverses
We can find the inverse of a matrix using the solve(<matrix object>) function. Recall for a 2×2
matrix:
a b 1 d b
A A 1
c d ad bc c a
Hence, we have:
3 2 1 4 2
M M1
5 4 2 5 3
The solve function can be used more generally to solve a set of equations of the form:
Ax b
2 x 3y 14
3x 4y 4
Ax = b
2 3 x 14
3 4 y 4
Then:
x = A ‐1 b
1
x 2 3 14 4
y 3 4 4 2
We can obtain the eigenvalues by finding the values of for which det(A I) 0 .
2 1
A
2 5
2 1
det 0
2 5
(2 )(5 ) 2 0
2 7 12 0
( 3)( 4) 0 3, 4
1 1 x 0 x y 0
2 2 y 0 2 x 2y 0 x y
1
So the eigenvectors corresponding to 3 are of the form k .
1
2 1 x 0 2 x y 0
2 1 y 0 2 x y 0 y 2 x
1
So the eigenvectors corresponding to 4 are of the form k .
2
To obtain the eigenvalues and eigenvectors in R we use the function eigen(<matrix object>):
Notice that it gives the “normalised” version of the eigenvectors – that is a vector with a modulus
of 1. So:
1 2 1 5
k and k
1 2 2 5
Notice how it says $values and $vectors? If you just wanted the eigenvalues you could type:
eigen(A)$values
We’ll discuss this $ notation more in the next chapter on data frames.
6 Summary
Key terms
Matrix A two‐dimensional (row, column) ordered collection of data of the
same type
Coercing The process of changing the data types so that they are all the
same
Key commands
matrix(<data>, nrow = , ncol = , byrow = FALSE, dimnames = list(<row names>, <col names>))
<matrix>[<row>,<column>] Returns the element from <matrix> given in <row> and <column>.
7 Have a go
You will only get proficient at R by practising.
1. Create a matrix C :
1 2 3 4
C 5 6 7 8
9 10 11 12
2. Use an appropriate function (other than matrix( )) to create a new matrix, D , which is the
same as matrix C but with an additional row containing the elements 13, 14, 15, 16.
3. What type of matrix (numeric, logical, character) would be formed from each of the
following:
4. (a) Create the following named matrix of temperatures using the dimnames option:
Mon Tue
London 18 20
T Paris 20 19
Stockholm 15 13
(b) Create the following named matrix of indices using rownames and colnames
functions:
Mon Tue
7 8
(e) a 2×2 matrix of the bottom right
11 12
(f) the 2nd and 3rd columns of the 1st and 2nd rows
1 3
M 2 5
0 9
(c) Use R to obtain the column sums and row sums for matrix M .
1 9 1 2
A B
2 10 3 4
(a) Use R to calculate AB and BA . Hence show that matrices are not commutative
(ie AB BA ).
(b) Use R to find its determinant, B1 and hence show that BB 1 I .
2p 4q 14
2q 3p 13
Data frames
Covered in R14
In this chapter we look at data frames which are two‐dimensional objects like matrices. However,
they can contain different types of data. Essentially a data frame is a collection of column
vectors/factors of the same length. The columns are the vectors/factors which contain data for a
single variable (eg policyholder’s names, ages and smoker status). The rows represent a single
observation (eg a single policyholder).
Alfie 34 TRUE
Belinda 28 FALSE
Charlie 31 FALSE
Delilah 38 TRUE
So we can see that whilst each column (ie vector/factor) contains data of the same type, the
different columns (ie vectors/factors) can be different data types. So in the example above we
have character data for the first column, numeric data for the second column and logical data for
the third column.
You can find out whether an object is a data frame by using is.data.frame(<object>). The
dimensions are called rows and columns and the numbers of each can be found by using the
nrow(<data frame object>) and ncol(<data frame object>) commands respectively. Alternatively,
you could use the dimensions command dim(<object>) to get both the number of rows and
columns.
In this section we look at how to create a data frame from scratch. This will be rather long‐
winded and so you will often use scripts to prevent laborious retyping. In reality we will usually
import data from a spreadsheet or other source to create a data frame. We cover this in the next
chapter.
To create a data frame from scratch we use the data.frame( ) command. This will create a data
frame out of a list of vectors (or other data frames):
As already mentioned, vectors don’t have to contain the same data type but they do need to be
the same length. If not, then R will recycle/extend the shorter ones to make them the same
length as the longest vector.
We could either create the vectors separately and then put them together in a data frame or we
can create them inside this function. We’ll look at both to show you the pros and cons of each
approach.
Alfie 34 TRUE
Belinda 28 FALSE
Charlie 31 FALSE
Delilah 38 TRUE
A <‐ data.frame(c("Alfie", "Belinda", "Charlie", "Delilah"), c(34, 28, 31, 38), c(TRUE, FALSE, FALSE,
TRUE))
You might prefer to enter this function in two parts then just put part of the function on the next
line.
However when R displays this data frame it is not pretty. The reason is that a data frame
automatically looks for column names from the vectors. This is great if we create a data frame
out of existing vectors (as we shall see a bit later) but if, as we did above, we create it from
scratch it causes rathy messy names.
We can name the vectors either within the data.frame( ) command itself or separately using the
colnames( ) command or the names( ) command (which assumes you’re talking about columns
when applying it to a dataframe).
To define the names inside the data.frame( ) command is similar to naming elements in a vector.
We put the name equal to the vector as follows:
Let’s create this data frame again but this time name the columns – name, age and smoker:
A <‐ data.frame(name = c("Alfie", "Belinda", "Charlie", "Delilah"), age = c(34, 28, 31, 38), smoker =
c(TRUE, FALSE, FALSE, TRUE))
Had we wished to do this via the colnames or names functions we would have typed one of the
following:
Note that the numbers 1, 2, 3, 4 down the lefthand side of the data frame are not index values
referring to the first, second, third and fourth rows but the names of the rows and are, therefore,
characters “1”, “2”, “3”, “4”. We can rename them using the rownames( ) function. However,
there is little point in this case as the names of the individuals are included in the data frame
itself.
Properties
Let’s look at the properties of the data frame A:
Recall that another way of obtaining (nearly) all this information in one go is to use the
str(<object>) command which displays the structure of an R object:
So the data frame is the default object for data input which, like our example, assumes that the
columns are the variables we are observing and the rows are the observations. Hence it says that
we have 4 observations of 3 variables.
The names of the observations are given at the start of each line, followed by their data type,
followed by their contents (or some of the contents if there are too many to display).
We can see that age is numeric data and smoker status is logical, however names are not
characters but factors. The reason for this is that data frames assume that observations are
factors unless we tell it otherwise. This is understandable as most data we observe is categorical
(eg policy type, car type, postcode, etc). We can turn this feature off by setting the
stringsAsFactors option in the data frame to FALSE (instead of the default TRUE).
A <‐ data.frame(name = c("Alfie", "Belinda", "Charlie", "Delilah"), age = c(34, 28, 31, 38), smoker =
c(TRUE, FALSE, FALSE, TRUE), stringsAsFactors = FALSE)
When we do this we can see that the structure command now lists names as character data:
We can see that, by default, the names of the vectors will be the column names.
Similarly we can create a new data frame from other data frames. Let’s now combine our two
data frames A and B together:
Since the arguments of the data frame function are the (column) vectors, it combines the two
data frame objects side by side (ie as new columns).
Coercing
Since each column of a data frame is a vector then each column must contain data of the same
type. So if there is a mix of different data types in a column then R will try to coerce them all to
the same type.
For example suppose we create a data frame, D, with two columns, c1 and c2, as follows:
We can see for the first column it has converted the logical data into numeric data (TRUE
becomes 1 and FALSE becomes 0) so that it is now a vector of numeric data . We can see that the
second column has been converted to factors. Had we specified the stringsAsFactors = FALSE
then it would have coerced into a character vector. Try it out to see for yourself.
Recycling
Since a data frame is a collection of column vectors/factors of the same length then if we try to
create a data frame from vectors of differing lengths then it will recycle shorter vectors. The
following data frame is made from 3 vectors (v1, v2 and v3) of lengths 1, 2 and 4:
Since the length of the longest vector is 4, the shorter vectors have been recycled (ie extended by
repetition) until they are also of length 4.
However, suppose that the longest vector is not a multiple of one or more of the other vectors
(eg lengths 1, 2 and 5). When we did vector arithmetic it displayed an error message but still
performed the operation. However, for data frames it displays an error message and stops:
Since data frames have two dimensions we will need to give the row and column of the element
we want. Let’s apply this to our first data frame object A:
Alfie 34 TRUE
Belinda 28 FALSE
Charlie 31 FALSE
Delilah 38 TRUE
We can choose the element in the first row and second column by writing A[1,2]:
To display all the elements of, say, the first row, we omit the figure for the column A[1,]:
Similarly, to display all the elements of the second column, we omit the figure for the row A[ ,2]:
You notice that even though we have selected the column it displays it horizontally as R does for
vectors.
To display more than one row or more than one column we simply enter the multiple values using
the c( ) command. For example, to display the elements in the 1st and 2nd rows of the 3rd
column:
Or the 2nd and 3rd rows of the 1st and 3rd columns:
We could also select multiple rows or columns using a:b the consecutive integer function:
We can use the dimension commands nrow( ) or ncol( ) to specify all the rows or columns until
the end:
To specify all rows/columns elements except the specified ones we use a negative infront of their
positions:
We can select elements using the results of logical tests. However, since our data frame consists
of different types of data we will look at how we can do logical tests just on one variable (ie
column) in a later chapter.
We can also use the names of rows or columns to select elements. In our data frame we had
variable/column names of name, age and smoker:
For example, our data frame object is called A and the first of its column names is “name”. So we
can specify that column immediately using A$name. Similarly for the other columns:
This is a very useful way of obtaining a vector object from the data frame to use in calculations (eg
mean or correlation). We’ll make use of this in a later chapter.
Note that just like with function arguments, you can abbreviate the column name to the fewest
letters that uniquely define it. Since our columns all begin with different letters we could just use
the first letters:
The second way is to use the cbind function. So we could have obtained the same result as
follows:
The third method is to use the dollar notation to define a new column. For example let’s add the
weight vector as another column to the data frame H above:
The only way to do it is to use the row bind function on the original data frame and a new data
frame containing the observations for that individual.
Suppose we wish to add Eddie to the data frame who is aged 24, is not a smoker, is 172cm tall
with a weight of 82kg. We’ll put these results in a data frame called I and then add this row to
data frame H above:
There is a problem as the names of our new observation data frame do not match those of the
original. So we have to define the same names first, and then we can combine them:
We can see that we have the objects “height” and “weight” that were vectors we used to create
more columns in some of our data frames such as C:
Let’s look at what happens to the data frames if we change the weights vector.
You can see that the last result in the vector weight has been changed but not the values of the
weight column in the data frame.
The data frame doesn’t update the values of the weight vector after it has been created. To do
that you would have use the weight subset of the data frame as follows:
This may seem like a silly point at the moment – but it will be very important in the next chapter
when we look at attaching data frames.
5 Summary
Key terms
Data frame A two‐dimensional (row, column) ordered collection of column
vectors/factors of the same length – each column (vector) must
contain data of the same type but different columns can have
different types.
Coercing The process of changing the data types so that they are all the
same
Subsetting Another name for indexing – usually used for where more than
one element is obtained from an object
Key commands
data.frame(<name1>=<vector1>, <name2>=<vector2>, … , stringsAsFactors=TRUE)
dim(<data frame>) Gives the dimensions (ie rows and columns) of <data frame>
cbind(<object1>, <object2>, …) creates a data frame by combining the different vector objects
names(<data frame>) Gives the names of the column vectors in the <data frame>
rownames(<data frame>) Gives the names the rows of the <data frame>
colnames(<data frame>) Gives the names the column vectors in the <data frame>
<data frame>[<row>,<column>]
Returns the element from <data frame> given in <row> and
<column>
6 Have a go
You will only get proficient at R by practising.
3. Add an additional column called Year containing the data: 2017, 2015, 2012.
You will have ample opportunity to create and manipulate data frames in your study of CS1 or
CS2.
Importing data
Covered in R15
Overview
Using datasets in packages
Importing data frames from other programs
Importing data from CSV files
Importing data from Excel files and elsewhere
1 Overview
In the last chapter we looked at the data frame, which is going to be the standard object used for
most data analysis.
We have already covered how to manually enter data into vectors or data frames in previous
chapters. So this chapter will cover the other two methods.
Inbuilt datasets
It may be helpful to review Chapter 8 before reading this section.
We’re going to look at the datasets package which, unsurprisingly, contains a variety of datasets.
To find out more about the contents of this package we can click on its name in the Packages
window.
Clicking on the links will take us to the help page on that specific dataset. For example, clicking on
rivers gives:
Here R displays the vector of 141 observations (lengths in miles of 141 major rivers in North
America).
Let’s load up the MASS package (using the tick box in the Packages window).
Let’s take a look at one of these, Cars93, which contains data from the different types of cars sold
in the USA in 1993. From the help page we can see it is a data frame with 93 types of cars (rows)
and 27 variables (columns):
Once we’ve loaded the package into the workspace we can access the dataset simply by typing its
name:
Because of the size of this data frame it is not possible to display it easily in the Console. Using
the structure command can tell us about the different variables (columns):
It’s easier to view the dataset by selecting it in the Environment window. First select the package
by changing Global Environment to packageMASS using the dropdown menu:
You can import datasets in RStudio using the Environment window. Click on Import Dataset and
select the first option: From Text (base).
Open the text file you have just created and RStudio will open a window where you can select
some options to use for the import (feel free to experiment with the options now or later).
We can see that it has placed a V1 above the numbers. That’s because R places the data by
default into a data frame and has given the column the name “V1”. When you click on Import
RStudio will display the data in a window and you’ll also see the object listed in the Environment
window.
Let’s now create a three‐column dataset in notepad. We can separate the columns with spaces or
tabs – both will work. Let’s save this dataset as “data2”:
We will now load this into object C (by changing the Name in the Import window):
Again, we can see that it is placed in a data frame and by default it has column headings of V
followed by the column number.
Suppose we want to have column headings “name”, “age” and “smoker”. One way to do this is
to use the colnames (or names) function like we did in Chapter 14. For example:
An alternative way is to add the column names in the original text document (calling this
data3.txt):
When we reach the Import window we can use the Heading option.
By default the row names are “1”, “2”, “3”, etc. We could change these using the rownames
function like we did in Chapter 14. Alternatively, we could use the row names option in the
window above and use the first column of data.
Just like in the previous chapter, R assumes non‐numeric data are factors unless we specify
otherwise. We can do this by unchecking the Strings as Factors box in the Import window.
Missing data
Missing data is often an issue. R uses the logical value NA to tell its functions that the data is
missing. However, our text file might use something else. In which case we need to tell R what it
is. We do this via the na.strings option in the Import window. For example, we have specified
“n/a” and R will replace any occurrences of this with NA:
If you keep an eye on the Console when you import data you will see that R is using the read.table
(or possibly the read‐delim) function. For example, the last instruction might have read:
If the dataset contained more one way of indicating missing data, we could adjust this line of code
to include the alternatives. For example, here we are looking for “n/a” and “‐“:
We can import CSV files in the same way as other text files. RStudio will usually recognise the file
type and automatically change the Separator option to “Comma”. For example:
You can select From Excel from the import Dataset drop‐down menu:
RStudio will then open a window from where you can open the Excel file and select a number of
options. Experiment with the options so you can understand what they do.
You may be prompted to install or update some packages when you first use these options and if
you receive an error then search the internet for some help to find out what you might need to
install first.
6 Summary
Key terms
Import Load data stored elsewhere into R
CSV Comma Separated Value – a format often used to store data sets,
with each field of a record separated by a comma
Menus
Import dataset Located in the Environment window
Key commands
read.table Imports data into R. View its help for more information on its
options.arguments.
7 Have a go
You will only get proficient at R by practising.
2. Create a dataset in Excel, including column and row headings, and import it into R.
3. Alternatively, search the internet for some datasets and load one or two that interest you
into R.
Exporting data
Covered in R16
Overview
Exporting vector data to windows clipboard
Exporting data to a text file
Using write.table with data frames
Using write.csv
Other export commands
1 Overview
In this chapter we’ll look at how to export data to text, csv or formats which can be used in other
programs to, for example, produce reports.
There are a variety of places that we can export data to. These include:
windows clipboard
a text document
a CSV file
Excel
other statistical packages such as SAS or SPSS.
You can also copy data to the windows clipboard using the writeClipboard function. We can then
paste this into any other program such as Notepad, Word or Excel. However, it only works well
with vectors of character data.
writeClipboard(<character object>)
Then we’ll use the writeClipboard command on object “name” to copy the contents of “name”
into Windows clipboard:
Now if we open Word or Excel and paste it using CTRL+V or Edit/Paste we find that we have the
column of data from “name”:
The writeClipboard command only works with character data. Let’s look at what happens if we
try to apply it to a numeric vector.
Then we’ll use the writeClipboard command on object “age” to copy the contents of “age” into
Windows clipboard:
R returns an error telling us it’s not a character vector and so it can’t do it.
We can fix this by using the as.character command. This coerces the data type into character
data:
We could put this in a new object and then apply writeClipboard on this new object. For example:
writeClipboard(age2)
Incidentally, if our vectors have named elements they would not be included in the clipboard.
write.table(<object>,file="<filename>")
If the filename does not specify the file path then R will save the file in the current working
directory.
Recall that you can find out the current working directory using the getwd( ) and you can change
it using the setwd( ) command or the menu option Session/Set Working Directory.
Also remember that we can use the tilde, ~, shortcut to specify the location. You can find out
what the shortcut is on your computer by using path.expand(“~”).
write.table(name, file="name")
or since file is the second argument we could omit the file= as long as we put it in the second
position:
write.table(name, "name")
However, because we didn’t specify the file type Windows doesn’t have a clue what it is. We can
still open it but we’ll have to tell windows which program to use:
Since this is a bit annoying let’s save the file correctly this time by adding the extension .txt:
When we look in our working directory we can see this file correctly identified as a text file:
There are two odd things about our text file. First, we notice that there is an “x” at the top. This
is because by default write.table adds column headings. Since our vector hasn’t got a column
heading it calls it “x”. Secondly, it has added the row names “1”, “2”, “3” and “4”. Had we named
the elements in our vector these would have appeared here instead of the default row names.
Both of these features can be turned off by using the optional arguments col.names and
row.names. By default both of these options are set to TRUE.
Let’s experiment to see what happens if we try the logical options FALSE or NA. First, let’s set
col.names to FALSE:
Hopefully you are shocked that R has just overwritten our previous file without even a warning.
Hopefully you’ll be careful about saving files from now on. Once you’ve recovered from the shock
and opened this file in Wordpad or Notepad you’ll see the following:
We can see that it has removed the default column name of “x”.
We can see that it puts in the default column name “x” but it also adds a blank column name for
the row names. This is a useful output if we were going to paste this into, say, Excel as it ensures
that everything lines up correctly in a grid.
Finally, suppose we want to call the column “name” rather than the default “x”. we simply set the
col.names option to this:
We can see that, as expected, it has removed the row names “1”, “2”, “3” and “4”.
Suppose we want to name the rows “P1”, “P2”, “P3” and “P4” rather than the default “1”, “2”,
“3” and “4”. We simply set the row.names option to equal this:
For a vector we will probably not actually want either the row or the column names. In which
case we would type the following:
This gives:
Before we move on to applying this function to matrices and data.frames let’s show a couple of
other optional arguments.
By now, we know how to get rid of the quote marks, set col.names to NA and separators to tabs
to make it all look pretty as follows:
Or we could get rid of the row names, in which case it would be pointless setting col.names to NA
(and you’ll get a lovely error message from R if you try):
5 Using write.csv
To write (ie export) data to a csv file (*.csv) we use the write.csv( ) command:
write.csv(<object>, file="<filename>")
The write.csv function is actually the same as the write.table command, it just has fixed defaults
for some of the optional arguments to ensure that it is properly formatted as a csv file that can be
read by any spreadsheet program such as Excel.
These defaults include commas for the separator (which tells spreadsheets to put the value in the
next column). It always has a header row (ie column names) and will assign default ones if none
are specified. If row.names is TRUE then it sets col.names=NA so that everything lines up
beautifully as we saw earlier.
write.csv(name, "name.csv")
Because we gave the file extension .csv we can see in our working directory that Windows
recognises it as a spreadsheet file:
We can see that we have the default row names “1”, “2”, “3” and “4” and the default vector
column name “x”.
If you get an error message it’s because we are saving a file called “name.csv” whilst that file is
open in Excel. Simply close the Excel file and re‐execute the command and you’ll get the
following:
This is because the write.csv always has a header row to match the convention. That’s why when
we used read.csv in the previous chapter it always assumed the file had a header row.
Similarly if you tried to change the separator to something other than a comma you’d get an error
too.
Again, everything is beautifully lined up with the row and column names.
If we had no row or column names, the function would automatically assign them as we have
seen.
to export the object to an xlsx file. Additionally you can even specify the formatting and other
features.
7 Summary
Key terms
Export Move data from R into another format, eg a text or Excel file
Key commands
writeClipboard(<character object>)
Copies data stored in a character vector to the clipboard
write.table(<object>,file="<filename>")
Exports data into a text file. Optional arguments include
row.names, col.names, quote and sep
write.csv(<object>,file="<filename>")
Exports data into a csv file
Discrete Distributions
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 1.01
A European roulette wheel has the numbers 0 to 36. Each play involves spinning the wheel one
way and a ball the other. The result is the number the ball lands on.
(ii) Use the length function and logical operators to calculate the following probabilities:
(c) P(3 R 9)
(iii) Use set.seed(37) and the sample function to simulate the mathematician’s results
and store it in the object S1.
(iv) Use the table function to obtain a frequency table of the results.
(v) Use the function hist to plot a histogram of the results, ensuring that the labels on the
horizontal access are in the centres of the bars.
(vi) Use the results of the simulation to calculate the empirical probabilities in part (ii).
(vii) Use the results of the simulation to calculate empirical values of the:
(a) mean
(b) median
Exercise 1.02
A group consists of 10 people who have each been independently infected by a serious disease.
The survival probability for the disease is 70%.
(iii) Draw a labelled bar chart showing the number of people surviving from the group of 10
people using the barplot function.
(iv) Use the bar chart from part (iii) to obtain the modal number of people in the group that
will survive.
(v) Using xP( X x) , calculate the mean number of people who will survive.
Exercise 1.03
A group consists of 10 people who have each been independently infected by a serious disease.
The survival probability for the disease is 70%.
(ii) Check your answer to part (i)(a) using the dbinom function.
(iii) Draw a labelled bar chart showing the CDF of the number of people surviving from the
group of 10 people using the barplot function.
(iv) Draw a stepped graph of the CDF in part (iii) using the plot function.
Exercise 1.04
A group consists of 10 people who have each been independently infected by a serious disease.
The survival probability for the disease is 70%.
(i) Use qbinom to calculate the minimum number of survivors, x , such that:
(a) P( X x) 0.8
(b) P( X x) 0.95
(c) P( X x) 0.4
(iv) Draw a stepped graph showing the number of survivors (percentiles) against the
cumulative probability for the group of 10 people using the plot function.
Exercise 1.05
A group consists of 10 people who have each been independently infected by a serious disease.
The survival probability for the disease is 70%.
(i) Use set.seed(37) and rbinom to simulate the number of survivors 500 times. Store
this in the object B.
(ii) (a) Use the table function on B to obtain a frequency table for the survivors.
(c) Compare the results of (b) with the actual probabilities from dbinom (round
them to 3DP using the round function).
(d) Use length to obtain the empirical probability of at most 6 survivors and
compare with the actual probability using pbinom.
(iii) (a) Draw a histogram of the results obtained from the simulation, centring .
(b) Superimpose on the histogram a line graph of the expected frequencies for the
binomial distribution using the lines function.
(iv) Compare the following statistics for the distribution and simulated values:
(a) mean
(b) Use a loop to store the standard deviation of the first i values in the object B in
the ith element of StdDev.
(c) Plot a graph of the object StdDev showing how the standard deviation of the
simulations changes over the 500 values compared to a horizontal line showing
the true value..
Exercise 1.06
The probability of having a male child can be assumed to be 0.51 independently from birth to
birth.
(i) Calculate the probability that a woman’s fourth child is her first son:
(ii) Draw a labelled bar chart showing the probability of obtaining 0 to 10 daughters before
her first son using the barplot function.
(iii) Use pgeom to calculate the probability that the woman has:
(iv) Draw a stepped graph of the CDF using the plot function.
(v) Use qgeom to calculate the smallest number of daughters, x , before the first son such
that:
(a) P( X x) 0.9
(b) P( X x) 0.4
(vi) Use set.seed(47) and rgeom to simulate the number of daughters before the first
son 1,000 times. Store this in the object G.
(vii) (a) Use length to obtain the empirical probabilities for part (iii) and comment.
(b) Use quantile to calculate the empirical results for part (v) and comment.
(viii) Compare the following statistics for the distribution and simulated values:
(a) mean
(b) variance.
Exercise 1.07
The probability that a person will believe a rumour about a scandal in politics is 0.8.
(i) Calculate the probability that the ninth person to hear the rumour will be the fourth
person to believe it:
(ii) (a) Use the par function and mfrow to prepare the plot area to display 4 graphs in a
2 by 2 grid.
(b) Use the barplot function to draw 4 bar charts of the probability function for
negative binomial distributions with p 0.8 and k 1 ,2, 3 and 4, with titles
showing the value of k .
(c) Reset the graphics display area using the par function and mfrow.
(a) at most 2 people didn’t believe the rumour before the fourth person did
(b) more than 3 people didn’t believe the rumour before the fourth person did.
(iv) Use qnbinom to calculate the smallest number of people who didn’t believe the rumour,
x , before the fourth person did such that:
(v) Use set.seed(57) and rnbinom to simulate the number of people who didn’t
believe the rumour before the fourth person did 2,000 times. Store this in the object N.
(vi) (a) Draw a histogram of the results obtained from the simulation in part (v).
(vii) Use length to obtain the empirical probabilities for part (iii) and comment.
(viii) Compare the following statistics for the distribution and simulated values:
(b) IQR (use the quantile function and the results from part (iv)).
(ix) Obtain one simulated value for the number of people who didn’t believe the rumour
before the fourth person did using set.seed(57) and the rgeom function.
Exercise 1.08
Among the 58 people applying for a job, only 30 have a particular qualification. 5 of the group are
randomly selected for a survey about the job application procedure.
(i) Calculate the probability that none of the group selected have the qualification:
(ii) (a) Use the par function and mfrow to prepare the plot area to display 4 graphs in a
2 by 2 grid.
(b) Use the barplot function to draw 4 bar charts showing the number of the
group selected having the qualification from samples of size 5, 10, 15 and 20.
(c) Reset the graphics display area using the par function and mfrow.
(iii) Use phyper to calculate the probability that more than 2 people in the group selected
have the qualification.
(iv) Draw a stepped graph of the CDF using the plot function.
(v) Use qhyper to calculate the upper quartile of the number of people in the group
selected who have the qualification.
(vi) Use set.seed(67) and rhyper to simulate the number of people who have the
qualification in the group selected 2,000 times. Store this in the object H.
(vii) Use length to obtain the empirical probability for part (iii) and comment.
(viii) Compare the following statistics for the distribution and simulated values:
(a) mean
(b) upper quartile (use the quantile function and the result from part (v)).
(ix) (a) Draw a line graph of the binomial approximation to the probabilities of the
number of people selected who have the qualification using the plot function.
(b) Superimpose the actual probabilities using dhyper and the lines function.
(c) Superimpose the actual probabilities when 116 people apply for the job with the
same proportion having the particular qualification.
Exercise 1.09
A home insurance company receives claims at a rate of 2 per month.
(i) Calculate the probability that the company receives 4 claims in a month:
(ii) (a) Use the par function and mfrow to prepare to display 4 graphs in a 2 by 2 grid.
(b) use the barplot function to draw 4 bar charts showing the numbers of claims
received in a month if they occur at rates of 2, 5, 10 and 20 per month.
(c) Comment on the shape and position of the distribution for larger values of the
mean.
(d) Reset the graphics display area using the par function and mfrow.
(iii) Use ppois to calculate the probability that the company receives at least 3 claims in a
month.
(iv) Draw a stepped graph of the CDF using the plot function.
(v) Use qpois to calculate the interquartile range of the number of claims received in a
month.
(vi) Use set.seed(77) and rpois to simulate the number of claims received in 2,000
separate months. Store this in the object P and plot a histogram.
(vii) Use length to obtain the empirical probability for part (iii) and comment.
(viii) Compare the following statistics for the distribution and simulated values:
(b) IQR (use the quantile function and the result from part (v)).
(b) Use a loop to store the mean of the first i values in the object P in the ith
element of Average.
(c) Plot a graph of the object average showing how the mean of the simulations
changes over the 2,000 values compared to the true value.
Discrete Distributions
Answers
Exercise 1.01
(ii) (a) 0.5405405
(b) 0.7297297
(c) 0.1621622
(v)
(b) 0.704
(c) 0.175
(b) 17
(c) 10.73636
(d) 0.04215146
Exercise 1.02
(i) 0.2001209
(b) 0.04734899
(iii)
(iv) 7
(v) 7
Exercise 1.03
(i) (a) 0.6172172
(b) 0.8497317
(c) 0.3827828
(d) 0.1502683
(e) 0.1210608
(iii)
(iv)
Exercise 1.04
(i) (a) 8
(b) 9
(c) 7
(ii) 7
(iii) 2
(iv)
Exercise 1.05
(ii) (a)
survivors 2 3 4 5 6 7 8 9 10
Freq 3 4 18 48 83 142 110 72 20
(b)
survivors 2 3 4 5 6 7 8 9 10
Prob 0.006 0.008 0.036 0.096 0.166 0.284 0.220 0.144 0.040
(c)
survivors 2 3 4 5 6 7 8 9 10
Prob 0.001 0.009 0.037 0.103 0.200 0.267 0.233 0.121 0.028
Fairly similar
(v) (c)
Exercise 1.06
The probability of having a male child can be assumed to be 0.51 independently from birth to
birth.
(i) 0.06000099
(ii)
(b) 0.117649
(c) 0.2499
(iv)
(v) (a) 3
(b) 1
Exercise 1.07
The probability that a person will believe a rumour about a scandal in politics is 0.8.
(i) 0.007340032
(b) 0.033344
(iv) (a) 2
(b) 0
(b) 0.042
There are slightly more values at the upper end in the simulation.
(viii) (a) 1.155412 vs true value of 1.118034, simulation slightly more spread out
(b) 2 vs true value of 2, IQR identical – ties in with part (vii) – slightly more extreme
values produced
(ix) 3
Exercise 1.08
(i) (a) 0.02144861
(b) 0.02144861
(c) 0.02622106
(iii) 0.5334928
(iv)
(v) 3
(vii) 0.5155, slightly fewer values at upper end than the true distribution
(ix)
Exercise 1.09
(i) 0.09022352
(ii)(c) For higher values of the distribution shifts to right and becomes more symmetrical.
(iii) 0.3233236
(v) 2
(vii) 0.3165. The probability similar but slightly underestimates the true value.
(ix) (c)
Continuous Distributions
Exercises
Exercise 1.11
Claims to a general insurance company’s 24-hour call centre occur at a rate of 3 per hour.
Accordingly the waiting time between calls is modelled using an exponential distribution with
3.
(ii) Draw a labelled graph of the PDF for this exponential distribution over the range x (0,6)
using the:
(c) plot function to draw a blank set of axes and then the lines function to draw
the PDF.
(iii) Use the lines function to add to any one of your graphs from part (ii) the following:
(a) a red dotted line showing the PDF of an exponential distribution with 6
(b) a green dashed line showing the PDF of an exponential distribution with 1.5 .
Exercise 1.12
Claims to a general insurance company’s 24-hour call centre occur at a rate of 3 per hour.
Accordingly the waiting time between calls is modelled using an exponential distribution with
3.
(ii) Draw a labelled graph of the CDF for this exponential distribution over the range x (0,5)
using either the plot function, or the curve function or the plot function to draw a
blank set of axes and then the lines function to draw the PDF.
Exercise 1.13
Claims to a general insurance company’s 24-hour call centre occur at a rate of 3 per hour.
Accordingly the waiting time between calls is modelled using an exponential distribution with
3.
(i) Use qexp to calculate the number of hours waited, x , such that:
(a) P( X x) 0.8
(b) P( X x) 0.95
(c) P( X x) 0.3 .
Exercise 1.14
Claims to a general insurance company’s 24-hour call centre occur at a rate of 3 per hour.
Accordingly the waiting time between calls is modelled using an exponential distribution with
3.
(i) Use set.seed(37) and rexp to simulate 500 waiting times. Store this in the object W.
(ii) (a) Draw a labelled histogram of the densities of the 500 simulated waiting times.
(b) Superimpose on the histogram a graph of the actual PDF of the waiting times
using the lines function.
(iii) Use length to obtain the empirical probability of waiting and compare to results from
Exercise 1.12(i):
(a) mean
(b) Use a loop to store the median of the first i values in the object W in the ith
element of middle.
(c) Plot a graph of the object middle showing how the median of the simulations
changes over the 500 values and show the median of the distribution on the same
graph.
Exercise 1.15
The annual claim amounts (in $m) due to damage caused by sinkholes in a certain American state
is modelled by a gamma distribution with parameters 2 and 1.5 .
(ii) Draw a labelled graph of the PDF for this gamma distribution over the range x (0,8)
using either plot, curve or plot and lines.
(iii) Use the lines function to add the following to your graph from part (ii):
(a) a red dotted line showing the PDF of a gamma distribution now with 1
(b) a green dashed line showing the PDF of a gamma distribution now with 0.5 .
(v) Use pgamma to calculate the probability for the original gamma distribution that:
(vi) Use qgamma to calculate the IQR of the annual sinkhole claim amounts.
(vii) Use set.seed(57) and rgamma to simulate 2,000 annual damage amounts. Store this
in the object C.
(viii) (a) Draw a labelled histogram of the densities of the 2,000 simulated annual claims.
(b) Superimpose on the histogram a graph of the actual PDF of the claim amounts
using the lines function.
(ix) Use length to obtain the empirical probabilities for part (v) and comment.
(b) IQR (use the quantile function and the result from part (vi)).
Exercise 1.16
There is no Exercise 1.16.
Exercise 1.17
(i) Claim amounts, X , are modelled using a continuous distribution with CDF given by:
0.5
F ( x ) 1 e 3 x x0
(a) Use set.seed(47) and runif to simulate 1,000 random numbers between 0
and 1. Store this in the object U.
(b) Use the random numbers from part (i)(a) to obtain 1,000 simulated claims.
(c) Hence obtain an estimate of the mean and standard deviation of the claims.
(ii) Draw the PDFs of the following beta distributions in blue, green and red, respectively, on
the same axes using plot and lines and adding a legend:
(a) beta(0.5,2)
(b) beta(2,0.5)
(c) beta(0.5,0.5)
(iv) Use qbeta to find x such that P( X x) 0.65 , where X has a beta(0.5,2) distribution.
Exercise 1.18
(i) Calculate the value of the PDF when x 120 for:
(iv) Use set.seed(58) and rlnorm to simulate 1,000 values from a lognormal
distribution whose estimated parameters will be exactly 4.5 and 2 0.005 . Store
them in the object L .
(v) (a) Plot the PDF of a lognormal with parameters 4.5 and 2 0.005 for
x (60,130)
(b) Use lines and density(L) to superimpose the empirical PDF in red.
(vi) Use the simulations from part (iv) to calculate the empirical value of:
Exercise 1.19
(i) Calculate the value of the PDF when x 0.5 for:
(ii) Use plot and lines to draw a labelled graph of the PDF of the following distributions over
the range x (3.5,3.5) :
(a) t5
(b) t10
(c) N(0,1)
(vi) Use set.seed(59) and rf to simulate 500 values from an F distribution with 2 and 7
degrees of freedom. Store this in the object S.
(vii) Use the simulated values from part (vi) to calculate the empirical value of:
Continuous distributions
Exercises
Exercise 1.11
(i) (a)(b) 0.0074363
(ii),(iii),(iv)
Exercise 1.12
(i) (a) 0.22313
(b) 0.99988
(c) 0.010556
(ii)
Exercise 1.13
(i) (a) 0.53648
(b) 0.99858
(c) 0.40132
(ii) 0.23105
(iii) 0.36620
Exercise 1.14
(ii) (a)(b)
All simulated values greater than actual so simulations are more spread out.
All simulated values greater than distribution. So simulated values higher and more spread out.
(v) (c)
After 500 simulations the median has still not settled down to its long-term value.
Exercise 1.15
(i) 0.07498573
(ii)-(iv)
(b) 0.4967259
(vi) 1.154237
(viii)
The probabilities very similar – unsurprising given that we've got 2,000 simulations
Exercise 1.16
There is no Exercise 1.16.
Exercise 1.17
(i) (c) mean 0.2298496, standard deviation 0.4498087
(ii)
(b) 0.3690101
(iv) 0.05655679
Exercise 1.18
(i) (a) 0.001033349
(b) 1.209859e-05
(b) 0.0684637
(b) 101.1201
(v)
Exercise 1.19
(i) (a) 0.3279185
(b) 0.5483227
(ii)
(iii) The PDF of the t distribution approaches that of the N(0,1) as degrees of freedom gets
larger.
(b) 0.6707434
(c) 0.2869744
(b) 0.5122197
(b) 0.5353863
Data requirements
These exercises do not require you to upload any data files.
Exercise 5.01
Claims amounts for a particular policy are modelled using an exponential distribution with mean
£1,000.
(i) Use set.seed(27) and rexp to simulate the claim amounts 50 times. Store this in
the object E.
(ii) (a) Draw a histogram of the results obtained from the 50 simulations.
(b) Use a loop to store the sum of the results obtained in the i th sample of 50
simulations (using set.seed(27)) in the i th element of xsum.
(iv) (a) Draw a labelled histogram of the probabilities of the results in xsum.
(v) Calculate the probability of the sum of 50 claims being greater than £60,000:
(b) using the central limit theorem and the pnorm function.
(vi) (a) Use the par function and mfrow to prepare the plot area to display 4 graphs in a
2 by 2 grid.
(b) Repeat parts (iii) and (iv) for sample sizes of 5, 10, 50 and 100 claims.
(d) Reset the graphics display area using the par function and mfrow.
Exercise 5.02
Claims amounts for a particular policy are modelled using an exponential distribution with mean
£1,000.
If you have already created xsum in your current R session then proceed directly onto part (ii).
(b) Use a loop to store the sum of the results obtained in the i th sample of 50
simulations (using set.seed(27)) in the i th element of xsum.
(ii) (a) Calculate the mean and variance of the simulations of the sum of 50 claims.
(b) Compare part (a) with the mean and variance of an appropriate normal
distribution of the form N (n , n 2 ) .
(iii) (a) Calculate the median, lower and upper quartiles of the simulations of the sum of
50 claims using either quantile or summary.
(b) Compare part (a) with the median, lower and upper quartiles of a N (n , n 2 )
distribution.
(iv) (a) Use qqnorm to obtain a QQ plot for the simulations of the sum of 50 claims and a
normal distribution.
(c) Comment on the skewness of the distribution of the sum of 50 claims and how
close it is to a normal distribution.
Exercise 5.03
The numbers of claims are modelled by a Poisson distribution with mean 10 per day.
(i) (a) Obtain 1,000 simulations from this distribution using set.seed(29). Store this
in the object P.
(b) Plot a histogram of the probabilities of the results, ensuring that the labels on the
horizontal axis are in the centres of the bars.
(ii) (a) Use length to obtain the empirical probability of more than 10 claims in a day.
(c) Compare parts (a) and (b) to the exact Poisson probability using ppois.
(iii) (a) Use qqnorm to obtain a QQ plot for the simulations and a normal distribution.
(c) Comment on how close the normal distribution approximation is to the Poisson.
Exercise 5.01
(ii) (a)
(iv) (a)(b)
(b) 0.07865
(vi) (c) As the sample size increases, the normal approximation approaches the empirical
distribution more closely.
Exercise 5.02
(ii) (a) mean 49,909 and variance 50,999,108.
(iv) (a)(b)
(c) Close to normal in the middle and fairly good in upper tail.
However, ‘banana shape’ indicates skewness. Since sample quantiles above the
line in both tails - they need to be lower to match norm.
Seriously lighter lower tail so it’s not as low as should be and slightly heavier
upper tail so it’s higher than expected.
Lighter lower tail and heavier upper tail indicates positive skew.
Exercise 5.03
The numbers of claims are modelled by a Poisson distribution with mean 10 per day.
(i) (b)(c)
(b) 0.43718
(c) The probability from the Poisson simulation is closer than that from the normal
approximation to the true value of 0.41696.
(iii) (a)(b)
(c) #QQ plot 'banana shape' shows signs of positive skew. The mean of the Poisson
distribution is not large enough to ensure normal distribution is a good
approximation
Sampling distributions
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 6.01
Heights of a particular group of women are normally distributed with mean 162cm and standard
deviation 9 cm.
(b) Use a loop to obtain 1,000 sample variances from a sample of 20 women, using
set.seed(27) and storing the sample variance of the i th sample of 20 women
in the i th element of xvar.
(n 1)S 2
Recall that n21 for samples of size n from a N( , 2 ) distribution.
2
(20 1) xvar
(ii) Create a new vector X from xvar which is equal to .
92
(b) Superimpose on the histogram the empirical PDF of vector X using the functions
density and lines.
(c) Superimpose on the histogram a graph of the PDF of a n21 distribution using
the lines function.
(d) Comment on how close our empirical distribution is to the n21 distribution.
(iv) Calculate the mean and variance of X and compare it to the mean and variance of a n21
distribution.
(v) Calculate the median, lower and upper quartiles of the vector X using either quantile
or summary and compare them to the median, lower and upper quartiles of a n21
distribution.
(vi) (a) Simulate 1,000 values from a n21 distribution and store them in the vector chi.
(b) Use qqplot to obtain a QQ plot for the vectors X and chi.
(vii) (a) Calculate the probability of a value of X being greater than 15.
Sampling distributions
Answers
Exercise 6.01
(iii) (a),(b),(c)
(d) Fairly similar – simulated peak slightly to the left and also there's a lump in the
positive tail
(iv) Mean and variance of simulated values are 18.888 and 37.623, respectively.
2
Mean and variance of 19 distribution are 19 and 38, respectively.
(v)
Simulated Actual
Lower quartile 14.352 14.562
Median 18.139 18.338
Upper quartile 22.612 22.718
The results are close – but quantiles of the simulated are all slightly lower than the true values.
(vi) (b)(c)
(d) Fairly close – can see the lump towards the upper end.
Slightly weaker at the tails – but both sides of the line so not skewed.
(b) 0.72260
Estimation
Exercises
Data requirements
These exercises require the following data files:
motor.txt
lifetime.csv
Exercise 7.01
A motor insurance portfolio produces claim incidence data for 100,000 policies over one year. The
table below shows the observed number of policyholders making 0, 1, 2, 3, 4, 5, and 6 or more
claims in a year.
These data values are contained in the csv data file, ‘motor’.
It is thought the data could either be modelled as a Poisson distribution with mean 0.13345
or as a Type 2 negative binomial distribution with k 1.8569 and p 0.93295 .
(iii) (a) List out the expected frequencies for each of these fitted distributions to the
nearest whole number.
(b) Obtain the differences between the observed and expected frequencies for the
two fitted distributions.
(c) Hence, comment on the fit of these two distributions to the observed data.
Exercise 7.02
The lifetimes (in hours) of 250 incandescent bulbs are contained in the CSV data file ‘lifetime’.
It is thought the data could either be modelled as an exponential distribution with parameter
0.00049724 or as a gamma distribution with 0.86280 and 0.00042902 .
(iv) Superimpose the PDFs of these two distributions on the empirical PDF from part (iii).
(v) (a) Use set.seed(71) and rexp to generate 1,000 values from the fitted
exponential distribution and store them in the object xexp.
(vi) (a) Use set.seed(71) and rgamma to generate 1,000 values from the fitted
exponential distribution and store them in the object xgamma.
(vii) State which model is most appropriate using your results of parts (v) and (vi).
Exercise 7.03
There is currently no Exercise 7.03.
Exercise 7.04
A random sample of eight observations from an unknown distribution is given below:
(ii) (a) Making no distributional assumption, use set.seed(19), sample and either
replicate or a loop to obtain the mean of 1,000 re-samples of size 8.
(c) Use lines and density to superimpose the empirical distribution of the
estimators.
(d) Calculate the mean and standard deviation of this empirical distribution.
It is now assumed that the random sample has been taken from a 2 distribution.
(iv) (a) Use set.seed(19), rchisq and either replicate or a loop to obtain the
mean of 1,000 samples of size 8.
(c) Use lines and density to superimpose the empirical distribution of the
estimators.
(d) Calculate the mean and standard deviation of this empirical distribution.
Estimation
Answers
Exercise 7.01
(ii)
# claims 0 1 2 3 4 5 6
Expected 87,507 11,678 779 35 1 0 0
Difference 382 678 221 65 9 1 0
# claims 0 1 2 3 4 5 6
Expected 87,908 10,945 1,048 90 7 1 0
Difference 19 55 48 10 3 0 0
(c) Negative binomial expected frequencies are closer to the observed frequencies,
hence it is the better fit to the number of claims.
Exercise 7.02
(ii)
(iii)
(iv) (a)
(b) Hard to comment on which is the better fit from this graph.
(v) (c)(d)
Middle to upper sample values are higher than model, so heavier upper tail – more positively
skew than model.
(vi) (b)(c)
Middle to higher values get worse (with the highest value very poor) but better in middle than the
exponential since both sides of line.
(vii) Both have good fit at lower end but worse elsewhere. Despite the single extreme value in
the gamma the middle has a better fit than the exponential.
Exercise 7.03
There currently is no Exercise 7.03.
Exercise 7.04
(ii) (b)(c)
(iii) 2.95
(iv) (b)(c)
(v) The longer tail in the parametric bootstrap leads to larger standard deviation.
Confidence intervals
Exercises
Data requirements
These exercises require the following data file:
water.txt
Exercise 8.01
Heights of males with classic congenital adrenal hyperplasia (CAH) are assumed to be normally
distributed with a standard deviation of 8.4 cm.
(c) a 90% confidence interval for the mean height of men of the form (0, L).
The weights of women in the US are assumed to be normally distributed with standard deviation
12.1kg.
(b) Obtain a 95% confidence interval for the mean weight of women in the US.
Exercise 8.02
The annual rainfall in centimetres at a certain weather station over the last ten years has been as
follows:
17.2, 28.1, 25.3, 26.2, 30.7, 19.2, 23.4, 27.5, 29.5, 31.6
(ii) Obtain a 99% confidence interval for the average annual rainfall
A sample of 100 claims (in £) for damage due to water leakage on an insurance company’s
household contents policies are contained in the file ‘water.txt’.
(iii) Use t.test to obtain a 95% confidence interval for the mean water leakage damage.
The built in data object, women, contains the average heights (in) and weights (lbs) for American
women aged 30–39.
(iv) Use t.test to obtain a 90% confidence interval for the average weights.
Exercise 8.03
The annual rainfall in centimetres at a certain weather station over the last ten years has been as
follows:
17.2, 28.1, 25.3, 26.2, 30.7, 19.2, 23.4, 27.5, 29.5, 31.6
(i) Obtain a 90% confidence interval for the standard deviation of the annual rainfall from
scratch using qchisq.
A sample of 100 claims (in £) for damage due to water leakage on an insurance company’s
household contents policies are contained in the file ‘water.txt’.
(ii) Obtain a 95% confidence interval for the variance of the water leakage damage.
Exercise 8.04
The annual rainfall in centimetres at a certain weather station over the last ten years has been as
follows:
17.2, 28.1, 25.3, 26.2, 30.7, 19.2, 23.4, 27.5, 29.5, 31.6
(ii) (a) Assuming that rainfall is normally distributed use set.seed(19), rnorm and
either replicate or a loop to obtain the mean of 1,000 samples of size 10.
(b) Hence, obtain a 99% parametric bootstrap confidence interval for the average
annual rainfall.
(iii) (a) Making no distributional assumption, use set.seed(19), sample and either
replicate or a loop to obtain the mean of 1,000 re-samples of size 10.
(b) Hence, obtain a non-parametric 99% confidence interval for the average annual
rainfall.
(iv) Using the method in part (ii) and the same seed, obtain a 90% confidence interval for the
standard deviation of the annual rainfall.
(v) Using the method in part (iii) and the same seed, obtain a non-parametric 90% confidence
interval for the standard deviation of the annual rainfall.
Exercise 8.05
An opinion poll of 1,000 voters found that 450 favoured Party P.
(i) (a) Use binom.test to calculate a 99% confidence interval for the proportion of
voters who favour Party P.
(b) Comment on the likelihood of more than 50% of the voters voting for Party P in
an election.
Exercise 8.06
A sample of 30 values from the Poisson distribution has a mean of 2.
Use poisson.test to calculate an exact 90% confidence interval for the mean rate.
Exercise 8.07
The average blood pressure (in mmHg) for a control group C of 10 patients and a similar group T
of 10 patients on a special diet are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) (a) Use two vectors and t.test to calculate a 90% confidence interval for the
difference in average blood pressures for the two groups.
(ii) It is now known that both groups were made up of the same patients at different times.
Repeat part (i)(a) given this new information.
The built in data set iris contains measurements (in cm) of various features of 3 species of iris:
setosa, versicolor and virginica.
(iii) (a) Extract the Petal.Length of the setosa and virginica species and store them in the
vectors PS and PV.
(b) Obtain a 99% confidence interval for the difference between the mean petal
lengths of the two species in part (a), assuming equal variances.
Exercise 8.08
There is no Exercise 8.08.
Exercise 8.09
The average blood pressure (in mmHg) for a control group C of 10 patients and a similar group T
of 10 patients on a special diet are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) (a) Use two vectors and var.test to calculate a 99% confidence interval for the
ratio of the variances of two groups blood pressures .
The built in data set iris contains measurements (in cm) of various features of 3 species of iris:
setosa, versicolor and virginica.
(ii) (a) Extract the Petal.Length of the setosa and virginica species and store them in the
vectors PS and PV.
(b) Obtain a 90% confidence interval for the ratio of the variances of the petal lengths
of the two species in part (a).
Exercise 8.10
A sample of 100 claims on household policies made during the year just ended showed that 62
were due to burglary. A sample of 200 claims made during the previous year had 115 due to
burglary.
(i) Use two vectors and prop.test to calculate a 90% confidence interval for the
difference in proportions of home claims due to burglary between the two years.
(ii) Repeat part (i) using a matrix for the results instead of two vectors.
Confidence intervals
Answers
Exercise 8.01
(i) (a) (157.59, 168.00)
Exercise 8.02
(ii) (20.987, 30.753)
Exercise 8.03
(i) (3.4652, 7.8166)
Exercise 8.04
(ii) (b) (21.714, 29.312)
Exercise 8.05
(i) (a) (0.40930, 0.49119)
(b) Since 99% CI for p doesn't contain p 0.5 (or higher values of p ) it is unlikely
that Party P will gain more than 50% of the votes.
Exercise 8.06
(i) (1.5951, 2.4797)
Exercise 8.07
(i) (a) (1.0063, 5.0063) or (5.0063, 1.0063)
Exercise 8.08
There is no Exercise 8.08.
Exercise 8.09
(i) (a) (0.14049, 6.0111) or (0.16636, 7.1178)
Exercise 8.10
(i) (a) (0.14339, 0.053387) or ( 0.053387,0.14339)
9
Hypothesis Tests
Exercises
Data requirements
These exercises require the following data file:
• water.txt
Exercise 9.01
Heights of males with classic congenital adrenal hyperplasia (CAH) are assumed to be normally
distributed with a standard deviation of 8.4 cm.
(i) Carry out the following tests in R using pnorm to obtain the p-value:
=
(a) H0 : µ 165 vs H1 : µ < 165
=
(b) H0 : µ 158 vs H1 : µ ≠ 158
The weights of women in the US are assumed to be normally distributed with standard deviation
12.1kg.
Exercise 9.02
The annual rainfall in centimetres at a certain weather station over the last ten years has been as
follows:
17.2, 28.1, 25.3, 26.2, 30.7, 19.2, 23.4, 27.5, 29.5, 31.6
(ii) Test whether the average annual rainfall has increased from its former long-term value of
22 cm:
A sample of 100 claims (in £) for damage due to water leakage on an insurance company’s
household contents policies are contained in the file ‘water.txt’.
(iii) Use t.test to test whether the mean water leakage damage is £300.
The built in data object, women, contains the average heights (in) and weights (lbs) for American
women aged 30–39.
(iv) Use t.test to test whether the mean height is less than 70 inches.
Exercise 9.03
The annual rainfall in centimetres at a certain weather station over the last ten years has been as
follows:
17.2, 28.1, 25.3, 26.2, 30.7, 19.2, 23.4, 27.5, 29.5, 31.6
(i) Test whether the standard deviation of annual rainfall is equal to 10cm from scratch using
pchisq to obtain the p-value.
A sample of 100 claims (in £) for damage due to water leakage on an insurance company’s
household contents policies are contained in the file ‘water.txt’.
(ii) Test whether the standard deviation of the water leakage damage is less than £100.
Exercise 9.04
There is currently no Exercise 9.04.
Exercise 9.05
A new gene has been identified that makes carriers particularly susceptible to a particular
degenerative disease. In a random sample of 250 adult males born in the UK, 8 were found to be
carriers of the disease.
(i) Use binom.test to test whether the proportion of adult males born in the UK carrying
the gene is less than 10%.
Exercise 9.06
A random sample of 500 policies of a particular kind revealed a total of 116 claims during the last
year. Assume the annual claim frequency per policy has a Poisson distribution with mean λ .
(i) Test the null hypothesis H0 : λ = 0.18 against the alternative H1 : λ > 0.18 .
Exercise 9.07
The average blood pressure (in mmHg) for a control group C of 10 patients and a similar group T
of 10 patients on a special diet are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) Use two vectors and t.test to test the hypothesis that patients on the special diet have
a lower average blood pressure than the control group (you may assume that the
variances of the 2 groups of patients are equal).
(ii) It is now known that both groups were made up of the same patients at different times.
Repeat part (i) given this new information.
The built in data set iris contains measurements (in cm) of various features of 3 species of iris:
setosa, versicolor and virginica.
(iii) (a) Extract the Petal.Length of the setosa and virginica species and store them in the
vectors PS and PV.
(b) Obtain the p-value for a test that the difference between the means of the two
species in part (a) is equal to 4cm.
Exercise 9.08
The average blood pressure (in mmHg) for a control group C of 10 patients and a similar group T
of 10 patients on a special diet are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) Use two vectors and var.test to test the hypothesis that patients on the special diet
have a higher variance between their blood pressures.
The built in data set iris contains measurements (in cm) of various features of 3 species of iris:
setosa, versicolor and virginica.
(ii) (a) Extract the Petal.Length of the setosa and virginica species and store them in the
vectors PS and PV.
(b) Calculate the p-value for a test that the variances of the two species in part (a) are
equal and comment.
Exercise 9.09
There is currently no Exercise 9.09.
Exercise 9.10
A sample of 100 claims on household policies made during the year just ended showed that 62
were due to burglary. A sample of 200 claims made during the previous year had 115 due to
burglary.
(i) Use two vectors and prop.test to test the hypothesis that the underlying proportion
of home claims that are due to burglary is higher in the second year than in the first. Do
not use a continuity correction.
(ii) Repeat part (i) using a matrix for the results instead of two vectors.
An actuary claims that the rates of burglary should be modelled using a Poisson distribution.
(iii) Use poisson.test to test whether the rates of burglary have changed between the
two years.
Exercise 9.11
The average blood pressure (in mmHg) for a control group C of 10 patients and a similar group T
of 10 patients on a special diet are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) Store these results in two vectors and the value of the difference between their means in
the object ObsT.
(ii) Carry out a permutation test to test the hypothesis that patients on the special diet have a
lower average blood pressure than the control group:
(b) Create an object index that gives the positions of the values in results.
(c) Use the function combn on the object index to calculate all the combinations of
patients in the control group and store this in the object p.
(d) Use a loop to store the differences in the average blood pressures of the two
groups in the object dif.
(iii) (a) Plot a labelled histogram of the differences in the average blood pressures of the
two groups for every combination.
(b) Use the function abline to add a dotted vertical blue line to show the critical
value.
(c) Use the function abline to add a dashed vertical red line to show the observed
statistic.
(iv) (a) Calculate the p-value of the test based on this permutation test.
(b) The p-value calculated under the normality assumption was 13.19%. Comment on
your result.
(v) Repeat part (ii) but with 10,000 resamples from the object results using the function
sample and set.seed(77).
(vi) Calculate the p-value of the test using resampling and compare it to the answer using all
the combinations calculated in part (iv).
Exercise 9.12
The average blood pressure (in mmHg) for a group of 10 patients under a controlled diet (C) and a
special diet (T) are given in the table below:
C 73.0, 76.9, 82.8, 74.8, 83.0, 79.7, 78.2, 73.9, 74.5, 73.2
T 72.4, 76.2, 73.9, 72.2, 84.6, 75.4, 78.2, 72.8, 72.0, 72.3
(i) Store the differences of pairs of results in the vector D and the mean value of these
differences in the object ObsD.
(ii) Carry out a permutation test to test the hypothesis that patients on the special diet have a
lower average blood pressure than the control group:
(b) Use the function permutations from the package gtools to calculate all the
permutations of the signs of the differences in object D and store these
permutations in the object p.
(c) Use a loop to store the mean differences in the average blood pressures of the
two groups in the object dif.
(iii) (a) Calculate the p-value of the test based on this permutation test.
(b) The p-value calculated under the normality assumption was 2.885%. Comment on
your result.
(iv) Repeat part (ii) but with 10,000 resamples from the object sign using the function
sample in the loop and set.seed(79).
(v) Calculate the p-value of the test using resampling and compare it to the answer using all
the combinations calculated in part (iii).
Exercise 9.13
The number of employees in a particular company who are absent for each work day are given in
the table below:
(b) Give the expected frequencies if the number of employees absent was
independent of the day (ie uniformly distributed).
(c) Use chisq.test to determine whether the observed results fit a uniform
distribution.
According to genetic theory the number of colour-strains (pink, white and blue) of a certain
flower should appear in the ratio 2:3:5.
(ii) (a) Use chisq.test to determine whether the observed results are consistent
with genetic theory.
An insurer believes that the distribution of the number of claims on a particular type of policy is
binomial with parameters n = 3 and p . A random sample of the number of claims on 153 policies
revealed the following results:
Number of claims 0 1 2 3
Number of policies 60 75 16 2
(iii) (a) Show that the method of moments estimate for p is 0.246.
(b) Use chisq.test to carry out a goodness of fit test for the specified binomial
model for the number of claims on each policy, ensuring that the expected
frequencies are greater than 5.
Exercise 9.14
In an investigation into the effectiveness of car seat belts, 292 accident victims were classified
according to the severity of their injuries and whether they were wearing a seat belt at the time
of the accident. The results were as follows:
Death 3 47
Severe injury 78 32
(i) (a) Store these names and frequencies in the matrix obs.
The eye colour and hair colour for a group of Caucasian people were noted:
(ii) (a) Store these names and frequencies in the matrix obs2.
(b) Use chisq.test to determine whether eye colour and hair colour are
independent.
Exercise 9.15
The eye colour and hair colour for a group of Caucasian people were noted:
(ii) Use fisher.test to determine whether eye colour and hair colour are independent
and give the exact p-value.
(iii) Compare the p-value obtained in part (ii) using chisq.test both with and without
Yates’ continuity correction.
(iv) Get R to display only the p-value from part (ii), rather than the whole test.
9
Hypothesis Tests
Answers
Exercise 9.01
(i) (a) Statistic= −0.8282156 , p-value = 0.2037742, hence do not reject H0 and conclude
µ = 165 .
(b) Statistic= 1.807016, p-value = 0.07075981, hence do not reject H0 and conclude
µ = 158 .
(ii) (b) Statistic= 0.3019688, p-value = 0.7626759, hence do not reject H0 and conclude
mean weight is 64.2kg.
Exercise 9.02
(ii) Statistic= 2.5758, p-value = 0.01495, hence sufficient evidence to reject H0 , annual
rainfall has increased.
(iii) Statistic= 1.3138, p-value = 0.192, hence insufficient evidence to reject H0 , mean water
leakage damage is £300.
(iv) Statistic = −4.3301 , p-value = 0.000346, hence sufficient evidence to reject H0 , mean
height is less than 70 inches.
Exercise 9.03
(i) Statistic= 2.03161, p-value = 0.01808439, hence sufficient evidence to reject H0 , the
standard deviation of annual rainfall is not equal to 10cm.
(ii) Statistic= 86.7747, p-value = 0.195035, hence insufficient evidence to reject H0 , the
standard deviation of the water leakage damage is not less than £100.
Exercise 9.04
There is currently no Exercise 9.04.
Exercise 9.05
(i) p-value = 3.977e-05 (ie 0.004%), hence sufficient evidence to reject H0 , the proportion of
adult males born in the UK carrying the gene is less than 10%.
Exercise 9.06
(i) p-value = 0.004787, hence sufficient evidence to reject H0 , we conclude that λ > 0.18 .
(ii) 0.232
Exercise 9.07
(i) p-value = 0.1319, hence insufficient evidence to reject H0 , patients on the special diet
have the same average blood pressure than the control group.
(ii) p-value = 0.02885, hence sufficient evidence to reject H0 , patients on the special diet
have lower average blood pressure than the control group.
(iii) (b) p-value = 0.2759, hence insufficient evidence to reject H0 , the difference
between the means of the two species is equal to 4cm.
Exercise 9.08
(i) p-value = 0.4509, hence insufficient evidence to reject H0 , patients in the two groups
have same variance between their blood pressures.
(ii) (b) The p-value is as good as zero, hence there is sufficient evidence to reject H0 , the
variances of the two species are not equal.
Exercise 9.09
There is currently no Exercise 9.09.
Exercise 9.10
(i) p-value = 0.2275, hence insufficient evidence to reject H0 , the underlying proportion of
home claims that are due to burglary is the same in both years.
(iii) p-value = 0.633, hence insufficient evidence to reject H0 , the rates of burglary have not
changed between the two years.
Exercise 9.11
(iii)
(b) This p-value is very close to the value under the normality assumption.
(vi) p-value = 0.1337, hence insufficient evidence to reject H0 , no difference in mean blood
pressure under special diet.
Exercise 9.12
(iii) (a) p-value = 0.015625, hence sufficient evidence to reject H0 , blood pressure drops
under special diet.
(b) The non-parametric p-value is lower than the one under the normality
assumption.
(v) (a) p-value = 0.0153, hence sufficient evidence to reject H0 , blood pressure drops
under special diet.
(b) This p-value is very close to the value using all combinations.
Exercise 9.13
(i) (b) 300/5 = 60 each day
(c) Statistic= 6.4333 on chi-square 4, p-value = 0.169, hence do not reject H0 and
conclude absent employees are evenly distributed.
(ii) (a) Statistic= 6.6694 on chi-square 2, p-value = 0.03562, hence reject H0 and
conclude that the observed results are not consistent with genetic theory.
(b) 2
(iii) (b) Combining the last two groups so expected frequencies are greater than 5 gives a
statistic of 3.4675 on chi-square 2, p-value = 0.1766, hence do not reject H0 and
conclude that the binomial model is a good fit.
(c) Statistic of 3.4675 on chi-square 1, p-value = 0.0625, hence do not reject H0 and
conclude that the binomial model is a good fit.
Exercise 9.14
(i) (b) Statistic= 85.449 on chi-square 2, p-value = 0, hence reject H0 and conclude injury
NOT independent of wearing seatbelt.
(ii) (b) Statistic= 28.885 on chi-square 1, p-value = 0, hence reject H0 and conclude eye
colour not independent of hair colour.
(c) Statistic= 30.962 on chi-square 1, p-value = 0, hence reject H0 and conclude eye
colour not independent of hair colour.
Exercise 9.15
(ii) p-value = 2.36e-08, we reject H0 and conclude eye colour depends on hair colour.
(iii) With Yates’ continuity correction: p-value = 7.68e-08, we reject H0 and conclude eye
colour depends on hair colour.
Without Yates’ continuity correction: p-value = 2.63e-08, we reject H0 and conclude eye
colour depends on hair colour.
(iv) 2.362163e-08
10
Data Analysis
Exercises
Data requirements
These exercises require the following data files:
• baby weights.txt
• AIDS.csv
Exercise 10.01
A new computerised ultrasound scanning technique has enabled doctors to monitor the weights
of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
(i) (a) Load the data frame and store it in the data frame baby.
(continued overleaf)
The numbers of new AIDS cases recorded in the US in successive years during the early part of the
AIDS epidemic are shown in the table below.
Year 81 82 83 84 85 86 87 88
Number of cases (000s) 0.34 1.20 3.15 6.37 12.04 19.40 29.11 36.13
Source: www.avert.org
(ii) (a) Load the data frame and store it in the data frame AIDS.
(iii) (a) Create a new data frame AIDS2 which contains the log of the number of cases.
Exercise 10.02
This question uses the ‘baby weights’ data from Exercise 10.01 that should be stored in the data
frame baby.
(i) Use the cor function to obtain the following correlation coefficients:
(a) Pearson
(b) Spearman
(c) Kendall.
(ii) (a) Store the gestation period in vector x and the weight in vector y.
(b) Create objects Sxx, Syy and Sxy which contain the sum of squares for the baby
weights data.
(iii) (a) Calculate the Pearson correlation coefficient of the ranks of the gestation period
and ranks of the weights.
6∑ di2
rs = 1 − i
2
n(n − 1)
This question uses the ‘AIDS’ data from Exercise 10.01 that should be stored in the data frame
AIDS and the log data stored in the data frame AIDS2.
(iv) (a) Use cor to calculate the correlation coefficient for the AIDS data before and after
logging.
(b) Comment on the effect of logging the data on the linear correlation.
Exercise 10.03
This question uses the ‘baby weights’ data from Exercise 10.01 that should be stored in the data
frame baby.
(i) (a) Use cor.test and the Pearson correlation coefficient to test whether ρ = 0 .
(c) Use the statistic from part (i)(b) to obtain the p-value for the test in part (i)(a).
(iii) Use cor.test to test if the true value of Kendall’s correlation coefficient is less than
zero.
(iv) =
Use Fisher’s transformation to test whether H0 : ρ 0.9 vs H1 : ρ > 0.9 stating the p-
value.
Exercise 10.04
The built in data set Iris contains measurements (in cm) of the variables sepal length, sepal width,
petal length and petal width, respectively, for 50 flowers from each of 3 species (Iris setosa,
versicolor, and virginica) of iris.
(i) Extract the four measurements for the setosa species only and store them in the 50 × 4
data frame, SDF.
(ii) Use plot to obtain a scattergraph of each pair of measurements for the setosa species.
(iv) Comment on the relationship between Petal Width and the other measurements.
Exercise 10.05
This question uses the setosa iris data from Exercise 10.04 that should be stored in the data frame
SDF.
(i) Use the cor function to obtain the following correlation coefficients between all the four
pairs of variables:
(a) Pearson
(b) Spearman
(c) Kendall.
(ii) Obtain the Spearman correlation coefficient between the Sepal Length and the Petal
Length only.
Exercise 10.06
This question uses the setosa iris data from Exercise 10.04 that should be stored in the data frame
SDF.
(i) (a) Use cor.test and the Pearson correlation coefficient to test H0 : ρ = 0 vs
H1 : ρ > 0 between Sepal Length and Petal Length.
(b) Extract the statistic and the degrees of freedom for the test in part (i)(a).
(c) Use the statistic from part (i)(b) to obtain the p-value for the test in part (i)(a).
(ii) Use cor.test and Kendall’s correlation coefficient to test whether the true value of τ
is zero between Sepal Width and Petal Length.
Exercise 10.07
This question uses the ‘baby weights’ data from Exercise 10.01 that should be stored in the data
frame baby.
(i) Use prcomp to carry out PCA on the baby weights data and store it in the object pca1.
(ii) (a) Obtain the eigenvectors (matrix W) of each principal component in pca1.
(iii) (a) Obtain the principal components decomposition (matrix P) for the baby weight
data from pca1.
(v) (a) Obtain the percentages each of the variances of the principal components using
the summary of the prcomp function.
(b) Using part (a) determine which components, if any, should be dropped.
The scattegraph of the centred baby weight data is shown by the + signs below:
The circles show the points obtained when the second principal component is removed.
Exercise 10.08
This question uses the setosa iris data from Exercise 10.04 that should be stored in the data frame
SDF.
(i) Using scale or otherwise, obtain a scaled matrix of the 50 observations of 4 variables
which have zero mean and store in the matrix object X.
(ii) Use eigen to obtain the eigenvectors of XT X and store them in the matrix object W.
(b) Calculate what percentage each of the variances in matrix S are of the total.
(b) Obtain the percentages in part (iv)(b) using the summary of the prcomp
function.
(c) Draw a scree diagram using plot on the result of the prcomp function and hence
state which principal component(s) should be dropped to simplify the
dimensionality.
(vi) (a) Carry out PCA with scaling of the data using prcomp.
(b) Using the Kaiser Criterion state which principal component(s) should be dropped
to simplify the dimensionality.
(vii) (a) Using cbind and rep, or otherwise, obtain a new matrix P1 which has only the
first two principle components and vectors of zeroes for the removed
components.
10
Data Analysis
Answers
Exercise 10.01
(i) (c) Apart from the result at 32 weeks it is nearly a perfect straight line so there is a
very strong linear relationship.
(ii) (b)
(iii) (c) It’s much more linear but it still appears to have some curvature.
Exercise 10.02
(i) (a) 0.984336
(b) 1
(c) 1
(iii) (a) 1
(b) The correlation coefficient has increased – so logging the data has improved the
linearity.
Exercise 10.03
(i) (a) p-value = 0.0003661 reject H0 and conclude ρ ≠ 0 .
(b) 11.16642
(iii) p-value = 1 do NOT reject H0 and conclude Kendall’s correlation coefficient is not less
than zero.
Exercise 10.04
(ii)
(iv) It looks like there might be weak positive correlation between Petal Width and all the
other variables. For example there are few values in the top left quadrant.
Exercise 10.05
(i) (a)
(b)
(c)
(ii) 0.2788849
Exercise 10.06
(i) (a) p-value = 0.03035, reject H0 and conclude that there is an increasing monotonic
relationship between Sepal Length and Petal Length.
(b) 1.920876, 48
(ii) p-value = 0.1876, do not reject H0 and conclude that there is no monotonic relationship
between Sepal Width and Petal Length.
Exercise 10.07
(ii) (a)
PC1 PC2
gestation 0.9797143 -0.2003992
weight 0.2003992 0.9797143
(b) These are the orthogonal vectors of the new co-ordinate system (which is a
rotation of the old co-ordinate system).
(iii) (a)
PC1 PC2
[1,] -5.0889509 0.07126724
[2,] -3.1094823 -0.23155967
[3,] -0.9897343 0.15141345
[4,] 1.0298141 0.04452941
[5,] 3.0694025 0.03561680
[6,] 5.0889509 -0.07126724
(b) Matrix P expresses the 6 points in terms of the two new principal components.
This is effectively a new co-ordinate system with the most important one across
the horizontal axis.
gestation weight
35.00 2.55
Scale: FALSE
(b) prcomp subtracted 35 from gestation and 2.55 from weight to make the means
zero. prcomp did not change the scale ie did not divide to change the variance.
(vi) We have essentially removed one dimension and so left a straight line which is the trend
line. We can see it is a good fit to most of the points.
Exercise 10.08
(ii)
[,1] [,2] [,3] [,4]
[1,] 0.66907840 0.5978840 0.4399628 -0.03607712
[2,] 0.73414783 -0.6206734 -0.2746075 -0.01955027
[3,] 0.09654390 0.4900556 -0.8324495 -0.23990129
[4,] 0.06356359 0.1309379 -0.1950675 0.96992969
(iv) (a)
(c) The first two PCs explain 88.41%, the first three PCs explain 97.08% which is more
than 90%. So probably drop just fourth PC.
Note, if you used matrix P rather than the output of prcomp (as asked for) you
will have the opposite signs for PC1.
(c)
(vi) (b)
The Kaiser Criterion only keeps components whose var (or sd) of scaled data is
greater than 1. Hence, it would suggest keeping only the first two PCs.
It captures sepal length and width relationship well but not the other
relationships.
11
Linear Regression
Exercises
Data requirements
These exercises require the following data file:
• baby weights.txt
• growth.csv
Exercise 11.01
There is currently no Exercise 11.01.
However, you may wish to revisit the exercises from the data analysis chapter if it has been a
while since you looked at them. The exercises in this chapter will assume you are able to recall
and use the R code from that chapter.
Exercise 11.02
A new computerised ultrasound scanning technique has enabled doctors to monitor the weights
of unborn babies. The table below shows the estimated weights for one particular baby at
fortnightly intervals during the pregnancy.
Estimated baby weight (kg) 1.6 1.7 2.5 2.8 3.2 3.5
(i) (a) Load the data in the file ‘baby weights.txt’, and store it in the data frame baby.
(b) Plot a labelled scattergraph of the data and add a red dashed regression line onto
your scatterplot.
(iv) Add blue points to the scatterplot to show the fitted values.
(v) Obtain the expected baby’s weight at 42 weeks (assuming it hasn’t been born by then):
Exercise 11.03
This question uses the ‘baby weights’ linear regression model, model1, of weight on gestation
period, created in an earlier exercise.
(i) Obtain the total sum of squares in the baby weights model together with its split between
the residual sum of squares and the regression sum of squares:
(b) from first principles using the functions sum, mean, fitted and residuals.
(iii) Obtain the correlation coefficient from the extracted coefficient of determination.
Exercise 11.04
This question uses the ‘baby weights’ linear regression model, model1, of weight on gestation
period, created in an earlier exercise.
(iii) Extract the estimated value of beta, the standard error of beta and the degrees of
freedom and store them in the objects b, se and dof.
(iv) Using the objects created in part (iii), use a first principles approach to:
(b) obtain the statistic and p-value for a test of H0 : β = 0.25 vs H1 : β < 0.25 .
(c) obtain the statistic and p-value for a test of H0 : β = 0.18 vs H1 : β ≠ 0.18 .
Exercise 11.05
This question uses the ‘baby weights’ linear regression model, model1, of weight on gestation
period, created in an earlier exercise.
(i) Obtain the results of an F-test to test the ‘no linear relationship’ hypothesis using the:
(ii) Calculate the F statistic and p-value from first principles by extracting the mean sum of
squares and degrees of freedom from the ANOVA table.
(iii) Obtain a 95% confidence interval for the error variance, σ 2 , from first principles.
Exercise 11.06
This question uses the ‘baby weights’ linear regression model, model1, of weight on gestation
period, created in an earlier exercise.
(iv) Obtain a 99% confidence interval for the mean weight of a baby at 0 weeks:
(a) the mean weights of babies at 20, 21, 22, 23, 24 weeks
(b) 95% confidence intervals for the mean weight of a baby at 20, 21, 22, 23, 24
weeks.
Exercise 11.07
This question uses the ‘baby weights’ linear regression model, model1, of weight on gestation
period, created in an earlier exercise.
(ii) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance and whether a linear model is
appropriate.
(iv) Examine the final two graphs obtained by plot(model1) and comment.
Exercise 11.08
Part (i) of this question uses the ‘baby weights’ linear regression model, model1, of weight on
gestation period, created in an earlier exercise.
(i) (a) Obtain a new linear regression model, model2, based on the data without the
second data point (gestation of 32 weeks).
(b) By examining the new value of R2 comment on the fit of model2 compared to
that of model1 which had R2 = 0.9689 .
x 1 2 3 4 5 6 7 8 9 10
y 0.33 0.51 0.75 1.16 1.90 2.59 5.14 7.39 11.3 17.4
(a) Load the csv file and store it in the data frame growth.
(iv) (a) Obtain estimates for the slope and intercept parameters for model3.
(b) Add a red dashed regression line to your scatterplot of lny vs x from part (ii)(c).
(b) Re-plot the scatterplot of y vs x and this time add blue points to the scatterplot
to show the fitted values of y using model3.
(c) Add a dashed red regression curve that passes through the fitted points.
(vi) Obtain a 95% confidence interval for the mean value of y when x = 8.5 .
11
Linear Regression
Answers
Exercise 11.01
There is currently no Exercise 11.01.
Exercise 11.02
(ii) (a) slope parameter = 0.2043 , intercept parameter = −4.6000 .
(b) See part (iv) but ignore the points on the regression line.
(iii)
1 2 3 4 5 6
1.528571 1.937143 2.345714 2.754286 3.162857 3.571429
(iv)
(v) 3.98 kg
Exercise 11.03
(i) SSTOT = 3.015 , SSRES = 0.09371 and SSREG = 2.92129 .
(ii) R2 = 0.9689173
(iii) r = 0.984336
Exercise 11.04
(i) Statistic = 11.17 and p-value = 0.000366, hence reject H0 and conclude β ≠ 0 .
(b) Since the 95% confidence interval, (0.153, 0.255), contains β = 0.24 , we do not
reject H0 .
(b) Statistic = −2.499 and p-value = 0.0334, hence reject H0 and conclude β < 0.25 .
(c) Statistic = 1.327 and p-value = 0.2550457, hence do not reject H0 and conclude
β = 0.18 .
Exercise 11.05
(i) F statistic = 124.69 and p-value = 0.0003661, hence reject H0 and conclude there is a
linear relationship between gestation and weight.
Exercise 11.06
(i) 2.141429 kg
(iii) −4.6 kg, clearly the linear relationship does not continue backwards to conception.
(b) (−1.30, 0.267) , (−1.04, 0.422) , (−0.788, 0.577) , (−0.535, 0.732) , (−0.282, 0.888)
Exercise 11.07
(i)
1 2 3 4 5
0.07142857 -0.23714286 0.15428571 0.04571429 0.03714286
6
-0.07142857
(ii) (a)
(b) It’s hard to tell with so few values – but point 2 looks like an outlier. If we include
point 2 then there is no discernible pattern, however if we omit point 2 then there
could be a possible pattern.
The variance appears constant (in the sense it's not increasing).
(iii) (a)
(b) It’s hard to tell with so few values – but point 2 looks like an outlier. If we don’t
include that value then it seems OK.
(iv)
Again, it’s hard to tell with so few values but it looks like constant variance.
The diagram shows that points 1 and 6 have the most influence. However, point 2 is a
combination of an outlier and high influence and therefore should be removed.
Exercise 11.08
(i) (b) R2 = 0.9935 which is greater hence model 2 has a better fit.
(ii) (b)
(c)
(b)
(b)(c)
(vi) (8.41,9.63)
11b
Multivariate Linear
Regression
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 11b.01
There is currently no Exercise 11b.01.
However, you may wish to revisit the exercises from the data analysis and linear regression
chapters if it has been a while since you looked at them. The exercises in this chapter will assume
you are able to recall and use the R code from those chapters.
Exercise 11b.02
The built in data set iris contains measurements (in cm) of the variables sepal length, sepal
width, petal length and petal width, respectively, for 50 flowers from each of 3 species (Iris
setosa, versicolor, and virginica) of iris.
(i) Extract the four measurements for the versicolor species only and store them in the 50×4
data frame, VDF.
A scattergraph of the VDF shows positive correlation between Petal.Width and the other
variables:
(ii) Using the versicolor iris data fit a linear regression model, model2, with Petal.Width as
the response variable and Sepal.Length, Sepal.Width and Petal.Length as explanatory
variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(v) Obtain the expected petal width on a versicolor iris with sepal length 5.1cm, sepal width
3.5cm and petal length 1.4cm:
Exercise 11b.03
This question uses the versicolor iris linear regression model, model2, with Petal.Width (y) as
the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as
explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(i) Obtain the total sum of squares in model2 together with its split between the residual
sum of squares and the regression sums of squares using the anova command.
2
(iii) Obtain the adjusted coefficient of determination, Radj from the:
Exercise 11b.04
This question uses the versicolor iris linear regression model, model2, with Petal.Width (y) as
the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as
explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(iii) Extract the value of β2 , the standard error of β2 and the degrees of freedom and store
them in the objects b2, se2 and dof.
(iv) Using the objects created in part (iii), use a first principles approach to:
(a) obtain a 90% confidence interval for β2 and compare to part (ii)(a).
(b) obtain the statistic and p-value for a test of H0 : β2 = 0.3 vs H1 : β2 < 0.3 .
Exercise 11b.05
This question uses the versicolor iris linear regression model, model2, with Petal.Width (y) as
the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as
explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(ii) Obtain a 95% confidence interval for the error variance, σ 2 , from first principles.
Exercise 11b.06
This question uses the versicolor iris linear regression model, model2, with Petal.Width (y) as
the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as
explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(i) Obtain the expected petal width on a versicolor iris with sepal length 5.94cm, sepal
width 2.77cm and petal length 4.26cm.
(a) mean petal width for versicolor irises with sepal length 5.94cm, sepal
width 2.77cm and petal length 4.26cm.
(b) petal width of an individual versicolor iris with sepal length 5.94cm, sepal
width 2.77cm and petal length 4.26cm.
Exercise 11b.07
This question uses the versicolor iris linear regression model, model2, with Petal.Width (y) as
the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as
explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(ii) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance and whether a linear model is
appropriate.
(iv) Examine the final two graphs obtained by plot(model2) and comment.
Exercise 11b.08
This question uses the versicolor iris data from Exercise 11b.02 which should be stored in the data
frame VDF.
We are fitting multiple linear regression models with Petal.Width as the response variable and a
combination of Sepal.Length, Sepal.Width and Petal.Length as explanatory variables.
Forward selection
(i) Fit the null regression model, fit0, to the Petal.Width data.
(ii) Obtain the (Pearson) linear correlation coefficient between all the pairs of variables.
(iii) Fit a linear regression model, fit1, with Petal.Width as the response variable and the
variable with the greatest correlation with Petal.Width as the explanatory variable.
(iv) (a) Fit a linear regression model, fit2, with Petal.Width as the response variable
and the variable from part (iii) and the variable with the next highest correlation
with Petal.Width as the two explanatory variables.
(v) (a) Fit a linear regression model, fit3, with Petal.Width as the response variable
and the variables from part (iv) plus the last variable as the explanatory variables.
(vi) Comment on the output of the fit3 model and the results of the ANOVA output.
Backward selection
Start with versicolor iris linear regression model, model2, with Petal.Width (y) as the response
variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(vii) (a) Update the model to create model2b by removing the variable with β j not
significantly different from zero.
Exercise 11b.09
This question uses the versicolor iris linear regression model, fit3, from Exercise 11b.08, with
Petal.Width (y) as the response variable and Petal.Length (x1 ) , Sepal.Width (x2 ) and
Sepal.Length (x3 ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3
y=
Forward selection
(i) (a) Fit a linear regression model, fit4, with Petal.Width as the response variable
and a two-way interaction term between the two most significant variables.
(b) Compare the adjusted R2 of fit3 and fit4. Comment on these values and the
results of the ANOVA output.
(ii) Create two further models, fit5 and fit6, each containing the three explanatory
variables from fit3 plus a single two-way interaction term. Show that only one of them
improves the value of the adjusted R2 but the ANOVA output shows that there is no
significant improvement in fit.
(iii) Explain why we would not consider adding a three-way interaction term in this case.
Backward selection
Start with the versicolor iris linear regression model, fitA, with Petal.Width (y) as the response
variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) and Petal.Length (x3 ) as explanatory variables,
together with all two and three way interactions.
(iv) Update the model fitA to create fitB, fitC, etc by removing:
Each time compare only the adjusted R2 of the models to ensure only those models
which improve the fit are kept.
(v) Comment on the limitations of only using adjusted R2 as a basis for model fit.
Exercise 11b.10
There is currently no Exercise 11b.10.
11b
Multivariate Linear
Regression
Answers
Exercise 11b.01
There is currently no Exercise 11b.01.
Exercise 11b.02
(iii) α=
−0.1686, β1 =
−0.07398, β2 =
0.2233, β3 =
0.3088
(v) 0.6678075cm
Exercise 11b.03
(i) SSTOT = 1.9162, SSRES = 0.56164, SSREG = 1.354562
(ii) R2 = 0.7069
Exercise 11b.04
(i) Statistic = −1.560 and p-value = 0.125599 hence do not reject H0 and conclude β1 = 0 .
(b) Since the 95% confidence interval (0.201, 0.416) contains 0.24 we do not reject
H0 and conclude β3 = 0.24 .
(iv) (a) (0.119, 0.327). The confidence interval is the same as part (ii)(a).
(b) Statistic = −1.240004 and p-value = 0.1106 hence do not reject H0 and conclude
β2 = 0.3 .
Exercise 11b.05
(i) Statistic = 36.98 and p-value = 2.571e-12, hence reject H0 and conclude there is at least
one non-zero slope parameter.
Exercise 11b.06
(i) 1.325704cm
(iii) −0.16864
Exercise 11b.07
(i) −0.0791475663,..., −0.0007574191
(ii) (a)
(b) The variance appears to start increasing towards the end, so it may not be
constant.
(iii) (a)
(b) The middle section corresponds well, however extremes detract from normal
distribution. It appears to have ‘fat’ tails.
(iv)
Point 99 has the most influence, but there is no point that is both an outlier and has a high
influence.
Exercise 11b.08
(ii)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.5259107 0.7540490 0.5464611
Sepal.Width 0.5259107 1.0000000 0.5605221 0.6639987
Petal.Length 0.7540490 0.5605221 1.0000000 0.7866681
Petal.Width 0.5464611 0.6639987 0.7866681 1.0000000
(iii) The variable with the greatest correlation with Petal.Width is Petal.Length. The fitted
model has an adjusted R2 = 0.6109 .
(iv) (a) The variable with the next highest correlation with Petal.Width is Sepal.Width.
The fitted model has an adjusted R2 = 0.6783 .
(b) The adjusted R2 of fit2 is greater than that of fit1. Therefore we should keep
both variables.
(b) The adjusted R2 of fit3 is greater than that of fit2. Therefore we should keep
all three variables.
(vi) In the summary the Sepal.Length parameter is not significant – which suggests we should
remove it.
The ANOVA printout shows there is not a significant improvement in fit when we add the
last variable which also suggests that we should not include Sepal.Length.
However, the adjusted R2 does increase marginally. The problem is caused by the
overlap between the variables – PCA would remove this issue.
(vii) (a) The Sepal.Length parameter is not significantly different from zero.
(b) The adjusted R2 has fallen from 0.6878 to 0.6783. Hence, we should not remove
Sepal.Length.
Exercise 11b.09
(i) (a) Petal.Length and Sepal.Width have greatest significance.
(b) The adjusted R2 has fallen from 0.6878 to 0.6814. The ANOVA printout confirms
there is not a significant improvement. Therefore we should not add the
interactive term.
(iii) Since no two-way terms have been included, we should not consider adding a three-way
interaction term.
(iv) (a) The three way interaction term is not significant. Removing it increases the
adjusted R2 from 0.6863 to 0.691.
(c) It is not appropriate to remove single terms when we have 2 way interactions that
involve them.
(v) Even though we have maximised the adjusted R2 none of the coefficients are significant.
So we need a better method to use. Hence we tend to use the ANOVA test between
models to check improvement instead.
Exercise 11b.10
There is currently no Exercise 11b.10.
12
GLMs
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 12.01
The built in data set iris contains measurements (in cm) of the variables sepal length, sepal
width, petal length and petal width, respectively, for 50 flowers from each of 3 species (Iris
setosa, versicolor, and virginica) of iris.
(i) Extract the four measurements for the versicolor species only and store them in the 50×4
data frame, VDF.
(ii) Using the versicolor iris data fit a linear regression model, lmodel, with Petal.Width as
the response variable and Sepal.Length, Sepal.Width and Petal.Length as explanatory
variables:
α + β1 x1 + β2 x2 + β3 x3
y=
(iii) (a) Use the function glm to fit an equivalent generalised linear model, glmodel, to
the versicolor iris data. State explicitly the appropriate family and the link
function in the arguments.
(b) Confirm that the estimated parameters are identical to the linear model in
part (ii).
(c) Give a shortened version of the R code from part (iii)(b) that will fit the same GLM
as part (iii)(a) but makes use of the default settings of the glm function.
Exercise 12.02
The built in data set iris contains measurements (in cm) of the variables sepal length, sepal
width, petal length and petal width, respectively, for 50 flowers from each of 3 species (setosa,
versicolor, and virginica) of iris.
(i) (a) Assuming that the measurements are normally distributed, use the function glm
to fit a generalised linear model, glmodel1, with Petal.Width as the response
variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) , Petal.Length (x3 ) and
Species (γ i ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3 + γ i
y=
(c) Explain what has happened to the coefficient for the setosa species.
(ii) State the code for a linear predictor which also included a quadratic effect from
Petal.Length.
The built in data set esoph contains data from a case-control study of oesophageal cancer in Ille-
et-Vilaine, France. agegp contains 6 age groups, alcgp contains 4 alcohol consumption groups,
tobgp contains 4 tobacco consumption groups, ncases gives the number of observed cases of
oesophageal cancer out of the group of size ncontrols.
(iii) Fit a binomial generalised linear model, glmodel2, with a logit link function to estimate
the probability of obtaining oesophageal cancer as the response variable and a linear
predictor containing the main effects of agegp (α i ) , alcgp (β j ) and tobgp (γ k ) :
y =α i + β j + γ k
(iv) State the code for a linear predictor which also has interaction between alcohol and
tobacco.
Exercise 12.03
The first two parts of this exercise use the iris generalised linear model, glmodel1, with
Petal.Width (y) as the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) , Petal.Length
(x3 ) and Species (γ i ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3 + γ i
y=
(i) (a) State the statistic, p-value and conclusion for a test of H0 : β2 = 0 vs H1 : β2 ≠ 0 .
The next two parts of this exercise use the oesophageal cancer binomial probability generalised
linear model, glmodel2, with the probability of obtaining oesophageal cancer as the response
variable and a linear predictor containing the main effects of agegp (α i ) , alcgp (β j ) and
tobgp (γ k ) :
y =α i + β j + γ k
(iii) State the p-value and conclusion for a test that the second non-base category in the age
group is zero.
(a) obtain a 99% confidence interval for the third non-base coefficient in the alcohol
group.
(b) test, at the 5% level, whether the first non-base coefficient in the tobacco group is
equal to 0.5.
Exercise 12.04
The first three parts of this exercise use the iris generalised linear model, glmodel1, with
Petal.Width (y) as the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) , Petal.Length
(x3 ) and Species (γ i ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3 + γ i
y=
(i) (a) Obtain the residual degrees of freedom and residual deviance for this model.
(iii) (a) Create a new GLM, glmodel01, which does not contain Species as an
explanatory variable.
(c) Use anova to carry out a formal F test to compare these two models.
The last part of this exercise uses the oesophageal cancer binomial probability generalised linear
model, glmodel2, with the probability of obtaining oesophageal cancer as the response variable
and a linear predictor containing the main effects of agegp (α i ) , alcgp (β j ) and tobgp (γ k ) :
y =α i + β j + γ k
(iv) (a) Create a new GLM, glmodel02, which does not contain tobgp as an explanatory
variable.
(c) Use anova to carry out a formal χ 2 test to compare these two models.
Exercise 12.05
We are fitting generalised linear models to the iris data with Petal.Width as the response variable
and a combination of Sepal.Length, Sepal.Width, Petal.Length and Species as explanatory
variables, assuming the measurements are normally distributed.
Forward selection
(i) Fit the null generalised linear model, fit0, to the iris data.
First covariate
(ii) (a) By examining the scatterplot of all the pairs of variables explain why either
Species or Petal.Length should be chosen as our first explanatory variable.
(b) Fit a linear regression model, fit1a, with Petal.Width as the response variable
and Species as the only explanatory variable. Determine the AIC for fit1a.
(c) Fit a linear regression model, fit1b, with Petal.Width as the response variable
and Petal.Length as the only explanatory variable. Determine the AIC for fit1b.
(d) By examining the AIC of fit1a and fit1b choose the model that provides the
best fit to the data.
(e) Use the anova function to carry out an F test comparing fit0 and the model
chosen in part (ii)(d).
Second covariate
(iii) (a) Fit a linear regression model, fit2, with Petal.Width as the response variable
and both Species and Petal.Length as the explanatory variables.
(b) By examining the AIC and carrying out an F test compare fit2 and the model
chosen in part (ii)(d).
Third covariate
(iv) (a) Fit a linear regression model, fit3a, with Petal.Width as the response variable
and Species, Petal.Length and Sepal.Length as explanatory variables. Determine
the AIC for fit3a.
(b) Fit a linear regression model, fit3b, with Petal.Width as the response variable
and Species, Petal.Length and Sepal.Width as explanatory variables. Determine
the AIC for fit3b.
(c) By examining the AIC of fit3a and fit3b choose the model that provides the
best fit to the data.
(d) Use the anova function to carry out an F test comparing fit2 and the model
chosen in part (iv)(c).
Fourth covariate
(v) (a) Fit a linear regression model, fit4, with Petal.Width as the response variable
and all four covariates as the explanatory variables.
(b) By examining the AIC and carrying out an F test compare fit4 and the model
chosen in part (iv)(c).
Fifth covariate
(vi) (a) Fit a linear regression model, fit5a, with Petal.Width as the response variable,
all four covariates as main effects and an interactive term between Species and
Sepal.Width as explanatory variables. Determine the AIC for fit5a.
(b) Fit a linear regression model, fit5b, with Petal.Width as the response variable,
all four covariates as main effects and an interactive term between Petal.Length
and Sepal.Width as explanatory variables. Determine the AIC for fit5b.
(c) By examining the AIC of fit5a and fit5b choose the best fit to the data.
(d) Use the anova function to carry out an F test comparing fit4 and the model
chosen in part (vi)(c).
Sixth covariate
(vii) (a) Fit a linear regression model, fit6, with Petal.Width as the response variable, all
four covariates as main effects, the interactive terms between Species and
Sepal.Width, and between Petal.Length and Sepal.Width as explanatory variables.
(b) By examining the AIC and carrying out an F test compare fit6 and the model
chosen in part (vi)(c).
Seventh covariate
(viii) Show that adding interaction between Petal.Length and Sepal.Length to fit6 leads to a
drop in the AIC and a significant improvement in the residual deviance.
It can be shown that adding other two-way interactions terms do not improve the AIC nor lead to
a significant improvement in residual deviance.
(ix) Explain why we should not add any three-way interaction terms at this stage.
Backward selection
(x) Fit the full generalised linear model, fitA, to the iris data to model Petal.Width using
Species*Petal.Length*Sepal.Length*Sepal.Width and show the AIC is −109.79 .
(xi) Show that the generalised linear model, fitB, which removes the four-way interaction
term leads to an improvement in the AIC.
It can be shown that two three-way interaction terms have parameters that are insignificant.
(xii) (a) Update the model fitB to create fitC1 by removing the three-way interaction
between Species, Petal.Length and Sepal.Width. Determine the AIC for fitC1.
(b) Update the model fitB to create fitC2 by removing the three-way interaction
between Species, Petal.Length and Sepal.Length. Determine the AIC for fitC2.
Let fitC be the model from parts (xii)(a) and (b) which produces the biggest improvement in the
AIC.
(xiii) It can be shown that another three-way interaction term has insignificant parameters at
the 10% level. Use the summary function to determine which interaction term this is.
Create the generalised linear model, fitD, which removes it and show that there is an
improvement in the AIC.
(xiv) Show that generalised linear model, fitE, which removes another insignificant three-
way interaction term also leads to an improvement in the AIC.
(xv) Use the summary function to show that the parameter of the final three-way interaction
term is still significant but that the two-way interaction term between Species and
Sepal.Length is not. Update the model fitE to create fitF by removing this two-way
interaction and show it leads to an improvement in the AIC.
(xvi) Use the summary function to show that the parameters of three of the two-way
interaction terms are insignificant at the 5% level. Show that removing any of these
interaction terms leads to no improvement in the AIC.
Exercise 12.06
The first three parts of this exercise use the iris generalised linear model, glmodel1, with
Petal.Width (y) as the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) , Petal.Length
(x3 ) and Species (γ i ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3 + γ i
y=
(i) Obtain the value of the linear predictor for glmodel1 for a versicolor iris with sepal
length 5.1cm, sepal width 3.5cm and petal length 1.4cm:
(ii) (a) Explain why the expected petal width of a versicolor iris will be the same as the
linear predictor in part(i).
(b) Show that this is the case by using the predict function.
(iii) Explain why there is no constant for the setosa species in the linear predictor.
(iv) Obtain the expected petal width of a setosa iris with sepal length 5.1cm, sepal width
3.5cm and petal length 1.4cm:
Exercise 12.07
The first two parts of this exercise use the iris generalised linear model, glmodel1, with
Petal.Width (y) as the response variable and Sepal.Length (x1 ) , Sepal.Width (x2 ) ,
Petal.Length (x3 ) and Species (γ i ) as explanatory variables:
α + β1 x1 + β2 x2 + β3 x3 + γ i
y=
(i) Obtain the raw residuals for the generalised linear model:
(ii) Show that the raw residuals are the same as the:
(iii) By examining the median, lower and upper quartiles of the residuals, comment on their
skewness.
(iv) (a) Obtain a plot of the residuals against the fitted values.
(b) Comment on the constancy of the variance of the residuals and whether a normal
model is appropriate.
(vi) Examine the final two graphs obtained by plot(glmodel1) and comment.
12
GLMs
Answers
Exercise 12.01
(ii)
coefficients of (Intercept) Sepal.Length Sepal.Width Petal.Length
(c) glm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,
data=VDF)
Exercise 12.02
(i) (b)
(c) The setosa coefficient has been absorbed into the ‘intercept’ coefficient.
(ii) glm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length+Species
+I(Petal.Length^2))
ie glm(cbind(ncases,ncontrols) ~
agegp+alcgp+tobgp+alcgp:tobgp, family=binomial)
or glm(cbind(ncases,ncontrols) ~ agegp+alcgp*tobgp,
family=binomial).
Exercise 12.03
(i) (a) Statistic = 5.072, p-value = 1.20e-06 hence we reject H0 and conclude that
β2 ≠ 0 .
(b) A 95% confidence interval for β1 is (−0.180, −0.00555) . Since it does not contain
−0.2 we reject the null hypothesis at the 5% level and conclude that β1 ≠ −0.2 .
(iii) p-value = 0.02362 hence we reject H0 and conclude that the coefficient is not zero.
(b) A 95% confidence interval for the first non-base coefficient in the tobacco group is
(0.209, 0.972). Since it contains 0.5 we have insufficient evidence to reject the
null hypothesis at the 5% level and conclude that the coefficient is 0.5.
Exercise 12.04
(i) (a) The residual degrees of freedom are 144 and the residual deviance is 3.998.
(iii) (b) The AIC for glmodel01 is −63.5 . Since the AIC for glmodel1 is lower it is
considered a better fit. Therefore we should include Species in our model.
(c) The p-value is 5.143e-10 which is way less than 5% so we would reject H0 . The
model with Species significantly reduces the scaled deviance and hence is a better
fit.
(iv) (b) The AIC for glmodel2 is 225.5, whereas the AIC for glmodel02 is 230.1. Since
the AIC for glmodel2 is lower it is considered the better model. Therefore we
should include tobacco in our model.
(c) The p-value is 0.0141 which is smaller than 5% so we would reject H0 . The model
with tobacco significantly reduces the scaled deviance and is a better fit.
Exercise 12.05
First covariate
(ii) (a) The strongest correlation is between Petal.Width and Species or Petal.Width and
Petal.Length.
(b) −45.29
(c) −43.59
(d) The AIC for fit1a (−45.29) is the lower out of fit1a and fit1b. So it is
considered the better model.
(e) The p-value is 2.2e-16 which is much lower than 5% so we would reject H0 . The
model with Species significantly reduces the residual deviance and is a better fit.
Second covariate
(iii) (b) Since the AIC for fit2, −83.41 , is lower than fit1a it is considered the better
model so we should also include Petal.Length in our model.
Third covariate
(b) −101.60
(c) The AIC for fit3a is −81.41 (which is worse than fit2) and the AIC for fit3b
is −101.6 which is an improvement. The best model is fit3b.
(d) The p-value is 1.03e-05 which is much lower than 5% so we would reject H0 . The
model with Sepal.Width significantly reduces the residual deviance and is a better
fit.
Fourth covariate
(v) (b) Since the AIC for fit4, −104.06 , is lower than fit3b it is considered the better
model so we should also include Sepal.Length in our model.
The p-value is 0.03889 which is less than 5% so we would reject H0 . The model
with Sepal.Length significantly reduces the residual deviance and is a better fit.
Fifth covariate
(b) −107.46
(c) The AIC for fit5a is −109.88 and the AIC for fit5b is −107.46 . Both are an
improvement but fit5a is lower and so is the better fit.
(d) The p-value is 0.009593 which is less than 5% so we would reject H0 . The model
with Species:Sepal.Width significantly reduces the residual deviance and is a
better fit.
Sixth covariate
(vii) (b) Since the AIC for fit6, −114.26 , is lower than fit5a it is considered the better
model so we should also include Petal.Length:Sepal.Width in our model.
The p-value is 0.0145 which is less than 5% so we would reject H0 . The model
with Petal.Length:Sepal.Width significantly reduces the residual deviance and is a
better fit.
Seventh covariate
(viii) The AIC drops to −116.55 and the p-value is 0.04566 so there is a significant
improvement.
Backward selection
Exercise 12.06
(i) 0.8877986 cm
(ii) (a) The canonical link function for the normal distribution is the identity function.
Hence the mean response variable is equal to the linear predictor.
(iv) 0.2396861 cm
Exercise 12.07
(i) −0.0396860931, ... , − 0.1867598541
The median is nearly zero, lower and upper quartiles have nearly equal absolute values.
So middle 50% of the data is nearly symmetrical.
(iv) (a)
(b) The line is fairly horizontal – so variance of residuals is fairly constant and hence
the normal model is appropriate.
(v) (a)
(b) The middle section is good but there are issues in the extremes.
The residuals at the lower end are more negative than expected – so the fitted
values are too large.
The residuals at the upper end are more positive than expected – so the fitted
values are too small.
So the current model has very ‘fat’ tails which is not ideal.
(vi)
The variance of the residuals is increasing – implying a defect in our model. Interaction
terms may resolve this problem.
13
Bayesian Statistics
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 13.01
The probability of a person dying from a particular disease is p . The prior distribution of p is
beta with parameters a = 2 and b = 3 .
(b) Use a loop to obtain 1,000 simulations of the posterior outcome (where 1 denotes
death and 0 denotes survival) for a single person. Use the functions
set.seed(77), rbeta and rbinom and store the i th outcome in the i th
element of x.
(c) Hence, obtain an empirical Bayesian estimate for p under quadratic loss.
The Bayesian estimate for p under quadratic loss for a single outcome x is:
x +a
a + b +1
(b) Repeat part (i)(b) but also store the i th theoretical Bayesian estimate in the i th
element of pm.
(c) Compare the average empirical and theoretical Bayesian estimates under
quadratic loss.
(b) Use a loop to obtain 1,000 simulations of the posterior probability of death, based
on 1,000 samples each of 12 people. Use the functions set.seed(79),
rbeta and rbinom and store the estimate for the probability in the i th
outcome in the i th element of xp.
(c) Hence, obtain an empirical Bayesian estimate for p under quadratic loss.
The Bayesian estimate for p under quadratic loss, given x deaths in a sample of size n is:
x +a
a+b+n
(iv) (a) Repeat part (iii)(b) but also store the i th theoretical Bayesian estimate in the i th
element of pm.
(b) Compare the average empirical and theoretical Bayesian estimates under
quadratic loss.
13
Bayesian Statistics
Answers
Exercise 13.01
(i) (c) 0.382
(ii) (c) Average empirical estimate = 0.382, theoretical Bayesian estimate = 0.397. There
is about a 4% difference between the two quadratic loss estimates.
(iv) (b) Average empirical estimate = 0.4097 (or 0.4023 if use rbinom(n,1,p)),
theoretical Bayesian estimate = 0.4068 (or 0.4016 if use rbinom(n,1,p)).
There is less than a 1% difference between the two quadratic loss estimates.
14
Credibility Theory
Exercises
Data requirements
These exercises do not require you to upload any data files.
Exercise 14.01
The probability of a person dying from a particular disease is p . The prior distribution of p is
beta with parameters a = 2 and b = 3 .
x a
Z × + (1 − Z ) ×
n a+b
n
Z=
n+a+b
The statistician is going to take samples of 5 people to calculate the credibility estimate.
(b) Use a loop to obtain 1,000 simulations of the posterior probability of death, based
on 1,000 random samples each containing 5 people. Use the functions
set.seed(79), rbeta and rbinom and store the credibility estimate of
the i th outcome in the i th element of cp.
(ii) Plot a labelled bar chart of the simulated credibility estimates for p using the functions
barplot and table.
(iii) Calculate the mean and standard deviation of the empirical credibility estimates.
14
Credibility Theory
Answers
Exercise 14.01
(ii)
(iii) mean = 0.403, standard deviation = 0.1393931 (or 0.4046 and 0.1402088 if use
rbinom(n,1,p))
15
EBCT
Exercises
Data requirements
These exercises require the following data files:
• insurance claims.txt
• insurance volumes.txt
Exercise 15.01
The table below shows the aggregate claim amounts (in £m ) for an international insurer’s fire
portfolio for a 5-year period. The claim amounts are subdivided by country of origin.
Total claim
Year
amount
Country 1 2 3 4 5
A 48 53 42 50 59
B 64 71 64 73 70
C 85 54 76 65 90
D 44 52 69 55 71
(i) Load the data frame and store it in the matrix ins.claim.
(ii) Store the number of years and number of countries in the objects n and N, respectively.
An actuary is using EBCT Model 1 to set premiums for the coming year.
(iii) (a) Use mean and rowMeans (or otherwise) to calculate an estimate of E[m(θ )] and
store it in the object m.
(b) Use apply, var and mean to calculate an estimate of E[ s2 (θ )] and store it in
the object s.
(c) Use var and rowMeans (or otherwise) and your result from part (iii)(b) to
calculate an estimate of var[m(θ )] and store it in the object v.
(iv) Use your results from parts (ii) and (iii) to calculate the credibility factor and store it in the
object Z.
(v) Calculate the EBCT premiums for each of the four countries.
Exercise 15.02
This question uses the aggregate claim amounts (in £m ) for an international insurer’s fire
portfolio for a 5-year period from the previous exercise, which should be stored in the matrix
ins.claim.
The table below shows the volumes of business for each country in each year for the international
insurer.
Volume Year
Country 1 2 3 4 5
A 12 15 13 16 10
B 20 14 22 15 30
C 5 8 6 12 4
D 22 35 30 16 10
(i) Load the data frame of volumes and store it in the matrix ins.volume.
An actuary is using EBCT Model 2 to set premiums for the coming year.
(ii) Calculate the claims per unit of risk volume and store them in the matrix X.
(iii) (a) Use rowSums to calculate the total policies for each country and store them in
the object Pi.
(b) Use sum to calculate the overall total policies for all countries and store it in the
object P.
N
(c) calculate P∗ Nn1−1 ∑ i =1 Pi (1 − Pi P ) and store it in
Use ncol, nrow and sum to=
the object Pstar.
(b) Use rowSums to calculate the mean claims per policy for each country and store
it in the object Xibar.
(c) Use rowSums and mean to calculate E[ s2 (θ )] and store it in the object s.
(d) Use sum and rowSums and your result from part (iii)(c) to calculate var[m(θ )]
and store it in the object v.
(v) Use your results from parts (iii) and (iv) to calculate the credibility factor for each country
and store the values in the object Zi.
(vi) If the volumes of business for each country for the coming year are 20, 25, 10 and 12,
respectively, calculate the EBCT Model 2 premiums for each of the four countries.
15
EBCT
Answers
Exercise 15.01
(iii) (a) m = 62.75
(b) s = 101.2
(c) v = 90.33
(iv) 0.8169485
(v) Country A premium = 52.66, Country B premium = 67.37, Country C premium = 71.94,
Country D premium = 59.03.
Exercise 15.02
(b) P = 315
(c) P∗ = 11.80852
(b) X1 = 3.818,
= X2 3.386,
= X3 10.57,
= X 4 2.575
(c) s = 104.642
(d) v = 6.538782
(v) Z1 = 0.8048,
= =
Z2 0.8632, =
Z3 0.6862, Z4 0.8759
(vi) The credibility premiums for countries A, B, C and D are 77.0, 86.7, 85.0, 33.0,
respectively.
If you are having your assignment marked by ActEd, please follow these instructions carefully:
– Download and open the Word document ‘CS1 Assignment Y1 Answer Booklet 12345’.
Follow the instructions provided in the template and enter your answers where
indicated.
– In your submission include sufficient R code for the markers to work out how you
arrived at your answers.
– Begin your answer to each question on a new page. Only send ActEd one Word file
(created using the template) when you have completed the assignment.
– Assignment marking is not included in the price of the course materials. Please
purchase Series Y Marking or a Marking Voucher before submitting your script.
– We only accept the current version of assignments for marking, and so you can only
submit this assignment in the sessions leading to the 2019 exams.
– We only accept Word files produced in Office 2007, 2010 or 2013 format. Submitted
assignments will not be marked if any of the files are suspected to have been affected
by a computer virus or to have been corrupted.
– You should aim to submit this script for marking by the recommended submission
date. The recommended and deadline dates for submission of this assignment are
listed on the summary page at the back of this pack and on our website at
www.ActEd.co.uk.
– Scripts received after the deadline date will not be marked, unless you are using a
Marking Voucher. It is your responsibility to ensure that scripts reach ActEd in good
time. If you are using Marking Vouchers, then please make sure that your script
reaches us by the Marking Voucher deadline date to give us enough time to mark
and return the script before the exam.
– In addition to this paper, you should have available actuarial tables and an electronic
calculator.
Y1.1 In a particular portfolio of 1,000 life assurance policyholders, deaths are assumed to occur
independently with a probability of 0.05.
(ii) Calculate the probability that the number of deaths, D , lies between 45 and 59 inclusive:
(a) exactly
Y1.2 (i) (a) Calculate 1,000 simulated values from a U(0,1) distribution using set.seed(13).
(b) Hence, determine 1,000 simulations from the distribution which has cumulative
distribution function:
1
F (x) = 1 − , x >0 [7]
1+ x
(a) plot a labelled graph of the empirical PDF of the simulations for the range
x ∈(0,200)
(b) calculate the empirical mean, standard deviation and coefficient of skewness and
comment on the shape of the distribution. [13]
[Total 20]
Y1.3 A company that makes Gizmos™ is trying to ascertain the percentage of consumers who are
aware of the existence of its product. A study is to be carried out in which a random sample of
the population will be interviewed and asked whether or not they are aware of it.
(i) In a sample of 20 people, 10 had heard of Gizmos™. Determine the width of an exact 95%
confidence interval for the underlying population proportion. [5]
(ii) Show exhaustively for a sample of size 20 that the greatest width of an exact binomial
confidence interval occurs when half of the sample have heard of Gizmos™. [8]
[Total 13]
Y1.4 An insurer is measuring the inter-arrival times between notification of consecutive claims from a
portfolio of policies with a low claim rate. The insurer believes that these inter-arrival times may
have an exponential distribution with unknown parameter λ . A random sample gives the
following time periods (in days) between consecutive claims:
14, 4, 3, 2, 3, 1, 5, 10, 4, 23
(i) Derive a 99% confidence interval for the exponential parameter λ using a non-parametric
bootstrap and set.seed(17), based on a sample of 1,000 values. [10]
After extensive analysis it is decided that the inter-arrival times have an exponential distribution
with parameter 0.145.
(ii) (a) Determine 1,000 simulated means from samples of size 10 from this exponential
distribution using set.seed(19).
(c) Use the results of part (ii)(a) to calculate the empirical probability that the sample
mean is less than 5. [14]
(iii) Plot the PDF of the appropriate gamma distribution on the histogram of part (ii)(b) and
comment. [5]
(iv) Calculate the exact probability that the sample mean is less than 5 using this result and
compare to part (ii)(c). [5]
(v) (a) Determine 1,000 simulated values from the appropriate gamma distribution using
set.seed(21).
(b) Plot a Q-Q plot of the sample means from part (ii)(a) and the simulations from
part (v)(a) and comment on the result. [13]
[Total 47]
END OF PAPER
For the session leading to the April 2019 exams – CS1B, CS2B, CM1B & CM2B Subjects
Marking vouchers
Subjects Assignments
Series Y Assignments
Recommended
Subjects Assignment Final deadline date
submission date
If you submit your assignment on the final deadline date you are likely to receive your script back less than a
week before your exam.
For the session leading to the September 2019 exams – CS1B, CS2B, CM1B & CM2B Subjects
Marking vouchers
Subjects Assignments
Series Y Assignments
Recommended
Subjects Assignment Final deadline date
submission date
If you submit your assignment on the final deadline date you are likely to receive your script back less than a
week before your exam.
Solution Y1.1
(i) Median
n <- 1000
p <- 0.05
qbinom(0.5,n,p) [3]
qbinom(0.5,1000,0.05)
Since Bin(n, p) Poi(np) we have a Poisson distribution with mean 50. [2]
ppois(59,n*p) - ppois(44,n*p) [2]
ppois(59,50) - ppois(44,50)
pnorm(59.5,50,sqrt(47.5)) - pnorm(44.5,50,sqrt(47.5))
Solution Y1.2
set.seed(13) [1]
1 1
u 1 x 1 [2]
1 x 1u
plot(density(x),xlim=c(0,200),main="Empirical PDF of
simulations", xlab="x",col="blue")
[4]
[Subtract 1 mark per error, colour not needed]
mean(x)
sd(x)
skew/(sd(x)^3)
The huge standard deviation for such a small mean indicates that we have a very long tail (as the
values must be greater than zero). [1]
Solution Y1.3
test$conf[2]-test$conf[1]
for (i in 1:20)
{test <-binom.test(i,20);width[i]<-test$conf[2]-test$conf[1]}
width
max(width)
[7]
By examining the widths, we can see the greatest width (0.4560843) occurs for 10 successes. [1]
Solution Y1.4
Since ˆ 1 X we could obtain a 99% confidence interval for the sample mean and then find its
reciprocal.
bm <- rep(0,1000)
set.seed(17)
for(i in 1:1000)
[6]
set.seed(17)
bm <- replicate(1000,mean(sample(x,replace=TRUE)))
1/ci [1]
Note that you will have to swap the numbers to get the confidence interval in the correct form.
set.seed(19)
for (i in 1:1000)
set.seed(19)
(ii)(b) Histogram
length(xbar[xbar<5])/length(xbar) [3]
lines(xvals,dgamma(xvals,10,1.45),type="l",col="blue") [3]
[Subtract 1 mark per error, colour not needed]
pgamma(5,10,1.45) [2]
There’s about a 7% difference in the answers so they’re not that close. [2]
set.seed(21)
(v)(b) QQ plot
abline(0,1,col="red",lty=2,lwd=2) [2]
[Colour, line type and width not needed]
The fit appears fairly good in the middle. The lower end sample means are slightly higher than
expected – so we have a lighter lower tail. [2]
The upper end sample means are much lower than expected – so we have a lighter upper tail
except for a handful of extremely large sample mean values. [2]
So it’s not a very good fit, possibly because of the sample size not being large enough. [1]
If you are having your assignment marked by ActEd, please follow these instructions carefully:
– Download and open the Word document ‘CS1 Assignment Y2 Answer Booklet 12345’.
Follow the instructions provided in the template and enter your answers where
indicated.
– In your submission include sufficient R code for the markers to work out how you
arrived at your answers.
– Begin your answer to each question on a new page. Only send ActEd one Word file
(created using the template) when you have completed the assignment.
– Assignment marking is not included in the price of the course materials. Please
purchase Series Y Marking or a Marking Voucher before submitting your script.
– We only accept the current version of assignments for marking, and so you can only
submit this assignment in the sessions leading to the 2019 exams.
– We only accept Word files produced in Office 2007, 2010 or 2013 format. Submitted
assignments will not be marked if any of the files are suspected to have been affected
by a computer virus or to have been corrupted.
– You should aim to submit this script for marking by the recommended submission
date. The recommended and deadline dates for submission of this assignment are
listed on the summary page at the back of this pack and on our website at
www.ActEd.co.uk.
– Scripts received after the deadline date will not be marked, unless you are using a
Marking Voucher. It is your responsibility to ensure that scripts reach ActEd in good
time. If you are using Marking Vouchers, then please make sure that your script
reaches us by the Marking Voucher deadline date to give us enough time to mark
and return the script before the exam.
– In addition to this paper, you should have available actuarial tables and an electronic
calculator.
Y2.1 An investigation is to be carried out into the spreading properties of two brands of paint, Brand A
and Brand B . Samples of 5 cans (of the same size) of each type are analysed, and the area of wall
covered by the paint in each can is measured (in square metres), with the following results:
(i) Test whether the variances of the 2 brands can be considered to be equal. [8]
(ii) Based on your answer to part (i), test the hypothesis that Brand A covers a greater area
than Brand B , against the hypothesis that both brands are equally effective. State the
probability value of your test statistic. [6]
(iii) It is decided that the assumption of normality is not appropriate. Repeat part (ii) using an
appropriate non-parametric test without resampling. [10]
[Total 24]
Y2.2 The aggregate claims X each year, from a portfolio of insurance policies, are assumed to have a
normal distribution with unknown mean θ and variance τ 2 = 400 . Prior information is such that
θ is assumed to have a normal distribution with mean μ = 270 and variance σ 2 = 225 .
Independent claim amounts over the past 5 years have been obtained.
Simulate 1,000 samples of 5 aggregate claims using a seed value of 13 and calculate the mean of
each sample. Hence, obtain an empirical Bayesian estimate for θ under quadratic loss. [10]
Y2.3 A statistician is analysing the fall in fertility rates that occurred in Switzerland in 1888. The
standardised fertility measure for each of 47 French-speaking provinces is obtained along with
five other socio-economic factors:
(i) Show that Education has the second strongest Spearman’s correlation with Fertility. [6]
(iii) (a) Fit a linear regression model, using Fertility as the response variable and
Education as the explanatory variable. State the intercept and gradient
parameters and comment on their p-values.
(b) Plot a red dotted fitted regression line to the scattergraph in part (i).
(c) By considering the coefficient of determination and the fitted line from part (b),
explain the limitations of this model. [12]
Neuchatel, the 42nd swiss province has 17.6% of men occupied in agriculture, 35% of draftees
receiving the highest mark on the army examination, 32% of draftees receiving education beyond
primary school, 16.92% who are catholic and 23.0% of live births who live less than 1 year.
(iv) Calculate the residual and a 90% confidence interval for the fertility rate of the Neuchatel
province, based on the model in part (iii). [8]
[continued over]
To improve the model it is decided to use forward selection to add other variables and interaction
terms that meet the following criteria:
• The variable that most improves the adjusted R 2 out of the remaining possibilities is to be
added first
• The variable is then only kept if all the resulting parameters in the model are significant.
(v) Derive the best model for fertility that meets all these criteria, recording your adjusted
R 2 for each model considered. Comment on how each model meets the criteria. [26]
(vi) (a) Repeat part (iv) for your model from part (v).
(b) Hence, comment on the fit of the second model compared to the first. [8]
[Total 66]
END OF PAPER
For the session leading to the April 2019 exams – CS1B, CS2B, CM1B & CM2B Subjects
Marking vouchers
Subjects Assignments
Series Y Assignments
Recommended
Subjects Assignment Final deadline date
submission date
If you submit your assignment on the final deadline date you are likely to receive your script back less than a
week before your exam.
For the session leading to the September 2019 exams – CS1B, CS2B, CM1B & CM2B Subjects
Marking vouchers
Subjects Assignments
Series Y Assignments
Recommended
Subjects Assignment Final deadline date
submission date
If you submit your assignment on the final deadline date you are likely to receive your script back less than a
week before your exam.
Solution Y2.1
var.test(A,B) [4]
This gives a p-value of 0.5576. Hence we have insufficient evidence to reject H0 . Therefore it is
reasonable to assume that the variances are equal. [2]
Alternatively, students may reverse the A and B in the function, which gives the same answer.
Since the variances can be considered equal we can use the equal variance t test:
t.test(A,B,var.equal=TRUE,alt="greater") [4]
This gives a p-value of 0.01446. Hence we have sufficient evidence to reject H0 . Therefore it is
reasonable to assume that Brand A covers a greater area than Brand B.
Alternatively, students may reverse the A and B in the function, and test ‘less’ which gives the
same answer.
Subtract 2 marks for students who carry out a 2 sided test and get a p-value of 0.02891.
Subtract 2 marks for students who don’t specify that the variances are equal and get a p-value of
0.01564.
p<-combn(index,length(A)) [2]
n <- ncol(p)
dif<-rep(0,n) [1]
for (i in 1:n)
{dif[i]<-mean(results[p[,i]])-mean(results[-p[,i]])} [2]
length(dif[dif>=ObsT])/length(dif) [2]
This gives an empirical p-value of 0.01984127. Hence we have sufficient evidence to reject H0 .
Therefore it is reasonable to assume that Brand A covers a greater area than Brand B. [2]
Alternatively, students may reverse the A and B in the function, and test dif<=ObsT which gives the
same answer.
Solution Y2.2
n <- 5
xp <- rep(0,1000)
set.seed(13)
for (i in 1:1000)
x <- rnorm(n,mu,sqrt(dvar))
[8, 2 marks each for the last 3 lines, 2 marks for everything else]
Alternatively, students could place the values directly into the loop:
xp <- rep(0,1000)
set.seed(13)
for (i in 1:1000)
x <- rnorm(5,mu,sqrt(400))
mean(xp) [1]
Hence the empirical Bayesian estimate for θ under quadratic loss is 268.9667. [1]
Solution Y2.3
cor(swiss,method="spearman") [4]
Hence, we can see that Education has the second strongest correlation coefficient (−0.44) with
Fertility after Examination (−0.66) . [2]
Subtract 2 marks for students who calculate the default Pearson or the Kendall correlation.
attach(swiss)
plot(Education,Fertility,pch=3,main="Scattergraph of
Fertility rate vs education")
[6]
Alternatively, students need not attach the data and instead use either of the following:
plot(swiss$Education,swiss$Fertility,pch=3,xlab="Education",
ylab="Fertility",main="Scattergraph of Fertility rate vs
education")
plot(swiss[,4],swiss[,1],pch=3,xlab="Education",y
lab="Fertility",main="Scattergraph of Fertility rate vs
education")
lm(swiss$Fertility ~ swiss$Education)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
Education -0.8624 0.1448 -5.954 3.66e-07 ***
The p-values are both very significant indicating that the parameters are non-zero. [1]
abline(model,col="red",lty=3) [4]
R 2 = 0.4406 [1]
Half marks for students who give the adjusted R 2 of 0.4282 as the question asks for the R 2 .
The low coefficient of determination is due to the large spread of results around the line. [1]
There are also very few provinces with high levels of education. [1]
The residual for Neuchatel which is the 42nd province can be obtained from either of the
following:
resid(model)[42] [1]
model$resid[42]
newdata <-
data.frame(Agriculture=17.6,Examination=35,Education=32,
Catholic=16.92,Infant.Mortality=23.0)
Students who use the mean response will get (46.4, 57.6). Lose 1 mark.
Students who obtain a 95% confidence interval will get individual (31.8, 72.2) or mean (45.3, 58.7).
Lose 1 mark and 2 marks, respectively.
summary(fit1a)
The adjusted R 2 is 0.4242, which is a bit worse, and not all parameters are significant. [1]
summary(fit1b)
summary(fit1c)
summary(fit1d)
So fit1b (+Catholic) improves the adjusted R 2 the most (to 0.5552) and has all parameters
significant. [1]
summary(fit2a)
summary(fit2b)
The adjusted R 2 is 0.5452, which is worse, and not all parameters are significant. [1]
summary(fit2c)
So fit2c (+Infant.Mortality) improves the adjusted R 2 the most (to 0.639) and has all
parameters significant. [1]
summary(fit3a)
summary(fit3b)
The adjusted R 2 is 0.6319, which is worse, and not all parameters are significant. [1]
So fit3a (+Agriculture) is the only model that improves the adjusted R 2 (to 0.6707) and
has all parameters significant. [1]
summary(fit4a)
The adjusted R 2 is 0.671, which is a marginal improvement but not all the parameters are
significant. [1]
Hence, we do not add this covariate and stick with model fit3a with an adjusted R 2 of 0.6707.
[1]
summary(fit5a)
The adjusted R 2 is 0.6628, which is worse, and not all parameters are significant. [1]
summary(fit5b)
Examination was not a significant main effect and therefore should not be considered.
summary(fit5c)
The adjusted R 2 is 0.6779, which is an improvement, but not all parameters are significant. [1]
So fit5b (+Education:Catholic) is the only model that improves the adjusted R 2 (to
0.699) and has all parameters significant. [1]
summary(fit6a)
The adjusted R 2 is 0.6917, which is worse, and not all parameters are significant. [1]
summary(fit6b)
The adjusted R 2 is 0.6957, which is worse, and not all parameters are significant. [1]
Neither of these models meet the criteria so the best model is fit5b which has the following
covariates:
Students could try examination as a main effect again after the first interaction term to see if it is
now significant. If they do, they’ll get an adjusted R 2 of 0.7073, which is an improvement, but
note that not all parameters are significant. So again, it would not be added.
The residual for Neuchatel which is the 42nd province can be obtained from either of the
following:
resid(fit5b)[42] [1]
fit5b$resid[42]
newdata <-
data.frame(Agriculture=17.6,Examination=35,Education=32,
Catholic=16.92,Infant.Mortality=23.0) [2]
Students who use the mean response will get (54.3, 67.3). Lose 1 mark.
Students who obtain a 95% confidence interval will get individual (44.9, 76.7) or mean (53.0, 68.6).
Lose 1 mark and 2 marks, respectively.
(vi)(b) Comment
For the second model (fit5b), the residual is smaller and the confidence interval is narrower.
Both of these indicate that the model is a better fit. [2]