0% found this document useful (0 votes)
571 views227 pages

Csc121 Full Notes

This document provides an overview of CSC 121: Computer Science for Statistics, a course taught by Radford M. Neal at the University of Toronto in 2017. The course introduces programming in R, which is widely used by statisticians. It explains why learning to program is important for working with data and doing statistics. It also gives examples of using R to perform statistical analysis and modeling. The document outlines some key things students will learn, like performing linear regression, and how to effectively learn to program in R through experimentation, documentation, and writing many programs.

Uploaded by

Cindy Han
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
571 views227 pages

Csc121 Full Notes

This document provides an overview of CSC 121: Computer Science for Statistics, a course taught by Radford M. Neal at the University of Toronto in 2017. The course introduces programming in R, which is widely used by statisticians. It explains why learning to program is important for working with data and doing statistics. It also gives examples of using R to perform statistical analysis and modeling. The document outlines some key things students will learn, like performing linear regression, and how to effectively learn to program in R through experimentation, documentation, and writing many programs.

Uploaded by

Cindy Han
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 227

CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 1
Why Learn to Program (in R)?

• Programming is a fundamental skill in today’s world — akin to reading,


writing, and arithmetic.

• Programming is essential for working with data in more than a superficial


way, and a crucial tool in learning and doing statistics.

• The R programming language is widely used by statisticians, and has some


features that are especially helpful for statistics. Many statisticians have
written R “packages” that implement various statistical methods.

• Learning to program is also the gateway to learning more advanced computer


science, and to research and development in statistical computation.
A Statistical Analysis Using R
Model for all iris flowers:

(Intercept) Sepal.Length
Petal Length Versus Sepal Length in Three Iris Species
-7.101443 1.858433

7
Iris virginica
Model for species setosa: Iris versicolor

6
Iris setosa

(Intercept) Sepal.Length

5
0.8030518 0.1316317

Petal Length

4
Model for species versicolor:

(Intercept) Sepal.Length 3
2

0.1851155 0.6864698
1

Model for species virginica:


4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

(Intercept) Sepal.Length Sepal Length

0.6104680 0.7500808
The R Script Used (which you’re not expected understand yet)
# Analyse the relationship of petal length to sepal length in the flowers of
# three iris species, fitting regression models to all data and to each species.

species <- levels(iris$Species) # Names of the three species of iris


species_col <- c("red","green","blue") # Colours to use for the three species
names(species_col) <- species

# Plot the sepal and petal lengths for each flower that was measured. Identify
# species by colour. Randomly jitter the data slightly to prevent overlap.

plot (iris$Sepal.Length + runif(nrow(iris),-0.02,0.02), xlab="Sepal Length",


iris$Petal.Length + runif(nrow(iris),-0.02,0.02), ylab="Petal Length",
col=species_col[as.character(iris$Species)],
main="Petal Length Versus Sepal Length in Three Iris Species")

mtext (paste(" Iris",species), adj=0, line=c(-5,-3.5,-2), col=species_col)

# Show and plot regression line of petal length on sepal length fit to all data.

m <- lm (Petal.Length ~ Sepal.Length, data=iris) # Find fit for all data


cat ("\nModel for all iris flowers:\n\n") # Print the regression
print (coef(m)) # coefficients
abline (m) # Add regression line to plot

# Print and plot linear regression lines for petal length on sepal length
# fit to data on each species separately.

for (sp in species) {


d <- iris[iris$Species==sp,] # Data for one species
m <- lm (Petal.Length ~ Sepal.Length, data=d) # Find fit for one species
cat ("\nModel for species ",sp,":\n\n", sep="") # Print the regression
print (coef(m)) # coefficients
clip (min(d$Sepal.Length)-0.2, # Add the regression line
max(d$Sepal.Length)+0.2, -1e10, 1e10) # for this species to
abline (m, col=species_col[sp]) # to the plot
}
Some Statistical Tasks Where Programming Helps

• Obtaining data from files or databases where it isn’t in the format required,
or needs to be “cleaned up” (eg, inconsistent names for the same thing, some
records with erroneous data, . . . ).

• Applying a statistical method that isn’t implemented in a standard package


— but which may be the most appropriate method for the problem at hand.
(Or a standard package may just need to be tweaked a bit.)

• Research into new statistical methods for general use. They’re not very useful
if they can’t be done on a computer!

• Reporting the results of an analysis using appropriate plots, tables, or other


output — which might not be what a standard package produces.

• Embedding statistical methods in larger systems (eg, detecting when there’s


enough evidence of a problem to shut down a refinery before it explodes).
About This Course
This is a programming course, not a statistics course. Some programming
examples will involve statistics, but only very elementary statistics.

This course is meant for students who


• Do not already have a firm knowledge of introductory programming (who
might be totally ignorant of programming, or who might have some slight
programming experience).
• Are interested in working in statistics, or in a field that uses statistics
extensively. Students with other interests might be better off in CSC 120 or
in CSC 108.

It is not meant for


• Students who already know how to program well (even if they don’t know R).
They should take CSC 148 directly, and/or just learn R on their own.
• Students expecting to go on to more advanced courses in computer science.
They should take CSC 108, and then CSC 148.
(But if you change your mind, with some extra study, CSC 121 can substitute
for CSC 108 as preparation for CSC 148 — but consult the CS department.)
Using R on CS Teaching Computers or on Your Computer
R can be used on almost any computer, whatever kind of hardware it has, and
whether its operating system is Linux, Mac OS X, or Microsoft Windows.

R is free. You can download it from r-project.org and install it on your


computer.

R is also installed on the CS department’s teaching computers


(teach.cs.utoronto.ca), which you can use for this course.

There are several ways to use R:

– in a “terminal” window, typing commands and seeing output there


– within a more elaborate graphical user interface
– non-interactively, with input from a file, and output and plots to other files

We’ll talk more about these ways of running R later.


How to Learn to Program (in R)

• Play around. R is an interactive language — you can type something and


immediately see the result.
> 14 + 11 # You type this and see the answer below...
[1] 25
> plot(iris) # You type this and see plots of the "iris" dataset

• Use R’s “help” facility, the on-line “Introduction to R”, and other on-line or
paper documentation. See the course web page for some links.

• Read other people’s programs.

• Write your own programs — lots of them!

• Read other people’s programs.

• Write your own programs — lots of them!

• etc.
Two Kinds of Programs
Some programs compute an output from some input.
Examples:

• Input: A list of numbers (for example, 2, 11, 5).


Output: The mean (average) of these numbers (for the above example, 6).

• Input: The age, blood pressure, and cholesterol levels of 1000 people.
Output: Three clusters of people who are similar in these measurements.

Other programs do things, and perhaps also take input and produce output.
Examples:

• Any simple video game: No input, no output, just play.

• Input: A data set of pairs of numbers.


Actions: Plot the data points.
Allow the user to identify “outliers” by clicking on them with a mouse.
Output: The data set with outliers removed.
What is a Program? Data + Procedures + Structure
All programs work with data of some sort:
– there’s usually some input data
– all but the simplest programs create more data during computations
– most programs produce some output data

The procedures in a program do things with data, producing new data, or taking
actions. For example, procedures in a program might do things like:
– add two numbers to get a third number
– re-arrange a list of numbers in increasing order
– display a plot of a set of numbers
– change all the upper-case letters in a document to lower-case
Specifying the procedures that operate on data — sometimes also called
“scripts”, “methods”, or “functions” — is a major part of programming.

Procedures and data need to have a good structure, that


– produces the correct answer, and also . . .
– is easy for a person to read and modify
Some Types of Data in R
Every data item in R has a type — the kind of data it is, which determines what
operations can be done with it.

Real numbers in R have numeric type (also called “double”, for obscure reasons).
We can write these numbers in mostly familiar fashion:

123
1.234
1.23e-44 ← this means 1.23 × 10−44

R can also operate on strings of characters, which are written in single or double
quotation marks:

"x"
"Hello, James."
’say "please"!’
Arithmetic Operations
R can do all the usual arithmetic operations on numbers. You can try them out
by typing expressions at R’s command prompt (“>”):

> 4.1 + 6.2 # Addition


[1] 10.3
> 7.7 - 0.1 # Subtraction
[1] 7.6
> 4 * 5 # Multiplication (you need to explicitly write *)
[1] 20
> 10 / 4 # Division
[1] 2.5
> 2 ^ 3 # Raising to a power
[1] 8

Note that everything you type after “#” is a comment, that R ignores (but that
people reading what you wrote may find helpful). R also ignores extra spaces (in
most places), but they may make an expression easier to read.
You can ignore the “[1]” seen above (we’ll see later what it means).
Combining Operations, Parentheses, and Precedence
You can combine operations, using parentheses to indicate which is done first:
> (8 + 2) * 5
[1] 50
> 8 + (2 * 5)
[1] 18
You can omit parentheses if the precedence of the operators would produce the
desired result. Addition and subtraction have lower precedence than multiplication
and division, which have lower precedence than raising to a power:
> 8 + 2*5 # Same as 8 + (2*5)
[1] 18
> 3 * 5^2 # Same as 3 * (5^2)
[1] 75
Operators (except “^”) with the same precedence are applied leftmost first:
> 2 - 1 + 9 # Same as (2 - 1) + 9
[1] 10
> 50 / 5*10 # Same as (50 / 5)*10, NOT 50 / (5*10)
[1] 100
More on Typing Expressions Into R
If you need to split an expression between lines, make sure the first line doesn’t
look like a whole expression on its own.
Example:
> 1234 + 5678 + 1111 + 2222 + 3333 + 567890 *
+ 876 / (1 + 2 + 3 + 4) - 888^2
[1] 48972198
The first line, “1234 + 5678 + 1111 + 2222 + 3333 + 567890 * ”, isn’t a
valid expression — there’s nothing after the “*”. To tell you that more is needed,
R changes the prompt from “>” to “+” (this “+” has nothing to do with addition).

If what you type doesn’t make sense, R displays an error message (and ignores
what you typed).
Example:
> 2 * (3 + 4))
Error: unexpected ’)’ in " 2 * (3 + 4))"
This kind of error is called a syntax error. R doesn’t even try to do anything,
because it can’t figure out what you meant.
Mathematical Functions
R can also compute mathematical functions, such as logarithms and cosines:

> log(10) # Natural logarithm (to base e)


[1] 2.302585
> exp(1) # Exponential (power of e)
[1] 2.718282
> cos(1) # Cosine, for angle in radians
[1] 0.5403023
> sqrt(2) # Square root
[1] 1.414214

You can combine several mathematical functions and/or arithmetic operators:

> exp(-1) # Should be the same as 1 / e


[1] 0.3678794
> 2 * log(3*4) # This should be the same as the next one...
[1] 4.969813
> log(3^2) + log(4^2)
[1] 4.969813
String Operations
R can also do operations on strings of characters.

You can put two or more strings together into one string:

> paste ("John", "Henry", "Smith") # Separated by spaces


[1] "John Henry Smith"
> paste ("John", "Henry", "Smith", sep="") # Separated by nothing
[1] "JohnHenrySmith"
> paste ("Toronto", "Ontario", sep=", ") # Separated by ", "
[1] "Toronto, Ontario"

You can also extract just part of a character string:

> substring("12 Jan 2016", 4, 6) # Get the 4th through 6th characters
[1] "Jan"

Why do we need these operations, when people are good at combining and
extracting characters without the help of a computer? They’re useful as parts of
larger programs — for example, to build suitable titles and axis labels for plots.
Saving Values in Variables
You can save a value in a variable, giving it some name. You then can use that
name to refer to the value in the variable later:

> x <- 123 + 456 # We assign a value to variable x using <-


> x * 10 # We can then refer to the variable many times
[1] 5790
> x / 10
[1] 57.9

You can see what value is in a variable by just typing its name:

> x
[1] 579

Note: You can use “=” rather than “<-” to assign a value to a variable, and that
is what is used in some other programming languages. But “<-” looks like an
arrow, which is more descriptive of what happens: x <- 9 moves the value 9 into
the variable x. I recommend that you use “<-”.
Names for Variables
A variable name can be any sequence of letters, digits, “.”, and “_”, except that
it can’t start with “_”, or with a digit, or with “.” followed by a digit.
Choosing good names for variables helps you (and others) remember what they
are for.
Examples:

> this_year <- 2016


> this_year + 2
[1] 2018
> this.month <- "January"
> paste (this.month, this_year) # paste converts numbers to strings
[1] "January 2016"

Using “.” in a variable name (like in this.year above) is a bit archaic, and
clashes with usage in other programming languages. It’s better to use “_”. It’s
also good to be consistent, whatever you do. (Not like above!)

Note! xy is the name of a single variable, not x times y (which we write as x*y).
Changing the Value in a Variable
The value stored in a variable can be changed. When you refer to a variable you
always get the last value stored into it:
> My_age <- 12 # Set My_age to 12
> My_age - 19
[1] -7
> My_age <- 22 # Change My_age to now be 22 (value 12 forgotten)
> My_age - 19
[1] 3
Changing a variable’s value doesn’t change things previously computed from it:
> My_age <- 12
> h <- My_age - 19 # Set h based on My_age being 12
> h
[1] -7
> My_age <- 22 # When we change My_age to be 22, the value
> h # of h doesn’t change
[1] -7
> h <- My_age - 19 # But we can re-compute h with the new My_age
> h
[1] 3
Vectors
R lets you put together several data values of the same type into a vector.

Here’s one way to create a vector from individual values:

c (4.1, 123, 0.099)

This creates a numeric vector containing three numbers.

Vectors of character strings can be created similarly:

c ("Robert", "Mary", "George", "Helen", "Vladimir")

This creates a character vector containing five character strings.

The order within a vector matters — c(3,4) and c(4,3) are not the same thing.
Repetitions also matter — c(3,3) is not the same as c(3,3,3).
Combining Vectors
The “c” function can also create vectors by combining other vectors:

> a <- c (1.3, 5.1, 3.3)


> b <- c (200, 400, 100)
> c(a,b,b,b,b,b,a)
[1] 1.3 5.1 3.3 200.0 400.0 100.0 200.0 400.0 100.0 200.0
[11] 400.0 100.0 200.0 400.0 100.0 200.0 400.0 100.0 1.3 5.1
[21] 3.3

Note how R prints long vectors, that take more than one line — each line starts
with the index of the next element printed in brackets. (That’s also why R
prints“[1]” before the answer when it’s a single number.)
Plotting Data Stored in Vectors
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> plot (x) # plot data x as points, against indexes 1, 2, ..., 8

5.0
4.0
x

3.0
1 2 3 4 5 6 7 8

Index

> plot (x, type="l")5.0


4.0 # plot the data as lines instead
x

3.0

1 2 3 4 5 6 7 8

Index

> z <- c (4.3, 4.8, 5.1, 4.2, 3.1, 3.2, 3.0, 2.7)
> lines (z, col="red") # add data z to the plot, in red
5.0
4.0
x

3.0

1 2 3 4 5 6 7 8

Index
Arithmetic on Vectors
R can do arithmetic on two vectors of the same length, applying the arithmetic
operation to corresponding elements:

> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> z <- c (4.3, 4.8, 5.1, 4.2, 3.1, 3.2, 3.0, 2.7)
> x - z
[1] -0.2 0.1 0.2 0.1 0.4 -0.2 0.1 0.1

One application is to plot the differences between two data vectors (that are the
same length):

> plot (x-z, type="b") # Plots both lines and dots


> abline(h=0) # Adds a horizontal line at 0 to the plot
0.4
0.2
x−z

0.0
−0.2

1 2 3 4 5 6 7 8

Index
Arithmetic on a Vector and a Scalar
You can also do an arithmetic operation on a vector and a single number:

> x
[1] 4.1 4.9 5.3 4.3 3.5 3.0 3.1 2.8
> x + 1
[1] 5.1 5.9 6.3 5.3 4.5 4.0 4.1 3.8
> 10 * x
[1] 41 49 53 43 35 30 31 28

In fact, R will do arithmetic on any two vectors, repeating the shorter one to
reach the length of the longer:

> x + c(100,0)
[1] 104.1 4.9 105.3 4.3 103.5 3.0 103.1 2.8

Indeed, R thinks of a scalar (a single number) as a vector of length one, so


operations on a vector and a scalar are a special case of this.
Getting Individual Elements of a Vector
You can extract a single number (an “element”) from a vector of numbers by
subscripting the vector with an index. For example
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> x[3]
[1] 5.3
> x[1]
[1] 4.1
> x[8]
[1] 2.8
Notice that indexes start at one, for the first element, and go up to the length of
the vector, for the last element. (Some other programming languages start their
indexes at zero.)
You can find out the length of a vector using the length function:
> length(x)
[1] 8
> x[length(x)] # Gets the last element regardless of how long x is
[1] 2.8
Setting Individual Elements in a Vector
You can also change a single element in a vector held in a variable, by assigning
to the name of the variable followed by a subscript. For example:

> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> x[2] <- 7.7
> x
[1] 4.1 7.7 5.3 4.3 3.5 3.0 3.1 2.8 # x[2] changed from 4.9 to 7.7

Changing an element in one vector doesn’t change any other vector:

> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
>
> y <- x # The vector in y is now the same as the vector in x
>
> x[2] <- 7.7 # We change the second element in the vector x
>
> y # But y is still the same as before
[1] 4.1 4.9 5.3 4.3 3.5 3.0 3.1 2.8
An Example: Reading Data, Plotting it, and Editing It
Data editing is one use for getting and changing individual numbers in a vector.
Let’s read a vector of numbers from a file on my web site (the scan function will
do this if it’s a simple file of numbers), then plot the data to see what it looks like:

> data <- scan ("http://www.cs.utoronto.ca/~radford/csc121/data1")


> plot(data)
80
60
data

40
20
0

5 10 15 20

Index

The 12th data point looks like it might be wrong. Let’s see exactly what it is:

> data[12]
[1] 77

Maybe there’s a missing decimal point? Could the correct value be 7.7?
. . . Example Continued
Let’s change the 12th data point assuming it’s missing a decimal point and see
what the plot looks like then:

> data[12] <- 7.7


> plot(data)
20
15
data

10
5

5 10 15 20

Index

That looks more plausible.


Perhaps with further investigation we could determine whether or not 7.7 is
actually the correct value.
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 2
Typing Stuff into R Can be Good . . .
You can learn a lot by seeing what happens when you type an R command.
Interactive use of R is also a good way to start exploring a new data set.
For instance, you can

– make sure it’s actually the data you were told it was
– play around with how best to plot the data
– look at plots to see if there are any obviously erroneous data points
– see if relationships between variables seem to be roughly linear
. . . Or Bad
But when you’re seriously analysing data, you don’t want to just type stuff, since
– it’s tedious to type things again and again
– it’s easy to make a mistake when typing something and not notice
– you won’t be able to remember exactly what you typed
– other people won’t be able to replicate your analysis
– if you decide to change the analysis slightly, you have to do it all again
– if you get another similar data set, you have to do it all again

Instead, you want to write a program to analyse your data, saving your program
in a text file. Once you’ve written it
– you can look it over carefully to make sure it’s correct
– make changes to it without starting over again
– run it on as many data sets as you have
– share it with someone else

The ability to write readable and reliable programs is one big advantage of using
R rather than less flexible analysis tools, such as spreadsheets.
Creating and Using R Scripts
One kind of R program is a text file containing R commands that you can ask
R to perform — much as if you had typed them at the R prompt. This kind of
program is called an R script.

You can create an R script with whatever your favourite text editor is (but not
with a word processor, unless you save the document in .txt format).
RStudio has a built-in text editor, which may be the most convenient one to use.

Once you’ve created a script, you can get R to read it — and do the commands it
contains — with the source function, giving it the name of the script file:

> source("myscript.r")

RStudio has a button you can click to do this for a script created with its editor.
If you type the command yourself, you may have to use the full path to the file
(such as /Users/mary/myscript.r).

If the script doesn’t work as desired, you can change it in the editor (and save the
new version), and then get R to do it again, until you have “debugged” it.
Example Script: Read Data, Compute its Mean and SD, Plot it

# Read a file of numbers from the course web page.

data <- scan ("http://www.cs.utoronto.ca/~radford/csc121/data2")

# Compute the sample mean and sample standard deviation of the data.

m <- mean(data)
s <- sd(data)

# Plot the data points, along with a horizontal line at the mean, two
# dashed lines at the mean plus and minus the standard deviation, and
# two dotted lines at mean plus and minus twice the standard deviation.

plot(data)

abline (h=m)
abline (h=c(m-s,m+s), lty="dashed")
abline (h=c(m-2*s,m+2*s), lty="dotted")
How to Run This Script, and the Plot It Produces
This script is stored in a file on the course web page. You can run scripts
obtained from the internet using the URL rather than a file name:
> source("http://www.cs.utoronto.ca/~radford/csc121/demo-script2a.r")
3
2
data

1
0
−1

0 10 20 30 40 50

Index

Note: For security reasons, don’t run a script from a URL at a website you
don’t trust! Instead, download the script and verify it’s OK before running it.
Looking at the Variables Set in the Script
The variables used by the R script will still exist after it has run, and can be
examined:

> source("http://www.cs.utoronto.ca/~radford/csc121/demo-script2a.r")
Read 50 items
> data
[1] 2.170 1.985 1.616 3.181 -0.978 1.597 0.862 -0.186 0.956
[10] 0.169 -0.403 1.107 0.965 2.155 -0.141 1.647 3.007 0.631
[19] 0.393 2.883 1.588 -0.228 1.078 0.355 -0.113 0.852 0.913
[28] 2.876 -0.499 -1.607 1.749 0.167 1.366 1.976 2.907 3.470
[37] 1.162 0.871 0.506 3.138 0.920 0.743 -1.242 -0.678 1.104
[46] 0.817 2.226 1.521 1.915 0.095
> m
[1] 1.07128
> s
[1] 1.205056
Ways to Run an R Script
As was mentioned, you can run an R script in the file myscript.r with the
command source("myscript.r").
But this isn’t quite the same as typing the contents of myscript.r. The
commands in myscript.r aren’t displayed, and you don’t see the value of each
expression. (Though the print function can be used to explicitly display values.)
If you want to see everything, much as if you had typed the commands, use

> source("myscript.r",echo=TRUE)

In RStudio, there is a button for sourcing the script being edited, with an option
for whether echo=TRUE.

You can run a script non-interactively (plots going to the file Rplots.pdf) with
the Unix/Linux command

Rscript myscript.r

Later, we’ll see how to run scripts and get pretty output using the spin function
in the knitr package.
Uses and Limitations of Scripts
R scripts are a good way to do one thing — such as produce output and plots
from analysing one data set in one way. The script helps document exactly what
you did for later reference.

But a script isn’t a good way of doing many things, for instance:

– analysing several different data sets, or


– varying the way that the data is analysed

For example, the source of the dataset is fixed in the R script shown earlier.

It’s also not very convenient to take the output of an R script and do more with it.

It is possible to change what a script does by setting variables before you run the
script. And the script can set variables to values that can be looked at later.
But there is a better way to write programs that can do many things, and be
used as part of a larger program — using functions.
Programming by Defining Functions
An R function specifies how to compute an output — the value of the function
— from one or more inputs — the arguments (or parameters) of the function.

Within a function, the arguments are referred to by their names. Each time the
function is used (“called”), values for the arguments are specified, and the
argument names will refer to those values during that use of the function.
The next time the function is called, the arguments may have different values.

A function will compute a value from its arguments. When the arguments are
different, in a different call of the function, the value may also be different.
The value computed by a function call can be assigned to a variable, or used in
arithmetic, just like for R’s built-in functions like log and sin.

When defining a function, you can make use of other functions you have defined.
In this way, large, complex programs can be built from simpler parts, which helps
make them easier to understand.
Defining and Using a Simple Function
Let’s define a function called sin_deg that computes the sine of an angle specified
in degrees, rather than in radians (as for sin):
> sin_deg <- function (angle) sin(angle*pi/180)
This sets the variable sin_deg to the function specified by the expression
function (angle) sin(angle*180/pi), in the same way we can set a variable
to a number or a string. This function has one argument, which is referred to by
the name angle. The value of the function is computed as sin(angle*180/pi).
(The variable pi is pre-defined by R as π = 3.14159 . . .)
We can then use this function just like we can use R’s built-in functions:
> sin_deg(30)
[1] 0.5
> sin_deg(45)
[1] 0.7071068
> 100 + sin_deg(90)
[1] 101
Within sin_deg, the argument named angle will have values 30, 45, and 90 for
the three uses of sin_deg above.
A Function With Two Vector Arguments
Functions can have more than one argument, and the arguments can be vectors
rather than single numbers.
Here’s a function that computes the distance between two points in a plane, with
each point specified by a vector of two coordinates:

> distance <- function (a,b) sqrt ((a[1]-b[1])^2 + (a[2]-b[2])^2)

Within this function, the two arguments (the points we want to compute the
distance between) are referred to by the names a and b.

Here are some uses of this function:

> distance (c(1,2), c(4,-2))


[1] 5
> x <- c(1.3,2.4)
> y <- x + c(1,1)
> distance(x,y)
[1] 1.414214
Defining a Function With Several Steps
In the examples we’ve just looked at, the functions were simple enough that their
value could be computed with a single expression that wasn’t too complex.
For complicated functions, it can be convenient to break the computation into
several steps, enclosed in curly brackets ({ and }). The early steps assign values
to variables, which are used in the last step to compute the final value of the
function.

Here is a version of distance function re-written in this way:

distance <- function (a,b) {


diff1 <- a[1] - b[1]
diff2 <- a[2] - b[2]
sqrt (diff1^2 + diff2^2)
}

In this example, the steps inside the function definition are indented by four
spaces, so that it’s easier to see that they are part of the distance function.
This is a good practice, which you should follow.
Computing the Perimeter of a Diamond
Here’s a function that computes the total length of the four sides of a diamond
that has widths width1 and width2 for its two axes:

diamond_perimeter <- function (width1, width2) {

vertex1 <- c(width1/2,0)


vertex2 <- c(0,width2/2)
vertex3 <- c(-width1/2,0)
vertex4 <- c(0,-width2/2)

( distance(vertex1,vertex2) +
distance(vertex2,vertex3) +
distance(vertex3,vertex4) +
distance(vertex4,vertex1) )
}

(As you may realize, the four sides are actually all the same length, so this could
be simplified — but we’ll pretend we don’t realize that for this example.)
Using Functions Defined in a File from an R Script
We usually don’t type functions into R — they’re inconveniently long, and we
may wish to change them without having to re-type everything.
Instead, we store the definitions in a file, just as for an R script.

When we want to use these functions in an R script, we use source at the start of
the script to read these functions definitions into R.

For example, we could put the definitions of distance and diamond_perimeter


into a file called distfuns.r, and then use these functions in a script as follows:

source("distfuns.r")
big_diamond_perim <- diamond_perimeter (12.1, 4.7)
small_diamond_perim <- diamond_perimeter (0.4, 0.9)
print(big_diamond_perim)
print(small_diamond_perim)

By putting the definitions of these function in a file separate from the script that
uses them, we can easily use the same functions in other scripts as well.
Functions that Do Things
The purpose of the functions in the previous examples is to compute some output
value from the inputs given as arguments. We can also define functions whose
purpose is to do something, instead of (or in addition to) computing something.
Here’s an example:

> parrot <- function () {


+ what <- readLines (n=1)
+ cat (what, what, what, "\n")
+ }

This function has no arguments, and produces no value as output. It just reads a
line of text typed by the user, and then prints it three times (followed by an
end-of-line marker, which is written as "\n").
For example:

> parrot()
Hello!
Hello! Hello! Hello!
Example: Plotting Data with Mean and SD

# PLOT DATA VECTOR SHOWING MEAN AND STANDARD DEVIATION. Plots a vector
# of data points, along with horizontal lines showing the mean (solid
# line), the mean +- sd (dashed lines), and the mean +- 2*sd (dotted
# lines). The single argument must be a numeric vector. No return value.

plot_showing_mean_sd <- function (data) {

m <- mean(data)
s <- sd(data)

plot(data)

abline (h=m)
abline (h=c(m-s,m+s), lty="dashed")
abline (h=c(m-2*s,m+2*s), lty="dotted")
}
Using this Function in a Script
We can put this function definition in a script called demo-funs2b.r, and then
use it (twice) in another script, demo-script2b.r, which starts by reading in the
script that defines the function;
source("http://www.cs.utoronto.ca/~radford/csc121/demo-funs2b.r")

par(mfrow=c(1,2)) # Put two plots side-by-side

data1 <- scan ("http://www.cs.utoronto.ca/~radford/csc121/data1")


plot_showing_mean_sd(data1)

data2 <- scan ("http://www.cs.utoronto.ca/~radford/csc121/data2")


plot_showing_mean_sd(data2)

We can run this script by


source("http://www.cs.utoronto.ca/~radford/csc121/demo-script2b.r")
Try it and see what you get!
Better yet, download both scripts, change them a bit to do something else, and
try that.
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 3
Making Functions Do Different Things, Using if
When you call a function with different values for its arguments, it can compute
different return values, or plot different data. That’s more useful than a script
that computes or does only one thing.
But what if the way the return value should be computed, or data should be
plotted, depends on the arguments?
We can use if to do this:

# Function to compute how much of your income you should save.


money_to_save <- function (income)
if (income < 30000) 0 else 0.1 * (income-30000)

# Plot data with lines, plus dots if no more than 100 points.
plot_data <- function (data) {
if (length(data) > 100)
plot (data, type="l") # Plot lines only
else
plot (data, type="b") # Plot lines, plus dots at the points
}
Comparisons and the Logical Data Type
In these if expressions, which thing to do is determined by comparing two numbers.
Comparisons in R produce values of logical data type — either TRUE or FALSE.
Here are some examples:
> a <- 12
> a < 10 # "less than" comparisons
[1] FALSE
> a < 20
[1] TRUE
> a > 0 # "greater than" comparisons
[1] TRUE
> 10 > a
[1] FALSE
> a == 12 # "equals" comparisons - note that it uses ==, not just =
[1] TRUE
> a == 9
[1] FALSE
> a != 9 # "not equal" comparison
[1] TRUE
More Comparisons
> a <- 12
> a >= 12 # "greater or equal" comparisons
[1] TRUE
> 13 >= a
[1] TRUE
> a <= 11 # "less or equal" comparison
[1] FALSE

You can compare strings too:


> time <- "morning"
> time == "evening"
[1] FALSE
> time == "morning"
[1] TRUE
> time != "evening"
[1] TRUE
Strings can also be compared with <, >, <=, and >=, according to “alphabetical
order”, but exactly how this works may depend on what “locale” R is operating in.
Some Notes on Spaces
Although R usually ignores spaces, an exception is that the ==, !=, <=, and >=
operators must be written without spaces inside them. Notice the syntax error
below, for example:
> a < = 4
Error: unexpected ’=’ in "a < ="

This is true for the assignment operator, <-, as well. It’s a good idea to always
put spaces around the <- operator, however, because it makes programs easier to
read. Spaces around other operators can sometimes improve readability too.
Compare the following:
> abc[123]<-xyz+456*pqr
> abc[123] <- xyz + 456*pqr

Warning: The expression a<-9 assigns the value 9 to the variable a. If you want
to compare the value in a to the number -9, you must put a space between < and -:
> a < -9
[1] FALSE
More on How if Works
An if expression has the form

if ( condition ) true-option else false-option

The condition produces a TRUE or FALSE value. If the value is TRUE, the
true-option expression is done; if FALSE, the false-option expression is done.
The true-option and false-option expressions can have several steps, enclosed
between { and }.
When the expression is evaluated for what it does, rather than producing a value,
the else part can be omitted — that’s the same as making false-option be { },
which does nothing.
It’s often useful for false-option to be another if expression. An example:

> edu_level <- ( if (school=="primary") 1


else if (school=="secondary") 2
else if (school=="university") 3
else 0 )
Doing Things Again and Again and Again. . . Using for
We often want to do the same thing, or similar things, many times. One way is to
do arithmetic operations on vectors. But only some simple things can be done
many times that way.
A more general way is to use a loop.
One kind of loop in R is a for loop, which has the form

for ( variable in vector ) body

The body can be one statement, or several enclosed in { and }. The for loop does
the body as many times as there are elements in vector, with variable set to each
element in turn.
Here’s an example:

> for (i in c(10,12,456)) cat ("The square of",i,"is",i^2,"\n")


The square of 10 is 100
The square of 12 is 144
The square of 456 is 207936
Using for with a Sequence Vector
We often want a for loop to go through a sequence of integers. We can create a
vector containing such a sequence with the : operator. For example:

> 1:5
[1] 1 2 3 4 5

Here’s an example of its use with for:

> for (i in 1:5) cat ("The square of",i,"is",i^2,"\n")


The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25
Example: Looking at and Modifying a Vector Using a Loop
The function below takes a vector argument and returns this vector modified so
all elements are between 0 and 100. It also prints a message if any elements were
outside this range:
make_in_0_to_100 <- function (vec) {
below_0 <- 0; above_100 <- 0
for (i in 1:length(vec)) {
if (vec[i] < 0) {
vec[i] <- 0
below_0 <- below_0 + 1
}
else if (vec[i] > 100) {
vec[i] <- 100
above_100 <- above_100 + 1
}
}
if (below_0 + above_100 > 0)
cat ("Out of", length(vec), "elements,", below_0,
"were below 0 and", above_100, "were above 100\n")
vec
}
Calls of the Example Function
Here are some calls of this function:
> source("http://www.cs.utoronto.ca/~radford/csc121/make_in_0_to_100.r")
> make_in_0_to_100(c(33,55,77))
[1] 33 55 77
> original_vec <- c(12,-2,17,33,101,-3,104,-1,93)
> modified_vec <- make_in_0_to_100(original_vec)
Out of 9 elements, 3 were below 0 and 2 were above 100
> modified_vec
[1] 12 0 17 33 100 0 100 0 93

Notice the way we count how many elements were below zero using the below_0
variable. This variable is set to zero before the loop. Inside the loop, whenever an
element is found to be below zero, the count is increased by the assignment
below_0 <- below_0 + 1
This assigns a new value to below_0 that’s equal to the old value of below_0 plus 1.

Later, we’ll see how we can write this function more easily, without a loop, using
more advanced vector-handling facilities. But avoiding loops isn’t always possible.
When You Don’t Know How Many Times. . . Using while
A for loop repeats its body only as many times as the length of the vector it
is given at the start. But sometimes you can’t know at the start how many
repetitions will be needed.
Instead, we can use a while loop. It has the form

while ( condition ) body

This will repeat body as many times as necessary, until condition is FALSE. If
condition is FALSE at the start, body is not done even once.

Here’s an example, which searches for the smallest integer, i, greater than one, for
which i20 is less than ei :

> i <- 2
> while (i^20 >= exp(i)) i <- i + 1
> i
[1] 90
Now You Can Do Anything!
With what you now know about R programming, for anything that can possibly
be computed, you can (in theory) write a program that can compute it!

The keys points:

• You know about vectors, which as they get longer (eg, by putting shorter
ones together with c (..., ...)) will hold more and more data, with no
upper limit (in theory — in practice you run out of memory on your
computer at some point).

• You know about while loops, which can repeat operations as many times as
necessary, with no upper limit (in theory — in practice your computer will
wear out and fail after some number of years of computing).

The technical term for this is that the R language (even just the part you know
now) is “Turing complete” (after famous computer scientist Alan Turing, who
formalized the notion of what can and cannot be computed).
So That’s the End of the Course?
Now that you know how to compute anything that’s computable, what’s left to
do in the course?
• You may “know” how to do any computation, but you still need to develop
the skills that will help you to actually do it. This takes both instruction
and practice, practice, practice, . . .
• Once you learn more about R, you’ll know how to do some computations
more easily than you could do them using only what you know now.
• You’ll learn about some R features that aren’t strictly computational, but are
very useful (such as how to put titles on plots).
• We’ll talk about how to write programs that run faster — so you won’t have
to wait hours or days for the results.
• We’ll talk about how to write programs that are easier for yourself and other
people to understand.
• We’ll talk about how to test that your programs actually work correctly.
• We’ll talk about how to keep track of different versions of your programs, as
you change them to make them better, or work for various different problems.
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 4
Combining Data of Different Types in a List
We’ve seen how we can put several numbers into a vector of numbers. Or we can
put several strings into a vector of strings. But what if we want to combine both
types of data? Let’s try. . .
> c(123,"fred",456)
[1] "123" "fred" "456"
R converts the numbers to character strings, so that the elements of the vector
will all be the same type (character).
But we can put together data of different types in a list:
> list(123,"fred",456)
[[1]]
[1] 123

[[2]]
[1] "fred"

[[3]]
[1] 456
Lists Can Contain Anything
Elements of a list can actually be anything, including vectors of different lengths:
> list (1:4, 3:10)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 3 4 5 6 7 8 9 10
You can even put lists within lists (though these are hard to read when printed):
> list(4,list(5,6))
[[1]]
[1] 4

[[2]]
[[2]][[1]]
[1] 5

[[2]][[2]]
[1] 6
Extracting and Replacing Elements of a List
You can get a single element of a list by subscripting with the [[ . . . ]] operator:
> L <- list (c(3,1,7), c("red","green"), 1:4)
> L[[2]]
[1] "red" "green"
> L[[3]]
[1] 1 2 3 4
You can replace elements the same way. Continuing from above. . .
> L[[3]] <- c("x","y","z")
> L
[[1]]
[1] 3 1 7

[[2]]
[1] "red" "green"

[[3]]
[1] "x" "y" "z"
Notice that the new value can have a type different from that of the old value.
Looking at All Elements of a List; Extending Vectors
You can look at all elements of a list with the for statement, using length to
find out how many elements there are.
Suppose we have a list of vectors of strings or numbers. For example, we might
create such a list as follows:

> L <- list (c("a","b"), 2:4, c("x","y","z"))

The following will create a single vector of strings, called v, containing all the
elements of all the vectors from the list L:

> v <- character(0) # creates a string vector with zero strings


> for (i in 1:length(L)) v <- c (v, L[[i]])
> v
[1] "a" "b" "2" "3" "4" "x" "y" "z"

Note how we can start with a vector with no elements, and then extend it using
the c function. Also note how the vector of numbers was automatically converted
to a vector of strings, so they could be combined with a string vector.
Extending Lists
You can also build up lists starting with a list containing zero elements, which we
can create with list().
One way to extend the list is to just assign to an element that doesn’t exist yet
(usually the one just after the last existing element):

> a <- list()


> a[[1]] <- 1:3; a[[2]] <- TRUE; a[[3]] <- "hello"
> a
[[1]]
[1] 1 2 3

[[2]]
[1] TRUE

[[3]]
[1] "hello"

You can also combine lists with the c function.


More on Logical Values
We’ve seen that R uses logical values to represent the result of a comparison, such
as below:

> a <- 10
> a < 3
[1] FALSE
> a < 30
[1] TRUE

We can save logical values in variables, and then use them as if or while
conditions:

> b <- a < 30


> if (b) cat("It’s TRUE!\n")
It’s TRUE!

We can also just assign TRUE or FALSE to a variable.


Using Logical Variables to Stop a While Loop
Logical variables we set to TRUE or FALSE can be useful for stopping while loops.
This bit of a program checks for values in vec outside the range 0 to 100, stops
with a message if it finds one, or stops with no message if there are none:
i <- 0
keep_going <- TRUE
while (keep_going) {
i <- i + 1
if (i > length(vec))
keep_going <- FALSE
else if (vec[i] < 0) {
cat("Found a value less than 0\n")
keep_going <- FALSE
}
else if (vec[i] > 100) {
cat("Found a value greater than 100\n")
keep_going <- FALSE
}
}
The Logical “AND” Operator — &&
Suppose we want to print a message if the number in next_value is within the
range 0 to 100 (and do nothing if it is not).
Here’s one way we could do this:

if (next_value >= 0)
if (next_value <= 100)
cat("Next value is OK\n")

Instead, we can do this with just one if by using R’s logical AND operator,
which is written &&:

if (next_value >= 0 && next_value <= 100)


cat("Next value is OK\n")

An expression such as X && Y produces TRUE only if X and Y are both TRUE, and
FALSE if either (or both) of X and Y are FALSE.
The Logical “OR” Operator — ||
Similarly, R has a logical OR operator, written ||.
An expression such as X || Y produces TRUE if either X and Y (or both) are TRUE,
and FALSE if both X and Y are FALSE.
We could use it to print a message if the number in next_value is not in the
range 0 to 100:

if (next_value < 0 || next_value > 100)


cat("Next value is out of range\n")

The && and || operators can both be used in a condition, with && having higher
precedence.

There’s also a “NOT” operator, written !.


Shortcuts when Evaluating && and ||
When R evalues somethinge like X && Y, it first finds the value of X, and if it is
FALSE, it doesn’t bother to find the value of Y, since the result must be FALSE
regardless of Y.
This can be useful if evaluating Y would cause an error:

if (i <= length(L) && L[[i]] > 0) ... # do something

Trying to get element i of the list L results in an error message if i is greater


than the length of the list, but this won’t happen with the condition above.

Similarly, the value of X || Y will be TRUE if X is TRUE, regardless of Y, so if X is


TRUE, R doesn’t try to evaluate Y.
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 5
Making Vectors by Repetition
As you may recall from a previous lab exercise, you can make a vector in R by
repeating a single value or a vector of values. For example:

> rep (5,10)


[1] 5 5 5 5 5 5 5 5 5 5
> rep (c(8,1,2), 5)
[1] 8 1 2 8 1 2 8 1 2 8 1 2 8 1 2
> rep (c("fred","mary"), 3)
[1] "fred" "mary" "fred" "mary" "fred" "mary"

Instead of saying how many times to repeat, you can instead say what the final
length should be:

> rep (c(8,1,2), length=10)


[1] 8 1 2 8 1 2 8 1 2 8

Another option is to say how many times each element should be repeated
immediately:

> rep (c(8,1,2), each=3)


[1] 8 8 8 1 1 1 2 2 2
Making Sequence Vectors
You’ve seen that you can create a vector consisting of a sequence of consecutive
integers like this:

> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

If the first operand of : is greater than the second, the sequence it creates will go
backwards:

> 20:1
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

The seq function is more flexible. It can create sequences of numbers that differ
by an amount other than one:

> seq (1, 2, by=0.1)


[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
> seq (1.1, by=0.01, length=13)
[1] 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22
Combining Ways of Creating Vectors
We can use the various ways of creating vectors that we’ve seen in combination.
For example:
> c (1:5, 5:1)
[1] 1 2 3 4 5 5 4 3 2 1
> rep (1:5, 3)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
> c (seq(1,2,by=0.2), rep(2,5))
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.0 2.0 2.0 2.0 2.0

Note: The c function combines single values or vectors to make a bigger vector.
If you already have the vector you want, you don’t have to use c!
For example, the use of c in all the following is unnecessary.
> c(5)
[1] 5
> c(1:5)
[1] 1 2 3 4 5
> rep(c(5),3)
[1] 5 5 5
Matrices
In R, the elements of a vector can be arranged in a two-dimensional array, called
a matrix.
You can create a matrix with the matrix function, giving it a vector of data to fill
the matrix (down columns), which is repeated automatically if necessary:
> matrix (3, nrow=2, ncol=2)
[,1] [,2]
[1,] 3 3
[2,] 3 3
> matrix (1:6, nrow=2, ncol=3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can fill in the data by row instead if you like:
> matrix (1:6, nrow=2, ncol=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Treating Matrices Mathematically
R has operators that treat a matrix in the mathematical sense as in linear
algebra. For example, you can do matrix multiplication with the %*% operator:
> A <- matrix(c(2,3,1,5),nrow=2,ncol=2); A
[,1] [,2]
[1,] 2 1
[2,] 3 5
> B <- matrix(c(1,0,2,1),nrow=2,ncol=2); B
[,1] [,2]
[1,] 1 2
[2,] 0 1
> A %*% B # This multiplies A and B as matrices
[,1] [,2]
[1,] 2 5
[2,] 3 11
> A * B # This just multiplies element-by-element
[,1] [,2]
[1,] 2 2
[2,] 0 5
Treating Matrices Just as Arrays of Data
You can instead just consider a matrix to be a convenient way of laying out your
data, not as an object in linear algebra.
For this purpose, it’s useful that you can create matrices with data other than
numbers:

> matrix (c(TRUE,FALSE,TRUE), nrow=3, ncol=3)


[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] TRUE TRUE TRUE

> matrix (c("abc","xyz"), nrow=3, ncol=2)


[,1] [,2]
[1,] "abc" "xyz"
[2,] "xyz" "abc"
[3,] "abc" "xyz"
Indexing Elements of a Matrix
You can get or change elements in a matrix by using [...] with two subscripts,
the first identifying the row of the element, the second the column.
For example:
> X <- matrix (1:6, nrow=2, ncol=3); X
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> X[1,3]
[1] 5
> A <- matrix (0, nrow=3, ncol=3)
> A[2,1] <- 5
> A[1,3] <- 7
> A[3,3] <- 9
> A
[,1] [,2] [,3]
[1,] 0 0 7
[2,] 5 0 0
[3,] 0 0 9
Extracting Rows and Columns of a Matrix
You can also use [...] to extract an entire row or column of a matrix, by just
omitting one of the two subscripts.
For example:

> X <- matrix (1:6, nrow=2, ncol=3); X


[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

> X[1,] # get the first row


[1] 1 3 5
> X[,2] # get the second column
[1] 3 4
> X[1,] + X[2,] # add the first and second rows
[1] 3 7 11
Combining Matrices with cbind and rbind
You can put two matrices with the same number of rows together with cbind:

> X <- matrix (1:6, nrow=2, ncol=3)


> Y <- matrix (3, nrow=2, ncol=4)
> cbind(X,Y)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 3 5 3 3 3 3
[2,] 2 4 6 3 3 3 3

Similarly, rbind can put together two matrices with the same number of columns.
You can also use cbind or rbind to combine a matrix with a vector, which is
treated like a matrix with one row or one column:

> rbind(X,c(10,20,30))
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[3,] 10 20 30
Example: Plotting a Function of Two Arguments
One use of matrices is in plotting functions or data in three dimensions.
p
Here, we compute values of the function cos (8 x2 + y 2 ) for a grid of values for
x from −1 to +1, and a grid of values for y from 0 to 2.5, storing these values in a
matrix called funvals. The grid points are spaced apart by 0.01.

> gridx <- seq(-1,1,by=0.01)


> gridy <- seq(0,2.5,by=0.01)
>
> funvals <- matrix (0, nrow=length(gridx), ncol=length(gridy))
> for (i in 1:length(gridx))
+ for (j in 1:length(gridy))
+ funvals[i,j] <- cos (8*sqrt(gridx[i]^2 + gridy[j]^2))
One Column of the Computed Matrix
Here’s a single column of the matrix of function values that we computed:
> round (funvals[,1], 2)
[1] -0.15 -0.07 0.01 0.09 0.17 0.25 0.33 0.40 0.47 0.54 0.61 0.67
[13] 0.73 0.78 0.83 0.87 0.91 0.94 0.96 0.98 0.99 1.00 1.00 0.99
[25] 0.98 0.96 0.93 0.90 0.87 0.82 0.78 0.72 0.67 0.60 0.54 0.47
[37] 0.40 0.32 0.25 0.17 0.09 0.01 -0.07 -0.15 -0.23 -0.31 -0.38 -0.46
[49] -0.52 -0.59 -0.65 -0.71 -0.77 -0.81 -0.86 -0.90 -0.93 -0.96 -0.98 -0.99
[61] -1.00 -1.00 -0.99 -0.98 -0.97 -0.94 -0.91 -0.88 -0.84 -0.79 -0.74 -0.68
[73] -0.62 -0.56 -0.49 -0.42 -0.34 -0.27 -0.19 -0.11 -0.03 0.05 0.13 0.21
[85] 0.29 0.36 0.44 0.51 0.57 0.64 0.70 0.75 0.80 0.85 0.89 0.92
[97] 0.95 0.97 0.99 1.00 1.00 1.00 0.99 0.97 0.95 0.92 0.89 0.85
[109] 0.80 0.75 0.70 0.64 0.57 0.51 0.44 0.36 0.29 0.21 0.13 0.05
[121] -0.03 -0.11 -0.19 -0.27 -0.34 -0.42 -0.49 -0.56 -0.62 -0.68 -0.74 -0.79
[133] -0.84 -0.88 -0.91 -0.94 -0.97 -0.98 -0.99 -1.00 -1.00 -0.99 -0.98 -0.96
[145] -0.93 -0.90 -0.86 -0.81 -0.77 -0.71 -0.65 -0.59 -0.52 -0.46 -0.38 -0.31
[157] -0.23 -0.15 -0.07 0.01 0.09 0.17 0.25 0.32 0.40 0.47 0.54 0.60
[169] 0.67 0.72 0.78 0.82 0.87 0.90 0.93 0.96 0.98 0.99 1.00 1.00
[181] 0.99 0.98 0.96 0.94 0.91 0.87 0.83 0.78 0.73 0.67 0.61 0.54
[193] 0.47 0.40 0.33 0.25 0.17 0.09 0.01 -0.07 -0.15
A Perspective Plot of the Function
We can produce a three dimensional plot from the function values we computed
using R’s persp function (with options phi and theta to set the viewing angle):

> persp(gridx,gridy,funvals,phi=40,theta=20,shade=0.75,border=NA)

fun
vals

gridy
grid
x
A Contour Plot of the Function
Another way to display a function or data is with a contour plot, which we can
produce as follows:

> contour (gridx, gridy, funvals, levels=seq(-0.9,0.9,by=0.3))

2.5
−0.3 0.3 0.3 −0.3 −0.6
−0.6 0.6
0.9

0.9
0.6
0.3
−1.110223e−16 −0.3
−0.6
2.0

−0.9

−0.9
−0.6
−0.3
−1.110223e−16
0.3
0.6
0.9
1.5

0.9
0.6
0.3
−16
−1.110223e −0.3
−0.6
−0.9

−0.9
−0.6
1.0

−0.3
−1.1102
0.3 23e−16
0.6
0.9

0.9
0.6
0.3
3e− 16 −0.3
0.5

−1.1 1022 −0.6


−0.9

−0.9
−0.6
−0.3
−1.11022
3e−16
3
0.

0.9
0.0

−1.0 −0.5 0.0 0.5 1.0


Specifying Function Arguments by Name
Suppose you define a function with several arguments, such as
hohoho <- function (times, what) {
r <- what
while (times > 1) { r <- paste(what,r); times <- times-1 }
r
}
You can call the function by just giving values for the arguments, in the same
order as in the function definition. For example:
> hohoho (3, "ho")
[1] "ho ho ho"
But you can instead specify arguments using their names, in any order:.
> hohoho (times=3, what="ho")
[1] "ho ho ho"
> hohoho (what="ho", times=3)
[1] "ho ho ho"
This is very useful if there are many arguments, whose order is hard to remember.
Default Values for Function Arguments
When you define a function, you can specify a default value for an argument,
which is used if a value for the argument isn’t specified when the function is
called. For example, here is the hohoho function with defaults for both arguments:
hohoho <- function (times=3, what="ho") {
r <- what
while (times > 1) { r <- paste(what,r); times <- times-1 }
r
}
Here are some calls of this function:
> hohoho(4) # ’what’ will default to "ho"
[1] "ho ho ho ho"
> hohoho(what="hee") # ’times’ will default to 3
[1] "hee hee hee"
> hohoho() # uses defaults for both arguments
[1] "ho ho ho"
This is very useful for functions with many arguments that are often set to the
same (default) value, as is the case for many of R’s pre-defined functions.
Giving Names to List Elements
You can give names to elements of a list, and then refer to these elements by
name with the $ operator. For example:
> L <- list (a=c(3,1,7), bc=c("red","green"), q=1:4)
> L$a
[1] 3 1 7
> L$bc
[1] "red" "green"
> L$q <- TRUE
> L
$a
[1] 3 1 7

$bc
[1] "red" "green"

$q
[1] TRUE
If an element has a name, R uses it for printing, rather than the numerical index.
Using a List to Return Multiple Values from a Function
This function takes as input a vector of character strings, and returns a list of two
vectors, with the first and the last characters of the input strings:
first_and_last_chars <- function (strings) {
first <- character(length(strings)) # Create two string vectors for
last <- character(length(strings)) # the results, initially all ""
for (i in 1:length(strings)) {
nc <- nchar(strings[i])
first[i] <- substring(strings[i],1,1) # Find first & last chars
last[i] <- substring(strings[i],nc,nc) # of the i’th string
}
list (first=first, last=last) # Return list of both result vectors
}
Here’s an example of its use:
> fl <- first_and_last_chars (c("abc","wxyz"))
> fl$first
[1] "a" "w"
> fl$last
[1] "c" "z"
Names for Vector Elements and Matrix Rows and Columns
You can also give names to elements of vectors, and use the names as indexes:
> x <- c (dog=5, cat=3)
> x
dog cat
5 3
> x["cat"]
cat
3
You can also give names to the rows and columns of matrices:
> M <- matrix(1:4,ncol=2,nrow=2,dimnames=list(c("cat","dog"),
+ c("big","small")))
> M
big small
cat 1 3
dog 2 4
> M["dog","big"]
[1] 2
Scanning the Elements of a Matrix
Here’s an example function that finds the largest negative element in a numeric
matrix (ie, the negative element with smallest absolute value), returning this
element’s value, or minus infinity if there are no negative elements.
Note that you can find the number of rows and number of columns in a matrix
with nrow and ncol.

largest_neg <- function (M) {


result <- -Inf
for (i in 1:nrow(M))
for (j in 1:ncol(M))
if (M[i,j] < 0 && M[i,j] > result)
result <- M[i,j]
result
}

Here’s an example call:

> largest_neg (matrix (c(-6,3,-2,1), nrow=2, ncol=2))


[1] -2
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 6
Random Numbers and Their Uses
Random variation is a big part of what statistics is about. So it’s natural that R
has facilities to create its own random variation — to generate random numbers.
Random numbers have many uses (and not just in statistics):

• Simulate random processes, such as how a disease epidemic might spread


between people.

• See how the results of some statistical method vary when the data it is
applied to vary randomly.

• Compute things using “Monte Carlo” methods.

• Make interactions with a user have a random aspect — we don’t want a video
game to behave the same way every time we play!
Generating Random Numbers with Uniform Distribution
One simple kind of random number is one that takes on a real value that is
uniformly distributed within some bounds.
You can get such numbers in R using the runif function. It takes as arguments
the number of random numbers to generate, the low bound, and the high bound.
We’ll try generating one at a time here:

> runif(1,0,10) # one random number in (0,10)


[1] 3.195956
> runif(1,0,10) # another one, not the same
[1] 5.551191
> runif(1,0,10) # ... and another
[1] 1.165307
> runif(1,100,200) # one from a different range
[1] 182.0236

The random numbers generated are supposed to be independent — eg, which one
we get the second time is unrelated to what the first one was.
R’s Random Numbers Aren’t Really Random
Computers are carefully designed to not behave randomly.
Some computers have special devices for producing random numbers that are
really random. This is useful for cryptography (you want a really random key for
your code, so nobody else can guess it).
But for most purposes we don’t actually want real random numbers. They’re too
hard to generate, and if we use them, we can’t reproduce our results another day.

For example: Imagine that after running your program for a long time, it stops
with an error message, indicating it has a bug. You think you’ve now fixed the
bug. But how do you verify that you’ve really fixed it if you can’t reproduce the
run that led to the error?

So most computer “random” numbers are really “pseudo-random” — numbers


that look random for most purposes, but are actually generated by an algorithm
that isn’t random at all, so if it is run again, it will generate exactly the same
numbers.
An Example of a Pseudo-Random Generator
Here’s one simple way to generate a series of pseudo-random numbers, uniformly
distributed over the integers 1, 2, . . . , 30.
> nxt <- 1; series <- c()
> for (i in 1:200) { nxt <- (nxt * 17) %% 31; series <- c(series,nxt) }
Here’s a plot of the resulting series:
30
25
20
series

15
10
5
0

0 50 100 150 200

Index
It looks random, except that it repeats with period 30. Similar generators can
have much longer periods, however.
Setting the Random Seed
R uses a more sophisticated pseudo-random generator, but it also is deterministic,
and will reproduce the same sequence if restarted with the same “seed”.
For example:
> set.seed(123)
> runif(1)
[1] 0.2875775
> runif(1)
[1] 0.7883051
> runif(1)
[1] 0.4089769
> set.seed(123)
> runif(1)
[1] 0.2875775
> runif(1)
[1] 0.7883051
> runif(1)
[1] 0.4089769
For serious work, you should set the seed, so you’ll be able to reproduce your results.
The sample function
The call sample(n) will generate a random permutation of the integers from
1 to n, as illustrated below:

> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
> sample(10)
[1] 10 2 6 1 9 8 7 5 3 4

With other kinds of arguments, sample can do other things as well, including
sampling with replacement.
Generating Random Vectors
The runif function can generate a whole vector of random numbers at once. The
first argument of runif is the number of random numbers to generate.
For instance, here we plot 500 random numbers uniformly distributed from 0 to 1,
using the command

> plot(runif(500))
1.0
0.8
0.6
runif(500)

0.4
0.2
0.0

0 100 200 300 400 500

Index
The Problem with Plotting Rounded Data Points
Recall the “iris” data set of width and length of petals and sepals in three species
of Iris. It is stored in a special kind of list called a “data frame”, which also looks
sort-of like a matrix, which we’ll talk more about later.
Here’s a scatterplot of two of the variables (species marked by colour):
plot (iris$Sepal.Width, iris$Petal.Width, col=iris$Species,
xlab="Sepal Width", ylab="Petal Width")
2.5
2.0
1.5
Petal Width

1.0
0.5

2.0 2.5 3.0 3.5 4.0

Sepal Width
Solving the Problem with Random Jitter
Because the data is rounded to one decimal place, many of the dots in the
scatterplot are on top of each other. To see all the data points, we can add
random “jitter” to each data point before plotting:
plot (iris$Sepal.Width + runif(nrow(iris),-0.05,+0.05),
iris$Petal.Width + runif(nrow(iris),-0.05,+0.05),
col=iris$Species, xlab="Sepal Width", ylab="Petal Width")
2.5
2.0
1.5
Petal Width

1.0
0.5
0.0

2.0 2.5 3.0 3.5 4.0

Sepal Width
Making Random Choices
Often, we want to make a random choice, with certain probabilities for doing
certain things.

If we have a binary choice (to do or not do something), we can compare a random


number that’s uniform over (0, 1) to the desired probability.
For example, at some point in a computer game, we might want to kill the player
and end the game with probability 0.15. We can do it as follows:

if (runif(1) < 0.15) stop("You’re dead. Game over!")

Why does this work?

Suppose instead we have a three-way choice – do A with probability 0.15, do B


with probability 0.4, or do C with probability 0.45. (Note that these three
probabilities add to one.)
Could we generate one random number uniform over (0, 1) and use it to make this
choice?
Simulating a Random Walk
One well known “stochastic process” is a random walk on the integers, in which
we start at 0, and at each time step thereafter we randomly go to the position one
above or one below our current position, with probability 0.5 for either direction.
Here’s an R function to simulate a random walk:

random_walk <- function (steps) {


position <- numeric(steps+1)
for (i in 1:steps) {
if (runif(1) < 0.5)
position[i+1] <- position[i] + 1
else
position[i+1] <- position[i] - 1
}
position
}
Three Random Walks

40
20
0
−20
−40

0 200 400 600 800 1000


40
20
0
−20
−40

0 200 400 600 800 1000


40
20
0
−20
−40

0 200 400 600 800 1000


Environments
An R environment is a collection of variables and their current values.

The global environment contains variables that are created when you assign to a
name in a command typed at the R console (or as if typed in an R script).
For example, typing the command below creates (if it didn’t exist already) a
variable in the global environment named fred:
> fred <- 1+2

Calling a function creates a local environment used for just that call. Assignments
inside the function create or change variables in that environment — below, the
assignment to fred inside f changes fred in the local, not global, environment:
> f <- function (x) { fred <- 2*x; fred+1 }
> fred
[1] 3
> f(100)
[1] 201
> fred
[1] 3
Listing and Removing Variables
You can see what variables exist in the environment that is currently being used
with the ls function, which returns a vector of strings with the names of variables.

You can remove a variable from the current environment with rm.

Here’s an example (which assumes you haven’t already defined other variables in
the global environment):

> a <- 1
> b <- 2
> ls()
[1] "a" "b"
> rm(a)
> ls()
[1] "b"
> a
Error: object ’a’ not found

Note: After x <- "b", calling rm(x) removes variable x, not variable b.
Function Arguments in the Local Environment
When a function is called, all its arguments become variables in its local
environment. Their values are what is was specified in the call of the function,
or their default values if they were not specified.
We can see this by printing the result of ls inside a function:
> f <- function (x,y=100,z=1000) { print(ls()); x + y + z }
> f(7,z=10)
[1] "x" "y" "z"
[1] 117
If we create new variables by assignment, they also are in the local environment:
> g <- function (x,y=100,z=1000) { a <- x + y + z; print(ls()); a }
> g(7)
[1] "a" "x" "y" "z"
[1] 1107
The global environment isn’t changed when local variables are created for
arguments or by assignment. So after doing the above, in a new R session, we see
> ls()
[1] "f" "g"
Local and Global Variable References
When you reference a variable inside a function, it refers to the local variable of
that name, if it exists, and if not, to the global variable of that name, if it exists.
Here’s an example:
> f <- function (xyz,def) {
+ print (abc) # refers to the global variable ’abc’
+ print (xyz) # refers to the local variable (argument) ’xyz’
+ print (def) # refers to the local variable (argument) ’def’
+ xyz + def + abc
+ }
>
> abc <- 1
> def <- 2
>
> f(200,3000)
[1] 1
[1] 200
[1] 3000
[1] 3201
Changing Local and Global Variables Inside a Function
Assigning a value to a name with <- (or with =) from inside a function creates or
changes the local variable with that name. Assigning a value to a name with <<-
creates or changes the global variable with that name. Here’s an example:
> g <- function () {
+ x <- a # creates a local variable ’x’, with value from global ’a’
+ a <- 10 # creates a local variable ’a’; global ’a’ is not affected
+ b <<- 300 # changes the global variable ’b’; doesn’t create a local ’b’
+ a + b + x # here, ’a’ refers to the new local ’a’, not the global ’a’
+ }
> g()
Error in g() : object ’a’ not found
> a <- 100
> b <- 200
> g()
[1] 410
> a
[1] 100
> b
[1] 300
> x
Error: object ’x’ not found
Assigning to Arguments Doesn’t Change Them
Since assignments with <- inside a function change only the local environment,
assigning to a function argument doesn’t change what the caller passed.
For example:

> h <- function (x) { x[1] <- 0; sum(x) } # sum all but first element
> a <- c(3,4,1,7)
> h(a)
[1] 12
> a # the global variable ’a’ was not changed
[1] 3 4 1 7
> x <- c(10,20,30)
> h(x)
[1] 50
> x # global ’x’ unchanged - not the same as the local ’x’!
[1] 10 20 30

Exception: R has some “special” functions that do alter their arguments — for
example, as we’ve seen, rm(x) actually removes x!
When and How to Use Local and Global Variables
When writing a function, you should try to
• Separate what the function does from how it does it, so someone using the
function only needs to understand the “what”.
• Make what the function does be easy to describe and understand.
• Make what the function does be general, so it will be useful in many contexts.
Functions should usually get input from their arguments, not global variables —
they’re then more generally useful, as it’s easy to use different arguments in calls.
Functions should usually not assign to global variables. Putting intermediate
results in global variables makes “how” the function works be visible. Returning
information in global variables makes it hard to use the function in a general way.
There are exceptions:
• If many functions all refer to the same data, having them all refer to a global
data variable may be easier than passing a data argument to all of them.
• Assigning to a global variable can be a convenient way to keep track of overall
counts of how often something happended (eg, number of errors of some sort).
• Assigning some intermediate result to a global variable may help when
debugging a program (but take it out once the program is working).
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 7
Features of R for Statistics
R is a general purpose programming language. You can write all sorts of
programs in R — video games, accounting packages, word processors, programs
for navigating rocket ships to Mars, . . .

But R is more appropriate for some of these tasks than others. It’s probably not
the best choice for video game programming — games need to respond quickly,
but speed is not R’s strong point. On the other hand, some features of R that are
not common in other languages are especially useful for statistical applications.
Here are some:
• Specifying function arguments by name, with arguments often having default
values — very useful for functions implementing statistical methods.
• Names for elements of vectors and lists, and for rows and columns of matrices
and data frames — “age” is a better label for a column than the number 17.
• R’s “data frames” for storing observations in a way that is convenient for
statistical analysis.
• Special NA values to indicate where data is missing
We’ve talked about the first two, and will now talk about the last two.
Adding Attributes to R Objects
An R object can have one or more “attributes”, that record extra information.
They are mostly ignored if you don’t look at them, but are there if you look.
An example:

> x <- 123 # Set x to a plain number


> x
[1] 123
> attr(x,"fred") <- "abc" # Add a "fred" attribute to x
> x
[1] 123
attr(,"fred")
[1] "abc"
> attr(x,"fred") # We can get just the attribute if we like
[1] "abc"
> x + 1000 # The attribute (usually) gets passed on
[1] 1123
attr(,"fred")
[1] "abc"
Attributes for Dimensions and Names
You can attach attributes to objects for your own purposes, but R also has some
standard uses for attributes.

R uses a dim attribute to mark an object as a matrix, and hold how many rows
and columns it has. This attribute is not usually shown explicitly, be we can see
it if we look using attr:

> M <- matrix(0,nrow=3,ncol=5)


> attr(M,"dim")
[1] 3 5

R uses a names attribute to hold the names of elements in a list or a vector:

> L <- list (abc=9, def=10, xyz="ha")


> attr(L,"names")
[1] "abc" "def" "xyz"

Names for rows and columns in a matrix are stored in a dimnames attribute.
The Class Attribute
The special class attribute tells R that some operations on the object should be
done in a special way. We’ll cover more about how this works later — and about
how it can be used to program in a style known as ‘object-oriented programming”.
For the moment, here’s a brief illustration of what can be done:
> g <- 123
> attr(g,"class") <- "gobbler"
> print.gobbler <- function (what) {
+ cat ("I’m a gobbler with value", unclass(what), "\n")
+ }
> g
I’m a gobbler with value 123
> g+1000
I’m a gobbler with value 1123
We’ve used the class attribute to tell R that objects in our “gobbler” class
should be printed in a different way than ordinary numbers. Note that unclass
gets rid of the class attribute, which lets us handle the number inside a gobbler
object in the usual way (though using unclass is not strictly necessary here).
Data Frames
One major use of classes is for R’s data.frame objects, which are the most
common way that data is represented in R.
A data frame is sort of like a list and sort of like a matrix. Each “row” of a data
frame holds information on some individual, object, case, or whatever. The
“columns” of a data frame correspond to variables whose values have been
measured for each case. These variables can be numbers, logical (TRUE/FALSE)
values, or character strings (but all values for one variable have the same type).
For example, here’s how R prints a small data frame containing the heights and
weights of three people:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
A data frame is really a list, with named elements that are the columns of the
data frame, but with a data.frame class attribute that makes R do things like
printing and subscripting differently from an ordinary list.
Getting Data Out of a Data Frame
You can get data from a data frame using subscripting operations similar to those
for a matrix (by row and column index), or by operations similar to a list (using
names of variables). For example:

> heights_and_weights # The data frame from the last slide


name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> heights_and_weights$height # All values of the "height" variable
[1] 62 60 71
> heights_and_weights[2,] # All values for the 2nd person
name height weight
2 Mary 60 131
> heights_and_weights[2,3] # Value of 3rd variable for 2nd person
[1] 131
> heights_and_weights$weight[2] # ... and the same, by variable name
[1] 131
Creating a Data Frame
Using as.data.frame, you can create a data frame from a list (it just adds the
data.frame class attribute) or from a matrix (it has to split it up into columns).
If you don’t provide variable names, R uses V1, V2, etc.
Examples:
> as.data.frame (list (abc=c(1,3,2),
+ pqr=c(TRUE,FALSE,FALSE),
+ xyz=c("a","bb","c")))
abc pqr xyz
1 1 TRUE a
2 3 FALSE bb
3 2 FALSE c
>
> as.data.frame (matrix (1:12, nrow=3, ncol=4))
V1 V2 V3 V4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If a matrix has row and column names, they become those of the data frame.
Reading Data Into a Data Frame
The read.table function creates a data frame using data it reads from a text file.
The file has to contain one line for each row of the data frame, containing a value
(eg, a number, TRUE/FALSE, a string) for each variable for the case corresponding
to that row.
If a header=TRUE argument is given to read.table, the names of the variables
will be taken from the first line of the file.
Here’s how we could read the heights and weights data frame from a file on the
course web page:

heights_and_weights <-
read.table ("http://www.cs.utoronto.ca/~radford/csc121/data7",
header=TRUE)

The contents of the file read are as below:

name height weight


Fred 62 144
Mary 60 131
Joe 71 182
Indicating Missing Values with NA
It is very common for data collected to have some missing values — where the
subject declined to answer one of the survey questions, or the interviewer forgot
to fill out one page of the form, or where the machine taking the readings was
broken that day.
Sometimes these values are indicated by some special number like −999. But this
is very unreliable. The person analysing the data may not realize that this is
what −999 is supposed to mean, leading to drastically incorrect averages. Or
there may be an actual, non-missing, value of −999!

R supports representation of missing data by a special NA value. NA can be the


value of an element in a vector, matrix, or data frame. For example:

> c(5,1,NA,8,NA)
[1] 5 1 NA 8 NA
Arithmetic on NA values
Arithmetic operations where one or both operands are NA produce NA as the
result:

> a <- c(5,1,NA,8,NA)


> a+100
[1] 105 101 NA 108 NA
> b <- c(10,NA,20,NA,NA)
> a*b
[1] 50 NA NA NA NA

Comparisons with NA also produce NA, rather than TRUE or FALSE. Trying to
use NA as an if or while condition gives an error:

> a == 1
[1] FALSE TRUE NA FALSE NA
> if (a[3]==1) cat("true\n") else cat("false\n")
Error in if (a[3] == 1) cat("true\n") else cat("false\n") :
missing value where TRUE/FALSE needed
Checking For NA
Sometimes you need to check whether a value is NA. But you can’t do this with
something like if (a == NA) ... — that will always give an error!
Instead, you can use the is.na function. It can be applied to a single value,
giving TRUE or FALSE, or a vector of values, giving a logical vector.

For example, R’s built-in airquality demonstration dataset has some NA values.
The following statements create a modified version of the airquality data frame
in which missing values for solar radiation are replaced by the average of all the
non-missing measurements (found with mean using the na.rm option):

ave_solar <- mean (airquality$Solar.R, na.rm=TRUE)


mod_airquality <- airquality
for (i in 1:nrow(mod_airquality))
if (is.na(mod_airquality$Solar.R[i]))
mod_airquality$Solar.R[i] <- ave_solar

(We’ll see later how one can do this more easily using logical indexes.)
NA and NaN
A value will also be “missing” if it is the result of an undefined mathematical
operation. R prints such values as NaN, not NA, but is.na will be TRUE for
them. Operations on NaN produce NaN as a result. Here are some examples:
> 0/0
[1] NaN
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> x <- 0/0
> 10*x
[1] NaN
> v <- asin((-2):2)
Warning message:
In asin((-2):2) : NaNs produced
> v
[1] NaN -1.570796 0.000000 1.570796 NaN
> v / 0
[1] NaN -Inf NaN Inf NaN
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 8
Using Numeric Vectors as Subscripts
A subscript used with [ ] can be a vector of indexes, rather than just one index,
yielding a subset of elements having those indexes, not just one element:
> v <- c(9,10,3)
> names(v) <- c("abc","def","xyz")
> v
abc def xyz
9 10 3
> v[c(1,3)] # Notice that names of elements are carried along
abc xyz
9 3
You can also index with a vector of negative numbers. This gets you all elements
except those whose indexes are in the index vector (negated):
> v[-2]
abc xyz
9 3
> v[c(-1,-length(v))]
def
10
Difference Between [ ] and [[ ]]
We can now see better what the difference is between subscripting with [ ] and
with [[ ]] — [ ] extracts a subset of elements (which might be just one),
whereas [[ ]] extracts a single element.
> v
abc def xyz
9 10 3
> v[2]
def
10
> v[[2]] # Notice there’s no name here, just the element
[1] 10
> L <- list (a="xy", b=9, c=TRUE)
> L[2] # Notice that the result is still a list
$b
[1] 9

> L[[2]] # ... but this is an element of the list


[1] 9
Using Logical Vectors as Subscripts
A subscript can also be a logical vector, which selects elements in positions where
this subscript is TRUE:
> v
abc def xyz
9 10 3
> v[c(TRUE,FALSE,TRUE)]
abc xyz
9 3
> v[v>5]
abc def
9 10
R’s “and” (&) and “or” (|) operators can be useful for this:
> v[v>5 & v<10]
abc
9
> v[v>9 | v<7]
def xyz
10 3
Vectors as Matrix Subscripts
Vector subscripts can also be used to select rows or columns of a matrix:

> M <- matrix (1:12, nrow=3, ncol=4)


> rownames(M) <- c("ab","cd","ef")
> colnames(M) <- c("w","x","y","z")
> M
w x y z
ab 1 4 7 10
cd 2 5 8 11
ef 3 6 9 12
> M[c(3,1),c(2,4,4)] # Indexes needn’t be in order, can be duplicates
x z z
ef 6 12 12
ab 4 10 10
> M[c(TRUE,FALSE,TRUE),]
w x y z
ab 1 4 7 10
ef 3 6 9 12
Using Vector Indexes to Replace Elements in a Vector
Numeric and logical vectors can be used as indexes when we replace elements in a
vector rather than get them out.
For example:

> v <- c(66,33,99,10,12)


> v[c(2,4,1)] <- c(100,200,300)
> v
[1] 300 100 99 200 12
> v[c(TRUE,FALSE,FALSE,FALSE,TRUE)] <- c(800,900)
> v
[1] 800 100 99 200 900

Here’s how we can use this to make a modified version of the airquality data
frame (see last week’s slides) with missing values for Solar.R filled in:

mod_airquality <- airquality


mod_airquality$Solar.R [is.na(airquality$Solar.R)] <-
mean (airquality$Solar.R, na.rm=TRUE)
Re-Ordering a Vector, Matrix, or Data Frame
We can change the order of elements in a matrix, or of rows in a matrix or data
frame, using an index that is a permutation of the possible indexes.
One use is to change the order to be increasing in some variable. The order
function produces the permutation needed to do this. For example:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> by_weight <- order (heights_and_weights$weight)
> by_weight
[1] 2 1 3
> new <- heights_and_weights [by_weight, ]
> new
name height weight
2 Mary 60 131
1 Fred 62 144
3 Joe 71 182
Selecting a Subset of Rows in a Data Frame
Another use of logical indexes is in selecting a subset of rows in a data frame for
which the variables have certain values.
For example, here we select only people with weight greater than 140:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> heights_and_weights [heights_and_weights$weight > 140, ]
name height weight
1 Fred 62 144
3 Joe 71 182
And here we get only people with weight greater than 140 and height less than 70:
> heights_and_weights [heights_and_weights$weight > 140
+ & heights_and_weights$height < 70, ]
name height weight
1 Fred 62 144
Some Design Flaws in R
R is a very useful language, but like all programming languages, it’s not perfect.
Indeed, some of R’s features are poorly designed, making it too easy to write code
that doesn’t always work.
I’ll talk about two of these here:

• You can’t get an empty vector when making a sequence with an expression
like i:j.

• R will sometimes convert matrices to plain vectors when you don’t want it to.
The Problem of Reversing Sequences
The : operator will produce either an increasing sequence or a decreasing
sequence, depending on whether the first operand is less or greater than the
second:

> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1

This may seem convenient — and it is for Small Assignment 3 — but it’s a bad
idea. When you use : in a program, you need to be sure which sort of sequence
you’re going to get!
An Illustration of Why Reversing Sequences are Bad
Here’s a function that is supposed to return a modified square matrix in which all
the elements above the diagonal have been set to one:

ones_above_diagonal <- function (M) {


n <- nrow(M)
for (i in 1:n)
for (j in (i+1):n)
M[i,j] <- 1
M
}

Here’s what happens when we try to use it:

> ones_above_diagonal(matrix(0,nrow=4,ncol=4))
Error in M[i, j] <- 1 : subscript out of bounds

(The exact error message depends on the version of R used.)


Why the error? We need to get a zero-length sequence from (i+1):n when i
equals n. But instead we get a sequence of length two, containing n+1 and n.
How could we fix it?
The Problem of Dropped Dimensions
When you index a matrix with a single row or column index, R converts the
result to a vector, rather than keep it as a matrix.
Sometimes this is what you want:

> M <- matrix(1:6,nrow=2,ncol=3)


> M[1,2]
[1] 3
> M[1,2:ncol(M)]
[1] 3 5

But sometimes not:

> A <- M[,2:ncol(M)]


> A[1,1]
[1] 3
> B <- M[2:nrow(M),]
> B[1,1]
Error in B[1, 1] : incorrect number of dimensions
Stopping R From Dropping Dimensions
You can tell R to not drop dimensions from a matrix with the drop=FALSE option:

> M <- matrix(1:6,nrow=2,ncol=3)


> M[,2:ncol(M)]
[,1] [,2]
[1,] 3 5
[2,] 4 6
> M[2:nrow(M),]
[1] 2 4 6
> M[,2:ncol(M),drop=FALSE]
[,1] [,2]
[1,] 3 5
[2,] 4 6
> M[2:nrow(M),,drop=FALSE]
[,1] [,2] [,3]
[1,] 2 4 6

But adding drop=FALSE all the time makes everything longer and messier. So it’s
tempting not to. But then you may get unexpected bugs once in a while. . .
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 7
Features of R for Statistics
R is a general purpose programming language. You can write all sorts of
programs in R — video games, accounting packages, word processors, programs
for navigating rocket ships to Mars, . . .

But R is more appropriate for some of these tasks than others. It’s probably not
the best choice for video game programming — games need to respond quickly,
but speed is not R’s strong point. On the other hand, some features of R that are
not common in other languages are especially useful for statistical applications.
Here are some:
• Specifying function arguments by name, with arguments often having default
values — very useful for functions implementing statistical methods.
• Names for elements of vectors and lists, and for rows and columns of matrices
and data frames — “age” is a better label for a column than the number 17.
• R’s “data frames” for storing observations in a way that is convenient for
statistical analysis.
• Special NA values to indicate where data is missing
We’ve talked about the first two, and will now talk about the last two.
Adding Attributes to R Objects
An R object can have one or more “attributes”, that record extra information.
They are mostly ignored if you don’t look at them, but are there if you look.
An example:

> x <- 123 # Set x to a plain number


> x
[1] 123
> attr(x,"fred") <- "abc" # Add a "fred" attribute to x
> x
[1] 123
attr(,"fred")
[1] "abc"
> attr(x,"fred") # We can get just the attribute if we like
[1] "abc"
> x + 1000 # The attribute (usually) gets passed on
[1] 1123
attr(,"fred")
[1] "abc"
Attributes for Dimensions and Names
You can attach attributes to objects for your own purposes, but R also has some
standard uses for attributes.

R uses a dim attribute to mark an object as a matrix, and hold how many rows
and columns it has. This attribute is not usually shown explicitly, be we can see
it if we look using attr:

> M <- matrix(0,nrow=3,ncol=5)


> attr(M,"dim")
[1] 3 5

R uses a names attribute to hold the names of elements in a list or a vector:

> L <- list (abc=9, def=10, xyz="ha")


> attr(L,"names")
[1] "abc" "def" "xyz"

Names for rows and columns in a matrix are stored in a dimnames attribute.
The Class Attribute
The special class attribute tells R that some operations on the object should be
done in a special way. We’ll cover more about how this works later — and about
how it can be used to program in a style known as ‘object-oriented programming”.
For the moment, here’s a brief illustration of what can be done:
> g <- 123
> attr(g,"class") <- "gobbler"
> print.gobbler <- function (what) {
+ cat ("I’m a gobbler with value", unclass(what), "\n")
+ }
> g
I’m a gobbler with value 123
> g+1000
I’m a gobbler with value 1123
We’ve used the class attribute to tell R that objects in our “gobbler” class
should be printed in a different way than ordinary numbers. Note that unclass
gets rid of the class attribute, which lets us handle the number inside a gobbler
object in the usual way (though using unclass is not strictly necessary here).
Data Frames
One major use of classes is for R’s data.frame objects, which are the most
common way that data is represented in R.
A data frame is sort of like a list and sort of like a matrix. Each “row” of a data
frame holds information on some individual, object, case, or whatever. The
“columns” of a data frame correspond to variables whose values have been
measured for each case. These variables can be numbers, logical (TRUE/FALSE)
values, or character strings (but all values for one variable have the same type).
For example, here’s how R prints a small data frame containing the heights and
weights of three people:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
A data frame is really a list, with named elements that are the columns of the
data frame, but with a data.frame class attribute that makes R do things like
printing and subscripting differently from an ordinary list.
Getting Data Out of a Data Frame
You can get data from a data frame using subscripting operations similar to those
for a matrix (by row and column index), or by operations similar to a list (using
names of variables). For example:

> heights_and_weights # The data frame from the last slide


name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> heights_and_weights$height # All values of the "height" variable
[1] 62 60 71
> heights_and_weights[2,] # All values for the 2nd person
name height weight
2 Mary 60 131
> heights_and_weights[2,3] # Value of 3rd variable for 2nd person
[1] 131
> heights_and_weights$weight[2] # ... and the same, by variable name
[1] 131
Creating a Data Frame
Using as.data.frame, you can create a data frame from a list (it just adds the
data.frame class attribute) or from a matrix (it has to split it up into columns).
If you don’t provide variable names, R uses V1, V2, etc.
Examples:
> as.data.frame (list (abc=c(1,3,2),
+ pqr=c(TRUE,FALSE,FALSE),
+ xyz=c("a","bb","c")))
abc pqr xyz
1 1 TRUE a
2 3 FALSE bb
3 2 FALSE c
>
> as.data.frame (matrix (1:12, nrow=3, ncol=4))
V1 V2 V3 V4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
If a matrix has row and column names, they become those of the data frame.
Reading Data Into a Data Frame
The read.table function creates a data frame using data it reads from a text file.
The file has to contain one line for each row of the data frame, containing a value
(eg, a number, TRUE/FALSE, a string) for each variable for the case corresponding
to that row.
If a header=TRUE argument is given to read.table, the names of the variables
will be taken from the first line of the file.
Here’s how we could read the heights and weights data frame from a file on the
course web page:

heights_and_weights <-
read.table ("http://www.cs.utoronto.ca/~radford/csc121/data7",
header=TRUE)

The contents of the file read are as below:

name height weight


Fred 62 144
Mary 60 131
Joe 71 182
Indicating Missing Values with NA
It is very common for data collected to have some missing values — where the
subject declined to answer one of the survey questions, or the interviewer forgot
to fill out one page of the form, or where the machine taking the readings was
broken that day.
Sometimes these values are indicated by some special number like −999. But this
is very unreliable. The person analysing the data may not realize that this is
what −999 is supposed to mean, leading to drastically incorrect averages. Or
there may be an actual, non-missing, value of −999!

R supports representation of missing data by a special NA value. NA can be the


value of an element in a vector, matrix, or data frame. For example:

> c(5,1,NA,8,NA)
[1] 5 1 NA 8 NA
Arithmetic on NA values
Arithmetic operations where one or both operands are NA produce NA as the
result:

> a <- c(5,1,NA,8,NA)


> a+100
[1] 105 101 NA 108 NA
> b <- c(10,NA,20,NA,NA)
> a*b
[1] 50 NA NA NA NA

Comparisons with NA also produce NA, rather than TRUE or FALSE. Trying to
use NA as an if or while condition gives an error:

> a == 1
[1] FALSE TRUE NA FALSE NA
> if (a[3]==1) cat("true\n") else cat("false\n")
Error in if (a[3] == 1) cat("true\n") else cat("false\n") :
missing value where TRUE/FALSE needed
Checking For NA
Sometimes you need to check whether a value is NA. But you can’t do this with
something like if (a == NA) ... — that will always give an error!
Instead, you can use the is.na function. It can be applied to a single value,
giving TRUE or FALSE, or a vector of values, giving a logical vector.

For example, R’s built-in airquality demonstration dataset has some NA values.
The following statements create a modified version of the airquality data frame
in which missing values for solar radiation are replaced by the average of all the
non-missing measurements (found with mean using the na.rm option):

ave_solar <- mean (airquality$Solar.R, na.rm=TRUE)


mod_airquality <- airquality
for (i in 1:nrow(mod_airquality))
if (is.na(mod_airquality$Solar.R[i]))
mod_airquality$Solar.R[i] <- ave_solar

(We’ll see later how one can do this more easily using logical indexes.)
NA and NaN
A value will also be “missing” if it is the result of an undefined mathematical
operation. R prints such values as NaN, not NA, but is.na will be TRUE for
them. Operations on NaN produce NaN as a result. Here are some examples:
> 0/0
[1] NaN
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> x <- 0/0
> 10*x
[1] NaN
> v <- asin((-2):2)
Warning message:
In asin((-2):2) : NaNs produced
> v
[1] NaN -1.570796 0.000000 1.570796 NaN
> v / 0
[1] NaN -Inf NaN Inf NaN
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 8
Using Numeric Vectors as Subscripts
A subscript used with [ ] can be a vector of indexes, rather than just one index,
yielding a subset of elements having those indexes, not just one element:
> v <- c(9,10,3)
> names(v) <- c("abc","def","xyz")
> v
abc def xyz
9 10 3
> v[c(1,3)] # Notice that names of elements are carried along
abc xyz
9 3
You can also index with a vector of negative numbers. This gets you all elements
except those whose indexes are in the index vector (negated):
> v[-2]
abc xyz
9 3
> v[c(-1,-length(v))]
def
10
Difference Between [ ] and [[ ]]
We can now see better what the difference is between subscripting with [ ] and
with [[ ]] — [ ] extracts a subset of elements (which might be just one),
whereas [[ ]] extracts a single element.
> v
abc def xyz
9 10 3
> v[2]
def
10
> v[[2]] # Notice there’s no name here, just the element
[1] 10
> L <- list (a="xy", b=9, c=TRUE)
> L[2] # Notice that the result is still a list
$b
[1] 9

> L[[2]] # ... but this is an element of the list


[1] 9
Using Logical Vectors as Subscripts
A subscript can also be a logical vector, which selects elements in positions where
this subscript is TRUE:
> v
abc def xyz
9 10 3
> v[c(TRUE,FALSE,TRUE)]
abc xyz
9 3
> v[v>5]
abc def
9 10
R’s “and” (&) and “or” (|) operators can be useful for this:
> v[v>5 & v<10]
abc
9
> v[v>9 | v<7]
def xyz
10 3
Vectors as Matrix Subscripts
Vector subscripts can also be used to select rows or columns of a matrix:

> M <- matrix (1:12, nrow=3, ncol=4)


> rownames(M) <- c("ab","cd","ef")
> colnames(M) <- c("w","x","y","z")
> M
w x y z
ab 1 4 7 10
cd 2 5 8 11
ef 3 6 9 12
> M[c(3,1),c(2,4,4)] # Indexes needn’t be in order, can be duplicates
x z z
ef 6 12 12
ab 4 10 10
> M[c(TRUE,FALSE,TRUE),]
w x y z
ab 1 4 7 10
ef 3 6 9 12
Using Vector Indexes to Replace Elements in a Vector
Numeric and logical vectors can be used as indexes when we replace elements in a
vector rather than get them out.
For example:

> v <- c(66,33,99,10,12)


> v[c(2,4,1)] <- c(100,200,300)
> v
[1] 300 100 99 200 12
> v[c(TRUE,FALSE,FALSE,FALSE,TRUE)] <- c(800,900)
> v
[1] 800 100 99 200 900

Here’s how we can use this to make a modified version of the airquality data
frame (see last week’s slides) with missing values for Solar.R filled in:

mod_airquality <- airquality


mod_airquality$Solar.R [is.na(airquality$Solar.R)] <-
mean (airquality$Solar.R, na.rm=TRUE)
Re-Ordering a Vector, Matrix, or Data Frame
We can change the order of elements in a matrix, or of rows in a matrix or data
frame, using an index that is a permutation of the possible indexes.
One use is to change the order to be increasing in some variable. The order
function produces the permutation needed to do this. For example:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> by_weight <- order (heights_and_weights$weight)
> by_weight
[1] 2 1 3
> new <- heights_and_weights [by_weight, ]
> new
name height weight
2 Mary 60 131
1 Fred 62 144
3 Joe 71 182
Selecting a Subset of Rows in a Data Frame
Another use of logical indexes is in selecting a subset of rows in a data frame for
which the variables have certain values.
For example, here we select only people with weight greater than 140:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
> heights_and_weights [heights_and_weights$weight > 140, ]
name height weight
1 Fred 62 144
3 Joe 71 182
And here we get only people with weight greater than 140 and height less than 70:
> heights_and_weights [heights_and_weights$weight > 140
+ & heights_and_weights$height < 70, ]
name height weight
1 Fred 62 144
Some Design Flaws in R
R is a very useful language, but like all programming languages, it’s not perfect.
Indeed, some of R’s features are poorly designed, making it too easy to write code
that doesn’t always work.
I’ll talk about two of these here:

• You can’t get an empty vector when making a sequence with an expression
like i:j.

• R will sometimes convert matrices to plain vectors when you don’t want it to.
The Problem of Reversing Sequences
The : operator will produce either an increasing sequence or a decreasing
sequence, depending on whether the first operand is less or greater than the
second:

> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1

This may seem convenient — and it is for Small Assignment 3 — but it’s a bad
idea. When you use : in a program, you need to be sure which sort of sequence
you’re going to get!
An Illustration of Why Reversing Sequences are Bad
Here’s a function that is supposed to return a modified square matrix in which all
the elements above the diagonal have been set to one:

ones_above_diagonal <- function (M) {


n <- nrow(M)
for (i in 1:n)
for (j in (i+1):n)
M[i,j] <- 1
M
}

Here’s what happens when we try to use it:

> ones_above_diagonal(matrix(0,nrow=4,ncol=4))
Error in M[i, j] <- 1 : subscript out of bounds

(The exact error message depends on the version of R used.)


Why the error? We need to get a zero-length sequence from (i+1):n when i
equals n. But instead we get a sequence of length two, containing n+1 and n.
How could we fix it?
The Problem of Dropped Dimensions
When you index a matrix with a single row or column index, R converts the
result to a vector, rather than keep it as a matrix.
Sometimes this is what you want:

> M <- matrix(1:6,nrow=2,ncol=3)


> M[1,2]
[1] 3
> M[1,2:ncol(M)]
[1] 3 5

But sometimes not:

> A <- M[,2:ncol(M)]


> A[1,1]
[1] 3
> B <- M[2:nrow(M),]
> B[1,1]
Error in B[1, 1] : incorrect number of dimensions
Stopping R From Dropping Dimensions
You can tell R to not drop dimensions from a matrix with the drop=FALSE option:

> M <- matrix(1:6,nrow=2,ncol=3)


> M[,2:ncol(M)]
[,1] [,2]
[1,] 3 5
[2,] 4 6
> M[2:nrow(M),]
[1] 2 4 6
> M[,2:ncol(M),drop=FALSE]
[,1] [,2]
[1,] 3 5
[2,] 4 6
> M[2:nrow(M),,drop=FALSE]
[,1] [,2] [,3]
[1,] 2 4 6

But adding drop=FALSE all the time makes everything longer and messier. So it’s
tempting not to. But then you may get unexpected bugs once in a while. . .
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 9
Operations on Numeric Vectors that Produce One Number
R has several functions that take a numeric vector or matrix as their argument,
and return a single number as their value, including:
sum finds the sum of all elements.
prod finds the product of all elements.
max finds the largest of all elements.
min finds the smallest of all elements.
mean finds the mean (average) of all elements.

For example:
> u <- c(3,5,1,9)
> sum(u)
[1] 18
This does pretty much the same thing as the following loop:
> s <- 0
> for (x in u) s <- s + x
> s
[1] 18
However, sum(u) is faster, and in some cases more accurate.
Operations on Logical Vectors that Produce One Logical Value
R also has two functions that take a logical vector as their argument, and return
a single logical value:

any Return TRUE if any elements are TRUE


all Return TRUE if all elements are TRUE

Looked at another way, any finds the “or” of all elements in its argument, and
all finds the “and” of all elements.
Here’s an example of the use of these functions:

check_age <- function (df) {


if (any(is.na(df$age)))
stop("Age is missing for some people")
if (!all (df$age >= 0 & df$age < 150))
stop("Age is invalid for some people")
}

Can you think of a way to replace the second if condition with one that uses any
rather than all?
Creating a Plot in Stages
Many simple plots can be created with a single plot command — eg, plot(x,y)
will plot points with coordinates given by the vectors x and y.

More complicated plots can be created in stages by adding more points, lines, and
text to what has already been plotted.

The general approach:

• Create a new plot with plot. It might contains some points or lines, or might
be completely empty. Features such as the axis scales and labels are
determined at this stage.

• Then add more information, using functions such as points, lines, abline,
and text. You can call these functions as many times as needed, perhaps
with different options for things like colour and line width each time.

• You can also add a title above the plot with the title function.
Creating a New Plot
You create a new plot with the plot function. It takes one or two data vectors as
its first arguments, but has many, many other possible arguments. You’ll want to
let most of these have their default values, and refer to any that you set by name.
Here are some of the possible arguments to plot:
type Type of plotting — "p" for points (the default), "l" for lines,
"b" for both points and lines, "c" for lines only but with space for points
col Colour for points/lines plotted (default is "black")
xaxt Set to "n" to get rid of horizontal axis numbers
yaxt Set to "n" to get rid of vertical axis numbers
xlab Label for the horizontal axis
ylab Label for the vertical axis
xlim Horizontal range for plot (vector of length two)
ylim Vertical range for plot (vector of length two)
asp Aspect ratio, asp=1 ensures one vertical unit looks the same
length as one horizontal unit
For example, plot (c(), xlim=c(0,2), ylim=c(1,5)) will plot an empty
frame with horizonal axis labels from 0 to 2 and vertical axis labels from 1 to 5.
Adding Points to a Plot
We can add points to a plot with the points function. Like plot, it takes two
vectors as its first two arguments, containing the x and y coordinates of the
points. (Or just a single vector argument with the y coordinates, in which case
the x coordinates are 1, 2, 3, . . . )

It can also take other arguments that set various options, such as

type Set to "b" for lines as well as points


col Colour for points plotted
pch Character to plot points with — default is a circle, other possibilities
are pch="x" for plotting with x symbols, or pch=20 for solid dots

For example, points (x, y, col="red", pch=20) will add solid red dots to the
plot, at the coordinates given by the vectors x and y.
Adding Lines to a Plot
We can add lines to a plot with the lines function.
In addition to one or two arguments giving the coordinates of the points to
connect with lines, it can take other arguments such as those below (which can
also be used for plot):

type Set to "b" for points too, "c" for lines only but with space for points
col Colour for lines plotted
lty Line type — eg, "dotted", "dashed", or "solid" (the default)
lwd Line width (default is 1)

For example, lines (y, col="green", lty="dotted") will add dotted green
lines to the plot, at the x coordinates 1, 2, 3, . . . and y coordinates given by the
vector y.
Adding Text to a Plot
We can add text to a plot with the text function.

Here’s an example that adds ”WOW” to the origin of the plot:

> text (0, 0, "WOW")

We can put many character strings on a plot with one call of text, since its
arguments can be vectors of x coordinates, y coordinates, and character strings.
For example:

> x <- 1:10


> y <- x^2
> plot(x,y,xlim=c(0,11))
> text(x,y+2,paste("square of",x))
Example: Drawing a Spiral
Here’s an example R script that draws a spiral in a plain box, using 7 segments
each time it winds around, with red dots at the vertices. The start and end are
labelled with “start” and “end”.

n <- 20
angle <- 2*pi*(0:n)/7
dist <- 0:n
x <- dist * cos(angle)
y <- dist * sin(angle)

plot (x, y, type="c", xaxt="n", yaxt="n", xlab="", ylab="",


xlim=c(-n,n), ylim=c(-n,n), asp=1)

points (x, y, col="red")

text (x[1], y[1]-1, "start")


text (x[n+1], y[n+1]+1, "end")
The Spiral Plot
> source("http://www.cs.utoronto.ca/~radford/csc121/spiral-script.r")

start

end
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 10
Many Ways to Write a Simple Function
In this lecture, we’ll look at many ways of writing a simple function called
is_not_decreasing, which takes one argument, a vector, and returns TRUE if
the elements in the vector are in non-decreasing order, and FALSE otherwise.
We’ll see some new R features along the way.
Examples:

> is_not_decreasing (c(4,8,8,9))


[1] TRUE
> is_not_decreasing (c(5,1,3))
[1] FALSE
> is_not_decreasing (7)
[1] TRUE

We’ll assume that the vector has no NA values. What would be a reasonable
thing to do if it did?
Ending a Loop Using a Logical Flag Variable
Here’s one solution, that uses the setting of a logical variable as a way of
terminating a while loop:
is_not_decreasing <- function (v) {
answer_is_known <- FALSE
i <- 2
while (!answer_is_known) {
if (i > length(v)) {
answer <- TRUE
answer_is_known <- TRUE
}
else if (v[i] < v[i-1]) {
answer <- FALSE
answer_is_known <- TRUE
}
i <- i + 1
}
answer
}
Using a repeat Loop and break Statement
This function used two logical variables — one to hold the answer returned, the
other to indicate when the answer is now known, and hence the loop can end.
We can instead use a loop written using repeat, which continues indefinitely,
until a break statement is done:
is_not_decreasing <- function (v) {
i <- 2
repeat {
if (i > length(v)) {
answer <- TRUE
break
}
if (v[i] < v[i-1]) {
answer <- FALSE
break
}
i <- i + 1
}
answer
}
Using break Within a for Loop
We can use break to immediately exit any kind of loop. Here’s another way to
write this function:
is_not_decreasing <- function (v) {
answer <- TRUE
if (length(v) > 1)
for (i in 2:length(v)) {
if (v[i] < v[i-1]) {
answer <- FALSE
break
}
}
answer
}
In this version, we initially set answer to TRUE, which will be the answer if we
don’t find a place where the elements decrease. If we do find a decrease, we set
answer to FALSE, and also immediately exit the for loop.
Caution: The break statement exits from the innermost loop that contains it.
If you’re inside two loops, you can’t use break to exit both of them at once.
Returning a Value for a Function Immediately
Rather than exit a loop with break after setting answer, and then making
answer the value of the function by putting it as the last thing, we can instead
use return to exit the whole function, and specify the value it returns.

is_not_decreasing <- function (v) {


if (length(v) > 1) {
for (i in 2:length(v)) {
if (v[i] < v[i-1])
return(FALSE)
}
}
return(TRUE)
}

At the end, we could just have written TRUE instead of return(TRUE) — they do
the same thing at the end of a function.
Why is the check for length(v) > 1 needed?
Avoiding Loops with a Vector Comparison
We can write is_not_decreasing without an R loop using a vector comparison
and the all function:

is_not_decreasing <- function (v) all (v[-length(v)] <= v[-1])

In this version, v[-length(v)] will contain all of v except the last element, and
v[-1] will contain all of v except the first element. So v[-length(v)] <= v[-1]
compares each element except the last to the next element. The vector v is
non-decreasing if all these comparisons are TRUE.
Here’s another way to do the same thing:

is_not_decreasing <- function (v) {


if (length(v) < 2)
TRUE
else
all (v[1:(length(v)-1)] <= v[2:length(v)])
}

Why is the check for length(v) < 2 needed here, but not in the version above?
Recursion — When a Function Calls Itself
As you know, an R function can call another R function, which can call yet
another R function, etc.
Indeed, an R function can even call itself. This is called “recursion”.
Of course, a function had better not always call itself, or it will just keep calling,
and calling, and calling, without end.

But having a function sometimes call itself can be useful. Here’s a recursive
function to compute factorials in R:

fact <- function (n) if (n == 0) 1 else n * fact(n-1)

(Although R already has a pre-defined factorial function.)

In fact, anything computable can be computed using if and recursion, without


any loops or assignment statements. That’s not a typical style of programming
in R, but it is typical for some other programming languages.
Two Recursive Versions of is_not_decreasing
We could write the is_not_decreasing function using recursion. Here’s one way:
is_not_decreasing <- function (v) {
if (length(v) <= 1)
TRUE
else if (v[2] < v[1])
FALSE
else
is_not_decreasing(v[-1])
}
Here’s another way that doesn’t copy parts of v, and also extends the function’s
meaning so it checks only from a certain point forward (default, from the start):
is_not_decreasing <- function (v, from=1) {
if (length(v) <= from)
TRUE
else if (v[from+1] < v[from])
FALSE
else
is_not_decreasing(v,from+1)
}
Operations on Vectors
We’ve seen before that R can do many operations on entire vectors (or matrices),
not just on single numbers. For example, we can add 1 to all elements of a vector:
> u <- c(3,5,1,9)
> v <- u + 1
> v
[1] 4 6 2 10
Instead of the statement v <- u + 1 we could have written a loop:
> v <- u
> for (i in 1:length(v)) v[i] <- v[i] + 1
> v
[1] 4 6 2 10
But v <- u + 1 is easier to write, easier to read, and also faster in R.
This isn’t magic, though — there still is a loop hidden within the implementation
of R, and in some other languages writing a loop yourself would be just as fast.

R has many other facilities for doing operations on vectors, matrices, or lists
without having to write a loop, which often are also faster.
Replacing Loops with “apply” Functions
Functions in the “apply” family take as arguments both a data structure and a
function to apply to parts of the data structure — an example of “functional
programming”, using functions to construct more complex operations.
The lapply function operates on a list, and returns a list of results of applying a
given function to each element of the list. Here’s an example using the
is.numeric function, which says whether something is a numeric vector:

> L <- list ("abc", c(123,456), TRUE)


> lapply(L,is.numeric)
[[1]]
[1] FALSE

[[2]]
[1] TRUE

[[3]]
[1] FALSE
Using “apply” on Matrices
You can use apply to apply a function to all rows or to all columns of a matrix.
If the function applied returns a single value, the result is a vector of these values:

> M <- matrix (1:6, nrow=2, ncol=3)


> M
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> apply (M, 1, function (v) sum(v^2)) # 2nd arg of 1 means apply to rows
[1] 35 56
> apply (M, 2, function (v) sum(v^2)) # 2nd arg of 2 means apply to cols
[1] 5 25 61

If the function returns a vector of length greater than one, the result is a matrix:

> apply (M, 1, function (v) c(sum(v), prod(v)))


[,1] [,2]
[1,] 9 12
[2,] 15 48
Logical Operators
Some previous slides have mentioned logical operations on vectors. These operate
on operate on vectors of logical values, returning a vector of logical values.
For one logical value, the operators are defined as follows:

! Logical “not”: TRUE if its operand is FALSE, FALSE if its operand is TRUE.
& Logical “and”: TRUE only if both operands are TRUE.
| Logical “or”: TRUE if either operand is TRUE.

When applied to logical vectors, the operations are done on each element in turn:

> a <- c (TRUE, TRUE, FALSE, FALSE)


> b <- c (TRUE, FALSE, TRUE, FALSE)
> a & b
[1] TRUE FALSE FALSE FALSE
> a | b
[1] TRUE TRUE TRUE FALSE
> !a
[1] FALSE FALSE TRUE TRUE
An Example of apply Using Logical Operations
Here’s how apply can be used to see which columns in a matrix have values that
are all in the range of the first value to the last value:

> A <- matrix (c(2,9,0,1,3,8,4,9), nrow=4, ncol=2)


> A
[,1] [,2]
[1,] 2 3
[2,] 9 8
[3,] 0 4
[4,] 1 9
> apply (A, 2, function (v) all (v >= v[1] & v <= v[length(v)]))
[1] FALSE TRUE
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 11
Another Use for Classes — Factors
Recall that how R handles an object can be changed by giving it a “class”
attribute. That’s how lists become data frames. Another example is the “factor”
class, which is used to represent a vector of strings as a vector of integers, along
with a vector of just the distinct string values.
Here’s an illustration:
> a <- as.factor(c("red","green","yellow","red","green","blue","red"))
> a
[1] red green yellow red green blue red
Levels: blue green red yellow
> class(a) # We can see that this object has the class "factor"
[1] "factor"
> unclass(a) # Here’s what it is without its class attribute
[1] 3 2 4 3 2 1 3
attr(,"levels")
[1] "blue" "green" "red" "yellow"
The main reason factors exist is that an integer previously used less memory than
a string, though this is less true in recent versions of R. Strings are converted to
factors in read.table, unless you use the stringsAsFactors=FALSE option.
Operations on Factors
Factors look like strings for many purposes:

> a <- as.factor(c("red","green","yellow","red","green","blue","red"))


> a == "red"
[1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE

Even though factors are represented as integers, mathematical operations on


them are not allowed:

> sqrt(a)
Error in Math.factor(a) : sqrt not meaningful for factors

This is because the integers representing the “levels” of the factor are arbitrary,
so treating them like numbers would be misleading. (Unfortunately, R isn’t
completely consistent in this, and will sometimes use a factor as a number
without a warning.)
Another Use of Classes — Dates and Time Differences
R also defines classes for dates, and for differences in dates. Some of what you
can do with these is illustrated below:

> d1 <- as.Date("2015-03-24") # d1 will be an object of class "Date"


> d1
[1] "2015-03-24" # Adding an integer to a date gives a new date
> d1+2
[1] "2015-03-26"
> d1+10 # Addition will automatically change the month
[1] "2015-04-03"
>
> d2 <- as.Date("2015-02-24")
> d1-d2 # The difference has class "difftime"
Time difference of 28 days
> as.numeric(d1-d2) # We can convert a "difftime" object to a number
[1] 28
Defining Your Own Classes
You can attach a class attribute of your choice to any object. If that’s all you do,
the object gets handled just as before, except the class attribute is carried along:
> x <- 9
> class(x) <- "mod17"
> x + 10
[1] 19
attr(,"class")
[1] "mod17"
But you can now redefine some operations (ones that are “generic”) to operate
specially on your class:
> ‘+.mod17‘ <- function (a,b) {
+ r <- (unclass(a) + unclass(b)) %% 17
+ class(r) <- "mod17"
+ r
+ }
> x + 10
[1] 2
attr(,"class")
[1] "mod17"
Defining Your Own Generic Functions
You can also create new generic functions, that you can define “methods” for,
that are used when they are called with objects of particular classes. For example:

> picture <- function (x) UseMethod("picture")


> picture.default <- function (x) cat(x,"\n")
> picture.mod17 <- function (x) cat(rep("-",x),"X",rep("-",16-x),"\n")
> picture(9)
9
> picture(x)
- - - - - - - - - X - - - - - - -
> picture(x+3)
- - - - - - - - - - - - X - - - -

The definition of picture just says it’s generic. If no special method is defined
for a class, picture.default is used. By defining picture.mod17, we create a
special method for class mod17. R finds the method to use based on the class of
the first argument to the generic function.
The Object-Oriented Approach to Programming
R’s classes are designed to support what is called “object-oriented” programming.
This approach to programming has several goals:

• Allow manipulation of “objects” without having to know exactly what kind of


object you’re manipulating — as long as the object can do the things that
you need to do (it has the right “methods”).
Benefit: We can write one just function for all objects, not many functions,
that all do the same thing but in somewhat different ways.

• Separate what the methods for an object do from how they do it (including
how the object is represented).
Benefit: We can change how objects work without having to change all the
functions that use them.

• Permit the things that can be done with objects (“methods”) and the kinds
of objects (“classes”) to be extended without changing existing functions.
Benefit: We can more easily add new facilities, without having to rewrite
existing programs.
Generic Functions for Drawing, Rescaling, and Translating
Let’s see how we can define a set of generic functions for drawing and
transforming objects like circles and boxes.
We start by setting up the generic functions we want:

draw <- function (w) UseMethod("draw")


rescale <- function (w,s) UseMethod("rescale")
translate <- function (w,tx,ty) UseMethod("translate")

Then we need to define methods for these generic functions for all the classes of
objects we want. We also need functions for creating such objects.

Note: We might not have done things in this order. For example, we might have
first defined only draw and translate methods, and then later added the
rescale method. We would then need to implement a rescale method for a
class only if we actually will use rescale for objects of that class.
Implementing a Circle Object
We’ll represent a circle by the x and y coordinates of its centre and its radius.
new_circle <- function (x, y, r) {
w <- list (centre_x=x, centre_y=y, radius=r)
class(w) <- "circle"
w
}
draw.circle <- function (w) {
angles <- seq (0, 2*pi, length=100)
lines (w$centre_x + w$radius*cos(angles),
w$centre_y + w$radius*sin(angles))
}
rescale.circle <- function (w,s) {
w$radius <- w$radius * s;
w
}
translate.circle <- function (w,tx,ty) {
w$centre_x <- w$centre_x + tx; w$centre_y <- w$centre_y + ty
w
}
Implementing a Box Object
We’ll represent a box by the x and y coordinates at its left/right top/bottom.
But to create a box we’ll give coordinates for its centre and offsets to the corners.
new_box <- function (x, y, sx, sy) {
w <- list (x1=x-sx, x2=x+sx, y1=y-sy, y2=y+sy)
class(w) <- "box"
w
}
draw.box <- function (w) {
lines (c(w$x1,w$x1,w$x2,w$x2,w$x1), c(w$y1,w$y2,w$y2,w$y1,w$y1))
}
rescale.box <- function (w,s) {
xm <- (w$x1+w$x2) / 2
w$x1 <- xm + s*(w$x1-xm); w$x2 <- xm + s*(w$x2-xm)
ym <- (w$y1+w$y2) / 2
w$y1 <- ym + s*(w$y1-ym); w$y2 <- ym + s*(w$y2-ym)
w
}
translate.box <- function (w,tx,ty) {
w$x1 <- w$x1 + tx; w$x2 <- w$x2 + tx
w$y1 <- w$y1 + ty; w$y2 <- w$y2 + ty
w
}
An Example of Drawing Objects This Way
> plot(NULL,xlim=c(-7,7),ylim=c(-7,7),xlab="",ylab="",asp=1)
> c <- new_circle(3,4,2.5)
> draw(c); draw(rescale(c,0.7)); draw(translate(rescale(c,0.3),1,-5))
> b <- new_box(-3,-3,2,3)
> b2 <- translate(b,-1.3,2.2)
> draw(b); draw(b2); draw(rescale(b2,1.1))

6
4
2
0
−6 −4 −2

−6 −4 −2 0 2 4 6
Defining a Function That Works On Both Circles and Boxes
Here is a function that should work for circles, boxes, or any other class of object
that has draw, rescale, and translate methods:
smaller <- function (w, n)
for (i in 1:n) { draw (w); w <- rescale(translate(w,1,0),0.9) }

Here are two uses of it:


> plot(NULL,xlim=c(-7,7),ylim=c(-7,7), xlab="",ylab="",asp=1)
> smaller (new_circle(-3,3.1,3),10)
> smaller (new_box(-3,-3,3.1,3),10)
6
4
2
0
−6 −4 −2

−6 −4 −2 0 2 4 6
Statistical Facilities in R
In this course, we’ve mostly looked at R as a programming language, and at
general programming concepts.
But R is most popular as a language for statistical applications. So it has many
special facilities for doing statistics. I’ll talk about some now.

Don’t worry if you don’t understand some of the statistical concepts — that’s OK
for this course. Though learning about R’s statistical facilities is one good way to
learn statistics in a hands-on way!
Creating Tables of Counts
R can count how many times a value or combination of values occurs in a data
set, with the table function. It returns an object of class table, which looks like
a vector or matrix of integer counts.
For a vector, table counts how many times each unique value occurs:
> colours <- c("red","blue","red","red","green","blue")
> print (tcol <- table(colours))
colours
blue green red
2 1 3
> names(tcol)
[1] "blue" "green" "red"
> ages <- c(4,9,12,2,4,9,10)
> print (tage <- table(ages))
ages
2 4 9 10 12
1 2 2 1 1
> names(tage)
[1] "2" "4" "9" "10" "12"
Tables of Joint Counts
When used with two vectors, or a data frame with two columns, table creates a
two-dimensional table of how often each combination of values occurs. Examples:

> colours <- c("red","blue","red","red","green","blue")


> shapes <- c("round","round","square","square","square","round")
> table(colours,shapes)
shapes
colours round square
blue 2 0
green 0 1
red 1 2
> df <- data.frame(col=colours,shape=shapes)
> table(df)
shape
col round square
blue 2 0
green 0 1
red 1 2
Statistical Modeling in R
One big part of statistics is fitting a model to data. R has many functions for
doing this, but I’ll mention only lm, which fits a linear model.
Models in R are often specified using formulas, that say how one thing is
modelled in terms of other things.
For lm, we want to specify that some response variable is modelled as a linear
combination (plus noise) of some explanatory variables. This is done using a
formula such as

growth ~ ave_temp + fertilizer + variety

This might express that the amount by which some plant grows is linearly related
to the average temperature, the amount of fertilizer used, and a set of indicator
variables indicating the variety of the plant.
A Simple Example of a Linear Model
Here, I’ll show the results of a very simple linear model, relating the volume of
wood in a cherry tree to its girth (diameter of trunk). The data is in the data
frame trees that comes with R.
Here’s a plot of the data:

70
60
50
trees$Volume

40
30
20
10

8 10 12 14 16 18 20

trees$Girth
Fitting the Model with lm
We can fit a linear model for volume given girth as follows:

> lm (trees$Volume ~ trees$Girth)

Call:
lm(formula = trees$Volume ~ trees$Girth)

Coefficients:
(Intercept) trees$Girth
-36.943 5.066

The result says that best fit model for the volume is

Volume = −36.943 + 5.066 Girth + noise

We can get the same result with an abbreviated formula by saying the data comes
from the data frame trees:

lm (Volume ~ Girth, data=trees)


Using the Result of lm
The value returned by lm is an object of class "lm", which has special methods
for printing and other operations.
We can save the result, and then get the regression coefficients with coef.

> m <- lm (Volume ~ Girth, data=trees)


> coef(m)
(Intercept) Girth
-36.943459 5.065856

We could use these coefficients to predict the volume for a new tree, with girth
of 11.6:

> coef(m) %*% c(1,11.6) # %*% will compute the dot product
[,1]
[1,] 21.82048
Getting More Details on the Model Fitted
We can also ask for more statistical details with summary:
> summary(m)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom


Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Plotting the Regression Line
We can also plot the regression line from the fitted model on top of a scatterplot
of the data, using abline(m):

70
60
50
trees$Volume

40
30
20
10

8 10 12 14 16 18 20

trees$Girth

The plot shows some indication that the relationship is actually curved.
Trying a Quadratic Model
Let’s try fitting volume to both girth and the square of girth:
> Girth_squared <- trees$Girth^2
> summary (lm (trees$Volume ~ trees$Girth + Girth_squared))
Call:
lm(formula = trees$Volume ~ trees$Girth + Girth_squared)

Residuals:
Min 1Q Median 3Q Max
-5.4889 -2.4293 -0.3718 2.0764 7.6447

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.78627 11.22282 0.961 0.344728
trees$Girth -2.09214 1.64734 -1.270 0.214534
Girth_squared 0.25454 0.05817 4.376 0.000152 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.335 on 28 degrees of freedom


Multiple R-squared: 0.9616, Adjusted R-squared: 0.9588
F-statistic: 350.5 on 2 and 28 DF, p-value: < 2.2e-16
Some Useful Functions of Vectors
The unique function returns a vector of unique values:
> colours <- c("red","blue","red","red","green","blue")
> unique(colours)
[1] "red" "blue" "green"
The sort function sorts a vector in increasing order (or decreasing order if you
use decreasing=TRUE):
> ages <- c(4,9,12,2,4,9,10)
> sort(ages)
[1] 2 4 4 9 9 10 12
> sort(unique(ages),decreasing=TRUE)
[1] 12 10 9 4 2
The which.min and which.max functions give the index of the smallest and
largest elements in a vector (first occurrence if they occur more than once):
> which.min(ages)
[1] 4
> which.max(ages)
[1] 3
Checking if Things are in a Set
The %in% operator checks whether values are in some set of values (represented
by a vector of values in the set):

> colours <- c("red","blue","red","red","green","blue")


> "black" %in% colours
[1] FALSE
> colours %in% c("red","green")
[1] TRUE FALSE TRUE TRUE TRUE FALSE

You can use the results to find the elements of a vector that are in some set:

> colours [ colours %in% c("red","green") ]


[1] "red" "red" "red" "green"

With which, which returns indexes of TRUE in a logical vector, you can also find
the indexes of the elements that are in the set:

> which (colours %in% c("red","green"))


[1] 1 3 4 5
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 12
Computers are Fast
Modern computers are so fast that most simple operations on not-too-large
amounts of data appear to happen instantly.
If you’re working with a data frame with 1000 rows and 10 columns, you can
expect all of the following to happen so fast that a human can’t percieve the delay:

• Adding or deleting one row or one column to make a new data frame.

• Replacing all the NA values in a column with zeros.

• Fitting a linear model for one column in terms of other columns.

If you see any apparent delay, it’s probably not for the operation itself, but for
things like fetching the data over the internet, or waiting for your computer to
stop doing something else.
. . . But Not Always Fast Enough
Nevertheless, computing speed is still an issue today.

• Because today’s computers are fast, people try to use them on bigger
problems than before. If you work on a data frame with 1000000 rows and
1000 columns, many operations will be noticeably slow, maybe very slow.

• Sometimes you want to do simple things many, many times. For example,
how well a statistical method works is often assessed by trying it on many
randomly-generated data sets.

• Some tasks are inherently extremely slow for computers to do. If the famous
P 6= N P conjecture is true, this includes many useful tasks like finding the
shortest route visiting all locations in some set (the “Travelling Salesman”
problem).

• You can get computers to be very slow if you write your program in an
inefficient way, when there would have been a better way.
Computing Time and Problem Size
The time it takes for a program to run depends on what computer you run it on.
A low-end laptop computer might take five times longer to run a program than a
high-end desktop computer.
When analysing program speed, we therefore often look not at the actual time,
but at how the time grows with problem size.
Example: Suppose we want to sort a vector of numbers in increasing order,
creating another vector with the sorted list. How might the the time for this grow
with the length of the vector, which we’ll call n?

• For many simple methods, the time grows in proportion to n2 .

• Cleverer methods reduce this to a time growing as n log n.

• If we assume that the numbers are integers that aren’t huge, it can be done in
time proportional to n.
What Does “Time Growing in Proportion to n” Mean
To say that the time grows in proportion to n (or to n2 , or n log n), means that
asymptotically, as n becomes larger and larger, the time will grow that way, with
some unknown constant of proportionality.
Here’s an example of a function that asymptotically grows in proportion to n:

100
80
60
40
20
0

0 20 40 60 80 100

The constant of proportionality seems to be 0.5 (grey line has that slope).
But if this is the time for a program to run, that constant will vary from
computer to computer.
Example of Time Growing in Proportion to n
Here’s an example of a simple R function that (pointlessly) counts up to n, and
how its time grows with n:

> count <- function (n) { r <- 0; for (i in 1:n) r <- r + 1; r }


> system.time(count(100000))
user system elapsed
0.017 0.000 0.017
> system.time(count(1000000))
user system elapsed
0.160 0.001 0.163
> system.time(count(10000000))
user system elapsed
1.583 0.013 1.596

The constant of proportionality seems to be about 0.00000016. But it would be


smaller on a faster computer, bigger on a slower one.
Example of Time Growing in Proportion to n2
> sums <- function (v) {
+ r <- numeric(length(v))
+ for (i in 1:length(v)) r[i] <- sum(v[1:i])
+ r
+ }
> system.time(sums(1:1000))
user system elapsed
0.005 0.000 0.005
> system.time(sums(1:2000))
user system elapsed
0.019 0.003 0.024
> system.time(sums(1:4000))
user system elapsed
0.069 0.013 0.082
> system.time(sums(1:8000))
user system elapsed
0.261 0.064 0.325
How could you write a sums function that is faster?
(R’s built-in cumsum function does it the faster way.)
CSC 121: Computer Science for Statistics

Radford M. Neal, University of Toronto, 2017

http://www.cs.utoronto.ca/∼radford/csc121/

Week 13
A Few Final Comments
Here in the last lecture, I’ll mention a few things that we haven’t had time to
really cover. . .

• More on R packages.

• More on testing your program.

• Managing versions of your program, and writing a program together with


other people.

• Programming languages other than R.


R Packages
The basic features of R can be extended using “packages” of function definitions,
data sets, etc.
These packages are available in several ways:

• A few packages come with R and are available for use by default. One such is
the stats package that defines the lm function.

• Some more packages come with R, but have to be loaded manually before you
can use them. An example is the survival package for analysing survival
data. To use it, say library(survival).

• Many other packages (thousands) are available for installation from package
repositories, to which many people have contributed. The CRAN repository
is the best-known, and is the default when you try to install such packages
using the install.packages function.

• You can write your own packages.


Some Things Packages May Do
• Provide general extensions to the R language.
Example: magrittr defines an operator %>% for “piping” data from one
function to another.

• Provide convenient ways of interacting with R.


Example: knitr, which we’ve been using to produce convenient output.

• Interface to other software.


Example: foreign provides ways of reading data from other statistics
packages (eg, SAS) into R.

• Provide more elaborate ways of producing graphical output.


Example: ggplot2 is a very popular way of producing graphs.

• Provide additional statistical methods.


Example: tgp implements “Treed Gaussian Process” models, that can model
how a response variable relates to explanatory variables more flexibly than a
linear model.
Program Testing
In the assignments, you’ve been creating some tests for the functions you write.
Creating a good set of tests is important when first writing a program, and also
when making changes to the program later, to check that you haven’t broken it.
There are two general kinds of tests one might do:

• Tests that the whole program works as intended.


This is what we really care about. But it may be hard to think of ways to test
everything about a program at this level. And it’s particularly hard to test
programs that produce plots for a user to view, or that interact with a user.

• Tests that individual functions work as intended, including functions that are
just used inside the program (aren’t meant to be used elsewhere).
If we are confident that many of the parts of the program work correctly, we
will be more confident that the program as a whole works correctly.

One aim in testing is to make sure that every bit of code has been used — eg,
that every if statement has been tried with the condition being both TRUE and
FALSE. But that’s not enough to guarantee that the program always works.
Testing isn’t a substitute for careful design and coding.
Source Code Control
A source code control system manages the files containing your function
definitions, scripts, or documentation.
Here are some things a source code control system lets you do:

• Go back to an earlier version if you find out that some recent changes you
made were a bad idea.

• See what has changed from some earlier version to the current version.

• Create multiple versions of a program, perhaps specialized for slightly


different tasks.

• Merge work on one project that is done by several programmers.

Currently, the most popular source code control system is git. It is supported by
RStudio, or it can be used on its own.
Source Code Repositories
It’s increasingly popular for programs (managed by a source code control system)
to be made available to everyone on source code repositories.
Two popular ones based on git are gitlab.com and github.com.
These repositories support

• The developers uploading programs, including changes to previous versions.

• People downloading the programs, including the revision history if they wish.

• People reporting bugs.

• People submitting changes to programs to be considered by the developers.

Of course, the developers have to provide a license that allows the program to be
used / changed.
Other Programming Languages for Statistics
R is probably the most common programming language used by statisticians.
But there are others.
There are statistical packages that provide programming facilities, such as
• SAS
• Stata
There are several programming languages with wider communities that are
somewhat similar to R, including
• Matlab (and its free version, Octave).
• Python
There are also languages centred on symbolic mathematical computation, like
• Maxima
• Maple
• Mathematica
In these languages, you can multiply 2+x by 1+3*x and get 2+7*x+3*x^2.
Compiled Programming Languages
There are also programming languages that are usually compiled, rather than
interpreted, like R, and the other languages on the previous slide.
Compilation translates the program to a program in machine language, which the
computer can do directly. In contrast, an interpreter is a program in machine
language (usually compiled from some other language) that looks at a program
and does what it says, which is much slower.
So if you need your program to go really fast, you may want to write it in a
language that can be compiled, rather than in R.
Some common compiled languages:

• C

• C++ (like C but with object-oriented programming facilities)

• Fortran

You can also write just the time-critical part of the program in one of these
languages, and then call that part from an R program.
Alternative Implementations of R
Several projects are currently in the works to improve on the current
implementation of R (as distributed at http://R-project.org).
These include:

• FastR, supported by Oracle: https://github.com/graalvm/FastR

• Rho, supported by Google: https://github.com/rho-devel/Rho

• pqR, my own effort: http://pqR-project.org


My original aim was to create a faster version of the R interpreter.
I have also begun to extend the R language in ways that make it more useful,
and fix some of its design flaws.

You might also like