Csc121 Full Notes
Csc121 Full Notes
http://www.cs.utoronto.ca/∼radford/csc121/
Week 1
Why Learn to Program (in R)?
(Intercept) Sepal.Length
Petal Length Versus Sepal Length in Three Iris Species
-7.101443 1.858433
7
Iris virginica
Model for species setosa: Iris versicolor
6
Iris setosa
(Intercept) Sepal.Length
5
0.8030518 0.1316317
Petal Length
4
Model for species versicolor:
(Intercept) Sepal.Length 3
2
0.1851155 0.6864698
1
0.6104680 0.7500808
The R Script Used (which you’re not expected understand yet)
# Analyse the relationship of petal length to sepal length in the flowers of
# three iris species, fitting regression models to all data and to each species.
# Plot the sepal and petal lengths for each flower that was measured. Identify
# species by colour. Randomly jitter the data slightly to prevent overlap.
# Show and plot regression line of petal length on sepal length fit to all data.
# Print and plot linear regression lines for petal length on sepal length
# fit to data on each species separately.
• Obtaining data from files or databases where it isn’t in the format required,
or needs to be “cleaned up” (eg, inconsistent names for the same thing, some
records with erroneous data, . . . ).
• Research into new statistical methods for general use. They’re not very useful
if they can’t be done on a computer!
• Use R’s “help” facility, the on-line “Introduction to R”, and other on-line or
paper documentation. See the course web page for some links.
• etc.
Two Kinds of Programs
Some programs compute an output from some input.
Examples:
• Input: The age, blood pressure, and cholesterol levels of 1000 people.
Output: Three clusters of people who are similar in these measurements.
Other programs do things, and perhaps also take input and produce output.
Examples:
The procedures in a program do things with data, producing new data, or taking
actions. For example, procedures in a program might do things like:
– add two numbers to get a third number
– re-arrange a list of numbers in increasing order
– display a plot of a set of numbers
– change all the upper-case letters in a document to lower-case
Specifying the procedures that operate on data — sometimes also called
“scripts”, “methods”, or “functions” — is a major part of programming.
Real numbers in R have numeric type (also called “double”, for obscure reasons).
We can write these numbers in mostly familiar fashion:
123
1.234
1.23e-44 ← this means 1.23 × 10−44
R can also operate on strings of characters, which are written in single or double
quotation marks:
"x"
"Hello, James."
’say "please"!’
Arithmetic Operations
R can do all the usual arithmetic operations on numbers. You can try them out
by typing expressions at R’s command prompt (“>”):
Note that everything you type after “#” is a comment, that R ignores (but that
people reading what you wrote may find helpful). R also ignores extra spaces (in
most places), but they may make an expression easier to read.
You can ignore the “[1]” seen above (we’ll see later what it means).
Combining Operations, Parentheses, and Precedence
You can combine operations, using parentheses to indicate which is done first:
> (8 + 2) * 5
[1] 50
> 8 + (2 * 5)
[1] 18
You can omit parentheses if the precedence of the operators would produce the
desired result. Addition and subtraction have lower precedence than multiplication
and division, which have lower precedence than raising to a power:
> 8 + 2*5 # Same as 8 + (2*5)
[1] 18
> 3 * 5^2 # Same as 3 * (5^2)
[1] 75
Operators (except “^”) with the same precedence are applied leftmost first:
> 2 - 1 + 9 # Same as (2 - 1) + 9
[1] 10
> 50 / 5*10 # Same as (50 / 5)*10, NOT 50 / (5*10)
[1] 100
More on Typing Expressions Into R
If you need to split an expression between lines, make sure the first line doesn’t
look like a whole expression on its own.
Example:
> 1234 + 5678 + 1111 + 2222 + 3333 + 567890 *
+ 876 / (1 + 2 + 3 + 4) - 888^2
[1] 48972198
The first line, “1234 + 5678 + 1111 + 2222 + 3333 + 567890 * ”, isn’t a
valid expression — there’s nothing after the “*”. To tell you that more is needed,
R changes the prompt from “>” to “+” (this “+” has nothing to do with addition).
If what you type doesn’t make sense, R displays an error message (and ignores
what you typed).
Example:
> 2 * (3 + 4))
Error: unexpected ’)’ in " 2 * (3 + 4))"
This kind of error is called a syntax error. R doesn’t even try to do anything,
because it can’t figure out what you meant.
Mathematical Functions
R can also compute mathematical functions, such as logarithms and cosines:
You can put two or more strings together into one string:
> substring("12 Jan 2016", 4, 6) # Get the 4th through 6th characters
[1] "Jan"
Why do we need these operations, when people are good at combining and
extracting characters without the help of a computer? They’re useful as parts of
larger programs — for example, to build suitable titles and axis labels for plots.
Saving Values in Variables
You can save a value in a variable, giving it some name. You then can use that
name to refer to the value in the variable later:
You can see what value is in a variable by just typing its name:
> x
[1] 579
Note: You can use “=” rather than “<-” to assign a value to a variable, and that
is what is used in some other programming languages. But “<-” looks like an
arrow, which is more descriptive of what happens: x <- 9 moves the value 9 into
the variable x. I recommend that you use “<-”.
Names for Variables
A variable name can be any sequence of letters, digits, “.”, and “_”, except that
it can’t start with “_”, or with a digit, or with “.” followed by a digit.
Choosing good names for variables helps you (and others) remember what they
are for.
Examples:
Using “.” in a variable name (like in this.year above) is a bit archaic, and
clashes with usage in other programming languages. It’s better to use “_”. It’s
also good to be consistent, whatever you do. (Not like above!)
Note! xy is the name of a single variable, not x times y (which we write as x*y).
Changing the Value in a Variable
The value stored in a variable can be changed. When you refer to a variable you
always get the last value stored into it:
> My_age <- 12 # Set My_age to 12
> My_age - 19
[1] -7
> My_age <- 22 # Change My_age to now be 22 (value 12 forgotten)
> My_age - 19
[1] 3
Changing a variable’s value doesn’t change things previously computed from it:
> My_age <- 12
> h <- My_age - 19 # Set h based on My_age being 12
> h
[1] -7
> My_age <- 22 # When we change My_age to be 22, the value
> h # of h doesn’t change
[1] -7
> h <- My_age - 19 # But we can re-compute h with the new My_age
> h
[1] 3
Vectors
R lets you put together several data values of the same type into a vector.
The order within a vector matters — c(3,4) and c(4,3) are not the same thing.
Repetitions also matter — c(3,3) is not the same as c(3,3,3).
Combining Vectors
The “c” function can also create vectors by combining other vectors:
Note how R prints long vectors, that take more than one line — each line starts
with the index of the next element printed in brackets. (That’s also why R
prints“[1]” before the answer when it’s a single number.)
Plotting Data Stored in Vectors
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> plot (x) # plot data x as points, against indexes 1, 2, ..., 8
5.0
4.0
x
3.0
1 2 3 4 5 6 7 8
Index
3.0
1 2 3 4 5 6 7 8
Index
> z <- c (4.3, 4.8, 5.1, 4.2, 3.1, 3.2, 3.0, 2.7)
> lines (z, col="red") # add data z to the plot, in red
5.0
4.0
x
3.0
1 2 3 4 5 6 7 8
Index
Arithmetic on Vectors
R can do arithmetic on two vectors of the same length, applying the arithmetic
operation to corresponding elements:
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> z <- c (4.3, 4.8, 5.1, 4.2, 3.1, 3.2, 3.0, 2.7)
> x - z
[1] -0.2 0.1 0.2 0.1 0.4 -0.2 0.1 0.1
One application is to plot the differences between two data vectors (that are the
same length):
0.0
−0.2
1 2 3 4 5 6 7 8
Index
Arithmetic on a Vector and a Scalar
You can also do an arithmetic operation on a vector and a single number:
> x
[1] 4.1 4.9 5.3 4.3 3.5 3.0 3.1 2.8
> x + 1
[1] 5.1 5.9 6.3 5.3 4.5 4.0 4.1 3.8
> 10 * x
[1] 41 49 53 43 35 30 31 28
In fact, R will do arithmetic on any two vectors, repeating the shorter one to
reach the length of the longer:
> x + c(100,0)
[1] 104.1 4.9 105.3 4.3 103.5 3.0 103.1 2.8
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
> x[2] <- 7.7
> x
[1] 4.1 7.7 5.3 4.3 3.5 3.0 3.1 2.8 # x[2] changed from 4.9 to 7.7
> x <- c (4.1, 4.9, 5.3, 4.3, 3.5, 3.0, 3.1, 2.8)
>
> y <- x # The vector in y is now the same as the vector in x
>
> x[2] <- 7.7 # We change the second element in the vector x
>
> y # But y is still the same as before
[1] 4.1 4.9 5.3 4.3 3.5 3.0 3.1 2.8
An Example: Reading Data, Plotting it, and Editing It
Data editing is one use for getting and changing individual numbers in a vector.
Let’s read a vector of numbers from a file on my web site (the scan function will
do this if it’s a simple file of numbers), then plot the data to see what it looks like:
40
20
0
5 10 15 20
Index
The 12th data point looks like it might be wrong. Let’s see exactly what it is:
> data[12]
[1] 77
Maybe there’s a missing decimal point? Could the correct value be 7.7?
. . . Example Continued
Let’s change the 12th data point assuming it’s missing a decimal point and see
what the plot looks like then:
10
5
5 10 15 20
Index
http://www.cs.utoronto.ca/∼radford/csc121/
Week 2
Typing Stuff into R Can be Good . . .
You can learn a lot by seeing what happens when you type an R command.
Interactive use of R is also a good way to start exploring a new data set.
For instance, you can
– make sure it’s actually the data you were told it was
– play around with how best to plot the data
– look at plots to see if there are any obviously erroneous data points
– see if relationships between variables seem to be roughly linear
. . . Or Bad
But when you’re seriously analysing data, you don’t want to just type stuff, since
– it’s tedious to type things again and again
– it’s easy to make a mistake when typing something and not notice
– you won’t be able to remember exactly what you typed
– other people won’t be able to replicate your analysis
– if you decide to change the analysis slightly, you have to do it all again
– if you get another similar data set, you have to do it all again
Instead, you want to write a program to analyse your data, saving your program
in a text file. Once you’ve written it
– you can look it over carefully to make sure it’s correct
– make changes to it without starting over again
– run it on as many data sets as you have
– share it with someone else
The ability to write readable and reliable programs is one big advantage of using
R rather than less flexible analysis tools, such as spreadsheets.
Creating and Using R Scripts
One kind of R program is a text file containing R commands that you can ask
R to perform — much as if you had typed them at the R prompt. This kind of
program is called an R script.
You can create an R script with whatever your favourite text editor is (but not
with a word processor, unless you save the document in .txt format).
RStudio has a built-in text editor, which may be the most convenient one to use.
Once you’ve created a script, you can get R to read it — and do the commands it
contains — with the source function, giving it the name of the script file:
> source("myscript.r")
RStudio has a button you can click to do this for a script created with its editor.
If you type the command yourself, you may have to use the full path to the file
(such as /Users/mary/myscript.r).
If the script doesn’t work as desired, you can change it in the editor (and save the
new version), and then get R to do it again, until you have “debugged” it.
Example Script: Read Data, Compute its Mean and SD, Plot it
# Compute the sample mean and sample standard deviation of the data.
m <- mean(data)
s <- sd(data)
# Plot the data points, along with a horizontal line at the mean, two
# dashed lines at the mean plus and minus the standard deviation, and
# two dotted lines at mean plus and minus twice the standard deviation.
plot(data)
abline (h=m)
abline (h=c(m-s,m+s), lty="dashed")
abline (h=c(m-2*s,m+2*s), lty="dotted")
How to Run This Script, and the Plot It Produces
This script is stored in a file on the course web page. You can run scripts
obtained from the internet using the URL rather than a file name:
> source("http://www.cs.utoronto.ca/~radford/csc121/demo-script2a.r")
3
2
data
1
0
−1
0 10 20 30 40 50
Index
Note: For security reasons, don’t run a script from a URL at a website you
don’t trust! Instead, download the script and verify it’s OK before running it.
Looking at the Variables Set in the Script
The variables used by the R script will still exist after it has run, and can be
examined:
> source("http://www.cs.utoronto.ca/~radford/csc121/demo-script2a.r")
Read 50 items
> data
[1] 2.170 1.985 1.616 3.181 -0.978 1.597 0.862 -0.186 0.956
[10] 0.169 -0.403 1.107 0.965 2.155 -0.141 1.647 3.007 0.631
[19] 0.393 2.883 1.588 -0.228 1.078 0.355 -0.113 0.852 0.913
[28] 2.876 -0.499 -1.607 1.749 0.167 1.366 1.976 2.907 3.470
[37] 1.162 0.871 0.506 3.138 0.920 0.743 -1.242 -0.678 1.104
[46] 0.817 2.226 1.521 1.915 0.095
> m
[1] 1.07128
> s
[1] 1.205056
Ways to Run an R Script
As was mentioned, you can run an R script in the file myscript.r with the
command source("myscript.r").
But this isn’t quite the same as typing the contents of myscript.r. The
commands in myscript.r aren’t displayed, and you don’t see the value of each
expression. (Though the print function can be used to explicitly display values.)
If you want to see everything, much as if you had typed the commands, use
> source("myscript.r",echo=TRUE)
In RStudio, there is a button for sourcing the script being edited, with an option
for whether echo=TRUE.
You can run a script non-interactively (plots going to the file Rplots.pdf) with
the Unix/Linux command
Rscript myscript.r
Later, we’ll see how to run scripts and get pretty output using the spin function
in the knitr package.
Uses and Limitations of Scripts
R scripts are a good way to do one thing — such as produce output and plots
from analysing one data set in one way. The script helps document exactly what
you did for later reference.
But a script isn’t a good way of doing many things, for instance:
For example, the source of the dataset is fixed in the R script shown earlier.
It’s also not very convenient to take the output of an R script and do more with it.
It is possible to change what a script does by setting variables before you run the
script. And the script can set variables to values that can be looked at later.
But there is a better way to write programs that can do many things, and be
used as part of a larger program — using functions.
Programming by Defining Functions
An R function specifies how to compute an output — the value of the function
— from one or more inputs — the arguments (or parameters) of the function.
Within a function, the arguments are referred to by their names. Each time the
function is used (“called”), values for the arguments are specified, and the
argument names will refer to those values during that use of the function.
The next time the function is called, the arguments may have different values.
A function will compute a value from its arguments. When the arguments are
different, in a different call of the function, the value may also be different.
The value computed by a function call can be assigned to a variable, or used in
arithmetic, just like for R’s built-in functions like log and sin.
When defining a function, you can make use of other functions you have defined.
In this way, large, complex programs can be built from simpler parts, which helps
make them easier to understand.
Defining and Using a Simple Function
Let’s define a function called sin_deg that computes the sine of an angle specified
in degrees, rather than in radians (as for sin):
> sin_deg <- function (angle) sin(angle*pi/180)
This sets the variable sin_deg to the function specified by the expression
function (angle) sin(angle*180/pi), in the same way we can set a variable
to a number or a string. This function has one argument, which is referred to by
the name angle. The value of the function is computed as sin(angle*180/pi).
(The variable pi is pre-defined by R as π = 3.14159 . . .)
We can then use this function just like we can use R’s built-in functions:
> sin_deg(30)
[1] 0.5
> sin_deg(45)
[1] 0.7071068
> 100 + sin_deg(90)
[1] 101
Within sin_deg, the argument named angle will have values 30, 45, and 90 for
the three uses of sin_deg above.
A Function With Two Vector Arguments
Functions can have more than one argument, and the arguments can be vectors
rather than single numbers.
Here’s a function that computes the distance between two points in a plane, with
each point specified by a vector of two coordinates:
Within this function, the two arguments (the points we want to compute the
distance between) are referred to by the names a and b.
In this example, the steps inside the function definition are indented by four
spaces, so that it’s easier to see that they are part of the distance function.
This is a good practice, which you should follow.
Computing the Perimeter of a Diamond
Here’s a function that computes the total length of the four sides of a diamond
that has widths width1 and width2 for its two axes:
( distance(vertex1,vertex2) +
distance(vertex2,vertex3) +
distance(vertex3,vertex4) +
distance(vertex4,vertex1) )
}
(As you may realize, the four sides are actually all the same length, so this could
be simplified — but we’ll pretend we don’t realize that for this example.)
Using Functions Defined in a File from an R Script
We usually don’t type functions into R — they’re inconveniently long, and we
may wish to change them without having to re-type everything.
Instead, we store the definitions in a file, just as for an R script.
When we want to use these functions in an R script, we use source at the start of
the script to read these functions definitions into R.
source("distfuns.r")
big_diamond_perim <- diamond_perimeter (12.1, 4.7)
small_diamond_perim <- diamond_perimeter (0.4, 0.9)
print(big_diamond_perim)
print(small_diamond_perim)
By putting the definitions of these function in a file separate from the script that
uses them, we can easily use the same functions in other scripts as well.
Functions that Do Things
The purpose of the functions in the previous examples is to compute some output
value from the inputs given as arguments. We can also define functions whose
purpose is to do something, instead of (or in addition to) computing something.
Here’s an example:
This function has no arguments, and produces no value as output. It just reads a
line of text typed by the user, and then prints it three times (followed by an
end-of-line marker, which is written as "\n").
For example:
> parrot()
Hello!
Hello! Hello! Hello!
Example: Plotting Data with Mean and SD
# PLOT DATA VECTOR SHOWING MEAN AND STANDARD DEVIATION. Plots a vector
# of data points, along with horizontal lines showing the mean (solid
# line), the mean +- sd (dashed lines), and the mean +- 2*sd (dotted
# lines). The single argument must be a numeric vector. No return value.
m <- mean(data)
s <- sd(data)
plot(data)
abline (h=m)
abline (h=c(m-s,m+s), lty="dashed")
abline (h=c(m-2*s,m+2*s), lty="dotted")
}
Using this Function in a Script
We can put this function definition in a script called demo-funs2b.r, and then
use it (twice) in another script, demo-script2b.r, which starts by reading in the
script that defines the function;
source("http://www.cs.utoronto.ca/~radford/csc121/demo-funs2b.r")
http://www.cs.utoronto.ca/∼radford/csc121/
Week 3
Making Functions Do Different Things, Using if
When you call a function with different values for its arguments, it can compute
different return values, or plot different data. That’s more useful than a script
that computes or does only one thing.
But what if the way the return value should be computed, or data should be
plotted, depends on the arguments?
We can use if to do this:
# Plot data with lines, plus dots if no more than 100 points.
plot_data <- function (data) {
if (length(data) > 100)
plot (data, type="l") # Plot lines only
else
plot (data, type="b") # Plot lines, plus dots at the points
}
Comparisons and the Logical Data Type
In these if expressions, which thing to do is determined by comparing two numbers.
Comparisons in R produce values of logical data type — either TRUE or FALSE.
Here are some examples:
> a <- 12
> a < 10 # "less than" comparisons
[1] FALSE
> a < 20
[1] TRUE
> a > 0 # "greater than" comparisons
[1] TRUE
> 10 > a
[1] FALSE
> a == 12 # "equals" comparisons - note that it uses ==, not just =
[1] TRUE
> a == 9
[1] FALSE
> a != 9 # "not equal" comparison
[1] TRUE
More Comparisons
> a <- 12
> a >= 12 # "greater or equal" comparisons
[1] TRUE
> 13 >= a
[1] TRUE
> a <= 11 # "less or equal" comparison
[1] FALSE
This is true for the assignment operator, <-, as well. It’s a good idea to always
put spaces around the <- operator, however, because it makes programs easier to
read. Spaces around other operators can sometimes improve readability too.
Compare the following:
> abc[123]<-xyz+456*pqr
> abc[123] <- xyz + 456*pqr
Warning: The expression a<-9 assigns the value 9 to the variable a. If you want
to compare the value in a to the number -9, you must put a space between < and -:
> a < -9
[1] FALSE
More on How if Works
An if expression has the form
The condition produces a TRUE or FALSE value. If the value is TRUE, the
true-option expression is done; if FALSE, the false-option expression is done.
The true-option and false-option expressions can have several steps, enclosed
between { and }.
When the expression is evaluated for what it does, rather than producing a value,
the else part can be omitted — that’s the same as making false-option be { },
which does nothing.
It’s often useful for false-option to be another if expression. An example:
The body can be one statement, or several enclosed in { and }. The for loop does
the body as many times as there are elements in vector, with variable set to each
element in turn.
Here’s an example:
> 1:5
[1] 1 2 3 4 5
Notice the way we count how many elements were below zero using the below_0
variable. This variable is set to zero before the loop. Inside the loop, whenever an
element is found to be below zero, the count is increased by the assignment
below_0 <- below_0 + 1
This assigns a new value to below_0 that’s equal to the old value of below_0 plus 1.
Later, we’ll see how we can write this function more easily, without a loop, using
more advanced vector-handling facilities. But avoiding loops isn’t always possible.
When You Don’t Know How Many Times. . . Using while
A for loop repeats its body only as many times as the length of the vector it
is given at the start. But sometimes you can’t know at the start how many
repetitions will be needed.
Instead, we can use a while loop. It has the form
This will repeat body as many times as necessary, until condition is FALSE. If
condition is FALSE at the start, body is not done even once.
Here’s an example, which searches for the smallest integer, i, greater than one, for
which i20 is less than ei :
> i <- 2
> while (i^20 >= exp(i)) i <- i + 1
> i
[1] 90
Now You Can Do Anything!
With what you now know about R programming, for anything that can possibly
be computed, you can (in theory) write a program that can compute it!
• You know about vectors, which as they get longer (eg, by putting shorter
ones together with c (..., ...)) will hold more and more data, with no
upper limit (in theory — in practice you run out of memory on your
computer at some point).
• You know about while loops, which can repeat operations as many times as
necessary, with no upper limit (in theory — in practice your computer will
wear out and fail after some number of years of computing).
The technical term for this is that the R language (even just the part you know
now) is “Turing complete” (after famous computer scientist Alan Turing, who
formalized the notion of what can and cannot be computed).
So That’s the End of the Course?
Now that you know how to compute anything that’s computable, what’s left to
do in the course?
• You may “know” how to do any computation, but you still need to develop
the skills that will help you to actually do it. This takes both instruction
and practice, practice, practice, . . .
• Once you learn more about R, you’ll know how to do some computations
more easily than you could do them using only what you know now.
• You’ll learn about some R features that aren’t strictly computational, but are
very useful (such as how to put titles on plots).
• We’ll talk about how to write programs that run faster — so you won’t have
to wait hours or days for the results.
• We’ll talk about how to write programs that are easier for yourself and other
people to understand.
• We’ll talk about how to test that your programs actually work correctly.
• We’ll talk about how to keep track of different versions of your programs, as
you change them to make them better, or work for various different problems.
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 4
Combining Data of Different Types in a List
We’ve seen how we can put several numbers into a vector of numbers. Or we can
put several strings into a vector of strings. But what if we want to combine both
types of data? Let’s try. . .
> c(123,"fred",456)
[1] "123" "fred" "456"
R converts the numbers to character strings, so that the elements of the vector
will all be the same type (character).
But we can put together data of different types in a list:
> list(123,"fred",456)
[[1]]
[1] 123
[[2]]
[1] "fred"
[[3]]
[1] 456
Lists Can Contain Anything
Elements of a list can actually be anything, including vectors of different lengths:
> list (1:4, 3:10)
[[1]]
[1] 1 2 3 4
[[2]]
[1] 3 4 5 6 7 8 9 10
You can even put lists within lists (though these are hard to read when printed):
> list(4,list(5,6))
[[1]]
[1] 4
[[2]]
[[2]][[1]]
[1] 5
[[2]][[2]]
[1] 6
Extracting and Replacing Elements of a List
You can get a single element of a list by subscripting with the [[ . . . ]] operator:
> L <- list (c(3,1,7), c("red","green"), 1:4)
> L[[2]]
[1] "red" "green"
> L[[3]]
[1] 1 2 3 4
You can replace elements the same way. Continuing from above. . .
> L[[3]] <- c("x","y","z")
> L
[[1]]
[1] 3 1 7
[[2]]
[1] "red" "green"
[[3]]
[1] "x" "y" "z"
Notice that the new value can have a type different from that of the old value.
Looking at All Elements of a List; Extending Vectors
You can look at all elements of a list with the for statement, using length to
find out how many elements there are.
Suppose we have a list of vectors of strings or numbers. For example, we might
create such a list as follows:
The following will create a single vector of strings, called v, containing all the
elements of all the vectors from the list L:
Note how we can start with a vector with no elements, and then extend it using
the c function. Also note how the vector of numbers was automatically converted
to a vector of strings, so they could be combined with a string vector.
Extending Lists
You can also build up lists starting with a list containing zero elements, which we
can create with list().
One way to extend the list is to just assign to an element that doesn’t exist yet
(usually the one just after the last existing element):
[[2]]
[1] TRUE
[[3]]
[1] "hello"
> a <- 10
> a < 3
[1] FALSE
> a < 30
[1] TRUE
We can save logical values in variables, and then use them as if or while
conditions:
if (next_value >= 0)
if (next_value <= 100)
cat("Next value is OK\n")
Instead, we can do this with just one if by using R’s logical AND operator,
which is written &&:
An expression such as X && Y produces TRUE only if X and Y are both TRUE, and
FALSE if either (or both) of X and Y are FALSE.
The Logical “OR” Operator — ||
Similarly, R has a logical OR operator, written ||.
An expression such as X || Y produces TRUE if either X and Y (or both) are TRUE,
and FALSE if both X and Y are FALSE.
We could use it to print a message if the number in next_value is not in the
range 0 to 100:
The && and || operators can both be used in a condition, with && having higher
precedence.
http://www.cs.utoronto.ca/∼radford/csc121/
Week 5
Making Vectors by Repetition
As you may recall from a previous lab exercise, you can make a vector in R by
repeating a single value or a vector of values. For example:
Instead of saying how many times to repeat, you can instead say what the final
length should be:
Another option is to say how many times each element should be repeated
immediately:
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
If the first operand of : is greater than the second, the sequence it creates will go
backwards:
> 20:1
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
The seq function is more flexible. It can create sequences of numbers that differ
by an amount other than one:
Note: The c function combines single values or vectors to make a bigger vector.
If you already have the vector you want, you don’t have to use c!
For example, the use of c in all the following is unnecessary.
> c(5)
[1] 5
> c(1:5)
[1] 1 2 3 4 5
> rep(c(5),3)
[1] 5 5 5
Matrices
In R, the elements of a vector can be arranged in a two-dimensional array, called
a matrix.
You can create a matrix with the matrix function, giving it a vector of data to fill
the matrix (down columns), which is repeated automatically if necessary:
> matrix (3, nrow=2, ncol=2)
[,1] [,2]
[1,] 3 3
[2,] 3 3
> matrix (1:6, nrow=2, ncol=3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can fill in the data by row instead if you like:
> matrix (1:6, nrow=2, ncol=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Treating Matrices Mathematically
R has operators that treat a matrix in the mathematical sense as in linear
algebra. For example, you can do matrix multiplication with the %*% operator:
> A <- matrix(c(2,3,1,5),nrow=2,ncol=2); A
[,1] [,2]
[1,] 2 1
[2,] 3 5
> B <- matrix(c(1,0,2,1),nrow=2,ncol=2); B
[,1] [,2]
[1,] 1 2
[2,] 0 1
> A %*% B # This multiplies A and B as matrices
[,1] [,2]
[1,] 2 5
[2,] 3 11
> A * B # This just multiplies element-by-element
[,1] [,2]
[1,] 2 2
[2,] 0 5
Treating Matrices Just as Arrays of Data
You can instead just consider a matrix to be a convenient way of laying out your
data, not as an object in linear algebra.
For this purpose, it’s useful that you can create matrices with data other than
numbers:
Similarly, rbind can put together two matrices with the same number of columns.
You can also use cbind or rbind to combine a matrix with a vector, which is
treated like a matrix with one row or one column:
> rbind(X,c(10,20,30))
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[3,] 10 20 30
Example: Plotting a Function of Two Arguments
One use of matrices is in plotting functions or data in three dimensions.
p
Here, we compute values of the function cos (8 x2 + y 2 ) for a grid of values for
x from −1 to +1, and a grid of values for y from 0 to 2.5, storing these values in a
matrix called funvals. The grid points are spaced apart by 0.01.
> persp(gridx,gridy,funvals,phi=40,theta=20,shade=0.75,border=NA)
fun
vals
gridy
grid
x
A Contour Plot of the Function
Another way to display a function or data is with a contour plot, which we can
produce as follows:
2.5
−0.3 0.3 0.3 −0.3 −0.6
−0.6 0.6
0.9
0.9
0.6
0.3
−1.110223e−16 −0.3
−0.6
2.0
−0.9
−0.9
−0.6
−0.3
−1.110223e−16
0.3
0.6
0.9
1.5
0.9
0.6
0.3
−16
−1.110223e −0.3
−0.6
−0.9
−0.9
−0.6
1.0
−0.3
−1.1102
0.3 23e−16
0.6
0.9
0.9
0.6
0.3
3e− 16 −0.3
0.5
−0.9
−0.6
−0.3
−1.11022
3e−16
3
0.
0.9
0.0
$bc
[1] "red" "green"
$q
[1] TRUE
If an element has a name, R uses it for printing, rather than the numerical index.
Using a List to Return Multiple Values from a Function
This function takes as input a vector of character strings, and returns a list of two
vectors, with the first and the last characters of the input strings:
first_and_last_chars <- function (strings) {
first <- character(length(strings)) # Create two string vectors for
last <- character(length(strings)) # the results, initially all ""
for (i in 1:length(strings)) {
nc <- nchar(strings[i])
first[i] <- substring(strings[i],1,1) # Find first & last chars
last[i] <- substring(strings[i],nc,nc) # of the i’th string
}
list (first=first, last=last) # Return list of both result vectors
}
Here’s an example of its use:
> fl <- first_and_last_chars (c("abc","wxyz"))
> fl$first
[1] "a" "w"
> fl$last
[1] "c" "z"
Names for Vector Elements and Matrix Rows and Columns
You can also give names to elements of vectors, and use the names as indexes:
> x <- c (dog=5, cat=3)
> x
dog cat
5 3
> x["cat"]
cat
3
You can also give names to the rows and columns of matrices:
> M <- matrix(1:4,ncol=2,nrow=2,dimnames=list(c("cat","dog"),
+ c("big","small")))
> M
big small
cat 1 3
dog 2 4
> M["dog","big"]
[1] 2
Scanning the Elements of a Matrix
Here’s an example function that finds the largest negative element in a numeric
matrix (ie, the negative element with smallest absolute value), returning this
element’s value, or minus infinity if there are no negative elements.
Note that you can find the number of rows and number of columns in a matrix
with nrow and ncol.
http://www.cs.utoronto.ca/∼radford/csc121/
Week 6
Random Numbers and Their Uses
Random variation is a big part of what statistics is about. So it’s natural that R
has facilities to create its own random variation — to generate random numbers.
Random numbers have many uses (and not just in statistics):
• See how the results of some statistical method vary when the data it is
applied to vary randomly.
• Make interactions with a user have a random aspect — we don’t want a video
game to behave the same way every time we play!
Generating Random Numbers with Uniform Distribution
One simple kind of random number is one that takes on a real value that is
uniformly distributed within some bounds.
You can get such numbers in R using the runif function. It takes as arguments
the number of random numbers to generate, the low bound, and the high bound.
We’ll try generating one at a time here:
The random numbers generated are supposed to be independent — eg, which one
we get the second time is unrelated to what the first one was.
R’s Random Numbers Aren’t Really Random
Computers are carefully designed to not behave randomly.
Some computers have special devices for producing random numbers that are
really random. This is useful for cryptography (you want a really random key for
your code, so nobody else can guess it).
But for most purposes we don’t actually want real random numbers. They’re too
hard to generate, and if we use them, we can’t reproduce our results another day.
For example: Imagine that after running your program for a long time, it stops
with an error message, indicating it has a bug. You think you’ve now fixed the
bug. But how do you verify that you’ve really fixed it if you can’t reproduce the
run that led to the error?
15
10
5
0
Index
It looks random, except that it repeats with period 30. Similar generators can
have much longer periods, however.
Setting the Random Seed
R uses a more sophisticated pseudo-random generator, but it also is deterministic,
and will reproduce the same sequence if restarted with the same “seed”.
For example:
> set.seed(123)
> runif(1)
[1] 0.2875775
> runif(1)
[1] 0.7883051
> runif(1)
[1] 0.4089769
> set.seed(123)
> runif(1)
[1] 0.2875775
> runif(1)
[1] 0.7883051
> runif(1)
[1] 0.4089769
For serious work, you should set the seed, so you’ll be able to reproduce your results.
The sample function
The call sample(n) will generate a random permutation of the integers from
1 to n, as illustrated below:
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
> sample(10)
[1] 10 2 6 1 9 8 7 5 3 4
With other kinds of arguments, sample can do other things as well, including
sampling with replacement.
Generating Random Vectors
The runif function can generate a whole vector of random numbers at once. The
first argument of runif is the number of random numbers to generate.
For instance, here we plot 500 random numbers uniformly distributed from 0 to 1,
using the command
> plot(runif(500))
1.0
0.8
0.6
runif(500)
0.4
0.2
0.0
Index
The Problem with Plotting Rounded Data Points
Recall the “iris” data set of width and length of petals and sepals in three species
of Iris. It is stored in a special kind of list called a “data frame”, which also looks
sort-of like a matrix, which we’ll talk more about later.
Here’s a scatterplot of two of the variables (species marked by colour):
plot (iris$Sepal.Width, iris$Petal.Width, col=iris$Species,
xlab="Sepal Width", ylab="Petal Width")
2.5
2.0
1.5
Petal Width
1.0
0.5
Sepal Width
Solving the Problem with Random Jitter
Because the data is rounded to one decimal place, many of the dots in the
scatterplot are on top of each other. To see all the data points, we can add
random “jitter” to each data point before plotting:
plot (iris$Sepal.Width + runif(nrow(iris),-0.05,+0.05),
iris$Petal.Width + runif(nrow(iris),-0.05,+0.05),
col=iris$Species, xlab="Sepal Width", ylab="Petal Width")
2.5
2.0
1.5
Petal Width
1.0
0.5
0.0
Sepal Width
Making Random Choices
Often, we want to make a random choice, with certain probabilities for doing
certain things.
40
20
0
−20
−40
The global environment contains variables that are created when you assign to a
name in a command typed at the R console (or as if typed in an R script).
For example, typing the command below creates (if it didn’t exist already) a
variable in the global environment named fred:
> fred <- 1+2
Calling a function creates a local environment used for just that call. Assignments
inside the function create or change variables in that environment — below, the
assignment to fred inside f changes fred in the local, not global, environment:
> f <- function (x) { fred <- 2*x; fred+1 }
> fred
[1] 3
> f(100)
[1] 201
> fred
[1] 3
Listing and Removing Variables
You can see what variables exist in the environment that is currently being used
with the ls function, which returns a vector of strings with the names of variables.
You can remove a variable from the current environment with rm.
Here’s an example (which assumes you haven’t already defined other variables in
the global environment):
> a <- 1
> b <- 2
> ls()
[1] "a" "b"
> rm(a)
> ls()
[1] "b"
> a
Error: object ’a’ not found
Note: After x <- "b", calling rm(x) removes variable x, not variable b.
Function Arguments in the Local Environment
When a function is called, all its arguments become variables in its local
environment. Their values are what is was specified in the call of the function,
or their default values if they were not specified.
We can see this by printing the result of ls inside a function:
> f <- function (x,y=100,z=1000) { print(ls()); x + y + z }
> f(7,z=10)
[1] "x" "y" "z"
[1] 117
If we create new variables by assignment, they also are in the local environment:
> g <- function (x,y=100,z=1000) { a <- x + y + z; print(ls()); a }
> g(7)
[1] "a" "x" "y" "z"
[1] 1107
The global environment isn’t changed when local variables are created for
arguments or by assignment. So after doing the above, in a new R session, we see
> ls()
[1] "f" "g"
Local and Global Variable References
When you reference a variable inside a function, it refers to the local variable of
that name, if it exists, and if not, to the global variable of that name, if it exists.
Here’s an example:
> f <- function (xyz,def) {
+ print (abc) # refers to the global variable ’abc’
+ print (xyz) # refers to the local variable (argument) ’xyz’
+ print (def) # refers to the local variable (argument) ’def’
+ xyz + def + abc
+ }
>
> abc <- 1
> def <- 2
>
> f(200,3000)
[1] 1
[1] 200
[1] 3000
[1] 3201
Changing Local and Global Variables Inside a Function
Assigning a value to a name with <- (or with =) from inside a function creates or
changes the local variable with that name. Assigning a value to a name with <<-
creates or changes the global variable with that name. Here’s an example:
> g <- function () {
+ x <- a # creates a local variable ’x’, with value from global ’a’
+ a <- 10 # creates a local variable ’a’; global ’a’ is not affected
+ b <<- 300 # changes the global variable ’b’; doesn’t create a local ’b’
+ a + b + x # here, ’a’ refers to the new local ’a’, not the global ’a’
+ }
> g()
Error in g() : object ’a’ not found
> a <- 100
> b <- 200
> g()
[1] 410
> a
[1] 100
> b
[1] 300
> x
Error: object ’x’ not found
Assigning to Arguments Doesn’t Change Them
Since assignments with <- inside a function change only the local environment,
assigning to a function argument doesn’t change what the caller passed.
For example:
> h <- function (x) { x[1] <- 0; sum(x) } # sum all but first element
> a <- c(3,4,1,7)
> h(a)
[1] 12
> a # the global variable ’a’ was not changed
[1] 3 4 1 7
> x <- c(10,20,30)
> h(x)
[1] 50
> x # global ’x’ unchanged - not the same as the local ’x’!
[1] 10 20 30
Exception: R has some “special” functions that do alter their arguments — for
example, as we’ve seen, rm(x) actually removes x!
When and How to Use Local and Global Variables
When writing a function, you should try to
• Separate what the function does from how it does it, so someone using the
function only needs to understand the “what”.
• Make what the function does be easy to describe and understand.
• Make what the function does be general, so it will be useful in many contexts.
Functions should usually get input from their arguments, not global variables —
they’re then more generally useful, as it’s easy to use different arguments in calls.
Functions should usually not assign to global variables. Putting intermediate
results in global variables makes “how” the function works be visible. Returning
information in global variables makes it hard to use the function in a general way.
There are exceptions:
• If many functions all refer to the same data, having them all refer to a global
data variable may be easier than passing a data argument to all of them.
• Assigning to a global variable can be a convenient way to keep track of overall
counts of how often something happended (eg, number of errors of some sort).
• Assigning some intermediate result to a global variable may help when
debugging a program (but take it out once the program is working).
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 7
Features of R for Statistics
R is a general purpose programming language. You can write all sorts of
programs in R — video games, accounting packages, word processors, programs
for navigating rocket ships to Mars, . . .
But R is more appropriate for some of these tasks than others. It’s probably not
the best choice for video game programming — games need to respond quickly,
but speed is not R’s strong point. On the other hand, some features of R that are
not common in other languages are especially useful for statistical applications.
Here are some:
• Specifying function arguments by name, with arguments often having default
values — very useful for functions implementing statistical methods.
• Names for elements of vectors and lists, and for rows and columns of matrices
and data frames — “age” is a better label for a column than the number 17.
• R’s “data frames” for storing observations in a way that is convenient for
statistical analysis.
• Special NA values to indicate where data is missing
We’ve talked about the first two, and will now talk about the last two.
Adding Attributes to R Objects
An R object can have one or more “attributes”, that record extra information.
They are mostly ignored if you don’t look at them, but are there if you look.
An example:
R uses a dim attribute to mark an object as a matrix, and hold how many rows
and columns it has. This attribute is not usually shown explicitly, be we can see
it if we look using attr:
Names for rows and columns in a matrix are stored in a dimnames attribute.
The Class Attribute
The special class attribute tells R that some operations on the object should be
done in a special way. We’ll cover more about how this works later — and about
how it can be used to program in a style known as ‘object-oriented programming”.
For the moment, here’s a brief illustration of what can be done:
> g <- 123
> attr(g,"class") <- "gobbler"
> print.gobbler <- function (what) {
+ cat ("I’m a gobbler with value", unclass(what), "\n")
+ }
> g
I’m a gobbler with value 123
> g+1000
I’m a gobbler with value 1123
We’ve used the class attribute to tell R that objects in our “gobbler” class
should be printed in a different way than ordinary numbers. Note that unclass
gets rid of the class attribute, which lets us handle the number inside a gobbler
object in the usual way (though using unclass is not strictly necessary here).
Data Frames
One major use of classes is for R’s data.frame objects, which are the most
common way that data is represented in R.
A data frame is sort of like a list and sort of like a matrix. Each “row” of a data
frame holds information on some individual, object, case, or whatever. The
“columns” of a data frame correspond to variables whose values have been
measured for each case. These variables can be numbers, logical (TRUE/FALSE)
values, or character strings (but all values for one variable have the same type).
For example, here’s how R prints a small data frame containing the heights and
weights of three people:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
A data frame is really a list, with named elements that are the columns of the
data frame, but with a data.frame class attribute that makes R do things like
printing and subscripting differently from an ordinary list.
Getting Data Out of a Data Frame
You can get data from a data frame using subscripting operations similar to those
for a matrix (by row and column index), or by operations similar to a list (using
names of variables). For example:
heights_and_weights <-
read.table ("http://www.cs.utoronto.ca/~radford/csc121/data7",
header=TRUE)
> c(5,1,NA,8,NA)
[1] 5 1 NA 8 NA
Arithmetic on NA values
Arithmetic operations where one or both operands are NA produce NA as the
result:
Comparisons with NA also produce NA, rather than TRUE or FALSE. Trying to
use NA as an if or while condition gives an error:
> a == 1
[1] FALSE TRUE NA FALSE NA
> if (a[3]==1) cat("true\n") else cat("false\n")
Error in if (a[3] == 1) cat("true\n") else cat("false\n") :
missing value where TRUE/FALSE needed
Checking For NA
Sometimes you need to check whether a value is NA. But you can’t do this with
something like if (a == NA) ... — that will always give an error!
Instead, you can use the is.na function. It can be applied to a single value,
giving TRUE or FALSE, or a vector of values, giving a logical vector.
For example, R’s built-in airquality demonstration dataset has some NA values.
The following statements create a modified version of the airquality data frame
in which missing values for solar radiation are replaced by the average of all the
non-missing measurements (found with mean using the na.rm option):
(We’ll see later how one can do this more easily using logical indexes.)
NA and NaN
A value will also be “missing” if it is the result of an undefined mathematical
operation. R prints such values as NaN, not NA, but is.na will be TRUE for
them. Operations on NaN produce NaN as a result. Here are some examples:
> 0/0
[1] NaN
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> x <- 0/0
> 10*x
[1] NaN
> v <- asin((-2):2)
Warning message:
In asin((-2):2) : NaNs produced
> v
[1] NaN -1.570796 0.000000 1.570796 NaN
> v / 0
[1] NaN -Inf NaN Inf NaN
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 8
Using Numeric Vectors as Subscripts
A subscript used with [ ] can be a vector of indexes, rather than just one index,
yielding a subset of elements having those indexes, not just one element:
> v <- c(9,10,3)
> names(v) <- c("abc","def","xyz")
> v
abc def xyz
9 10 3
> v[c(1,3)] # Notice that names of elements are carried along
abc xyz
9 3
You can also index with a vector of negative numbers. This gets you all elements
except those whose indexes are in the index vector (negated):
> v[-2]
abc xyz
9 3
> v[c(-1,-length(v))]
def
10
Difference Between [ ] and [[ ]]
We can now see better what the difference is between subscripting with [ ] and
with [[ ]] — [ ] extracts a subset of elements (which might be just one),
whereas [[ ]] extracts a single element.
> v
abc def xyz
9 10 3
> v[2]
def
10
> v[[2]] # Notice there’s no name here, just the element
[1] 10
> L <- list (a="xy", b=9, c=TRUE)
> L[2] # Notice that the result is still a list
$b
[1] 9
Here’s how we can use this to make a modified version of the airquality data
frame (see last week’s slides) with missing values for Solar.R filled in:
• You can’t get an empty vector when making a sequence with an expression
like i:j.
• R will sometimes convert matrices to plain vectors when you don’t want it to.
The Problem of Reversing Sequences
The : operator will produce either an increasing sequence or a decreasing
sequence, depending on whether the first operand is less or greater than the
second:
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
This may seem convenient — and it is for Small Assignment 3 — but it’s a bad
idea. When you use : in a program, you need to be sure which sort of sequence
you’re going to get!
An Illustration of Why Reversing Sequences are Bad
Here’s a function that is supposed to return a modified square matrix in which all
the elements above the diagonal have been set to one:
> ones_above_diagonal(matrix(0,nrow=4,ncol=4))
Error in M[i, j] <- 1 : subscript out of bounds
But adding drop=FALSE all the time makes everything longer and messier. So it’s
tempting not to. But then you may get unexpected bugs once in a while. . .
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 7
Features of R for Statistics
R is a general purpose programming language. You can write all sorts of
programs in R — video games, accounting packages, word processors, programs
for navigating rocket ships to Mars, . . .
But R is more appropriate for some of these tasks than others. It’s probably not
the best choice for video game programming — games need to respond quickly,
but speed is not R’s strong point. On the other hand, some features of R that are
not common in other languages are especially useful for statistical applications.
Here are some:
• Specifying function arguments by name, with arguments often having default
values — very useful for functions implementing statistical methods.
• Names for elements of vectors and lists, and for rows and columns of matrices
and data frames — “age” is a better label for a column than the number 17.
• R’s “data frames” for storing observations in a way that is convenient for
statistical analysis.
• Special NA values to indicate where data is missing
We’ve talked about the first two, and will now talk about the last two.
Adding Attributes to R Objects
An R object can have one or more “attributes”, that record extra information.
They are mostly ignored if you don’t look at them, but are there if you look.
An example:
R uses a dim attribute to mark an object as a matrix, and hold how many rows
and columns it has. This attribute is not usually shown explicitly, be we can see
it if we look using attr:
Names for rows and columns in a matrix are stored in a dimnames attribute.
The Class Attribute
The special class attribute tells R that some operations on the object should be
done in a special way. We’ll cover more about how this works later — and about
how it can be used to program in a style known as ‘object-oriented programming”.
For the moment, here’s a brief illustration of what can be done:
> g <- 123
> attr(g,"class") <- "gobbler"
> print.gobbler <- function (what) {
+ cat ("I’m a gobbler with value", unclass(what), "\n")
+ }
> g
I’m a gobbler with value 123
> g+1000
I’m a gobbler with value 1123
We’ve used the class attribute to tell R that objects in our “gobbler” class
should be printed in a different way than ordinary numbers. Note that unclass
gets rid of the class attribute, which lets us handle the number inside a gobbler
object in the usual way (though using unclass is not strictly necessary here).
Data Frames
One major use of classes is for R’s data.frame objects, which are the most
common way that data is represented in R.
A data frame is sort of like a list and sort of like a matrix. Each “row” of a data
frame holds information on some individual, object, case, or whatever. The
“columns” of a data frame correspond to variables whose values have been
measured for each case. These variables can be numbers, logical (TRUE/FALSE)
values, or character strings (but all values for one variable have the same type).
For example, here’s how R prints a small data frame containing the heights and
weights of three people:
> heights_and_weights
name height weight
1 Fred 62 144
2 Mary 60 131
3 Joe 71 182
A data frame is really a list, with named elements that are the columns of the
data frame, but with a data.frame class attribute that makes R do things like
printing and subscripting differently from an ordinary list.
Getting Data Out of a Data Frame
You can get data from a data frame using subscripting operations similar to those
for a matrix (by row and column index), or by operations similar to a list (using
names of variables). For example:
heights_and_weights <-
read.table ("http://www.cs.utoronto.ca/~radford/csc121/data7",
header=TRUE)
> c(5,1,NA,8,NA)
[1] 5 1 NA 8 NA
Arithmetic on NA values
Arithmetic operations where one or both operands are NA produce NA as the
result:
Comparisons with NA also produce NA, rather than TRUE or FALSE. Trying to
use NA as an if or while condition gives an error:
> a == 1
[1] FALSE TRUE NA FALSE NA
> if (a[3]==1) cat("true\n") else cat("false\n")
Error in if (a[3] == 1) cat("true\n") else cat("false\n") :
missing value where TRUE/FALSE needed
Checking For NA
Sometimes you need to check whether a value is NA. But you can’t do this with
something like if (a == NA) ... — that will always give an error!
Instead, you can use the is.na function. It can be applied to a single value,
giving TRUE or FALSE, or a vector of values, giving a logical vector.
For example, R’s built-in airquality demonstration dataset has some NA values.
The following statements create a modified version of the airquality data frame
in which missing values for solar radiation are replaced by the average of all the
non-missing measurements (found with mean using the na.rm option):
(We’ll see later how one can do this more easily using logical indexes.)
NA and NaN
A value will also be “missing” if it is the result of an undefined mathematical
operation. R prints such values as NaN, not NA, but is.na will be TRUE for
them. Operations on NaN produce NaN as a result. Here are some examples:
> 0/0
[1] NaN
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> x <- 0/0
> 10*x
[1] NaN
> v <- asin((-2):2)
Warning message:
In asin((-2):2) : NaNs produced
> v
[1] NaN -1.570796 0.000000 1.570796 NaN
> v / 0
[1] NaN -Inf NaN Inf NaN
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 8
Using Numeric Vectors as Subscripts
A subscript used with [ ] can be a vector of indexes, rather than just one index,
yielding a subset of elements having those indexes, not just one element:
> v <- c(9,10,3)
> names(v) <- c("abc","def","xyz")
> v
abc def xyz
9 10 3
> v[c(1,3)] # Notice that names of elements are carried along
abc xyz
9 3
You can also index with a vector of negative numbers. This gets you all elements
except those whose indexes are in the index vector (negated):
> v[-2]
abc xyz
9 3
> v[c(-1,-length(v))]
def
10
Difference Between [ ] and [[ ]]
We can now see better what the difference is between subscripting with [ ] and
with [[ ]] — [ ] extracts a subset of elements (which might be just one),
whereas [[ ]] extracts a single element.
> v
abc def xyz
9 10 3
> v[2]
def
10
> v[[2]] # Notice there’s no name here, just the element
[1] 10
> L <- list (a="xy", b=9, c=TRUE)
> L[2] # Notice that the result is still a list
$b
[1] 9
Here’s how we can use this to make a modified version of the airquality data
frame (see last week’s slides) with missing values for Solar.R filled in:
• You can’t get an empty vector when making a sequence with an expression
like i:j.
• R will sometimes convert matrices to plain vectors when you don’t want it to.
The Problem of Reversing Sequences
The : operator will produce either an increasing sequence or a decreasing
sequence, depending on whether the first operand is less or greater than the
second:
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
This may seem convenient — and it is for Small Assignment 3 — but it’s a bad
idea. When you use : in a program, you need to be sure which sort of sequence
you’re going to get!
An Illustration of Why Reversing Sequences are Bad
Here’s a function that is supposed to return a modified square matrix in which all
the elements above the diagonal have been set to one:
> ones_above_diagonal(matrix(0,nrow=4,ncol=4))
Error in M[i, j] <- 1 : subscript out of bounds
But adding drop=FALSE all the time makes everything longer and messier. So it’s
tempting not to. But then you may get unexpected bugs once in a while. . .
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 9
Operations on Numeric Vectors that Produce One Number
R has several functions that take a numeric vector or matrix as their argument,
and return a single number as their value, including:
sum finds the sum of all elements.
prod finds the product of all elements.
max finds the largest of all elements.
min finds the smallest of all elements.
mean finds the mean (average) of all elements.
For example:
> u <- c(3,5,1,9)
> sum(u)
[1] 18
This does pretty much the same thing as the following loop:
> s <- 0
> for (x in u) s <- s + x
> s
[1] 18
However, sum(u) is faster, and in some cases more accurate.
Operations on Logical Vectors that Produce One Logical Value
R also has two functions that take a logical vector as their argument, and return
a single logical value:
Looked at another way, any finds the “or” of all elements in its argument, and
all finds the “and” of all elements.
Here’s an example of the use of these functions:
Can you think of a way to replace the second if condition with one that uses any
rather than all?
Creating a Plot in Stages
Many simple plots can be created with a single plot command — eg, plot(x,y)
will plot points with coordinates given by the vectors x and y.
More complicated plots can be created in stages by adding more points, lines, and
text to what has already been plotted.
• Create a new plot with plot. It might contains some points or lines, or might
be completely empty. Features such as the axis scales and labels are
determined at this stage.
• Then add more information, using functions such as points, lines, abline,
and text. You can call these functions as many times as needed, perhaps
with different options for things like colour and line width each time.
• You can also add a title above the plot with the title function.
Creating a New Plot
You create a new plot with the plot function. It takes one or two data vectors as
its first arguments, but has many, many other possible arguments. You’ll want to
let most of these have their default values, and refer to any that you set by name.
Here are some of the possible arguments to plot:
type Type of plotting — "p" for points (the default), "l" for lines,
"b" for both points and lines, "c" for lines only but with space for points
col Colour for points/lines plotted (default is "black")
xaxt Set to "n" to get rid of horizontal axis numbers
yaxt Set to "n" to get rid of vertical axis numbers
xlab Label for the horizontal axis
ylab Label for the vertical axis
xlim Horizontal range for plot (vector of length two)
ylim Vertical range for plot (vector of length two)
asp Aspect ratio, asp=1 ensures one vertical unit looks the same
length as one horizontal unit
For example, plot (c(), xlim=c(0,2), ylim=c(1,5)) will plot an empty
frame with horizonal axis labels from 0 to 2 and vertical axis labels from 1 to 5.
Adding Points to a Plot
We can add points to a plot with the points function. Like plot, it takes two
vectors as its first two arguments, containing the x and y coordinates of the
points. (Or just a single vector argument with the y coordinates, in which case
the x coordinates are 1, 2, 3, . . . )
It can also take other arguments that set various options, such as
For example, points (x, y, col="red", pch=20) will add solid red dots to the
plot, at the coordinates given by the vectors x and y.
Adding Lines to a Plot
We can add lines to a plot with the lines function.
In addition to one or two arguments giving the coordinates of the points to
connect with lines, it can take other arguments such as those below (which can
also be used for plot):
type Set to "b" for points too, "c" for lines only but with space for points
col Colour for lines plotted
lty Line type — eg, "dotted", "dashed", or "solid" (the default)
lwd Line width (default is 1)
For example, lines (y, col="green", lty="dotted") will add dotted green
lines to the plot, at the x coordinates 1, 2, 3, . . . and y coordinates given by the
vector y.
Adding Text to a Plot
We can add text to a plot with the text function.
We can put many character strings on a plot with one call of text, since its
arguments can be vectors of x coordinates, y coordinates, and character strings.
For example:
n <- 20
angle <- 2*pi*(0:n)/7
dist <- 0:n
x <- dist * cos(angle)
y <- dist * sin(angle)
start
end
CSC 121: Computer Science for Statistics
http://www.cs.utoronto.ca/∼radford/csc121/
Week 10
Many Ways to Write a Simple Function
In this lecture, we’ll look at many ways of writing a simple function called
is_not_decreasing, which takes one argument, a vector, and returns TRUE if
the elements in the vector are in non-decreasing order, and FALSE otherwise.
We’ll see some new R features along the way.
Examples:
We’ll assume that the vector has no NA values. What would be a reasonable
thing to do if it did?
Ending a Loop Using a Logical Flag Variable
Here’s one solution, that uses the setting of a logical variable as a way of
terminating a while loop:
is_not_decreasing <- function (v) {
answer_is_known <- FALSE
i <- 2
while (!answer_is_known) {
if (i > length(v)) {
answer <- TRUE
answer_is_known <- TRUE
}
else if (v[i] < v[i-1]) {
answer <- FALSE
answer_is_known <- TRUE
}
i <- i + 1
}
answer
}
Using a repeat Loop and break Statement
This function used two logical variables — one to hold the answer returned, the
other to indicate when the answer is now known, and hence the loop can end.
We can instead use a loop written using repeat, which continues indefinitely,
until a break statement is done:
is_not_decreasing <- function (v) {
i <- 2
repeat {
if (i > length(v)) {
answer <- TRUE
break
}
if (v[i] < v[i-1]) {
answer <- FALSE
break
}
i <- i + 1
}
answer
}
Using break Within a for Loop
We can use break to immediately exit any kind of loop. Here’s another way to
write this function:
is_not_decreasing <- function (v) {
answer <- TRUE
if (length(v) > 1)
for (i in 2:length(v)) {
if (v[i] < v[i-1]) {
answer <- FALSE
break
}
}
answer
}
In this version, we initially set answer to TRUE, which will be the answer if we
don’t find a place where the elements decrease. If we do find a decrease, we set
answer to FALSE, and also immediately exit the for loop.
Caution: The break statement exits from the innermost loop that contains it.
If you’re inside two loops, you can’t use break to exit both of them at once.
Returning a Value for a Function Immediately
Rather than exit a loop with break after setting answer, and then making
answer the value of the function by putting it as the last thing, we can instead
use return to exit the whole function, and specify the value it returns.
At the end, we could just have written TRUE instead of return(TRUE) — they do
the same thing at the end of a function.
Why is the check for length(v) > 1 needed?
Avoiding Loops with a Vector Comparison
We can write is_not_decreasing without an R loop using a vector comparison
and the all function:
In this version, v[-length(v)] will contain all of v except the last element, and
v[-1] will contain all of v except the first element. So v[-length(v)] <= v[-1]
compares each element except the last to the next element. The vector v is
non-decreasing if all these comparisons are TRUE.
Here’s another way to do the same thing:
Why is the check for length(v) < 2 needed here, but not in the version above?
Recursion — When a Function Calls Itself
As you know, an R function can call another R function, which can call yet
another R function, etc.
Indeed, an R function can even call itself. This is called “recursion”.
Of course, a function had better not always call itself, or it will just keep calling,
and calling, and calling, without end.
But having a function sometimes call itself can be useful. Here’s a recursive
function to compute factorials in R:
R has many other facilities for doing operations on vectors, matrices, or lists
without having to write a loop, which often are also faster.
Replacing Loops with “apply” Functions
Functions in the “apply” family take as arguments both a data structure and a
function to apply to parts of the data structure — an example of “functional
programming”, using functions to construct more complex operations.
The lapply function operates on a list, and returns a list of results of applying a
given function to each element of the list. Here’s an example using the
is.numeric function, which says whether something is a numeric vector:
[[2]]
[1] TRUE
[[3]]
[1] FALSE
Using “apply” on Matrices
You can use apply to apply a function to all rows or to all columns of a matrix.
If the function applied returns a single value, the result is a vector of these values:
If the function returns a vector of length greater than one, the result is a matrix:
! Logical “not”: TRUE if its operand is FALSE, FALSE if its operand is TRUE.
& Logical “and”: TRUE only if both operands are TRUE.
| Logical “or”: TRUE if either operand is TRUE.
When applied to logical vectors, the operations are done on each element in turn:
http://www.cs.utoronto.ca/∼radford/csc121/
Week 11
Another Use for Classes — Factors
Recall that how R handles an object can be changed by giving it a “class”
attribute. That’s how lists become data frames. Another example is the “factor”
class, which is used to represent a vector of strings as a vector of integers, along
with a vector of just the distinct string values.
Here’s an illustration:
> a <- as.factor(c("red","green","yellow","red","green","blue","red"))
> a
[1] red green yellow red green blue red
Levels: blue green red yellow
> class(a) # We can see that this object has the class "factor"
[1] "factor"
> unclass(a) # Here’s what it is without its class attribute
[1] 3 2 4 3 2 1 3
attr(,"levels")
[1] "blue" "green" "red" "yellow"
The main reason factors exist is that an integer previously used less memory than
a string, though this is less true in recent versions of R. Strings are converted to
factors in read.table, unless you use the stringsAsFactors=FALSE option.
Operations on Factors
Factors look like strings for many purposes:
> sqrt(a)
Error in Math.factor(a) : sqrt not meaningful for factors
This is because the integers representing the “levels” of the factor are arbitrary,
so treating them like numbers would be misleading. (Unfortunately, R isn’t
completely consistent in this, and will sometimes use a factor as a number
without a warning.)
Another Use of Classes — Dates and Time Differences
R also defines classes for dates, and for differences in dates. Some of what you
can do with these is illustrated below:
The definition of picture just says it’s generic. If no special method is defined
for a class, picture.default is used. By defining picture.mod17, we create a
special method for class mod17. R finds the method to use based on the class of
the first argument to the generic function.
The Object-Oriented Approach to Programming
R’s classes are designed to support what is called “object-oriented” programming.
This approach to programming has several goals:
• Separate what the methods for an object do from how they do it (including
how the object is represented).
Benefit: We can change how objects work without having to change all the
functions that use them.
• Permit the things that can be done with objects (“methods”) and the kinds
of objects (“classes”) to be extended without changing existing functions.
Benefit: We can more easily add new facilities, without having to rewrite
existing programs.
Generic Functions for Drawing, Rescaling, and Translating
Let’s see how we can define a set of generic functions for drawing and
transforming objects like circles and boxes.
We start by setting up the generic functions we want:
Then we need to define methods for these generic functions for all the classes of
objects we want. We also need functions for creating such objects.
Note: We might not have done things in this order. For example, we might have
first defined only draw and translate methods, and then later added the
rescale method. We would then need to implement a rescale method for a
class only if we actually will use rescale for objects of that class.
Implementing a Circle Object
We’ll represent a circle by the x and y coordinates of its centre and its radius.
new_circle <- function (x, y, r) {
w <- list (centre_x=x, centre_y=y, radius=r)
class(w) <- "circle"
w
}
draw.circle <- function (w) {
angles <- seq (0, 2*pi, length=100)
lines (w$centre_x + w$radius*cos(angles),
w$centre_y + w$radius*sin(angles))
}
rescale.circle <- function (w,s) {
w$radius <- w$radius * s;
w
}
translate.circle <- function (w,tx,ty) {
w$centre_x <- w$centre_x + tx; w$centre_y <- w$centre_y + ty
w
}
Implementing a Box Object
We’ll represent a box by the x and y coordinates at its left/right top/bottom.
But to create a box we’ll give coordinates for its centre and offsets to the corners.
new_box <- function (x, y, sx, sy) {
w <- list (x1=x-sx, x2=x+sx, y1=y-sy, y2=y+sy)
class(w) <- "box"
w
}
draw.box <- function (w) {
lines (c(w$x1,w$x1,w$x2,w$x2,w$x1), c(w$y1,w$y2,w$y2,w$y1,w$y1))
}
rescale.box <- function (w,s) {
xm <- (w$x1+w$x2) / 2
w$x1 <- xm + s*(w$x1-xm); w$x2 <- xm + s*(w$x2-xm)
ym <- (w$y1+w$y2) / 2
w$y1 <- ym + s*(w$y1-ym); w$y2 <- ym + s*(w$y2-ym)
w
}
translate.box <- function (w,tx,ty) {
w$x1 <- w$x1 + tx; w$x2 <- w$x2 + tx
w$y1 <- w$y1 + ty; w$y2 <- w$y2 + ty
w
}
An Example of Drawing Objects This Way
> plot(NULL,xlim=c(-7,7),ylim=c(-7,7),xlab="",ylab="",asp=1)
> c <- new_circle(3,4,2.5)
> draw(c); draw(rescale(c,0.7)); draw(translate(rescale(c,0.3),1,-5))
> b <- new_box(-3,-3,2,3)
> b2 <- translate(b,-1.3,2.2)
> draw(b); draw(b2); draw(rescale(b2,1.1))
6
4
2
0
−6 −4 −2
−6 −4 −2 0 2 4 6
Defining a Function That Works On Both Circles and Boxes
Here is a function that should work for circles, boxes, or any other class of object
that has draw, rescale, and translate methods:
smaller <- function (w, n)
for (i in 1:n) { draw (w); w <- rescale(translate(w,1,0),0.9) }
−6 −4 −2 0 2 4 6
Statistical Facilities in R
In this course, we’ve mostly looked at R as a programming language, and at
general programming concepts.
But R is most popular as a language for statistical applications. So it has many
special facilities for doing statistics. I’ll talk about some now.
Don’t worry if you don’t understand some of the statistical concepts — that’s OK
for this course. Though learning about R’s statistical facilities is one good way to
learn statistics in a hands-on way!
Creating Tables of Counts
R can count how many times a value or combination of values occurs in a data
set, with the table function. It returns an object of class table, which looks like
a vector or matrix of integer counts.
For a vector, table counts how many times each unique value occurs:
> colours <- c("red","blue","red","red","green","blue")
> print (tcol <- table(colours))
colours
blue green red
2 1 3
> names(tcol)
[1] "blue" "green" "red"
> ages <- c(4,9,12,2,4,9,10)
> print (tage <- table(ages))
ages
2 4 9 10 12
1 2 2 1 1
> names(tage)
[1] "2" "4" "9" "10" "12"
Tables of Joint Counts
When used with two vectors, or a data frame with two columns, table creates a
two-dimensional table of how often each combination of values occurs. Examples:
This might express that the amount by which some plant grows is linearly related
to the average temperature, the amount of fertilizer used, and a set of indicator
variables indicating the variety of the plant.
A Simple Example of a Linear Model
Here, I’ll show the results of a very simple linear model, relating the volume of
wood in a cherry tree to its girth (diameter of trunk). The data is in the data
frame trees that comes with R.
Here’s a plot of the data:
70
60
50
trees$Volume
40
30
20
10
8 10 12 14 16 18 20
trees$Girth
Fitting the Model with lm
We can fit a linear model for volume given girth as follows:
Call:
lm(formula = trees$Volume ~ trees$Girth)
Coefficients:
(Intercept) trees$Girth
-36.943 5.066
The result says that best fit model for the volume is
We can get the same result with an abbreviated formula by saying the data comes
from the data frame trees:
We could use these coefficients to predict the volume for a new tree, with girth
of 11.6:
> coef(m) %*% c(1,11.6) # %*% will compute the dot product
[,1]
[1,] 21.82048
Getting More Details on the Model Fitted
We can also ask for more statistical details with summary:
> summary(m)
Call:
lm(formula = Volume ~ Girth, data = trees)
Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
70
60
50
trees$Volume
40
30
20
10
8 10 12 14 16 18 20
trees$Girth
The plot shows some indication that the relationship is actually curved.
Trying a Quadratic Model
Let’s try fitting volume to both girth and the square of girth:
> Girth_squared <- trees$Girth^2
> summary (lm (trees$Volume ~ trees$Girth + Girth_squared))
Call:
lm(formula = trees$Volume ~ trees$Girth + Girth_squared)
Residuals:
Min 1Q Median 3Q Max
-5.4889 -2.4293 -0.3718 2.0764 7.6447
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.78627 11.22282 0.961 0.344728
trees$Girth -2.09214 1.64734 -1.270 0.214534
Girth_squared 0.25454 0.05817 4.376 0.000152 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
You can use the results to find the elements of a vector that are in some set:
With which, which returns indexes of TRUE in a logical vector, you can also find
the indexes of the elements that are in the set:
http://www.cs.utoronto.ca/∼radford/csc121/
Week 12
Computers are Fast
Modern computers are so fast that most simple operations on not-too-large
amounts of data appear to happen instantly.
If you’re working with a data frame with 1000 rows and 10 columns, you can
expect all of the following to happen so fast that a human can’t percieve the delay:
• Adding or deleting one row or one column to make a new data frame.
If you see any apparent delay, it’s probably not for the operation itself, but for
things like fetching the data over the internet, or waiting for your computer to
stop doing something else.
. . . But Not Always Fast Enough
Nevertheless, computing speed is still an issue today.
• Because today’s computers are fast, people try to use them on bigger
problems than before. If you work on a data frame with 1000000 rows and
1000 columns, many operations will be noticeably slow, maybe very slow.
• Sometimes you want to do simple things many, many times. For example,
how well a statistical method works is often assessed by trying it on many
randomly-generated data sets.
• Some tasks are inherently extremely slow for computers to do. If the famous
P 6= N P conjecture is true, this includes many useful tasks like finding the
shortest route visiting all locations in some set (the “Travelling Salesman”
problem).
• You can get computers to be very slow if you write your program in an
inefficient way, when there would have been a better way.
Computing Time and Problem Size
The time it takes for a program to run depends on what computer you run it on.
A low-end laptop computer might take five times longer to run a program than a
high-end desktop computer.
When analysing program speed, we therefore often look not at the actual time,
but at how the time grows with problem size.
Example: Suppose we want to sort a vector of numbers in increasing order,
creating another vector with the sorted list. How might the the time for this grow
with the length of the vector, which we’ll call n?
• If we assume that the numbers are integers that aren’t huge, it can be done in
time proportional to n.
What Does “Time Growing in Proportion to n” Mean
To say that the time grows in proportion to n (or to n2 , or n log n), means that
asymptotically, as n becomes larger and larger, the time will grow that way, with
some unknown constant of proportionality.
Here’s an example of a function that asymptotically grows in proportion to n:
100
80
60
40
20
0
0 20 40 60 80 100
The constant of proportionality seems to be 0.5 (grey line has that slope).
But if this is the time for a program to run, that constant will vary from
computer to computer.
Example of Time Growing in Proportion to n
Here’s an example of a simple R function that (pointlessly) counts up to n, and
how its time grows with n:
http://www.cs.utoronto.ca/∼radford/csc121/
Week 13
A Few Final Comments
Here in the last lecture, I’ll mention a few things that we haven’t had time to
really cover. . .
• More on R packages.
• A few packages come with R and are available for use by default. One such is
the stats package that defines the lm function.
• Some more packages come with R, but have to be loaded manually before you
can use them. An example is the survival package for analysing survival
data. To use it, say library(survival).
• Many other packages (thousands) are available for installation from package
repositories, to which many people have contributed. The CRAN repository
is the best-known, and is the default when you try to install such packages
using the install.packages function.
• Tests that individual functions work as intended, including functions that are
just used inside the program (aren’t meant to be used elsewhere).
If we are confident that many of the parts of the program work correctly, we
will be more confident that the program as a whole works correctly.
One aim in testing is to make sure that every bit of code has been used — eg,
that every if statement has been tried with the condition being both TRUE and
FALSE. But that’s not enough to guarantee that the program always works.
Testing isn’t a substitute for careful design and coding.
Source Code Control
A source code control system manages the files containing your function
definitions, scripts, or documentation.
Here are some things a source code control system lets you do:
• Go back to an earlier version if you find out that some recent changes you
made were a bad idea.
• See what has changed from some earlier version to the current version.
Currently, the most popular source code control system is git. It is supported by
RStudio, or it can be used on its own.
Source Code Repositories
It’s increasingly popular for programs (managed by a source code control system)
to be made available to everyone on source code repositories.
Two popular ones based on git are gitlab.com and github.com.
These repositories support
• People downloading the programs, including the revision history if they wish.
Of course, the developers have to provide a license that allows the program to be
used / changed.
Other Programming Languages for Statistics
R is probably the most common programming language used by statisticians.
But there are others.
There are statistical packages that provide programming facilities, such as
• SAS
• Stata
There are several programming languages with wider communities that are
somewhat similar to R, including
• Matlab (and its free version, Octave).
• Python
There are also languages centred on symbolic mathematical computation, like
• Maxima
• Maple
• Mathematica
In these languages, you can multiply 2+x by 1+3*x and get 2+7*x+3*x^2.
Compiled Programming Languages
There are also programming languages that are usually compiled, rather than
interpreted, like R, and the other languages on the previous slide.
Compilation translates the program to a program in machine language, which the
computer can do directly. In contrast, an interpreter is a program in machine
language (usually compiled from some other language) that looks at a program
and does what it says, which is much slower.
So if you need your program to go really fast, you may want to write it in a
language that can be compiled, rather than in R.
Some common compiled languages:
• C
• Fortran
You can also write just the time-critical part of the program in one of these
languages, and then call that part from an R program.
Alternative Implementations of R
Several projects are currently in the works to improve on the current
implementation of R (as distributed at http://R-project.org).
These include: