0% found this document useful (0 votes)
249 views603 pages

Data 101 Complete PDF

Uploaded by

Sami Almuallim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
249 views603 pages

Data 101 Complete PDF

Uploaded by

Sami Almuallim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 603

DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Data analysis case study - data science in the real world
Data analysis case study — data collection

The study included twenty participants who were split into 2 treatment
groups. Ten envelopes contained instructions for the participant to
commence with their DH needle and NDH probe and ten for the opposite
hand arrangement.

Once the participant demonstrated the needle tip to be within the target
in the phantom jelly, the task was complete and the opposite hand
arrangement was started.

This method was chosen to allow 10 attempts at each arrangement to be


completed each of the 20 participants.
Data analysis case study – data frame
Data analysis case study – data frame
Data analysis case study – data frame
Data analysis case study – predictive modelling

• The final model for time used is

y = B0 + B1order + B2group + β3order × group + ε


where the Bs are normally distributed with means β0

• The log of the mean of the counts is related to the covariates, order
and group. That is,

log(Mean Count) = B0 + β1order + B2group


where B0 and B2 are normally distributed with mean β0 and β2,
respectively.
Summary of Data Science

• Propose the problem.

• Design a way to collect the data.

• Record the information and measurements and clean the data.

• Build the model with the cleaned data.

• Apply the model to make predictions or to make decision.

To be a data scientist you need to learn enough statistics and computer


science and need to be open to cooperate with professional people.
Data 101 will focus on mastering basic R programming and learning
some basic statistical concepts related to the functions that will be used
most often.
Why R Language

• R is the top one being used for data analysis among all
programming languages.

• R is open source

• R has more than 9000 packages to use. The package likes the app
we use.

• R has big community to supply support and give examples


Outline of this course

We will start with basic programming. We will teach you R, but we will
try not to just teach you R. We will emphasize those things that are
common to many computing platforms and are important to beginning
data scientists.

• We will show you how to construct statistical graphics.

• We will show how to control the flow of execution of a program.

• We will touch on Boolean algebra, a formal way to manipulate


logical statements.

• We will discuss how to break down complex problems into simple


parts.

• We will spend quite a lot of time discussing how to get it right

• We will provide an introduction to simple, multiple and tree-based


regression.
The course textbook and additional notes

The course textbook is called A First Course in Statistical Programming


with R.

We will cover the


first 4 chapters of the
textbook in addition
to some notes on
regression and data
management that I
will post it later.

Solutions to selected exercises in the textbook can be found on the web


at http://www.statprogr.science.
An overview of R

These lectures introduce R, originally developed as S, by


John Chambers and others at Bell Laboratories in 1976,
and implemented and made into an Open Source program
1
by Robert Gentleman and Ross Ihaka in 1995.

As you learn R, there is nothing wrong with making errors when learning
a programming language like R.

You learn from your mistakes, and there is no harm done.

Try out the code embedded into these slides and experiment with new
variations to discover how the system will respond.

1 https://www.r-project.org/Licenses/GPL-2
Downloading and installing R and RStudio

R can be downloaded for free from CRAN*.

RStudio is also very popular. You can download the “Open Source
Edition” of “RStudio Desktop” from http://www.rstudio.com/, and
follow the instructions to install it on your computer.

Although much or all of what is described here can be carried out in


RStudio, most of our focus will be on the use of R itself.

* http://cloud.r-project.org or check out the short video on Canvas.


DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Executing commands in R

Clicking on the R icon, or opening RStudio similarly, should provide you


with access to a window or pane, called the R console in which you can
execute commands.*

The > sign is the R


prompt which indicates
where you can type in
the command to be ex-
ecuted.

*A short video can be found on Canvas to show how to start running R.


2
Executing commands in R

You can do arithmetic of any type, including multiplication:

By hitting the “Enter”


key, you are asking R to
execute this
calculation.

3
Executing commands in R

The answer appears on the next line:

Often, you will type in commands such as this into a script window, as in
RStudio, for later execution, through hitting the “Run” button, “ctrl-R” or
another related keystroke sequence.*
* Check out the short ScriptRStudio video for a quick example.
4
Executing commands in R

Objects that are built in to R or saved in your workspace, i.e. the


environment in which you are currently doing your calculations, can be
displayed, simply by invoking their name.

> women
For example, the data ## height weight
set or data frame called ## 1 58 115
## 2 59 117
women contains infor-
## 3 60 120
mation on heights and ## 4 61 123
weights of American ## 5 62 126
## 6 63 129
women: ## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164

5
Polling questions - Yes or No?

If I want to subtract 238 from 745 in R, do I just type


745 - 238

6
Polling question - Answer

Yes!
745 - 238

## [1] 507

7
Polling questions - Yes or No?

If I want to subtract 238 divide 0 in R, do I just type


(745-238) / 0

## [1] Inf

If I want to subtract 238 divide 0 in R but accidently type o instead 0, the


result is
(745-238) / o

## Error in eval(expr, envir, enclos): object ’o’ not


found

8
Polling question - Yes or No?

There is a built-in object called mean which can be used to calculate


averages. If I just type
mean

I will get an error message, because it will only work if I supply it with
data.

9
Polling questions - Answer

No!
mean

## function (x, ...)


## UseMethod("mean")
## <bytecode: 0x55be140614c8>
## <environment: namespace:base>

mean is an object, so if we type its name, its contents are printed to the
screen.

10
Polling question - Yes or No?

I have created a function that will convert Fahrenheit temperatures to


Celsius temperatures and I have stored it in an object called F2C.

I will see the contents of the function, if I type


F2C

11
Polling questions - Answer

Yes!
F2C

## function(x) (x - 32)*5/9

and I can use my function to find the Celsius temperature for my


American friend who likes going to the beach when it is 86 degrees
Fahrenheit:
F2C(86)

## [1] 30

12
What else to expect in this course

1. You will learn a lot about R:

• using it to perform calculations

• using it to create graphs

• using it as a programming language

• using built-in functions, together with data, to build models (i.e.


equations or formulas) that can be used to make predictions.

2. You will learn some things about data: organizing it, summarizing it,
and cleaning it.

3. Later in the course, you will learn about predictive models:


regression models, Bayes classifiers and tree-based models.

13
What else to expect in this lecture

1. Running (executing) commands in R

2. Doing arithmetic calculations and storing the results for future use.

3. Creating some simple graphics.

4. Learning about some of R’s built-in functions.

5. Extending R’s abilities by installing add-on packages.

6. Seeing how R organizes data - reading in and manipulating data


frames

14
R can be used as a calculator

Some arithmetic problems to try*:

12*11 # multiplication

## [1] 132

22*22

## [1] 484

125/25 # division

## [1] 5

* Note that R ignores everything typed to the right of #.


15
More Calculations with R

R can compute powers with the ˆ operator. For example,

3ˆ4

## [1] 81

It can also calculate square roots - in two ways:

9ˆ(1/2)

## [1] 3

or
sqrt(9)

## [1] 3
16
Polling question - Yes or No?

Can I find the cube root of 1000 by typing

1000ˆ(1/3)

17
Polling question - Answer

Yes! The cube root of a number is the 1/3 power of that number. (Or the
value that when raised to the 3rd power is the original number.
103 = 1000 and

1000ˆ(1/3)

## [1] 10

18
Polling question - Yes or No?

The area A of a circle of radius r can be found from the formula

A = πr2
and R has stored a value of π as

pi

## [1] 3.141593

Can I calculate the area of a circle with radius 12 using

pi*12ˆ2

19
Polling question - Answer

Yes!

pi*12ˆ2

## [1] 452.3893

20
Calculations in R

You can control the number of digits in the output with the options()
function.

This is useful when reporting final results such as means and standard
deviations, since including excessive numbers of digits can give a
misleading impression of the accuracy in your results.

options(digits=3)
583/31
Compare with 583/31
## [1] 18.80645
## [1] 18.8

21
Calculations in R

Observe the patterns in the following calculations.

options(digits = 18)
1111111*1111111 The error in the final calculation is
## [1] 1234567654321 due to the way R stores information
11111111*11111111 about numbers.
## [1] 123456787654321 There are around 17 digits of
111111111*111111111 numeric storage available.
## [1] 12345678987654320

22
More Calculations with R

We can compute the remainder after division of 17 by 6


17 %% 6

## [1] 5

This is called modular arithmetic.

The calculation above is referred to as “17 (mod 6).”

23
More Calculations with R

We can also calculate the quotient, without remainder, using %/%:


17 %/% 6

## [1] 2

2 * 6 + 5 # check the calculation

## [1] 17

24
Polling question - Yes or No?

If we divide 17 by 3, we should get 5 and a remainder of


17%/%3

25
Polling question - Answer

No!
17%/%3

## [1] 5

This is what we get when dividing 17 by 3 and ignoring the remainder.

To calculate the remainder only, we use


17%%3

## [1] 2

26
What are the numbers in square brackets?

The following example displays the data in rivers, lengths of 141 North
American rivers (in miles). The second line starts with the 12th value,
and the third line stars with the 23rd value, and so on.*
options(width=60)
rivers

## [1] 735 320 325 392 524 450 1459 135 465 600 330
## [12] 336 280 315 870 906 202 329 290 1000 600 505
## [23] 1450 840 1243 890 350 407 286 280 525 720 390
## [34] 250 327 230 265 850 210 630 260 230 360 730
## [45] 600 306 390 420 291 710 340 217 281 352 259
## [56] 250 470 680 570 350 300 560 900 625 332 2348
## [67] 1171 3710 2315 2533 780 280 410 460 260 255 431
## [78] 350 760 618 338 981 1306 500 696 605 250 411
## [89] 1054 735 233 435 490 310 460 383 375 1270 545
## [100] 445 1885 380 300 380 377 425 276 210 800 420
## [111] 350 360 538 1100 1205 314 237 610 360 540 1038
## [122] 424 310 300 444 301 268 620 215 652 900 525
## [133] 246 360 529 500 720 270 430 671 1770

* The line break is based on the optional setting options(width=60).


27
Simple Number Patterns

The : operator yields increasing sequences of numbers. For example,


1:10

## [1] 1 2 3 4 5 6 7 8 9 10

Here is what happens when you add 3 to the above sequence:


(1:10)+3

## [1] 4 5 6 7 8 9 10 11 12 13

28
Simple Number Patterns

You can subtract or multiply a number too:


(1:10)-3

## [1] -2 -1 0 1 2 3 4 5 6 7

(1:10)*3

## [1] 3 6 9 12 15 18 21 24 27 30

29
More patterns

You can take powers of all elements in your sequence:

(1:10)ˆ2

## [1] 1 4 9 16 25 36 49 64 81 100

(1:10)ˆ3

## [1] 1 8 27 64 125 216 343 512 729 1000

30
Polling question - Yes or No?

We can divide the numbers from 1 through 10 by 3, ignoring the


remainder using
(1:10)%/%3

31
Polling question - Answer

Yes!
(1:10)%/%3

## [1] 0 0 1 1 1 2 2 2 3 3

Dividing 1 by 3 gives 0, and so does dividing 2 by 3.

Dividing 3 by 3 gives 1, and so does dividing 4 and 5 by 3, and so on.

32
Polling question - Yes or No?

To get the sequence 4, 5, 6, 7, we use


4:7

33
Polling question - Answer

Yes!
4:7

## [1] 4 5 6 7

34
Polling question - Yes or No?

The command
7:4

will generate an error message.

35
Polling question - Answer

No!
7:4

## [1] 7 6 5 4

If the first number is larger than the second number, a decreasing


sequence is generated.

36
Named storage

When you begin an R session, you open a workspace known as the


global environment.

This environment is where you can begin to store the results of your
work.

For example, you might want to keep track of some calculations, or you
might have invented a new function to solve some kind of problem.

To store your output, you need to provide names for each object that you
want to save.

37
Named storage: Example

The rivers object contains measurements in miles. To convert to


kilometers, we can divide all of the measurements by 1.609:

riversKm <- rivers/1.609

The above command causes R to assign the calculated values to


riversKm using an arrow that points to the left, created with the
less-than sign (<) and the hyphen (-).

Note that no output appears. You can see the results of this assignment
by typing

riversKm

38
Polling question - Yes or No?

There are 5280 feet in one mile. That means we can convert the lengths
of the rivers from miles to feet by multiplyng each value by 5280. Does
the following code assign these lengths to the object riversFeet?

riversFeet < rivers*5280

39
Polling question - Answer

No! You need <-, not just <

riversFeet <- rivers*5280

40
Quitting R

To quit your R session, type q()


q()

If you then hit the Enter key, you will be asked whether to save an image
of the current workspace, or not, or to cancel.

The workspace image contains a record of the computations you’ve


done, and may contain some saved results.

41
Functions

Most of the work in R is done using functions.

For example, we saw that to quit R we type q(). This tells R to call the
function named q.

The brackets surround the argument list, which in this case contains
nothing: we just want R to quit, and do not need to tell it how.

42
q is a function

Attempting to quit without the brackets gives:


q

## function (save = "default", status = 0, runLast = TRUE)


## .Internal(quit(save, status, runLast))
## <bytecode: 0x55be140f3098>
## <environment: namespace:base>

This has happened because q is a function that is used to


tell R to quit.

43
q is a function

Typing q by itself tells R to show us the contents of the function q. By


typing q(), we are telling R to call the function q.

The action of this function is to quit R.

q has three arguments: save, status, and runLast.

44
Default Values of Parameters

Each argument has a default value: "default", 0, and TRUE


respectively.

What happens when we execute q() is that R calls the q function with
the arguments set to their default values.

45
Changing from the Defaults

To change from the default values, specify them in the function call.

Arguments are identified by their position or by their name.

For example,

q("no") # these calls tell R to quit without


q(save = "no") # saving the workspace image

46
Changing from the Defaults

If we had given two arguments without names, they would apply to save
and status.

If we want to accept the defaults of the early parameters but change later
ones, we give the name when calling the function, e.g.
q(runLast = FALSE)

47
Changing from the Defaults

Alternatively, commas can be used to mark the missing arguments, e.g.


q( , , FALSE)

It is a good idea to use named arguments when calling a function which


has many arguments or when using uncommon arguments, because it
reduces the risk of specifying the wrong argument, and makes your
code easier to read.

48
DATA SCIENCE 101

Predicting with Data

Shabnam Fani

Winter 2021

1
Named storage: Graphics Example

The amounts of time spent by a person watching 4 different types of TV


shows were measured.

15% of the time was spent on sports, 10% on game shows, 30% on
movies, and 45% on comedies. Set up a pie chart and a bar chart.

2
Named storage: Graphics Example

First, you need to set up an object that contains the information to be


plotted. Here, an object called tv is assigned the required information.
tv <- c("sports" = 15, "game shows"= 10,
"movies" = 30, "comedies" = 45)

The pie() function can then be used to create the pie chart.

3
Named storage: Graphics Example
pie(tv)

game shows

movies
sports

comedies

4
Named storage: Bar Chart Example
barplot(tv)
40
30
20
10
0

sports game shows movies comedies

5
Polling question - Yes or No?

My friend has baked 3 apple pies, 4 blueberry pies and 7 cherry pies.
Create a pie chart for these data. The first step is:

pies <- c(apple = 3, blueberry = 4, cherry = 7)

6
Polling question - Answer

No! Don’t forget the quotation marks.

pies <- c("apple" = 3, "blueberry" = 4, "cherry" = 7)

7
Polling question - Yes or No?

The second step to create the pie chart is for my friend’s pie data is

pie(pies)

8
Polling question - Answer

Yes!
pie(pies)

blueberry
apple

cherry

9
R is case-sensitive

Let’s try to find the length of the longest river:

MAX(rivers)

## Error in MAX(rivers): could not find function "MAX"

Now try

max(rivers)

## [1] 3710

The function max() is built in to R. R would consider MAX to be a


different function, because it is case-sensitive: m is different from M.

10
R is case-sensitive

If you really want a function called MAX to do the work of max, you would
type
MAX <- max

Now, MAX will do what you want:

MAX(rivers)

## [1] 3710

11
Listing the objects in the workspace

Our workspace now contains some objects that we have created.

A list of all objects in the current workspace can be printed to the screen
using the objects() function:
objects()

## [1] "MAX" "pies" "tv"

The same result is obtained with the alias function


ls()

## [1] "MAX" "pies" "tv"

Remember that if we quit our R session without saving the workspace


image, then these objects will disappear.
12
The simplest model for random noise - the runif function

Suppose you are measuring a length with a ruler that gives you
accuracy to the nearest millimeter.

When you measure the length of your pencil as 273 millimeters, the truth
could be anywhere between 272.5 and 273.5 millimeters.

The uniform distribution on the interval [−.5, .5] provides a model for the
error in your measurement, and we can simulate values from this
distribution using the runif() function.

Following is a sample of 4 such simulated values:

runif(4, min = -.5, max = .5)

## [1] 0.4958510 -0.4893649 0.4281421 0.4016467

13
Packages

One of the major strengths of R is the availability of add-on packages


that have been created by statisticians and computer scientists from
around the world.

There are thousands of packages, e.g. graphics, ggplot2, and MPV.

A package contains functions and data which extend the abilities of R.

Every installation of R contains a number of packages by default (e.g.


base, stats, and graphics) which are automatically loaded when you
start R.

14
Packages

To load an additional package, for example, called DAAG, type

library(DAAG)

If you get a warning that the package is can’t be found, then the package
doesn’t exist on your computer, but it can likely be installed. Try

install.packages("DAAG")

15
Packages

In RStudio, it may be simpler to use the Tools menu.

16
Packages

Choose “Install Packages”:

17
Packages

Type in the name of the package you are requesting, and click “Install”:

18
Packages

Before DAAG is installed, if we type;

> seedrates
Error: object 'seedrates' not found

Once DAAG is installed, it can be loaded using the library() function,


and you can access data frames and functions that were not avaiable
previously. For example, the seedrates data frame is now available:
seedrates

## rate grain
## 1 50 21.2
## 2 75 19.9
## 3 100 19.2
## 4 125 18.4
## 5 150 17.9
19
Polling question - Yes or No?

To install the MPV package, type

install.packages(MPV)

20
Polling question - Answer

No! We forgot the quotation marks again.

install.packages("MPV")

21
Using one object from a package at a time

The MPV package is installed on my MPV::p2.12


system, but I have not loaded it. I ## temp usage
only want to access the p2.12 data ## 1 21 185.79
frame and nothing else. ## 2 24 214.47
## 3 32 288.03
## 4 47 424.84
To do this, just type the package
## 5 50 454.68
name (MPV), followed by two colons
## 6 59 539.03
(::) and the object name you seek.
## 7 68 621.55
## 8 74 675.06
## 9 62 562.03
## 10 50 452.93
## 11 41 369.95
## 12 30 273.98

22
Polling question - Yes or No?

If I don’t load the DAAG package, but I want to assign seedrates to an


object called plantingData, the following should work:

plantingData <- DAAG::seedrates

23
Polling question - Answer

Yes!

plantingData <- DAAG::seedrates

Look at the result:

plantingData
The contents of
## rate grain
plantingData
## 1 50 21.2
are the same as
## 2 75 19.9
the contents of
## 3 100 19.2
seedrates.
## 4 125 18.4
## 5 150 17.9

24
Packages

You might want to know which packages are loaded into your system
already.

To see which packages are loaded, run


search()

## [1] ".GlobalEnv" "package:DAAG" "package:latt


## [4] "package:stats" "package:graphics" "package:grDe
## [7] "package:utils" "package:datasets" "package:meth
## [10] "Autoloads" "package:base"

(Your list will likely be different from mine.)

25
Packages

This list also indicates the search order: a package can only contain one
function of any given name, but the same name may be used in another
package.

When you use that function, R will choose it from the first package in the
search list.

If you want to force a function to be chosen from a particular package,


prefix the name of the function with the name of the package and ::,
e.g.

stats::median(x)

26
Packages

Thousands of contributed packages are available, though you likely


have only a few dozen installed on your computer.

If you try to use one that isn’t already there, you will receive an error
message:

library(notInstalled)

## Error in library(notInstalled): there is no package


called ’notInstalled’

This means that the package doesn’t exist on your computer, but it
might be available in a repository online.

27
Packages

The biggest repository of R packages is known as CRAN. To install a


package from CRAN, you can run a command like

install.packages("knitr")

or, within RStudio, click on the Packages tab in the Output Pane,
choose Install, and enter the name in the resulting dialog box.

28
Packages

Because there are so many contributed packages, it is hard to know


which one to use to solve your own problems.

If you can’t get help from someone with more experience, you can get
information from the CRAN task views at
https://cloud.r-project.org/web/views.

These are reviews of available packages written by experts in dozens of


different subject areas.

29
Built-in help pages

There is an online help facility that can help you to see what a particular
function is supposed to do.

If you know the name of the function that you need help with, the
help() function is likely sufficient.

It may be called with a string or function name as an argument, or you


can simply put a question mark (?) in front of your query.

30
Built-in help pages

For example, for help on the q() function, type


?q

or
help(q)

or just hit the F1 key while pointing at q in RStudio. Any of these will
open a help page containing a description of the function for quitting R.

31
Data frames
Most data sets are stored in R as data frames, such as the women object
we encountered earlier.

Data frames are like matrices, but where the columns have their own
names. You can obtain information about a built-in data frame by using
the help() function. For example, observe the outcome to typing
help(women).

women package:datasets R Documentation


Average Heights and Weights for American Women
Description:

This data set gives the average heights and weights for American
women aged 30-39.
Usage:

women

Format:
A data frame with 15 observations on 2 variables.

[,1] height numeric Height (in)


[,2] weight numeric Weight (lbs)

Details:

The data set appears to have been taken from the American Society
32
Data frames

It is generally unwise to inspect data frames by printing their entire


contents to your computer screen, as it is far better to use graphical
procedures to display large amounts of data or to exploit numerical
summaries.

> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
33
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Data frames

The summary() function provides information about the main features of


a data frame:
summary(women)

## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0

34
Polling question - Yes or No?

Is the first quartile (1st Qu.) the value which is at least as large as 25%
of the measurements?

35
Polling question - Answer

Yes!

The first quartile is the same as the 25th percentile - the value which
divides the lower 25 percent of the data from the upper 75 percent.

Similarly, the third quartile is the 75th percentile, and the median is the
50th percentile; the median is the middle value of a sorted collection of
measurements.

36
Polling question - Yes or No?

We can count the number of rows in the women data frame using
nrow(women)

37
Polling question - Answer

Yes!
nrow(women)

## [1] 15

There are 15 rows in the women data frame.

38
Reading data into a data frame from an external file

If you have prepared the data set yourself, you could simply type it into a
text file, for example called file1.txt, perhaps with a header indicating
column names, and where you use blank spaces to separate the data
entries.

39
Reading data into a data frame from an external file

The read.table() function will read in the data for you as follows:

mydata <- read.table("file1.txt", header = TRUE)

The object mydata now contains the data read in from the external file.

You could use any name that you wish in place of mydata, as long as the
first element of its name is an alphabetic character.

40
Reading data into a data frame from an external file

If the data entries are separated by commas

and there is no header row, as in the file wx l3 2006.txt, you would type:
wx1 <- read.table("wx_l3_2006.txt", header=F, sep=",")

41
Reading data into a data frame from an external file

Often, your data will be in a spreadsheet.

If possible, export it as a .csv file and use something like the following
to read it in.

wx2 <- read.table("wx_l3_fwi_2006-2011.csv",


header=FALSE, sep=",")

If you cannot export to .csv, you can leave it as .xlsx and use the
read.xslx() command in the xlsx package.

42
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Data input and output

When in an R session, it is possible to read and write data to files


outside of R, for example on your computer’s hard drive.

It is helpful to know where the data is coming from or going to.

While the work that you do in R is in an environment called the


workspace, there is another environment on your computer that the R
program is working in: this is called the working directory.

This directory (or folder) contains the files that you read in to R and write
out from R.

2
Changing working directories

In the RStudio Files tab of the output pane you can navigate to the
directory where you want to work, and choose
Set As Working Directory from the More menu item.

Alternatively you can run the R function setwd(). For example, to work
with data in the folder mydata on the C: drive, run

setwd("c:/mydata") # or setwd("c:\\mydata")

3
Changing working directories

After running this command, all data input and output will default to the
mydata folder in the C: drive.

If you are accustomed to folder names in Windows, you might have


expected this to be written as "c:\mydata".

However, R treats the backslash character “\” as a special “escape”


character, which modifies the interpretation of the next character. If you
really want a backslash, you need to double it: the first backslash tells
the second backslash not to be an escape.

Because other systems use a forward slash “/” in their folder names,
and because doubling the backslash is tedious in Windows, R accepts
either form.

4
dump() and source()

Suppose you have constructed an R object called usefuldata. In order


to save this object for a future session, type
dump("usefuldata", "useful.R")

This stores the command necessary to create the vector usefuldata


into the file useful.R on your computer’s hard drive.

5
dump() and source()

The choice of filename is up to you, as long as it conforms to the usual


requirements for filenames on your computer.

To retrieve the vector in a future session, type


source("useful.R")

This reads and executes the command in useful.R, resulting in the


creation of the usefuldata object in your global environment. If there
was an object of the same name there before, it will be replaced.

To save all of the objects that you have created during a session, type
dump(list = objects(), "all.R")

This produces a file called all.R on your computer’s hard drive. Using
source("all.R") at a later time will allow you to retrieve all of these
objects.
6
Example

To save existing objects nhtemp and nhtempC to a file called nhtemp.R


on your hard drive, type
dump(c("nhtemp", "nhtempC"), "nhtemp.R")

7
Saving and retrieving image files

The vectors and other objects created during an R session are stored in
the workspace known as the global environment.

When ending an R session, we have the option of saving the workspace


in a file called a workspace image.

If we choose to do so, a file called by default .RData is created in the


current working directory (folder) which contains the information
needed to reconstruct this workspace.

8
Saving and retrieving image files

In Windows, the workspace image will be automatically loaded if R is


started by clicking on the icon representing the file .RData, or if the
.RData file is saved in the directory from which R is started.

If R is started in another directory, the load() function may be used to


load the workspace image.

9
Saving and retrieving image files

It is also possible to save workspace images without quitting. For


example, we could save all current workspace image information to a file
called temp.RData by typing

save.image("temp.RData")

Again, we can begin an R session with that workspace image, by


clicking on the icon for temp.RData. Alternatively, we can type
load("temp.RData") after entering an R session. Objects that were
already in the current workspace image will remain, unless they have the
same name as objects in the workspace image associated with
temp.RData. In the latter case, the current objects will be overwritten
and lost.

10
Vectors in R

Data come in the form of numbers and characters.

In R, a vector is a list of either numbers or of characters.

We will see some other kinds of vectors as well.

11
Numeric Vectors

A numeric vector is a list of numbers.

The rivers object is an example of a vector that is built in to R.

We can count the number of elements in rivers using the length()


function:

length(rivers)

## [1] 141

This vector contains 141 elements.

12
Numeric Vectors

In hte previous lecture, we learned that we can view the entire contents
of an object by typing its name. Let’s do that one more time:
rivers

## [1] 735 320 325 392 524 450 1459 135 465 600 330
## [12] 336 280 315 870 906 202 329 290 1000 600 505
## [23] 1450 840 1243 890 350 407 286 280 525 720 390
## [34] 250 327 230 265 850 210 630 260 230 360 730
## [45] 600 306 390 420 291 710 340 217 281 352 259
## [56] 250 470 680 570 350 300 560 900 625 332 2348
## [67] 1171 3710 2315 2533 780 280 410 460 260 255 431
## [78] 350 760 618 338 981 1306 500 696 605 250 411
## [89] 1054 735 233 435 490 310 460 383 375 1270 545
## [100] 445 1885 380 300 380 377 425 276 210 800 420
## [111] 350 360 538 1100 1205 314 237 610 360 540 1038
## [122] 424 310 300 444 301 268 620 215 652 900 525
## [133] 246 360 529 500 720 270 430 671 1770

13
Extracting elements from vectors

Suppose we want to see the 35th measurement in the rivers object.

You can extract this element from the rivers vector using the value 35
inside square brackets:

rivers[35]

## [1] 327

14
Extracting elements from vectors: Example

nhtemp contains average annual temperatures for New Haven,


Connecticut, starting in 1912. We can extract the 5th element of this time
series vector using
nhtemp[5]

## [1] 49.4

This gives us the average temperature for 1916.

15
Polling question - Yes or No?

How do we find the number of elements in nhtemp, i.e. the length of


nhtemp?

(a) length(nhtemp)

(b) length[nhtemp]

Is the correct answer (a)?

16
Polling question - Answer

Yes!
length(nhtemp)

## [1] 60

There are 60 elements in the nhtemp vector.

17
Polling question - Yes or No?

How do we find the 57th element of nhtemp?

(a) nhtemp(57)

(b) nhtemp[57]

Is the correct answer (a)?

18
Polling question - Answer

No!
nhtemp(57)

## Error in nhtemp(57): could not find function "nhtemp"

You should use the square brackets, not the round brackets, when
extracting an element from a vector. Here, R is telling you that nhtemp is
not a function - which means that it thinks you want to use nhtemp to be
a function. (You don’t)

The 57th element of nhtemp is correctly displayed with


nhtemp[57]

## [1] 51.9

19
Building your own numeric vectors

The c() function is used to collect things together into a vector. We can
create a vector called myvector which contains some random data:
myvector <- c(2.5, 5, 0, 0.7, -8)

We can see the contents of myvector by typing its name


myvector

## [1] 2.5 5.0 0.0 0.7 -8.0

20
Polling question: Yes or no?

I want to assign the first 3 prime numbers to the object prime3. Does
the following work?

prime3 <- (2, 3, 5)

21
Polling question: Answer

No! You need to use the c function:

prime3 <- c(2, 3, 5)

22
Vectors of Sequences

Earlier, we learned that the : symbol can be used to create sequences of


increasing (or decreasing) values.

We can create a vector called numbers5to20 which contains all of the


integers from 5 through 20:
numbers5to20 <- 5:20
numbers5to20

## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

23
Putting Vectors Together

Vectors can be joined together (i.e. concatenated) with the c function.

For example, watch what happens when we combine the existing object
numbers5to20 with the numbers 31 through 35:
c(numbers5to20, 31:35)

## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 31 32
## [19] 33 34 35

24
Putting Vectors Together

Here is another example of the use of the c() function.


some.numbers <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37,
41, 43, 47, 59, 67, 71, 73, 79, 83, 89, 97, 103, 107, 109,
113, 119)

If you type this in the R console (not in the RStudio Source Pane), R will
prompt you with a + sign for the second line of input. RStudio doesn’t
add the prompt, but it will indent the second line.

In both cases you are being told that the first line is incomplete: you
have an open bracket which must be followed by a closing bracket in
order to complete the command.

Also, don’t forget to include all the commas where they are needed.*

* Watch the video vectorErrorMsg.mp4 for an example.


25
Extracting multiple elements from vectors

We can extract more than one element at a time. To do this, we use a


vector inside the square brackets which indicates the elements that we
want.

For example, the second and fourth elements of rivers are


rivers[c(2, 4)]

## [1] 320 392

You can check the second and fourth elements of rivers.


> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
[16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
[31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
[46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
[61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
[91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
[106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
[136] 500 720 270 430 671 1770
26
Extracting multiple elements from vectors

To get the fifth through ninth element of rivers, type

rivers[5:9]

## [1] 524 450 1459 135 465

* Hereis an example of the use of the colon (:) to create increasing sequences of integers.,
5:9 gives the sequence {5, 6, 7, 8, 9}.
27
Polling question - Yes or No?

The result from executing rivers[3:2] is

(a) 320 325

(b) 325 320

Is the correct answer (a)?


> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
[16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
[31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
[46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
[61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
[91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
[106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
[136] 500 720 270 430 671 1770

28
Polling question - Answer

No!

rivers[3:2]

## [1] 325 320

Remember that, when using the : operator, the numbers decrease if the
first value is larger than the second.

29
Extracting multiple elements from vectors

When we extract elements from a vector, we will usually want to create a


new vector that contains the smaller collection of values.

For example, we can create river23 which will contain only the second
and third elements of rivers:

river23 <- rivers[2:3]

river23

## [1] 320 325

30
Extracting multiple elements from vectors

Negative indices can be used to avoid certain elements. For example,


we can select all but the fifth, ninth and eleventh elements of nhtemp as
follows:
nhtemp[-c(5, 9, 11)]

## [1] 49.9 52.3 49.4 51.1 47.9 49.8 50.9 51.9 49.6 49.3 50.6
## [12] 48.4 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4
## [23] 51.6 51.8 50.9 48.8 51.7 51.0 50.6 51.7 51.5 52.1 51.3
## [34] 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9 52.6 50.2
## [45] 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8
## [56] 51.9 53.0

Compare with the original vector

[1] 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 50.8 49.6 49.3 50.6 48.4
[16] 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4 51.6 51.8 50.9 48.8 51.7
[31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
[46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0

31
Extracting multiple elements from vectors

The fourth through eightieth elements of rivers can be omitted as


follows:
rivers[-(4:80)]

## [1] 735 320 325 338 981 1306 500 696 605 250 411
## [12] 1054 735 233 435 490 310 460 383 375 1270 545
## [23] 445 1885 380 300 380 377 425 276 210 800 420
## [34] 350 360 538 1100 1205 314 237 610 360 540 1038
## [45] 424 310 300 444 301 268 620 215 652 900 525
## [56] 246 360 529 500 720 270 430 671 1770

32
Extracting elements from vectors: 0 indices

Using a zero index returns nothing.

This is not something that one would usually type, but it may be useful
in more complicated expressions.

For example,
nhtemp[1:5]

## [1] 49.9 52.3 49.4 51.1 49.4

myindices <- c(1:3, 0, 5)


nhtemp[myindices]

## [1] 49.9 52.3 49.4 49.4

The result is just the first 3 elements plus the 5th element of nhtemp.
33
Polling question - Yes or No?

The result from executing

x <- 1:10
x[0:5]

is

(a) 0 1 2 3 4 5

(b) 1 2 3 4 5

Is the correct answer (a)?

34
Polling question - Answer

No!
x <- 1:10
x[0:5]

## [1] 1 2 3 4 5

The 0 index is ignored and only the 1, 2, 3, 4 and 5 indices are used.

35
Extracting elements from vectors - positives and negatives

Do not mix positive and negative indices. To see what happens, observe

nhtemp[c(-2, 3)]

## Error in ‘[.default‘(nhtemp, c(-2, 3)): only 0’s may


be mixed with negative subscripts

The problem is that it is not clear what is to be extracted: do we want the


third element of x before or after removing the second one?

36
Extracting elements from vectors - Fractional Indices

Always be careful to make sure that vector indices are integers. When
fractional values are used, they will be truncated towards 0. Thus 0.6
becomes 0, as in
nhtemp[0.6]

## numeric(0)

The output numeric(0) indicates a numeric vector of length zero.

37
DATA SCIENCE 101

Predicting with Data

Shabnam Fnai, UBC

Winter T1, 2021

1
Extracting elements from vectors - logical subsetting

What if you want to see only the values of rivers that are larger than
some number, such as 2000?

There is a simple way to do this, with the square bracket and the greater-
than (>) sign:

rivers[rivers > 2000]

## [1] 2348 3710 2315 2533

2
Extracting elements from vectors - which ones?

To see which values of rivers are larger than 2000, use the which()
function:

which(rivers > 2000)

## [1] 66 68 69 70

This tells us that we could extract the values that are larger than 2000,
using the indices 66, 68, 69, 70.

3
Polling question: Yes or no?

I want to know which elements of rivers are less than 250. Does the
following work?

which(rivers < 250)

4
Polling question: Answer

Yes!

which(rivers < 250)

## [1] 8 17 36 39 42 52 91 108 117 129 133

5
Vector arithmetic

Arithmetic can be done on R vectors. For example, we can divide all


elements of river23 by 5:

river23 <- rivers[2:3]


river23

## [1] 320 325

river23/5

## [1] 64 65

Note that the computation is performed elementwise. Addition, subtrac-


tion and multiplication by a constant have the same kind of effect. For
example,

6
y <- river23 - 5
y

## [1] 315 320


Vector arithmetic

For another example, consider taking the square root of the elements of
river23:

sqrt(river23)

## [1] 17.88854 18.02776

7
Vector arithmetic

The above examples show how a binary arithmetic operator can be used
with vectors and constants.

In general, the binary operators also work element-by-element when ap-


plied to pairs of vectors.

x x x x
For example, we can compute yi i , for i = 1, 2, 3, i.e. (y11 , y22 , y33 ),
where y = [2 4 5] and x = [5 2 3] as follows:
y <- c(2, 4, 5)
x <- c(5, 2, 3)
yˆx

## [1] 32 16 125

8
Vector arithmetic

When the vectors are different lengths, the shorter one is extended by
recycling: values are repeated, starting at the beginning.

For example, to see the pattern of the numbers 1 to 10 after adding 2 and
3, we need only give the 2:3 vector once:
c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
10, 10) + 2:3

## [1] 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13

9
Vector arithmetic

R will give a warning if the length of the longer vector is not a multiple of
the length of the smaller one, because that is often a symptom of an error
in the code. For example, if we wanted to add 2, 3 and 4 to the numbers 1
through 10, this is the wrong way to do it:
c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10) + 2:4

## Warning in c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
10, 10) + : longer object length is not a multiple of shorter object
length

## [1] 3 4 6 4 6 7 6 7 9 7 9 10 9 10 12 10 12 13 12 13

(Do you see the error?)

10
Polling question: Yes or no?

Consider the following code.


(1:10) + (1:5)

Will there be an error message?

11
Polling question: Answer

No!
(1:10) + (1:5)

## [1] 2 4 6 8 10 7 9 11 13 15

The numbers 1 through 5 are first added to the numbers 1 through 5, and
then the numbers 6 through 10 are added to the numbers 1 through 5.

12
Simple patterned vectors

We have seen the use of the : operator for producing simple sequences
of integers.

Patterned vectors can also be produced using the seq() function as well
as the rep() function. For example, the sequence of odd numbers less
than or equal to 21 can be obtained using
seq(1, 21, by = 2)

## [1] 1 3 5 7 9 11 13 15 17 19 21

Notice the use of by = 2 here. The seq() function has several optional
parameters, including one named by. If by is not specified, the default
value of 1 will be used.

13
Simple patterned vectors

Repeated patterns are obtained using rep(). Consider the following ex-
amples:
rep(3, 12) # repeat the value 3, 12 times

## [1] 3 3 3 3 3 3 3 3 3 3 3 3

rep(seq(2, 20, by = 2), 2) # repeat the pattern 2 4 ... 20, twice

## [1] 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20

14
Polling question - Yes or No?

The result from executing seq(1,28, by = 7) is

(a) 1 8 15 22

(b) an error message.

Is the correct answer (a)?

15
Polling question - Answer

Yes!
seq(1, 28, by = 7)

## [1] 1 8 15 22

16
Simple patterned vectors - repeating patterns

Try these examples on your own:


rep(c(1, 4), c(3, 2)) # repeat 1, 3 times and 4, twice

## [1] 1 1 1 4 4

rep(c(1, 4), each = 3) # repeat each value 3 times

## [1] 1 1 1 4 4 4

rep(1:10, rep(2, 10)) # repeat each value twice

## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10

17
Vectors with random patterns

We already saw that runif() will help us simulate a simple kind of noise.

The sample() function allows us to simulate things like the results of the
repeated tossing of a 6-sided die.

# an imaginary die is tossed 8 times:


dieTosses <- sample(1:6, size = 8, replace = TRUE)
dieTosses

## [1] 6 3 1 3 6 5 3 5

18
Polling question - Yes or No?

Which of the following could be used to simulate 1000 coin tosses,


where 1 is Tails and 2 is Heads?

(a) coinTosses <- sample(1:2, size = 1000, replace = TRUE)

(b) coinTosses <- sample(1:6, size = 1000, replace = TRUE)/3

Is the correct answer (a)?

19
Polling question - Answer

Yes!

coinTosses <- sample(1:2, size = 1000, replace = TRUE)

There are too many results in coinTosses to list, but we can use the
table function to display the numbers of 1’s and the number of 2’s:

table(coinTosses)

## coinTosses
## 1 2
## 506 494

We would expect 500 of each type.

20
Character vectors

Scalars and vectors can be made up of strings of characters instead of


numbers. All elements of a vector must be of the same type. For example,
colors <- c("red", "yellow", "blue")

Just like numeric vectors, when you type the name of a character vector,
you see its contents:
colors

## [1] "red" "yellow" "blue"

21
Character vectors

We can add new elements (green, magenta, and cyan) to the colors vec-
tor:
more.colors <- c(colors, "green", "magenta", "cyan")

more.colors

## [1] "red" "yellow" "blue" "green" "magenta" "cya

22
Character vectors

Mixing numerics and characters coerces all elements to become charac-


ter type:

For example,
z <- c("red", "green", 1)

z # 1 has been converted to the character "1"

## [1] "red" "green" "1"

23
Manipulating character vectors

There are two basic operations you might want to perform on character
vectors. To take substrings, use substr().

It takes arguments substr(x, start, stop), where x is a vector of


character strings, and start and stop say which characters to keep.

For example, to print the first two letters of each color use
substr(colors, 1, 2)

## [1] "re" "ye" "bl"

24
Manipulating character vectors

The substring() function is similar, but with slightly different defini-


tions of the arguments: see the help page ?substring.

The other basic operation is building up strings by concatenation.

Use the paste() function for this. For example,


paste(colors, "flowers")

## [1] "red flowers" "yellow flowers" "blue flowers"

25
Manipulating character vectors

There are two optional parameters to paste().

The sep parameter controls what goes between the components being
pasted together.

We might not want the default space, for example:


paste("several ", colors, "s", sep = "")

## [1] "several reds" "several yellows" "several blues"

26
Manipulating character vectors

The paste0() function is a shorthand way to set sep = "":


paste0("several ", colors, "s")

## [1] "several reds" "several yellows" "several blues"

The collapse parameter to paste() allows all the components of the


resulting vector to be collapsed into a single string:

paste("I like", colors, collapse = ", ")

## [1] "I like red, I like yellow, I like blue"

27
Polling question - Yes or No?

The result from executing rep(c("red", "green"), each = 3) is

(a) red, red, red, green, green, green

(b) red, green, red, green, red, green

(c) an error message.

Is the correct answer (a)?

28
Polling question - Answer

Yes!
rep(c("red", "green"), each = 3)

## [1] "red" "red" "red" "green" "green" "green"

29
Polling question - Yes or No?

The result from executing rep(c("red", "green"), c(5,6)) is

(a) red, green, red, green, red, green, red, green, red, green

(b) red, red, red, red, red, green, green, green, green, green, green

(c) an error message.

Is the correct answer (a)?

30
Polling question - Answer

No, (b) is correct


rep(c("red", "green"), c(5,6))

## [1] "red" "red" "red" "red" "red" "green" "gree


## [9] "green" "green" "green"

31
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Introduction to the R Programming Language (Cont’d)

What we’ll learn about today:

1. Factor objects
• What are they?
• What can we do with these objects?
• How do we create them?
2. More about data frames
3. Aggregating data - to calculate groups means etc.
4. Some of the many functions built into R
5. Missing values and other special symbols
6. Handling special circumstances when reading in data

2
Factors

Factors offer an alternative way to store character data.

For example, a factor with four elements and having the two levels,
control and treatment can be created using:
grp <- c("control", "treatment", "control", "treatment")
grp

## [1] "control" "treatment" "control" "treatment"

grp <- factor(grp)


grp

## [1] control treatment control treatment


## Levels: control treatment

3
Factors

Factors can be an efficient way of storing character data when there are
repeats among the vector elements.

This is because the levels of a factor are internally coded as integers.

To see what the codes are for our factor, we can type

as.integer(grp)

## [1] 1 2 1 2

4
Factors

The labels for the levels are only stored once each, rather than being
repeated. The codes are indices of the vector of levels:
levels(grp)

## [1] "control" "treatment"

levels(grp)[as.integer(grp)]

## [1] "control" "treatment" "control" "treatment"

5
Factors

For example, suppose we wish to change the "control" label to


"placebo".

Since "control" is the first level, we change the first element of the
levels(grp) vector:
levels(grp)[1] <- "placebo"

In this example, grp was a very small vector. The same command could
be used if grp had a large number of elements.

6
Polling question: a, b or c?

Recall our sample of coin tosses from the previous lecture:

coinTosses <- sample(1:2, size = 1000, replace = TRUE)

How do we convert the 1000 1’s and 2’s in coinTosses to a factor with
levels Tails and Heads?

(a) coinTossfactor <- factor(coinTosses)

(b) coinTossfactor <- factor(coinTosses)


levels(coinTossfactor) <- c(Tails, Heads)

(c) coinTossfactor <- factor(coinTosses)


levels(coinTossfactor) <- c("Tails", "Heads")

7
Polling question: Answer

(c)!

coinTossfactor <- factor(coinTosses)


levels(coinTossfactor) <- c("Tails", "Heads")

Compare the output of the summary() function for the original numeric
vector and the factor:

summary(coinTosses) # numeric summary

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1.000 1.000 1.000 1.486 2.000 2.000

summary(coinTossfactor) # factor summary

## Tails Heads
## 514 486
8
Factors can have levels that are empty

An important use for factors is to list all possible values, even if some
are not present.

For example,
sex <- factor(c("F", "F"), levels = c("F", "M"))
sex

## [1] F F
## Levels: F M

shows that there are two possible values for sex, but only one is present
in our vector.

9
Extracting elements from factors and character vectors

As for numeric vectors, square brackets [] are used to index factor and
character vector elements.

For example, the factor grp has 4 elements, so we can print out the third
element by typing
grp[3]

## [1] placebo
## Levels: placebo treatment

10
Extracting elements from factors and character vectors

Recall
more.colors

## [1] "red" "yellow" "blue" "green" "magenta" "cyan"

We can access the second through fifth elements of more.colors as


follows:
more.colors[2:5]

## [1] "yellow" "blue" "green" "magenta"

11
Polling question: a, b or c?

How do we extract the first twenty elements of coinTossfactor?

(a) coinTossfactor[1:20]

(b) coinTossfactor[seq(1, 20, by = 1)]

(c) None of the above.

12
Polling question: Answer

(a) and (b) are both correct!


coinTossfactor[seq(1, 20, by = 1)]

## [1] Heads Tails Heads Tails Heads Heads Heads


## [8] Heads Heads Heads Tails Tails Tails Tails
## [15] Tails Tails Tails Tails Tails Tails
## Levels: Tails Heads

13
More information about data frames

Recall from the first lecture that data sets usually consist of more than
one column of data, where each column represents measurements of a
single variable.

Each row usually represents a single observation.

This format is referred to as case-by-variable format.

Most data sets are stored in R as data frames.

14
Data frames

An example is trees which contains the girth, height and volume of 31


cherry trees:
trees

## Girth Height Volume


## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 7 11.0 66 15.6
## 8 11.0 75 18.2
## 9 11.1 80 22.6
## 10 11.2 75 19.9
## 11 11.3 79 24.2
## 12 11.4 76 21.0
## 13 11.4 76 21.4
## 14 11.7 69 21.3
## 15 12.0 75 19.1
## 16 12.9 74 22.2
## 17 12.9 85 33.8
## 18 13.3 86 27.4
15
## 19 13.7 71 25.7
## 20 13.8 64 24.9
## 21 14.0 78 34.5
## 22 14.2 80 31.7
## 23 14.5 74 36.3
## 24 16.0 72 38.3
## 25 16.3 77 42.6
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 28 17.9 80 58.3
## 29 18.0 80 51.5
## 30 18.0 80 51.0
## 31 20.6 87 77.0

We have displayed the entire data frame, a practice not normally


recommended, since data frames can be very large, and not much can
be learned by scanning columns of numbers.
Data frames

Note that each row of this data frame corresponds to one of the trees, an
observation - measurements for a single tree.

Each column corresponds to the type of measurement, e.g. girth,


volume or height: the variables.

16
Data frames - viewing the data

Trying to look at the whole data frame all at once is not usually very
helpful. It is difficult to see patterns or unusual features in a large
collection of numbers.

Better ways to view the data are through the use of the summary()
function as shown below, or by constructing a pairwise scatterplot
obtained by executing the command plot(trees).
summary(trees)

## Girth Height Volume


## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
17
Data frames - viewing the data
plot(trees)

65 70 75 80 85
● ●

20
● ●
● ●●
●● ●

16
● ● ● ●

Girth ● ●


● ●
●●
●● ● ●
● ● ●

12
● ● ●●
● ●● ●●
● ● ●●● ●

●●
●●


● ● ● ●

8
● ●
●● ●
85


● ●
●● ● ●●●
● ● ● ●

● ●●

● ● ● ●
●●● ●

Height ●●●●
● ●
75

● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
65

● ● ● ●
● ●

● ●

70
●●● ●●●

50
● ●

● ●●●


● ●

● ●
Volume

30

●●
● ● ● ●
● ● ● ● ● ●●
●●●●●● ● ● ●
●●● ● ● ●

10
●●● ● ● ●

8 10 12 14 16 18 20 10 20 30 40 50 60 70

These scatterplots indicate that volume tends to increase with height,


and it tends to increase with girth. Height and girth are not as clearly
related to each other.
18
Data frames - viewing the data

For larger data frames, a quick way of counting the number of rows and
columns is important. The functions nrow() and ncol() play this role.

We can get both at once using dim() (for dimension):


dim(trees)

## [1] 31 3

and can get summary information using str() (for structure):


str(trees)

## 'data.frame': 31 obs. of 3 variables:


## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

In fact, str() works with almost any R object, and is often a quick way
to find what you are working with.
19
Extracting data frame elements

We can extract elements from data frames using two indices. For
example, the value in the seventh row, second column is
trees[7, 2]

## [1] 66

20
Extracting data frame elements

Whole rows or columns of data frames may be selected by leaving one


index blank:
trees[1,] # this gives the first row of trees

## Girth Height Volume


## 1 8.3 70 10.3

trees[, 1] # this gives the first column of trees

## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1
## [10] 11.2 11.3 11.4 11.4 11.7 12.0 12.9 12.9 13.3
## [19] 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5
## [28] 17.9 18.0 18.0 20.6

21
Polling question: a, b or c?

Look back at the trees data frame and predict the output from the
following:

trees[3, 2]

(a) ## [1] 63

(b) ## [1] 10.3

(c) Neither of the above.

22
Polling question: Answer

(a)!
trees[3, 2]

## [1] 63

23
Polling question: a, b or c?

Look back at the trees data frame and predict the output from the
following:

trees[3, ]

(a) ## [1] 63

(b) ## Girth Height Volume


## 3 8.8 63 10.2

(c) Neither of the above.

24
Polling question: Answer

(b)!
trees[3, ]

## Girth Height Volume


## 3 8.8 63 10.2

25
Polling question: a, b or c?

Look back at the trees data frame and predict the output from the
following:
trees[4:7, 1]

(a) ## [1] 10.5 10.7 10.8 11.0

(b) ## [1] 11
(c) Neither of the above.

26
Polling question: Answer

(a)!
trees[4:7, 1]

## [1] 10.5 10.7 10.8 11.0

27
Data frame - accessing individual columns

Data frame columns can also be addressed using their names using the
$ operator. For example, the weight column can be extracted as follows:

We can also extract all heights for which the volumes are less than 17.5
using
trees$Height[trees$Volume < 17.5]

## [1] 70 65 63 72 66

28
Data frames - accessing individual columns

The with() function allows us to access columns of a data frame


directly without using the $.

For example, we can divide the volumes by the heights in the trees data
frame using
with(trees, Volume/Height)

## [1] 0.1471429 0.1584615 0.1619048 0.2277778


## [5] 0.2320988 0.2373494 0.2363636 0.2426667
## [9] 0.2825000 0.2653333 0.3063291 0.2763158
## [13] 0.2815789 0.3086957 0.2546667 0.3000000
## [17] 0.3976471 0.3186047 0.3619718 0.3890625
## [21] 0.4423077 0.3962500 0.4905405 0.5319444
## [25] 0.5532468 0.6839506 0.6792683 0.7287500
## [29] 0.6437500 0.6375000 0.8850575

See help(with) for more information.


29
Data frames - accessing individual columns and averaging

We saw that the mean is part of the output from the summary function.

What if we want to calculate the mean for a single column?

We can use $, together with the mean function, to do this.

For example, the average of the tree volumes is

mean(trees$Volume)

## [1] 30.17097

30
Polling question: a, b or c?

To find the average height in the trees data frame, we can use

(a) mean(trees$Height)

(b) mean(trees$height)
(c) Neither of the above.

31
Polling question: Answer

(b)!

mean(trees$Height)

## [1] 76

The column names in the trees data frame are all capitalized as you can
see if you use the names function:
names(trees)

## [1] "Girth" "Height" "Volume"

32
Data frames

Columns can be of different types from each other.

An example is the built-in chickwts data frame:


summary(chickwts)

## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12

One column is of factor type while the other is numeric.

33
Data frames

If you want to see the first few rows of a data frame, you can use the
head() function:

head(chickwts)

## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean

The tail() function displays the last few rows.

34
Subsetting a data frame

If you want only the chicks who were fed horsebean, you can apply the
subset() function to the chickwts data frame:
chickHorsebean <- subset(chickwts, feed == "horsebean")
chickHorsebean

## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
35
Subsetting a data frame

You can now apply the summary function to this new data frame:

summary(chickHorsebean$weight)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 108.0 137.0 151.5 160.2 176.2 227.0

Specifically obtaining the average weight for this type of feed is also
possible:

mean(chickHorsebean$weight)

## [1] 160.2

36
Polling question: a, b or c?

If we want to only consider the observations in the chickwts data set


for which the weight is less than 145, we would use the following code:

(a) subset(chickwts, weight < 145)

(b) subset(chickwts$weight < 145)


(c) Neither of the above.

37
Polling question: Answer

(a)!

subset(chickwts, weight < 145)

## weight feed
## 3 136 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
## 14 141 linseed

38
Aggregating data according to factor levels

We would like to calculate the average weight for each type of feed in
order to make a comparison – e.g. which feed leads to the highest
weight?

The aggregate function allows us to calculate statistics such as the


average for all values within different groups (i.e. different factor levels).

The syntax of the aggregate function is:


aggregate(measurements ˜ myfactor, data = mydata,
FUN = myfun)

The first term is a model formula which relates the measurements


(weight in our example) with the factor (feed in our example). The FUN
argument allows us to specify what function is to operate on each of the
groupings of data.
39
Aggregating data according to factor levels

We plug in the appropriate quantities as:

aggregate(weight ˜ feed, data = chickwts, FUN = mean)

## feed weight
## 1 casein 323.5833
## 2 horsebean 160.2000
## 3 linseed 218.7500
## 4 meatmeal 276.9091
## 5 soybean 246.4286
## 6 sunflower 328.9167

This analysis suggests that sunflower might be the most effective at


increasing the weights of the chicks.*

*A
more thorough treatment of this question would be through the Analysis of Variance
(ANOVA) which is outside the scope of this course.
40
Polling question: a, b or c?

The iris data frame contains measurements on the petals of samples


of 3 species of flowers (setosa, versicolor, and virginica). One of the
columns is called Petal.Length and other is called Species. If we
want to compute the average petal length for each of the three species,
we can type

(a) aggregate(Petal.Length = Species, data = iris,


FUN = mean)

(b) aggregate(Petal.Length ˜ Species, data = iris,


FUN = mean)
(c) Neither of the above.

41
Polling question: Answer

(b)!

aggregate(Petal.Length ˜ Species, data = iris, FUN = mean)

## Species Petal.Length
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552

The = does not work in the model formula!

42
Taking random samples from populations

The sample() function can be used to take samples (with or without


replacement) from larger finite populations.

Suppose, for example, that we have a data frame called fluSurvey


consisting of 15000 entries, and we would like to randomly select 8
entries (without replacement) for detailed study.

43
Taking random samples from populations

If the entries have been enumerated (say, by the use of an ID index) from
1 through 15000, we could select the 8 numbers with
sampleID <- sample(1:15000, size = 8, replace = FALSE)
sampleID

## [1] 5543 1992 1553 166 12353 11452 2385


## [8] 889

The above numbers have been chosen randomly (or at least


approximately so), and the random rows of fluSurvey, a supposedly
existing data frame, can now be extracted with
fluSample <- fluSurvey[sampleID,]

The result is a new data frame consisting of 8 rows and the same
number of columns as fluSurvey.
44
Polling question: a, b or c?

We would like to take a random sample of 10 rivers from the 141 whose
lengths are recorded in rivers. The code to do this is

(a) whichRivers <- sample(1:141, size = 10, replace = FALSE)


rivers(whichRivers)

(b) sample(rivers, size = 10, replace = FALSE)


(c) Neither of the above.

45
Polling question: Answer

(a) and (b) are both correct!

sample(rivers, size = 10, replace = FALSE)

## [1] 780 870 618 270 360 360 215 329 1885
## [10] 420

46
Constructing data frames

Use the data.frame() function to construct data frames from vectors


that already exist in your workspace:
y <- c(2, 4, 5); x <- c(5, 2, 3)

xy <- data.frame(x, y)
xy

## x y
## 1 5 2
## 2 2 4
## 3 3 5

For another example, consider


xynew <- data.frame(x, y, new = 3:1)

47
Data frames can have non-numeric columns

As an example, consider the following data that might be used as a


baseline in an obesity study:

gender <- c("M", "M", "F", "F", "F")


weight <- c(73, 68, 52, 69, 64)
obesityStudy <- data.frame(gender, weight)

48
Data frames can have non-numeric columns

The vector gender is clearly a character vector

obesityStudy$gender

[1] "M" "M" "F" "F" "F"

49
Some built-in functions

Summary statistics can be calculated for data stored in vectors. In


particular, try
summary(x) # computes several summary statistics on the data in x
length(x) # number of elements in x
min(x) # minimum value of x
max(x) # maximum value of x
pmin(x, y) # pairwise minima of corresponding elements of x and y
pmax(x, y) # pairwise maxima of x and y
range(x) # difference between maximum and minimum of data in x
IQR(x) # interquartile range: difference between 1st and 3rd
# quartiles of data in x
sd(x) # computes the standard deviation of the data in x
var(x) # computes the variance of the data in x
diff(x) # successive differences of the values in x
sort(x) # arranges the elements of x in ascending order

52
Some built-in functions: pmin

For an example of the calculation of pairwise minima of two vectors,


consider
x <- 1:5
y <- 7:3
x

## [1] 1 2 3 4 5

## [1] 7 6 5 4 3

pmin(x,y)

## [1] 1 2 3 4 3

53
Polling question: (a), (b) or (c)?

Let’s calculate the pairwise maxima of the two vectors,


x

## [1] 1 2 3 4 5

## [1] 7 6 5 4 3

pmax(x,y)

What is the output?

(a) 7 6 5 4 5
(b) 7 7 7 7 7
(c) 7 6 5 4 3
54
Polling question: Answer

(a)!
pmax(x,y)

## [1] 7 6 5 4 5

55
Some built-in functions: median

The sample median measures the middle value of a data set.

It is either the average of the middle two measurements (when the


sample size is even) or the middle measurement (when the sample size
is odd).

The median of the New Haven (USA) average annual temperatures is

median(nhtemp)

## [1] 51.2

56
Comparing two data sets with the median

We can compare the medians for the New Haven temperatures between
1920 and 1939 with those of 1952 to 1971 by extracting the appropriate
subsets from the nhtemp vector.

Since the first observation was in 1912, the 9th must be in 1920 and the
28th in 1939. The 41st is in 1952 and the 60th is in 1971.

The medians of the New Haven temperatures for 1920 through 1939, and
1952 through 1971 are:
median(nhtemp[9:28])

## [1] 50.75

median(nhtemp[41:60])

## [1] 51.85
57
A questionable form of data analysis

We might be interested in comparing New Haven temperatures to those


of Nottingham, England (monthly values, between 1920 and 1939) which
can be found in the nottem object.

median(nottem) # median Nottingham temperatures

## [1] 47.35

median(nhtemp) # median New Haven temperatures

## [1] 51.2

It appears to be warmer in New Haven than in Nottingham...

58
Polling question: a, b or c?

Why is the analysis on the previous slide possibly misleading?


(a) If there has been warming over the years in question, the data set
that includes later years could have a larger median value.
(b) The use of monthly averages could lead to a different overall
median value than the use of annual averages.
(c) There is no problem.

59
Polling question: Answer

(a) and (b) are both correct!

(a) It is better to compare the temperatures at the two locations over the
same time period.

(b) Consider 2 years of artificial monthly data from the planet Xenon:
x <- rep(c(rep(0,11), 12), 2); x

## [1] 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0
## [16] 0 0 0 0 0 0 0 0 12

The median of the 2 yearly averages which are both 1 (12/12) is: 1.

The median of the 24 monthly averages is (sort and take the average of
the middle 2 values): 0.

So it makes a difference whether you use months or years as your basis


for comparison.
60
A proper comparsion: aggregating data

In order to compare with the New Haver temperatures with the


Nottingham temperatures, we need to convert the monthly averages in
Nottingham to yearly averages.

We can do this with the aggregate function.

The nottem data vector contains 240 monthly averages, for the years
1920 through 1939.

We would like to compute averages for each year (12 months each), so
we first create a vector of the 20 years, repeated 12 times, so that we
have a vector of length 240 which matches the nottem vector:
year <- rep(1920:1939, each = 12)

61
A proper comparsion: aggregating data

Next, we do the aggregating with the mean function (FUN)

nottinghamtemp <- aggregate(nottem ˜ year, FUN = mean)

Here, we are using a formula which relates the values in nottem to the
values in year.

We don’t need to specify a data frame because the objects nottem and
year are located in the workspace.

62
A proper comparsion: aggregating data

The result is the 20 average temperatures:


nottinghamtemp

## year nottem
## 1 1920 48.89167
## 2 1921 50.73333
## 3 1922 47.27500
## 4 1923 47.81667
## 5 1924 48.72500
## 6 1925 48.45833 Note that the result is a data
## 7 1926 49.36667 frame with columns year
## 8 1927 48.36667
## 9 1928 48.99167 and nottem.
## 10 1929 48.13333 The second column contains
## 11 1930 49.15000
## 12 1931 48.13333 the yearly averages.
## 13 1932 49.01667
## 14 1933 50.19167
## 15 1934 50.31667
## 16 1935 49.85833
## 17 1936 48.60000
## 18 1937 49.10000
## 19 1938 50.27500
## 20 1939 49.39167
63
A proper comparsion: aggregating data

We can now compare New Haven annual average temperatures with


Nottingham annual average temperatures – for 1920 through 1939 – as
follows:

median(nottinghamtemp$nottem)

## [1] 49.00417

median(nhtemp[9:28]) # 1920 - 1939 values

## [1] 50.75

New Haven temperatures are marginally higher (in median), but the
difference is not nearly as large as the earlier analysis suggested.

64
The range and the range statistic

The range statistic measures how spread out the distribution of


measurements is: it is the difference between the minimum and
maximum values.

We can calculate the range of the New Haven temperatures in two steps.
First we use the range function to calculate the minimum and maximum
values:
range(nhtemp)

## [1] 47.9 54.6

Then we compute the difference, using the diff function:


diff(range(nhtemp))

## [1] 6.7
65
Comparing ranges

We can also compute the range statistic of the Nottingham annual


averages by extracting the nottem column from nottinghamtemp and
applying the range function:
diff(range(nottinghamtemp$nottem))

## [1] 3.458333

diff(range(nhtemp[9:28])) # 1920-1939 values

## [1] 4.4

The range for the Nottingham average temperatures is less than the one
for New Haven.

Temperatures in Nottingham seem to be less variable than in New


Haven. This means that this aspect of the weather is more predictable in
Nottingham than in New Haven.
66
Outliers and the interquartile range (IQR)

The range statistic is not often used to compare spread, because it can
be distorted by unusual values - called outliers.

For example, suppose you have 100 numbers which are between 5 and
15, except for one which is 99. And somebody else has 100 numbers
which are all between 0 and 20.

Your range is 94 and the other person’s range is 20. But which data set
is really more spread out? With the exception of a single data point, the
other person’s is more spread out, so it would be better to have a
statistic that would not be influenced so strongly by one data point.

The interquartile range (IQR) of a data set conveys the same kind of
information as the range: how spread out is the distribution of values?
But it does it in a way that is a lot less sensitive to outlying values.

67
Interquartile range (IQR)

The interquartile range (IQR) is the difference between the 25th


percentile (1st quartile) and the 75th percentile of the data.

Let’s compare the IQR for the two temperature data sets:

IQR(nottinghamtemp$nottem)

## [1] 1.072917

IQR(nhtemp[9:28])

## [1] 1.425

We still see that the Nottingham temperatures are less variable than the
New Haven temperatures.

68
Standard deviation (sd)

Another statistic that is used to measure variability is the standard


deviation (sd). It can be interpreted in the same way as the IQR and the
range statistic.

Let’s compare the sd for the two temperature data sets:

sd(nottinghamtemp$nottem)

## [1] 0.9069859

sd(nhtemp[9:28])

## [1] 1.06524

We see that the Nottingham temperatures are less variable than the New
Haven temperatures by this criterion as well.
69
Polling question: (a), (b) or (c)?

A lecturer calculated the range statistic for a data set to be −7.5. This
means

(a) the data set has negative variability.


(b) the data set has imaginary numbers.
(c) the lecturer made an error in the calculation.

70
Polling question: Answer

(c)!

It is not possible for the range to be negative, since you are subtracting
a smaller number (minimum) from a larger number (maximum). If the
minimum and maximum are the same, then all data points are equal and
the range is 0.

71
Polling question: (a), (b) or (c)?

The IQR for a set of measurements on the time to recovery from


COVID-19 was found to be 17 days in one country and 9 days in a
second country. This means

(a) people in the first country recover faster than people in the second
country.
(b) recovery time in the first country is less predictable than recovery
time in the second country.
(c) none of the above.

72
Polling question: Answer

(b)!

The IQR measures the variability in the measurements. The higher the
variability, the less predictable an individual measurement would be. In
this case, the first country has more variability in recovery times than
the second country, so recovery times are harder to predict.

73
Missing values and other special values

The missing value symbol is NA.

Consider the data frame table.b3 in the MPV package:


library(MPV) # load the package

This data frame contains data on gas mileage for a number of cars and
additional characteristics such as horsepower and weight and so on.
The variable x3 contains the measurements on torque for each car.

The output from the summary function applied to table.b3$x3 is


summary(table.b3$x3)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


## 81.0 171.2 243.0 217.9 258.8 366.0 2

This gives us the summary information for all non-missing values of the
torque measurement and also tells us that 2 of the values are missing.
74
Missing values and other special values

As the example indicates, missing values often arise in real data


problems, but they can also arise because of the way calculations are
performed.

some.evens <- NULL # creates a vector with no elements


some.evens[seq(2, 10, 2)] <- seq(2, 10, 2)
some.evens

## [1] NA 2 NA 4 NA 6 NA 8 NA 10

What happened here is that we assigned values to elements 2, 4, . . . , 10


but never assigned anything to elements 1, 3, . . . , 9, so R uses NA to
signal that the value is unknown.

75
Missing values and other special values

Consider the following:


x <- c(0, 1, 2)
x / x

## [1] NaN 1 1

The NaN symbol denotes a value which is ‘not a number’ which arises as
a result of attempting to compute the indeterminate 0/0.

This symbol is sometimes used when a calculation does not make


sense.

76
Missing values and other special values

In other cases, special values may be shown, or you may get an error or
warning message:
1 / x

## [1] Inf 1.0 0.5

Here R has tried to evaluate 1/0 and reports the infinite result as Inf.

77
Missing values and other special values

When there may be missing values, the is.na() function should be


used to detect them. For instance,

is.na(some.evens)

## [1] TRUE FALSE TRUE FALSE TRUE FALSE


## [7] TRUE FALSE TRUE FALSE

The result is a “logical vector”.

78
Missing values and other special values

The ! symbol means “not”, so we can locate the non-missing values in


some.evens as follows:

!is.na(some.evens)

## [1] FALSE TRUE FALSE TRUE FALSE TRUE


## [7] FALSE TRUE FALSE TRUE

79
Missing values and other special values

We can then display the even numbers only:

some.evens[!is.na(some.evens)]

## [1] 2 4 6 8 10

Here we have used logical indexing. (More on this later.)

80
Reading data from an external file containing missing values

When reading in a file with columns separated by blanks with blank


missing values, you can use code such as

dataset1 <- read.table("file1.txt", header=TRUE,


sep=" ", na.string=" ")

This tells R that the blank spaces should be read in as missing values.

81
Reading data from an external file containing missing values

Observe the contents of dataset1:

dataset1 Note the appearance of NA.


## x y z This is a missing value.
## 1 3 4 NA Functions such as is.na() are important
## 2 51 48 23 for detecting missing values in vectors and
## 3 23 33 111 data frames.

82
Reading data into a data frame from an external file

Sometimes, external software exports data files that are tab-separated.


When reading in a file with columns separated by tabs with blank
missing values, you could use code like

dataset2 <- read.table("file2.txt", header=TRUE,


sep="\t", na.string=" ")

dataset2
## x y z
Again, observe the result: ## 1 33 223 NA
## 2 32 88 2
## 3 3 NA NA

If you need to skip the first 3 lines of a file to be read in, use the skip=3
argument.

83
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Digging More Deeply into R and Computing

What we’ll learn about:

1. Logical vectors and relational operators


• Boolean algebra
• Logical operations
• Relational operators
2. Data storage
• Computer arithmetic and binary representations of numbers
• Exact storage
3. Dates, times and the time series object

2
Logical Vectors and relational operators

Logical vectors can contain only three types of elements: TRUE and
FALSE, as well as NA for missing.

They can be created in many ways, including by the use of the c


function:
eg1.logical <- c(TRUE, FALSE, FALSE, NA)
eg1.logical

## [1] TRUE FALSE FALSE NA

The rep function can also be used:


eg2.logical <- c(rep(TRUE, 2), rep(FALSE, 4), rep(NA, 2))
eg2.logical

## [1] TRUE TRUE FALSE FALSE FALSE FALSE NA NA


3
Logical vectors and relational operators

They can also be created by using relational operators, e.g., >, < and ==.
x <- 1:8
x

## [1] 1 2 3 4 5 6 7 8

eg3.logical <- (x > 5)


eg3.logical

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

eg4.logical <- (x == 5)
eg4.logical

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE

4
Logical vectors and relational operators

The %in% operator tests whether elements of one vector can be found in
another vector.
y <- seq(-5, 15, 4)
y

## [1] -5 -1 3 7 11 15

## [1] 1 2 3 4 5 6 7 8

y %in% x

## [1] FALSE FALSE TRUE TRUE FALSE FALSE

The third and fourth elements of y are somewhere in x.


5
Logical vectors and relational operators

Note that y %in% x is not the same as x %in% y.


y

## [1] -5 -1 3 7 11 15

## [1] 1 2 3 4 5 6 7 8

x %in% y

## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE

The third and seventh elements of x are somewhere in y.

6
Polling question - (a), (b), (c) or (d)?

Consider vectors aa and bb as follows


aa <- c("d", "dd", "ff")
bb <- c("dd", "f", "ff")

What is the output from

aa %in% bb

(a) ## [1] FALSE TRUE TRUE

(b) ## [1] TRUE FALSE TRUE

(c) ## [1] TRUE FALSE FALSE


(d) none of the above

7
Polling question - Answer

(a)!

aa %in% bb

## [1] FALSE TRUE TRUE

The second and third elements of aa can be found somewhere in bb.

8
Practical use of the %in% operator

Example:

The cuckoos data frame in the DAAG package has measurements of the
eggs laid by cuckoos in the nests of birds of other species. Here is the
basic summary:

library(DAAG)
summary(cuckoos[, -4]) # the 4th column is not needed

## length breadth species


## Min. :19.60 Min. :15.00 hedge.sparrow:14
## 1st Qu.:21.90 1st Qu.:16.20 meadow.pipit :45
## Median :22.35 Median :16.60 pied.wagtail :15
## Mean :22.45 Mean :16.55 robin :16
## 3rd Qu.:23.23 3rd Qu.:17.00 tree.pipit :15
## Max. :25.00 Max. :17.50 wren :15

9
Practical use of the %in% operator

If we only want to study the measurements of the eggs laid in the robin
and wren nests, we can subset the data with the %in% operator as in
cuckooWrenRobin <- subset(cuckoos[, -4],
species %in% c("robin", "wren"))

Summarize the result:


summary(cuckooWrenRobin)

## length breadth species


## Min. :19.80 Min. :15.00 hedge.sparrow: 0
## 1st Qu.:21.00 1st Qu.:15.90 meadow.pipit : 0
## Median :22.00 Median :16.00 pied.wagtail : 0
## Mean :21.86 Mean :16.15 robin :16
## 3rd Qu.:22.50 3rd Qu.:16.40 tree.pipit : 0
## Max. :23.90 Max. :17.20 wren :15
The new data frame only has observations on the robins and wrens.
10
Polling question ... (a) or (b)?

Recall the chickwts data that has weight measurements on samples of


chicks who have been fed different types of feed.
summary(chickwts)

## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12

Which of the following will give us a subset containing only the


samples fed on horsebean or sunflower?

(a) options(width=15)
subset(chickwts, c("horsebean", "sunflower") %in% feed)
(b) subset(chickwts, feed %in% c("horsebean", "sunflower"))

11
Polling question ... Answer

(b)!

chick.sub <- subset(chickwts,


feed %in% c("horsebean", "sunflower"))
summary(chick.sub)

## weight feed
## Min. :108.0 casein : 0
## 1st Qu.:162.0 horsebean:10
## Median :261.0 linseed : 0
## Mean :252.2 meatmeal : 0
## 3rd Qu.:331.0 soybean : 0
## Max. :423.0 sunflower:12
Only the horsebean and sunflower measurements remain in the new data set.

12
Not: turning TRUE into FALSE

(or FALSE into TRUE)

The ! operator turns logical elements into their opposites.*

Example:
x <- c(TRUE, FALSE, TRUE)
!x

## [1] FALSE TRUE FALSE

* But it leaves missing values as missing.


13
Not: turning TRUE into FALSE

Returning to our cuckoos example, we might want to study the


measurements of the eggs laid anywhere but in the robin and wren
nests. Then we can subset the data with the ! operator as in
cuckooNotWrenRobin <- subset(cuckoos[, -4],
!(species %in% c("robin", "wren")))

summary(cuckooNotWrenRobin)

## length breadth species


## Min. :19.60 Min. :15.80 hedge.sparrow:14
## 1st Qu.:22.00 1st Qu.:16.30 meadow.pipit :45
## Median :22.60 Median :16.80 pied.wagtail :15
## Mean :22.66 Mean :16.69 robin : 0
## 3rd Qu.:23.40 3rd Qu.:17.00 tree.pipit :15
## Max. :25.00 Max. :17.50 wren : 0
The new data frame has observations on all but the robins and wrens.
14
Polling question ... (a) or (b) ?

The code

rivers[!(rivers < 500)]

will give

(a) values that are not larger than 500.

(b) values that are not smaller than 500.

15
Polling question ... answer

(b)!

rivers[!(rivers < 500)]

## [1] 735 524 1459 600 870 906 1000 600 505
## [10] 1450 840 1243 890 525 720 850 630 730
## [19] 600 710 680 570 560 900 625 2348 1171
## [28] 3710 2315 2533 780 760 618 981 1306 500
## [37] 696 605 1054 735 1270 545 1885 800 538
## [46] 1100 1205 610 540 1038 620 652 900 525
## [55] 529 500 720 671 1770

16
Boolean algebra

To understand how R handles TRUE and FALSE, we need to understand


Boolean algebra.

Consider a collection of statements which may be either true or false.

We represent a statement by a letter or variable, e.g. A is the statement


that the sky is clear, and B is the statement that it is raining.

Depending on the weather where you are, those two statements may
both be true (there is a “sunshower”), A may be true and B false (the
usual clear day), A false and B true (the usual rainy day), or both may be
false (a cloudy but dry day).

17
Boolean algebra

Boolean algebra tells us how to evaluate the truth of compound


statements, or collections of statements considered together.

For example, “A and B” is the statement that it is both clear and raining.
(This might be true with a small amount of cloud overhead - conditions
for a “sunshower”).

“A or B” says that it is clear or it is raining, or both: anything but the


cloudy dry day.

This is sometimes called an inclusive or, to distinguish it from the


exclusive or “A xor B”, which says that it is either clear or raining, but
not both.

18
Boolean algebra

There is also the “not A” statement, which says that it is not clear.

There is a very important relation between Boolean algebra and set


theory.

If we interpret A and B as sets, then we can think of “A and B” as the


set of elements which are in A and are in B, i.e. the intersection A ∩ B.

Similarly “A or B” can be interpreted as the set of elements that are in A


or are in B, i.e. the union A ∪ B.

Finally, “not A” is the complement of A, i.e. Ac.

19
Boolean algebra - example

Let A be the set of animals, and B be the set of objects with four legs.

A spider is in set A but not in set B, since it has eight legs.

A dining room table has four legs so it is in set B, but it is not in set A.

A dog is in set A and in set B.

To summarize:

• spider ∈ A, spider ∈/ B, spider ∈ A ∪ B, and spider ∈


/ A ∩ B.
• table ∈
/ A, table ∈ B, table ∈ A ∪ B, and table ∈
/ A ∩ B.
• dog ∈ A, dog ∈ B, dog ∈ A ∪ B, and dog ∈ A ∩ B.

20
Polling question - (a), (b), (c) or (d)?

Assume A and B are as in the previous slide. Which one of the following
statements is true?

(a) apple tree ∈ A

(b) apple tree ∈


/ A∩B

(c) apple tree ∈ B

(d) apple tree ∈ A ∪ B.

21
Polling question - Answer

(b)!

An apple tree is not an animal and an apple does not have four legs, so it
definitely not a four-legged animal.

22
Polling question - (a), (b), (c) or (d)?

A is the set of positive numbers, and B is the set of odd numbers.

Which one of the following statements is true?

(a) 4 ∈ B

(b) 4 ∈ A ∩ B

(c) 3 ∈
/B

(d) 3 ∈ A ∩ B

23
Polling question - Answer

(d)!

3 is positive and 3 is odd.

24
Boolean algebra

Because there are only two possible values (true and false), we can
record all Boolean operations in a table.

On the first line, we list the basic Boolean expressions, on the second
line the equivalent way to code them in R, and in the body of the table
the results of the operations.
Boolean A B not A not B A and B A or B
R A B !A !B A & B A | B
TRUE TRUE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE TRUE FALSE TRUE
FALSE TRUE TRUE FALSE FALSE TRUE
FALSE FALSE TRUE TRUE FALSE FALSE

25
Logical operations in R

One of the basic types of vector in R holds logical values. For example,
a logical vector may be constructed as
a <- c(TRUE, FALSE, FALSE, TRUE)

The result is a vector of 4 logical values. Logical vectors may be used as


indices:
b <- c(13, 7, 8, 2)
b[a]

## [1] 13 2

The elements of b corresponding to TRUE are selected.

26
Logical operations in R

If we attempt arithmetic on a logical vector, e.g.


sum(a)

## [1] 2

then the operations are performed after converting FALSE to 0 and TRUE
to 1, so by summing we count how many occurrences of TRUE are in the
vector.

27
Logical operations in R

There are two versions of the Boolean operators. The usual versions are
&, | and !, as discussed earlier. These are all vectorized, so we see for
example
!a

## [1] FALSE TRUE TRUE FALSE

28
Logical operations in R

If we attempt logical operations on a numerical vector, 0 is taken to be


FALSE, and any non-zero value is taken to be TRUE:
a & (b - 2)

## [1] TRUE FALSE FALSE FALSE

29
Relational operators

It is often necessary to test relations when programming. R allows for


equality and inequality relations to be tested using the relational
operators: <, >, ==, >=, <=, !=.

Be careful with tests of equality. Because R works with only a limited


number of decimal places rounding error can accumulate, and you may
find surprising results, such as 49 * (4 / 49) not being equal to 4.
x <- 49*(4/49)
y <- 4
x == y

## [1] FALSE

all.equal(x, y) # this function tests approximate equality

## [1] TRUE
30
Relational operators

Some simple examples involving the first 6 years of New Haven


Temperature data:
nhtemp6 <- nhtemp[1:6]
nhtemp6

## [1] 49.9 52.3 49.4 51.1 49.4 47.9

nhtemp6 > 51 # which elements exceed 51

## [1] FALSE TRUE FALSE TRUE FALSE FALSE

nhtemp6 == 51 # which elements are exactly equal to 51

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

nhtemp6 >= 51 # which elements are greater than or equal to 51

## [1] FALSE TRUE FALSE TRUE FALSE FALSE

31
Relational operators

More examples:
threeM <- c(3, 6, 9)
threeM != 6 # which elements are not equal to 6

## [1] TRUE FALSE TRUE

threeM[threeM > 4] # elements which are greater than 4

## [1] 6 9

four68 <- c(4, 6, 8)


four68 > threeM # four68 elements which exceed threeM

## [1] TRUE FALSE FALSE

four68[threeM < four68] # print them

## [1] 4
32
Data Storage in R

33
Approximate Storage of Numbers

One important distinction in computing is between exact and


approximate results.

It is possible in a computer to represent any rational number exactly, but


it is more common to use approximate representations: usually
floating point representations.

These are a binary (base-two) variation on scientific notation.

34
Approximate Storage of Numbers

For example, we might write a number to four significant digits in


scientific notation as 4.93 × 10−2.

This representation of a number could represent any true value between


0.04925 and 0.04935.

Standard floating point representations on computers are similar, except


that a power of 2 would be used rather than a power of 10, and the
fraction would be written in binary notation.

The number above would be written as 1.10012 × 2−5 if five binary digit
precision was used.

35
Approximate Storage of Numbers

The subscript 2 in the mantissa 1.10012 indicates that this number is


shown in base 2; that is it represents 1 × 20 + 1 × 2−1 + 1 × 2−4, or
1.5625 in decimal notation.

However, 4.93 × 10−2 and 1.10012 × 2−5 are not identical.

Five binary digits give less precision than three decimal digits: a range
of values from approximately 0.0488 to 0.0508 would all get the same
representation to five binary digit precision.

36
Approximate Storage of Numbers

In fact, 4.93 × 10−2 cannot be represented exactly in binary notation in


a finite number of digits.

The problem is similar to trying to represent 1/3 as a decimal: 0.3333 is


a close approximation, but is not exact. The standard precision in R is 53
binary digits, which is equivalent to about 15 or 16 decimal digits.

37
Approximate Storage of Numbers

To illustrate, consider the fractions 5/4 and 4/5.

In decimal notation these can be represented exactly as 1.25 and 0.8


respectively.

In binary notation 5/4 is 1 + 1/4 = 1.012.

38
Approximate Storage of Numbers

How do we determine the binary representation of 4/5?

It is between 0 and 1, so we’d expect something of the form 0.b1b2b3 · · ·


where each bi represents a “bit”, i.e. a 0 or 1 digit.

Multiplying by 2 moves the all bits left by one, i.e.


2 × 4/5 = 1.6 = b1.b2b3 · · · .

Thus b1 = 1, and 0.6 = 0.b2b3 · · · .

We can now multiply by 2 again to find 2 × 0.6 = 1.2 = b2.b3 · · · , so


b2 = 1. Repeating twice more yields b3 = b4 = 0.

39
Approximate Storage of Numbers

At this point we’ll have the number 0.8 again, so the sequence of 4 bits
will repeat indefinitely: in base 2, 4/5 is 0.110011001100 · · · .

Since R only stores 53 bits, it won’t be able to store 0.8 exactly.

Some rounding error will occur in the storage.

40
Approximate Storage of Numbers

We can observe the rounding error with the following experiment.

With exact arithmetic, (5/4) × (4/5) = 1, so (5/4) × (n × 4/5) should


be exactly n for any value of n.

But if we try this calculation in R, we find


n <- 1:10
1.25 * (n * 0.8) - n

## [1] 0.000000e+00 0.000000e+00 4.440892e-16


## [4] 0.000000e+00 0.000000e+00 8.881784e-16
## [7] 8.881784e-16 0.000000e+00 0.000000e+00
## [10] 0.000000e+00

i.e. it is equal for some values, but not equal for n = 3, 6 or 7. The errors
are very small, but non-zero.

41
Approximate Storage of Numbers

Rounding error tends to accumulate in most calculations, so usually a


long series of calculations will result in larger errors than a short one.

Some operations are particularly prone to rounding error: for example,


subtraction of two nearly equal numbers, or (equivalently) addition of
two numbers with nearly the same magnitude but opposite signs.

Since the leading bits in the binary expansions of nearly equal numbers
will match, they will cancel in subtraction, and the result will depend on
what is stored in the later bits.

42
Approximate Storage of Numbers - Variance Example

Consider the standard formula for the sample variance of a sample


x1, . . . , xn:
n
1 X
s2 = (xi − x̄)2 (1)
n − 1 i=1
where x̄ is the sample mean, (1/n) xi.*
P

The sample variance gives an idea of how much the measurements in a


sample differ from each other. Large values of s2 correspond to a lot of
variation in a sample. Small values correspond to less variation in the
measurements.

In R, s2 is available as var(), and x̄ is mean().

P
* The symbol is the mathematical shorthand for sum. If x1 = 3, x2 = 4 and x3 = 2,
P3
then i=1 xi = x1 + x2 + x3 = 9. In R, we would obtain this sum by typing sum(x).
43
Approximate Storage of Numbers - Variance Example
x <- 1:11
mean(x)

## [1] 6

var(x)

## [1] 11

sum((x-mean(x))ˆ2 )/10 # this replaces the formula at (1)

## [1] 11

Because this formula requires calculation of x̄ first and the sum of


squared deviations second, it requires that all xi values be kept in
memory.
44
Approximate Storage of Numbers - Variance Example

Not too long ago memory was so expensive that it was advantageous to
rewrite the formula as
 
n
1 
s2 = x2 2
X
i − nx̄
n − 1 i=1

This is called the “one-pass formula”, because we evaluate each xi


value just once, and accumulate the sums of xi and of x2
i.

It gives the correct answer, both mathematically and in our example:

( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10

## [1] 11

However, notice what happens if we add a large value A to each xi.


45
Approximate Storage of Numbers
Pn
The sum i=1 x2 2 2
i increases by approximately nA , and so does nx̄ .

This doesn’t change the variance, but it provides the conditions for a
“catastrophic loss of precision” when we take the difference:
A <- 1.e10
x <- 1:11 + A
var(x)

## [1] 11

( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10

## [1] 0

Since R gets the right answer, it clearly doesn’t use the one-pass
formula, and neither should you.
46
Exact Storage of Numbers

We have seen that R uses floating point storage for numbers, using a
base 2 format that stores 53 bits of accuracy.

It turns out that this format can store some fractions exactly: if the
fraction can be written as n/2m, where n and m are integers (not too
large; m can be no bigger than about 1000, but n can be very large), R
can store it exactly.

The number 5/4 is in this form, but the number 4/5 is not, so only the
former is stored exactly.

47
Exact Storage of Numbers

Floating point storage is not the only format that R uses.

For whole numbers, it can use 32 bit integer storage.

In this format, numbers are stored as binary versions of the integers 0 to


232 − 1 = 4294967295.

Numbers that are bigger than 231 − 1 = 2147483647 are treated as


negative values by subtracting 232 from them, i.e. to find the stored
value for a negative number, add 232 to it.

48
Exact Storage of Numbers - Example

The number 11 can be stored as the binary value of 11, i.e. 0 . . . 01011,
whereas −11 can be stored as the binary value of
232 − 11 = 4294967285, which turns out to be 1 . . . 10101.

If you add these two numbers together, you get 232.

Using only 32 bits for storage, this is identical to 0, which is what we’d
hope to get for 11 + (−11).

49
Dates and Times

Dates and times are among the most difficult types of data to work with
on computers.

The standard calendar is very complicated: months of different lengths,


leap years every four years (with exceptions for whole centuries) and so
on.

When looking at dates over historical time periods, changes to the


calendar (such as the switch from the Julian calendar to the modern
Gregorian calendar that occurred in various countries between 1582 and
1923) affect the interpretation of dates.

Times are also messy, because there is often an unstated time zone
(which may change for some dates due to daylight savings time), and
some years have “leap seconds” added in order to keep standard clocks
consistent with the rotation of the earth.
50
Dates and Times

There have been several attempts to deal with this in R.

The base package has the function strptime() to convert from strings
(e.g. "2007-12-25", or "12/25/07") to an internal numerical
representation, and format() to convert back for printing.

The ISOdate() and ISOdatetime() functions are used when


numerical values for the year, month day, etc. are known.

51
Dates and Times

Other functions are available in the chron package.

For example, suppose a trucker’s first accident occurred on February 18,


2019. We might store this information in a spreadsheet as 19-02-18, but
for calculation purposes, such as determining how many days were
accident-free we need to know the number of days after a certain
reference date:

library(chron) # load the chron package


accidentDate <- "19-02-18"
numberOfDaysAcc <- chron(dates = accidentDate,
format=c('y-m-d'))
as.numeric(numberOfDaysAcc) # No. of days since Jan. 1, 1970

## [1] 17945

52
Dates and Times

If the trucker started driving on July 11, 2014, we count the number of
accident-free days as follows:
startDate <- "14-07-11"
numberOfDaysStart <- chron(dates = startDate,
format=c('y-m-d'))
as.numeric(numberOfDaysStart) # Ref. data is Jan. 1, 1970

## [1] 16262

numberOfDaysAcc - numberOfDaysStart # time fr. start to accid

## Time in days:
## [1] 1683

53
Time series objects

The Nile object contains annual flow amounts for the Nile River (Egypt)
and is an important example of a time series. The Nile River has
important effects on agricultural, and the flow data has been studied a
lot, in order to understand how patterns of flow change over time.

The object looks like a numeric vector but with additional features.
Nile

## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935
## [13] 1110 994 1020 960 1180 799 958 1140 1100 1210 1150 1250
## [25] 1260 1220 1030 1100 774 840 874 694 940 833 701 916
## [37] 692 1020 1050 969 831 726 456 824 702 1120 1100 832
## [49] 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846
## [73] 812 742 801 1040 860 874 848 890 744 749 838 1050
## [85] 918 986 797 923 975 815 1020 906 901 1170 912 746
## [97] 919 718 714 740
54
Time series objects

As can be seen, the time series object contains information about the
starting year and ending year as well as the number of observations per
year. Here, there is one observation.

Monthly observations would correspond to a frequency of 12, and


quarterly observations (often used in Economics and Finance) would
have a frequency of 4.

55
Time series objects

The extra information is used when constructing a graph of the


observations against time, as in
plot(Nile)
1400
1000
Nile

600

1880 1900 1920 1940 1960

Time

According to this graph, it seems that the flow was higher in the 1800’s
than in the 1900’s.
56
Time series objects

We can create our own time series objects with the use of the ts
function.

Consider the monthly BC jobs data for 1995 and 1996 in the jobs data
frame in the DAAG package.
jobs$BC

## [1] 1752 1737 1765 1762 1754 1759 1766 1775 1777 1771 1757 1766
## [13] 1786 1784 1791 1800 1800 1798 1814 1803 1796 1818 1829 1840

We will create the time series object jobsBC as follows:

jobsBC <- ts(jobs$BC, start = c(1995, 1),


end = c(1996, 12), frequency = 12)

The start value corresponds to the first month of 1995, and the end
vector corresponds to the 12th month of 1996.
57
Time series objects
plot(jobsBC)
1820
jobsBC

1780
1740

1995.0 1995.5 1996.0 1996.5

Time

The graph shows that the job situation in BC improved through 1995 and
1996.

58
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBCO

1
Programming Base Graphics in R

What we’ll learn about this section:

1. Base graphics - the main plotting system in R

2. Displaying single sets of data - bar charts, dot charts and


histograms

3. Displaying bivariate (paired) data - scatter plots

4. Visual perception and choosing a high level graphic

5. Fine-tuning graphs - low level parameter settings

2
Programming Statistical Graphics

The main graphic system in R is known as base graphics. The grid


package provides the basis for a more modern graphics system.

Other packages, such as lattice and ggplot2 provide functions for


high-level plots based on grid graphics.

Most graphics in R are designed to be “device independent”. Directions


are given where to draw and these drawing commands work on any
device.

3
Bar Charts and Dot Charts

The most basic type of graph is one that displays a single set of
numbers.

Bar charts and dot charts do this by displaying a bar or dot whose
length or position corresponds to the number.

4
Basic Bar Charts

The WorldPhones matrix holds counts of the numbers of telephones in


the major regions of the world for a number of years. The first row of the
matrix corresponds to the year 1951. In order to display these data
graphically, we first extract that row.
WorldPhones51 <- WorldPhones[1, ]
WorldPhones51

## N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer


## 45939 21574 2876 1815 1646 89 555

5
Bar Charts

We could plot the bar chart using the barplot() function as


barplot(WorldPhones51)

but the plot that results needs some minor changes: we’d like to display
a title at the top, and we’d like to shrink the size of the labels on the
axes. We can do that with the following code.

6
Bar Charts
barplot(WorldPhones51, cex.names = .75, cex.axis = .75,
main = "Numbers of Telephones in 1951")

Numbers of Telephones in 1951


40000
30000
20000
10000
0

N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer

7
Bar Charts

The cex.names = .75 argument reduced the size of the region names
to 0.75 of their former size, and the cex.axis = .75 argument reduced
the labels on the vertical axis by the same amount.

The main argument sets the main title for the plot.

8
Dot Charts

An alternative way to plot the same kind of data is in a dot chart.


dotchart(WorldPhones51, xlab = "Numbers of Phones ('000s)")

Mid.Amer ●

Africa ●

Oceania ●

S.Amer ●

Asia ●

Europe ●

N.Amer ●

0 10000 20000 30000 40000

Numbers of Phones ('000s)

9
Dot Charts

Use pch=16 to get a filled in dot for the plotting character, for clarity:
dotchart(WorldPhones51, xlab = "Numbers of Phones ('000s)", pch=16)

Mid.Amer ●

Africa ●

Oceania ●

S.Amer ●

Asia ●

Europe ●

N.Amer ●

0 10000 20000 30000 40000

Numbers of Phones ('000s)

10
Data with More Structure

Data sets having more complexity can also be displayed using these
graphics functions.

The barplot() function has a number of options which allow for


side-by-side or stacked styles of displays, legends can be included
using the legend argument, and so on.

11
Example

The VADeaths dataset in R contains death rates (number of deaths per


1000 population per year) in various subpopulations within the state of
Virginia in 1940.
VADeaths

## Rural Male Rural Female Urban Male Urban Female


## 50-54 11.7 8.7 15.4 8.4
## 55-59 18.1 11.7 24.3 13.6
## 60-64 26.9 20.3 37.0 19.3
## 65-69 41.0 30.9 54.6 35.1
## 70-74 66.0 54.3 71.1 50.0

This data set may be displayed as a sequence of bar charts,


one for each subgroup.

12
Virginia Deaths
barplot(VADeaths, beside = TRUE, legend = TRUE, ylim = c(0, 90),
ylab = "Deaths per 1000",
main = "Death rates in Virginia")

Death rates in Virginia

50−54
80

55−59
60−64
65−69
60
Deaths per 1000

70−74
40
20
0

Rural Male Rural Female Urban Male Urban Female

13
Virginia Deaths

The bars correspond to each number in the matrix.

The beside = TRUE argument causes the values in each column to be


plotted side-by-side; legend = TRUE causes the legend in the top right
to be added.

The ylim = c(0, 90) argument modifies the vertical scale of the
graph to make room for the legend.

Finally, main = "Death rates in Virginia" sets the main title for
the plot.

14
Virginia Deaths Dot Chart
dotchart(VADeaths, xlim=c(0, 75),
xlab="Deaths per 1000",
main="Death rates in Virginia", pch=16)

We set the x-axis limits to run from 0 to 75 so that zero is included,


because it is natural to want to compare the total rates in the different
groups.

15
Virginia Deaths Dot Chart

Death rates in Virginia

Rural Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●

Rural Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●

Urban Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●

Urban Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●

0 20 40 60

Deaths per 1000

16
The Histogram

A histogram is a special type of bar chart that is used to show the


frequency distribution of a collection of numbers.

Each bar represents the count of x values that fall in the range indicated
by the base of the bar. Usually all bars should be the same width; this is
the default in R.

17
The Histogram

When bars have equal width, the height of each bar is proportional to the
number of observations in the corresponding interval.

If bars have different widths, then the area of the bar should be
proportional to the count; in this way the height represents the density
(i.e. the frequency per unit of x).

18
The Histogram

In R, hist(x, ...) is the main way to plot histograms.

The main parameter is x, a vector consisting of numeric observations.

There are several optional parameters in ... that are used to control the
details of the display.

19
The Histogram – Example

Data: Escape times (in seconds) from a floating oil rig:

389 356 359 363 375 424 325 394


402 373 373 370 364 366 364 325
339 393 392 369 374 359 356 403
334 397

Store them in a file called escape.txt. Then type


escape <- scan("escape.txt")

20
The Histogram – Example
hist(escape, xlab="escape times (in seconds)")

Histogram of escape
10
8
Frequency

6
4
2
0

320 340 360 380 400 420 440

escape times (in seconds)

21
The Histogram – Choosing Bin Widths

If you have n values of x, R, by default, divides the range into


approximately log2(n) + 1 intervals, giving rise to that number of bars.

22
The Histogram – Choosing Bin Widths

For example, our data set consisted of 26 measurements. Since

26 > 24 = 16

26 < 25 = 32

4 < log2(26) < 5


it can be seen that R should choose about 4 or 5 bars. In fact, it chose 6,
because it also attempts to put the breaks at round numbers (multiples
of 20 in this case).

23
The Histogram – Choosing Bin Widths

The above rule (known as the “Sturges” rule) is not always satisfactory
for very large values of n, giving too few bars.

Current research suggests that the number of bars should increase


proportionally to n1/3 rather than proportional to log2(n).

Use breaks = "Scott" or breaks = "Freedman -Diaconis".

24
A Smoother Alternative: Density Plotting
plot(density(escape), main=' ')

This command produces a nonparametric estimate of the probability


density function. The method of estimation is not based on a parametric
model, such as the normal distribution, exponential distribution, etc.

25
A Smoother Alternative: Density Plotting
plot(density(escape), main=' ')

0.015
0.010
Density

0.005
0.000

300 350 400 450

N = 26 Bandwidth = 11.29

26
Box plots

A box plot gives a quick visual summary of the main features of a set
of data.

A rectangular box is drawn, together with line segments on opposing


sides.

The box indicates location and spread of the main body of the data, and
the line segments indicate the range of the data.

Outliers (observations that are very different from the rest of the data)
are often plotted as separate points.

27
Box plots

The basic construction of the box part of the boxplot is as follows:

1. Draw a line at the median.


2. Split the data into two halves, each containing the median.
3. Calculate the upper and lower quartiles as the medians of each half,
and draw horizontal lines at each of these values.

Note that the box specifies the inter-quartile range (IQR).

28
Box plots

The lower line segment is drawn from the lower end of the box to the
smallest value that is no smaller than 1.5 IQR below the lower quartile.

Similarly, the upper segment is drawn from the middle of the upper end
of the box to the largest value that is no larger than 1.5 IQR above the
upper quartile.

29
Box plots

● outlier

upper whisker

upper quartile

median

lower quartile

lower whisker


outliers

30
Box plots

Box plots are convenient for comparing distributions of data in two or


more categories, with a number (say 10 or more) of numerical
observations per category.

boxplot(Sepal.Length ˜ Species, data = iris,


ylab = "Sepal length (cm)",
main = "Iris measurements", boxwex = 0.5)

compares the distributions of the sepal length measurements between


different species of irises.

31
Box plots

In the code, we have used R’s formula based interface to the graphics
function: the syntax Sepal.Length ˜ Species is read as
“Sepal.Length depending on Species”, where both are columns of the
data frame specified by data = iris.

The boxplot() function draws separate side-by-side box plots for each
species.

From these, we can see substantial differences between the mean


lengths for the species, and that there is one unusually small specimen
among the virginica samples.

32
Box plots

Iris measurements

8.0
7.5
7.0
Sepal length (cm)

6.5
6.0
5.5
5.0


4.5

setosa versicolor virginica

33
Scatterplots

When doing statistics, most of the interesting problems have to do with


the relationships between different measurements.

To study this, one of the most commonly used plots is the


scatterplot, in which points (xi, yi), i = 1, . . . , n are drawn using dots
or other symbols.

In R, scatterplots (and many other kinds of plots) are drawn using the
plot() function.

34
Scatterplots – Optional Arguments

There are many additional optional arguments: e.g. type, pch, xlab,
ylab, main, col, xlim, ylim, ...

The default for type is type="p", which plot points. Line plots (in which
line segments join the (xi, yi) points in order from first to last) are drawn
using type="l".

Many other types are available, including type="n", to draw nothing:


this just sets up the frame around the plot, allowing other functions to
be used to draw in it.

35
Example: Distances Traveled Down a Ramp
library(DAAG) # package containing modelcars data frame
summary(modelcars)

## distance.traveled starting.point
## Min. :11.75 Min. : 3.00
## 1st Qu.:17.78 1st Qu.: 5.25
## Median :24.12 Median : 7.50
## Mean :23.19 Mean : 7.50
## 3rd Qu.:27.94 3rd Qu.: 9.75
## Max. :33.62 Max. :12.00

36
Example: Distances Traveled Down a Ramp

Basic Scatterplot of Distance traveled against starting point (height up


the ramp):
attach(modelcars) # to access
# variables directly
plot(starting.point, distance.traveled)

37
Example: Distances Traveled Down a Ramp


30 ●


distance.traveled



25



20


15


4 6 8 10 12

starting.point

38
A More Carefully Drawn Version of the Plot
plot(starting.point, distance.traveled,
pch=16, cex=1.25, xlab="Starting Point",
ylab="Distance Traveled", main="Model Car Data",
xlim=c(0,12), ylim=c(0,35))

• pch=16: change the plotting character


• cex=1.25: expand the character by a factor of 1.25
• xlab, ylab: x and y axis labels
• main, plot title
• xlim, ylim: ranges for x and y axes

39
A More Carefully Drawn Version of the Plot

Model Car Data

35


30




25
Distance Traveled



20


15




10
5
0

0 2 4 6 8 10 12

Starting Point

40
Orange Trees

The Orange data frame is in the datasets package installed with R.

It consists of 35 observations on the age (in days since December 31,


1968) and the corresponding circumference of 5 different orange trees,
with identifiers

unique(as.character(Orange$Tree))

## [1] "1" "2" "3" "4" "5"


(Since Orange$Tree is a factor, we use as.character() to get the displayed form, and unique() to
select the unique values.)

41
Orange Trees

To get a sense of how circumference relates to age, we might try the


following:
plot(circumference ˜ age, data = Orange)



200

● ●

● ●
● ●


150
circumference

● ● ●
● ●



● ● ●

100







50



500 1000 1500

age

42
Orange Trees

The previous graphic hides important information: the observations are


not all from the same tree, and they are not all from different trees; they
are from five different trees, but we cannot tell which observations are
from which tree.

One way to remedy this problem is to use a different plotting symbol for
each tree.

43
Orange Trees

The pch parameter controls the plotting character.

The default setting pch = 1 yields the open circular dot.

Other numerical values of this parameter will give different plotting


characters. We can also ask for different characters to be plotted; for
example, pch = "A" causes R to plot the character A.

44
Orange Trees

The following code can be used to identify the individual trees


plot(circumference ˜ age, data = Orange,
pch = as.character(Tree), cex = .75)

4 4
200

2 2

4 5 5
4 2
circumference

150

2
5 1 1
3
3
5 1
4
2 1 3
100

3
1
5
3
2
4
1
50

3
5
2
4
5
3
1

500 1000 1500

age

45
Orange Trees

Individual lines can be used to do the same thing. This code uses a for
loop, which will discussed in more detail when we study programming.
plot(circumference ˜ age, data = Orange, pch=as.numeric(Orange$Tree))
for (i in Orange$Tree) {
lines(circumference ˜ age, data = subset(Orange, Tree == i),
lty=as.numeric(i))
}
200
circumference

150

● ●


100


50

500 1000 1500

age

46
Choosing a High Level Graphic

We have described bar, dot, and pie charts, histograms, box plots and
scatterplots. There are many other styles of statistical graphics that we
haven’t discussed. How should a user choose which one to use?

The first consideration is the type of data. Bar, dot and pie charts
display individual values, histograms, box plots and QQ plots display
distributions, and scatterplots display pairs of values.

47
Choosing a High Level Graphic

Another consideration is the audience. If the plot is for yourself or for a


statistically educated audience, then you can assume a more
sophisticated understanding.

For example, a box plot or QQ plot would require more explanation than
a histogram, and might not be appropriate for the general public.

48
Choosing a High Level Graphic – Visual Perception

It is also important to have some understanding of how human visual


perception works in order to make a good choice. There has been a
huge amount of research on this and we can only touch on it here.

When looking at a graph, you extract quantitative information when your


visual system decodes the graph.

This process can be described in terms of unconscious measurements


of lengths, positions, slopes, angles, areas, volumes, and various
aspects of colour.

49
Choosing a High Level Graphic – Visual Perception

It has been found that people are particularly good at recognizing


lengths and positions, not as good at slopes and angles, and their
perception of areas and volumes can be quite inaccurate, depending on
the shape.

Most of us are quite good at recognizing differences in colours.


However, up to 10% of men and a much smaller proportion of women are
partially colour-blind, and almost nobody is very good at making
quantitative measurements from colours.

50
Choosing a High Level Graphic – Visual Perception

We can take these facts about perception into account when we


construct graphs. We should try to convey the important information in
ways that are easy to perceive, and we should try not to have conflicting
messages in a graph.

For example, the bars in bar charts are easy to recognize, because the
position of the ends of the bars and the length of the bars are easy to
see.

51
Choosing a High Level Graphic – Visual Perception

The area of the bars also reinforces our perception.

However, the fact that we see length and area when we look at a bar
constrains us.

We should normally base bar charts at zero, so that the position, length
and area all convey the same information.

If we are displaying numbers where zero is not relevant, then a dot chart
is a better choice: in a dot chart it is mainly the position of the dot that is
perceived.

52
Choosing a High Level Graphic – Visual Perception

Thinking in terms of visual tasks tells us that pie charts can be poor
choices for displaying data.

In order to see the sizes of slices of the pie, we need to make angle and
area measurements, and we are not very good at those.

53
Choosing a High Level Graphic – Visual Perception

Finally, colour can be very useful in graphs to distinguish groups from


each other. The RColorBrewer package in R contains a number of
palettes, or selections of colours.

Some palettes indicate sequential groups from low to high, others show
groups that diverge from a neutral value, and others are purely
qualitative.

These are chosen so that most people (even if colour-blind) can easily
see the differences.

54
Low Level Graphics Functions

Functions like barplot(), dotchart() and plot() do their work by


using low level graphics functions to draw lines and points, to establish
where they will be placed on a page, and so on.

In this section we will describe some of these low level functions, which
are also available to users to customize their plots.

We will start with a description of how R views the page it is drawing on,
then show how to add points, lines and text to existing plots, and finish
by showing how some of the common graphics settings are changed.

55
The plotting region and margins

Line 4

Line 3 Margin 3
Line 2

Line 1

Line 0
50
40

Plot region
30
Margin 2

Margin 4
10 Line 1 20


(6, 20)
Line 4

Line 3

Line 2

Line 0

Line 0

Line 1

Line 2

Line 3

Line 4
0

Line 0

2 Line 1 4 6 8
Line 2

Line 3 Margin 1
Line 4

56
The plotting region and margins

Base graphics in R divides up the display into several regions.

The plot region is where data will be drawn.

Within the plot region R maintains a coordinate system based on the


data.

The axes show this coordinate system.

Outside the plot region are the margins, numbered clockwise from 1 to
4, starting at the bottom.

57
The plotting region and margins

Normally text and labels are plotted in the margins, and R positions
objects based on a count of lines out from the plot region.

We can see from the figure that R chose to draw the tick mark labels on
line 1. We drew the margin titles on line 3.

58
Adding to plots

Several functions exist to add components to the plot region of existing


graphs:

• points(x, y, ...)
• lines(x, y, ...) adds line segments
• text(x, y, labels, ...) adds text into the graph
• abline(a, b, ...) adds the line y = a + bx
• abline(h=y, ...) adds a horizontal line
• abline(v=x, ...) adds a vertical line

59
Adding to plots - Example

Consider the Orange data frame again. In addition to using different


plotting characters for the different trees, we will pass lines of best-fit
(i.e. least-squares regression lines) through the points corresponding to
each tree.

The basic scatterplot is obtained from

plot(circumference ˜ age, pch =


as.numeric(Tree), data = Orange)

60
Adding lines to plots

The best-fit lines for the five trees can be obtained using the lm()
function which relates circumference to age for each tree. A legend has
been added to identify which data points come from the different trees.
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "1"),
lty = 1)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "2"),
lty = 2)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "3"),
lty = 3)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "4"),
lty = 4)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "5"),
lty = 5)
legend("topleft", legend = paste("Tree", 1:5), lty = 1:5, pch = 1:5)

61
Adding broken lines to plots
lines(circumference ˜ age, data = Orange, subset = Tree == "1", lty = 1)
lines(circumference ˜ age, data = Orange, subset = Tree == "2", lty = 2)
lines(circumference ˜ age, data = Orange, subset = Tree == "3", lty = 3)
lines(circumference ˜ age, data = Orange, subset = Tree == "4", lty = 4)
lines(circumference ˜ age, data = Orange, subset = Tree == "5", lty = 5)

62
Adding lines and broken lines to plots

● Tree 1
200

200
Tree 2
Tree 3
Tree 4
Tree 5
150

150
circumference

circumference
● ● ● ●

● ●
● ●
100

100
● ●
50

50
● ●

● ●

500 1000 1500 500 1000 1500

age age

63
Adding Material Outside the Plotting Region

• title(main, sub, xlab, ylab, ...) adds a main title, a


subtitle, an x-axis label and/or a y-axis label
• mtext(text, side, line, ...) draws text in the margins
• axis(side, at, labels, ...) adds an axis to the plot
• box(...) adds a box around the plot region

64
Example
par(mar=c(5, 5, 5, 5) + 0.1)
plot(c(1, 9), c(0, 50), type='n', xlab="", ylab="")
text(6, 40, "Plot region")
points(6, 20)
text(6, 20, "(6, 20)", adj=c(0.5, 2))
mtext(paste("Margin", 1:4), side=1:4, line=3)
mtext(paste("Line", 0:4), side=1, line=0:4,
at=3, cex=0.6)
mtext(paste("Line", 0:4), side=2, line=0:4,
at=15, cex=0.6)
mtext(paste("Line", 0:4), side=3, line=0:4,
at=3, cex=0.6)
mtext(paste("Line", 0:4), side=4, line=0:4,
at=15, cex=0.6)

65
Setting Graphical Parameters

After a device is opened, other graphical parameters may be set


using the par(...) function. This function controls a very large
number of parameters.

• mfrow=c(m, n) tells R to draw m rows and n columns of plots,


rather than going to a new page for each plot.
• mfg=c(i, j) says to draw the figure in row i and column j next.
• ask=TRUE tells R to ask the user before erasing a plot to draw a new
one.

66
Setting Graphical Parameters

• cex=1.5 tells R to expand characters by this amount in the plot


region. There are separate cex.axis, etc. parameters to control text
in the margins.
• mar=c(side1, side2, side3, side4) sets the margins of the
plot to the given numbers of lines of text on each side.
• oma=c(side1, side2, side3, side4) sets the outer margins
(the region outside the array of plots).
• usr=c(x1, x2, y1, y2) sets the coordinate system within the
plot with x and y coordinates on the given ranges.

67
Setting Graphical Parameters

The par() function is set up to take arguments in several forms.

If you give character strings (e.g. par("mfrow")) the function will return
the current value of the graphical parameter.

If you provide named arguments (e.g. par(mfrow=c(1, 2))), you will


set the corresponding parameter, and the previous value will be returned
in a list.

Finally, you can use a list as input to set several parameters at once.

68
Setting Graphical Parameters – Example

Construct a 2 × 2 layout of graphs for the cuckoos data in the DAAG


package:

• A histogram of length
• A histogram of breadth
• A scatterplot of length versus breadth
• Side-by-side boxplots of length for each host species.

69
Setting Graphical Parameters – Example

First check what the current layout setting is:


par("mfrow")

We want to change this to 2 × 2:


par(mfrow = c(2,2))

70
Setting Graphical Parameters – Example

First check what the current layout setting is:


par("mfrow")

## [1] 1 1

We want to change this to 2 × 2:


par(mfrow = c(2,2))

71
Setting Graphical Parameters – Example
par(mfrow = c(2,2))
hist(cuckoos$length, xlab="length")
hist(cuckoos$breadth, xlab="breadth")
plot(length ˜ breadth, data=cuckoos)
boxplot(length ˜ species, data=cuckoos)

Histogram of cuckoos$length Histogram of cuckoos$breadth

40
20
Frequency

Frequency

20
10
0

0
20 21 22 23 24 25 15.0 15.5 16.0 16.5 17.0 17.5

length breadth

● ●
● ● ●

24

24

● ●●● ●● ●●● ●

●●● ● ●● ●● ● ●
length

● ● ● ●●● ● ●●● ●
● ● ● ● ●
● ● ●● ●● ●● ● ●
22

22

● ●●● ● ●
●●● ●●●●●●

● ●
●● ● ● ● ●
●●● ●
● ●

20

20

● ● ●
● ● ●

15.0 15.5 16.0 16.5 17.0 17.5 hedge.sparrow robin wren

breadth

72
Orange Juice Data

The calorie content of six different brands of orange juice were


determined by three different machines.

The data below are in calories per 6 fluid ounces.

We are interested in knowing whether the caloric content differs for the
different brands, but we also would like to take into account differences
in the machines’ ability to measure caloric content.

73
Read in the Data
oj <- read.table("oj.txt",
header=TRUE)
names(oj) <- c("machine", paste("Brand",
c("A", "B", "C", "D", "E", "F")))
oj

## machine Brand A Brand B Brand C Brand D Brand E Brand F


## 1 M1 89 97 92 105 100 91
## 2 M1 94 96 94 101 103 92
## 3 M2 92 101 94 110 100 95
## 4 M2 90 100 98 106 104 99
## 5 M3 90 98 94 109 99 94
## 6 M3 94 92 96 107 97 98

74
Fix the Data Frame
rownames(oj) <- paste(oj[,1], rep(1:2,3), sep="")
oj.mat <- as.matrix(oj[,-1])
oj.mat

## Brand A Brand B Brand C Brand D Brand E Brand F


## M11 89 97 92 105 100 91
## M12 94 96 94 101 103 92
## M21 92 101 94 110 100 95
## M22 90 100 98 106 104 99
## M31 90 98 94 109 99 94
## M32 94 92 96 107 97 98

75
Plot a Dot Chart
The following code produces a dot chart ... but it is a bit hard to read.
dotchart(oj.mat)

Brand
M32A ●
M31
M22


M21
M12


M11 ●
Brand
M32B ●
M31
M22


M21
M12 ●

M11 ●
Brand
M32C ●
M31
M22


M21
M12


M11 ●
Brand
M32D ●
M31
M22 ●

M21
M12 ●

M11 ●
Brand
M32E ●
M31
M22


M21
M12


M11 ●
Brand
M32F ●
M31
M22


M21
M12 ●

M11 ●

90 95 100 105 110

76
Fixing the Axis Labels
Remove the M’s, since they are cluttering the vertical axis. Add a horizontal axis label
and a title.
dotchart(oj.mat, labels="", xlab="Energy (calories)")
title("Orange Juice Caloric Measurements")

Orange Juice Caloric Measurements

Brand A ●





Brand B ●





Brand C ●





Brand D ●





Brand E ●





Brand F ●




90 95 100 105 110

Energy (calories)

77
Fixing the Plot
Colour the lines:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)))
title("Orange Juice Caloric Measurements")

Orange Juice Caloric Measurements

Brand A ●





Brand B ●





Brand C ●





Brand D ●





Brand E ●





Brand F ●




90 95 100 105 110

Energy (calories)

78
Fixing the Plot
Colour the points:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)), color=rep(1:3,rep(2,3)))
title("Orange Juice Caloric Measurements")

Orange Juice Caloric Measurements

Brand A ●





Brand B ●





Brand C ●





Brand D ●





Brand E ●





Brand F ●




90 95 100 105 110

Energy (calories)

79
Fixing the Plot
Use different plotting characters:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)), color=rep(1:3,rep(2,3)),
pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")

80
Orange Juice Caloric Measurements

Brand A


Brand B


Brand C


Brand D


Brand E


Brand F

90 95 100 105 110

Energy (calories)
Fixing the Plot
Add axis labels to identify the machines, using axis()
dotchart(oj.mat, labels="",
xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)),
color=rep(1:3,rep(2,3)),
pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")
lab.locations <- seq(1.5,48,2)[-seq(4,48,4)]
labels <- paste("Machine", rep(1:3, 6))
axis(side=2, at=lab.locations,
label=labels, las=2)

81
Fixing the Plot
Add axis labels to identify the machines, using axis()

Orange Juice Caloric Measurements

Brand A
Machine 3
Machine 2 ●
Machine 1 ●
Brand B
Machine 3
Machine 2 ●
Machine 1 ●
Brand C
Machine 3
Machine 2 ●
Machine 1 ●
Brand D
Machine 3
Machine 2 ●
Machine 1 ●
Brand E
Machine 3
Machine 2 ●
Machine 1 ●
Brand F
Machine 3
Machine 2 ●
Machine 1 ●

90 95 100 105 110

Energy (calories)

ouch!

82
Fixing the Plot
Fix the labels to identify the machines using mtext():
dotchart(oj.mat, labels="",
xlab="Energy (calories)",
lcolor=rep("grey", 18), color=
rep(1:3,rep(2,3)), pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")
lab.locations <- seq(1.5,48,2)[-seq(4,48,4)]
labels <- paste("Machine", rep(1:3, 6))
mtext(labels, at=lab.locations, side=2,
las=2, cex=.6, col=1:3, line=-1.25)

83
Fixing the Plot

Orange Juice Caloric Measurements

Brand A
Machine 3
Machine 2
Machine 1 ●

Brand B
Machine 3
Machine 2
Machine 1 ●

Brand C
Machine 3
Machine 2
Machine 1 ●

Brand D
Machine 3
Machine 2
Machine 1 ●

Brand E
Machine 3
Machine 2
Machine 1 ●

Brand F
Machine 3
Machine 2
Machine 1 ●

90 95 100 105 110

Energy (calories)

84
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Predicting with Data

• Example: radon data


• Another example: lawn roller data
• The simple linear regression model
• Parameter estimation
• Prediction
• Lists and function output

2
Example: radon release data

Remember the radon data: the measurements in radon.txt describe


the percentage of radon released in showers with different sized
orifices. (Orifices are the holes that the water sprays through.)

Radon-enriched water was used for the experiment.


source("radon.R")

3
Radon release data
Plot the data first:
plot(percentage ˜ diameter, data = radon,
ylab="% radon released")
85



80



% radon released



75


● ●


70


● ●

65




60

0.5 1.0 1.5 2.0

diameter

Goal: to predict the percentage of radon released for a given diameter.

4
Radon release data and best-fit line
plot(percentage ˜ diameter, data = radon,
ylab="% radon released")
radon.lm <- lm(percentage ˜ diameter, data = radon)
abline(radon.lm)
85



80



% radon released



75


● ●


70


● ●

65




60

0.5 1.0 1.5 2.0

diameter

The goal of these slides is to understand how the best-fit line is calculated.

5
Another example: lawn roller data

Different weights of roller (in kilograms) were used to roll over different
parts of a lawn, and the depth of the depression (in millimeters) was
recorded at various locations.
library(DAAG) # DAAG contains the roller data
plot(depression ˜ weight, data = roller, pch=16)


25



depression

● ●
15

● ●
5



0

2 4 6 8 10 12

weight

Again, we want to predict the size of the depression for a given weight. We start by studying models for
such data. The first model to look at is the Simple Linear Regression Model.

6
The simple linear regression model

• Measurement of Y (response) changes in a linear fashion with a


setting of the variable x (predictor):
β0 + β1x ε
Y= +
linear relation noise
◦ The linear relation is deterministic (non-random).

◦ β0 represents the intercept of the line. This parameter is


unknown and must be estimated from data.

◦ β1 represents the slope of the line. This parameter is unknown


and must be estimated from data.

7
The simple linear regression model

• ◦ The noise or error is random.

• Noise accounts for the variability of the observations about the


straight line.
No noise ⇒ relation is deterministic.
Increased noise ⇒ increased variability.

• The variability in the noise is due to any factors, other than the
weight of the roller. For example, the type of soil (sandy or hard clay,
etc), or amount of moisture in different parts of the lawn, and so on.

• The variability in the noise is summarized by the parameter σ


(pronounced “sigma”). Larger values of σ lead to larger amounts of
noise.

8
The simple linear regression model

Experiment with this simulation program:

simple.sim <- function(intercept=0,slope=1,x=seq(1,10),


sigma=1){
noise <- rnorm(length(x),sd = sigma)
y <- intercept + slope*x + noise
plot(x,y,pch=16) # plot noisy data
abline(intercept,slope,col=4,lwd=2) # blue line, no noise
}
Download simplesim.R from Canvas and source it into R. Also, watch the videos simpleSimVideo.mp4
and simpleSimVideo2.mp4 to see how data noisier as the value of σ (sigma) increases.

9
The simple linear regression model

Example 1: y = 2 + 0.5x + ε, with σ = 0.01:


simple.sim(intercept = 2, slope = 0.5, sigma=.01)
7


6


5
y


4


3

2 4 6 8 10

Very little noise, so the points lie very close to the straight line.

10
The simple linear regression model

Example 2: y = 2 + 0.5x + ε, with σ = 0.1:


simple.sim(intercept = 2, slope = 0.5, sigma=.1)
7



6


5


y


4


3

2 4 6 8 10

A little more noise, so the points are not as close to the straight line.

11
The simple linear regression model

Example 3: y = 2 + 0.5x + ε, with σ = 1:


simple.sim(intercept = 2, slope = 0.5, sigma=1)


8


7
6

● ●

5
y



3
2

● ●

1

2 4 6 8 10

A lot more noise, so the points are scattered about the straight line.

12
The simple linear regression model

Example 4: y = 2 + 0.5x + ε, with σ = 10:


simple.sim(intercept = 2, slope = 0.5, sigma=10)
15

● ●


10


5


y

● ●
0
−5


−10

2 4 6 8 10

Mostly noise, so the line is no longer very recognizable from the points.

13
The Setup

• Assumptions:
1. Expected value of y = β0 + β1x.
2. Standard Deviation(ε) = σ.
• Data: Suppose data Y1, Y2, . . . , Yn are obtained at settings
x1, x2, . . . , xn, respectively. Then the model on the data is

Yi = β0 + β1xi + εi
Either
1. the x’s are fixed values and measured without error (controlled
experiment) - Example: Radon Data
OR
2. the analysis is conditional on the observed values of x
(observational study) - Example: Roller Data

14
Parameter estimation, fitted values and residuals

Least Squares Estimation

◦ Assumptions:

1. The mean of the noise term εi is 0.

2. The standard deviation of εi is σ.

3. εi’s are independent.

15
Method

• We want to choose the parameters (or regression coefficients) β0


and β1: βb0 and βb1 so that the fitted line passes as close to all of the
points as possible.

◦ Aim: small Residuals (observed - fitted response values):

ei = Yi − βb0 − βb1xi

16
Visualizing residuals*

residViz.plot(a=14, b=0)
On Canvas, you will find a
file called residViz.plot.R
which can be read into R.

30
● Fitted values
Data values

25
It plots the roller data and +ve residual

20
overlays a line with slope 0
and intercept 14.

15
y
● ●● ● ● ● ● ● ● ●

10
It also calculates the residuals −ve residual

5
and plots them together with

0
line segments linking the data
0 2 4 6 8 10 12
points with the fitted values.
x

Notice that there are both negative and positive residuals.

*The videos residplot2mp4 and residplot3.mp4 also show how the fitted line and
residuals change depending on the values of the slope and intercept.
17
Visualizing residuals - same data; different line

This time, we pass the line y = 2 + 2x through the plot:

residViz.plot(a=2, b=2)
30

● Fitted values
Data values

25


20


+ve residual
15
y

● ●


10

●●

5

−ve residual
0

0 2 4 6 8 10 12

18
Visualizing residuals - same data; different line

This time, we pass the line y = 12 + x through the plot:

residViz.plot(a=12, b=1)
30

● Fitted values
Data values
+ve residual
25


20


● ●
● ●
15

●●
y


10

−ve residual
5
0

0 2 4 6 8 10 12

19
Visualizing residuals - same data; different line

This time, we pass the line y = −2 + 2.67x through the plot:


residViz.plot(a=-2, b=2.67)

● Fitted values ●
30

Data values
25


20


+ve residual
15
y


● −ve residual


10

●●
5


0

0 2 4 6 8 10 12

Which plot is best? Hard to tell, because the residuals can be negative or positive, so if just add them
up, there will be cancellation.

20
The Key: minimize squared residuals

The function resid2Viz.plot is available on Canvas in the file


resid2Viz.plot.R and does the same thing as residViz.plot except that
it squares the residuals before plotting them.

The videos resid2plot1.mp4 also show how the fitted line and residuals
change depending on the values of the slope and intercept.

21
The Key: minimize squared residuals

resid2Viz.plot(a=14,b=0)

● Fitted values
Depression in lawn (mm)

Data values
100
60
20

+ve residual
● ●● ● ● ● ● ● ● ●
−ve residual
0

0 2 4 6 8 10 12

Roller weight (t)

These look big.

22
The Key: minimize squared residuals

resid2Viz.plot(a=2,b=2)

● Fitted values
Depression in lawn (mm)

Data values
100
60


20


+ve residual ● ● ●
● ●
● ●●
−ve residual
0

0 2 4 6 8 10 12

Roller weight (t)

These look a lot smaller.

23
The Key: minimize squared residuals

resid2Viz.plot(a=12,b=1)

● Fitted values
Depression in lawn (mm)

Data values
100
60

+ve residual● ●
20

● ● ● ● ●
● ●●
−ve residual
0

0 2 4 6 8 10 12

Roller weight (t)

These look big again.

24
The Key: minimize squared residuals

resid2Viz.plot(a=-2, b=2.67)

● Fitted values
Depression in lawn (mm)

Data values
100
60



0 20

+ve residual ●
● ● −ve residual
● ●
● ●●

0 2 4 6 8 10 12

Roller weight (t)

These look the smallest of all that we have seen. This suggests that the intercept
might be near -2 and the slope might be near 2.67.

25
Calculating the slope and intercept estimates in R

The lm function finds the slope and intercept values that minimize
the sum of the squared residuals.

roller.lm <- lm(depression ˜ weight, data = roller)

coef(roller.lm) # the intercept and slope estimates

## (Intercept) weight
## -2.09 2.67

26
Calculating the slope and intercept estimates in R

• From the output,

◦ the slope estimate is βb1 = 2.67.

◦ the intercept estimate is βb0 = −2.09.

The slope is positive, so this means that the amount of depression in the
lawn will increase with increasing roller weight.

27
Making predictions

The equation of the fitted line is yb = −2.1 + 2.7x.

This is a formula that can be used to predict the amount of depression


for different roller weights.

For example, if we want to predict the amount of depression that is


expected from a 10 t roller, we plug 10 into the formula above to get
yb = −2.1 + 102.7 = 24.9 mm.

A 2 t roller would be expected to depress the ground by


yb = −2.1 + 22.7 = 3.3 mm.

Be careful though: because of the noise, these predictions will not be


exactly correct. We need to use the residuals to obtain intervals which
will contain the true values with high probability - more on this soon.

28
Minimizing squared residuals for the radon data

Try slope b = 2 and intercept a = 65:

resid2Viz.plot(radon$percentage, radon$diameter, a=65, b=2)

● Fitted values
Data values
150
100

+ve residual ● ● ●
● ● ● −ve residual
50
0

0.0 0.5 1.0 1.5 2.0

large squared residuals


29
Squared residuals for the radon data

Try slope b = −2 and intercept a = 75:

resid2Viz.plot(radon$percentage, radon$diameter, a=75, b=-2)

● Fitted values
Data values
150
100

+ve residual● ● ● ● ● ●
−ve residual
50
0

0.0 0.5 1.0 1.5 2.0

smaller squared residuals


30
Squared residuals for the radon data

Try slope b = −12 and intercept a = 84:

resid2Viz.plot(radon$percentage, radon$diameter, a=84, b=-12)

● Fitted values
Data values
150
100

+ve residual● ● ● ● ● −ve residual ●


50
0

0.0 0.5 1.0 1.5 2.0

even smaller squared residuals


31
Estimating the slope and intercept for the radon data

Use the lm function to find the slope and intercept that give the
smallest squared residuals:

radon.lm <- lm(percentage ˜ diameter, data = radon)

coef(radon.lm) # the intercept and slope estimates

## (Intercept) diameter
## 84.2 -11.8

These values are close to the ones on the preceding slide.

The slope is negative, so the amount of radon released will decrease


with increasing orifice diameter.

32
Predicting radon release

The fitted line is yb = 84.2 − 11.8x.

This is a formula that can be used to predict the percentage of radon


released for different orifice diameters

For example, if we want to predict the percentage of released radon that


is expected from a 1.0 mm diameter orifice, we plug 1.0 into the formula
above to get yb = 84.2 + 10 − 11.8 = 72.4 %.

A 2.5 mm diameter would be expected to give a release percentage of


yb = 84.2 + 2.5 − 11.8 = 54.7%.

Caution: predicting far outside the range of the original diameters


amounts to extrapolation and may be subject to serious inaccuracy.
Here, we have predicted at 2.5 mm, and the maximum diameter in the
data is 2.0, so we have mildly extrapolated.
33
Other regression analysis functions

◦ fitted values: predict(roller.lm)


◦ residuals: resid(roller.lm)
◦ diagnostic plots: plot(roller.lm)
(these include a plot of the residuals against the fitted values)
◦ Also plot(roller); abline(roller.lm)
(this gives a plot of the data with the fitted line overlaid)

34
Calculating fitted values

The fitted values are the values of yb that are obtained by plugging in the
original x values.

For the roller data, the x values are the values of roller$weight
roller$weight

## [1] 1.9 3.1 3.3 4.8 5.3 6.1 6.4 7.6 9.8 12.4

The fitted values are obtained by plugging in these values in for x in


yb = −2.1 + 2.7x.

Manual calculation of fitted values:


-2.1 + 2.7*roller$weight

## [1] 3.03 6.27 6.81 10.86 12.21 14.37 15.18 18.42 24.36 31.38

35
Calculating fitted values using predict

The predict function calculates the fitted values automatically.


predict(roller.lm)

## 1 2 3 4 5 6 7 8 9 10
## 2.98 6.18 6.71 10.71 12.05 14.18 14.98 18.18 24.05 30.98

36
Polling question ... yes or no?

If we type predict(radon.lm) will we get more than 6 different values,


since we know that there are only 6 different values in the column
radon$diameter which are the x values we would plug into the
prediction equation?

i.e.
radon$diameter

## [1] 0.37 0.37 0.37 0.37 0.51 0.51 0.51 0.51 0.71 0.71 0.71 0.71 1.02
## [16] 1.02 1.40 1.40 1.40 1.40 1.99 1.99 1.99 1.99

37
Polling question ... answer

No!
predict(radon.lm)

## 1 2 3 4 5 6
## 79.83882 79.83882 79.83882 79.83882 78.18019 78.18019
## 7 8 9 10 11 12
## 78.18019 78.18019 75.81073 75.81073 75.81073 75.81073
## 13 14 15 16 17 18
## 72.13805 72.13805 72.13805 72.13805 67.63607 67.63607
## 19 20 21 22 23 24
## 67.63607 67.63607 60.64614 60.64614 60.64614 60.64614

There are only six different values. To see this clearly, use the table
function.
table(predict(radon.lm))

##
## 60.6461377309841 67.6360657498926 72.1380532874947
## 4 4 4
## 75.8107273313279 78.1801944563816 79.8388214439192
## 4 4 4
This shows that there are six different predicted (or fitted) values, each occurring four times.

38
Calculating residuals

For every true value of y there is a fitted value y.


b The difference between
these two values y − yb is defined as the residual.

We can calculate the residuals manually by subtracting the fitted values


from the true values of y, but the resid function does this calculation
for us.

For the lawn roller data, the residuals are


resid(roller.lm)

## 1 2 3 4 5
## -0.9796695 -5.1797646 -1.7131138 -5.7132327 7.9533944
## 6 7 8 9 10
## 5.8199976 8.0199738 -8.1801213 5.9530377 -5.9805017

These values should be like noise - random, with no clear pattern or trend.

39
Plotting residuals

We can plot them against the fitted values to see if there is a pattern
plot(resid(roller.lm) ˜ predict(roller.lm),
ylab = "roller residuals", xlab="fitted values")

● ●

● ●
5
roller residuals



−5


● ●

5 10 15 20 25 30

fitted values

No obvious trend is present. This is a characteristics of a model that fits the data well.

40
Automatic plotting of residuals

If we apply the plot function directly to roller.lm, we can obtain four


plots, of which the first one is the plot of the residuals against the fitted
values.* †
plot(roller.lm, which = 1)

Residuals vs Fitted
5 10

●5 ●7
● ●
Residuals

● ●
● ● ●
8●
−10

5 10 15 20 25 30

Fitted values
lm(depression ~ weight)

* The other three plots are beyond the scope of this course.
† The
red curve is added to help the eye identify patterns - in this case, there might be
some nonlinearity that has been missed by the model.
41
Automatic plotting of residuals - radon data

This the residual versus fitted value plot for the radon data.

plot(radon.lm, which = 1)

Residuals vs Fitted
2 4 6

● 24 ●

● ●
Residuals


● ● ●
● ● ●

● ●
−2


● ●


−6

17 ●
18

60 65 70 75 80

Fitted values
lm(percentage ~ diameter)

Polling question (yes or no?): Do you see a pattern?

42
Polling question ... answer

Yes!

The residuals decrease and then increase. The effect is slight, but the
pattern suggests that it is possible to improve the model.

43
Making predictions on new data using predict

Suppose we want to predict the depression made by a roller with a


weight of 5 units. Use

predict(roller.lm, newdata=data.frame(weight = 5))

## 1
## 11.24658

44
Making predictions on new data using predict

It is even better to include an interval estimate (this is a confidence


interval):

predict(roller.lm, newdata=data.frame(weight = 5),


interval="prediction")

## fit lwr upr


## 1 11.24658 -5.134942 27.62811

Interpretation: the prediction interval (−5.13, 27.6) contains the true


amount of depression with 95% probability.

45
Visualizing the fitted line
plot(depression ˜ weight, data = roller)
abline(roller.lm) # overlays fitted line
30


25



20

● ●
depression

15
10

● ●
5



0

2 4 6 8 10 12

weight

46
Radon release data example (cont’d)

• Predict the percentage release when the orifice diameter is set to one
of 1.15, 1.25 and 1.35.
• Overlay the plot of the data with the fitted line.

47
Predicting radon release percentage

Predictions at diameters of 1.15, 1.25 and 1.35 are as follows:

predict(radon.lm, newdata =
data.frame(diameter = c(1.15, 1.25, 1.35)))

## 1 2 3
## 70.59790 69.41317 68.22843
Interval predictions are:
predict(radon.lm, newdata =
data.frame(diameter = c(1.15, 1.25, 1.35)),
interval ="prediction")

## fit lwr upr


## 1 70.59790 63.88790 77.30790
## 2 69.41317 62.68594 76.14039
## 3 68.22843 61.47545 74.98142

For example, if the diameter is 1.15, there is 95% probability that the release
percentage will be between 63.9 and 77.3.

48
Visualizing the fitted line
plot(percentage ˜ diameter, data = radon)
abline(radon.lm) # overlays fitted line
85



80




percentage


75


● ●


70


● ●

65




60

0.5 1.0 1.5 2.0

diameter

49
Lists

Lists are a very flexible type of object. They literally consist of a list of
things.

Lists in R can contain any kinds of objects, such as vectors, functions,


data frames and even other lists.

Data frames are actually a special kind of list.

You won’t often construct these yourself, but many functions return
complicated results as lists.

The lm function is an example. It returns output as a list.

50
Named lists

You can see the names of the objects in a list using the names()
function, and extract parts of it:

names(d) # Print the names of the objects in d.


d$x # Print the x component of d

51
Named lists

Let’s see what the objects are in the output from radon.lm.
names(radon.lm)

## [1] "coefficients" "residuals" "effects"


## [4] "rank" "fitted.values" "assign"
## [7] "qr" "df.residual" "xlevels"
## [10] "call" "terms" "model"

52
Lists

The list() function is one way of organizing multiple pieces of output


from functions. For example,
x <- c(3, 2, 3)
y <- c(7, 7)
z <- list(x = x, y = y)
z

## $x
## [1] 3 2 3
##
## $y
## [1] 7 7

53
Lists - accessing elements

In the previous slide, you should have seen the $.

Just like data frames, we can use the $ to extract elements.

For example, to extract x from z, try

z$x

## [1] 3 2 3

54
Lists - accessing elements

To access the coefficients from radon.lm, try

radon.lm$coefficients

## (Intercept) diameter
## 84.22234 -11.84734

This is the same result that is obtained from the extractor function coef.

55
Lists

There are several functions which make working with lists easy. Two of
them are lapply() and vapply(). The lapply() function “applies”
another function to every element of a list and returns the results in a
new list; for example,
lapply(z, mean)

## $x
## [1] 2.666667
##
## $y
## [1] 7

56
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBC

Winter T1, 2021

1
Programming 1: Flow Control in R

Computer programming often involves the use of repeated commands


or statements.

There are several R functions that control how many times statements
are repeated. The main function to use for this is for.

We will also describe how to control when code is executed and when it
is not to be executed. The main function for this is if.

2
Another look at adding lines to plots

Recall the Orange trees data, where we related circumference to age.

The basic scatterplot was obtained from

Orange$tree <- as.numeric(Orange$Tree) # Tree was a factor


plot(circumference ˜ age, pch =
tree, data = Orange)

and the broken lines were added one at a time with


lines(circumference ˜ age, data = subset(Orange, tree == 1), lty = 1)
lines(circumference ˜ age, data = subset(Orange, tree == 2), lty = 2)
lines(circumference ˜ age, data = subset(Orange, tree == 3), lty = 3)
lines(circumference ˜ age, data = subset(Orange, tree == 4), lty = 4)
lines(circumference ˜ age, data = subset(Orange, tree == 5), lty = 5)

3
Adding lines and broken lines to plots

The resulting plot:

● Tree 1
200

Tree 2
Tree 3
Tree 4
Tree 5
150
circumference

● ●



100


50

500 1000 1500

age

4
Using the for() function to save time and space

The for function can be used to save time and space when
programming with lines that repeat.

The command we want to repeatedly execute is of the form


lines(circumference ˜ age, data = subset(Orange, tree == 1), lty = 1)

but where the “1” changes to “2” and then to “3” and so on.

We can express these kinds of code lines as


lines(circumference ˜ age, data = subset(Orange, tree == i), lty = i)

where i is changing, progressing through the values 1, 2, . . . , 5.

5
Using the for() function to save time and space

We can use the : function to generate these values:


1:5

## [1] 1 2 3 4 5

and for each of these values, we want to execute the command


lines(circumference ˜ age, data = subset(Orange, tree == i), lty = i)

6
Using the for() function to save time and space

The for() function allows us to repeat a command a specified number


of times.

Syntax:

for (i in values) command

This sequentially sets a variable called i equal to each of the elements


of values.

For each value of i, the listed command is executed.

7
Using the for() function on the orange tree plots
plot(circumference ˜ age, data = Orange, pch=as.numeric(Orange$Tree))
for (i in 1:5) lines(circumference ˜ age,
data = subset(Orange, tree == i), lty = i)
legend("topleft", legend = paste("Tree", 1:5), lty = 1:5, pch = 1:5)
200

● Tree 1
Tree 2
Tree 3
circumference

150

Tree 4 ● ●
Tree 5

100


50

500 1000 1500

age

8
Another example - plotting financial data

The four columns of the EuStockMarket object contain daily closing


prices for different European stock markets. We could plot the four time
series in a 2 × 2 layout of plots using
par(mfrow=c(2,2))
plot(EuStockMarkets[, 1])
plot(EuStockMarkets[, 2])
plot(EuStockMarkets[, 3])
plot(EuStockMarkets[, 4])

9
Another example - plotting financial data

6000

8000
EuStockMarkets[, i]

EuStockMarkets[, i]

6000
4000

4000
Or we can use the for loop applied

2000

2000
to the column index that is running 1992 1994 1996 1998 1992 1994 1996 1998

Time Time

through the values from 1 through 4:


par(mfrow=c(2,2))

6000
EuStockMarkets[, i]

EuStockMarkets[, i]
for (i in 1:4) plot(EuStockMarkets[, i])

5000
3500

4000
2500

3000
1500
1992 1994 1996 1998 1992 1994 1996 1998

Time Time

10
Factorial example

The factorial n! counts how many ways n different objects could be


ordered. It is defined as

n! = 1 · 2 · 3 · · · (n − 1) · n

11
Factorial Example

One way to calculate it would be to use a for() statement.

For example, we could find the value of 13! using the code
n <- 13
result <- 1
for (i in 1:n) result <- result * i
result

## [1] 6227020800

12
Understanding the Code

The first line sets a variable named n to 13, and the second line
initializes result to 1, i.e. a product with no terms.

The third line starts the for() statement: the variable i will be set to the
values 1, 2, . . . , n in succession.

Line 3 multiplies result by i in each of those steps, and the final line
prints it.

There is also a factorial() function built in to R; it is much faster than


the for() loop.

13
Repeating several commands at a time

If we want to execute several commands at once, we enclose them in


curly brackets:

Syntax:

for (n in values) {
command 1
command 2
...
}

14
Repeating several commands at a time - cuckoos example

The code on the next slide will produce a 2 × 3 layout of plots for the
cuckoos data showing:

1. the egg breadths versus the egg lengths for the cuckoo eggs laid in
each of 6 host species’ nests.

2. the corresponding best-fit lines relating breadth to length

3. plot titles which give the species’ name (these are levels of the
species factor column).

4. an overall title describing the plot.

15
Repeating several commands at a time - cuckoos example
library(DAAG) # contains the cuckoos data frame
par(mfrow=c(2, 3))
for (i in levels(cuckoos$species)) {
plot(breadth ˜ length,
data = subset(cuckoos, species == i))
breadth.lm <- lm(breadth ˜ length,
data = subset(cuckoos, species==i))
abline(breadth.lm)
title(i)
}
mtext(side=3, line=-1.5,
"Characteristics of Cuckoo Eggs Laid in Other Birds' Nests",
outer=TRUE) # outer=TRUE puts this text in the outer margin

16
Repeating several commands at a time - cuckoos example

Characteristics of Cuckoo Eggs Laid in Other Birds' Nests


hedge.sparrow meadow.pipit pied.wagtail
● ● ●● ●

15.8 16.4 17.0


● ● ●

17.0
● ● ● ● ●
17.0

● ●● ●● ● ● ●
●● ● ●
breadth

breadth

breadth
● ● ●
● ● ●●●● ● ● ● ●

● ●
● ●
● ● ● ●
● ● ●
● ● ●● ●● ● ●

16.0
● ●
16.0

● ●
● ●
● ● ●

21 22 23 24 25 20 21 22 23 24 21 22 23 24 25

length length length

robin tree.pipit wren

16.2
● ● ● ●

● ● ● ●
● ● ● ● ●● ●
17.0
16.5

●●●
breadth

breadth

breadth
● ●

15.6
● ● ● ● ●

● ● ●
● ● ●
● ● ●
● ●

15.5

● ● ●

16.0

15.0
● ● ●

21.0 22.0 23.0 24.0 21.0 22.0 23.0 24.0 20.0 21.0 22.0

length length length

17
Example - simulation of random numbers

This example describes how to simulate random numbers on a


computer.

We will describe a method of generating nonrandom values, which are


then treated as though they are random.

Simulated random numbers are called pseudorandom numbers.

One of the simplest methods for simulating independent uniform


random variables on the interval [0,1] is the multiplicative congruential
random number generator.

It produces a sequence of pseudorandom numbers, u0, u1, u2, . . . , which


appear to be equally likely to occur in the interval between 0 and 1.

18
Generation of pseudorandom numbers

The formulas below can be coded to create pseudorandom numbers.

xn = (171 xn−1)%%30269

un = xn/30269
with initial value x0. (x0 is called the seed.)

Recall that the first formula calculates the remainder after division of
171xn−1 by 30269.

The second formula ensures that the resulting u-values are between 0
and 1.

19
Generation of pseudorandom numbers
u <- numeric(30268) # the output
# will be stored here
x0 <- 27218 # arbitrarily chosen seed
x <- x0 # current x value
for (j in 1:30268) {
x <- (171 * x) %% 30269 # update x with formula 1
u[j] <- x/30269
}

The results, stored in the vector u, are in the range between 0 and 1.
These are the pseudorandom numbers, u1, u2, . . . , u30268.

20
Visualizing some of our pseudorandom numbers
u[1:5]

## [1] 0.7638508 0.6184876 0.7613730 0.1947867 0.3085335

hist(u[1:5000]) # histogram of first 5000 numbers

Histogram of u[1:5000]
400
Frequency

200
0

0.0 0.2 0.4 0.6 0.8 1.0

u[1:5000]

The values are all inside the interval between 0 and 1 and appear to be pretty evenly distributed over that
range.

The runif function uses a much better method, however.

21
Example: simulating normal random variables

Normal random variables are commonly used in data analysis as models


for continuous data. The normal distribution is the familiar bell-shaped
curve.

They are very common, because there is a mathematical theorem, called


the Central Limit Theorem, that says that the adding up measurements
will often result in approximately normally distributed random variables.

22
Example - simulation of normal variates

Simulating normal random variables is possible in a variety of ways.

If we add up 12 uniform random variables on [−.5, .5], we can get a sum


that follows a close approximation to the standard normal distribution.

We will use a for() loop to construct a large vector of such values so


that we can draw a histogram and QQ-plot, to verify that we have
succeeded in simulating normal random variables.

23
Example - simulation of normal variates

We can use the runif() function to simulate the uniform variates that
we will need.

A histogram of the simulated


Simulation of uniform variates:
uniform values:
N <- 10000
hist(U, col="blue")
U <- runif(N, min=-.5, max=.5)
Histogram of U

U contains values in the interval

500
[−0.5, 0.5].

Frequency

300
100
0
−0.4 −0.2 0.0 0.2 0.4

24
Example - simulation of normal variates

We initially assign 0 to our outcome vector Z.

Then we successively add a uniform vector of size N = 10000 to Z, 12


times.
Z <- 0; N <- 10000
for (i in 1:12) {
U <- runif(N, min=-.5, max=.5)
Z <- Z + U
}

Note that we started with a Z which had only one entry, and successively
added vectors of size 10000. R automatically changes the length of Z to
make elementwise addition with U possible.

25
Simulating normal random variables - visualizing the result

The histogram is given below.

hist(Z)

Histogram of Z
2000
Frequency

1000
500
0

−4 −2 0 2 4

this is a typical “bell-shaped” curve - the normal distribution

26
Simulating standard normal random variables

In theory, the mean of random variables like Z should be 0, and the


standard deviation should be 1.

In fact, for our simulated sample, the values of the sample mean and
standard deviation are:
mean(Z) # sample average

## [1] 0.003215301

sd(Z) # sample standard deviation

## [1] 0.9942739

Different samples would have slightly different means and standard deviations, but all
would be pretty close to 0 and 1.

27
Summing squared standard normal variables

If Z is a standard normal random, then X = Z 2 is called a chi-squared


random variable on 1 degree of freedom.
X <- Zˆ2; hist(X)

Histogram of X
6000
Frequency

4000
2000
0

0 5 10 15

This is an example of a skewed distribution. Most of the values are near


0, but there are a few very large values.
28
Example - nesting for() loops

If Z1, Z2, . . . , Zk are independent standard normal random variables,


then the sum of their squares is a chi-squared random variable on k
degrees of freedom.

We can use nested for() loops to simulate these sums of squared


normals.

For example, suppose k = 7 as in the following:


X <- 0; k <- 7
for (i in 1:k) {
Z <- 0
for (j in 1:12) {
U <- runif(N, min = -.5, max = .5)
Z <- Z + U # Z is standard normal
}
X <- X + Zˆ2 # X is chi-squared
}
29
Example - nesting for() loops

The histogram to the right shows

2500
what a chi-squared distribution

2000
on 7 degrees of freedom looks

1500
Frequency
like:

1000
hist(X, main="")

500
It is skewed, but not as much as

0
0 5 10 15 20 25 30

when the number of degrees of X

freedom is smaller.

30
The if() statement

The if() statement allows us to control which statements are executed.

Syntax:

if (condition) {commands when TRUE}


if (condition) {commands when TRUE} else {commands if FALSE}

This statement causes a set of commands to be invoked if condition


evaluates to TRUE.

The else part is optional, and provides an alternative set of commands


which are to be invoked in case the logical variable is FALSE.

31
The if() statement: Caution!

Be careful how you type the else statement. Typing it as

if (condition) {commands when TRUE}


else {commands when FALSE}

may produce an error, because R will execute the first line before you
have time to enter the second.

If these two lines appear within a block of commands in curly brackets,


they won’t trigger an error, because R will collect all the lines before it
starts to act on any of them.

32
The if() statement: Recommendation

To avoid this kind of difficulty, use the form

if (condition) {
commands when TRUE
} else {
commands when FALSE
}

33
The if() statement: Another Warning

R also allows numerical values to be used as the value of condition.

These are converted to logical values using the rule that zero becomes
FALSE, and any other value becomes TRUE.

Missing values are not allowed for the condition, and will trigger an error.

34
Example

A simple example:
x <- 3
if (x > 2) y <- 2 * x else y <- 3 * x

Since x > 2 is TRUE, y is assigned 2 * 3 = 6.

If it hadn’t been true, y would have been assigned the value of 3 * x.

35
Example - counting missing values
The cfseal data frame in DAAG contains mass measurements for various organs from
cape fur seals in addition to estimates of age and the overall weight.
summary(cfseal)

## age weight heart lung


## Min. : 10.0 Min. : 18.00 Min. : 84.5 Min. : 380.0
## 1st Qu.: 31.5 1st Qu.: 28.12 1st Qu.: 141.4 1st Qu.: 601.9
## Median : 54.5 Median : 46.25 Median : 230.0 Median :1055.0
## Mean : 56.2 Mean : 54.79 Mean : 316.1 Mean :1136.8
## 3rd Qu.: 73.0 3rd Qu.: 68.38 3rd Qu.: 422.5 3rd Qu.:1483.3
## Max. :120.0 Max. :179.00 Max. :1075.0 Max. :2735.0
## NA's :6
## liver spleen stomach leftkid
## Min. : 435.4 Min. : 24.5 Min. : 120.4 Min. : 53.5
## 1st Qu.:1041.8 1st Qu.: 46.0 1st Qu.: 388.7 1st Qu.:119.5
## Median :1820.0 Median : 90.0 Median : 640.0 Median :145.4
## Mean :2211.7 Mean :122.5 Mean : 782.0 Mean :179.8
## 3rd Qu.:2976.8 3rd Qu.:145.0 3rd Qu.:1029.5 3rd Qu.:234.4
## Max. :8309.0 Max. :425.0 Max. :2500.0 Max. :385.0
## NA's :1 NA's :6
## rightkid kidney intestines
## Min. : 58.0 Min. : 112.0 Min. : 711.4
## 1st Qu.:106.8 1st Qu.: 239.8 1st Qu.:1147.2
## Median :151.8 Median : 373.5 Median :1694.5
## Mean :182.7 Mean : 440.6 Mean :1987.0
## 3rd Qu.:238.2 3rd Qu.: 600.0 3rd Qu.:2707.5
## Max. :400.0 Max. :1410.0 Max. :3570.0
## NA's :6 NA's :6

36
Example - counting missing values

The summary shows that some of the columns have missing values (NA).

We can use is.na to determine which values are missing and


sum(is.na()) to count them up.

For example,
is.na(cfseal$lung)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE

sum(is.na(cfseal$lung))

## [1] 6

37
Example - counting missing values

par(mfrow=c(3, 4))
for (i in 1:11) hist(cfseal[ ,i], main=names(cfseal)[i], xlab="mass")

age weight heart lung


Frequency

Frequency

Frequency

Frequency

8
6 12
6

4
3
0

0
0 40 80 120 0 50 150 0 400 800 0 1000 2500

mass mass mass mass


From this figure,
liver spleen stomach leftkid
you would not
know that the
Frequency

Frequency

Frequency

Frequency
8

8
6 12
data contain
4

4
missing values.
0

0
0 4000 8000 0 200 400 0 1000 2000 50 200 350

mass mass mass mass

rightkid kidney intestines


Frequency

Frequency

Frequency

6
6

0 4 8

3
3
0

50 200 350 0 500 1500 500 2000 3500

mass mass mass

38
Example - counting missing values

The code below uses the if statement to add information about


numbers of missing values to the figure, in the location of the subtitles
(this is controlled with the sub argument in the title function).
par(mfrow=c(3, 4))
for (i in 1:11) {
hist(cfseal[ ,i], main=names(cfseal)[i], xlab="mass")
Nmissing <- sum(is.na(cfseal[, i]))
if(Nmissing > 0) title(sub=paste(Nmissing, "NA's"))
}

Note how we can use the paste function to combine the missing value
count in Nmissing with the character string NA's.

The figure is on the next slide.

39
Example - counting missing values

age weight heart lung

8
12
6
Frequency

Frequency

Frequency

Frequency

6
8

8
4

4
4

4
2

2
0

0
0 40 80 120 0 50 150 0 400 800 0 1000 2500

mass mass mass mass


6 NA's

liver spleen stomach leftkid


8

8
12
Frequency

Frequency

Frequency

Frequency
6

6
8
4

4
4
2

2
0

0
0 4000 8000 0 200 400 0 1000 2000 50 200 350

mass mass mass mass


1 NA's 6 NA's

rightkid kidney intestines


6
6
Frequency

Frequency

Frequency
8

4
4

2
2
0

50 200 350 0 500 1500 500 2000 3500

mass mass mass


6 NA's 6 NA's

40
DATA SCIENCE 101

Predicting with Data

1
Programming 2: Functions

In this lecture, we will learn

• what a function is, in R

• how to construct functions in R

• how to manage the complexity of computer programming by the use


of functions which break a large complicated problem into a bunch
of smaller manageable problems

We will also learn how to smooth a scatter plot by constructing a set of


small functions that do simple things but accomplish a complicated feat
when put together.

2
Programming 2: Functions

As we have seen, R calculations are carried out by functions, and


graphs are produced by functions.

The usual composition of a function is

• a header that includes the word function and an argument list


(which might be empty)

• a body which includes a set of statements enclosed in curly


brackets {}.

Function names should be chosen to describe the action of the function.


For example, median() computes medians, and boxplot() produces
box plots.
3
An example using an if statement

The if() statement is often used inside user-defined functions.

The correlation between two vectors of numbers is often calculated


using the cor() function.

It is supposed to give a measure of linear association.

4
An example using an if statement

We can add a scatter plot of the data as follows.

corplot <- function(x, y, plotit) {


if (plotit == TRUE) plot(x, y)
cor(x, y)
}

5
An example using an if statement

We can apply this function to two vectors without plotting by typing

Z1 <- runif(100)
Z2 <- runif(100)
corplot(Z1, Z2, FALSE)

## [1] 0.01178233

Correlations can take values between -1 and 1. Positive correlations


indicate that one variable can be predicted from the other by a straight
line with positive slope.

A correlation near 0 tells us that the two variables are not predictable
from each by a straight line.

6
An example using an if statement

We will now appy the function to the first two columns of the European
stock price data, requesting the scatter plot as well as the computed
correlation:
corplot(EuStockMarkets[, 1], EuStockMarkets[,2], TRUE)
8000

●● ●
● ●●●●●●
●●
● ●●
●● ●●
●●● ● ●●●● ● ●
●●●
●●●●
●●●●●●●●



●● ●
● ●●
●●●
● ●
●●
●● ●●

● ● ●●●●
●●

●●

●●

●●●●
●●●
●●●●●● ●●●● ●●
●●


●●●
●●
● ●●●●●
●●

●●







●● ●●

●●
●●●
●●●


●●
●●

●●●
● ● ●
●●●●●

●●
● ●
●●

●● ●
●●

●●

●●
●●●
●●
5000

●●

● ●●
● ●
● ●● ●

●●
●●






●●



●●● ●●●
● ●●●
●●
●●

●●●
●●
y



●●

●●






●●
●●


●●
●●
●●●
●●

●●●

● ●●
● ●
● ●
●●●

●●

●●


●●
●●
●●

●●●



●●



●●


●●
●●


● ●
●●

● ●●


●●

●●


●●


●●
●●


● ●●
●●


●●






●●





●●



●●



●●



●●



●●

●●





●●



●●





●●

●●




●●


●●


2000

●●●●

●●

●●


●●


●●


●●


●●

●●
●●

●● ●


●●



●●





●●




●●




●●


●●

●●
●●●●

●●



●●


●●
●●

●●


●●





●●


●●

●●

●●


●●
●●

2000 3000 4000 5000 6000

## [1] 0.9911539
Not surprisingly, the two stock markets are very highly correlated. They take low values together, and
high values together.

7
Example - a normal random number generator

We will write a function to approximately simulate standard normal


random variables. An appropriate header for the function could be:

rStdNorm <- function(n)

Note that this function will take n as an input. The output should be that
number of standard normal variates.

At some point in the body of the function there is normally a statement


like return(Z) which specifies the output value of the function. If there
is no return() statement, then the value of the last statement executed
is returned.

8
Example - a normal random number generator

In our standard normal simulator, we will want to return a vector of


length n. We will use Z as the name of this object.

rStdNorm <- function(n) {


...
return(Z)
}

9
Example - a normal random number generator

Using the sum of uniforms concept from the earlier example, we will use
a function body of the form:

{
Z <- 0
for (j in 1:12) {
U <- runif(n, min = -.5, max = .5)
Z <- Z + U
}
return(Z)
}

10
Example - a normal random number generator

Putting the header and body together, we have the following function:

rStdNorm <- function(n) {


Z <- 0
for (j in 1:12) {
U <- runif(n, min = -.5, max = .5)
Z <- Z + U
}
return(Z)
}

11
Example - a normal random number generator

A trial with 3 values is executed as follows:


rStdNorm(3)

## [1] -0.5420390 0.3298943 -1.1798510

12
Functions can take any number of arguments

We can use our new rStdNorm() function inside a function which


calculates chi-squared random variables on k degrees of freedom.

Two arguments, n and k will be needed in this function.

rChisq <- function(n, k) {


X <- 0
for (i in 1:k) {
Z <- rStdNorm(n)
X <- X + Zˆ2
}
return(X)
}

13
Functions can take any number of arguments

A trial with k = 17 degrees of freedom, and 2 values is executed as


follows:

rChisq(2, 17)

## [1] 10.03174 12.92907

14
Use of default arguments

To give the user of a function a hint as to the kind of input that the
function is expecting, we may give default values to some arguments.

If the user doesn’t specify the value, the default will be used.

15
Example

We could have used the header, i.e. the first line of the function,

rChisq <- function(n, k = 1)

to indicate that if a user called rChisq(10) without specifying k, then it


should act as though k = 1.

16
Function environment

We conclude our brief discussion of functions with a mention of the


function’s environment.

We won’t give a complete description here, but will limit ourselves to the
following circular definition: the environment is a reference to the
environment in which the function was defined.

This has implications for where objects are that the function can access.

17
Example

A function myfun is created in an environment that does not contain


mydata:

myfun <- function() {


mymean <- mean(mydata)
return(mymean)
}
myfun() # execute function

## Error in mean(mydata): object ’mydata’ not found

18
Example

Now, consider what happens when mydata is in the function’s


environment:

mydata <- rChisq(4, 1)


myfun() # mydata exists now and mymean exists internally to

## [1] 1.082443

Note, as well, that mymean does not exist in the workspace, only locally
to myfun:
mymean

## Error in eval(expr, envir, enclos): object ’mymean’


not found

19
Managing complexity through functions

Most real computer programs are much longer than the examples we
give in this course.

Most people can’t keep the details in their heads all at once, so it is
extremely important to find ways to reduce the complexity.

There have been any number of different strategies of program design


developed over the years.

We now give a short outline of some of the strategies that have been
effective.

20
Reminder: what are functions?

Functions are self-contained units of R code with a well-defined


purpose.

In general, functions take inputs, do calculations (possibly printing


intermediate results, drawing graphs, calling other functions, etc.), and
produce outputs.

If the inputs and outputs are well-defined, the programmer can be


reasonably sure whether the function works or not: and once it works,
can move on to the next problem.

21
Example

Suppose payments of R dollars are deposited annually into a bank


account which earns constant interest i per year.

What is the accumulated value of the account at the end of n years,


supposing deposits are made at the end of each year?

The total amount at the end of n years is

R(1 + i)n−1 + . . . + R(1 + i) + R

(1 + i)n − 1
=R
i

An R function to calculate the amount of an annuity is

annuityAmt <- function(n, R, i) {


R*((1 + i)ˆn - 1) / i
}
22
Example

If $400 is deposited annually for 10 years into an account bearing 5%


annual interest, we can calculate the accumulated amount using

annuityAmt(10, 400, 0.05)

## [1] 5031.157

23
Functions in R are objects

R is somewhat unusual among computer languages in that functions are


objects that can be manipulated like other more common objects such
as vectors, matrices and lists.

Functions can be created and called within the body of a function.

24
Example

A function to implement Eratosthenes sieve for finding prime numbers


has the following body:

{
if (n >= 2) {
sieve <- seq(2, n)
primes <- c()
for (i in seq(2, n)) {
if (any(sieve == i)) {
primes <- c(primes, i)
sieve <- c(sieve[(sieve %% i) != 0], i)
}
}
return(primes)
} else {
stop("Input value of n should be at least 2.")
}
}
25
Prime mumber sieve example

At some point in the body of the function there is normally a statement


like return(primes) which specifies the output value of the function.

(In R all functions produce a single output. In some other languages


functions may produce no output, or multiple outputs.)

If there is no return() statement, then the value of the last statement


executed is returned.

26
Prime number sieve example

Within the Eratosthenes function, we could define a new function. Its


environment would include n, sieve, primes and i.

For example, we might want to make the removal of multiples of the


prime values clearer by putting that operation into a small function
called noMultiples.

27
Prime number sieve example
Eratosthenes <- function(n) {
# Print all prime numbers up to n (based on the sieve of Eratosthenes
if (n >= 2) {

noMultiples <- function(j) sieve[(sieve %% j) != 0]

sieve <- seq(2, n)


primes <- c()
for (i in seq(2, n)) {
if (any(sieve == i)) {
primes <- c(primes, i)
sieve <- c(noMultiples(i), i)
}
}
return(primes)
} else {
stop("Input value of n should be at least 2.")
}
}

The noMultiples function defines j in its header, so j is a local variable, and it finds
sieve in its environment.

28
Returning multiple objects

R functions always return a single object.

But sometimes you want to return more than one thing.

The trick here is to return them in a list() or vector.

29
Example

For example, our annuityAmt() function simply calculated the amount


of the annuity at the end of n years.

But we might also want to know the present value, which is (1 + i)−n
times the amount.

We return both in this function:

annuityValues <- function(n, R, i) {


amount <- R*((1 + i)ˆn - 1) / i
PV <- amount * (1 + i)ˆ(-n)
list(amount = amount, PV = PV)
}
annuityValues(10, 400, 0.05)

## $amount
30
## [1] 5031.157
##
## $PV
## [1] 3088.694
Exercise - smoothing a scatterplot

The faithful data set consists of the waiting times until the next
eruption of the Old Faithful geyser together with the corresponding
eruption times.

The scatterplot for these data can be obtained by typing


plot(faithful, pch=16, col="black")


● ● ●

90

● ● ●● ●●● ●
● ● ●●● ● ● ●●
● ● ●●
● ●● ● ● ● ● ● ● ●●● ●●
●●● ●● ●● ●

● ● ●● ● ●●●
●●
● ● ●● ●
●● ●● ● ●

● ●● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●
● ● ● ●● ● ●●
●●
● ● ●●●●
●●●● ●
● ●
●● ●● ●●
● ● ●●
waiting

● ●●● ●●●
● ●●● ● ●●●
● ● ● ● ● ●● ●●● ●

70

● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ● ● ● ●
●● ●●
●●● ●


● ● ●●●●

● ●● ● ● ●

●●●●●●
● ●●● ● ●●● ● ●
● ● ● ●
●●●●● ●●● ● ●
50

●●●
● ● ●
● ●
● ● ● ●
●●●●●● ●● ●

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions

31
Smoothing a scatterplot

A simple way to make predictions from such data is to smooth the


scatterplot of the y values that are plotted against the x values.

One way to do this is to use moving averages.

In other words, just take averages of y values that are near each other
according to their x values.

Join these averages together to form a curve.

32
Smoothing a scatterplot

We will construct a function called smoother() which outputs a new


data frame consisting of a column of equally spaced x values and a
column of corresponding local averages, taking the following arguments

• x: the vector of x values

• y: the vector of y values

• x.min: a constant which specifies the left boundary of the plotted


curve

• x.max: a constant which specifies the right boundary of the plotted


curve

• window: a constant giving the range of x values used to calculate


the moving averages

33
Smoothing a scatterplot - function header

The function header:

smoother <- function(x, y, x.min, x.max, window) {

34
Smoothing a scatterplot - function output

The output for this function will be a data frame with 2 columns: x and y,
which will correspond to the y-averages and the corresponding x
locations where the averages are taken.

Thus, we include a line such as the one at the end of the following
body-less function:

smoother <- function(x, y, x.min, x.max, window) {


...
data.frame(x = xpoints, y = yaverages)
}

35
Smoothing a scatterplot - function body

We use the seq() function to create a sequence of 401 equally spaced x


values, starting at x.min and ending at x.max.

We include a line of code that assigns this sequence to an object called


xpoints:

smoother <- function(x, y, x.min, x.max, window) {


xpoints <- seq(x.min, x.max, len=401)
...
data.frame(x = xpoints, y = yaverages)
}

36
Smoothing a scatterplot - function body

We use a for() loop to calculate the column of corresponding


yaverages.

To do this, we need to first initialize the yaverages object to have the


same number of elements as xpoints.

Thus, we Include the following line in the function:

yaverages <- numeric(length(xpoints))

37
Smoothing a scatterplot - function body

Next, for each value of i, running from 1 through xpoints, we need to


determine which elements of the original data vector x are close to
xpoints[i], so that we can take the average of the corresponding y
values only.

In other words, we want to determine the indices of x for which the


absolute value of x - xpoints[i] is less than the window parameter
that was specified in the argument to the smoother() function.

smoother <- function(x, y, x.min, x.max, window) {


xpoints <- seq(x.min, x.max, len=401)
yaverages <- numeric(length(xpoints))
for (i in 1:length(xpoints)) {
indices <- which(abs(x - xpoints[i]) < window)
}
data.frame(x = xpoints, y = yaverages)
}
38
Smoothing a scatterplot - function body

Within the for() loop just created, we add a line of code which assigns
the average of the values in y[indices] to yaverages[i]:

smoother <- function(x, y, x.min, x.max, window){


xpoints <- seq(x.min, x.max, len=401)
yaverages <- numeric(length(xpoints))
for (i in 1:length(xpoints)) {
indices <- which(abs(x - xpoints[i]) < window)
yaverages[i] <- mean(y[indices])
}
data.frame(x = xpoints, y = yaverages)
}

39
Smoothing a scatterplot - testing the function

We should now have a working function, which we can test on artificial


data, in this case, a noisy parabola:

For example,

10
x <- seq(0, 3, length=20) ●

8
y <- xˆ2 + rnorm(20) ●

6

plot(x, y, pch=16)

4
lines(smoother(x, y, ●

2

x.min=0.25, ●



0
● ●

x.max=2.75, ● ●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

window=0.5), lwd=2) x

40
Smoothing a scatterplot - testing the function

For example,

x <- seq(0, 3, length=20) ●


6
y <- xˆ2 + rnorm(20) ●

4
plot(x, y, pch=16)

y

lines(smoother(x, y, ●

2

x.min=0.25, ● ●

● ●

0

x.max=2.75, ●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

window=0.06), lwd=2) x

When window is very close to 0, we see missing pieces in the smooth


curve. Why?

If the window parameter is too close to 0, there will be no data points


close enough to some of the values in xpoints, so you will be averaging
no data, thus, there is nothing to plot.

41
Adding an error message - using if and stop()

To avoid such a problem, we include an error message in the function to


tell the user that the window parameter is too small.

The stop() function provides such a message and aborts execution of


the function.

Within the for loop to your function, we include the following lines of
code:

if (length(indices) < 1) {
stop("Your choice of window width is too small.")
} else {
yaverages[i] <- mean(y[indices])
}

42
Smoothing a scatterplot

smoother <- function(x, y, x.min, x.max, window=1) {


xpoints <- seq(x.min, x.max, len=401)
yaverages <- numeric(401)
for (i in 1:length(xpoints)) {
indices <- which(abs(x - xpoints[i]) < window)
if (length(indices) < 1) {
stop("Your choice of window width is too small.")
} else {
yaverages[i] <- mean(y[indices])
}
}
data.frame(x = xpoints, y = yaverages)
}

43
Smoothing a scatterplot

Finally, note that the so-called “smooth” curve is still quite bumpy.

To reduce the bumpiness, we can iterate the smoothing procedure.

In other words, we can repeat the smoothing procedure on the output


from smoother(), as follows:

output1 <- smoother(x, y, 0.25, 2.75, window = .5)


output2 <- smoother(output1$x, output1$y, 0.25, 2.75,
window = .25)

Observe that the window parameter does not have to be the same for
each iteration.

44
Smoothing a scatterplot

We now construct a new function called doublesmoother() which


takes the same arguments as smoother, but where window is now
assumed to be a vector with 2 elements.

The output from doublesmoother() is again a data frame consisting of


xpoints and yaverages as in smoother() but should be the result of
the second round of smoothing.

doublesmoother <- function(x, y, x.min, x.max, window) {


output1 <- smoother(x, y, x.min, x.max, window[1])
output2 <- smoother(output1$x, output1$y, x.min, x.max,
window[2])
output2
}

45
Smoothing the faithful scatterplot

Applying the doublesmoother() func- ● ●




90
tion to the faithful data frame, with
● ●● ●●●
● ● ●
● ● ● ● ●●
● ●
● ●● ● ●
● ● ● ●●●
●● ●● ●●● ●● ●
● ● ● ● ● ●●● ● ●
● ● ●●●●
● ●● ● ● ●
● ● ● ●● ● ● ● ● ●● ●

80
a window parameter of 1 unit for the ●

● ●

● ●



● ●● ●
● ● ● ●● ●
●●
●● ●● ● ●



● ●●

● ●● ●
● ●●
● ●● ●● ●●
●● ● ● ● ●


●●



● ● ● ●● ●

waiting

● ● ● ● ●

first level of smoothing and a value

70
● ● ●●
● ●


● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●

of 0.1 unit for the second level, and

60
● ● ● ●● ●
● ●● ● ● ●●
● ●● ●
● ●
●● ●
● ●● ● ● ●

●●●●● ● ●
●● ●● ● ●
● ● ● ● ●

equally spaced xpoints in the interval


●●●● ● ●

50
● ● ● ●
●●● ● ●
● ● ●
● ● ●
●●
● ●
●● ●

[1.5, 5.0]: 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions

plot(faithful, pch=16, col="grey")


lines(doublesmoother(faithful$eruptions, faithful$waiting,
1.5, 5.0, c(1, 0.1)), col="blue", lwd=2)

46
DATA SCIENCE 101

Predicting with Data

Shabnam Fani, UBCO

Winter T1, 2021

1
Visualizing and Modelling Multiple Variables

In this lecture, we will introduce

• lattice graphics

1. dot plots

2. conditioning scatter plots

• predictive modelling with multiple regression

2
Dot plots - in the lattice package

The lattice package contains functions for plotting that extend beyond
what is easily done with base graphics.

We will focus on two lattice graphing functions. The first gives another
version of the dot plot, but the syntax depends on a modelling formula
instead of a matrix.

Typical usage:

library(lattice) # loads lattice package


dotplot(myfactor ˜ mymeasurement|optionalfactor,
data = mydata)
Other arguments such as main, xlab and so, can still be used.

3
Dot plot example - lengths of cuckoo eggs

Recall the cuckoo data in the DAAG package which contain lengths of
eggs laid in the nests of other birds.
A basic lattice dot plot showing how the length distributions compare between the
different host species:
library(DAAG) # contains cuckoo data
dotplot(species ˜ length, data = cuckoos)

wren ● ● ● ●●● ● ● ●● ●

tree.pipit ● ● ● ● ● ●●●●● ● ●

robin ● ● ●● ●● ● ● ● ●

pied.wagtail ● ●● ● ● ● ●● ●● ● ●

meadow.pipit ● ● ● ● ●● ●●●●●●●● ●●● ●● ●● ●●

hedge.sparrow ● ● ● ● ●● ● ●●● ●

20 21 22 23 24 25

length

Wrens are small birds, so it is not surprising that the cuckoo eggs found in their nests are smaller.

4
Dot plot example - comparing treatments for anorexia

The anorexia data frame, in the MASS package contains weight


measurements for young female anorexia patients before and after a
study comparing two types of treatments with a control.

The three columns of the data frame are:

1. Treat which is a factor containing levels


• Cont (Control - no therapy)
• CBT (Cognitive Behavioural Therapy)
• FT (Family Therapy)
2. Prewt: Weight of patient before study period, in lbs.
3. Postwt: Weight of patient after study period, in lbs.

5
Dot plots - comparing treatments for anorexia

We usually want to compare the change in weight between the treatment


groups.

For this purpose, we construct a new column which subtracts the


pre-study weights from the post-study weights:
library(MASS) # contains anorexia data set
anorexia$change <- with(anorexia, Postwt - Prewt)

The first few data frame rows:


head(anorexia, n=3)

## Treat Prewt Postwt change


## 1 Cont 80.7 80.2 -0.5
## 2 Cont 89.4 80.1 -9.3
## 3 Cont 91.8 86.4 -5.4
Note that in this context, a negative change is bad. The therapies should increase the weight if they are
effective.
6
Dot plots - comparing treatments for anorexia

The factor of interest is Treat.

We want to know if the measured changes in weight are different for the
different therapies.

If there is no difference from the control group (where nothing was


done), we would conclude that the therapies are not useful.

7
Dot plots - comparing treatments for anorexia

Side-by-side dot plots* allow us to see differences between the different


treatment groups:

dotplot(Treat ˜ change, data = anorexia,


xlab = "weight change", ylab="treatment")
It appears that the treatments
(Family Therapy and CBT) some-
times lead to better increases in
weight than the control group,
FT ● ● ● ● ● ●● ● ●● ●●● ●●● ●
but there is a lot of overlap
among the measurement distri-
treatment

Cont ● ●● ●
● ●● ● ●
butions.● ● ●●●● ● ● ●● ● ● ●● ● ●

This is not very strong evidence


CBT ● ● ●●● ●●
●●●● ●●●●
●●●● ●● ● ● ● ●● ● ●
in favour of the therapies. We
will return to this example later.
−10 0 10 20

weight change

* Box plots could also be used - try bwplot in place of dotplot.


8
Dot plots - orange juice energy data

Recall the orange juice energy data from an earlier lecture where we
used base graphics to plot the energy measurements for the different
orange juice brands, taking account of the machine used to do the
measuring.

The data in orangejuice.R are in 3 columns.


source("orangejuice.R")
head(orangejuice, n=3)

## energy brand machine


## 1 89 A M1
## 2 94 A M1
## 3 92 A M2

We are interested in how energy relates to machine for the levels of


brand.
9
Dot plots - orange juice energy data

Including brand as the optional factor, we can see clearly how energy
depends on brand, for the different machines.

dotplot(machine ˜ energy|brand, data = orangejuice)

90 95 100 105 110

D E F
M3 ● ● ● ● ● ●

M2 ● ● ● ● ● ●

M1 ● ● ● ● ●●

A B C
M3 ● ● ● ● ● ●

M2 ● ● ●● ● ●

M1 ● ● ●● ● ●

90 95 100 105 110 90 95 100 105 110

energy

Observations: Now it is clear that brand D is a pretty high energy juice, especially compared with A, C
and F. Energy in brand E is intermediate. Machine does not have a systematic effect.

10
Conditioning scatter plots

The second lattice function gives another version of the scatter plot,
which extends in a way to help visualize relatively complex data frames.

Typical usage:

library(lattice) # loads lattice package


xyplot(y ˜ x|optionalfactor, data = mydata)

Other arguments such as main, xlab and so, can still be used.

11
Conditioning scatter plots - cuckoo eggs example

A scatter plot of length versus breadth for the cuckoo eggs can be
obtained with
xyplot(length ˜ breadth, data = cuckoos)

25 ● ●

● ●
24 ● ● ● ● ● ●
● ● ● ● ●

● ●
● ● ● ● ●
● ● ●
● ● ● ●
23 ● ● ● ● ● ● ● ●

● ● ● ●
length

● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
22 ● ● ● ● ●
● ●
● ● ●
● ● ● ●


● ● ●
21 ● ● ● ●
● ● ● ●




20 ●

15.0 15.5 16.0 16.5 17.0 17.5

breadth

We see a general tendency for length to increase with breadth, but this plot hides any possible effects
due to the different species.

12
Conditioning scatter plots - cuckoo eggs example

We can condition on species by including the vertical slash and the


additional factor species:
xyplot(length ˜ breadth|species, data = cuckoos)

15.0 15.5 16.0 16.5 17.0 17.5

robin tree.pipit wren


25
● ● ●
● ● ● 24
● ● ● ●● ●
● ●● ● ●
● 23
● ● ●● ● ●

● ●● ●


● ● 22

● ● ●
● ● ●

●● 21

● ● 20
length

hedge.sparrow meadow.pipit pied.wagtail


25 ● ●
● ●
24 ● ● ● ● ● ●
● ● ●
● ● ●● ● ●
23 ● ● ●● ● ●● ● ●
● ●●
●● ●●●
●● ●

● ●
● ●
●●●●●● ●
22 ●

●●

●●● ● ●

21 ● ●

20 ●

15.0 15.5 16.0 16.5 17.0 17.5 15.0 15.5 16.0 16.5 17.0 17.5

breadth

Some observations: there is still only a vague tendency for length to increase with breadth, but we can
see that both length and breadth are small for wrens. For other species, the breadth measurements tend
to be larger, while length measurements have a relatively large range.

13
Conditioning scatter plots - cuckoo eggs example

We can overlay a smooth curve in each of the panels, through the use of
the type argument.
xyplot(length ˜ breadth|species, data = cuckoos, type=c("p", "smooth"))

15.0 15.5 16.0 16.5 17.0 17.5

robin tree.pipit wren


25
● ● ●
● ● ● 24
● ● ● ●● ●
● ●● ● ●
● 23
● ● ●● ● ●

● ●● ●

● ●
● 22

● ● ●
● ● ●
●●
● 21

● ● 20
length

hedge.sparrow meadow.pipit pied.wagtail


25 ● ●
● ●
24 ● ● ● ● ● ●
● ● ●
● ● ●● ● ●
23 ● ● ●● ● ●● ● ●
● ●●
● ●●
●● ● ● ●
●● ●
●●●
●●
22 ●

●●
● ●●●●
● ●
● ●

21 ● ●

20 ●

15.0 15.5 16.0 16.5 17.0 17.5 15.0 15.5 16.0 16.5 17.0 17.5

breadth

The smooth curve is fit to the data using a method similar to double smoothing described earlier.

14
Conditioning scatter plots - nitrous oxide emissions example

Nitrous oxide is one of the four main greenhouse gases. It is produced


by internal combustion engines.

data(ethanol) # contained in the lattice package

The ethanol data set contains engine emissions of nitrous oxide at


several different equivalency ratios E and five different compression
ratios C
summary(ethanol)

## NOx C E
## Min. :0.370 Min. : 7.500 Min. :0.5350
## 1st Qu.:0.953 1st Qu.: 8.625 1st Qu.:0.7618
## Median :1.754 Median :12.000 Median :0.9320
## Mean :1.957 Mean :12.034 Mean :0.9265
## 3rd Qu.:3.003 3rd Qu.:15.000 3rd Qu.:1.1098
## Max. :4.028 Max. :18.000 Max. :1.2320

all three variables are numeric; we might want to predict nitrous oxide emissions from the other two
variables

15
Conditioning Plots - nitrous oxide emissions example

Because there are several measurements taken at only five values of the
compression ratio C, we can construct five scatter plots, each relating
nitrous oxide emissions to E for a fixed value of C.

We say that we are conditioning on C, and each scatter plot tells us how
NOx and E relate to each other, after accounting for C.

The xyplot function produces a set of conditioning scatter plots


(“co-plots”).

16
Conditioning Plots - nitrous oxide emissions example

The following code produces a co-plot of nitrous oxide emissions (NOx)


vs equivalency ratio (E) for each value of the compression ratio (C).

xyplot(NOx ˜ E|C, data=ethanol)

0.6 0.8 1.0 1.2

C C
● ● ● 4
● ●● ●



3 Each panel shows


●●
● ● 2 how the nitrous oxide
●●



●●●


● ●● 1 emissions increase
● ● ●

● ●
with equivalency ratio
NOx

C C C
4
●● ●●
to a maximum and
● ●●
● ●
3



● ●


● then decrease again.
● ●
● ●
2 ●● ●

● ●
● ●
●● ● ●
● ●
● ●
1 ● ●
● ● ●● ●

●●
● ●● ●

0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2

The orange bar in the top bar of each panel indicates the relative size of C for that panel. That is, in the
lower left panel, the value of C is lowest and in the upper right panel, the value of C is highest.

17
Conditioning Plots - nitrous oxide emissions example

Why are the conditioning plots useful? We can see this by comparing
with what we would get by simply looking at a scatter plot of NOx vs E:

xyplot(NOx ˜ E, data=ethanol)
In the conditioning plots, a
4 ●


pattern of increase followed by



● ● ●


● ●● ●
● ●

decrease was clearly evident.

● ●

3 ●
●●



● ●
When we ignore the effects of
NOx

● ●
● ● ●
2 ●
● ●

● ●
●●



C, we see very complicated
●●●
● ●●
●●

● looking patterns in the relation
● ●

1 ●


●●
●●

● ● between NOx and E - these
●●●●● ●

●●

● ●● ●● ●



patterns are due to C, not due
to how NOx and E are related.
0.6 0.8 1.0 1.2

18
Overlaying the conditioning plots with smooth curves

Again, we can overlay a smooth curve.


xyplot(NOx ˜ E|C, data=ethanol, type=c("p", "smooth"), span=.65)

0.6 0.8 1.0 1.2

C C
● ● ● 4
● ●● ●
● ●

3

●● ● 2
● ●
●●
● ●
● ●
● ●●●
● ●● 1
● ● ●

● ●
NOx

C C C
4 ●●
●● ●● ●
● ●
● ● ●

3 ●
● ●
● ●
● ● ●
2 ●● ●

● ●
● ●
●● ● ●
● ●
● ●
1 ● ●
● ● ●● ●

●●
● ●● ●

0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2

The span argument indicates what proportion of the data should be used to estimate each point of the
smooth curve.

19
Another look at the anorexia data

Another way to visualize pre/post data is to use a scatter plot relating


the post-study data to the pre-study data. Here we do that for each
treatment group.

xyplot(Postwt ˜ Prewt|Treat, data = anorexia,


type=c("p", "smooth"),
span=.75) Now, we see a clear dif-
ference between the con-
trol group and the treat-
70 75 80 85 90 95 ment groups.
CBT Cont FT


For pre-study weights
100 ● ●


● ● above 82 pounds, the


● ● ● ●
● ● control is not having
●● ●
● ●
a good effect, but the
Postwt

90 ● ●
● ●


● ●●

● ●●
● ●●

other therapies are. For
● ●

80
●● ●●
● ● ●● ●

● ●●
pre-study weights below







●●
● ●


about 80-82 pounds, it


● ●●
does not seem to make a
70
difference.
70 75 80 85 90 95 70 75 80 85 90 95

Prewt

20
Another look at the anorexia data

We can reconstruct the dot plots now to take this new information into
account.

We can create a factor which separates the very low pre-weight subject
from the others as follows:

anorexia$lowPrewt <- factor(anorexia$Prewt < 82)


levels(anorexia$lowPrewt) <-
c("Higher Preweight", "Very Low Preweight")

21
Another look at the anorexia data
The dot plots, conditional on whether the pre-study weight was very low
or not are constructed as follows:
dotplot(Treat ˜ change|lowPrewt, data = anorexia,
xlab = "weight change", ylab="treatment")

−10 0 10 20

Higher Preweight Very Low Preweight

FT ● ●
● ● ●● ●
●● ●
● ● ●● ● ● ●
treatment

Cont ● ●●● ●● ●● ● ●● ● ●
●● ● ●● ● ● ●
●● ●

CBT ● ●●●●●●●
●● ● ●● ● ● ● ●● ●● ●●● ● ● ● ●

−10 0 10 20

weight change

Now, we see that for subjects with a very low pre-study weight, there are no differences, but for subjects
with a high enough pre-study weight, the therapies really appear to help, especially the Family Therapy.

22
Another example - how far do elastic bands travel?

Experiments were conducted using elastic bands which were stretched


by a certain amount and then released.

The distance travelled was then measured.

The data from two such experiments are contained in elastic1 and
elastic2 in the DAAG package.

library(DAAG) # this library contains the data


names(elastic1) # elastic band experiment 1

## [1] "stretch" "distance"

names(elastic2) # experiment 2

## [1] "stretch" "distance"


23
Another example - how far do elastic bands travel?

We first create a single data frame that contains the data from both
experiments.

We will create a variable called expt which will indicate for which
experiment the measurement was taken.

elastic1$expt <- rep(1, length(elastic1$distance))


elastic2$expt <- rep(2, length(elastic2$distance))
elastic <- rbind(elastic1, elastic2)

24
Another example - how far do elastic bands travel?

We are interested in how the flight distance is related to the amount of


stretching, but we want to make this comparison for each experiment:

xyplot(distance ˜ stretch|expt, data=elastic,


type=c("p", "smooth"), span=.7)

30 40 50 60

expt expt
300 ●

250 ● ●

● ●
distance


200 ●
● ●
● ●

150 ●


100

30 40 50 60

stretch

In both experiments, we see that as the amount of stretch increases, the elastic band travels farther -
and the relationship is pretty close to linear.

25
Summary

When you have multiple variables and you are interested in predicting
one of them, using the other variables, a simple scatter plot might not
display all of the information in the data.

Conditional plots can be very useful in highlighting the relationships


between two variables, taking other variables into account.

26
DATA SCIENCE 101

Predicting with Data

1
Predicting with Several Numeric Variables

In this part of today’s lecture, we will learn how to set up predictive


models using more than one predictor variable.

We will see that we can still make predictions using the predict
function.

We will learn about a form of leave-one-out cross-validation called


PRESS.

2
Predicting brain weight from body weight and litter size

The data frame litters contains body and brain weights of 20


mice. The size of the litter in which each mouse was born is also
recorded.

library(DAAG) # contains the litters data


head(litters)

## lsize bodywt brainwt


## 1 3 9.447 0.444
## 2 3 9.780 0.436
## 3 4 9.155 0.417
## 4 4 9.613 0.429
## 5 5 8.850 0.425
## 6 5 9.610 0.434

3
Look at all pairwise relationships
pairs(litters, pch=16)

6 7 8 9

12
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●

lsize ● ● ● ●

8
● ● ● ●
● ● ● ●

6
● ● ● ●
● ● ● ●

4
● ● ● ●

● ● ● ● ●●
● ●
● ●
9

● ●
● ●
● ● ● ●

bodywt
8

● ● ● ●
● ●●
● ● ●
7

● ● ● ●
● ● ●
● ● ●
6

● ●

● ●
● ● ●
● ● ● ●● ●●
● ● ● ●

0.42
● ●
● ● ● ●

● ● ●
● ●
● ●
●● ●
●●
● brainwt

0.38
● ●

● ●

4 6 8 10 12 0.38 0.40 0.42 0.44

It appears that brain weight increases with body weight, and it decreases with litter size.

4
Setting up the predictive model for brain weight

In order to find out how brain weight relates to both body weight and
litter size, we can use the following model:

brainwt = β0 + β1bodywt + β2lsize + ε

This is an example of a multiple regression model. It is a little more


complicated to fit than a simple regression model, but the lm function
still applies.

There is still a response variable brainwt on the left side of the model
formula, but now there are two predictor variables bodywt and lsize on
the right side of the model formula:

brainwt ˜ bodywt + brainwt

5
Fitting the model in R
litters.lm <- lm(brainwt ˜ bodywt + lsize, data = litters)
coef(litters.lm)

## (Intercept) bodywt lsize


## 0.178246962 0.024306344 0.006690331

The fitted model is then

yb = .18 + .024x1 + .0067x2


where Y is brain weight, x1 is body weight and x2 is litter size.

Note that this fitted model says that for a fixed body weight, brain weight
is actually higher for larger litters.

This is consistent with what is known as ‘brain sparing’: nutritional


deprivation that results from large litter sizes has a proportionately
smaller effect on brain weight than on body weight.

6
The brain sparing effect can be visualized - but only with effort

Our earlier visualization with the pairs plot did not reveal the brain
sparing effect, but a conditional plot can. We need condition on different
levels of body weight to see this.

We will use the cut function to turn the numeric bodywt variable into a
factor with interval based categories.

Example of use of cut, where we find intervals (5, 6], (6, 7], ..., (9, 10]
which contain the different body weights:
cut(litters$bodywt, 5:10)

## [1] (9,10] (9,10] (9,10] (9,10] (8,9] (9,10] (8,9] (8,9] (7,8]
## [10] (8,9] (7,8] (7,8] (6,7] (7,8] (6,7] (6,7] (7,8] (6,7]
## [19] (5,6] (6,7]
## Levels: (5,6] (6,7] (7,8] (8,9] (9,10]

The output tells us that the first four body weights are in the interval (9, 10] and the next two are in the
interval (8, 9], which agrees with the tabular display on slide 3.

7
The brain sparing effect can be visualized - but only with effort

We will choose cutpoints that divide the body weights into 6


approximately equal-sized groups.

cutpoints <- c(5, 6.3, 7, 7.3, 8.5, 9.4, 10)


cutpoints

## [1] 5.0 6.3 7.0 7.3 8.5 9.4 10.0

8
The brain sparing effect can be visualized - but only with effort

Then we plot brain weight against litter size for each of these groups
with the xyplot:
xyplot(brainwt ˜ lsize|cut(bodywt, cutpoints),
data = litters, type=c("p", "smooth"), span = 2)

4 6 8 10 12

(7.3,8.5] (8.5,9.4] (9.4,10]



● ● ●
0.44
● ●

● 0.42


0.40
0.38
brainwt

(5,6.3] (6.3,7] (7,7.3]


0.44 ●

0.42 ●
● ● ●

0.40 ●


0.38

4 6 8 10 12 4 6 8 10 12

lsize

This plot shows that for relatively fixed values of body weight, the brain weight is somewhat more likely
to grow with litter size than to decrease.

9
The brain sparing effect can be visualized - but only with effort

As stated earlier, the brain sparing effect is completely hidden if we


don’t try to condition on fixed values of body weight:
xyplot(brainwt ˜ lsize, data = litters, type=c("p", "smooth"), span = 2)


0.44 ●
● ●
● ●
● ●

0.42


brainwt

● ● ●

● ●

0.40

0.38

4 6 8 10 12

lsize

This plot hides the body weight effects.

10
The brain sparing effect can be visualized - but only with effort

Here is another view:


xyplot(brainwt ˜ lsize, data = litters, groups=cut(bodywt, cutpoints),
type=c("p", "smooth"), span = 2)


0.44 ●
● ●
● ●
● ●

0.42


brainwt

● ● ●

● ●

0.40

0.38

4 6 8 10 12

lsize

This time, we use different colours, using the groups argument, to represent the points and smooths
corresponding to the different body weights. Now, we see that, often, for fixed body weight, brain weight
increases with litter size. The light green and blue points are exceptions.

11
Making predictions with the fitted model

If we have a mouse born in a litter of size 6 with a body weight of 8.5, we


can predict the brain weight from the fitted model:

yb = .18 + .024(8.5) + .0067(6) = 0.42487.

We can do this automatically with the predict function:


predict(litters.lm, newdata =
data.frame(bodywt = 8.5, lsize = 6))

## 1
## 0.425
We can also obtain a 95% prediction interval:
predict(litters.lm, newdata =
data.frame(bodywt = 8.5, lsize = 6), interval="prediction")

## fit lwr upr


## 1 0.425 0.399 0.451
12
Leave-one-out Cross-Validation: PRESS

• PRESS residuals:
e(i) = yi − yb(i).
Here, yi is the ith observed response and yb(i) is the predicted value
at the ith observation based on the regression of y against the xs,
omitting the ith observation.

• PRedicted Error Sum of Squares:


n
e2
X
PRESS = (i)
i=1
This gives an idea of how well a regression model can predict new
data. Small values of PRESS are desired.

13
PRESS - litters example

# regression of brain weight against body weight and litter size:


> litters.lm <- lm(brainwt ˜ bodywt + lsize, data = litters)
PRESS(litters.lm)
[1] 0.0035 # same regression as above, but without the intercept
term:
> litters.0 <- lm(brainwt ˜ bodywt + lsize -1, data=litters)
> PRESS(litters.0)
[1] 0.00482 # regression of brain weight against body weight only,
with intercept:
> litters.1 <- lm(brainwt ˜ bodywt, data=litters)
> PRESS(litters.1)
[1] 0.00385 # regression of brain weight against both variables
plus an interaction term:
> litters.2 <- lm(brainwt ˜ bodywt + lsize + lsize:bodywt, data=litters)
> PRESS(litters.2)
[1] 0.0037 # best predictor is the 1st model!

14
Example - winning football games

The data in table.b1 in the MPV package concern the number of


games won y in a 14 game season, together with measurements on
specific aspects of the game, such as the number of yards that the
football was passed, and kicked, and so on. There are 9 variables like
this, labeled x1 through x9.

library(MPV) # contains table.b1 - football example

We will use PRESS to help choose between some possible models.

15
Example - winning football games
We can fit the model that relates y to ALL of the x’s by using a dot (.):
all.lm <- lm(y ˜ . , data = table.b1)
PRESS(all.lm) # calculate PRESS value for this full model

## [1] 145.9

Fit a model that only contains x2, x4, x7, x8 and x9:
five.lm <- lm(y ˜ x2 + x4 + x7 + x8 + x9, data = table.b1)
PRESS(five.lm)

## [1] 97.13

Compare with a similar model that does not have x7:


four.lm <- lm(y ˜ x2 + x4 + x8 + x9, data = table.b1)
PRESS(four.lm)

## [1] 119.2

16
Example - winning football games

Since the PRESS value is smallest for the five variable model, we would
prefer that one.

We can partially visualize the effects of the various variables on the


response y, using the conditioning plots and by use of the groups
argument.

17
Example - winning football games

If we do not condition at all, but view y as a function of x2, we have


xyplot(y ˜ x2, data = table.b1, pch=16, type=c("smooth", "p")

● ● ● ●

10 ● ● ● ● ●

●●


y

● ● ●

5 ● ● ●

● ● ●

● ●

● ●

0 ●

1500 2000 2500 3000

x2

There is a lot of noise around our predictions.

18
Example - winning football games

Splitting x7 roughly in half, and also x8 - as the group variable, we have

xyplot(y ˜ x2|(x7 < 56), groups=(x8 < 2100), data = table.b1,


pch=16, type=c("smooth", "p"), span=2)

1500 2000 2500 3000

FALSE TRUE

● ● ● ●

10 ● ● ● ● ●
●●


y

● ● ●

5 ● ● ●
● ● ●
● ●
● ●

0 ●

1500 2000 2500 3000

x2

By considering roughly fixed values of x7 and x8, we have more precise predictions of y based on x2.

19
Example - winning football games

We can predict the number of wins for a team with 2000 passing yards
x2, 60% field goal percentage x4, 80% rushing x7, 1900 opponent
rushing yards x8 and 1800 opponent passing yards x9:

predict(five.lm, newdata = data.frame(x2 = 2000, x4 = 60,


x7 = 80, x8 = 1900, x9 = 1800))

## 1
## 12.99

We would predict that this team would win 13 out of the 14 games.

20
Example - winning football games

A prediction interval can be obtained from

predict(five.lm, newdata = data.frame(x2 = 2000, x4 = 60,


x7 = 80, x8 = 1900, x9 = 1800), interval="prediction")

## fit lwr upr


## 1 12.99 7.669 18.31

The prediction interval is pretty wide (7 to 18), and since there are only
14 games, some of the interval is impossible. We could be very
confident that this team will win at least one-half of its games.

21
Summary

The lm() function can be used to fit predictive models where there are
several predictor variables.

The PRESS statistic, leave-one-out cross-validation, can be used to


decide which model is preferred - small PRESS values are desirable.

Predictions and prediction intervals can be computed for fitted models


using the predict function.

22
DATA SCIENCE 101

Predicting with Data

Term 1, 2021W

1
Predicting with regression trees

Consider the type of data that might be used by car insurance


companies:

source("driverhistory.R")

head(driverhistory, n=6)

## Sex Age CarAccident


## 1 M 19 TRUE
## 2 M 31 FALSE
## 3 F 20 FALSE
## 4 F 40 FALSE
## 5 M 41 FALSE
## 6 M 21 TRUE
This is a very very small data set to illustrate the basic tree idea. Normally, a database with thousands of
drivers would be used.

2
Predicting with regression trees

Visualize in a plot age versus sex, with pink for the accident cases and
blue for non-accident cases:


60 ●



50 ●



Age

40 ●

● ●
30 ●

● ●



● ●
● ●
20 ●




● ●

F M

Not all males have accidents, and not all young drivers have accidents

But ...

3
Predicting with regression trees

If we split the data set between the sexes and then divide the males at an
age of around 25, we can separate most of the accident cases from the
non-accident cases.

For a new case, we could then make a prediction. Say, a 40 year old
female, would be predicted to not have an accident, while a 20 year old
male would be predicted to have an accident.

library(rpart)
driver.rpart <- rpart(CarAccident ˜ Age + Sex,
data = driverhistory)

Here, we are using the rpart function in the rpart package. Note that its
syntax is similar to lm.

4
Predicting with regression trees

Sex=a
|

The output is in the form of a


Age>=20.5 Age>=26.5
tree. 0 0.1429

plot(driver.rpart)
text(driver.rpart)

0.1538 0.9091

Let’s read the tree for a 40 year old female, using the rule that if the
statement at the top of a split is true, we take the left branch. For some
reason, “a” is Female, so we move left. 40 is greater than 20.5, so we
move left. The probability of an accident is very low.
5
Predicting with regression trees

What is the probability of an accident for a 40 year old male?

How about a 20 year old female?

And a 18 year old male?

6
Predicting with classification trees

The earlier tree was called a regression tree. Now, we will construct a
classification tree, using method = "class".

library(rpart)
driver.rpart <- rpart(CarAccident ˜ Age + Sex,
data = driverhistory, method="class")

7
Predicting with classification trees

Sex=a
|

Age>=26.5
FALSE

The output is again in the


form of a tree.

plot(driver.rpart)
text(driver.rpart)

FALSE TRUE

Let’s read the tree for a 40 year old female, using the rule that if the
statement at the top of a split is true, we take the left branch. We predict
this driver to not have an accident.

8
Predicting with classification trees

What is the prediction for a 40 year old male?

How about a 20 year old female?

And a 18 year old male?

9
Predicting Spam with classification trees

George Forman of Hewlett-Packard Labs collected 4601 email items, of


which 1813 items were spam.

There are 57 explanatory variables.

There is one response variable (yes/no) which is 0 (no) for non-spam


and 1 (yes) for spam.

10
Predicting Spam with classification trees
in the DAAG package:

crl.tot, total length of words that are in capitals.


6 of the 57 explanatory variables are contained in the data frame spam7
dollar, frequency of the $ symbol, as a percentage of all characters.

bang, frequency of the ! symbol, as a percentage of all characters.

money, frequency of the word ‘money’ as a percentage of all words.

n000, frequency of the character string ‘000’ as a percentage of all


words.

make, frequency of the word ‘make’, as a percentage of all words

11
Predicting Spam with classification trees

yesno is the variable that indicates spam (yes) or not (no):

We can see how well the variables might predict spam with side-by-side
box plots, such as
bwplot(yesno ˜ crl.tot, data = spam7)

y ● ●


●●
●●

●●


●●

●●

●●
●●
●●

●●●
●●

●●

●●

● ●●●


●●
●●

●●●● ● ●
● ● ●

n ●●
●●
●●

●●
●●
●●


●●
●●



●●

●●

●●


●●
●●
●●●



●●


●●●


●●
●●


●●

●●

●●

●●

●●

●●
●●
●●
●● ●●●
●●●●●●● ● ●

0 5000 10000 15000

crl.tot

More capital letters in the message are an indicator of spam.


12
Predicting with classification trees

We can use the classification tree again.


spam.tree <- rpart(formula = yesno ˜ crl.tot + dollar +
bang + money + n000 + make, data = spam7,
method = "class")

13
Predicting with classification trees

dollar< 0.0555
|

plot(spam.tree)
text(spam.tree)
bang< 0.0915
y
crl.tot< 85.5
n

bang< 0.7735
crl.tot< 17 y
n n y

The resulting tree tells us how to classify an email message as spam


depending on its numbers of dollar signs and capital letters, and so on.
14

You might also like