Data 101 Complete PDF
Data 101 Complete PDF
1
Data analysis case study - data science in the real world
Data analysis case study — data collection
The study included twenty participants who were split into 2 treatment
groups. Ten envelopes contained instructions for the participant to
commence with their DH needle and NDH probe and ten for the opposite
hand arrangement.
Once the participant demonstrated the needle tip to be within the target
in the phantom jelly, the task was complete and the opposite hand
arrangement was started.
• The log of the mean of the counts is related to the covariates, order
and group. That is,
• R is the top one being used for data analysis among all
programming languages.
• R is open source
• R has more than 9000 packages to use. The package likes the app
we use.
We will start with basic programming. We will teach you R, but we will
try not to just teach you R. We will emphasize those things that are
common to many computing platforms and are important to beginning
data scientists.
As you learn R, there is nothing wrong with making errors when learning
a programming language like R.
Try out the code embedded into these slides and experiment with new
variations to discover how the system will respond.
1 https://www.r-project.org/Licenses/GPL-2
Downloading and installing R and RStudio
RStudio is also very popular. You can download the “Open Source
Edition” of “RStudio Desktop” from http://www.rstudio.com/, and
follow the instructions to install it on your computer.
1
Executing commands in R
3
Executing commands in R
Often, you will type in commands such as this into a script window, as in
RStudio, for later execution, through hitting the “Run” button, “ctrl-R” or
another related keystroke sequence.*
* Check out the short ScriptRStudio video for a quick example.
4
Executing commands in R
> women
For example, the data ## height weight
set or data frame called ## 1 58 115
## 2 59 117
women contains infor-
## 3 60 120
mation on heights and ## 4 61 123
weights of American ## 5 62 126
## 6 63 129
women: ## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
5
Polling questions - Yes or No?
6
Polling question - Answer
Yes!
745 - 238
## [1] 507
7
Polling questions - Yes or No?
## [1] Inf
8
Polling question - Yes or No?
I will get an error message, because it will only work if I supply it with
data.
9
Polling questions - Answer
No!
mean
mean is an object, so if we type its name, its contents are printed to the
screen.
10
Polling question - Yes or No?
11
Polling questions - Answer
Yes!
F2C
## function(x) (x - 32)*5/9
## [1] 30
12
What else to expect in this course
2. You will learn some things about data: organizing it, summarizing it,
and cleaning it.
13
What else to expect in this lecture
2. Doing arithmetic calculations and storing the results for future use.
14
R can be used as a calculator
12*11 # multiplication
## [1] 132
22*22
## [1] 484
125/25 # division
## [1] 5
3ˆ4
## [1] 81
9ˆ(1/2)
## [1] 3
or
sqrt(9)
## [1] 3
16
Polling question - Yes or No?
1000ˆ(1/3)
17
Polling question - Answer
Yes! The cube root of a number is the 1/3 power of that number. (Or the
value that when raised to the 3rd power is the original number.
103 = 1000 and
1000ˆ(1/3)
## [1] 10
18
Polling question - Yes or No?
A = πr2
and R has stored a value of π as
pi
## [1] 3.141593
pi*12ˆ2
19
Polling question - Answer
Yes!
pi*12ˆ2
## [1] 452.3893
20
Calculations in R
You can control the number of digits in the output with the options()
function.
This is useful when reporting final results such as means and standard
deviations, since including excessive numbers of digits can give a
misleading impression of the accuracy in your results.
options(digits=3)
583/31
Compare with 583/31
## [1] 18.80645
## [1] 18.8
21
Calculations in R
options(digits = 18)
1111111*1111111 The error in the final calculation is
## [1] 1234567654321 due to the way R stores information
11111111*11111111 about numbers.
## [1] 123456787654321 There are around 17 digits of
111111111*111111111 numeric storage available.
## [1] 12345678987654320
22
More Calculations with R
## [1] 5
23
More Calculations with R
## [1] 2
## [1] 17
24
Polling question - Yes or No?
25
Polling question - Answer
No!
17%/%3
## [1] 5
## [1] 2
26
What are the numbers in square brackets?
The following example displays the data in rivers, lengths of 141 North
American rivers (in miles). The second line starts with the 12th value,
and the third line stars with the 23rd value, and so on.*
options(width=60)
rivers
## [1] 735 320 325 392 524 450 1459 135 465 600 330
## [12] 336 280 315 870 906 202 329 290 1000 600 505
## [23] 1450 840 1243 890 350 407 286 280 525 720 390
## [34] 250 327 230 265 850 210 630 260 230 360 730
## [45] 600 306 390 420 291 710 340 217 281 352 259
## [56] 250 470 680 570 350 300 560 900 625 332 2348
## [67] 1171 3710 2315 2533 780 280 410 460 260 255 431
## [78] 350 760 618 338 981 1306 500 696 605 250 411
## [89] 1054 735 233 435 490 310 460 383 375 1270 545
## [100] 445 1885 380 300 380 377 425 276 210 800 420
## [111] 350 360 538 1100 1205 314 237 610 360 540 1038
## [122] 424 310 300 444 301 268 620 215 652 900 525
## [133] 246 360 529 500 720 270 430 671 1770
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 4 5 6 7 8 9 10 11 12 13
28
Simple Number Patterns
## [1] -2 -1 0 1 2 3 4 5 6 7
(1:10)*3
## [1] 3 6 9 12 15 18 21 24 27 30
29
More patterns
(1:10)ˆ2
## [1] 1 4 9 16 25 36 49 64 81 100
(1:10)ˆ3
30
Polling question - Yes or No?
31
Polling question - Answer
Yes!
(1:10)%/%3
## [1] 0 0 1 1 1 2 2 2 3 3
32
Polling question - Yes or No?
33
Polling question - Answer
Yes!
4:7
## [1] 4 5 6 7
34
Polling question - Yes or No?
The command
7:4
35
Polling question - Answer
No!
7:4
## [1] 7 6 5 4
36
Named storage
This environment is where you can begin to store the results of your
work.
For example, you might want to keep track of some calculations, or you
might have invented a new function to solve some kind of problem.
To store your output, you need to provide names for each object that you
want to save.
37
Named storage: Example
Note that no output appears. You can see the results of this assignment
by typing
riversKm
38
Polling question - Yes or No?
There are 5280 feet in one mile. That means we can convert the lengths
of the rivers from miles to feet by multiplyng each value by 5280. Does
the following code assign these lengths to the object riversFeet?
39
Polling question - Answer
40
Quitting R
If you then hit the Enter key, you will be asked whether to save an image
of the current workspace, or not, or to cancel.
41
Functions
For example, we saw that to quit R we type q(). This tells R to call the
function named q.
The brackets surround the argument list, which in this case contains
nothing: we just want R to quit, and do not need to tell it how.
42
q is a function
43
q is a function
44
Default Values of Parameters
What happens when we execute q() is that R calls the q function with
the arguments set to their default values.
45
Changing from the Defaults
To change from the default values, specify them in the function call.
For example,
46
Changing from the Defaults
If we had given two arguments without names, they would apply to save
and status.
If we want to accept the defaults of the early parameters but change later
ones, we give the name when calling the function, e.g.
q(runLast = FALSE)
47
Changing from the Defaults
48
DATA SCIENCE 101
Shabnam Fani
Winter 2021
1
Named storage: Graphics Example
15% of the time was spent on sports, 10% on game shows, 30% on
movies, and 45% on comedies. Set up a pie chart and a bar chart.
2
Named storage: Graphics Example
The pie() function can then be used to create the pie chart.
3
Named storage: Graphics Example
pie(tv)
game shows
movies
sports
comedies
4
Named storage: Bar Chart Example
barplot(tv)
40
30
20
10
0
5
Polling question - Yes or No?
My friend has baked 3 apple pies, 4 blueberry pies and 7 cherry pies.
Create a pie chart for these data. The first step is:
6
Polling question - Answer
7
Polling question - Yes or No?
The second step to create the pie chart is for my friend’s pie data is
pie(pies)
8
Polling question - Answer
Yes!
pie(pies)
blueberry
apple
cherry
9
R is case-sensitive
MAX(rivers)
Now try
max(rivers)
## [1] 3710
10
R is case-sensitive
If you really want a function called MAX to do the work of max, you would
type
MAX <- max
MAX(rivers)
## [1] 3710
11
Listing the objects in the workspace
A list of all objects in the current workspace can be printed to the screen
using the objects() function:
objects()
Suppose you are measuring a length with a ruler that gives you
accuracy to the nearest millimeter.
When you measure the length of your pencil as 273 millimeters, the truth
could be anywhere between 272.5 and 273.5 millimeters.
The uniform distribution on the interval [−.5, .5] provides a model for the
error in your measurement, and we can simulate values from this
distribution using the runif() function.
13
Packages
14
Packages
library(DAAG)
If you get a warning that the package is can’t be found, then the package
doesn’t exist on your computer, but it can likely be installed. Try
install.packages("DAAG")
15
Packages
16
Packages
17
Packages
Type in the name of the package you are requesting, and click “Install”:
18
Packages
> seedrates
Error: object 'seedrates' not found
## rate grain
## 1 50 21.2
## 2 75 19.9
## 3 100 19.2
## 4 125 18.4
## 5 150 17.9
19
Polling question - Yes or No?
install.packages(MPV)
20
Polling question - Answer
install.packages("MPV")
21
Using one object from a package at a time
22
Polling question - Yes or No?
23
Polling question - Answer
Yes!
plantingData
The contents of
## rate grain
plantingData
## 1 50 21.2
are the same as
## 2 75 19.9
the contents of
## 3 100 19.2
seedrates.
## 4 125 18.4
## 5 150 17.9
24
Packages
You might want to know which packages are loaded into your system
already.
25
Packages
This list also indicates the search order: a package can only contain one
function of any given name, but the same name may be used in another
package.
When you use that function, R will choose it from the first package in the
search list.
stats::median(x)
26
Packages
If you try to use one that isn’t already there, you will receive an error
message:
library(notInstalled)
This means that the package doesn’t exist on your computer, but it
might be available in a repository online.
27
Packages
install.packages("knitr")
or, within RStudio, click on the Packages tab in the Output Pane,
choose Install, and enter the name in the resulting dialog box.
28
Packages
If you can’t get help from someone with more experience, you can get
information from the CRAN task views at
https://cloud.r-project.org/web/views.
29
Built-in help pages
There is an online help facility that can help you to see what a particular
function is supposed to do.
If you know the name of the function that you need help with, the
help() function is likely sufficient.
30
Built-in help pages
or
help(q)
or just hit the F1 key while pointing at q in RStudio. Any of these will
open a help page containing a description of the function for quitting R.
31
Data frames
Most data sets are stored in R as data frames, such as the women object
we encountered earlier.
Data frames are like matrices, but where the columns have their own
names. You can obtain information about a built-in data frame by using
the help() function. For example, observe the outcome to typing
help(women).
This data set gives the average heights and weights for American
women aged 30-39.
Usage:
women
Format:
A data frame with 15 observations on 2 variables.
Details:
The data set appears to have been taken from the American Society
32
Data frames
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
33
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Data frames
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
34
Polling question - Yes or No?
Is the first quartile (1st Qu.) the value which is at least as large as 25%
of the measurements?
35
Polling question - Answer
Yes!
The first quartile is the same as the 25th percentile - the value which
divides the lower 25 percent of the data from the upper 75 percent.
Similarly, the third quartile is the 75th percentile, and the median is the
50th percentile; the median is the middle value of a sorted collection of
measurements.
36
Polling question - Yes or No?
We can count the number of rows in the women data frame using
nrow(women)
37
Polling question - Answer
Yes!
nrow(women)
## [1] 15
38
Reading data into a data frame from an external file
If you have prepared the data set yourself, you could simply type it into a
text file, for example called file1.txt, perhaps with a header indicating
column names, and where you use blank spaces to separate the data
entries.
39
Reading data into a data frame from an external file
The read.table() function will read in the data for you as follows:
The object mydata now contains the data read in from the external file.
You could use any name that you wish in place of mydata, as long as the
first element of its name is an alphabetic character.
40
Reading data into a data frame from an external file
and there is no header row, as in the file wx l3 2006.txt, you would type:
wx1 <- read.table("wx_l3_2006.txt", header=F, sep=",")
41
Reading data into a data frame from an external file
If possible, export it as a .csv file and use something like the following
to read it in.
If you cannot export to .csv, you can leave it as .xlsx and use the
read.xslx() command in the xlsx package.
42
DATA SCIENCE 101
1
Data input and output
This directory (or folder) contains the files that you read in to R and write
out from R.
2
Changing working directories
In the RStudio Files tab of the output pane you can navigate to the
directory where you want to work, and choose
Set As Working Directory from the More menu item.
Alternatively you can run the R function setwd(). For example, to work
with data in the folder mydata on the C: drive, run
setwd("c:/mydata") # or setwd("c:\\mydata")
3
Changing working directories
After running this command, all data input and output will default to the
mydata folder in the C: drive.
Because other systems use a forward slash “/” in their folder names,
and because doubling the backslash is tedious in Windows, R accepts
either form.
4
dump() and source()
5
dump() and source()
To save all of the objects that you have created during a session, type
dump(list = objects(), "all.R")
This produces a file called all.R on your computer’s hard drive. Using
source("all.R") at a later time will allow you to retrieve all of these
objects.
6
Example
7
Saving and retrieving image files
The vectors and other objects created during an R session are stored in
the workspace known as the global environment.
8
Saving and retrieving image files
9
Saving and retrieving image files
save.image("temp.RData")
10
Vectors in R
11
Numeric Vectors
length(rivers)
## [1] 141
12
Numeric Vectors
In hte previous lecture, we learned that we can view the entire contents
of an object by typing its name. Let’s do that one more time:
rivers
## [1] 735 320 325 392 524 450 1459 135 465 600 330
## [12] 336 280 315 870 906 202 329 290 1000 600 505
## [23] 1450 840 1243 890 350 407 286 280 525 720 390
## [34] 250 327 230 265 850 210 630 260 230 360 730
## [45] 600 306 390 420 291 710 340 217 281 352 259
## [56] 250 470 680 570 350 300 560 900 625 332 2348
## [67] 1171 3710 2315 2533 780 280 410 460 260 255 431
## [78] 350 760 618 338 981 1306 500 696 605 250 411
## [89] 1054 735 233 435 490 310 460 383 375 1270 545
## [100] 445 1885 380 300 380 377 425 276 210 800 420
## [111] 350 360 538 1100 1205 314 237 610 360 540 1038
## [122] 424 310 300 444 301 268 620 215 652 900 525
## [133] 246 360 529 500 720 270 430 671 1770
13
Extracting elements from vectors
You can extract this element from the rivers vector using the value 35
inside square brackets:
rivers[35]
## [1] 327
14
Extracting elements from vectors: Example
## [1] 49.4
15
Polling question - Yes or No?
(a) length(nhtemp)
(b) length[nhtemp]
16
Polling question - Answer
Yes!
length(nhtemp)
## [1] 60
17
Polling question - Yes or No?
(a) nhtemp(57)
(b) nhtemp[57]
18
Polling question - Answer
No!
nhtemp(57)
You should use the square brackets, not the round brackets, when
extracting an element from a vector. Here, R is telling you that nhtemp is
not a function - which means that it thinks you want to use nhtemp to be
a function. (You don’t)
## [1] 51.9
19
Building your own numeric vectors
The c() function is used to collect things together into a vector. We can
create a vector called myvector which contains some random data:
myvector <- c(2.5, 5, 0, 0.7, -8)
20
Polling question: Yes or no?
I want to assign the first 3 prime numbers to the object prime3. Does
the following work?
21
Polling question: Answer
22
Vectors of Sequences
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
23
Putting Vectors Together
For example, watch what happens when we combine the existing object
numbers5to20 with the numbers 31 through 35:
c(numbers5to20, 31:35)
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 31 32
## [19] 33 34 35
24
Putting Vectors Together
If you type this in the R console (not in the RStudio Source Pane), R will
prompt you with a + sign for the second line of input. RStudio doesn’t
add the prompt, but it will indent the second line.
In both cases you are being told that the first line is incomplete: you
have an open bracket which must be followed by a closing bracket in
order to complete the command.
Also, don’t forget to include all the commas where they are needed.*
rivers[5:9]
* Hereis an example of the use of the colon (:) to create increasing sequences of integers.,
5:9 gives the sequence {5, 6, 7, 8, 9}.
27
Polling question - Yes or No?
28
Polling question - Answer
No!
rivers[3:2]
Remember that, when using the : operator, the numbers decrease if the
first value is larger than the second.
29
Extracting multiple elements from vectors
For example, we can create river23 which will contain only the second
and third elements of rivers:
river23
30
Extracting multiple elements from vectors
## [1] 49.9 52.3 49.4 51.1 47.9 49.8 50.9 51.9 49.6 49.3 50.6
## [12] 48.4 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4
## [23] 51.6 51.8 50.9 48.8 51.7 51.0 50.6 51.7 51.5 52.1 51.3
## [34] 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9 52.6 50.2
## [45] 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8
## [56] 51.9 53.0
[1] 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 50.8 49.6 49.3 50.6 48.4
[16] 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4 51.6 51.8 50.9 48.8 51.7
[31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
[46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0
31
Extracting multiple elements from vectors
## [1] 735 320 325 338 981 1306 500 696 605 250 411
## [12] 1054 735 233 435 490 310 460 383 375 1270 545
## [23] 445 1885 380 300 380 377 425 276 210 800 420
## [34] 350 360 538 1100 1205 314 237 610 360 540 1038
## [45] 424 310 300 444 301 268 620 215 652 900 525
## [56] 246 360 529 500 720 270 430 671 1770
32
Extracting elements from vectors: 0 indices
This is not something that one would usually type, but it may be useful
in more complicated expressions.
For example,
nhtemp[1:5]
The result is just the first 3 elements plus the 5th element of nhtemp.
33
Polling question - Yes or No?
x <- 1:10
x[0:5]
is
(a) 0 1 2 3 4 5
(b) 1 2 3 4 5
34
Polling question - Answer
No!
x <- 1:10
x[0:5]
## [1] 1 2 3 4 5
The 0 index is ignored and only the 1, 2, 3, 4 and 5 indices are used.
35
Extracting elements from vectors - positives and negatives
Do not mix positive and negative indices. To see what happens, observe
nhtemp[c(-2, 3)]
36
Extracting elements from vectors - Fractional Indices
Always be careful to make sure that vector indices are integers. When
fractional values are used, they will be truncated towards 0. Thus 0.6
becomes 0, as in
nhtemp[0.6]
## numeric(0)
37
DATA SCIENCE 101
1
Extracting elements from vectors - logical subsetting
What if you want to see only the values of rivers that are larger than
some number, such as 2000?
There is a simple way to do this, with the square bracket and the greater-
than (>) sign:
2
Extracting elements from vectors - which ones?
To see which values of rivers are larger than 2000, use the which()
function:
## [1] 66 68 69 70
This tells us that we could extract the values that are larger than 2000,
using the indices 66, 68, 69, 70.
3
Polling question: Yes or no?
I want to know which elements of rivers are less than 250. Does the
following work?
4
Polling question: Answer
Yes!
5
Vector arithmetic
river23/5
## [1] 64 65
6
y <- river23 - 5
y
For another example, consider taking the square root of the elements of
river23:
sqrt(river23)
7
Vector arithmetic
The above examples show how a binary arithmetic operator can be used
with vectors and constants.
x x x x
For example, we can compute yi i , for i = 1, 2, 3, i.e. (y11 , y22 , y33 ),
where y = [2 4 5] and x = [5 2 3] as follows:
y <- c(2, 4, 5)
x <- c(5, 2, 3)
yˆx
## [1] 32 16 125
8
Vector arithmetic
When the vectors are different lengths, the shorter one is extended by
recycling: values are repeated, starting at the beginning.
For example, to see the pattern of the numbers 1 to 10 after adding 2 and
3, we need only give the 2:3 vector once:
c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
10, 10) + 2:3
## [1] 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13
9
Vector arithmetic
R will give a warning if the length of the longer vector is not a multiple of
the length of the smaller one, because that is often a symptom of an error
in the code. For example, if we wanted to add 2, 3 and 4 to the numbers 1
through 10, this is the wrong way to do it:
c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10) + 2:4
## Warning in c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
10, 10) + : longer object length is not a multiple of shorter object
length
## [1] 3 4 6 4 6 7 6 7 9 7 9 10 9 10 12 10 12 13 12 13
10
Polling question: Yes or no?
11
Polling question: Answer
No!
(1:10) + (1:5)
## [1] 2 4 6 8 10 7 9 11 13 15
The numbers 1 through 5 are first added to the numbers 1 through 5, and
then the numbers 6 through 10 are added to the numbers 1 through 5.
12
Simple patterned vectors
We have seen the use of the : operator for producing simple sequences
of integers.
Patterned vectors can also be produced using the seq() function as well
as the rep() function. For example, the sequence of odd numbers less
than or equal to 21 can be obtained using
seq(1, 21, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21
Notice the use of by = 2 here. The seq() function has several optional
parameters, including one named by. If by is not specified, the default
value of 1 will be used.
13
Simple patterned vectors
Repeated patterns are obtained using rep(). Consider the following ex-
amples:
rep(3, 12) # repeat the value 3, 12 times
## [1] 3 3 3 3 3 3 3 3 3 3 3 3
## [1] 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
14
Polling question - Yes or No?
(a) 1 8 15 22
15
Polling question - Answer
Yes!
seq(1, 28, by = 7)
## [1] 1 8 15 22
16
Simple patterned vectors - repeating patterns
## [1] 1 1 1 4 4
## [1] 1 1 1 4 4 4
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
17
Vectors with random patterns
We already saw that runif() will help us simulate a simple kind of noise.
The sample() function allows us to simulate things like the results of the
repeated tossing of a 6-sided die.
## [1] 6 3 1 3 6 5 3 5
18
Polling question - Yes or No?
19
Polling question - Answer
Yes!
There are too many results in coinTosses to list, but we can use the
table function to display the numbers of 1’s and the number of 2’s:
table(coinTosses)
## coinTosses
## 1 2
## 506 494
20
Character vectors
Just like numeric vectors, when you type the name of a character vector,
you see its contents:
colors
21
Character vectors
We can add new elements (green, magenta, and cyan) to the colors vec-
tor:
more.colors <- c(colors, "green", "magenta", "cyan")
more.colors
22
Character vectors
For example,
z <- c("red", "green", 1)
23
Manipulating character vectors
There are two basic operations you might want to perform on character
vectors. To take substrings, use substr().
For example, to print the first two letters of each color use
substr(colors, 1, 2)
24
Manipulating character vectors
25
Manipulating character vectors
The sep parameter controls what goes between the components being
pasted together.
26
Manipulating character vectors
27
Polling question - Yes or No?
28
Polling question - Answer
Yes!
rep(c("red", "green"), each = 3)
29
Polling question - Yes or No?
(a) red, green, red, green, red, green, red, green, red, green
(b) red, red, red, red, red, green, green, green, green, green, green
30
Polling question - Answer
31
DATA SCIENCE 101
1
Introduction to the R Programming Language (Cont’d)
1. Factor objects
• What are they?
• What can we do with these objects?
• How do we create them?
2. More about data frames
3. Aggregating data - to calculate groups means etc.
4. Some of the many functions built into R
5. Missing values and other special symbols
6. Handling special circumstances when reading in data
2
Factors
For example, a factor with four elements and having the two levels,
control and treatment can be created using:
grp <- c("control", "treatment", "control", "treatment")
grp
3
Factors
Factors can be an efficient way of storing character data when there are
repeats among the vector elements.
To see what the codes are for our factor, we can type
as.integer(grp)
## [1] 1 2 1 2
4
Factors
The labels for the levels are only stored once each, rather than being
repeated. The codes are indices of the vector of levels:
levels(grp)
levels(grp)[as.integer(grp)]
5
Factors
Since "control" is the first level, we change the first element of the
levels(grp) vector:
levels(grp)[1] <- "placebo"
In this example, grp was a very small vector. The same command could
be used if grp had a large number of elements.
6
Polling question: a, b or c?
How do we convert the 1000 1’s and 2’s in coinTosses to a factor with
levels Tails and Heads?
7
Polling question: Answer
(c)!
Compare the output of the summary() function for the original numeric
vector and the factor:
## Tails Heads
## 514 486
8
Factors can have levels that are empty
An important use for factors is to list all possible values, even if some
are not present.
For example,
sex <- factor(c("F", "F"), levels = c("F", "M"))
sex
## [1] F F
## Levels: F M
shows that there are two possible values for sex, but only one is present
in our vector.
9
Extracting elements from factors and character vectors
As for numeric vectors, square brackets [] are used to index factor and
character vector elements.
For example, the factor grp has 4 elements, so we can print out the third
element by typing
grp[3]
## [1] placebo
## Levels: placebo treatment
10
Extracting elements from factors and character vectors
Recall
more.colors
11
Polling question: a, b or c?
(a) coinTossfactor[1:20]
12
Polling question: Answer
13
More information about data frames
Recall from the first lecture that data sets usually consist of more than
one column of data, where each column represents measurements of a
single variable.
14
Data frames
Note that each row of this data frame corresponds to one of the trees, an
observation - measurements for a single tree.
16
Data frames - viewing the data
Trying to look at the whole data frame all at once is not usually very
helpful. It is difficult to see patterns or unusual features in a large
collection of numbers.
Better ways to view the data are through the use of the summary()
function as shown below, or by constructing a pairwise scatterplot
obtained by executing the command plot(trees).
summary(trees)
65 70 75 80 85
● ●
20
● ●
● ●●
●● ●
16
● ● ● ●
Girth ● ●
●
●
● ●
●●
●● ● ●
● ● ●
●
12
● ● ●●
● ●● ●●
● ● ●●● ●
●
●●
●●
●
●
● ● ● ●
●
8
● ●
●● ●
85
●
● ●
●● ● ●●●
● ● ● ●
●
● ●●
●
● ● ● ●
●●● ●
●
Height ●●●●
● ●
75
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
65
● ● ● ●
● ●
● ●
70
●●● ●●●
50
● ●
● ●●●
●
●
● ●
●
● ●
Volume
30
●
●●
● ● ● ●
● ● ● ● ● ●●
●●●●●● ● ● ●
●●● ● ● ●
10
●●● ● ● ●
8 10 12 14 16 18 20 10 20 30 40 50 60 70
For larger data frames, a quick way of counting the number of rows and
columns is important. The functions nrow() and ncol() play this role.
## [1] 31 3
In fact, str() works with almost any R object, and is often a quick way
to find what you are working with.
19
Extracting data frame elements
We can extract elements from data frames using two indices. For
example, the value in the seventh row, second column is
trees[7, 2]
## [1] 66
20
Extracting data frame elements
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1
## [10] 11.2 11.3 11.4 11.4 11.7 12.0 12.9 12.9 13.3
## [19] 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5
## [28] 17.9 18.0 18.0 20.6
21
Polling question: a, b or c?
Look back at the trees data frame and predict the output from the
following:
trees[3, 2]
(a) ## [1] 63
22
Polling question: Answer
(a)!
trees[3, 2]
## [1] 63
23
Polling question: a, b or c?
Look back at the trees data frame and predict the output from the
following:
trees[3, ]
(a) ## [1] 63
24
Polling question: Answer
(b)!
trees[3, ]
25
Polling question: a, b or c?
Look back at the trees data frame and predict the output from the
following:
trees[4:7, 1]
(b) ## [1] 11
(c) Neither of the above.
26
Polling question: Answer
(a)!
trees[4:7, 1]
27
Data frame - accessing individual columns
Data frame columns can also be addressed using their names using the
$ operator. For example, the weight column can be extracted as follows:
We can also extract all heights for which the volumes are less than 17.5
using
trees$Height[trees$Volume < 17.5]
## [1] 70 65 63 72 66
28
Data frames - accessing individual columns
For example, we can divide the volumes by the heights in the trees data
frame using
with(trees, Volume/Height)
We saw that the mean is part of the output from the summary function.
mean(trees$Volume)
## [1] 30.17097
30
Polling question: a, b or c?
To find the average height in the trees data frame, we can use
(a) mean(trees$Height)
(b) mean(trees$height)
(c) Neither of the above.
31
Polling question: Answer
(b)!
mean(trees$Height)
## [1] 76
The column names in the trees data frame are all capitalized as you can
see if you use the names function:
names(trees)
32
Data frames
## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12
33
Data frames
If you want to see the first few rows of a data frame, you can use the
head() function:
head(chickwts)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
34
Subsetting a data frame
If you want only the chicks who were fed horsebean, you can apply the
subset() function to the chickwts data frame:
chickHorsebean <- subset(chickwts, feed == "horsebean")
chickHorsebean
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
35
Subsetting a data frame
You can now apply the summary function to this new data frame:
summary(chickHorsebean$weight)
Specifically obtaining the average weight for this type of feed is also
possible:
mean(chickHorsebean$weight)
## [1] 160.2
36
Polling question: a, b or c?
37
Polling question: Answer
(a)!
## weight feed
## 3 136 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
## 14 141 linseed
38
Aggregating data according to factor levels
We would like to calculate the average weight for each type of feed in
order to make a comparison – e.g. which feed leads to the highest
weight?
## feed weight
## 1 casein 323.5833
## 2 horsebean 160.2000
## 3 linseed 218.7500
## 4 meatmeal 276.9091
## 5 soybean 246.4286
## 6 sunflower 328.9167
*A
more thorough treatment of this question would be through the Analysis of Variance
(ANOVA) which is outside the scope of this course.
40
Polling question: a, b or c?
41
Polling question: Answer
(b)!
## Species Petal.Length
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
42
Taking random samples from populations
43
Taking random samples from populations
If the entries have been enumerated (say, by the use of an ID index) from
1 through 15000, we could select the 8 numbers with
sampleID <- sample(1:15000, size = 8, replace = FALSE)
sampleID
The result is a new data frame consisting of 8 rows and the same
number of columns as fluSurvey.
44
Polling question: a, b or c?
We would like to take a random sample of 10 rivers from the 141 whose
lengths are recorded in rivers. The code to do this is
45
Polling question: Answer
## [1] 780 870 618 270 360 360 215 329 1885
## [10] 420
46
Constructing data frames
xy <- data.frame(x, y)
xy
## x y
## 1 5 2
## 2 2 4
## 3 3 5
47
Data frames can have non-numeric columns
48
Data frames can have non-numeric columns
obesityStudy$gender
49
Some built-in functions
52
Some built-in functions: pmin
## [1] 1 2 3 4 5
## [1] 7 6 5 4 3
pmin(x,y)
## [1] 1 2 3 4 3
53
Polling question: (a), (b) or (c)?
## [1] 1 2 3 4 5
## [1] 7 6 5 4 3
pmax(x,y)
(a) 7 6 5 4 5
(b) 7 7 7 7 7
(c) 7 6 5 4 3
54
Polling question: Answer
(a)!
pmax(x,y)
## [1] 7 6 5 4 5
55
Some built-in functions: median
median(nhtemp)
## [1] 51.2
56
Comparing two data sets with the median
We can compare the medians for the New Haven temperatures between
1920 and 1939 with those of 1952 to 1971 by extracting the appropriate
subsets from the nhtemp vector.
Since the first observation was in 1912, the 9th must be in 1920 and the
28th in 1939. The 41st is in 1952 and the 60th is in 1971.
The medians of the New Haven temperatures for 1920 through 1939, and
1952 through 1971 are:
median(nhtemp[9:28])
## [1] 50.75
median(nhtemp[41:60])
## [1] 51.85
57
A questionable form of data analysis
## [1] 47.35
## [1] 51.2
58
Polling question: a, b or c?
59
Polling question: Answer
(a) It is better to compare the temperatures at the two locations over the
same time period.
(b) Consider 2 years of artificial monthly data from the planet Xenon:
x <- rep(c(rep(0,11), 12), 2); x
## [1] 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0
## [16] 0 0 0 0 0 0 0 0 12
The median of the 2 yearly averages which are both 1 (12/12) is: 1.
The median of the 24 monthly averages is (sort and take the average of
the middle 2 values): 0.
The nottem data vector contains 240 monthly averages, for the years
1920 through 1939.
We would like to compute averages for each year (12 months each), so
we first create a vector of the 20 years, repeated 12 times, so that we
have a vector of length 240 which matches the nottem vector:
year <- rep(1920:1939, each = 12)
61
A proper comparsion: aggregating data
Here, we are using a formula which relates the values in nottem to the
values in year.
We don’t need to specify a data frame because the objects nottem and
year are located in the workspace.
62
A proper comparsion: aggregating data
## year nottem
## 1 1920 48.89167
## 2 1921 50.73333
## 3 1922 47.27500
## 4 1923 47.81667
## 5 1924 48.72500
## 6 1925 48.45833 Note that the result is a data
## 7 1926 49.36667 frame with columns year
## 8 1927 48.36667
## 9 1928 48.99167 and nottem.
## 10 1929 48.13333 The second column contains
## 11 1930 49.15000
## 12 1931 48.13333 the yearly averages.
## 13 1932 49.01667
## 14 1933 50.19167
## 15 1934 50.31667
## 16 1935 49.85833
## 17 1936 48.60000
## 18 1937 49.10000
## 19 1938 50.27500
## 20 1939 49.39167
63
A proper comparsion: aggregating data
median(nottinghamtemp$nottem)
## [1] 49.00417
## [1] 50.75
New Haven temperatures are marginally higher (in median), but the
difference is not nearly as large as the earlier analysis suggested.
64
The range and the range statistic
We can calculate the range of the New Haven temperatures in two steps.
First we use the range function to calculate the minimum and maximum
values:
range(nhtemp)
## [1] 6.7
65
Comparing ranges
## [1] 3.458333
## [1] 4.4
The range for the Nottingham average temperatures is less than the one
for New Haven.
The range statistic is not often used to compare spread, because it can
be distorted by unusual values - called outliers.
For example, suppose you have 100 numbers which are between 5 and
15, except for one which is 99. And somebody else has 100 numbers
which are all between 0 and 20.
Your range is 94 and the other person’s range is 20. But which data set
is really more spread out? With the exception of a single data point, the
other person’s is more spread out, so it would be better to have a
statistic that would not be influenced so strongly by one data point.
The interquartile range (IQR) of a data set conveys the same kind of
information as the range: how spread out is the distribution of values?
But it does it in a way that is a lot less sensitive to outlying values.
67
Interquartile range (IQR)
Let’s compare the IQR for the two temperature data sets:
IQR(nottinghamtemp$nottem)
## [1] 1.072917
IQR(nhtemp[9:28])
## [1] 1.425
We still see that the Nottingham temperatures are less variable than the
New Haven temperatures.
68
Standard deviation (sd)
sd(nottinghamtemp$nottem)
## [1] 0.9069859
sd(nhtemp[9:28])
## [1] 1.06524
We see that the Nottingham temperatures are less variable than the New
Haven temperatures by this criterion as well.
69
Polling question: (a), (b) or (c)?
A lecturer calculated the range statistic for a data set to be −7.5. This
means
70
Polling question: Answer
(c)!
It is not possible for the range to be negative, since you are subtracting
a smaller number (minimum) from a larger number (maximum). If the
minimum and maximum are the same, then all data points are equal and
the range is 0.
71
Polling question: (a), (b) or (c)?
(a) people in the first country recover faster than people in the second
country.
(b) recovery time in the first country is less predictable than recovery
time in the second country.
(c) none of the above.
72
Polling question: Answer
(b)!
The IQR measures the variability in the measurements. The higher the
variability, the less predictable an individual measurement would be. In
this case, the first country has more variability in recovery times than
the second country, so recovery times are harder to predict.
73
Missing values and other special values
This data frame contains data on gas mileage for a number of cars and
additional characteristics such as horsepower and weight and so on.
The variable x3 contains the measurements on torque for each car.
This gives us the summary information for all non-missing values of the
torque measurement and also tells us that 2 of the values are missing.
74
Missing values and other special values
## [1] NA 2 NA 4 NA 6 NA 8 NA 10
75
Missing values and other special values
## [1] NaN 1 1
The NaN symbol denotes a value which is ‘not a number’ which arises as
a result of attempting to compute the indeterminate 0/0.
76
Missing values and other special values
In other cases, special values may be shown, or you may get an error or
warning message:
1 / x
Here R has tried to evaluate 1/0 and reports the infinite result as Inf.
77
Missing values and other special values
is.na(some.evens)
78
Missing values and other special values
!is.na(some.evens)
79
Missing values and other special values
some.evens[!is.na(some.evens)]
## [1] 2 4 6 8 10
80
Reading data from an external file containing missing values
This tells R that the blank spaces should be read in as missing values.
81
Reading data from an external file containing missing values
82
Reading data into a data frame from an external file
dataset2
## x y z
Again, observe the result: ## 1 33 223 NA
## 2 32 88 2
## 3 3 NA NA
If you need to skip the first 3 lines of a file to be read in, use the skip=3
argument.
83
DATA SCIENCE 101
1
Digging More Deeply into R and Computing
2
Logical Vectors and relational operators
Logical vectors can contain only three types of elements: TRUE and
FALSE, as well as NA for missing.
They can also be created by using relational operators, e.g., >, < and ==.
x <- 1:8
x
## [1] 1 2 3 4 5 6 7 8
eg4.logical <- (x == 5)
eg4.logical
4
Logical vectors and relational operators
The %in% operator tests whether elements of one vector can be found in
another vector.
y <- seq(-5, 15, 4)
y
## [1] -5 -1 3 7 11 15
## [1] 1 2 3 4 5 6 7 8
y %in% x
## [1] -5 -1 3 7 11 15
## [1] 1 2 3 4 5 6 7 8
x %in% y
6
Polling question - (a), (b), (c) or (d)?
aa %in% bb
7
Polling question - Answer
(a)!
aa %in% bb
8
Practical use of the %in% operator
Example:
The cuckoos data frame in the DAAG package has measurements of the
eggs laid by cuckoos in the nests of birds of other species. Here is the
basic summary:
library(DAAG)
summary(cuckoos[, -4]) # the 4th column is not needed
9
Practical use of the %in% operator
If we only want to study the measurements of the eggs laid in the robin
and wren nests, we can subset the data with the %in% operator as in
cuckooWrenRobin <- subset(cuckoos[, -4],
species %in% c("robin", "wren"))
## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12
(a) options(width=15)
subset(chickwts, c("horsebean", "sunflower") %in% feed)
(b) subset(chickwts, feed %in% c("horsebean", "sunflower"))
11
Polling question ... Answer
(b)!
## weight feed
## Min. :108.0 casein : 0
## 1st Qu.:162.0 horsebean:10
## Median :261.0 linseed : 0
## Mean :252.2 meatmeal : 0
## 3rd Qu.:331.0 soybean : 0
## Max. :423.0 sunflower:12
Only the horsebean and sunflower measurements remain in the new data set.
12
Not: turning TRUE into FALSE
Example:
x <- c(TRUE, FALSE, TRUE)
!x
summary(cuckooNotWrenRobin)
The code
will give
15
Polling question ... answer
(b)!
## [1] 735 524 1459 600 870 906 1000 600 505
## [10] 1450 840 1243 890 525 720 850 630 730
## [19] 600 710 680 570 560 900 625 2348 1171
## [28] 3710 2315 2533 780 760 618 981 1306 500
## [37] 696 605 1054 735 1270 545 1885 800 538
## [46] 1100 1205 610 540 1038 620 652 900 525
## [55] 529 500 720 671 1770
16
Boolean algebra
Depending on the weather where you are, those two statements may
both be true (there is a “sunshower”), A may be true and B false (the
usual clear day), A false and B true (the usual rainy day), or both may be
false (a cloudy but dry day).
17
Boolean algebra
For example, “A and B” is the statement that it is both clear and raining.
(This might be true with a small amount of cloud overhead - conditions
for a “sunshower”).
18
Boolean algebra
There is also the “not A” statement, which says that it is not clear.
19
Boolean algebra - example
Let A be the set of animals, and B be the set of objects with four legs.
A dining room table has four legs so it is in set B, but it is not in set A.
To summarize:
20
Polling question - (a), (b), (c) or (d)?
Assume A and B are as in the previous slide. Which one of the following
statements is true?
21
Polling question - Answer
(b)!
An apple tree is not an animal and an apple does not have four legs, so it
definitely not a four-legged animal.
22
Polling question - (a), (b), (c) or (d)?
(a) 4 ∈ B
(b) 4 ∈ A ∩ B
(c) 3 ∈
/B
(d) 3 ∈ A ∩ B
23
Polling question - Answer
(d)!
24
Boolean algebra
Because there are only two possible values (true and false), we can
record all Boolean operations in a table.
On the first line, we list the basic Boolean expressions, on the second
line the equivalent way to code them in R, and in the body of the table
the results of the operations.
Boolean A B not A not B A and B A or B
R A B !A !B A & B A | B
TRUE TRUE FALSE FALSE TRUE TRUE
TRUE FALSE FALSE TRUE FALSE TRUE
FALSE TRUE TRUE FALSE FALSE TRUE
FALSE FALSE TRUE TRUE FALSE FALSE
25
Logical operations in R
One of the basic types of vector in R holds logical values. For example,
a logical vector may be constructed as
a <- c(TRUE, FALSE, FALSE, TRUE)
## [1] 13 2
26
Logical operations in R
## [1] 2
then the operations are performed after converting FALSE to 0 and TRUE
to 1, so by summing we count how many occurrences of TRUE are in the
vector.
27
Logical operations in R
There are two versions of the Boolean operators. The usual versions are
&, | and !, as discussed earlier. These are all vectorized, so we see for
example
!a
28
Logical operations in R
29
Relational operators
## [1] FALSE
## [1] TRUE
30
Relational operators
31
Relational operators
More examples:
threeM <- c(3, 6, 9)
threeM != 6 # which elements are not equal to 6
## [1] 6 9
## [1] 4
32
Data Storage in R
33
Approximate Storage of Numbers
34
Approximate Storage of Numbers
The number above would be written as 1.10012 × 2−5 if five binary digit
precision was used.
35
Approximate Storage of Numbers
Five binary digits give less precision than three decimal digits: a range
of values from approximately 0.0488 to 0.0508 would all get the same
representation to five binary digit precision.
36
Approximate Storage of Numbers
37
Approximate Storage of Numbers
38
Approximate Storage of Numbers
39
Approximate Storage of Numbers
At this point we’ll have the number 0.8 again, so the sequence of 4 bits
will repeat indefinitely: in base 2, 4/5 is 0.110011001100 · · · .
40
Approximate Storage of Numbers
i.e. it is equal for some values, but not equal for n = 3, 6 or 7. The errors
are very small, but non-zero.
41
Approximate Storage of Numbers
Since the leading bits in the binary expansions of nearly equal numbers
will match, they will cancel in subtraction, and the result will depend on
what is stored in the later bits.
42
Approximate Storage of Numbers - Variance Example
P
* The symbol is the mathematical shorthand for sum. If x1 = 3, x2 = 4 and x3 = 2,
P3
then i=1 xi = x1 + x2 + x3 = 9. In R, we would obtain this sum by typing sum(x).
43
Approximate Storage of Numbers - Variance Example
x <- 1:11
mean(x)
## [1] 6
var(x)
## [1] 11
## [1] 11
Not too long ago memory was so expensive that it was advantageous to
rewrite the formula as
n
1
s2 = x2 2
X
i − nx̄
n − 1 i=1
( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10
## [1] 11
This doesn’t change the variance, but it provides the conditions for a
“catastrophic loss of precision” when we take the difference:
A <- 1.e10
x <- 1:11 + A
var(x)
## [1] 11
( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10
## [1] 0
Since R gets the right answer, it clearly doesn’t use the one-pass
formula, and neither should you.
46
Exact Storage of Numbers
We have seen that R uses floating point storage for numbers, using a
base 2 format that stores 53 bits of accuracy.
It turns out that this format can store some fractions exactly: if the
fraction can be written as n/2m, where n and m are integers (not too
large; m can be no bigger than about 1000, but n can be very large), R
can store it exactly.
The number 5/4 is in this form, but the number 4/5 is not, so only the
former is stored exactly.
47
Exact Storage of Numbers
48
Exact Storage of Numbers - Example
The number 11 can be stored as the binary value of 11, i.e. 0 . . . 01011,
whereas −11 can be stored as the binary value of
232 − 11 = 4294967285, which turns out to be 1 . . . 10101.
Using only 32 bits for storage, this is identical to 0, which is what we’d
hope to get for 11 + (−11).
49
Dates and Times
Dates and times are among the most difficult types of data to work with
on computers.
Times are also messy, because there is often an unstated time zone
(which may change for some dates due to daylight savings time), and
some years have “leap seconds” added in order to keep standard clocks
consistent with the rotation of the earth.
50
Dates and Times
The base package has the function strptime() to convert from strings
(e.g. "2007-12-25", or "12/25/07") to an internal numerical
representation, and format() to convert back for printing.
51
Dates and Times
## [1] 17945
52
Dates and Times
If the trucker started driving on July 11, 2014, we count the number of
accident-free days as follows:
startDate <- "14-07-11"
numberOfDaysStart <- chron(dates = startDate,
format=c('y-m-d'))
as.numeric(numberOfDaysStart) # Ref. data is Jan. 1, 1970
## [1] 16262
## Time in days:
## [1] 1683
53
Time series objects
The Nile object contains annual flow amounts for the Nile River (Egypt)
and is an important example of a time series. The Nile River has
important effects on agricultural, and the flow data has been studied a
lot, in order to understand how patterns of flow change over time.
The object looks like a numeric vector but with additional features.
Nile
## Time Series:
## Start = 1871
## End = 1970
## Frequency = 1
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935
## [13] 1110 994 1020 960 1180 799 958 1140 1100 1210 1150 1250
## [25] 1260 1220 1030 1100 774 840 874 694 940 833 701 916
## [37] 692 1020 1050 969 831 726 456 824 702 1120 1100 832
## [49] 764 821 768 845 864 862 698 845 744 796 1040 759
## [61] 781 865 845 944 984 897 822 1010 771 676 649 846
## [73] 812 742 801 1040 860 874 848 890 744 749 838 1050
## [85] 918 986 797 923 975 815 1020 906 901 1170 912 746
## [97] 919 718 714 740
54
Time series objects
As can be seen, the time series object contains information about the
starting year and ending year as well as the number of observations per
year. Here, there is one observation.
55
Time series objects
600
Time
According to this graph, it seems that the flow was higher in the 1800’s
than in the 1900’s.
56
Time series objects
We can create our own time series objects with the use of the ts
function.
Consider the monthly BC jobs data for 1995 and 1996 in the jobs data
frame in the DAAG package.
jobs$BC
## [1] 1752 1737 1765 1762 1754 1759 1766 1775 1777 1771 1757 1766
## [13] 1786 1784 1791 1800 1800 1798 1814 1803 1796 1818 1829 1840
The start value corresponds to the first month of 1995, and the end
vector corresponds to the 12th month of 1996.
57
Time series objects
plot(jobsBC)
1820
jobsBC
1780
1740
Time
The graph shows that the job situation in BC improved through 1995 and
1996.
58
DATA SCIENCE 101
1
Programming Base Graphics in R
2
Programming Statistical Graphics
3
Bar Charts and Dot Charts
The most basic type of graph is one that displays a single set of
numbers.
Bar charts and dot charts do this by displaying a bar or dot whose
length or position corresponds to the number.
4
Basic Bar Charts
5
Bar Charts
but the plot that results needs some minor changes: we’d like to display
a title at the top, and we’d like to shrink the size of the labels on the
axes. We can do that with the following code.
6
Bar Charts
barplot(WorldPhones51, cex.names = .75, cex.axis = .75,
main = "Numbers of Telephones in 1951")
7
Bar Charts
The cex.names = .75 argument reduced the size of the region names
to 0.75 of their former size, and the cex.axis = .75 argument reduced
the labels on the vertical axis by the same amount.
The main argument sets the main title for the plot.
8
Dot Charts
Mid.Amer ●
Africa ●
Oceania ●
S.Amer ●
Asia ●
Europe ●
N.Amer ●
9
Dot Charts
Use pch=16 to get a filled in dot for the plotting character, for clarity:
dotchart(WorldPhones51, xlab = "Numbers of Phones ('000s)", pch=16)
Mid.Amer ●
Africa ●
Oceania ●
S.Amer ●
Asia ●
Europe ●
N.Amer ●
10
Data with More Structure
Data sets having more complexity can also be displayed using these
graphics functions.
11
Example
12
Virginia Deaths
barplot(VADeaths, beside = TRUE, legend = TRUE, ylim = c(0, 90),
ylab = "Deaths per 1000",
main = "Death rates in Virginia")
50−54
80
55−59
60−64
65−69
60
Deaths per 1000
70−74
40
20
0
13
Virginia Deaths
The ylim = c(0, 90) argument modifies the vertical scale of the
graph to make room for the legend.
Finally, main = "Death rates in Virginia" sets the main title for
the plot.
14
Virginia Deaths Dot Chart
dotchart(VADeaths, xlim=c(0, 75),
xlab="Deaths per 1000",
main="Death rates in Virginia", pch=16)
15
Virginia Deaths Dot Chart
Rural Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Rural Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Urban Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Urban Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
0 20 40 60
16
The Histogram
Each bar represents the count of x values that fall in the range indicated
by the base of the bar. Usually all bars should be the same width; this is
the default in R.
17
The Histogram
When bars have equal width, the height of each bar is proportional to the
number of observations in the corresponding interval.
If bars have different widths, then the area of the bar should be
proportional to the count; in this way the height represents the density
(i.e. the frequency per unit of x).
18
The Histogram
There are several optional parameters in ... that are used to control the
details of the display.
19
The Histogram – Example
20
The Histogram – Example
hist(escape, xlab="escape times (in seconds)")
Histogram of escape
10
8
Frequency
6
4
2
0
21
The Histogram – Choosing Bin Widths
22
The Histogram – Choosing Bin Widths
26 > 24 = 16
26 < 25 = 32
23
The Histogram – Choosing Bin Widths
The above rule (known as the “Sturges” rule) is not always satisfactory
for very large values of n, giving too few bars.
24
A Smoother Alternative: Density Plotting
plot(density(escape), main=' ')
25
A Smoother Alternative: Density Plotting
plot(density(escape), main=' ')
0.015
0.010
Density
0.005
0.000
N = 26 Bandwidth = 11.29
26
Box plots
A box plot gives a quick visual summary of the main features of a set
of data.
The box indicates location and spread of the main body of the data, and
the line segments indicate the range of the data.
Outliers (observations that are very different from the rest of the data)
are often plotted as separate points.
27
Box plots
28
Box plots
The lower line segment is drawn from the lower end of the box to the
smallest value that is no smaller than 1.5 IQR below the lower quartile.
Similarly, the upper segment is drawn from the middle of the upper end
of the box to the largest value that is no larger than 1.5 IQR above the
upper quartile.
29
Box plots
● outlier
upper whisker
upper quartile
median
lower quartile
lower whisker
●
outliers
●
30
Box plots
31
Box plots
In the code, we have used R’s formula based interface to the graphics
function: the syntax Sepal.Length ˜ Species is read as
“Sepal.Length depending on Species”, where both are columns of the
data frame specified by data = iris.
The boxplot() function draws separate side-by-side box plots for each
species.
32
Box plots
Iris measurements
8.0
7.5
7.0
Sepal length (cm)
6.5
6.0
5.5
5.0
●
4.5
33
Scatterplots
In R, scatterplots (and many other kinds of plots) are drawn using the
plot() function.
34
Scatterplots – Optional Arguments
There are many additional optional arguments: e.g. type, pch, xlab,
ylab, main, col, xlim, ylim, ...
The default for type is type="p", which plot points. Line plots (in which
line segments join the (xi, yi) points in order from first to last) are drawn
using type="l".
35
Example: Distances Traveled Down a Ramp
library(DAAG) # package containing modelcars data frame
summary(modelcars)
## distance.traveled starting.point
## Min. :11.75 Min. : 3.00
## 1st Qu.:17.78 1st Qu.: 5.25
## Median :24.12 Median : 7.50
## Mean :23.19 Mean : 7.50
## 3rd Qu.:27.94 3rd Qu.: 9.75
## Max. :33.62 Max. :12.00
36
Example: Distances Traveled Down a Ramp
37
Example: Distances Traveled Down a Ramp
●
30 ●
●
distance.traveled
●
●
25
●
●
20
●
15
●
●
4 6 8 10 12
starting.point
38
A More Carefully Drawn Version of the Plot
plot(starting.point, distance.traveled,
pch=16, cex=1.25, xlab="Starting Point",
ylab="Distance Traveled", main="Model Car Data",
xlim=c(0,12), ylim=c(0,35))
39
A More Carefully Drawn Version of the Plot
35
●
●
30
●
●
●
●
25
Distance Traveled
●
●
20
●
15
●
●
●
10
5
0
0 2 4 6 8 10 12
Starting Point
40
Orange Trees
unique(as.character(Orange$Tree))
41
Orange Trees
●
●
200
● ●
● ●
● ●
●
●
150
circumference
● ● ●
● ●
●
●
● ● ●
●
100
●
●
●
●
●
●
50
●
●
●
●
age
42
Orange Trees
One way to remedy this problem is to use a different plotting symbol for
each tree.
43
Orange Trees
44
Orange Trees
4 4
200
2 2
4 5 5
4 2
circumference
150
2
5 1 1
3
3
5 1
4
2 1 3
100
3
1
5
3
2
4
1
50
3
5
2
4
5
3
1
age
45
Orange Trees
Individual lines can be used to do the same thing. This code uses a for
loop, which will discussed in more detail when we study programming.
plot(circumference ˜ age, data = Orange, pch=as.numeric(Orange$Tree))
for (i in Orange$Tree) {
lines(circumference ˜ age, data = subset(Orange, Tree == i),
lty=as.numeric(i))
}
200
circumference
150
● ●
●
100
●
50
age
46
Choosing a High Level Graphic
We have described bar, dot, and pie charts, histograms, box plots and
scatterplots. There are many other styles of statistical graphics that we
haven’t discussed. How should a user choose which one to use?
The first consideration is the type of data. Bar, dot and pie charts
display individual values, histograms, box plots and QQ plots display
distributions, and scatterplots display pairs of values.
47
Choosing a High Level Graphic
For example, a box plot or QQ plot would require more explanation than
a histogram, and might not be appropriate for the general public.
48
Choosing a High Level Graphic – Visual Perception
49
Choosing a High Level Graphic – Visual Perception
50
Choosing a High Level Graphic – Visual Perception
For example, the bars in bar charts are easy to recognize, because the
position of the ends of the bars and the length of the bars are easy to
see.
51
Choosing a High Level Graphic – Visual Perception
However, the fact that we see length and area when we look at a bar
constrains us.
We should normally base bar charts at zero, so that the position, length
and area all convey the same information.
If we are displaying numbers where zero is not relevant, then a dot chart
is a better choice: in a dot chart it is mainly the position of the dot that is
perceived.
52
Choosing a High Level Graphic – Visual Perception
Thinking in terms of visual tasks tells us that pie charts can be poor
choices for displaying data.
In order to see the sizes of slices of the pie, we need to make angle and
area measurements, and we are not very good at those.
53
Choosing a High Level Graphic – Visual Perception
Some palettes indicate sequential groups from low to high, others show
groups that diverge from a neutral value, and others are purely
qualitative.
These are chosen so that most people (even if colour-blind) can easily
see the differences.
54
Low Level Graphics Functions
In this section we will describe some of these low level functions, which
are also available to users to customize their plots.
We will start with a description of how R views the page it is drawing on,
then show how to add points, lines and text to existing plots, and finish
by showing how some of the common graphics settings are changed.
55
The plotting region and margins
Line 4
Line 3 Margin 3
Line 2
Line 1
Line 0
50
40
Plot region
30
Margin 2
Margin 4
10 Line 1 20
●
(6, 20)
Line 4
Line 3
Line 2
Line 0
Line 0
Line 1
Line 2
Line 3
Line 4
0
Line 0
2 Line 1 4 6 8
Line 2
Line 3 Margin 1
Line 4
56
The plotting region and margins
Outside the plot region are the margins, numbered clockwise from 1 to
4, starting at the bottom.
57
The plotting region and margins
Normally text and labels are plotted in the margins, and R positions
objects based on a count of lines out from the plot region.
We can see from the figure that R chose to draw the tick mark labels on
line 1. We drew the margin titles on line 3.
58
Adding to plots
• points(x, y, ...)
• lines(x, y, ...) adds line segments
• text(x, y, labels, ...) adds text into the graph
• abline(a, b, ...) adds the line y = a + bx
• abline(h=y, ...) adds a horizontal line
• abline(v=x, ...) adds a vertical line
59
Adding to plots - Example
60
Adding lines to plots
The best-fit lines for the five trees can be obtained using the lm()
function which relates circumference to age for each tree. A legend has
been added to identify which data points come from the different trees.
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "1"),
lty = 1)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "2"),
lty = 2)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "3"),
lty = 3)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "4"),
lty = 4)
abline(lm(circumference ˜ age, data = Orange, subset = Tree == "5"),
lty = 5)
legend("topleft", legend = paste("Tree", 1:5), lty = 1:5, pch = 1:5)
61
Adding broken lines to plots
lines(circumference ˜ age, data = Orange, subset = Tree == "1", lty = 1)
lines(circumference ˜ age, data = Orange, subset = Tree == "2", lty = 2)
lines(circumference ˜ age, data = Orange, subset = Tree == "3", lty = 3)
lines(circumference ˜ age, data = Orange, subset = Tree == "4", lty = 4)
lines(circumference ˜ age, data = Orange, subset = Tree == "5", lty = 5)
62
Adding lines and broken lines to plots
● Tree 1
200
200
Tree 2
Tree 3
Tree 4
Tree 5
150
150
circumference
circumference
● ● ● ●
● ●
● ●
100
100
● ●
50
50
● ●
● ●
age age
63
Adding Material Outside the Plotting Region
64
Example
par(mar=c(5, 5, 5, 5) + 0.1)
plot(c(1, 9), c(0, 50), type='n', xlab="", ylab="")
text(6, 40, "Plot region")
points(6, 20)
text(6, 20, "(6, 20)", adj=c(0.5, 2))
mtext(paste("Margin", 1:4), side=1:4, line=3)
mtext(paste("Line", 0:4), side=1, line=0:4,
at=3, cex=0.6)
mtext(paste("Line", 0:4), side=2, line=0:4,
at=15, cex=0.6)
mtext(paste("Line", 0:4), side=3, line=0:4,
at=3, cex=0.6)
mtext(paste("Line", 0:4), side=4, line=0:4,
at=15, cex=0.6)
65
Setting Graphical Parameters
66
Setting Graphical Parameters
67
Setting Graphical Parameters
If you give character strings (e.g. par("mfrow")) the function will return
the current value of the graphical parameter.
Finally, you can use a list as input to set several parameters at once.
68
Setting Graphical Parameters – Example
• A histogram of length
• A histogram of breadth
• A scatterplot of length versus breadth
• Side-by-side boxplots of length for each host species.
69
Setting Graphical Parameters – Example
70
Setting Graphical Parameters – Example
## [1] 1 1
71
Setting Graphical Parameters – Example
par(mfrow = c(2,2))
hist(cuckoos$length, xlab="length")
hist(cuckoos$breadth, xlab="breadth")
plot(length ˜ breadth, data=cuckoos)
boxplot(length ˜ species, data=cuckoos)
40
20
Frequency
Frequency
20
10
0
0
20 21 22 23 24 25 15.0 15.5 16.0 16.5 17.0 17.5
length breadth
● ●
● ● ●
●
24
24
● ●●● ●● ●●● ●
●
●●● ● ●● ●● ● ●
length
● ● ● ●●● ● ●●● ●
● ● ● ● ●
● ● ●● ●● ●● ● ●
22
22
● ●●● ● ●
●●● ●●●●●●
●
● ●
●● ● ● ● ●
●●● ●
● ●
●
20
20
● ● ●
● ● ●
breadth
72
Orange Juice Data
We are interested in knowing whether the caloric content differs for the
different brands, but we also would like to take into account differences
in the machines’ ability to measure caloric content.
73
Read in the Data
oj <- read.table("oj.txt",
header=TRUE)
names(oj) <- c("machine", paste("Brand",
c("A", "B", "C", "D", "E", "F")))
oj
74
Fix the Data Frame
rownames(oj) <- paste(oj[,1], rep(1:2,3), sep="")
oj.mat <- as.matrix(oj[,-1])
oj.mat
75
Plot a Dot Chart
The following code produces a dot chart ... but it is a bit hard to read.
dotchart(oj.mat)
Brand
M32A ●
M31
M22
●
●
M21
M12
●
●
M11 ●
Brand
M32B ●
M31
M22
●
●
M21
M12 ●
●
M11 ●
Brand
M32C ●
M31
M22
●
●
M21
M12
●
●
M11 ●
Brand
M32D ●
M31
M22 ●
●
M21
M12 ●
●
M11 ●
Brand
M32E ●
M31
M22
●
●
M21
M12
●
●
M11 ●
Brand
M32F ●
M31
M22
●
●
M21
M12 ●
●
M11 ●
76
Fixing the Axis Labels
Remove the M’s, since they are cluttering the vertical axis. Add a horizontal axis label
and a title.
dotchart(oj.mat, labels="", xlab="Energy (calories)")
title("Orange Juice Caloric Measurements")
Brand A ●
●
●
●
●
●
Brand B ●
●
●
●
●
●
Brand C ●
●
●
●
●
●
Brand D ●
●
●
●
●
●
Brand E ●
●
●
●
●
●
Brand F ●
●
●
●
●
●
Energy (calories)
77
Fixing the Plot
Colour the lines:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)))
title("Orange Juice Caloric Measurements")
Brand A ●
●
●
●
●
●
Brand B ●
●
●
●
●
●
Brand C ●
●
●
●
●
●
Brand D ●
●
●
●
●
●
Brand E ●
●
●
●
●
●
Brand F ●
●
●
●
●
●
Energy (calories)
78
Fixing the Plot
Colour the points:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)), color=rep(1:3,rep(2,3)))
title("Orange Juice Caloric Measurements")
Brand A ●
●
●
●
●
●
Brand B ●
●
●
●
●
●
Brand C ●
●
●
●
●
●
Brand D ●
●
●
●
●
●
Brand E ●
●
●
●
●
●
Brand F ●
●
●
●
●
●
Energy (calories)
79
Fixing the Plot
Use different plotting characters:
dotchart(oj.mat, labels="", xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)), color=rep(1:3,rep(2,3)),
pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")
80
Orange Juice Caloric Measurements
Brand A
●
●
Brand B
●
●
Brand C
●
●
Brand D
●
●
Brand E
●
●
Brand F
●
●
Energy (calories)
Fixing the Plot
Add axis labels to identify the machines, using axis()
dotchart(oj.mat, labels="",
xlab="Energy (calories)",
lcolor=rep(1:3,rep(2,3)),
color=rep(1:3,rep(2,3)),
pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")
lab.locations <- seq(1.5,48,2)[-seq(4,48,4)]
labels <- paste("Machine", rep(1:3, 6))
axis(side=2, at=lab.locations,
label=labels, las=2)
81
Fixing the Plot
Add axis labels to identify the machines, using axis()
Brand A
Machine 3
Machine 2 ●
Machine 1 ●
Brand B
Machine 3
Machine 2 ●
Machine 1 ●
Brand C
Machine 3
Machine 2 ●
Machine 1 ●
Brand D
Machine 3
Machine 2 ●
Machine 1 ●
Brand E
Machine 3
Machine 2 ●
Machine 1 ●
Brand F
Machine 3
Machine 2 ●
Machine 1 ●
Energy (calories)
ouch!
82
Fixing the Plot
Fix the labels to identify the machines using mtext():
dotchart(oj.mat, labels="",
xlab="Energy (calories)",
lcolor=rep("grey", 18), color=
rep(1:3,rep(2,3)), pch=rep(1:3, rep(2,3)))
title("Orange Juice Caloric Measurements")
lab.locations <- seq(1.5,48,2)[-seq(4,48,4)]
labels <- paste("Machine", rep(1:3, 6))
mtext(labels, at=lab.locations, side=2,
las=2, cex=.6, col=1:3, line=-1.25)
83
Fixing the Plot
Brand A
Machine 3
Machine 2
Machine 1 ●
●
Brand B
Machine 3
Machine 2
Machine 1 ●
●
Brand C
Machine 3
Machine 2
Machine 1 ●
●
Brand D
Machine 3
Machine 2
Machine 1 ●
●
Brand E
Machine 3
Machine 2
Machine 1 ●
●
Brand F
Machine 3
Machine 2
Machine 1 ●
●
Energy (calories)
84
DATA SCIENCE 101
1
Predicting with Data
2
Example: radon release data
3
Radon release data
Plot the data first:
plot(percentage ˜ diameter, data = radon,
ylab="% radon released")
85
●
●
80
●
●
% radon released
●
●
75
●
● ●
●
●
70
●
● ●
●
65
●
●
●
60
diameter
4
Radon release data and best-fit line
plot(percentage ˜ diameter, data = radon,
ylab="% radon released")
radon.lm <- lm(percentage ˜ diameter, data = radon)
abline(radon.lm)
85
●
●
80
●
●
% radon released
●
●
75
●
● ●
●
●
70
●
● ●
●
65
●
●
●
60
diameter
The goal of these slides is to understand how the best-fit line is calculated.
5
Another example: lawn roller data
Different weights of roller (in kilograms) were used to roll over different
parts of a lawn, and the depth of the depression (in millimeters) was
recorded at various locations.
library(DAAG) # DAAG contains the roller data
plot(depression ˜ weight, data = roller, pch=16)
●
25
●
●
depression
● ●
15
● ●
5
●
●
0
2 4 6 8 10 12
weight
Again, we want to predict the size of the depression for a given weight. We start by studying models for
such data. The first model to look at is the Simple Linear Regression Model.
6
The simple linear regression model
7
The simple linear regression model
• The variability in the noise is due to any factors, other than the
weight of the roller. For example, the type of soil (sandy or hard clay,
etc), or amount of moisture in different parts of the lawn, and so on.
8
The simple linear regression model
9
The simple linear regression model
●
6
●
5
y
●
4
●
3
2 4 6 8 10
Very little noise, so the points lie very close to the straight line.
10
The simple linear regression model
●
●
6
●
5
●
y
●
4
●
3
2 4 6 8 10
A little more noise, so the points are not as close to the straight line.
11
The simple linear regression model
●
8
●
7
6
● ●
●
5
y
●
●
3
2
● ●
●
1
2 4 6 8 10
A lot more noise, so the points are scattered about the straight line.
12
The simple linear regression model
● ●
●
10
●
5
●
y
● ●
0
−5
●
−10
2 4 6 8 10
Mostly noise, so the line is no longer very recognizable from the points.
13
The Setup
• Assumptions:
1. Expected value of y = β0 + β1x.
2. Standard Deviation(ε) = σ.
• Data: Suppose data Y1, Y2, . . . , Yn are obtained at settings
x1, x2, . . . , xn, respectively. Then the model on the data is
Yi = β0 + β1xi + εi
Either
1. the x’s are fixed values and measured without error (controlled
experiment) - Example: Radon Data
OR
2. the analysis is conditional on the observed values of x
(observational study) - Example: Roller Data
14
Parameter estimation, fitted values and residuals
◦ Assumptions:
15
Method
ei = Yi − βb0 − βb1xi
16
Visualizing residuals*
residViz.plot(a=14, b=0)
On Canvas, you will find a
file called residViz.plot.R
which can be read into R.
30
● Fitted values
Data values
25
It plots the roller data and +ve residual
20
overlays a line with slope 0
and intercept 14.
15
y
● ●● ● ● ● ● ● ● ●
10
It also calculates the residuals −ve residual
5
and plots them together with
0
line segments linking the data
0 2 4 6 8 10 12
points with the fitted values.
x
*The videos residplot2mp4 and residplot3.mp4 also show how the fitted line and
residuals change depending on the values of the slope and intercept.
17
Visualizing residuals - same data; different line
residViz.plot(a=2, b=2)
30
● Fitted values
Data values
●
25
●
20
●
+ve residual
15
y
● ●
●
●
10
●●
●
5
−ve residual
0
0 2 4 6 8 10 12
18
Visualizing residuals - same data; different line
residViz.plot(a=12, b=1)
30
● Fitted values
Data values
+ve residual
25
●
20
●
● ●
● ●
15
●●
y
●
10
−ve residual
5
0
0 2 4 6 8 10 12
19
Visualizing residuals - same data; different line
● Fitted values ●
30
Data values
25
●
20
●
+ve residual
15
y
●
● −ve residual
●
●
10
●●
5
●
0
0 2 4 6 8 10 12
Which plot is best? Hard to tell, because the residuals can be negative or positive, so if just add them
up, there will be cancellation.
20
The Key: minimize squared residuals
The videos resid2plot1.mp4 also show how the fitted line and residuals
change depending on the values of the slope and intercept.
21
The Key: minimize squared residuals
resid2Viz.plot(a=14,b=0)
● Fitted values
Depression in lawn (mm)
Data values
100
60
20
+ve residual
● ●● ● ● ● ● ● ● ●
−ve residual
0
0 2 4 6 8 10 12
22
The Key: minimize squared residuals
resid2Viz.plot(a=2,b=2)
● Fitted values
Depression in lawn (mm)
Data values
100
60
●
20
●
+ve residual ● ● ●
● ●
● ●●
−ve residual
0
0 2 4 6 8 10 12
23
The Key: minimize squared residuals
resid2Viz.plot(a=12,b=1)
● Fitted values
Depression in lawn (mm)
Data values
100
60
+ve residual● ●
20
● ● ● ● ●
● ●●
−ve residual
0
0 2 4 6 8 10 12
24
The Key: minimize squared residuals
resid2Viz.plot(a=-2, b=2.67)
● Fitted values
Depression in lawn (mm)
Data values
100
60
●
●
0 20
+ve residual ●
● ● −ve residual
● ●
● ●●
0 2 4 6 8 10 12
These look the smallest of all that we have seen. This suggests that the intercept
might be near -2 and the slope might be near 2.67.
25
Calculating the slope and intercept estimates in R
The lm function finds the slope and intercept values that minimize
the sum of the squared residuals.
## (Intercept) weight
## -2.09 2.67
26
Calculating the slope and intercept estimates in R
The slope is positive, so this means that the amount of depression in the
lawn will increase with increasing roller weight.
27
Making predictions
28
Minimizing squared residuals for the radon data
● Fitted values
Data values
150
100
+ve residual ● ● ●
● ● ● −ve residual
50
0
● Fitted values
Data values
150
100
+ve residual● ● ● ● ● ●
−ve residual
50
0
● Fitted values
Data values
150
100
Use the lm function to find the slope and intercept that give the
smallest squared residuals:
## (Intercept) diameter
## 84.2 -11.8
32
Predicting radon release
34
Calculating fitted values
The fitted values are the values of yb that are obtained by plugging in the
original x values.
For the roller data, the x values are the values of roller$weight
roller$weight
## [1] 1.9 3.1 3.3 4.8 5.3 6.1 6.4 7.6 9.8 12.4
## [1] 3.03 6.27 6.81 10.86 12.21 14.37 15.18 18.42 24.36 31.38
35
Calculating fitted values using predict
## 1 2 3 4 5 6 7 8 9 10
## 2.98 6.18 6.71 10.71 12.05 14.18 14.98 18.18 24.05 30.98
36
Polling question ... yes or no?
i.e.
radon$diameter
## [1] 0.37 0.37 0.37 0.37 0.51 0.51 0.51 0.51 0.71 0.71 0.71 0.71 1.02
## [16] 1.02 1.40 1.40 1.40 1.40 1.99 1.99 1.99 1.99
37
Polling question ... answer
No!
predict(radon.lm)
## 1 2 3 4 5 6
## 79.83882 79.83882 79.83882 79.83882 78.18019 78.18019
## 7 8 9 10 11 12
## 78.18019 78.18019 75.81073 75.81073 75.81073 75.81073
## 13 14 15 16 17 18
## 72.13805 72.13805 72.13805 72.13805 67.63607 67.63607
## 19 20 21 22 23 24
## 67.63607 67.63607 60.64614 60.64614 60.64614 60.64614
There are only six different values. To see this clearly, use the table
function.
table(predict(radon.lm))
##
## 60.6461377309841 67.6360657498926 72.1380532874947
## 4 4 4
## 75.8107273313279 78.1801944563816 79.8388214439192
## 4 4 4
This shows that there are six different predicted (or fitted) values, each occurring four times.
38
Calculating residuals
## 1 2 3 4 5
## -0.9796695 -5.1797646 -1.7131138 -5.7132327 7.9533944
## 6 7 8 9 10
## 5.8199976 8.0199738 -8.1801213 5.9530377 -5.9805017
These values should be like noise - random, with no clear pattern or trend.
39
Plotting residuals
We can plot them against the fitted values to see if there is a pattern
plot(resid(roller.lm) ˜ predict(roller.lm),
ylab = "roller residuals", xlab="fitted values")
● ●
● ●
5
roller residuals
●
●
−5
●
● ●
5 10 15 20 25 30
fitted values
No obvious trend is present. This is a characteristics of a model that fits the data well.
40
Automatic plotting of residuals
Residuals vs Fitted
5 10
●5 ●7
● ●
Residuals
● ●
● ● ●
8●
−10
5 10 15 20 25 30
Fitted values
lm(depression ~ weight)
* The other three plots are beyond the scope of this course.
† The
red curve is added to help the eye identify patterns - in this case, there might be
some nonlinearity that has been missed by the model.
41
Automatic plotting of residuals - radon data
This the residual versus fitted value plot for the radon data.
plot(radon.lm, which = 1)
Residuals vs Fitted
2 4 6
● 24 ●
● ●
Residuals
●
● ● ●
● ● ●
●
● ●
−2
●
● ●
●
−6
17 ●
18
60 65 70 75 80
Fitted values
lm(percentage ~ diameter)
42
Polling question ... answer
Yes!
The residuals decrease and then increase. The effect is slight, but the
pattern suggests that it is possible to improve the model.
43
Making predictions on new data using predict
## 1
## 11.24658
44
Making predictions on new data using predict
45
Visualizing the fitted line
plot(depression ˜ weight, data = roller)
abline(roller.lm) # overlays fitted line
30
●
25
●
●
20
● ●
depression
15
10
● ●
5
●
●
0
2 4 6 8 10 12
weight
46
Radon release data example (cont’d)
• Predict the percentage release when the orifice diameter is set to one
of 1.15, 1.25 and 1.35.
• Overlay the plot of the data with the fitted line.
47
Predicting radon release percentage
predict(radon.lm, newdata =
data.frame(diameter = c(1.15, 1.25, 1.35)))
## 1 2 3
## 70.59790 69.41317 68.22843
Interval predictions are:
predict(radon.lm, newdata =
data.frame(diameter = c(1.15, 1.25, 1.35)),
interval ="prediction")
For example, if the diameter is 1.15, there is 95% probability that the release
percentage will be between 63.9 and 77.3.
48
Visualizing the fitted line
plot(percentage ˜ diameter, data = radon)
abline(radon.lm) # overlays fitted line
85
●
●
80
●
●
●
percentage
●
75
●
● ●
●
●
70
●
● ●
●
65
●
●
●
60
diameter
49
Lists
Lists are a very flexible type of object. They literally consist of a list of
things.
You won’t often construct these yourself, but many functions return
complicated results as lists.
50
Named lists
You can see the names of the objects in a list using the names()
function, and extract parts of it:
51
Named lists
Let’s see what the objects are in the output from radon.lm.
names(radon.lm)
52
Lists
## $x
## [1] 3 2 3
##
## $y
## [1] 7 7
53
Lists - accessing elements
z$x
## [1] 3 2 3
54
Lists - accessing elements
radon.lm$coefficients
## (Intercept) diameter
## 84.22234 -11.84734
This is the same result that is obtained from the extractor function coef.
55
Lists
There are several functions which make working with lists easy. Two of
them are lapply() and vapply(). The lapply() function “applies”
another function to every element of a list and returns the results in a
new list; for example,
lapply(z, mean)
## $x
## [1] 2.666667
##
## $y
## [1] 7
56
DATA SCIENCE 101
1
Programming 1: Flow Control in R
There are several R functions that control how many times statements
are repeated. The main function to use for this is for.
We will also describe how to control when code is executed and when it
is not to be executed. The main function for this is if.
2
Another look at adding lines to plots
3
Adding lines and broken lines to plots
● Tree 1
200
Tree 2
Tree 3
Tree 4
Tree 5
150
circumference
● ●
●
●
100
●
50
age
4
Using the for() function to save time and space
The for function can be used to save time and space when
programming with lines that repeat.
but where the “1” changes to “2” and then to “3” and so on.
5
Using the for() function to save time and space
## [1] 1 2 3 4 5
6
Using the for() function to save time and space
Syntax:
7
Using the for() function on the orange tree plots
plot(circumference ˜ age, data = Orange, pch=as.numeric(Orange$Tree))
for (i in 1:5) lines(circumference ˜ age,
data = subset(Orange, tree == i), lty = i)
legend("topleft", legend = paste("Tree", 1:5), lty = 1:5, pch = 1:5)
200
● Tree 1
Tree 2
Tree 3
circumference
150
Tree 4 ● ●
Tree 5
●
100
●
50
age
8
Another example - plotting financial data
9
Another example - plotting financial data
6000
8000
EuStockMarkets[, i]
EuStockMarkets[, i]
6000
4000
4000
Or we can use the for loop applied
2000
2000
to the column index that is running 1992 1994 1996 1998 1992 1994 1996 1998
Time Time
6000
EuStockMarkets[, i]
EuStockMarkets[, i]
for (i in 1:4) plot(EuStockMarkets[, i])
5000
3500
4000
2500
3000
1500
1992 1994 1996 1998 1992 1994 1996 1998
Time Time
10
Factorial example
n! = 1 · 2 · 3 · · · (n − 1) · n
11
Factorial Example
For example, we could find the value of 13! using the code
n <- 13
result <- 1
for (i in 1:n) result <- result * i
result
## [1] 6227020800
12
Understanding the Code
The first line sets a variable named n to 13, and the second line
initializes result to 1, i.e. a product with no terms.
The third line starts the for() statement: the variable i will be set to the
values 1, 2, . . . , n in succession.
Line 3 multiplies result by i in each of those steps, and the final line
prints it.
13
Repeating several commands at a time
Syntax:
for (n in values) {
command 1
command 2
...
}
14
Repeating several commands at a time - cuckoos example
The code on the next slide will produce a 2 × 3 layout of plots for the
cuckoos data showing:
1. the egg breadths versus the egg lengths for the cuckoo eggs laid in
each of 6 host species’ nests.
3. plot titles which give the species’ name (these are levels of the
species factor column).
15
Repeating several commands at a time - cuckoos example
library(DAAG) # contains the cuckoos data frame
par(mfrow=c(2, 3))
for (i in levels(cuckoos$species)) {
plot(breadth ˜ length,
data = subset(cuckoos, species == i))
breadth.lm <- lm(breadth ˜ length,
data = subset(cuckoos, species==i))
abline(breadth.lm)
title(i)
}
mtext(side=3, line=-1.5,
"Characteristics of Cuckoo Eggs Laid in Other Birds' Nests",
outer=TRUE) # outer=TRUE puts this text in the outer margin
16
Repeating several commands at a time - cuckoos example
17.0
● ● ● ● ●
17.0
● ●● ●● ● ● ●
●● ● ●
breadth
breadth
breadth
● ● ●
● ● ●●●● ● ● ● ●
●
● ●
● ●
● ● ● ●
● ● ●
● ● ●● ●● ● ●
16.0
● ●
16.0
● ●
● ●
● ● ●
21 22 23 24 25 20 21 22 23 24 21 22 23 24 25
16.2
● ● ● ●
●
● ● ● ●
● ● ● ● ●● ●
17.0
16.5
●●●
breadth
breadth
breadth
● ●
15.6
● ● ● ● ●
●
● ● ●
● ● ●
● ● ●
● ●
●
15.5
● ● ●
●
16.0
15.0
● ● ●
21.0 22.0 23.0 24.0 21.0 22.0 23.0 24.0 20.0 21.0 22.0
17
Example - simulation of random numbers
18
Generation of pseudorandom numbers
xn = (171 xn−1)%%30269
un = xn/30269
with initial value x0. (x0 is called the seed.)
Recall that the first formula calculates the remainder after division of
171xn−1 by 30269.
The second formula ensures that the resulting u-values are between 0
and 1.
19
Generation of pseudorandom numbers
u <- numeric(30268) # the output
# will be stored here
x0 <- 27218 # arbitrarily chosen seed
x <- x0 # current x value
for (j in 1:30268) {
x <- (171 * x) %% 30269 # update x with formula 1
u[j] <- x/30269
}
The results, stored in the vector u, are in the range between 0 and 1.
These are the pseudorandom numbers, u1, u2, . . . , u30268.
20
Visualizing some of our pseudorandom numbers
u[1:5]
Histogram of u[1:5000]
400
Frequency
200
0
u[1:5000]
The values are all inside the interval between 0 and 1 and appear to be pretty evenly distributed over that
range.
21
Example: simulating normal random variables
22
Example - simulation of normal variates
23
Example - simulation of normal variates
We can use the runif() function to simulate the uniform variates that
we will need.
500
[−0.5, 0.5].
Frequency
300
100
0
−0.4 −0.2 0.0 0.2 0.4
24
Example - simulation of normal variates
Note that we started with a Z which had only one entry, and successively
added vectors of size 10000. R automatically changes the length of Z to
make elementwise addition with U possible.
25
Simulating normal random variables - visualizing the result
hist(Z)
Histogram of Z
2000
Frequency
1000
500
0
−4 −2 0 2 4
26
Simulating standard normal random variables
In fact, for our simulated sample, the values of the sample mean and
standard deviation are:
mean(Z) # sample average
## [1] 0.003215301
## [1] 0.9942739
Different samples would have slightly different means and standard deviations, but all
would be pretty close to 0 and 1.
27
Summing squared standard normal variables
Histogram of X
6000
Frequency
4000
2000
0
0 5 10 15
2500
what a chi-squared distribution
2000
on 7 degrees of freedom looks
1500
Frequency
like:
1000
hist(X, main="")
500
It is skewed, but not as much as
0
0 5 10 15 20 25 30
freedom is smaller.
30
The if() statement
Syntax:
31
The if() statement: Caution!
may produce an error, because R will execute the first line before you
have time to enter the second.
32
The if() statement: Recommendation
if (condition) {
commands when TRUE
} else {
commands when FALSE
}
33
The if() statement: Another Warning
These are converted to logical values using the rule that zero becomes
FALSE, and any other value becomes TRUE.
Missing values are not allowed for the condition, and will trigger an error.
34
Example
A simple example:
x <- 3
if (x > 2) y <- 2 * x else y <- 3 * x
35
Example - counting missing values
The cfseal data frame in DAAG contains mass measurements for various organs from
cape fur seals in addition to estimates of age and the overall weight.
summary(cfseal)
36
Example - counting missing values
The summary shows that some of the columns have missing values (NA).
For example,
is.na(cfseal$lung)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
sum(is.na(cfseal$lung))
## [1] 6
37
Example - counting missing values
par(mfrow=c(3, 4))
for (i in 1:11) hist(cfseal[ ,i], main=names(cfseal)[i], xlab="mass")
Frequency
Frequency
Frequency
8
6 12
6
4
3
0
0
0 40 80 120 0 50 150 0 400 800 0 1000 2500
Frequency
Frequency
Frequency
8
8
6 12
data contain
4
4
missing values.
0
0
0 4000 8000 0 200 400 0 1000 2000 50 200 350
Frequency
Frequency
6
6
0 4 8
3
3
0
38
Example - counting missing values
Note how we can use the paste function to combine the missing value
count in Nmissing with the character string NA's.
39
Example - counting missing values
8
12
6
Frequency
Frequency
Frequency
Frequency
6
8
8
4
4
4
4
2
2
0
0
0 40 80 120 0 50 150 0 400 800 0 1000 2500
8
12
Frequency
Frequency
Frequency
Frequency
6
6
8
4
4
4
2
2
0
0
0 4000 8000 0 200 400 0 1000 2000 50 200 350
Frequency
Frequency
8
4
4
2
2
0
40
DATA SCIENCE 101
1
Programming 2: Functions
2
Programming 2: Functions
4
An example using an if statement
5
An example using an if statement
Z1 <- runif(100)
Z2 <- runif(100)
corplot(Z1, Z2, FALSE)
## [1] 0.01178233
A correlation near 0 tells us that the two variables are not predictable
from each by a straight line.
6
An example using an if statement
We will now appy the function to the first two columns of the European
stock price data, requesting the scatter plot as well as the computed
correlation:
corplot(EuStockMarkets[, 1], EuStockMarkets[,2], TRUE)
8000
●● ●
● ●●●●●●
●●
● ●●
●● ●●
●●● ● ●●●● ● ●
●●●
●●●●
●●●●●●●●
●
●
●
●● ●
● ●●
●●●
● ●
●●
●● ●●
●
● ● ●●●●
●●
●
●●
●
●●
●
●●●●
●●●
●●●●●● ●●●● ●●
●●
●
●
●●●
●●
● ●●●●●
●●
●
●●
●
●
●
●
●
●
●
●● ●●
●
●●
●●●
●●●
●
●
●●
●●
●
●●●
● ● ●
●●●●●
●
●●
● ●
●●
●
●● ●
●●
●
●●
●
●●
●●●
●●
5000
●●
●
● ●●
● ●
● ●● ●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●● ●●●
● ●●●
●●
●●
●
●●●
●●
y
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●
●●●
●
● ●●
● ●
● ●
●●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
● ●
●●
●
● ●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●
● ●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
2000
●●●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●●
●
●● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●●●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
## [1] 0.9911539
Not surprisingly, the two stock markets are very highly correlated. They take low values together, and
high values together.
7
Example - a normal random number generator
Note that this function will take n as an input. The output should be that
number of standard normal variates.
8
Example - a normal random number generator
9
Example - a normal random number generator
Using the sum of uniforms concept from the earlier example, we will use
a function body of the form:
{
Z <- 0
for (j in 1:12) {
U <- runif(n, min = -.5, max = .5)
Z <- Z + U
}
return(Z)
}
10
Example - a normal random number generator
Putting the header and body together, we have the following function:
11
Example - a normal random number generator
12
Functions can take any number of arguments
13
Functions can take any number of arguments
rChisq(2, 17)
14
Use of default arguments
To give the user of a function a hint as to the kind of input that the
function is expecting, we may give default values to some arguments.
If the user doesn’t specify the value, the default will be used.
15
Example
We could have used the header, i.e. the first line of the function,
16
Function environment
We won’t give a complete description here, but will limit ourselves to the
following circular definition: the environment is a reference to the
environment in which the function was defined.
This has implications for where objects are that the function can access.
17
Example
18
Example
## [1] 1.082443
Note, as well, that mymean does not exist in the workspace, only locally
to myfun:
mymean
19
Managing complexity through functions
Most real computer programs are much longer than the examples we
give in this course.
Most people can’t keep the details in their heads all at once, so it is
extremely important to find ways to reduce the complexity.
We now give a short outline of some of the strategies that have been
effective.
20
Reminder: what are functions?
21
Example
(1 + i)n − 1
=R
i
## [1] 5031.157
23
Functions in R are objects
24
Example
{
if (n >= 2) {
sieve <- seq(2, n)
primes <- c()
for (i in seq(2, n)) {
if (any(sieve == i)) {
primes <- c(primes, i)
sieve <- c(sieve[(sieve %% i) != 0], i)
}
}
return(primes)
} else {
stop("Input value of n should be at least 2.")
}
}
25
Prime mumber sieve example
26
Prime number sieve example
27
Prime number sieve example
Eratosthenes <- function(n) {
# Print all prime numbers up to n (based on the sieve of Eratosthenes
if (n >= 2) {
The noMultiples function defines j in its header, so j is a local variable, and it finds
sieve in its environment.
28
Returning multiple objects
29
Example
But we might also want to know the present value, which is (1 + i)−n
times the amount.
## $amount
30
## [1] 5031.157
##
## $PV
## [1] 3088.694
Exercise - smoothing a scatterplot
The faithful data set consists of the waiting times until the next
eruption of the Old Faithful geyser together with the corresponding
eruption times.
●
● ● ●
●
90
● ● ●● ●●● ●
● ● ●●● ● ● ●●
● ● ●●
● ●● ● ● ● ● ● ● ●●● ●●
●●● ●● ●● ●
●
● ● ●● ● ●●●
●●
● ● ●● ●
●● ●● ● ●
●
● ●● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●
● ● ● ●● ● ●●
●●
● ● ●●●●
●●●● ●
● ●
●● ●● ●●
● ● ●●
waiting
● ●●● ●●●
● ●●● ● ●●●
● ● ● ● ● ●● ●●● ●
●
70
● ● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ● ● ● ●
●● ●●
●●● ●
●
●
● ● ●●●●
●
● ●● ● ● ●
●
●●●●●●
● ●●● ● ●●● ● ●
● ● ● ●
●●●●● ●●● ● ●
50
●●●
● ● ●
● ●
● ● ● ●
●●●●●● ●● ●
●
eruptions
31
Smoothing a scatterplot
In other words, just take averages of y values that are near each other
according to their x values.
32
Smoothing a scatterplot
33
Smoothing a scatterplot - function header
34
Smoothing a scatterplot - function output
The output for this function will be a data frame with 2 columns: x and y,
which will correspond to the y-averages and the corresponding x
locations where the averages are taken.
Thus, we include a line such as the one at the end of the following
body-less function:
35
Smoothing a scatterplot - function body
36
Smoothing a scatterplot - function body
37
Smoothing a scatterplot - function body
Within the for() loop just created, we add a line of code which assigns
the average of the values in y[indices] to yaverages[i]:
39
Smoothing a scatterplot - testing the function
For example,
●
10
x <- seq(0, 3, length=20) ●
●
8
y <- xˆ2 + rnorm(20) ●
●
6
●
●
plot(x, y, pch=16)
4
lines(smoother(x, y, ●
2
●
●
x.min=0.25, ●
●
●
●
●
0
● ●
●
x.max=2.75, ● ●
window=0.5), lwd=2) x
40
Smoothing a scatterplot - testing the function
For example,
●
●
6
y <- xˆ2 + rnorm(20) ●
●
●
4
plot(x, y, pch=16)
y
●
lines(smoother(x, y, ●
●
2
●
●
x.min=0.25, ● ●
●
● ●
0
●
x.max=2.75, ●
window=0.06), lwd=2) x
41
Adding an error message - using if and stop()
Within the for loop to your function, we include the following lines of
code:
if (length(indices) < 1) {
stop("Your choice of window width is too small.")
} else {
yaverages[i] <- mean(y[indices])
}
42
Smoothing a scatterplot
43
Smoothing a scatterplot
Finally, note that the so-called “smooth” curve is still quite bumpy.
Observe that the window parameter does not have to be the same for
each iteration.
44
Smoothing a scatterplot
45
Smoothing the faithful scatterplot
●
●
90
tion to the faithful data frame, with
● ●● ●●●
● ● ●
● ● ● ● ●●
● ●
● ●● ● ●
● ● ● ●●●
●● ●● ●●● ●● ●
● ● ● ● ● ●●● ● ●
● ● ●●●●
● ●● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
80
a window parameter of 1 unit for the ●
●
● ●
●
●
● ●
●
●
●
● ●● ●
● ● ● ●● ●
●●
●● ●● ● ●
●
●
●
●
● ●●
●
● ●● ●
● ●●
● ●● ●● ●●
●● ● ● ● ●
●
●
●●
●
●
●
●
● ● ● ●● ●
waiting
●
● ● ● ● ●
70
● ● ●●
● ●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60
● ● ● ●● ●
● ●● ● ● ●●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●
●●●●● ● ●
●● ●● ● ●
● ● ● ● ●
50
● ● ● ●
●●● ● ●
● ● ●
● ● ●
●●
● ●
●● ●
●
[1.5, 5.0]: 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
eruptions
46
DATA SCIENCE 101
1
Visualizing and Modelling Multiple Variables
• lattice graphics
1. dot plots
2
Dot plots - in the lattice package
The lattice package contains functions for plotting that extend beyond
what is easily done with base graphics.
We will focus on two lattice graphing functions. The first gives another
version of the dot plot, but the syntax depends on a modelling formula
instead of a matrix.
Typical usage:
3
Dot plot example - lengths of cuckoo eggs
Recall the cuckoo data in the DAAG package which contain lengths of
eggs laid in the nests of other birds.
A basic lattice dot plot showing how the length distributions compare between the
different host species:
library(DAAG) # contains cuckoo data
dotplot(species ˜ length, data = cuckoos)
wren ● ● ● ●●● ● ● ●● ●
tree.pipit ● ● ● ● ● ●●●●● ● ●
robin ● ● ●● ●● ● ● ● ●
pied.wagtail ● ●● ● ● ● ●● ●● ● ●
hedge.sparrow ● ● ● ● ●● ● ●●● ●
20 21 22 23 24 25
length
Wrens are small birds, so it is not surprising that the cuckoo eggs found in their nests are smaller.
4
Dot plot example - comparing treatments for anorexia
5
Dot plots - comparing treatments for anorexia
We want to know if the measured changes in weight are different for the
different therapies.
7
Dot plots - comparing treatments for anorexia
Cont ● ●● ●
● ●● ● ●
butions.● ● ●●●● ● ● ●● ● ● ●● ● ●
weight change
Recall the orange juice energy data from an earlier lecture where we
used base graphics to plot the energy measurements for the different
orange juice brands, taking account of the machine used to do the
measuring.
Including brand as the optional factor, we can see clearly how energy
depends on brand, for the different machines.
D E F
M3 ● ● ● ● ● ●
M2 ● ● ● ● ● ●
M1 ● ● ● ● ●●
A B C
M3 ● ● ● ● ● ●
M2 ● ● ●● ● ●
M1 ● ● ●● ● ●
energy
Observations: Now it is clear that brand D is a pretty high energy juice, especially compared with A, C
and F. Energy in brand E is intermediate. Machine does not have a systematic effect.
10
Conditioning scatter plots
The second lattice function gives another version of the scatter plot,
which extends in a way to help visualize relatively complex data frames.
Typical usage:
Other arguments such as main, xlab and so, can still be used.
11
Conditioning scatter plots - cuckoo eggs example
A scatter plot of length versus breadth for the cuckoo eggs can be
obtained with
xyplot(length ˜ breadth, data = cuckoos)
25 ● ●
● ●
24 ● ● ● ● ● ●
● ● ● ● ●
●
● ●
● ● ● ● ●
● ● ●
● ● ● ●
23 ● ● ● ● ● ● ● ●
●
● ● ● ●
length
● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
22 ● ● ● ● ●
● ●
● ● ●
● ● ● ●
●
●
● ● ●
21 ● ● ● ●
● ● ● ●
●
●
●
●
20 ●
●
●
breadth
We see a general tendency for length to increase with breadth, but this plot hides any possible effects
due to the different species.
12
Conditioning scatter plots - cuckoo eggs example
21 ● ●
●
●
20 ●
●
15.0 15.5 16.0 16.5 17.0 17.5 15.0 15.5 16.0 16.5 17.0 17.5
breadth
Some observations: there is still only a vague tendency for length to increase with breadth, but we can
see that both length and breadth are small for wrens. For other species, the breadth measurements tend
to be larger, while length measurements have a relatively large range.
13
Conditioning scatter plots - cuckoo eggs example
We can overlay a smooth curve in each of the panels, through the use of
the type argument.
xyplot(length ˜ breadth|species, data = cuckoos, type=c("p", "smooth"))
20 ●
●
15.0 15.5 16.0 16.5 17.0 17.5 15.0 15.5 16.0 16.5 17.0 17.5
breadth
The smooth curve is fit to the data using a method similar to double smoothing described earlier.
14
Conditioning scatter plots - nitrous oxide emissions example
## NOx C E
## Min. :0.370 Min. : 7.500 Min. :0.5350
## 1st Qu.:0.953 1st Qu.: 8.625 1st Qu.:0.7618
## Median :1.754 Median :12.000 Median :0.9320
## Mean :1.957 Mean :12.034 Mean :0.9265
## 3rd Qu.:3.003 3rd Qu.:15.000 3rd Qu.:1.1098
## Max. :4.028 Max. :18.000 Max. :1.2320
all three variables are numeric; we might want to predict nitrous oxide emissions from the other two
variables
15
Conditioning Plots - nitrous oxide emissions example
Because there are several measurements taken at only five values of the
compression ratio C, we can construct five scatter plots, each relating
nitrous oxide emissions to E for a fixed value of C.
We say that we are conditioning on C, and each scatter plot tells us how
NOx and E relate to each other, after accounting for C.
16
Conditioning Plots - nitrous oxide emissions example
C C
● ● ● 4
● ●● ●
●
●
●
3 Each panel shows
●
●
●●
● ● 2 how the nitrous oxide
●●
●
●
●
●●●
●
●
● ●● 1 emissions increase
● ● ●
●
● ●
with equivalency ratio
NOx
C C C
4
●● ●●
to a maximum and
● ●●
● ●
3
●
●
●
● ●
●
●
● then decrease again.
● ●
● ●
2 ●● ●
●
● ●
● ●
●● ● ●
● ●
● ●
1 ● ●
● ● ●● ●
●
●●
● ●● ●
The orange bar in the top bar of each panel indicates the relative size of C for that panel. That is, in the
lower left panel, the value of C is lowest and in the upper right panel, the value of C is highest.
17
Conditioning Plots - nitrous oxide emissions example
Why are the conditioning plots useful? We can see this by comparing
with what we would get by simply looking at a scatter plot of NOx vs E:
xyplot(NOx ˜ E, data=ethanol)
In the conditioning plots, a
4 ●
●
●
pattern of increase followed by
●
●
●
● ● ●
●
●
● ●● ●
● ●
●
decrease was clearly evident.
●
● ●
●
3 ●
●●
●
●
●
● ●
When we ignore the effects of
NOx
● ●
● ● ●
2 ●
● ●
●
● ●
●●
●
●
●
C, we see very complicated
●●●
● ●●
●●
●
● looking patterns in the relation
● ●
●
1 ●
●
●
●●
●●
●
● ● between NOx and E - these
●●●●● ●
●
●●
●
● ●● ●● ●
●
●
●
patterns are due to C, not due
to how NOx and E are related.
0.6 0.8 1.0 1.2
18
Overlaying the conditioning plots with smooth curves
C C
● ● ● 4
● ●● ●
● ●
●
3
●
●● ● 2
● ●
●●
● ●
● ●
● ●●●
● ●● 1
● ● ●
●
● ●
NOx
C C C
4 ●●
●● ●● ●
● ●
● ● ●
●
3 ●
● ●
● ●
● ● ●
2 ●● ●
●
● ●
● ●
●● ● ●
● ●
● ●
1 ● ●
● ● ●● ●
●
●●
● ●● ●
The span argument indicates what proportion of the data should be used to estimate each point of the
smooth curve.
19
Another look at the anorexia data
●
● ● above 82 pounds, the
●
●
● ● ● ●
● ● control is not having
●● ●
● ●
a good effect, but the
Postwt
90 ● ●
● ●
●
● ●●
●
● ●●
● ●●
●
other therapies are. For
● ●
80
●● ●●
● ● ●● ●
●
● ●●
pre-study weights below
●
●
●
●
●
●
●
●●
● ●
●
●
about 80-82 pounds, it
●
●
● ●●
does not seem to make a
70
difference.
70 75 80 85 90 95 70 75 80 85 90 95
Prewt
20
Another look at the anorexia data
We can reconstruct the dot plots now to take this new information into
account.
We can create a factor which separates the very low pre-weight subject
from the others as follows:
21
Another look at the anorexia data
The dot plots, conditional on whether the pre-study weight was very low
or not are constructed as follows:
dotplot(Treat ˜ change|lowPrewt, data = anorexia,
xlab = "weight change", ylab="treatment")
−10 0 10 20
FT ● ●
● ● ●● ●
●● ●
● ● ●● ● ● ●
treatment
Cont ● ●●● ●● ●● ● ●● ● ●
●● ● ●● ● ● ●
●● ●
CBT ● ●●●●●●●
●● ● ●● ● ● ● ●● ●● ●●● ● ● ● ●
−10 0 10 20
weight change
Now, we see that for subjects with a very low pre-study weight, there are no differences, but for subjects
with a high enough pre-study weight, the therapies really appear to help, especially the Family Therapy.
22
Another example - how far do elastic bands travel?
The data from two such experiments are contained in elastic1 and
elastic2 in the DAAG package.
names(elastic2) # experiment 2
We first create a single data frame that contains the data from both
experiments.
We will create a variable called expt which will indicate for which
experiment the measurement was taken.
24
Another example - how far do elastic bands travel?
30 40 50 60
expt expt
300 ●
250 ● ●
●
● ●
distance
●
200 ●
● ●
● ●
150 ●
●
●
100
●
30 40 50 60
stretch
In both experiments, we see that as the amount of stretch increases, the elastic band travels farther -
and the relationship is pretty close to linear.
25
Summary
When you have multiple variables and you are interested in predicting
one of them, using the other variables, a simple scatter plot might not
display all of the information in the data.
26
DATA SCIENCE 101
1
Predicting with Several Numeric Variables
We will see that we can still make predictions using the predict
function.
2
Predicting brain weight from body weight and litter size
3
Look at all pairwise relationships
pairs(litters, pch=16)
6 7 8 9
12
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
lsize ● ● ● ●
8
● ● ● ●
● ● ● ●
6
● ● ● ●
● ● ● ●
4
● ● ● ●
● ● ● ● ●●
● ●
● ●
9
● ●
● ●
● ● ● ●
bodywt
8
● ● ● ●
● ●●
● ● ●
7
● ● ● ●
● ● ●
● ● ●
6
● ●
● ●
● ● ●
● ● ● ●● ●●
● ● ● ●
0.42
● ●
● ● ● ●
●
● ● ●
● ●
● ●
●● ●
●●
● brainwt
0.38
● ●
● ●
It appears that brain weight increases with body weight, and it decreases with litter size.
4
Setting up the predictive model for brain weight
In order to find out how brain weight relates to both body weight and
litter size, we can use the following model:
There is still a response variable brainwt on the left side of the model
formula, but now there are two predictor variables bodywt and lsize on
the right side of the model formula:
5
Fitting the model in R
litters.lm <- lm(brainwt ˜ bodywt + lsize, data = litters)
coef(litters.lm)
Note that this fitted model says that for a fixed body weight, brain weight
is actually higher for larger litters.
6
The brain sparing effect can be visualized - but only with effort
Our earlier visualization with the pairs plot did not reveal the brain
sparing effect, but a conditional plot can. We need condition on different
levels of body weight to see this.
We will use the cut function to turn the numeric bodywt variable into a
factor with interval based categories.
Example of use of cut, where we find intervals (5, 6], (6, 7], ..., (9, 10]
which contain the different body weights:
cut(litters$bodywt, 5:10)
## [1] (9,10] (9,10] (9,10] (9,10] (8,9] (9,10] (8,9] (8,9] (7,8]
## [10] (8,9] (7,8] (7,8] (6,7] (7,8] (6,7] (6,7] (7,8] (6,7]
## [19] (5,6] (6,7]
## Levels: (5,6] (6,7] (7,8] (8,9] (9,10]
The output tells us that the first four body weights are in the interval (9, 10] and the next two are in the
interval (8, 9], which agrees with the tabular display on slide 3.
7
The brain sparing effect can be visualized - but only with effort
8
The brain sparing effect can be visualized - but only with effort
Then we plot brain weight against litter size for each of these groups
with the xyplot:
xyplot(brainwt ˜ lsize|cut(bodywt, cutpoints),
data = litters, type=c("p", "smooth"), span = 2)
4 6 8 10 12
0.42 ●
● ● ●
●
0.40 ●
●
0.38
●
4 6 8 10 12 4 6 8 10 12
lsize
This plot shows that for relatively fixed values of body weight, the brain weight is somewhat more likely
to grow with litter size than to decrease.
9
The brain sparing effect can be visualized - but only with effort
●
0.44 ●
● ●
● ●
● ●
●
0.42
●
●
brainwt
● ● ●
●
● ●
●
0.40
0.38
4 6 8 10 12
lsize
10
The brain sparing effect can be visualized - but only with effort
●
0.44 ●
● ●
● ●
● ●
●
0.42
●
●
brainwt
● ● ●
●
● ●
●
0.40
0.38
4 6 8 10 12
lsize
This time, we use different colours, using the groups argument, to represent the points and smooths
corresponding to the different body weights. Now, we see that, often, for fixed body weight, brain weight
increases with litter size. The light green and blue points are exceptions.
11
Making predictions with the fitted model
## 1
## 0.425
We can also obtain a 95% prediction interval:
predict(litters.lm, newdata =
data.frame(bodywt = 8.5, lsize = 6), interval="prediction")
• PRESS residuals:
e(i) = yi − yb(i).
Here, yi is the ith observed response and yb(i) is the predicted value
at the ith observation based on the regression of y against the xs,
omitting the ith observation.
13
PRESS - litters example
14
Example - winning football games
15
Example - winning football games
We can fit the model that relates y to ALL of the x’s by using a dot (.):
all.lm <- lm(y ˜ . , data = table.b1)
PRESS(all.lm) # calculate PRESS value for this full model
## [1] 145.9
Fit a model that only contains x2, x4, x7, x8 and x9:
five.lm <- lm(y ˜ x2 + x4 + x7 + x8 + x9, data = table.b1)
PRESS(five.lm)
## [1] 97.13
## [1] 119.2
16
Example - winning football games
Since the PRESS value is smallest for the five variable model, we would
prefer that one.
17
Example - winning football games
● ● ● ●
10 ● ● ● ● ●
●●
●
y
● ● ●
5 ● ● ●
● ● ●
● ●
● ●
0 ●
x2
18
Example - winning football games
FALSE TRUE
●
● ● ● ●
10 ● ● ● ● ●
●●
●
●
y
● ● ●
5 ● ● ●
● ● ●
● ●
● ●
0 ●
x2
By considering roughly fixed values of x7 and x8, we have more precise predictions of y based on x2.
19
Example - winning football games
We can predict the number of wins for a team with 2000 passing yards
x2, 60% field goal percentage x4, 80% rushing x7, 1900 opponent
rushing yards x8 and 1800 opponent passing yards x9:
## 1
## 12.99
We would predict that this team would win 13 out of the 14 games.
20
Example - winning football games
The prediction interval is pretty wide (7 to 18), and since there are only
14 games, some of the interval is impossible. We could be very
confident that this team will win at least one-half of its games.
21
Summary
The lm() function can be used to fit predictive models where there are
several predictor variables.
22
DATA SCIENCE 101
Term 1, 2021W
1
Predicting with regression trees
source("driverhistory.R")
head(driverhistory, n=6)
2
Predicting with regression trees
Visualize in a plot age versus sex, with pink for the accident cases and
blue for non-accident cases:
●
60 ●
●
●
50 ●
●
●
Age
40 ●
●
●
● ●
30 ●
●
● ●
●
●
●
● ●
● ●
20 ●
●
●
●
●
● ●
●
F M
Not all males have accidents, and not all young drivers have accidents
But ...
3
Predicting with regression trees
If we split the data set between the sexes and then divide the males at an
age of around 25, we can separate most of the accident cases from the
non-accident cases.
For a new case, we could then make a prediction. Say, a 40 year old
female, would be predicted to not have an accident, while a 20 year old
male would be predicted to have an accident.
library(rpart)
driver.rpart <- rpart(CarAccident ˜ Age + Sex,
data = driverhistory)
Here, we are using the rpart function in the rpart package. Note that its
syntax is similar to lm.
4
Predicting with regression trees
Sex=a
|
plot(driver.rpart)
text(driver.rpart)
0.1538 0.9091
Let’s read the tree for a 40 year old female, using the rule that if the
statement at the top of a split is true, we take the left branch. For some
reason, “a” is Female, so we move left. 40 is greater than 20.5, so we
move left. The probability of an accident is very low.
5
Predicting with regression trees
6
Predicting with classification trees
The earlier tree was called a regression tree. Now, we will construct a
classification tree, using method = "class".
library(rpart)
driver.rpart <- rpart(CarAccident ˜ Age + Sex,
data = driverhistory, method="class")
7
Predicting with classification trees
Sex=a
|
Age>=26.5
FALSE
plot(driver.rpart)
text(driver.rpart)
FALSE TRUE
Let’s read the tree for a 40 year old female, using the rule that if the
statement at the top of a split is true, we take the left branch. We predict
this driver to not have an accident.
8
Predicting with classification trees
9
Predicting Spam with classification trees
10
Predicting Spam with classification trees
in the DAAG package:
11
Predicting Spam with classification trees
We can see how well the variables might predict spam with side-by-side
box plots, such as
bwplot(yesno ˜ crl.tot, data = spam7)
y ● ●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●●
●
●●●
●●
●
●●
●
●●
●
● ●●●
●
●
●●
●●
●
●●●● ● ●
● ● ●
n ●●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●●
●
●
●
●●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●● ●●●
●●●●●●● ● ●
crl.tot
13
Predicting with classification trees
dollar< 0.0555
|
plot(spam.tree)
text(spam.tree)
bang< 0.0915
y
crl.tot< 85.5
n
bang< 0.7735
crl.tot< 17 y
n n y