STA 421 Notes II

STA 421 NOTES BATCH II R.
WACHANA
3. More on R Graphics
Not only R has fancy graphical tools, but also it has all sorts of useful commands that allow users
to control almost every aspect of their graphical output to the finest details.
3.1 Histogram
We will use a data set fuel.frame which is based on makes of cars taken from the April 1990
issue of Consumer Reports.
> fuel.frame<-read.table("c:/fuel-frame.txt", header=T, sep=",")
> names(fuel.frame)
[1] "row.names" "Weight" "Disp." "Mileage" "Fuel" "Type"
> attach(fuel.frame)
attach() allows to reference variables in fuel.frame without the cumbersome fuel.frame$
prefix.
In general, graphic functions are very flexible and intuitive to use. For example, hist()
produces a histogram, boxplot() does a boxplot, etc.
> hist(Mileage)
> hist(Mileage, freq=F) # if probability instead of frequency is desired
Let us look at the Old Faithful geyser data, which is a built-in R data set.
> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"
1
STA 421 NOTES BATCH II R. WACHANA
> hist(eruptions, seq(1.6, 5.2, 0.2), prob=T)

> lines(density(eruptions, bw=0.1))
> rug(eruptions, side=1)
3.2 Boxplot
> boxplot(Weight) # usual vertical boxplot
> boxplot(Weight, horizontal=T) # horizontal boxplot
> rug(Weight, side=2)
2
If you want to get the statistics involved in the boxplots, the following commands show them. In
this example, a$stats gives the value of the lower end of the whisker, the first quartile (25th
percentile), second quartile (median=50th percentile), third quartile (75th percentile), and the
upper end of the whisker.
> a<-boxplot(Weight, plot=F)
> a$stats
[,1]
[1,] 1845.0
[2,] 2567.5
[3,] 2885.0
[4,] 3242.5
[5,] 3855.0
> a #gives additional information
> fivenum(Weight) #directly obtain the five number summary
[1] 1845.0 2567.5 2885.0 3242.5 3855.0
Boxplot is more useful when comparing grouped data. For example, side-by-side boxplots of
weights grouped by vehicle types are shown below:
> boxplot(Weight ~Type)
> title("Weight by Vehicle Types")
On-line help is available for the commands:

> help(hist)
> help(boxplot)
3
3.3 plot()
plot() is a general graphic command with numerous options.
> plot(Weight)
The following command produce a scatterplot with Weight on the x-axis and Mileage on the y-
axis.
> plot(Weight, Mileage, main="Weight vs. Mileage")
A fitted straight line is shown in the plot by executing two more commands.
> fit<-lm(Mileage~Weight)
> abline(fit)
3.4 matplot()
matplot() is used to plot two or more vectors of equal length.
> y60<-c(316.27, 316.81, 317.42, 318.87, 319.87, 319.43, 318.01, 315.74,
314.00, 313.68, 314.84, 316.03)
> y70<-c(324.89, 325.82, 326.77, 327.97, 327.91, 327.50, 326.18, 324.53,
322.93, 322.90, 323.85, 324.96)
> y80<-c(337.84, 338.19, 339.91, 340.60, 341.29, 341.00, 339.39, 337.43,
335.72, 335.84, 336.93, 338.04)
> y90<-c(353.50, 354.55, 355.23, 356.04, 357.00, 356.07, 354.67, 352.76,
350.82, 351.04, 352.69, 354.07)
> y97<-c(363.23, 364.06, 364.61, 366.40, 366.84, 365.68, 364.52, 362.57,
360.24, 360.83, 362.49, 364.34)
> CO2<-data.frame(y60, y70, y80, y90, y97)
> row.names(CO2)<-c("Jan", "Feb",
"Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
> CO2
4
y60 y70 y80 y90 y97

Jan 316.27 324.89 337.84 353.50 363.23
Feb 316.81 325.82 338.19 354.55 364.06
Mar 317.42 326.77 339.91 355.23 364.61
Apr 318.87 327.97 340.60 356.04 366.40
May 319.87 327.91 341.29 357.00 366.84
Jun 319.43 327.50 341.00 356.07 365.68
Jul 318.01 326.18 339.39 354.67 364.52
Aug 315.74 324.53 337.43 352.76 362.57
Sep 314.00 322.93 335.72 350.82 360.24
Oct 313.68 322.90 335.84 351.04 360.83
Nov 314.84 323.85 336.93 352.69 362.49
Dec 316.03 324.96 338.04 354.07 364.34
> matplot(CO2)
Note that the observations labeled 1 represents the monthly CO2 levels for 1960, 2 represents
those for 1970, and so on. We can enhance the plot by changing the line types and adding axis
labels and titles:
> matplot(CO2,axes=F,frame=T,type='b',ylab="")
> #axes=F: initially do not draw axis
> #frame=T: box around the plot is drawn;
> #type=b: both line and character represent a seris;
> #ylab="": No label for y-axis is shown;
> #ylim=c(310,400): Specify the y-axis range
> axis(2) # put numerical annotations at the tickmarks in y-axis;
> axis(1, 1:12, row.names(CO2))
> # use the Monthly names for the tickmarks in x-axis; length is 12;
> title(xlab="Month") #label for x-axis;
> title(ylab="CO2 (ppm)")#label for y-axis;
> title("Monthly CO2 Concentration \n for 1960, 1970, 1980, 1990 and 1997")
> # two-line title for the matplot
5
4. Plot Options
4.1 Multiple plots in a single graphic window

You can have more than one plot in a graphic window. For example, par(mfrow=c(1,2))allows
you to have two plots side by side. par(mfrow=c(2,3)) allows 6 plots to appear on a page (2
rows of 3 plots each). Note that the arrangement remains in effect until you change it. If you
want to go back to the one plot per page setting, type par(mfrow=c(1,1)).
4.2 Adjusting graphical parameters

4.2.1 Labels and title; axis limits
Any plot benefits from clear and concise labels which greatly enhances the readability.
> plot(Fuel, Weight)
6
If the main title is too long, you can split it into two and adding a subtitle below the horizontal
axis label is easy:
> title(main="Title is too long \n so split it into two",sub="subtitle goes
here")
By default, when you issue a plot command R inserts variable name(s) if it is available and
figures out the range of x axis and y axis by itself. Sometimes you may want to change these:
> plot(Fuel, Weight, ylab="Weight in pounds", ylim=c(1000,6000))
Similarly, you can specify xlab and xlim to change x-axis. If you do not want the default labels
to appear, specify xlab=" ", ylab=" ". This give you a plot with no axis labels. Of course you
can add the labels after using appropriate statements within title() statement.
> plot(Mileage, Weight, xlab="Miles per gallon", ylab="Weight in pounds",

xlim=c(20,30),ylim=c(2000,4000))
> title(main="Weight versus Mileage \n data=fuel.frame;", sub="Figure 4.1")
4.2.2 Types for plots and lines

In a series plot (especially time series plot), type provides useful options:
> par(mfrow=c(2,2))
> plot(Fuel, type="l"); title("lines")
> plot(Fuel, type="b"); title("both")
> plot(Fuel, type="o"); title("overstruck")
> plot(Fuel, type="h"); title("high density")
7
Also you can specify the line types using lty argument within plot() command:
> plot(Fuel, type="l", lty=1) #the usual series plot
> plot(Fuel, type="l", lty=2) #shows dotted line instead. lty can go up to
8.
> plot(Fuel, type="l", lty=1); title(main="Fuel data", sub="lty=1")
8
Note that we can control the thickness of the lines by lwd=1 (default) through lwd=5 (thickest).
4.3 Colors and characters
You can change the color by specifying

> plot(Fuel, col=2)
which shows a plot with different color. The default is col=1. The actual color assignment
9
depends on the system you are using. You may want to experiment with different numbers. Of
course you can specify the col option together with other options such as type or lty. pch option
allows you to choose alternative plotting characters when making a points-type plot. For
example, the command
> plot(Fuel, pch="*") # plots with * characters
> plot(Fuel, pch="M") # plots with M.
4.4 Controlling axis line
bty ="n"; No box is drawn around the plot, although the x and y axes are still drawn.
bty="o"; The default box type; draws a four-sided box around the plot.
bty="c"; Draws a three-sided box around the plot in the shape of an uppercase "C."
bty="l"; Draws a two-sided box around the plot in the shape of an uppercase "L."
bty="7"; Draws a two-sided box around the plot in the shape of a square numeral "7."
> par(mfrow = c(2,2))
> plot(Fuel)
> plot(Fuel, bty="l")
> plot(Fuel, bty="7")
> plot(Fuel, bty="c")
4.5 Controlling tick marks

tck parameter is used to control the length of tick marks. tck=1 draws grid lines. Any positive
value between 0 and 1 draws inward tick marks for each axis. Also with some more work you
can have tick marks of different lengths, as the following example shows.
> plot(Fuel, main="Default")

> plot(Fuel, tck=0.05, main="tck=0.05")
> plot(Fuel, tck=1, main="tck=1")
> plot(Fuel, axes=F, main="Different tick marks for each axis")
> #axes=F suppresses the drawing of axis
> axis(1)# draws x-axis.
> axis(2, tck=1, lty=2) # draws y-axis with horizontal grid of dotted line
> box()# draws box around the remaining sides.
4.6 Legend
legend() is useful when adding more information to the existing plot.
In the following example, the legend() command says
(1) put a box whose upper left corner coordinates are x=30 and y=3.5;
(2) write the two texts Fuel and Smoothed Fuel within the box together with corresponding
symbols described in pch and lty arguments.
>par(mfrow = c(1,1))
>plot(Fuel)
>lines(lowess(Fuel))
>legend(30,3.5, c("Fuel","Smoothed Fuel"), pch="* ", lty=c(0,1))
10
If you want to keep the legend box from appearing, add bty="n" to the legend command.
4.7 Putting text to the plot; controlling the text size
mtext() allows you to put texts to the four sides of the plot. Starting from the bottom (side=1),
it goes clockwise to side 4. The plot command in the example suppresses axis labels and the plot
itself. It just gives the frame. Also shown is the use of cex (character expansion) argument
which controls the relative size of the text characters. By default, cex is set to 1, so graphics text
and symbols appear in the default font size. With cex=2, text appears at twice the default font
size. text() statement allows precise positioning of the text at any specified point. First text
statement puts the text within the quotation marks centered at x=15, y=4.3. By using optional
argument adj, you can align to the left (adj=0) such that the specified coordinates are the
starting point of the text.
> plot(Fuel, xlab=" ", ylab=" ", type="n")

> mtext("Text on side 1, cex=1", side=1,cex=1)
> mtext("Text on side 2, cex=1.2", side=2,cex=1.2)
> mtext("Text on side 3, cex=1.5", side=3,cex=1.5)
> mtext("Text on side 4, cex=2", side=4,cex=2)
> text(15, 4.3, "text(15, 4.3)")
> text(35, 3.5, adj=0, "text(35, 3.5), left aligned")
> text(40, 5, adj=1, "text(40, 5), right aligned")
11
4.8 Adding symbols to plots
abline() can be used to draw a straight line to a plot.

abline(a,b) a=y-intercept, b=slope.
abline(h=30) draws a horizontal line at y=30.
abline(v=12) draws a vertical line at x=12.
4.9 Adding arrow and line segment

> plot(Mileage, Fuel)
> arrows(23,3.5,24.5,3.9)
> segments(31.96713,3.115541, 29.97309,3.309592)
> title("arrow and segment")
> text(23,3.4,"Chrysler Le Baron V6", cex=0.7)
12
4.10 Identifying plotted points

While examining a plot, identifying a data point such as possible outliers can be achieved using
identify() function.
> plot(Fuel)
> identify(Fuel, n=3)
After pressing return, R waits for you to identify (n=3) points with the mouse. Moving the mouse
cursor over the graphics window and click on a data point. Then the observation number appears
next to the point, thus making the point identifiable.
4.11 Managing graphics windows

Normally high level graphic commands (hist(), plot(), boxplot(), ...) produce a plot
which replaces the previous one. To avoid this, use win.graph() to open a separate graphic
window. Even if more than one graphics windows are open, only one window is active, i.e., as
long as you don't change the active window, subsequent plotting commands will show the plot in
that particular window. dev.cur() gives the current active window, dev.list() lists all
available graphics windows, dev.set() changes the active window, dev.off() closes the
current graphic window, and graphics.off() closes all the open graphics windows at once.
The following examples assume that currently no graphic window is open.
> for (i in 1:3) win.graph() #open three graphic windows

> dev.list()
windows windows windows
2 3 4
13
> dev.cur()
windows
4
> dev.set(3) #change the current window to window 3
windows
3
> dev.cur() #check it
windows
3
> dev.off() #close the current window and window 4 is active
windows
4
> dev.list()
windows windows
2 4
> graphics.off() # now close all three
> dev.list()
Correlations and scatter plots
Correlations of write, read, math and science with listwise deletion of missing values. The
correlations will not be calculated if there are missing values so it is important to use the
complete.obs argument to indicate how the missing values should be handled.
# correlation of a pair of variables

cor(write, math)
cor(write, science)
cor(write, science, use="complete.obs")
# correlation matrix
cor(read.sci, use="complete.obs")
cor(read.sci, use="pairwise.complete.obs")
plot(math, write)
# scatter plot matrix

plot(read.sci)
Unless you are going to continue working with the hs0 data frame it is generally a good idea to
detach all attached data frames.
detach()
detach()
14
Data Modification in R
1.0 R functions used in this session and the syntax file
comment add comment to an object

sapply apply a function over a list or vector
is.factor check if a variable is a factor variable
factor creates a categorical variable with value labels if desired
table creates frequency table
Use Data Modification file
2.0 Commenting a data frame or a variable
It is a good practice to label the data sets or variables that we have been working on. This can be
accomplished by using the comment function.
# cleaning up
rm(list=ls())
# reading in data
hs0 <- read.table("hs0.csv", header=T, sep=",")
# commenting the data set
comment(hs0)<-"High school and beyond data"
# checking
comment(hs0)
# variable labels using comment
comment(hs0$write)<-"writing score"
comment(hs0$read) <-"reading score"
# more checking to make sure that our comments stay with the
data frame
save(hs0,file="hs0.rda")
rm(list=ls())
15
load(file="hs0.rda")
comment(hs0)
comment(hs0$write)
3.0 Creating factor variables
For the rest of this section, we are going to attach hs0 so our syntax will look cleaner. The
search() function displays what is currently on the search path.
search()
attach(hs0)
search()
We use the sapply function with the is.factor function to check if any of the variables in the hs0
data frame are factor variables.
sapply(hs0, is.factor)
Creating a factor (categorical) variable called schtyp.f for schtyp and a factor variable female
for gender with value labels.
schtyp.f <- factor(schtyp, levels=c(1, 2), labels=c("public",

"private"))
female <- factor(gender, levels=c(0, 1), labels=c("male",
"female"))
table(schtyp.f)
table(female)
4.0 Recoding variables and generating new variables
Recoding race=5 to be NA (to be missing).
table(hs0$race)
hs0$race[hs0$race==5] <-NA
table(hs0$race)
# displaying the missings as well

table(hs0$race, useNA="ifany")
16
Creating a variable called total = read + write+ math+science
total<-read+write+math+science
# noticing the missing values generated
summary(total)
Creating a variable called grade based on total.
# initializing a variable
grade<-0
grade[total <=140]<-0
grade[total > 140 & total <= 180] <-1
grade[total > 234] <-4
comment(grade)<-"combined grades of read, write, math, science"
The following R code creates a categorical variable called grade
grade<-factor(grade, levels=c(0, 1, 2, 3, 4), labels=c("F", "D",

"C", "B", "A"))
Displaying one-way table for grades in R
table(grade)
Creating mean scores in two ways - working with missing values differently.
m1<-(read+write+math+science)/4
m2<-rowMeans(cbind(read, write, math, science))
m2<-rowMeans(cbind(read, write, math, science), na.rm=T)
At this point, we might want to combine the new variables we have created with the original data
set. We can use the cbind function for this.
hs1<-cbind(hs0, cbind(schtyp.f, female, total, grade))

table(hs1$race)
is.data.frame(hs1)
Data Management in R
17
1.0 R functions used in this session and the syntax file
mean calculates the mean

names lists the variable names of a data frame
table creates a frequency table
rbind combines rows of data
order sort data frames
merge match merges two data frames
cbind combines columns of data
Use Data Management File
2.0 Keeping and dropping a subset of variables or observations
Read in the hs1 data using the read.table function and storing in object called hs1.
hs1 <- read.table("hs1.csv", header=T, sep=",")
Keeping only the observations where the reading score is 60 or higher.
hs1.read.well <- hs1[hs1$read >= 60, ]
Comparing means of read in the original hs1 data frame and the new smaller hs1.read.well data
frame. To keep from getting confused we will use the convention of using the data name, dollar
sign, variable name. For example, hs1$read is the read variable from the hs1 data.
mean(hs1.read.well$read)
mean(hs1$read)
Keeping only the variables read and write from the hs1 data frame.
hs2<-hs1[, c("read", "write")]

# another way of doing the same thing
hs3<-hs1[, c(7, 8)]
names(hs3)
Dropping the variables read and write from the hs1 data frame by using the column indices
corresponding to these two variables with a negative sign.
18
hs2.drop<-hs1[, -c(7, 8)]

names(hs2.drop)
3.0 Append files
We will subset hs1 to two data sets, one for female and one for male. We then put them back
together.
hsfemale<-hs1[female==1, ]
hsmale<-hs1[female==0, ]
dim(hsfemale)
dim(hsmale)
hs.all<-rbind(hsfemale, hsmale)
dim(hs.all)
4.0 Merging Files
We will create two data sets from hs1, one contains demographic variables and the other one
contains test scores. We then merge the two data sets by the id variable.
hs.demo<-hs1[, c("id", "ses", "female", "race")]

hs.scores<-hs1[, c("id", "read", "write", "math", "science")]
dim(hs.demo)
dim(hs.scores)
hs.merge <- merge(hs.demo, hs.scores, by="id", all=T)

head(hs.merge)
dim(hs.merge)
If the variable that we were merging on had different names in each data frame then we could
use the by.x and by.y arguments. In the by.x argument we would list the name of the variable(s)
that was in the data frame listed first in the merge function (in this case in hs.demo) and in the
by.y argument we would name the variable(s) that was in the data frame listed second (in this
case hs.scores).
hs.merge1 <- merge(hs.demo, hs.scores, by.x="id", by.y="id",

all=T)
Statistical Analysis of Data in R
19
1.0 Functions used in this session, the data set and the syntax file
t.test t-tests, including one sample, two sample and paired

tapply applies a function to each cell of a ragged array
var calculates the variance
lm fits a linear model (regression)
anova extracts the anova table from a lm object
summary generic function provides a synopsis of an object
fitted extracts the fitted values from a lm object
resid extracts the residuals from a lm object
a generic function which is used here to obtain default plots of a lm object
plot
as well as to generate a scatter plot between two continuous variables.
glm generalized linear models
wilcox.test non-parametric analyses
kruskal.test non-parametric analyses
Read in the hs1 data via the internet using read.table function. We also use attach function to
place the data set on the search path of R.
Use Statistical Data Analysis file in R
rm(list=ls())
hs1 <- read.table("c://Users/Richard Wachana/SAS 205/hs1.csv",
header=T, sep=",")
attach(hs1)
2.0 chi-square test
This is a chi-square test of independence for the two-way table.
tab1 <- table(female, ses)

# chi-square test of independence
summary(tab1)
3.0 t-tests
This is the one-sample t-test, testing whether the sample of writing scores was drawn from a
population with a mean of 50.
t.test(write, mu=50)
20
This is the paired t-test, testing whether or not the mean of write equals the mean of read.
t.test(write, read, paired=TRUE)
This is the two-sample independent t-test. We can use either the by function or the tapply
function to look at the variances of the variable write for each group of female. The output from
the first t.test function assumes equal variances which is the default in the t.test function; the
output from the second t.test function assumes unequal variances.
by(write, female, var)

tapply(write, female, var)
# assuming equal variances
t.test(write~female, var.equal=TRUE)
# assuming unequal variances

t.test(write~female, var.equal=FALSE)
4.0 Anova
In R you can use either the aov function or the anova function combined with the lm function.
Both alternatives will give you the same results. The anova function extracts the anova table
from the linear model fitted by the lm function. The aov function only fits an anova model and
we use the summary function to see all the output.
anova(lm(write~factor(prog)))
summary(aov(write~factor(prog)))
The following is an example of a two-way factorial ANOVA. Notice that in R, the sum of
squares is type I.
m2<-lm(write~factor(prog)*factor(female))
anova(m2)
Here is an analysis of covariance (ANCOVA). In this example, prog is the categorical predictor
and read is the continuous covariate.
anova(lm(write~factor(prog) + read))
summary(aov(write~factor(prog) + read))
5.0 Regression
21
Plain old OLS regression.
summary(lm(write~female+read)
The generic plot function will produce multiple diagnostic plots when applied to an lm object.
These plots include residual versus fitted plots, qqplots of the residuals as well as scatter plots
with the regression line overlaid. If you are only interested in one or a few of the plots it might
be useful to use the which.plot option in the plot function.
lm2 <- lm(write~read+socst)

summary(lm2)
# plotting diagnostic plots of lm2
plot(lm2)
Let's take a closer look at the object lm2 we just created. Notice that an object can have many
components of different types and different sizes.
class(lm2)
names(lm2)
length(lm2)
length(lm2$residuals)
length(lm2$coefficients)
lm2$coefficients
The fitted function will extract the fitted values from the lm object and the resid function will
extract the residuals.
write[1:20]
fitted(lm2)[1:20]
resid(lm2)[1:20]
6.0 Logistic regression
In order to demonstrate we will create a dichotomous variable called honcomp (honors

composition). Honcomp will be equal to 1 when the logical test of write >= 60 is true and
honcomp will be equal to zero when it is not true. This variable is created purely for illustrative
purposes only!
honcomp <- write >= 60

honcomp[1:20]
22
The glm function fits a generalized linear model including a logistic regression. In order to fit a
logistic model we need to specify that the distribution of the dependent variable is binomial in
the family argument and the default link function used will then be the logit function.
lr <- glm(honcomp~female+read, family=binomial)

summary(lr)
# odds ratios
exp(coef(lr))
7.0 Non-Parametric Tests
The signtest is the nonparametric analog to the single-sample t-test and is obtained by using the
wilcox.test function. The value that is being tested is specified by the mu argument.
wilcox.test(write, mu=50)
The signrank test is the nonparametric analog to the paired t-test. This test can be obtained by
also using the wilcox.test function and specifying T in the paired argument.
wilcox.test(write, read, paired=T)
The ranksum test is the nonparametric analog to the independent two-sample t-test.
wilcox.test(write, female)
The kruskal wallis test is the nonparametric analog to the one-way anova.
kruskal.test(write, ses)
Unless you are going to continue working with the hs1 data it is generally a good idea to detach
all attached data frames.
detach()
23
Functions
Almost everything in R is done through functions. Here I'm only referring to numeric and
character functions that are commonly used in creating or recoding variables.
1. Numeric Functions
Function Description
abs(x) absolute value
sqrt(x) square root
ceiling(x) ceiling(3.475) is 4. It rounds the number to the nearest integer
floor(x) floor(3.475) is 3. It rounds the number to the integer value
trunc(x) trunc(5.99) is 5. It rounds the number by discarding the decimal part
round(x, digits=n) round(3.475, digits=2) is 3.48 It rounds the value to the number
specified decimal places
signif(x, digits=n) signif(3.475, digits=2) is 3.5 rounds the value to the specified number
of significant digits
cos(x), sin(x), tan(x) These functions give the obvious trigonometric functions. They
respectively compute the cosine, sine, tangent, arc-cosine, arc-sine,
cos(x),sin(x),tan(x), arc-tangent, and the two-argument arc-tangent.
acos(x), asin(x), atan(x),
atan2(y, x)
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x
24
2. Character Functions
substr(x, start=n1, stop=n2) Extract or replace substrings in a character vector.

x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
grep(pattern, x , Search for pattern in x. If fixed =FALSE then pattern is a regular

ignore.case=FALSE, fixed=FALSE) expression. If fixed=TRUE then pattern is a text string. Returns
matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
sub(pattern, replacement, x, Find pattern in x and replace with replacement text. If fixed=FALSE
ignore.case =FALSE, fixed=FALSE) then pattern is a regular expression.
If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "Hello.There"
strsplit(x, split) Split the elements of character vector x at split.

strsplit("abc", "") returns 3 element vector "a","b","c"
paste(..., sep="") Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())
toupper(x) Uppercase
tolower(x) Lowercase
25
3. Statistical Probability Functions
The following table describes functions related to probability distributions. You need to type at
the command prompt set.seed(1234) to create reproducible pseudo-random numbers below.
dnorm(x) normal density function (by default m=0 sd=1)

# plot standard normal curve
x <- pretty(c(-3,3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxis="i")
pnorm(q) cumulative normal probability for q

(area under the normal curve to the right of q)
pnorm(1.96) is 0.975
qnorm(p) normal quantile.

value at the p percentile of normal distribution
qnorm(.9) is 1.28 # Computes 90th percentile
rnorm(n, m=0,sd=1) n random normal deviates with mean m

and standard deviation sd.
#Generating 50 random normal variates with mean=50, sd=10
x <- rnorm(50, m=50, sd=10)
dbinom(x, size, prob) binomial distribution where size is the sample size
pbinom(q, size, prob) and prob is the probability of a heads (pi)
qbinom(p, size, prob) # prob of 0 to 5 heads of fair coin out of 10 flips
rbinom(n, size, prob) dbinom(0:5, 10, .5)
# prob of 5 or less heads of fair coin out of 10 flips
pbinom(5, 10, .5)
dpois(x, lamda) poisson distribution with m=std=lamda

ppois(q, lamda) #probability of 0,1, or 2 events with lamda=4
qpois(p, lamda) dpois(0:2, 4)
rpois(n, lamda) # probability of at least 3 events with lamda=4
1- ppois(2,4)
dunif(x, min=0, max=1) uniform distribution, follows the same pattern

punif(q, min=0, max=1) as the normal distribution above.
qunif(p, min=0, max=1) #10 uniform random variates
runif(n, min=0, max=1) x <- runif(10)
26
4. Other Statistical Functions
Other useful statistical functions are provided in the following table. Each has the option na.rm
to strip missing values before calculations. Otherwise the presence of missing values will lead to
a missing result. Object can be a numeric vector or data frame.
mean(x, trim=0, na.rm=FALSE) mean of object x

# trimmed mean, removing any missing values and
# 5 percent of highest and lowest scores
mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) for variance and
mad(x) for median absolute deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired
and probs is a numeric vector with probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
scale(x, center=TRUE, scale=TRUE) column center or standardize a matrix.
27
5. Other Useful Functions

seq(from , to, by) generate a sequence

indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes, each) repeat x n times

y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
cut(x, n) divide continuous variable in factor with n levels

y <- cut(x, 5)
Note that while the examples on this topic apply functions to individual variables, many can be
applied to vectors and matrices as well.
Numerical Analysis in R
While R is best known as an environment for statistical computing, but it is also a

great tool for numerical analysis (optimization, integration, interpolation, matrix
operations, differential equations and so on). This topic discusses the capabilities
that R offers in analyzing such data.
Integration
To integrate a function numerically, use integrate (note that the function

must be able to accept and return a vector):
fun <- function(x, const) x^2 + 2*x + const

integrate(fun, lower=0, upper=10, const=1)
#output 443.3333 with absolute error < 4.9e-12
Integrate will evaluate the function over the specified range (lower to upper) by
passing a vector of these values to the function being integrated. Note that any
28
other arguments to fun must also be specified, as extra arguments to integrate,

and that the order of the arguments of fun does not matter, provided all arguments
are supplied in this way, apart from the one being integrated over:
fun2 <- function(A, b, x) A*x^b # "x" doesn't have to be the first argument
integrate(fun2, lower=0, upper=10, A=1, b=2) # "A" & "b" are given explicitly
Now, let's say you wanted to integrate this function for a series of values of b
bvals <- seq(0, 2, by=0.2) # create vector of b values

fun2.int <- function(b) integrate(fun2, lower=0, upper=10, A=1,
b=b)$value
fun2.int(bvals[1]) # works for a single value of b
fun2.int(bvals) # FAILS for a vector of values of b
to make it work, you need to force vectorization of the function, so it can cycle
piecewise through the elements of the vector and evaluate the function for each
one:
fun2.intV <- Vectorize(fun2.int, "b") # Vectorize "fun2.int" over "b"

fun2.intV(bvals) # returns a vector of values
Differentiation
To compute symbolically the derivative of a simple expression, use D (see

?deriv for more information):
>D(expression(sin(x)^2 - exp(x^2)), "x") # differentiate with respect to "x"

# Output 2 * (cos(x) * sin(x)) - exp(x^2) * (2 * x)
Differential equations
To solve differential equations, use the deSolve package.
install.packages("deSolve")
library("deSolve")
library(help="deSolve") # see information on package
29
Parameter optimization
To find the minimum value of a function within some interval, we use optimize
(optimise is a valid alias):
fun <- function(x) x^2 + x - 1 # create function

curve(fun, xlim=c(-2, 1)) # plot f unction
( res <- optimize(fun, interval=c(-10, 10)) )
points(res$minimum, res$objective) # plot point at minimum
Now, let's say you want to find the x value of the function at which the y value
equals some number (1.234, say):
#--Define an auxiliary minimizing function:

fun.aux <- function(x, target) ( fun(x) - target )^2
(res <- optimize(fun.aux, interval=c(-10, 10), target=1.234) )
fun(res$minimum) # close enough
Of course, there are 2 solutions in this case, as seen by plotting the function to be
minimized:
curve(fun.aux(x, target=1.234), xlim=c(-3, 2))

points(res$minimum, res$objective)
We can get the other solution by giving a skewed search interval (type at the
command prompt ?optimize to see how the start point is determined):
res2 <- optimize(fun.aux, interval=c(-10, 100), target=1.234) #
force higher start value
points(res2$minimum, res2$objective)# plot other minimum
#--Show target values plotted with original function:

curve(fun, xlim=c(-3, 2))
abline(h=1.234, lty=2)
abline(v=c(res$minimum, res2$minimum), lty=3)
Interpolating data
 A example of spline interpolation:
fun <- function(x) sqrt(3) * sin(2*pi*x)#function to generate some

data
30
x <- seq(0, 1, length=20)

set.seed(123) # allow reproducible random numbers
y <- jitter(fun(x), factor=20) # add a small amount of random noise
plot(y ~ x) # plot noisy data
lines(spline(x, y)) # add splined data
Now, compare with the prediction from a smoothing spline:
f <- smooth.spline(x, y)
lines(predict(f), lty=2)
why not also add the best-fit sine curve predicted from a linear regression with lm:
lines(x, predict(lm( y ~ sin(2*pi*x))), col="red")
Note that, by default, the predicted values are evaluated at the (X) positions of the
raw data. This means that you can end up with rather coarse curves, as seen above.
To get round this, you need to work with functions for the splines, which can be
supplied with more finely-spaced X values for plotting:
fun.spline <- splinefun(x, y)

fun.smooth <- function(xx, ...) predict(smooth.spline(x, y), x=xx,
...)$y
plot(y ~ x)
curve(fun.spline, add=TRUE) # } "curve" uses n=101 points by default
curve(fun.smooth, add=TRUE, lty=2)# }at which to evaluate the function
#--And add a smoother best-fit sine curve:

fun.sine <- function(X) predict(lm( y ~ sin(2*pi*x)),
newdata=list(x=X))
curve(fun.sine, add=TRUE, col="red")
#--Finally, just for completeness, plot the original function:

curve(fun, add=TRUE, col="blue")
A wider range of splines is available in the package of the same name, accessed via
library("splines"), including B splines, natural splines and so on.
Testing a population parameter
31
Consider a simple survey. You ask 100 people (randomly chosen) and 42 say ``yes'' to your
question. Does this support the hypothesis that the true proportion is 50%?
prop.test(42,100,p=.5)
Testing a mean
Suppose a car manufacturer claims a model gets 25 mpg. A consumer group asks 10 owners of
this model to calculate their mpg and the mean value was 22 with a standard deviation of 1.5. Is
the manufacturer's claim supported?
## Compute the t statistic. Note we assume mu=25 under H_0
xbar=22;s=1.5;n=10
t = (xbar-25)/(s/sqrt(n))
t
## use pt to get the distribution function of t

pt(t,df=n-1)
Two-sample tests of proportion
We use the command prop.test to handle these problems.
Example: Two surveys

A survey is taken two times over the course of two weeks. The pollsters wish to see if there is a
difference in the results as there has been a new advertising campaign run. Here is the data
Week 1 Week 2
Favorable 45 56
Unfavorable 35 47
The standard hypothesis test is H0: 1 = 2 against the alternative (two-sided) H1: 1 ≠ 2. The
function prop.test is used to being called as prop.test(x,n) where x is the number favorable
and n is the total. Here it is no different, but since there are two x's it looks slightly different.
Here is how to do it in R
prop.test(c(45,56),c(45+35,56+47))
Two-sample t-tests
32
Equal variances
When the two samples are assumed to have equal variances, then the data can be pooled to find
an estimate for the variance. By default, R assumes unequal variances. If the variances are
assumed equal, then you need to specify var.equal=TRUE when using t.test.
Example: Recovery time for new drug

Suppose the recovery time for patients taking a new drug is measured (in days). A placebo group
is also used to avoid the placebo effect. The data are as follows
with drug: 15 10 13 7 9 8 21 9 14 8
placebo: 15 14 12 8 14 7 16 10 15 12
After a side-by-side boxplot (boxplot(x,y), but not shown), it is determined that the
assumptions of equal variances and normality are valid. A one-sided test for equivalence of
means using the t-test is found. This tests the null hypothesis of equal variances against the one-
sided alternative that the drug group has a smaller mean. (�1 - �2 < 0). Here are the results
x = c(15, 10, 13, 7, 9, 8, 21, 9, 14, 8)
y = c(15, 14, 12, 8, 14, 7, 16, 10, 15, 12)
t.test(x,y,alt="less",var.equal=TRUE)
Unequal variances
If the variances are unequal, the denominator in the t-statistic is harder to compute
mathematically. But not with R. The only difference is that you don't have to specify
var.equal=TRUE (so it is actually easier with R).
If we continue the same example we would get the following
t.test(x,y,alt="less")
Matched samples
Matched or paired t-tests use a different statistical model. Rather than assume the two samples
are independent normal samples albeit perhaps with different means and standard deviations, the
matched-samples test assumes that the two samples share common traits.
The basic model is that Yi = Xi + i where i is the randomness. We want to test if the i are mean
0 against the alternative that they are not mean 0. In order to do so, one subtracts the X's from the
Y's and then performs a regular one-sample t-test.
Actually, R does all that work. You only need to specify paired=TRUE when calling the t.test
function.
33
Example: Dilemma of two graders

In order to promote fairness in grading, each application was graded twice by different graders.
Based on the grades, can we see if there is a difference between the two graders? The data is
Grader 1: 3 0 5 2 5 5 5 4 4 5
Grader 2: 2 1 4 1 4 3 3 2 3 5
Clearly there are differences. Are they described by random fluctuations (mean i is 0), or is
there a bias of one grader over another? (mean   0). A matched sample test will give us some
insight. First we should check the assumption of normality with normal plots say. (The data is
discrete due to necessary rounding, but the general shape is seen to be normal.) Then we can
apply the t-test as follows
x = c(3, 0, 5, 2, 5, 5, 5, 4, 4, 5)
y = c(2, 1, 4, 1, 4, 3, 3, 2, 3, 5)
t.test(x,y,paired=TRUE)
Regression Analysis
Regression analysis forms a major part of the statisticians tool box. This section discusses statistical
inference for the regression coefficients.
13.1 Simple linear regression model

R can be used to study the linear relationship between two numerical variables. Such a study is called
linear regression for historical reasons.
The basic model for linear regression is that pairs of data, (xi,yi), are related through the equation
yi = 0 + 1 xi + i
The values of 0 and 1 are unknown and will be estimated from the data. The value of i is the amount
the y observation differs from the straight line model.
x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37)
y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178)
plot(x,y) # make a plot
abline(lm(y ~ x)) # plot the regression line
lm(y ~ x) # the basic values of the regression analysis
summary(lm.result)
es = resid(lm.result) # the residuals lm.result
b1 =(coef(lm.result))[['x']] # the x part of the coefficients
34
Analysis of Variance
Recall, the t-test was used to test hypotheses about the means of two independent samples. For
example, to test if there is a difference between control and treatment groups. The method called
analysis of variance (ANOVA) allows one to compare means for more than 2 independent
samples.
One-way analysis of variance

We begin with an example of one-way analysis of variance.
Example: Scholarship Grading

Suppose a school is trying to grade 300 different scholarship applications. As the job is too much
work for one grader, suppose 6 are used. The scholarship committee would like to ensure that
each grader is using the same grading scale, as otherwise the students aren't being treated
equally. One approach to checking if the graders are using the same scale is to randomly assign
each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing
that the differences should be due to chance errors if the graders all grade equally.
To illustrate, suppose we have just 27 tests and 3 graders (not 300 and 6 to simplify data entry.).
Furthermore, suppose the grading scale is on the range 1-5 with 5 being the best and the scores
are reported as
grader 1 4 3 4 5 2 3 4 5
grader 2 4 4 5 5 4 5 4 4
grader 3 3 4 2 4 5 5 4 4
We enter this into our R session as follows and then make a data frame
> x = c(4,3,4,5,2,3,4,5)
> y = c(4,4,5,5,4,5,4,4)
> z = c(3,4,2,4,5,5,4,4)
> scores = data.frame(x,y,z)
> boxplot(scores)
Before beginning, we made a side-by-side boxplot which allows us to compare the three
distributions. From this graph (not shown) it appears that grader 2 is different from graders 1 and
3.
Analysis of variance allows us to investigate if all the graders have the same mean. The R
function to do the analysis of variance hypothesis test (oneway.test) requires the data to be in a
different format. It wants to have the data with a single variable holding the scores, and a factor
describing the grader or category. The stack command will do this for us:
scores = stack(scores) # look at scores if not clear
names(scores)
35
Looking at the names, we get the values in the variable values and the category in ind. To call
oneway.test we need to use the model formula notation as follows
oneway.test(values ~ ind, data=scores, var.equal=T)
Now, we have formulas and could do all the work ourselves, but were here to learn how to let the
computer do as much work for us as possible. Two functions are useful in this example:
oneway.test to perform the hypothesis test, and anova to give detailed
For the data used in oneway.test yields

df = stack(data.frame(x,y,z)) # prepare the data
oneway.test(values ~ ind, data=df,var.equal=T)
By default, it returns the value of F and the p-value but that's it. The small p value matches our
analysis of the figure. That is the means are not equal. Notice, we set explicitly that the variances
are equal with var.equal=T.
The function anova gives more detail. You need to call it on the result of lm
anova(lm(values ~ ind, data=df))
36

STA 421 Notes II

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

STA 421 Notes II

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STA 421 Notes II

Uploaded by

Copyright:

Available Formats

STA 421 NOTES BATCH II R.

> hist(eruptions, seq(1.6, 5.2, 0.2), prob=T)

On-line help is available for the commands:

y60 y70 y80 y90 y97

4.1 Multiple plots in a single graphic window

4.2 Adjusting graphical parameters

> plot(Mileage, Weight, xlab="Miles per gallon", ylab="Weight in pounds",

4.2.2 Types for plots and lines

4.3 Colors and characters

You can change the color by specifying

4.4 Controlling axis line

4.5 Controlling tick marks

> plot(Fuel, main="Default")

4.7 Putting text to the plot; controlling the text size

> plot(Fuel, xlab=" ", ylab=" ", type="n")

4.8 Adding symbols to plots

abline() can be used to draw a straight line to a plot.

4.9 Adding arrow and line segment

4.10 Identifying plotted points

4.11 Managing graphics windows

> for (i in 1:3) win.graph() #open three graphic windows

Correlations and scatter plots

# correlation of a pair of variables

# scatter plot matrix

1.0 R functions used in this session and the syntax file

comment add comment to an object

Use Data Modification file

2.0 Commenting a data frame or a variable

3.0 Creating factor variables

schtyp.f <- factor(schtyp, levels=c(1, 2), labels=c("public",

4.0 Recoding variables and generating new variables

Recoding race=5 to be NA (to be missing).

# displaying the missings as well

Creating a variable called total = read + write+ math+science

Creating a variable called grade based on total.

comment(grade)<-"combined grades of read, write, math, science"

The following R code creates a categorical variable called grade

grade<-factor(grade, levels=c(0, 1, 2, 3, 4), labels=c("F", "D",

Displaying one-way table for grades in R

hs1<-cbind(hs0, cbind(schtyp.f, female, total, grade))

1.0 R functions used in this session and the syntax file

mean calculates the mean

Use Data Management File

2.0 Keeping and dropping a subset of variables or observations

hs1 <- read.table("hs1.csv", header=T, sep=",")

Keeping only the observations where the reading score is 60 or higher.

hs1.read.well <- hs1[hs1$read >= 60, ]

hs2<-hs1[, c("read", "write")]

hs2.drop<-hs1[, -c(7, 8)]

3.0 Append files

4.0 Merging Files

hs.demo<-hs1[, c("id", "ses", "female", "race")]

hs.merge <- merge(hs.demo, hs.scores, by="id", all=T)

hs.merge1 <- merge(hs.demo, hs.scores, by.x="id", by.y="id",

Statistical Analysis of Data in R

t.test t-tests, including one sample, two sample and paired

Use Statistical Data Analysis file in R

2.0 chi-square test

fun <- function(x) sqrt(3) * sin(2pix)#function to generate some

lines(x, predict(lm( y ~ sin(2pix))), col="red")