Homework 3 R Tutorial: How To Use This Tutorial
Homework 3 R Tutorial: How To Use This Tutorial
This tutorial will give you some more practice with R and also reinforce some concepts regarding the standard
error and sampling distributions. It will also help you do HW 3.
Read through the tutorial, typing in the commands directly into the console as you go. For each step, once
you get the provided code running, you will be asked to tweak that code to get different effects or to answer
different questions. As you go, try and play and do different things. Notice any error messages when you
forget a parenthesis or something; this will help you learn how to write code more easily in the future.
Hit the up-arrow key to look at previously typed commands. You can then click on them and edit them in
the console to change them.
library(mosaicData)
data(Galton)
heights = Galton$height
mu = mean(heights)
mu
## [1] 66.76
Now let’s take a sample! Here we take one sample of size 10 and compute the mean:
## [1] 66.8
It turns out that the standard error for a random sample of size n is the population standard deviation
divided by the square root of n. So for our sample, the standard error is:
## [1] 1.133
Now, how many standard errors is our sample away from the real mean? This is a z-score:
1
( mean(smp0) - mu ) / SE
## [1] 0.03469
Take 4 more samples. How far were each of these sample means from the true mean?
## [1] 66.37
## [1] 66.17
## [1] 65.65
## [1] 66.52
Each time you take a sample you get a random number, i.e. the sample mean. We expect that random
number to be about µ, the real population mean, but we know that sampling is not perfect. Any given
sample is probably going to be about 1 SE away from the true mean. Look at your five samples. On average,
were they about 1 SE away from the true mean?
Take the standard deviation of your five means. Is the standard deviation about equal to your estimated
standard error?
## [1] 0.4309
Note: the above has two parts. First we make a list of our five means with c(). Second, we take the standard
deviation of our list of five numbers with sd().
The extra bits above hopefully illustrate the fundamental idea of the math we are doing. In real life, we only
have one sample. However, the way we understand our one sample is by thinking about what would happen
if we did our sampling over and over again. We then assume that our one sample is “typical” (i.e., close to
the population mean) and then base our inference on that assumption.
A crazier bootstrap
Let’s do a bootstrap on something called a trimmed mean. A trimmed mean is a mean where you drop the
top and bottom x percent of the data.
Here is some fake data along with the mean and a 80% trimmed mean:
2
lst = c( 1, 1, 1, 1, 2, 3, 4, 5, 6, 10000)
length(lst)
## [1] 10
mean(lst)
## [1] 1002
mean(lst, trim=0.1)
## [1] 2.875
Q: What happens when you trim 50% (0.5) from each end? Does having a 70% trimmed mean make any
sense?
The trimmed mean is the mean of the middle 80%, after lopping off 10% from each end. (Here 10% is 1 data
point, so the trimmed mean is the mean of the middle 8 points). We can check:
mean( c( 1, 1, 1, 2, 3, 4, 5, 6 ) )
## [1] 2.875
library( Lock5Data )
data( HollywoodMovies2011 )
HollywoodMovies2011 = subset(HollywoodMovies2011, !is.na( Budget ) )
bdg = HollywoodMovies2011$Budget
favstats(bdg)
## [1] 45.61
smp = resample(HollywoodMovies2011)
mean( smp$Budget, trim=0.1 )
## [1] 45.93
3
boots = replicate( 10000, {
smp = resample(HollywoodMovies2011)
mean( smp$Budget, trim=0.1 )
})
sd(boots)
## [1] 4.313
hist(boots)
Histogram of boots
1500
Frequency
1000
500
0
30 35 40 45 50 55 60 65
boots
You can get the coefficients out of your linear model like so:
data( Galton )
my.lm = lm( height ~ father + mother, data=Galton )
coef( my.lm )
4
coef( my.lm )[1]
## (Intercept)
## 22.31
## father
## 0.3799
new.f = 72
new.m = 72
coefs = coef( my.lm )
coefs[1] + coefs[2]*new.f + coefs[3]*new.m
## (Intercept)
## 70.05
Q: Predict your own height based on your mother and father’s height.
The easiest way to predict with a linear model is to just do it by hand. But R can do it for you, if you so
desire, via the predict() command. Let’s look at some examples.
library( Lock5Data )
data( RestaurantTips )
my.lm = lm( Tip ~ Bill, data=RestaurantTips )
new.dat = data.frame( Bill=c(10,20,40) )
new.dat
## Bill
## 1 10
## 2 20
## 3 40
## 1 2 3
## 1.530 3.352 6.996
We have made a new data table (new.dat) and then asked predict to predict salaries for our new people.
Notice that we match the Age variable exactly.
Q: What is the predicted tip for a $50 bill? $1000 bill?
Q: Try getting the coefficients of this model using the stuff above.
This can be used for multiple variables too:
5
my.lm2 = lm( Tip ~ Bill + Server, data=RestaurantTips )
new.dat = data.frame( Bill=c(20,20,20), Server=c("A","B","C") )
new.dat
## Bill Server
## 1 20 A
## 2 20 B
## 3 20 C
## 1 2 3
## 3.540 3.228 3.253
The following introduces the idea of transforming your data, and gives a nice trick for learning about curved
relationships between two variables.
Say we ran a linear model on the following data:
X = 1:10
Y = 0.25 * X^(3.0)
plot( Y ~ X )
bad.lm = lm( Y ~ X )
abline( bad.lm )
100 200
Y
2 4 6 8 10
X
Q: Why is running a linear model on these X and Y not appropriate?
We can transform our data. Here we take the log of both X and Y .
lX = log(X)
lY = log(Y)
Now let’s plot and run a linear model on your transformed data.
6
plot( lY ~ lX )
mlm = lm( lY ~ lX )
abline( mlm )
5
3
lY
1
−1
lX
summary(mlm)
##
## Call:
## lm(formula = lY ~ lX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25e-15 -5.39e-16 2.39e-16 4.96e-16 9.09e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.39e+00 6.17e-16 -2.25e+15 <2e-16 ***
## lX 3.00e+00 3.71e-16 8.09e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.16e-16 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.54e+31 on 1 and 8 DF, p-value: <2e-16
new.val = 6
log.pred = coef(mlm)[1] + coef(mlm)[2]*log(new.val)
exp( log.pred )
## (Intercept)
## 54
7
Y[6]
## [1] 54
We use 6, above, to verify it is what we should get. But try different values.