0% found this document useful (0 votes)
53 views8 pages

Homework 3 R Tutorial: How To Use This Tutorial

This tutorial reinforces concepts of standard error and sampling distributions through simulations in R. It takes random samples from a dataset of heights and calculates sample means and standard deviations. This illustrates that sample means vary around the true population mean due to sampling error. The tutorial also demonstrates predicting with linear models, transforming data through logarithms, and bootstrapping to estimate standard errors.

Uploaded by

Arpita Nehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views8 pages

Homework 3 R Tutorial: How To Use This Tutorial

This tutorial reinforces concepts of standard error and sampling distributions through simulations in R. It takes random samples from a dataset of heights and calculates sample means and standard deviations. This illustrates that sample means vary around the true population mean due to sampling error. The tutorial also demonstrates predicting with linear models, transforming data through logarithms, and bootstrapping to estimate standard errors.

Uploaded by

Arpita Nehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Homework 3 R Tutorial

This tutorial will give you some more practice with R and also reinforce some concepts regarding the standard
error and sampling distributions. It will also help you do HW 3.

How to use this tutorial

Read through the tutorial, typing in the commands directly into the console as you go. For each step, once
you get the provided code running, you will be asked to tweak that code to get different effects or to answer
different questions. As you go, try and play and do different things. Notice any error messages when you
forget a parenthesis or something; this will help you learn how to write code more easily in the future.
Hit the up-arrow key to look at previously typed commands. You can then click on them and edit them in
the console to change them.

Exploring What a Standard Error Is

I would like to reinforce the idea of a Standard Error.


We are going to do this by taking samples of the Galton data (on people’s heights) and looking at how the
samples vary.
First, we are going to pretend our Galton data is a population and look at what happens when we take
samples from it. So let’s look at the actual average height of our population:

library(mosaicData)
data(Galton)
heights = Galton$height
mu = mean(heights)
mu

## [1] 66.76

Now let’s take a sample! Here we take one sample of size 10 and compute the mean:

smp0 = sample( heights, 10 )


mean( smp0 )

## [1] 66.8

It turns out that the standard error for a random sample of size n is the population standard deviation
divided by the square root of n. So for our sample, the standard error is:

SE = sd( heights ) / sqrt( 10 )


SE

## [1] 1.133

Now, how many standard errors is our sample away from the real mean? This is a z-score:

1
( mean(smp0) - mu ) / SE

## [1] 0.03469

Take 4 more samples. How far were each of these sample means from the true mean?

smp1 = sample( heights, 10 )


mean( smp1 )

## [1] 66.37

smp2 = sample( heights, 10 )


mean( smp2 )

## [1] 66.17

smp3 = sample( heights, 10 )


mean( smp3 )

## [1] 65.65

smp4 = sample( heights, 10 )


mean( smp4 )

## [1] 66.52

Each time you take a sample you get a random number, i.e. the sample mean. We expect that random
number to be about µ, the real population mean, but we know that sampling is not perfect. Any given
sample is probably going to be about 1 SE away from the true mean. Look at your five samples. On average,
were they about 1 SE away from the true mean?
Take the standard deviation of your five means. Is the standard deviation about equal to your estimated
standard error?

sd( c( mean(smp0), mean(smp1), mean(smp2), mean(smp3), mean(smp4) ) )

## [1] 0.4309

Note: the above has two parts. First we make a list of our five means with c(). Second, we take the standard
deviation of our list of five numbers with sd().
The extra bits above hopefully illustrate the fundamental idea of the math we are doing. In real life, we only
have one sample. However, the way we understand our one sample is by thinking about what would happen
if we did our sampling over and over again. We then assume that our one sample is “typical” (i.e., close to
the population mean) and then base our inference on that assumption.

A crazier bootstrap
Let’s do a bootstrap on something called a trimmed mean. A trimmed mean is a mean where you drop the
top and bottom x percent of the data.
Here is some fake data along with the mean and a 80% trimmed mean:

2
lst = c( 1, 1, 1, 1, 2, 3, 4, 5, 6, 10000)
length(lst)

## [1] 10

mean(lst)

## [1] 1002

mean(lst, trim=0.1)

## [1] 2.875

Q: What happens when you trim 50% (0.5) from each end? Does having a 70% trimmed mean make any
sense?
The trimmed mean is the mean of the middle 80%, after lopping off 10% from each end. (Here 10% is 1 data
point, so the trimmed mean is the mean of the middle 8 points). We can check:

mean( c( 1, 1, 1, 2, 3, 4, 5, 6 ) )

## [1] 2.875

Here is the trimmed mean of the hollywood budgets:

library( Lock5Data )
data( HollywoodMovies2011 )
HollywoodMovies2011 = subset(HollywoodMovies2011, !is.na( Budget ) )
bdg = HollywoodMovies2011$Budget
favstats(bdg)

## min Q1 median Q3 max mean sd n missing


## 0.2 20.25 36.5 70 250 53.48 49.17 134 0

mean( bdg, trim=0.1 )

## [1] 45.61

Note how it is less since we have dropped the high outliers.


Let’s now take the trimmed mean of a bootstrap sample:

smp = resample(HollywoodMovies2011)
mean( smp$Budget, trim=0.1 )

## [1] 45.93

And bootstrap it!

3
boots = replicate( 10000, {
smp = resample(HollywoodMovies2011)
mean( smp$Budget, trim=0.1 )
})

And finally get the SE (and plot a histogram)

sd(boots)

## [1] 4.313

hist(boots)

Histogram of boots
1500
Frequency

1000
500
0

30 35 40 45 50 55 60 65

boots

Getting coefficients from a linear model

You can get the coefficients out of your linear model like so:

data( Galton )
my.lm = lm( height ~ father + mother, data=Galton )
coef( my.lm )

## (Intercept) father mother


## 22.3097 0.3799 0.2832

4
coef( my.lm )[1]

## (Intercept)
## 22.31

coef( my.lm )[2]

## father
## 0.3799

This is useful since you can use it to predict new observations:

new.f = 72
new.m = 72
coefs = coef( my.lm )
coefs[1] + coefs[2]*new.f + coefs[3]*new.m

## (Intercept)
## 70.05

Q: Predict your own height based on your mother and father’s height.

Predicting with linear models

The easiest way to predict with a linear model is to just do it by hand. But R can do it for you, if you so
desire, via the predict() command. Let’s look at some examples.

library( Lock5Data )
data( RestaurantTips )
my.lm = lm( Tip ~ Bill, data=RestaurantTips )
new.dat = data.frame( Bill=c(10,20,40) )
new.dat

## Bill
## 1 10
## 2 20
## 3 40

predict( my.lm, new.dat )

## 1 2 3
## 1.530 3.352 6.996

We have made a new data table (new.dat) and then asked predict to predict salaries for our new people.
Notice that we match the Age variable exactly.
Q: What is the predicted tip for a $50 bill? $1000 bill?
Q: Try getting the coefficients of this model using the stuff above.
This can be used for multiple variables too:

5
my.lm2 = lm( Tip ~ Bill + Server, data=RestaurantTips )
new.dat = data.frame( Bill=c(20,20,20), Server=c("A","B","C") )
new.dat

## Bill Server
## 1 20 A
## 2 20 B
## 3 20 C

predict( my.lm2, new.dat )

## 1 2 3
## 3.540 3.228 3.253

Q: Now predict for different Bill amounts and different Servers.


Q: Change your model to include Credit card (use the Credit variable) and predict a tip for a 100 bill for a
Credit card user and a cash user.

(Optional) Playing with log-transforms

The following introduces the idea of transforming your data, and gives a nice trick for learning about curved
relationships between two variables.
Say we ran a linear model on the following data:

X = 1:10
Y = 0.25 * X^(3.0)
plot( Y ~ X )
bad.lm = lm( Y ~ X )
abline( bad.lm )
100 200
Y

2 4 6 8 10

X
Q: Why is running a linear model on these X and Y not appropriate?
We can transform our data. Here we take the log of both X and Y .

lX = log(X)
lY = log(Y)

Now let’s plot and run a linear model on your transformed data.

6
plot( lY ~ lX )
mlm = lm( lY ~ lX )
abline( mlm )
5
3
lY

1
−1

0.0 1.0 2.0

lX

summary(mlm)

## Warning: essentially perfect fit: summary may be unreliable

##
## Call:
## lm(formula = lY ~ lX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25e-15 -5.39e-16 2.39e-16 4.96e-16 9.09e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.39e+00 6.17e-16 -2.25e+15 <2e-16 ***
## lX 3.00e+00 3.71e-16 8.09e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.16e-16 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.54e+31 on 1 and 8 DF, p-value: <2e-16

Note we have a perfect fit!


Q: Play with different values for the constants 0.25 and 2 for your Y variable and see what happens when you
make a log-log plot. How does the slop and intercept correspond to these values?
You can use your linear model to predict, but you have to take the logging into account:

new.val = 6
log.pred = coef(mlm)[1] + coef(mlm)[2]*log(new.val)
exp( log.pred )

## (Intercept)
## 54

7
Y[6]

## [1] 54

We use 6, above, to verify it is what we should get. But try different values.

You might also like