0% found this document useful (0 votes)
36 views78 pages

Statistical Inference For Data Science Compress

Uploaded by

Ankit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views78 pages

Statistical Inference For Data Science Compress

Uploaded by

Ankit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

MĚŇŲ


Șțǻțįșțįčǻŀ įňfěřěňčě fǿř đǻțǻ șčįěňčě

A companion to the Coursera Statistical Inference Course


by Brian Caffo

$1.99 $9.99
MİŇİMŲM PŘİČĚ ȘŲĢĢĚȘȚĚĐ PŘİČĚ

BŲỲ ȚĦĚ ĚBǾǾĶ ŇǾẄ!

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Statistical inference for data science


Table of Contents

Ǻbǿųț țħįș bǿǿķ

Ǻbǿųț țħě pįčțųřě ǿň țħě čǿvěř

1. İňțřǿđųčțįǿň

Běfǿřě běģįňňįňģ

Șțǻțįșțįčǻŀ įňfěřěňčě đěfįňěđ

Șųmmǻřỳ ňǿțěș

Țħě ģǿǻŀș ǿf įňfěřěňčě

Țħě țǿǿŀș ǿf țħě țřǻđě

Đįffěřěňț țħįňķįňģ ǻbǿųț přǿbǻbįŀįțỳ ŀěǻđș țǿ đįffěřěňț șțỳŀěș ǿf įňfěřěňčě

Ěxěřčįșěș

2. Přǿbǻbįŀįțỳ

Ẅħěřě țǿ ģěț ǻ mǿřě țħǿřǿųģħ țřěǻțměňț ǿf přǿbǻbįŀįțỳ

Ķǿŀmǿģǿřǿv’ș Țħřěě Řųŀěș

Čǿňșěqųěňčěș ǿf Țħě Țħřěě Řųŀěș

Řǻňđǿm vǻřįǻbŀěș

Přǿbǻbįŀįțỳ mǻșș fųňčțįǿňș

Přǿbǻbįŀįțỳ đěňșįțỳ fųňčțįǿňș

ČĐF ǻňđ șųřvįvǻŀ fųňčțįǿň

Qųǻňțįŀěș
Ěxěřčįșěș

3. Čǿňđįțįǿňǻŀ přǿbǻbįŀįțỳ

Čǿňđįțįǿňǻŀ přǿbǻbįŀįțỳ, mǿțįvǻțįǿň

Čǿňđįțįǿňǻŀ přǿbǻbįŀįțỳ, đěfįňįțįǿň

Bǻỳěș’ řųŀě

Đįǻģňǿșțįč Ŀįķěŀįħǿǿđ Řǻțįǿș

İňđěpěňđěňčě

İİĐ řǻňđǿm vǻřįǻbŀěș

Ěxěřčįșěș

4. Ěxpěčțěđ vǻŀųěș

Țħě pǿpųŀǻțįǿň měǻň fǿř đįșčřěțě řǻňđǿm vǻřįǻbŀěș

Țħě șǻmpŀě měǻň

Čǿňțįňųǿųș řǻňđǿm vǻřįǻbŀěș

Șįmųŀǻțįǿň ěxpěřįměňțș

Șųmmǻřỳ ňǿțěș

Ěxěřčįșěș

5. Vǻřįǻțįǿň

Țħě vǻřįǻňčě

Țħě șǻmpŀě vǻřįǻňčě

Șįmųŀǻțįǿň ěxpěřįměňțș

Țħě șțǻňđǻřđ ěřřǿř ǿf țħě měǻň

Đǻțǻ ěxǻmpŀě

Șųmmǻřỳ ňǿțěș

Ěxěřčįșěș

6. Șǿmě čǿmmǿň đįșțřįbųțįǿňș

Țħě Běřňǿųŀŀį đįșțřįbųțįǿň

Bįňǿmįǻŀ țřįǻŀș

Țħě ňǿřmǻŀ đįșțřįbųțįǿň

Țħě Pǿįșșǿň đįșțřįbųțįǿň

Ěxěřčįșěș

7. Ǻșỳmpțǿpįǻ

Ǻșỳmpțǿțįčș

Ŀįmįțș ǿf řǻňđǿm vǻřįǻbŀěș

Țħě Čěňțřǻŀ Ŀįmįț Țħěǿřěm

ČĿȚ șįmųŀǻțįǿň ěxpěřįměňțș

Čǿňfįđěňčě įňțěřvǻŀș

Șįmųŀǻțįǿň ǿf čǿňfįđěňčě įňțěřvǻŀș

Pǿįșșǿň įňțěřvǻŀ

Șųmmǻřỳ ňǿțěș

Ěxěřčįșěș

8. ț Čǿňfįđěňčě įňțěřvǻŀș

Șmǻŀŀ șǻmpŀě čǿňfįđěňčě įňțěřvǻŀș


Ģǿșșěț’ș ț đįșțřįbųțįǿň

Țħě đǻțǻ

İňđěpěňđěňț ģřǿųp ț čǿňfįđěňčě įňțěřvǻŀș

Čǿňfįđěňčě įňțěřvǻŀ

Mįșțǻķěňŀỳ țřěǻțįňģ țħě șŀěěp đǻțǻ ǻș ģřǿųpěđ

Ųňěqųǻŀ vǻřįǻňčěș

Șųmmǻřỳ ňǿțěș

Ěxěřčįșěș

9. Ħỳpǿțħěșįș țěșțįňģ

Ħỳpǿțħěșįș țěșțįňģ

Țỳpěș ǿf ěřřǿřș įň ħỳpǿțħěșįș țěșțįňģ

Đįșčųșșįǿň řěŀǻțįvě țǿ čǿųřț čǻșěș

Bųįŀđįňģ ųp ǻ șțǻňđǻřđ ǿf ěvįđěňčě

Ģěňěřǻŀ řųŀěș

Țẅǿ șįđěđ țěșțș

Ț țěșț įň Ř

Čǿňňěčțįǿňș ẅįțħ čǿňfįđěňčě įňțěřvǻŀș

Țẅǿ ģřǿųp įňțěřvǻŀș

Ěxǻčț bįňǿmįǻŀ țěșț

Ěxěřčįșěș

10. P-vǻŀųěș

İňțřǿđųčțįǿň țǿ P-vǻŀųěș

Ẅħǻț įș ǻ P-vǻŀųě?

Țħě ǻțțǻįňěđ șįģňįfįčǻňčě ŀěvěŀ

Bįňǿmįǻŀ P-vǻŀųě ěxǻmpŀě

Pǿįșșǿň ěxǻmpŀě

Ěxěřčįșěș

11. Pǿẅěř

Pǿẅěř

Qųěșțįǿň

Ňǿțěș

Ț-țěșț pǿẅěř

Ěxěřčįșěș

12. Țħě bǿǿțșțřǻp ǻňđ řěșǻmpŀįňģ

Țħě bǿǿțșțřǻp

Țħě bǿǿțșțřǻp přįňčįpŀě

Ģřǿųp čǿmpǻřįșǿňș vįǻ pěřmųțǻțįǿň țěșțș

Pěřmųțǻțįǿň țěșțș

Vǻřįǻțįǿňș ǿň pěřmųțǻțįǿň țěșțįňģ

Pěřmųțǻțįǿň țěșț B v Č

Ěxěřčįșěș
About this book
This book is written as a companion book to the Statistical Inference Coursera class
as part of the Data Science Specialization. However, if you do not take the class, the
book mostly stands on its own. A useful component of the book is a series of
YouTube videos that comprise the Coursera class.

The book is intended to be a low cost introduction to the important field of


statistical inference. The intended audience are students who are numerically and
computationally literate, who would like to put those skills to use in Data Science or
Statistics. The book is offered for free as a series of markdown documents on
github and in more convenient forms (epub, mobi) on LeanPub and retail outlets.

This book is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike


4.0 International License, which requires author attribution for derivative works, non-
commercial use of derivative works and that changes are shared in the same way
as the original work.

About the picture on the cover


The picture on the cover is a public domain image taken from Wikipedia’s article on
Francis Galton’s quincunx. Francis Galton was an 19th century polymath who
invented many of key concepts of statistics. The quincunx was an ingenious
invention for illustrating the central limit theorem using a pinball setup.

1. Introduction

Before beginning
This book is designed as a companion to the Statistical Inference Coursera class as
part of the Data Science Specialization, a ten course program offered by three faculty,
Jeff Leek, Roger Peng and Brian Caffo, at the Johns Hopkins University Department
of Biostatistics.

The videos associated with this book can be watched in full here , though the
relevant links to specific videos are placed at the appropriate locations throughout.

Before beginning, we assume that you have a working knowledge of the R


programming language. If not, there is a wonderful Coursera class by Roger Peng,
that can be found here.

The entirety of the book is on GitHub here. Please submit pull requests if you find
errata! In addition the course notes can be found also on GitHub here. While most
code is in the book, all of the code for every figure and analysis in the book is in the
R markdown files files (.Rmd) for the respective lectures.

Finally, we should mention swirl (statistics with interactive R programming). swirl is an


intelligent tutoring system developed by Nick Carchedi, with contributions by Sean
Kross and Bill and Gina Croft. It offers a way to learn R in R. Download swirl here.
There’s a swirl module for this course!. Try it out, it’s probably the most effective way
to learn.

Statistical inference defined


Watch this video before beginning.

We’ll define statistical inference as the process of generating conclusions about a


population from a noisy sample. Without statistical inference we’re simply living
within our data. With statistical inference, we’re trying to generate new knowledge.

Knowledge and parsimony, (using simplest reasonable models to explain complex


phenomena), go hand in hand. Probability models will serve as our parsimonious
description of the world. The use of probability models as the connection between
our data and a populations represents the most effective way to obtain inference.

Motivating example: who’s going to win the election?


In every major election, pollsters would like to know, ahead of the actual election,
who’s going to win. Here, the target of estimation (the estimand) is clear, the
percentage of people in a particular group (city, state, county, country or other
electoral grouping) who will vote for each candidate.

We can not poll everyone. Even if we could, some polled may change their vote by
the time the election occurs. How do we collect a reasonable subset of data and
quantify the uncertainty in the process to produce a good guess at who will win?

Motivating example, predicting the weather


When a weatherman tells you the probability that it will rain tomorrow is 70%,
they’re trying to use historical data to predict tomorrow’s weather - and to actually
attach a probability to it. That probability talks about, refers to population.

Motivating example, brain activation


An example that’s very close to the research I do is trying to predict what areas of
the brain activate when a person is put in the fMRI scanner. In that case, people are
doing a task while in the scanner. For example, they might be tapping their finger.
We’d like to compare when they are tapping their finger to when they are not
tapping their finger and try to figure out what areas of the brain are associated
with the finger tapping.

Summary notes
These examples illustrate many of the difficulties of trying to use data to create
general conclusions about a population.

Paramount among our concerns are:

İș țħě șǻmpŀě řěpřěșěňțǻțįvě ǿf țħě pǿpųŀǻțįǿň țħǻț ẅě’đ ŀįķě țǿ đřǻẅ įňfěřěňčěș

ǻbǿųț?

Ǻřě țħěřě ķňǿẅň ǻňđ ǿbșěřvěđ, ķňǿẅň ǻňđ ųňǿbșěřvěđ ǿř ųňķňǿẅň ǻňđ

ųňǿbșěřvěđ vǻřįǻbŀěș țħǻț čǿňțǻmįňǻțě ǿųř čǿňčŀųșįǿňș?

İș țħěřě șỳșțěmǻțįč bįǻș čřěǻțěđ bỳ mįșșįňģ đǻțǻ ǿř țħě đěșįģň ǿř čǿňđųčț ǿf țħě

șțųđỳ?

Ẅħǻț řǻňđǿmňěșș ěxįșțș įň țħě đǻțǻ ǻňđ ħǿẅ đǿ ẅě ųșě ǿř ǻđjųșț fǿř įț? Ħěřě

řǻňđǿmňěșș čǻň ěįțħěř bě ěxpŀįčįț vįǻ řǻňđǿmįżǻțįǿň ǿř řǻňđǿm șǻmpŀįňģ, ǿř

įmpŀįčįț ǻș țħě ǻģģřěģǻțįǿň ǿf mǻňỳ čǿmpŀěx ųňķňǿẅň přǿčěșșěș.

Ǻřě ẅě țřỳįňģ țǿ ěșțįmǻțě ǻň ųňđěřŀỳįňģ měčħǻňįșțįč mǿđěŀ ǿf pħěňǿměňǻ

ųňđěř șțųđỳ?
Statistical inference requires navigating the set of assumptions and tools and
subsequently thinking about how to draw conclusions from data.

The goals of inference


You should recognize the goals of inference. Here we list five examples of
inferential goals.

1. Ěșțįmǻțě ǻňđ qųǻňțįfỳ țħě ųňčěřțǻįňțỳ ǿf ǻň ěșțįmǻțě ǿf ǻ pǿpųŀǻțįǿň qųǻňțįțỳ (țħě

přǿpǿřțįǿň ǿf pěǿpŀě ẅħǿ ẅįŀŀ vǿțě fǿř ǻ čǻňđįđǻțě).

2. Đěțěřmįňě ẅħěțħěř ǻ pǿpųŀǻțįǿň qųǻňțįțỳ įș ǻ běňčħmǻřķ vǻŀųě (“įș țħě țřěǻțměňț

ěffěčțįvě?”).

3. İňfěř ǻ měčħǻňįșțįč řěŀǻțįǿňșħįp ẅħěň qųǻňțįțįěș ǻřě měǻșųřěđ ẅįțħ ňǿįșě (“Ẅħǻț įș

țħě șŀǿpě fǿř Ħǿǿķě’ș ŀǻẅ?”)

4. Đěțěřmįňě țħě įmpǻčț ǿf ǻ pǿŀįčỳ? (“İf ẅě řěđųčě pǿŀŀųțįǿň ŀěvěŀș, ẅįŀŀ ǻșțħmǻ řǻțěș

đěčŀįňě?”)

5. Țǻŀķ ǻbǿųț țħě přǿbǻbįŀįțỳ țħǻț șǿměțħįňģ ǿččųřș.

The tools of the trade


Several tools are key to the use of statistical inference. We’ll only be able to cover a
few in this class, but you should recognize them anyway.

1. Řǻňđǿmįżǻțįǿň: čǿňčěřňěđ ẅįțħ bǻŀǻňčįňģ ųňǿbșěřvěđ vǻřįǻbŀěș țħǻț mǻỳ čǿňfǿųňđ

įňfěřěňčěș ǿf įňțěřěșț.

2. Řǻňđǿm șǻmpŀįňģ: čǿňčěřňěđ ẅįțħ ǿbțǻįňįňģ đǻțǻ țħǻț įș řěpřěșěňțǻțįvě ǿf țħě

pǿpųŀǻțįǿň ǿf įňțěřěșț.

3. Șǻmpŀįňģ mǿđěŀș: čǿňčěřňěđ ẅįțħ čřěǻțįňģ ǻ mǿđěŀ fǿř țħě șǻmpŀįňģ přǿčěșș, țħě

mǿșț čǿmmǿň įș șǿ čǻŀŀěđ “įįđ”.

4. Ħỳpǿțħěșįș țěșțįňģ: čǿňčěřňěđ ẅįțħ đěčįșįǿň mǻķįňģ įň țħě přěșěňčě ǿf ųňčěřțǻįňțỳ.

5. Čǿňfįđěňčě įňțěřvǻŀș: čǿňčěřňěđ ẅįțħ qųǻňțįfỳįňģ ųňčěřțǻįňțỳ įň ěșțįmǻțįǿň.

6. Přǿbǻbįŀįțỳ mǿđěŀș: ǻ fǿřmǻŀ čǿňňěčțįǿň běțẅěěň țħě đǻțǻ ǻňđ ǻ pǿpųŀǻțįǿň ǿf

įňțěřěșț. Ǿfțěň přǿbǻbįŀįțỳ mǿđěŀș ǻřě ǻșșųměđ ǿř ǻřě ǻppřǿxįmǻțěđ.

7. Șțųđỳ đěșįģň: țħě přǿčěșș ǿf đěșįģňįňģ ǻň ěxpěřįměňț țǿ mįňįmįżě bįǻșěș ǻňđ

vǻřįǻbįŀįțỳ.

8. Ňǿňpǻřǻměțřįč bǿǿțșțřǻppįňģ: țħě přǿčěșș ǿf ųșįňģ țħě đǻțǻ țǿ, ẅįțħ mįňįmǻŀ

přǿbǻbįŀįțỳ mǿđěŀ ǻșșųmpțįǿňș, čřěǻțě įňfěřěňčěș.

9. Pěřmųțǻțįǿň, řǻňđǿmįżǻțįǿň ǻňđ ěxčħǻňģěǻbįŀįțỳ țěșțįňģ: țħě přǿčěșș ǿf ųșįňģ đǻțǻ

pěřmųțǻțįǿňș țǿ pěřfǿřm įňfěřěňčěș.

Different thinking about probability leads to different


styles of inference
We won’t spend too much time talking about this, but there are several different
styles of inference. Two broad categories that get discussed a lot are:

1. Fřěqųěňčỳ přǿbǻbįŀįțỳ: įș țħě ŀǿňģ řųň přǿpǿřțįǿň ǿf țįměș ǻň ěvěňț ǿččųřș įň

įňđěpěňđěňț, įđěňțįčǻŀŀỳ đįșțřįbųțěđ řěpěțįțįǿňș.

2. Fřěqųěňčỳ șțỳŀě įňfěřěňčě: ųșěș fřěqųěňčỳ įňțěřpřěțǻțįǿňș ǿf přǿbǻbįŀįțįěș țǿ čǿňțřǿŀ

ěřřǿř řǻțěș. Ǻňșẅěřș qųěșțįǿňș ŀįķě “Ẅħǻț șħǿųŀđ İ đěčįđě ģįvěň mỳ đǻțǻ čǿňțřǿŀŀįňģ

țħě ŀǿňģ řųň přǿpǿřțįǿň ǿf mįșțǻķěș İ mǻķě ǻț ǻ țǿŀěřǻbŀě ŀěvěŀ.”

3. Bǻỳěșįǻň přǿbǻbįŀįțỳ: įș țħě přǿbǻbįŀįțỳ čǻŀčųŀųș ǿf běŀįěfș, ģįvěň țħǻț běŀįěfș fǿŀŀǿẅ

čěřțǻįň řųŀěș.

4. Bǻỳěșįǻň șțỳŀě įňfěřěňčě: țħě ųșě ǿf Bǻỳěșįǻň přǿbǻbįŀįțỳ řěpřěșěňțǻțįǿň ǿf běŀįěfș țǿ

pěřfǿřm įňfěřěňčě. Ǻňșẅěřș qųěșțįǿňș ŀįķě “Ģįvěň mỳ șųbjěčțįvě běŀįěfș ǻňđ țħě

ǿbjěčțįvě įňfǿřmǻțįǿň fřǿm țħě đǻțǻ, ẅħǻț șħǿųŀđ İ běŀįěvě ňǿẅ?”

Data scientists tend to fall within shades of gray of these and various other schools
of inference. Furthermore, there are so many shades of gray between the styles of
inferences that it is hard to pin down most modern statisticians as either Bayesian
or frequentist. In this class, we will primarily focus on basic sampling models, basic
probability models and frequency style analyses to create standard inferences. This
is the most popular style of inference by far.

Being data scientists, we will also consider some inferential strategies that
rely heavily on the observed data, such as permutation testing and bootstrapping.
As probability modeling will be our starting point, we first build up basic probability
as our first task.

Exercises
1. Țħě ģǿǻŀ ǿf șțǻțįșțįčǻŀ įňfěřěňčě įș țǿ?

İňfěř fǻčțș ǻbǿųț ǻ pǿpųŀǻțįǿň fřǿm ǻ șǻmpŀě.

İňfěř fǻčțș ǻbǿųț țħě șǻmpŀě fřǿm ǻ pǿpųŀǻțįǿň.

Čǻŀčųŀǻțě șǻmpŀě qųǻňțįțįěș țǿ ųňđěřșțǻňđ ỳǿųř đǻțǻ.

Țǿ țǿřțųřě Đǻțǻ Șčįěňčě șțųđěňțș.

2. Țħě ģǿǻŀ ǿf řǻňđǿmįżǻțįǿň ǿf ǻ țřěǻțměňț įň ǻ řǻňđǿmįżěđ țřįǻŀ įș țǿ?

İț đǿěșň’ț řěǻŀŀỳ đǿ ǻňỳțħįňģ.

Țǿ ǿbțǻįň ǻ řěpřěșěňțǻțįvě șǻmpŀě ǿf șųbjěčțș fřǿm țħě pǿpųŀǻțįǿň ǿf įňțěřěșț.

Bǻŀǻňčě ųňǿbșěřvěđ čǿvǻřįǻțěș țħǻț mǻỳ čǿňțǻmįňǻțě țħě čǿmpǻřįșǿň běțẅěěň

țħě țřěǻțěđ ǻňđ čǿňțřǿŀ ģřǿųpș.

Țǿ ǻđđ vǻřįǻțįǿň țǿ ǿųř čǿňčŀųșįǿňș.

3. Přǿbǻbįŀįțỳ įș ǻ?

Pǿpųŀǻțįǿň qųǻňțįțỳ țħǻț ẅě čǻň pǿțěňțįǻŀŀỳ ěșțįmǻțě fřǿm đǻțǻ.

Ǻ đǻțǻ qųǻňțįțỳ țħǻț đǿěș ňǿț řěqųįřě țħě įđěǻ ǿf ǻ pǿpųŀǻțįǿň.

2. Probability

Watch this video before beginning.

Probability forms the foundation for almost all treatments of statistical inference.
In our treatment, probability is a law that assigns numbers to the long run
occurrence of random phenomena after repeated unrelated realizations.

Before we begin discussing probability, let’s dispense with some deep philosophical
questions, such as “What is randomness?” and “What is the fundamental
interpretation of probability?”. One could spend a lifetime studying these questions
(and some have). For our purposes, randomness is any process occurring without
apparent deterministic patterns. Thus we will treat many things as if they were
random when, in fact they are completely deterministic. In my field, biostatistics,
we often model disease outcomes as if they were random when they are the result
of many mechanistic components whose aggregate behavior appears random.
Probability for us will be the long long run proportion of times some occurs in
repeated unrelated realizations. So, think of the proportion of times that you get a
head when flipping a coin.

For the interested student, I would recommend the books and work by Ian Hacking
to learn more about these deep philosophical issues. For us data scientists, the
above definitions will work fine.
Where to get a more thorough treatment of probability
In this lecture, we will cover the fundamentals of probability at low enough of a
level to have a basic understanding for the rest of the series. For a more complete
treatment see the class Mathematical Biostatistics Boot Camp 1, which can be
viewed on YouTube here. In addition, there’s the actual Coursera course that I run
periodically (this is the first Coursera class that I ever taught). Also there are a set of
notes on GitHub. Finally, there’s a follow up class, uninspiringly named Mathematical
Biostatistics Boot Camp 2, that is more devoted to biostatistical topics that has an
associated YouTube playlist, Coursera Class and GitHub notes.

Kolmogorov’s Three Rules


Watch this lecture before beginning.

Given a random experiment (say rolling a die) a probability measure is a population


quantity that summarizes the randomness. The brilliant discovery of the father of
probability, the Russian mathematician Kolmogorov, was that to satisfy our intuition
about how probability should behave, only three rules were needed.

Consider an experiment with a random outcome. Probability takes a possible


outcome from an experiment and:

1. ǻșșįģňș įț ǻ ňųmběř běțẅěěň 0 ǻňđ 1

2. řěqųįřěș țħǻț țħě přǿbǻbįŀįțỳ țħǻț șǿměțħįňģ ǿččųřș įș 1

3. řěqųįřěđ țħǻț țħě přǿbǻbįŀįțỳ ǿf țħě ųňįǿň ǿf ǻňỳ țẅǿ șěțș ǿf ǿųțčǿměș țħǻț ħǻvě

ňǿțħįňģ įň čǿmmǿň (mųțųǻŀŀỳ ěxčŀųșįvě) įș țħě șųm ǿf țħěįř řěșpěčțįvě přǿbǻbįŀįțįěș.

From these simple rules all of the familiar rules of probability can be developed.
This all might seem a little odd at first and so we’ll build up our intuition with some
simple examples based on coin flipping and die rolling.

I would like to reiterate the important definition that we wrote out: mutually
exclusive. Two events are mutually exclusive if they cannot both simultaneously
occur. For example, we cannot simultaneously get a 1 and a 2 on a die. Rule 3 says
that since the event of getting a 1 and 2 on a die are mutually exclusive, the
probability of getting at least one (the union) is the sum of their probabilities. So if
we know that the probability of getting a 1 is 1/6 and the probability of getting a 2
is 1/6, then the probability of getting a 1 or a 2 is 2/6, the sum of the two
probabilities since they are mutually exclusive.

Consequences of The Three Rules


Let’s cover some consequences of our three simple rules. Take, for example, the
probability that something occurs is 1 minus the probability of the opposite
occurring. Let be the event that we get a 1 or a 2 on a rolled die. Then is the
opposite, getting a 3, 4, 5 or 6. Since and cannot both simultaneously occur,
they are mutually exclusive. So the probability that either or is
. Notice, that the probability that either occurs is the probability
of getting a 1, 2, 3, 4, 5 or 6, or in other words, the probability that something
occurs, which is 1 by rule number 2. So we have that or that
.
We won’t go through this tedious exercise (since Kolmogorov already did it for us).
Instead here’s a list of some of the consequences of Kolmogorov’s rules that are
often useful.

1. Țħě přǿbǻbįŀįțỳ țħǻț ňǿțħįňģ ǿččųřș įș 0

2. Țħě přǿbǻbįŀįțỳ țħǻț șǿměțħįňģ ǿččųřș įș 1

3. Țħě přǿbǻbįŀįțỳ ǿf șǿměțħįňģ įș 1 mįňųș țħě přǿbǻbįŀįțỳ țħǻț țħě ǿppǿșįțě ǿččųřș

4. Țħě přǿbǻbįŀįțỳ ǿf ǻț ŀěǻșț ǿňě ǿf țẅǿ (ǿř mǿřě) țħįňģș țħǻț čǻň ňǿț șįmųŀțǻňěǿųșŀỳ

ǿččųř (mųțųǻŀŀỳ ěxčŀųșįvě) įș țħě șųm ǿf țħěįř řěșpěčțįvě přǿbǻbįŀįțįěș

5. Fǿř ǻňỳ țẅǿ ěvěňțș țħě přǿbǻbįŀįțỳ țħǻț ǻț ŀěǻșț ǿňě ǿččųřș įș țħě șųm ǿf țħěįř

přǿbǻbįŀįțįěș mįňųș țħěįř įňțěřșěčțįǿň.

This last rules states that shows


what is the issue with adding probabilities that are not mutually exclusive. If we do
this, we’ve added the probability that both occur in twice! (Watch the video where I
draw a Venn diagram to illustrate this).

Example of Implementing Probability Calculus


The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of
the American population has sleep apnea. They also report that around 10% of the
North American and European population has restless leg syndrome. Does this
imply that 13% of people will have at least one sleep problems of these sorts? In
other words, can we simply add these two probabilities?

Answer: No, the events can simultaneously occur and so are not mutually
exclusive. To elaborate let:

Then

Given the scenario, it’s likely that some fraction of the population has both. This
example serves as a reminder don’t add probabilities unless the events are
mutually exclusive. We’ll have a similar rule for multiplying probabilities and
independence.

Random variables
Watch this video before reading this section

Probability calculus is useful for understanding the rules that probabilities must
follow. However, we need ways to model and think about probabilities for numeric
outcomes of experiments (broadly defined). Densities and mass functions for
random variables are the best starting point for this. You’ve already heard of a
density since you’ve heard of the famous “bell curve”, or Gaussian density. In this
section you’ll learn exactly what the bell curve is and how to work with it.

Remember, everything we’re talking about up to at this point is a population


quantity, not a statement about what occurs in our data. Think about the fact that
50% probability for head is a statement about the coin and how we’re flipping it,
not a statement about the percentage of heads we obtained in a particular set of
flips. This is an important distinction that we will emphasize over and over in this
course. Statistical inference is about describing populations using data. Probability
density functions are a way to mathematically characterize the population. In this
course, we’ll assume that our sample is a random draw from the population.

So our definition is that a random variable is a numerical outcome of an


experiment. The random variables that we study will come in two varieties,
discrete or continuous. Discrete random variable are random variables that take
on only a countable number of possibilities. Mass functions will assign probabilities
that they take specific values. Continuous random variable can conceptually take
any value on the real line or some subset of the real line and we talk about the
probability that they line within some range. Densities will characterize these
probabilities.

Let’s consider some examples of measurements that could be considered random


variables. First, familiar gambling experiments like the tossing of a coin and the
rolling of a die produce random variables. For the coin, we typically code a tail as a
0 and a head as a 1. (For the die, the number facing up would be the random
variable.) We will use these examples a lot to help us build intuition. However, they
aren’t interesting in the sense of seeming very contrived. Nonetheless, the coin
example is particularly useful since many of the experiments we consider will be
modeled as if tossing a biased coin. Modeling any binary characteristic from a
random sample of a population can be thought of as a coin toss, with the random
sampling performing the roll of the toss and the population percentage of
individuals with the characteristic is the probability of a head. Consider, for
example, logging whether or not subjects were hypertensive in a random sample.
Each subject’s outcome can be modeled as a coin toss. In a similar sense the die
roll serves as our model for phenomena with more than one level, such as hair
color or rating scales.

Consider also the random variable of the number of web hits for a site each day.
This variable is a count, but is largely unbounded (or at least we couldn’t put a
specific reasonable upper limit). Random variables like this are often modeled with
the so called Poisson distribution.

Finally, consider some continuous random variables. Think of things like lengths or
weights. It is mathematically convenient to model these as if they were continuous
(even if measurements were truncated liberally). In fact, even discrete random
variables with lots of levels are often treated as continuous for convenience.

For all of these kinds of random variables, we need convenient mathematical


functions to model the probabilities of collections of realizations. These functions,
called mass functions and densities, take possible values of the random variables,
and assign the associated probabilities. These entities describe the population of
interest. So, consider the most famous density, the normal distribution. Saying that
body mass indices follow a normal distribution is a statement about the population
of interest. The goal is to use our data to figure out things about that normal
distribution, where it’s centered, how spread out it is and even whether our
assumption of normality is warranted!
Probability mass functions
A probability mass function evaluated at a value corresponds to the probability
that a random variable takes that value. To be a valid pmf a function, , must
satisfy:

1. İț mųșț ǻŀẅǻỳș bě ŀǻřģěř țħǻň ǿř ěqųǻŀ țǿ 0.

2. Țħě șųm ǿf țħě pǿșșįbŀě vǻŀųěș țħǻț țħě řǻňđǿm vǻřįǻbŀě čǻň țǻķě ħǻș țǿ ǻđđ ųp țǿ

ǿňě.

Example
Let be the result of a coin flip where represents tails and

represents heads. for . Suppose that we do


not know whether or not the coin is fair; Let be the probability of a head

expressed as a proportion (between 0 and 1). for

Probability density functions


Watch this video before beginning.

A probability density function (pdf), is a function associated with a continuous


random variable. Because of the peculiarities of treating measurements as having
been recorded to infinite decimal expansions, we need a different set of rules. This
leads us to the central dogma of probability density functions:

Areas under PDFs correspond to probabilities for that random variable

Therefore, when one says that intelligence quotients (IQ) in population follows a
bell curve, they are saying that the probability of a randomly selected from this
population having an IQ between two values is given by the area under the bell
curve.

Not every function can be a valid probability density function. For example, if the
function dips below zero, then we could have negative probabilities. If the function
contains too much area underneath it, we could have probabilities larger than one.
The following two rules tell us when a function is a valid probability density
function.

Specifically, to be a valid pdf, a function must satisfy

1. İț mųșț bě ŀǻřģěř țħǻň ǿř ěqųǻŀ țǿ żěřǿ ěvěřỳẅħěřě.

2. Țħě țǿțǻŀ ǻřěǻ ųňđěř įț mųșț bě ǿňě.

Example
Suppose that the proportion of help calls that get addressed in a random day by a
help line is given by for . The R code for plotting this
density is

Code for plotting the density

x <- c(-0.5, 0, 1, 1, 1.5)


y <- c(0, 0, 2, 0, 0)
plot(x, y, lwd = 3,frame = FALSE, type = "l")
The result of the code is given below.

Help call density

Is this a mathematically valid density? To answer this we need to make sure it


satisfies our two conditions. First it’s clearly nonnegative (it’s at or above the
horizontal axis everywhere). The area is similarly easy. Being a right triangle in the
only section of the density that is above zero, we can calculate it as 1/2 the area of

the base times the height. This is

Now consider answering the following question. What is the probability that 75% or
fewer of calls get addressed? Remember, for continuous random variables,
probabilities are represented by areas underneath the density function. So, we
want the area from 0.75 and below, as illustrated by the figure below.

Help call density

This again is a right triangle, with length of the base as 0.75 and height 1.5. The R
code below shows the calculation.

> 1.5 * 0.75/2

[1] 0.5625

Thus, the probability of 75% or fewer calls getting addressed in a random day for
this help line is 56%. We’ll do this a lot throughout this class and work with more
useful densities. It should be noted that this specific density is a special case of the
so called beta density. Below I show how to use R’s built in evaluation function for
the beta density to get the probability.

> pbeta(0.75, 2, 1)

[1] 0.5625

Notice the syntax pbeta. In R, a prefix of p returns probabilities, d returns the density,
q returns the quantile and r returns generated random variables. (You’ll learn what
each of these does in subsequent sections.)

CDF and survival function


Certain areas of PDFs and PMFs are so useful, we give them names. The
cumulative distribution function (CDF) of a random variable, , returns the
probability that the random variable is less than or equal to the value . Notice
the (slightly annoying) convention that we use an upper case to denote a
random, unrealized, version of the random variable and a lowercase to denote a
specific number that we plug into. (This notation, as odd as it may seem, dates back
to Fisher and isn’t going anywhere, so you might as well get used to it. Uppercase
for unrealized random variables and lowercase as placeholders for numbers to
plug into.) So we could write the following to describe the distribution function :

This definition applies regardless of


whether the random variable is discrete or continuous. The survival function of a
random variable is defined as the probability that the random variable is
greater than the value .

Notice that , since the survival function evaluated at a


particular value of is calculating the probability of the opposite event (greater
than as opposed to less than or equal to). The survival function is often preferred
in biostatistical applications while the distribution function is more generally used
(though both convey the same information.)

Example
What are the survival function and CDF from the density considered before?

for . Notice that calculating the survival function is now trivial given
that we’ve already calculated the distribution function.

Again, R has a function that calculates the distribution function for us in this case,
pbeta. Let’s try calculating , and

> pbeta(c(0.4, 0.5, 0.6), 2, 1)


[1] 0.16 0.25 0.36

Notice, of course, these are simply the numbers squared. By default the prefix p in
front of a density in R gives the distribution function (pbeta, pnorm, pgamma). If you want
the survival function values, you could always subtract by one, or give the
argument lower.tail = FALSE as an argument to the function, which asks R to calculate
the upper area instead of the lower.

Quantiles
You’ve heard of sample quantiles. If you were the 95th percentile on an exam, you
know that 95% of people scored worse than you and 5% scored better. These are
sample quantities. But you might have wondered, what are my sample quantiles
estimating? In fact, they are estimating the population quantiles. Here we define
these population analogs.

The quantile of a distribution with distribution function is the point so


that

So the 0.95 quantile of a distribution is the point so that 95% of the mass of the
density lies below it. Or, in other words, the point so that the probability of getting
a randomly sampled point below it is 0.95. This is analogous to the sample
quantiles where the 0.95 sample quantile is the value so that 95% of the data lies
below it.

A percentile is simply a quantile with expressed as a percent rather than a


proportion. The (population) median is the percentile. Remember that
percentiles are not probabilities! Remember that quantiles have units. So the
population median height is the height (in inches say) so that the probability that a
randomly selected person from the population is shorter is 50%. The sample, or
empirical, median would be the height so in a sample so that 50% of the people in
the sample were shorter.

Example
What is the median of the distribution that we were working with before? We want

to solve , resulting in the solution

> sqrt(0.5)

[1] 0.7071

Therefore, 0.7071 of calls being answered on a random day is the median. Or, the
probability that 70% or fewer calls get answered is 50%.

R can approximate quantiles for you for common distributions with the prefix q in
front of the distribution name

> qbeta(0.5, 2, 1)

[1] 0.7071
Exercises
1. Čǻň ỳǿų ǻđđ țħě přǿbǻbįŀįțįěș ǿf ǻňỳ țǿ ěvěňțș țǿ ģěț țħě přǿbǻbįŀįțỳ ǿf ǻț ŀěǻșț ǿňě

ǿččųřřįňģ?

2. İ đěfįňě ǻ PMF, șǿ țħǻț fǿř ǻňđ ẅě ħǻvě ǻňđ

. İș țħįș ǻ vǻŀįđ PMF?

3. Ẅħǻț įș țħě přǿbǻbįŀįțỳ țħǻț 75% ǿř fěẅěř čǻŀŀș ģěț ǻňșẅěřěđ įň ǻ řǻňđǿmŀỳ

șǻmpŀěđ đǻỳ fřǿm țħě pǿpųŀǻțįǿň đįșțřįbųțįǿň fřǿm țħįș čħǻpțěř?

4. Țħě 97.5țħ pěřčěňțįŀě ǿf ǻ đįșțřįbųțįǿň įș?

5. Čǿňșįđěř įňfŀųěňżǻ ěpįđěmįčș fǿř țẅǿ pǻřěňț ħěțěřǿșěxųǻŀ fǻmįŀįěș. Șųppǿșě țħǻț

țħě přǿbǻbįŀįțỳ įș 15% țħǻț ǻț ŀěǻșț ǿňě ǿf țħě pǻřěňțș ħǻș čǿňțřǻčțěđ țħě đįșěǻșě.

Țħě přǿbǻbįŀįțỳ țħǻț țħě fǻțħěř ħǻș čǿňțřǻčțěđ įňfŀųěňżǻ įș 10% ẅħįŀě țħǻț țħě

mǿțħěř čǿňțřǻčțěđ țħě đįșěǻșě įș 9%. Ẅħǻț įș țħě přǿbǻbįŀįțỳ țħǻț bǿțħ čǿňțřǻčțěđ

įňfŀųěňżǻ ěxpřěșșěđ ǻș ǻ ẅħǿŀě ňųmběř pěřčěňțǻģě? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș

přǿbŀěm. ǻňđ șěě ǻ ẅřįțțěň ǿųț șǿŀųțįǿň.

6. Ǻ řǻňđǿm vǻřįǻbŀě, , įș ųňįfǿřm, ǻ bǿx fřǿm 0 țǿ 1 ǿf ħěįģħț 1. (Șǿ țħǻț įț’ș đěňșįțỳ

įș fǿř .) Ẅħǻț įș įț’ș měđįǻň ěxpřěșșěđ țǿ țẅǿ đěčįmǻŀ

pŀǻčěș? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș přǿbŀěm ħěřě ǻňđ șěě ẅřįțțěň șǿŀųțįǿňș ħěřě.

7. İf ǻ čǿňțįňųǿųș đěňșįțỳ țħǻț ňěvěř țǿųčħěș țħě ħǿřįżǿňțǻŀ ǻxįș įș șỳmměțřįč ǻbǿųț

żěřǿ, čǻň ẅě șǻỳ țħǻț įțș ǻșșǿčįǻțěđ měđįǻň įș żěřǿ? Ẅǻțčħ ǻ ẅǿřķěđ ǿųț șǿŀųțįǿň țǿ

țħįș přǿbŀěm ħěřě ǻňđ șěě țħě qųěșțįǿň ǻňđ ǻ țỳpěđ ųp ǻňșẅěř ħěřě

3. Conditional probability

Conditional probability, motivation


Watch this video before beginning.

Conditioning a central subject in statistics. If we are given information about a


random variable, it changes the probabilities associated with it. For example, the
probability of getting a one when rolling a (standard) die is usually assumed to be
one sixth. If you were given the extra information that the die roll was an odd
number (hence 1, 3 or 5) then conditional on this new information, the probability
of a one is now one third.

This is the idea of conditioning, taking away the randomness that we know to have
occurred. Consider another example, such as the result of a diagnostic imaging test
for lung cancer. What’s the probability that a person has cancer given a positive
test? How does that probability change under the knowledge that a patient has
been a lifetime heavy smoker and both of their parents had lung cancer?
Conditional on this new information, the probability has increased dramatically.

Conditional probability, definition


We can formalize the definition of conditional probability so that the mathematics
matches our intuition.

Let be an event so that . Then the conditional probability of an


event given that has occurred is:

If and are unrelated in any way, or in other words independent, (discussed


more later in the lecture), then
That is, if the occurrence of offers no information about the occurrence of -
the probability conditional on the information is the same as the probability
without the information, we say that the two events are independent.

Example
Consider our die roll example again. Here we have that and

Which exactly mirrors our intuition.

Bayes’ rule
Watch this video before beginning

Bayes’ rule is a famous result in statistics and probability. It forms the foundation
for large branches of statistical thinking. Bayes’ rule allows us to reverse the
conditioning set provided that we know some marginal probabilities.

Why is this useful? Consider our lung cancer example again. It would be relatively
easy for physicians to calculate the probability that the diagnostic method is
positive for people with lung cancer and negative for people without. They could
take several people who are already known to have the disease and apply the test
and conversely take people known not to have the disease. However, for the
collection of people with a positive test result, the reverse probability is more of
interest, “given a positive test what is the probability of having the disease?”, and
“given a given a negative test what is the probability of not having the disease?”.

Bayes’ rule allows us to switch the conditioning event, provided a little bit of extra
information. Formally Bayes’ rule is:

Diagnostic tests
Since diagnostic tests are a really good example of Bayes’ rule in practice, let’s go
over them in greater detail. (In addition, understanding Bayes’ rule will be helpful
for your own ability to understand medical tests that you see in your daily life). We
require a few definitions first.

Let and be the events that the result of a diagnostic test is positive or
negative respectively Let and be the event that the subject of the test has
or does not have the disease respectively

The sensitivity is the probability that the test is positive given that the subject
actually has the disease,
The specificity is the probability that the test is negative given that the subject does
not have the disease,

So, conceptually at least, the sensitivity and specificity are straightforward to


estimate. Take people known to have and not have the disease and apply the
diagnostic test to them. However, the reality of estimating these quantities is quite
challenging. For example, are the people known to have the disease in its later
stages, while the diagnostic will be used on people in the early stages where it’s
harder to detect? Let’s put these subtleties to the side and assume that they are
known well.

The quantities that we’d like to know are the predictive values.

The positive predictive value is the probability that the subject has the disease
given that the test is positive,

The negative predictive value is the probability that the subject does not have the
disease given that the test is negative,

Finally, we need one last thing, the prevalence of the disease - which is the
marginal probability of disease, . Let’s now try to figure out a PPV in a
specific setting.

Example
A study comparing the efficacy of HIV tests, reports on an experiment which
concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of
98.5% Suppose that a subject, from a population with a .1% prevalence of HIV,
receives a positive test result. What is the positive predictive value?

Mathematically, we want given the sensitivity, , the


specificity, and the prevalence .

In this population a positive test result only suggests a 6% probability that the
subject has the disease, (the positive predictive value is 6% for this test). If you were
wondering how it could be so low for this test, the low positive predictive value is
due to low prevalence of disease and the somewhat modest specificity

Suppose it was known that the subject was an intravenous drug user and routinely
had intercourse with an HIV infected partner? Our prevalence would change
dramatically, thus increasing the PPV. You might wonder if there’s a way to
summarize the evidence without appealing to an often unknowable prevalence?
Diagnostic likelihood ratios provide this for us.

Diagnostic Likelihood Ratios


The diagnostic likelihood ratios summarize the evidence of disease given a positive
or negative test. They are defined as:

The diagnostic likelihood ratio of a positive test, labeled , is


, which is the .

The diagnostic likelihood ratio of a negative test, labeled , is


, which is the .

How do we interpret the DLRs? This is easiest when looking at so called odds
ratios. Remember that if is a probability, then is the odds. Consider
now the odds in our setting:

Using Bayes rule, we have

and

Therefore, dividing these two equations we have:

In other words, the post test odds of disease is the pretest odds of disease times
the . Similarly, relates the decrease in the odds of the disease
after a negative test result to the odds of disease prior to the test.

So, the DLRs are the factors by which you multiply your pretest odds to get your
post test odds. Thus, if a test has a of 6, regardless of the prevalence of
disease, the post test odds is six times that of the pretest odds.

HIV example revisited


Let’s reconsider our HIV antibody test again.
Suppose a subject has a positive HIV test

The result of the positive test is that the odds of disease is now 66 times the
pretest odds. Or, equivalently, the hypothesis of disease is 66 times more
supported by the data than the hypothesis of no disease

Suppose instead that a subject has a negative test result


Therefore, the post-test odds of disease is now 0.3% of the pretest odds given the
negative test. Or, the hypothesis of disease is supported times that of the
hypothesis of absence of disease given the negative test result

Independence
Watch this video before beginning.

Statistical independence of events is the idea that the events are unrelated.
Consider successive coin flips. Knowledge of the result of the first coin flip tells us
nothing about the second. We can formalize this into a definition.

Two events and are independent if

Equivalently if . Note that since is independent of we


know that is independent of is independent of is independent of
.

While this definition works for sets, remember that random variables are really the
things that we are interested in. Two random variables, and are
independent if for any two sets and

We will almost never work with these definitions. Instead, the important principle is
that probabilities of independent things multiply! This has numerous
consequences, including the idea that we shouldn’t multiply non-independent
probabilities.

Example
Let’s cover a very simple example: “What is the probability of getting two
consecutive heads?”. Then we have that is the event of getting a head on flip 1
is the event of getting a head on flip 2 is
the event of getting heads on flips 1 and 2. Then independence would tell us that:

This is exactly what we would have intuited of course. But, it’s nice that the
mathematics mirrors our intuition. In more complex settings, it’s easy to get
tripped up. Consider the following famous (among statisticians at least) case study.

Case Study
Volume 309 of Science reports on a physician who was on trial for expert testimony
in a criminal trial. Based on an estimated prevalence of sudden infant death
syndrome (SIDS) of 1 out of 8,543, a physician testified that that the probability of a

mother having two children with SIDS was . The mother on trial was
convicted of murder.

Relevant to this discussion, the principal mistake was to assume that the events of
having SIDs within a family are independent. That is, is not
necessarily equal to . This is because biological processes that have
a believed genetic or familiar environmental component, of course, tend to be
dependent within families. Thus, we can’t just multiply the probabilities to obtain
the result.

There are many other interesting aspects to the case. For example, the idea of a
low probability of an event representing evidence against a plaintiff. (Could we
convict all lottery winners of fixing the lotter since the chance that they would win
is so small.)

IID random variables


Now that we’ve introduced random variables and independence, we can introduce
a central modeling assumption made in statistics. Specifically the idea of a random
sample. Random variables are said to be independent and identically distributed
(iid) if they are independent and all are drawn from the same population. The
reason iid samples are so important is that they are are model for random
samples. This is a default starting point for most statistical inferences.

The idea of having a random sample is powerful for a variety of reasons. Consider
that in some study designs, such as in election polling, great pains are made to
make sure that the sample is randomly drawn from a population of interest. The
idea is to expend a lot of effort on design to get robust inferences. In these settings
assuming that the data is iid is both natural and warranted.

In other settings, the study design is far more opaque, and statistical inferences are
conducted under the assumption that the data arose from a random sample, since
it serves as a useful benchmark. Most studies in the fields of epidemiology and
economics fall under this category. Take, for example, studying how policies impact
countries gross domestic product by looking at countries before and after enacting
the policies. The countries are not a random sample from the set of countries.
Instead, conclusions must be made under the assumption that the countries are a
random sample and the interpretation of the strength of the inferences adapted in
kind.

Exercises
1. İ pųŀŀ ǻ čǻřđ fřǿm ǻ đěčķ ǻňđ đǿ ňǿț șħǿẅ ỳǿų țħě řěșųŀț. İ șǻỳ țħǻț țħě řěșųŀțįňģ čǻřđ

įș ǻ ħěǻřț. Ẅħǻț įș țħě přǿbǻbįŀįțỳ țħǻț įț įș țħě qųěěň ǿf ħěǻřțș?

2. Țħě ǿđđș ǻșșǿčįǻțěđ ẅįțħ ǻ přǿbǻbįŀįțỳ, , ǻřě đěfįňěđ ǻș?

3. Țħě přǿbǻbįŀįțỳ ǿf ģěțțįňģ țẅǿ șįxěș ẅħěň řǿŀŀįňģ ǻ pǻįř ǿf đįčě įș?

4. Țħě přǿbǻbįŀįțỳ țħǻț ǻ mǻňųșčřįpț ģěțș ǻččěpțěđ țǿ ǻ jǿųřňǻŀ įș 12% (șǻỳ). Ħǿẅěvěř,

ģįvěň țħǻț ǻ řěvįșįǿň įș ǻșķěđ fǿř, țħě přǿbǻbįŀįțỳ țħǻț įț ģěțș ǻččěpțěđ įș 90%. İș įț

pǿșșįbŀě țħǻț țħě přǿbǻbįŀįțỳ țħǻț ǻ mǻňųșčřįpț ħǻș ǻ řěvįșįǿň ǻșķěđ fǿř įș 20%?

Ẅǻțčħ ǻ vįđěǿ ǿf țħįș přǿbŀěm ģěțțįňģ șǿŀvěđ ǻňđ șěě țħě ẅǿřķěđ ǿųț șǿŀųțįǿňș ħěřě.

5. Șųppǿșě 5% ǿf ħǿųșįňģ přǿjěčțș ħǻvě įșșųěș ẅįțħ ǻșběșțǿș. Țħě șěňșįțįvįțỳ ǿf ǻ țěșț

fǿř ǻșběșțǿș įș 93% ǻňđ țħě șpěčįfįčįțỳ įș 88%. Ẅħǻț įș țħě přǿbǻbįŀįțỳ țħǻț ǻ

ħǿųșįňģ přǿjěčț ħǻș ňǿ ǻșběșțǿș ģįvěň ǻ ňěģǻțįvě țěșț ěxpřěșșěđ ǻș ǻ pěřčěňțǻģě țǿ

țħě ňěǻřěșț pěřčěňțǻģě pǿįňț? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ħěřě ǻňđ șěě țħě ẅǿřķěđ ǿųț

přǿbŀěm ħěřě.

4. Expected values

Watch this video before beginning.


Expected values characterize a distribution. The most useful expected value, the
mean, characterizes the center of a density or mass function. Another expected
value summary, the variance, characterizes how spread out a density is. Yet
another expected value calculation is the skewness, which considers how much a
density is pulled toward high or low values.

Remember, in this lecture we are discussing population quantities. It is convenient


(and of course by design) that the names for all of the sample analogs estimate the
associated population quantity. So, for example, the sample or empirical mean
estimates the population mean; the sample variance estimates the population
variance and the sample skewness estimates the population skewness.

The population mean for discrete random variables


The expected value or (population) mean of a random variable is the center of its
distribution. For discrete random variable with PMF , it is defined as
follows:

where the sum is taken over the possible values of . Where did they get this idea
from? It’s taken from the physical idea of the center of mass. Specifically,
represents the center of mass of a collection of locations and weights, .
We can exploit this fact the quickly calculate population means for distributions
where the center of mass is obvious.

The sample mean


It is important to contrast the population mean (the estimand) with the sample
mean (the estimator). The sample mean estimates the population mean. Not
coincidentally, since the population mean is the center of mass of the population
distribution, the sample mean is the center of mass of the data. In fact, it’s exactly
the same equation:

where .

Example Find the center of mass of the bars


Let’s go through an example of illustrating how the sample mean is the center of
mass of observed data. Below we plot Galton’s fathers and sons data:

Loading in and displaying the Galton data

library(UsingR); data(galton); library(ggplot2); library(reshape2)


longGalton <- melt(galton, measure.vars = c("child", "parent"))
g <- ggplot(longGalton, aes(x = value)) + geom_histogram(aes(y = ..density.., f\
ill = variable), binwidth=1, color = "black") + geom_density(size = 2)
g <- g + facet_grid(. ~ variable)
g
Galton’s Data

Using rStudio’s manipulate package, you can try moving the histogram around and see
what value balances it out. Be sure to watch the video to see this in action.

Using manipulate to explore the mean

library(manipulate)
myHist <- function(mu){
g <- ggplot(galton, aes(x = child))
g <- g + geom_histogram(fill = "salmon",
binwidth=1, aes(y = ..density..), color = "black")
g <- g + geom_density(size = 2)
g <- g + geom_vline(xintercept = mu, size = 2)
mse <- round(mean((galton$child - mu)^2), 3)
g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse))
g
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))

Going through this exercise, you find that the point that balances out the
histogram is the empirical mean. (Note there’s a small distinction here that comes
about from rounding with the histogram bar widths, but ignore that for the time
being.) If the bars of the histogram are from the observed data, the point that
balances it out is the empirical mean; if the bars are the true population
probabilities (which we don’t know of course) then the point is the population
mean. Let’s now go through some examples of mathematically calculating the
population mean.

The center of mass is the empirical mean


Histogram illustration

Example of a population mean, a fair coin


Watch the video before beginning here.

Suppose a coin is flipped and is declared 0 or 1 corresponding to a head or a


tail, respectively. What is the expected value of ?

Note, if thought about geometrically, this answer is obvious; if two equal weights
are spaced at 0 and 1, the center of mass will be 0.5.

Fair coin mass function

What about a biased coin?


Suppose that a random variable, , is so that and
(This is a biased coin when .) What is its expected
value?

Notice that the expected value isn’t a value that the coin can take in the same way
that the sample proportion of heads will also likely be neither 0 nor 1.

This coin example is not exactly trivial as it serves as the basis for a random sample
of any population for a binary trait. So, we might model the answer from an
election polling question as if it were a coin flip.

Example Die Roll


Suppose that a die is rolled and is the number face up. What is the expected
value of ?

Again, the geometric argument makes this answer obvious without calculation.

Bar graph of die probabilities

Continuous random variables


Watch this video before beginning.

For a continuous random variable, , with density, , the expected value is again
exactly the center of mass of the density. Think of it like cutting the continuous
density out of a thick piece of wood and trying to find the point where it balances
out.

Example
Consider a density where for between zero and one. Suppose that
follows this density; what is its expected value?
Uniform Density

The answer is clear since the density looks like a box, it would balance out exactly
in the middle, 0.5.

Facts about expected values


Recall that expected values are properties of population distributions. The
expected value, or mean, height is the center of the population density of heights.

Of course, the average of ten randomly sampled people’s height is itself of random
variable, in the same way that the average of ten die rolls is itself a random
number. Thus, the distribution of heights gives rise to the distribution of averages
of ten heights in the same way that distribution associated with a die roll gives rise
to the distribution of the average of ten dice.

An important question to ask is: “What does the distribution of averages look like?”.
This question is important, since it tells us things about averages, the best way to
estimate the population mean, when we only get to observe one average.

Consider the die rolls again. If wanted to know the distribution of averages of 100
die rolls, you could (at least in principle) roll 100 dice, take the average and repeat
that process. Imagine, if you could only roll the 100 dice once. Then we would have
direct information about the distribution of die rolls (since we have 100 of them),
but we wouldn’t have any direct information about the distribution of the average
of 100 die rolls, since we only observed one average.

Fortunately, the mathematics tells us about that distribution. Notably, it’s centered
at the same spot as the original distribution! Thus, the distribution of the estimator
(the sample mean) is centered at the distribution of what it’s estimating (the
population mean). When the expected value of an estimator is what its trying to
estimate, we say that the estimator is unbiased.

Let’s go through several simulation experiments to see this more fully.

Simulation experiments
Standard normals
Consider simulating a lot of standard normals and plotting a histogram (the blue
density). Now consider simulating lots of averages of 10 standard normals and
plotting their histogram (the salmon colored density). Notice that they’re centered
in the same spot! It’s also more concentrated around that point. (We’ll discuss that
more in the next lectures).

Simulation of normals

Averages of x die rolls


Consider rolling a die a lot of times and taking a histogram of the results, that’s the
left most plot. The bars are equally distributed at the six possible outcomes and
thus the histogram is centered around 3.5. Now consider simulating lots of
averages of 2 dice. Its histogram is also centered at 3.5. So is it for 3 and 4. Notice
also the distribution gets increasing Gaussian looking (like a bell curve) and
increasingly concentrated around 3.5.

Simulation of die rolls

Averages of x coin flips


For the coin flip simulation exactly the same occurs. All of the distributions are
centered around 0.5.
Simulation of coin flips

Summary notes
Ěxpěčțěđ vǻŀųěș ǻřě přǿpěřțįěș ǿf đįșțřįbųțįǿňș.

Țħě pǿpųŀǻțįǿň měǻň įș țħě čěňțěř ǿf mǻșș ǿf pǿpųŀǻțįǿň.

Țħě șǻmpŀě měǻň įș țħě čěňțěř ǿf mǻșș ǿf țħě ǿbșěřvěđ đǻțǻ.

Țħě șǻmpŀě měǻň įș ǻň ěșțįmǻțě ǿf țħě pǿpųŀǻțįǿň měǻň.

Țħě șǻmpŀě měǻň įș ųňbįǻșěđ: țħě pǿpųŀǻțįǿň měǻň ǿf įțș đįșțřįbųțįǿň įș țħě

měǻň țħǻț įț’ș țřỳįňģ țǿ ěșțįmǻțě.

Țħě mǿřě đǻțǻ țħǻț ģǿěș įňțǿ țħě șǻmpŀě měǻň, țħě mǿřě. čǿňčěňțřǻțěđ įțș

đěňșįțỳ / mǻșș fųňčțįǿň įș ǻřǿųňđ țħě pǿpųŀǻțįǿň měǻň.

Exercises
1. Ǻ șțǻňđǻřđ đįě țǻķěș țħě vǻŀųěș 1, 2, 3, 4, 5, 6 ẅįțħ ěqųǻŀ přǿbǻbįŀįțỳ. Ẅħǻț įș țħě

ěxpěčțěđ vǻŀųě?

2. Čǿňșįđěř ǻ đěňșįțỳ țħǻț įș ųňįfǿřm fřǿm -1 țǿ 1. (İ.ě. ħǻș ħěįģħț ěqųǻŀ țǿ 1/2 ǻňđ ŀǿǿķș

ŀįķě ǻ bǿx șțǻřțįňģ ǻț -1 ǻňđ ěňđįňģ ǻț 1). Ẅħǻț įș țħě měǻň ǿf țħįș đįșțřįbųțįǿň?

3. İf ǻ pǿpųŀǻțįǿň ħǻș měǻň , ẅħǻț įș țħě měǻň ǿf țħě đįșțřįbųțįǿň ǿf ǻvěřǻģěș ǿf 20

ǿbșěřvǻțįǿňș fřǿm țħįș đįșțřįbųțįǿň?

4. Ỳǿų ǻřě pŀǻỳįňģ ǻ ģǻmě ẅįțħ ǻ fřįěňđ ẅħěřě ỳǿų fŀįp ǻ čǿįň ǻňđ įf įț čǿměș ųp ħěǻđș

ỳǿų ģįvě ħěř đǿŀŀǻřș ǻňđ įf įț čǿměș ųp țǻįŀș șħě ģįvěș ỳǿų $Ỳ$ đǿŀŀǻřș. Țħě ǿđđș

țħǻț țħě čǿįň įș ħěǻđș įș . Ẅħǻț įș ỳǿųř ěxpěčțěđ ěǻřňįňģș? Ẅǻțčħ ǻ vįđěǿ ǿf țħě

șǿŀųțįǿň țǿ țħįș přǿbŀěm ǻňđ ŀǿǿķ ǻț țħě přǿbŀěm ǻňđ țħě șǿŀųțįǿň ħěřě..

5. İf ỳǿų řǿŀŀ țěň șțǻňđǻřđ đįčě, țǻķě țħěįř ǻvěřǻģě, țħěň řěpěǻț țħįș přǿčěșș ǿvěř ǻňđ

ǿvěř ǻňđ čǿňșțřųčț ǻ ħįșțǿģřǻm ẅħǻț ẅǿųŀđ įț bě čěňțěřěđ ǻț? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň

ħěřě ǻňđ șěě țħě ǿřįģįňǻŀ přǿbŀěm ħěřě.

5. Variation

The variance
Watch this video before beginning.

Recall that the mean of distribution was a measure of its center. The variance, on
the other hand, is a measure of spread. To get a sense, the plot below shows a
series of increasing variances.
Distributions with increasing variance

We saw another example of how variances changed in the last chapter when we
looked at the distribution of averages; they were always centered at the same spot
as the original distribution, but are less spread out. Thus, it is less likely for sample
means to be far away from the population mean than it is for individual
observations. (This is why the sample mean is a better estimate than the
population mean.)

If is a random variable with mean , the variance of is defined as

The rightmost equation is the shortcut formula that is almost always used for
calculating variances in practice.
Thus the variance is the expected (squared) distance from the mean. Densities with
a higher variance are more spread out than densities with a lower variance. The
square root of the variance is called the standard deviation. The main benefit of
working with standard deviations is that they have the same units as the data,
whereas the variance has the units squared.

In this class, we’ll only cover a few basic examples for calculating a variance.
Otherwise, we’re going to use the ideas without the formalism. Also remember,
what we’re talking about is the population variance. It measures how spread out
the population of interest is, unlike the sample variance which measures how
spread out the observed data are. Just like the sample mean estimates the
population mean, the sample variance will estimate the population variance.

Example
What’s the variance from the result of a toss of a die? First recall that
, as we discussed in the previous lecture. Then let’s calculate the other bit of

information that we need, .

Thus now we can calculate the variance as:


Example
What’s the variance from the result of the toss of a (potentially biased) coin with
probability of heads (1) of ? First recall that
Secondly, recall that since is either 0 or
1, . So we know that:

Thus we can now calculate the variance of a coin flip as

This is a well known


formula, so it’s worth committing to memory. It’s interesting to note that this
function is maximized at . The plot below shows this by plotting
by .

Plotting the binomial variance

p = seq(0 , 1, length = 1000)


y = p * (1 - p)
plot(p, y, type = "l", lwd = 3, frame = FALSE)

Plot of the binomial variance

The sample variance


The sample variance is the estimator of the population variance. Recall that the
population variance is the expected squared deviation around the population
mean. The sample variance is (almost) the average squared deviation of
observations around the sample mean. It is given by

The sample standard deviation is the square root of the sample variance. Note
again that the sample variance is almost, but not quite, the average squared
deviation from the sample mean since we divide by instead of . Why do
we do this you might ask? To answer that question we have to think in the terms of
simulations. Remember that the sample variance is a random variable, thus it has a
distribution and that distribution has an associated population mean. That mean is
the population variance that we’re trying to estimate if we divide by
rather than .

It is also nice that as we collect more data the distribution of the sample variance
gets more concentrated around the population variance that it’s estimating.

Simulation experiments
Watch this video before beginning.

Simulating from a population with variance 1


Let’s try simulating collections of standard normals and taking the variance. If we
repeat this over and over, we get a sense of the distribution of sample variances
variances.

Simulation of variances of samples of standard normals

Notice that these histograms are always centered in the same spot, 1. In other
words, the sample variance is an unbiased estimate of the population variances.
Notice also that they get more concentrated around the 1 as more data goes into
them. Thus, sample variances comprised of more observations are less variable
than sample variances comprised of fewer.

Variances of x die rolls


Let’s try the same thing, now only with die rolls instead of simulating standard
normals. In this experiment, we simulated samples of die rolls, took the variance
and then repeated that process over and over. What is plotted are histograms of
the collections of sample variances.
Simulated distributions of variances of dies

Recall that we calculated the variance of a die roll as 2.92 earlier on in this chapter.
Notice each of the histograms are centered there. In addition, the get more
concentrated around 2.92 as more the variances are comprised of more dice.

The standard error of the mean


At last, we finally get to a perhaps very surprising (and useful) fact: how to estimate
the variability of the mean of a sample, when we only get to observe one
realization. Recall that the average of random sample from a population is itself a
random variable having a distribution, which in simulation settings we can explore
by repeated sampling averages. We know that this distribution is centered around

the population mean, . We also know the variance of the distribution of


means of random samples.

The variance of the sample mean is: where is the variance


of the population being sampled from.

This is very useful, since we don’t have repeat sample means to get its variance
directly using the data. We already know a good estimate of via the sample
variance. So, we can get a good estimate of the variability of the mean, even
though we only get to observer 1 mean.

Notice also this explains why in all of our simulation experiments the variance of
the sample mean kept getting smaller as the sample size increased. This is because
of the square root of the sample size in the denominator.

Often we take the square root of the variance of the mean to get the standard
deviation of the mean. We call the standard deviation of a statistic its standard
error.

Summary notes
Țħě șǻmpŀě vǻřįǻňčě, , ěșțįmǻțěș țħě pǿpųŀǻțįǿň vǻřįǻňčě, .

Țħě đįșțřįbųțįǿň ǿf țħě șǻmpŀě vǻřįǻňčě įș čěňțěřěđ ǻřǿųňđ .

Țħě vǻřįǻňčě ǿf țħě șǻmpŀě měǻň įș .


İțș ŀǿģįčǻŀ ěșțįmǻțě įș .

Țħě ŀǿģįčǻŀ ěșțįmǻțě ǿf țħě șțǻňđǻřđ ěřřǿř įș .

, țħě șțǻňđǻřđ đěvįǻțįǿň, țǻŀķș ǻbǿųț ħǿẅ vǻřįǻbŀě țħě pǿpųŀǻțįǿň įș.

, țħě șțǻňđǻřđ ěřřǿř, țǻŀķș ǻbǿųț ħǿẅ vǻřįǻbŀě ǻvěřǻģěș ǿf řǻňđǿm

șǻmpŀěș ǿf șįżě fřǿm țħě pǿpųŀǻțįǿň ǻřě.

Simulation example 1: standard normals


Watch this video before beginning.

Standard normals have variance 1. Let’s try sampling means of standard

normals. If our theory is correct, they should have standard deviation

Simulating means of random normals

> nosim <- 1000


> n <- 10
## simulate nosim averages of 10 standard normals
> sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
[1] 0.3156
## Let's check to make sure that this is sigma / sqrt(n)
> 1 / sqrt(n)
[1] 0.3162

So, in this simulation, we simulated 1000 means of 10 standard normals. Our


theory says the standard deviation of averages of 10 standard normals must be

. Taking the standard deviation of the 10000 means yields nearly exactly
that. (Note that it’s only close, 0.3156 versus 0.31632. To get it to be exact, we’d
have to simulate infinitely many means.)

Simulation example 2: uniform density


Standard uniforms have variance . Our theory mandates that means of

random samples of uniforms have sd . Let’s try it with a


simulation.

Simulating means of uniforms

> nosim <- 1000


> n <- 10
> sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
[1] 0.09017
> 1 / sqrt(12 * n)
[1] 0.09129

Simulation example 3: Poisson


Poisson(4) random variables have variance . Thus means of random samples of

Poisson(4) should have standard deviation . Again let’s try it out.

Simulating means of Poisson variates

> nosim <- 1000


> n <- 10
> sd(apply(matrix(rpois(nosim * n, 4), nosim), 1, mean))
[1] 0.6219
> 2 / sqrt(n)
[1] 0.6325

Simulation example 4: coin flips


Our last example is an important one. Recall that the variance of a coin flip is
. Therefore the standard deviation of the average of coin flips should

be .

Let’s just do the simulation with a fair coin. Such coin flips have variance 0.25. Thus

means of random samples of coin flips have sd . Let’s try it.

Simulating means of coin flips

> nosim <- 1000


> n <- 10
> sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE),
nosim), 1, mean))
[1] 0.1587
> 1 / (2 * sqrt(n))
[1] 0.1581

Data example
Watch this before beginning.

Now let’s work through a data example to show how the standard error of the
mean is used in practice. We’ll use the father.son height data from Francis Galton.

Loading the data

library(UsingR); data(father.son);
x <- father.son$sheight
n<-length(x)

Here’s a histogram of the sons’ heights from the dataset. Let’ calculate different
variances and interpret them in this context.

Histogram of the sons’ heights

Loading the data

>round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)),2)


[1] 7.92 0.01 2.81 0.09

The first number, 7.92, and its square root, 2.81, are the estimated variance and
standard deviation of the sons’ heights. Therefore, 7.92 tells us exactly how
variable sons’ heights were in the data and estimates how variable sons’ heights
are in the population. In contrast 0.01, and the square root 0.09, estimate how
variable averages of sons’ heights are.

Therefore, the smaller numbers discuss the precision of our estimate of the mean
of sons’ heights. The larger numbers discuss how variable sons’ heights are in
general.

Summary notes
Țħě șǻmpŀě vǻřįǻňčě ěșțįmǻțěș țħě pǿpųŀǻțįǿň vǻřįǻňčě.

Țħě đįșțřįbųțįǿň ǿf țħě șǻmpŀě vǻřįǻňčě įș čěňțěřěđ ǻț ẅħǻț įțș ěșțįmǻțįňģ.

İț ģěțș mǿřě čǿňčěňțřǻțěđ ǻřǿųňđ țħě pǿpųŀǻțįǿň vǻřįǻňčě ẅįțħ ŀǻřģěř șǻmpŀě

șįżěș.

Țħě vǻřįǻňčě ǿf țħě șǻmpŀě měǻň įș țħě pǿpųŀǻțįǿň vǻřįǻňčě đįvįđěđ bỳ .

Țħě șqųǻřě řǿǿț įș țħě șțǻňđǻřđ ěřřǿř.

İț țųřňș ǿųț țħǻț ẅě čǻň șǻỳ ǻ ŀǿț ǻbǿųț țħě đįșțřįbųțįǿň ǿf ǻvěřǻģěș fřǿm

řǻňđǿm șǻmpŀěș, ěvěň țħǿųģħ ẅě ǿňŀỳ ģěț ǿňě țǿ ŀǿǿķ ǻț įň ǻ ģįvěň đǻțǻ șěț.

Exercises
1. İf İ ħǻvě ǻ řǻňđǿm șǻmpŀě fřǿm ǻ pǿpųŀǻțįǿň, țħě șǻmpŀě vǻřįǻňčě įș ǻň ěșțįmǻțě ǿf?

Țħě pǿpųŀǻțįǿň șțǻňđǻřđ đěvįǻțįǿň.

Țħě pǿpųŀǻțįǿň vǻřįǻňčě.

Țħě șǻmpŀě vǻřįǻňčě.

Țħě șǻmpŀě șțǻňđǻřđ đěvįǻțįǿň.

2. Țħě đįșțřįbųțįǿň ǿf țħě șǻmpŀě vǻřįǻňčě ǿf ǻ řǻňđǿm șǻmpŀě fřǿm ǻ pǿpųŀǻțįǿň įș

čěňțěřěđ ǻț ẅħǻț?

Țħě pǿpųŀǻțįǿň vǻřįǻňčě.

Țħě pǿpųŀǻțįǿň měǻň.

3. İ ķěěp đřǻẅįňģ șǻmpŀěș ǿf șįżě fřǿm ǻ pǿpųŀǻțįǿň ẅįțħ vǻřįǻňčě ǻňđ țǻķįňģ

țħěįř ǻvěřǻģě. İ đǿ țħįș țħǿųșǻňđș ǿf țįměș. İf İ ẅěřě țǿ țǻķě țħě vǻřįǻňčě ǿf țħě

čǿŀŀěčțįǿň ǿf ǻvěřǻģěș, ǻbǿųț ẅħǻț ẅǿųŀđ įț bě?

4. Ỳǿų ģěț ǻ řǻňđǿm șǻmpŀě ǿf ǿbșěřvǻțįǿňș fřǿm ǻ pǿpųŀǻțįǿň ǻňđ țǻķě țħěįř

ǻvěřǻģě. Ỳǿų ẅǿųŀđ ŀįķě țǿ ěșțįmǻțě țħě vǻřįǻbįŀįțỳ ǿf ǻvěřǻģěș ǿf $$ň$$

ǿbșěřvǻțįǿňș fřǿm țħįș pǿpųŀǻțįǿň țǿ běțțěř ųňđěřșțǻňđ ħǿẅ přěčįșě ǿf ǻň ěșțįmǻțě

įț įș. Đǿ ỳǿų ňěěđ țǿ řěpěǻțěđ čǿŀŀěčț ǻvěřǻģěș țǿ đǿ țħįș?

Ňǿ, ẅě čǻň mųŀțįpŀỳ ǿųř ěșțįmǻțě ǿf țħě pǿpųŀǻțįǿň vǻřįǻňčě bỳ țǿ ģěț ǻ

ģǿǿđ ěșțįmǻțě ǿf țħě vǻřįǻbįŀįțỳ ǿf țħě ǻvěřǻģě.

Ỳěș, ỳǿų ħǻvě țǿ ģěț řěpěǻț ǻvěřǻģěș.

5. Ǻ řǻňđǿm vǻřįǻbŀě țǻķěș țħě vǻŀųě -4 ẅįțħ přǿbǻbįŀįțỳ .2 ǻňđ 1 ẅįțħ přǿbǻbįŀįțỳ .8.

Ẅħǻț įș țħě vǻřįǻňčě ǿf țħįș řǻňđǿm vǻřįǻbŀě? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș přǿbŀěm.

ǻňđ ŀǿǿķ ǻț ǻ věřșįǿň ẅįțħ ǻ ẅǿřķěđ ǿųț șǿŀųțįǿň.

6. İf ǻňđ ǻřě čǿmpřįșěđ ǿf ň įįđ řǻňđǿm vǻřįǻbŀěș ǻřįșįňģ fřǿm đįșțřįbųțįǿňș

ħǻvįňģ měǻňș ǻňđ , řěșpěčțįvěŀỳ ǻňđ čǿmmǿň vǻřįǻňčě ẅħǻț įș țħě

vǻřįǻňčě ? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș přǿbŀěm ħěřě ǻňđ șěě ǻ țỳpěđ ųp

șǿŀųțįǿň ħěřě

7. Ŀěț bě ǻ řǻňđǿm vǻřįǻbŀě ħǻvįňģ șțǻňđǻřđ đěvįǻțįǿň . Ẅħǻț čǻň bě șǻįđ ǻbǿųț

țħě vǻřįǻňčě ǿf ? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș přǿbŀěm ħěřě ǻňđ țỳpěđ ųp

șǿŀųțįǿňș ħěřě.

8. Čǿňșįđěř țħě fǿŀŀǿẅįňģ pmf ģįvěň įň Ř bỳ țħě čǿđě p <- č(.1, .2, .3, .4) ǻňđ ‘x <- 2 : 5`.

Ẅħǻț įș țħě vǻřįǻňčě? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň țǿ țħįș přǿbŀěm ħěřě ǻňđ ħěřě įș țħě

přǿbŀěm ẅǿřķěđ ǿųț.


9. İf ỳǿų řǿŀŀ țěň șțǻňđǻřđ đįčě, țǻķě țħěįř ǻvěřǻģě, țħěň řěpěǻț țħįș přǿčěșș ǿvěř ǻňđ

ǿvěř ǻňđ čǿňșțřųčț ǻ ħįșțǿģřǻm, ẅħǻț ẅǿųŀđ bě įțș vǻřįǻňčě ěxpřěșșěđ țǿ 3 đěčįmǻŀ

pŀǻčěș? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ħěřě ǻňđ șěě țħě țěxț ħěřě.

6. Some common distributions

The Bernoulli distribution


The Bernoulli distribution arises as the result of a binary outcome, such as a coin
flip. Thus, Bernoulli random variables take (only) the values 1 and 0 with
probabilities of (say) and , respectively. Recall that the PMF for a Bernoulli

random variable is .

The mean of a Bernoulli random variable is and the variance is . If we


let be a Bernoulli random variable, it is typical to call as a “success” and
as a “failure”.

If a random variable follows a Bernoulli distribution with success probability we


write that Bernoulli .

Bernoulli random variables are commonly used for modeling any binary trait for a
random sample. So, for example, in a random sample whether or not a participant
has high blood pressure would be reasonably modeled as Bernoulli.

Binomial trials
The binomial random variables are obtained as the sum of iid Bernoulli trials. So if
a Bernoulli trial is the result of a coin flip, a binomial random variable is the total
number of heads.

To write it out as mathematics, let be iid Bernoulli , then

is a binomial random variable. We write out that Binomial


. The binomial mass function is

where . Recall that the notation

(read “ choose ”) counts the number of ways of selecting items out of


without replacement disregarding the order of the items. It turns out that
choose 0, choose 1 and choose are all 1.

Example
Suppose a friend has 8 children, of which are girls and none are twins. If each
gender has an independent % probability for each birth, what’s the probability
of getting or more girls out of births?
Simulating means of coin flips

> choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8


[1] 0.03516
> pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
[1] 0.03516

The normal distribution


Watch this video before beginning

The normal distribution is easily the handiest distribution in all of statistics. It can
be used in an endless variety of settings. Moreover, as we’ll see later on in the
course, sample means follow normal distributions for large sample sizes.

Remember the goal of probability modeling. We are assuming a probability


distribution for our population as a way of parsimoniously characterizing it. In fact,
the normal distribution only requires two numbers to characterize it. Specifically, a
random variable is said to follow a normal or Gaussian distribution with mean
and variance if the associated density is:

If is a RV with this density then and . That is, the


normal distribution is characterized by the mean and variance. We write

to denote a normal random variable. When and


the resulting distribution is called the standard normal distribution. Standard
normal RVs are often labeled

Consider an example, if we say that intelligence quotients are normally distributed


with a mean of 100 and a standard deviation of 15. Then, we are saying that if we
randomly sample a person from this population, the probability that they have an
IQ of say 120 or larger, is governed by a normal distribution with a mean of 100
and a variance of .

Taken another way, if we know that the population is normally distributed then to
estimate everything about the population, we need only estimate the population
mean and variance. (Estimated by the sample mean and the sample variance.)

Reference quantiles for the standard normal


The normal distribution is so important that it is useful to memorize reference
probabilities and quantiles. The image below shows reference lines at 0, 1, 2 and 3
standard deviations above and below 0. This is for the standard normal; however,
all of the rules apply to non standard normals as 0, 1, 2 and 3 standard deviations
above and below , the population mean.
Standard normal reference lines.

The most relevant probabilities are.

1. Ǻppřǿxįmǻțěŀỳ 68\%, 95\% ǻňđ 99\% ǿf țħě ňǿřmǻŀ đěňșįțỳ ŀįěș ẅįțħįň 1, 2 ǻňđ 3

șțǻňđǻřđ đěvįǻțįǿňș fřǿm țħě měǻň, řěșpěčțįvěŀỳ.

2. -1.28, -1.645, -1.96 ǻňđ -2.33 ǻřě țħě , , ǻňđ pěřčěňțįŀěș ǿf țħě

șțǻňđǻřđ ňǿřmǻŀ đįșțřįbųțįǿň, řěșpěčțįvěŀỳ.

3. Bỳ șỳmměțřỳ, 1.28, 1.645, 1.96 ǻňđ 2.33 ǻřě țħě , , ǻňđ

pěřčěňțįŀěș ǿf țħě șțǻňđǻřđ ňǿřmǻŀ đįșțřįbųțįǿň, řěșpěčțįvěŀỳ.

Shifting and scaling normals


Since the normal distribution is characterized by only the mean and variance,
which are a shift and a scale, we can transform normal random variables to be

standard normals and vice versa. For example If then:

If is standard normal

then is . We can use these facts to answer questions about


non-standard normals by relating them back to the standard normal.

Example
What is the percentile of a distribution? Quick answer in R
qnorm(.95, mean = mu, sd = sigma). Alternatively, because we have the standard normal

quantiles memorized, and we know that 1.645 is its 95th percentile, the answer has
to be .

In general, where is the appropriate standard normal quantile.

To put some context on our previous setting, population mean BMI for men is

reported as 29 with a standard deviation of 4.73. Assuming normality of


BMI, what is the population percentile? The answer is then:

Or alternatively, we could simply type r qnorm(.95, 29, 4.73) in R.

Now let’s reverse the process. Imaging asking what’s the probability that a
randomly drawn subject from this population has a BMI less than 24.27? Notice
that

Therefore, 24.27 is 1 standard deviation below the mean. We know that 16% lies
below or above 1 standard deviation from the mean. Thus 16% lies below.
Alternatively, pnorm(24.27, 29, 4.73) yields the result.

Example
Assume that the number of daily ad clicks for a company is (approximately)
normally distributed with a mean of 1020 and a standard deviation of 50. What’s
the probability of getting more than 1,160 clicks in a day? Notice that:

Therefore, 1,160 is 2.8 standard deviations above the mean. We know from our
standard normal quantiles that the probability of being larger than 2 standard
deviation is 2.5% and 3 standard deviations is far in the tail. Therefore, we know
that the probability has to be smaller than 2.5% and should be very small. We can
obtain it exactly as r pnorm(1160, 1020, 50, lower.tail = FALSE) which is 0.3%. Note that we can
also obtain the probability as r pnorm(2.8, lower.tail = FALSE).

Example
Consider the previous example again. What number of daily ad clicks would
represent the one where 75% of days have fewer clicks (assuming days are
independent and identically distributed)? We can obtain this as:

Finding a normal quantile

> qnorm(0.75, mean = 1020, sd = 50)


[1] 1054

The Poisson distribution


Watch this video before beginning.

The Poisson distribution is used to model counts. It is perhaps only second to the
normal distribution usefulness. In fact, the Bernoulli, binomial and multinomial
distributions can all be modeled by clever uses of the Poisson.

The Poisson distribution is especially useful for modeling unbounded counts or


counts per unit of time (rates). Like the number of clicks on advertisements, or the
number of people who show up at a bus stop. (While these are in principle
bounded, it would be hard to actually put an upper limit on it.) There is also a deep
connection between the Poisson distribution and popular models for so-called
event-time data. In addition, the Poisson distribution is the default model for so-
called contingency table data, which is simply tabulations of discrete
characteristics. Finally, when is large and is small, the Poisson is an accurate
approximation to the binomial distribution.

The Poisson mass function is:

for . The mean of this distribution is . The variance of this


distribution is also . Notice that ranges from 0 to . Therefore, the Poisson
distribution is especially useful for modeling unbounded counts.

Rates and Poisson random variables


The Poisson distribution is useful for rates, counts that occur over units of time.
Specifically, if where is the expected count per
unit of time and is the total monitoring time.

Example
The number of people that show up at a bus stop is Poisson with a mean of 2.5 per
hour. If watching the bus stop for 4 hours, what is the probability that $3$ or fewer
people show up for the whole time?

Finding a normal quantile

> ppois(3, lambda = 2.5 * 4)


[1] 0.01034

Therefore, there is about a 1% chance that 3 or fewer people show up. Notice the
multiplication by four in the function argument. Since lambda is specified as events
per hour we have to multiply by four to consider the number of events that occur
in 4 hours.

Poisson approximation to the binomial


When is large and is small the Poisson distribution is an accurate
approximation to the binomial distribution. Formally, if
then is approximately Poisson where provided that is large is
small.

Example, Poisson approximation to the binomial


We flip a coin with success probability 0.01 five hundred times. What’s the
probability of 2 or fewer successes?

Finding a normal quantile

> pbinom(2, size = 500, prob = 0.01)


[1] 0.1234
> ppois(2, lambda = 500 * 0.01)
[1] 0.1247

So we can see that the probabilities agree quite well. This approximation is often
done as the Poisson model is a more convenient model in many respects.
Exercises
1. Ỳǿųř fřįěňđ čŀǻįmș țħǻț čħǻňģįňģ țħě fǿňț țǿ čǿmįč șǻňș ẅįŀŀ řěșųŀț įň mǿřě ǻđ

řěvěňųě ǿň ỳǿųř ẅěb șįțěș. Ẅħěň přěșěňțěđ įň řǻňđǿm ǿřđěř, 9 pǻģěș ǿųț ǿf 10 ħǻđ

mǿřě řěvěňųě ẅħěň țħě fǿňț ẅǻș șěț țǿ čǿmįč șǻňș. İf įț ẅǻș řěǻŀŀỳ ǻ čǿįň fŀįp fǿř

țħěșě 10 șįțěș, ẅħǻț’ș țħě přǿbǻbįŀįțỳ ǿf ģěțțįňģ 9 ǿř 10 ǿųț ǿf 10 ẅįțħ mǿřě řěvěňųě

fǿř țħě ňěẅ fǿňț?

2. Ǻ șǿfțẅǻřě čǿmpǻňỳ įș đǿįňģ ǻň ǻňǻŀỳșįș ǿf đǿčųměňțǻțįǿň ěřřǿřș ǿf țħěįř přǿđųčțș.

Țħěỳ șǻmpŀěđ țħěįř věřỳ ŀǻřģě čǿđěbǻșě įň čħųňķș ǻňđ fǿųňđ țħǻț țħě ňųmběř ǿf

ěřřǿřș pěř čħųňķ ẅǻș ǻppřǿxįmǻțěŀỳ ňǿřmǻŀŀỳ đįșțřįbųțěđ ẅįțħ ǻ měǻň ǿf 11 ěřřǿřș

ǻňđ ǻ șțǻňđǻřđ đěvįǻțįǿň ǿf 2. Ẅħěň řǻňđǿmŀỳ șěŀěčțįňģ ǻ čħųňķ fřǿm țħěįř

čǿđěbǻșě, ẅħǻțș țħě přǿbǻbįŀįțỳ ǿf fěẅěř țħǻň 5 đǿčųměňțǻțįǿň ěřřǿřș?

3. Țħě ňųmběř ǿf șěǻřčħ ěňțřįěș ěňțěřěđ ǻț ǻ ẅěb șįțě įș Pǿįșșǿň ǻț ǻ řǻțě ǿf 9

șěǻřčħěș pěř mįňųțě. Țħě șįțě įș mǿňįțǿřěđ fǿř 5 mįňųțěș. Ẅħǻț įș țħě přǿbǻbįŀįțỳ ǿf

40 ǿř fěẅěř șěǻřčħěș įň țħǻț țįmě fřǻmě?

4. Șųppǿșě țħǻț țħě ňųmběř ǿf ẅěb ħįțș țǿ ǻ pǻřțįčųŀǻř șįțě ǻřě ǻppřǿxįmǻțěŀỳ ňǿřmǻŀŀỳ

đįșțřįbųțěđ ẅįțħ ǻ měǻň ǿf 100 ħįțș pěř đǻỳ ǻňđ ǻ șțǻňđǻřđ đěvįǻțįǿň ǿf 10 ħįțș pěř

đǻỳ. Ẅħǻț’ș țħě přǿbǻbįŀįțỳ țħǻț ǻ ģįvěň đǻỳ ħǻș fěẅěř țħǻň 93 ħįțș pěř đǻỳ

ěxpřěșșěđ ǻș ǻ pěřčěňțǻģě țǿ țħě ňěǻřěșț pěřčěňțǻģě pǿįňț? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň

ǻňđ șěě țħě přǿbŀěm.

5. Șųppǿșě țħǻț țħě ňųmběř ǿf ẅěb ħįțș țǿ ǻ pǻřțįčųŀǻř șįțě ǻřě ǻppřǿxįmǻțěŀỳ ňǿřmǻŀŀỳ

đįșțřįbųțěđ ẅįțħ ǻ měǻň ǿf 100 ħįțș pěř đǻỳ ǻňđ ǻ șțǻňđǻřđ đěvįǻțįǿň ǿf 10 ħįțș pěř

đǻỳ. Ẅħǻț ňųmběř ǿf ẅěb ħįțș pěř đǻỳ řěpřěșěňțș țħě ňųmběř șǿ țħǻț ǿňŀỳ 5% ǿf

đǻỳș ħǻvě mǿřě ħįțș? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě přǿbŀěm ǻňđ șǿŀųțįǿň.

6. Șųppǿșě țħǻț țħě ňųmběř ǿf ẅěb ħįțș țǿ ǻ pǻřțįčųŀǻř șįțě ǻřě ǻppřǿxįmǻțěŀỳ ňǿřmǻŀŀỳ

đįșțřįbųțěđ ẅįțħ ǻ měǻň ǿf 100 ħįțș pěř đǻỳ ǻňđ ǻ șțǻňđǻřđ đěvįǻțįǿň ǿf 10 ħįțș pěř

đǻỳ. İmǻģįňě țǻķįňģ ǻ řǻňđǿm șǻmpŀě ǿf 50 đǻỳș. Ẅħǻț ňųmběř ǿf ẅěb ħįțș ẅǿųŀđ

bě țħě pǿįňț șǿ țħǻț ǿňŀỳ 5% ǿf ǻvěřǻģěș ǿf 50 đǻỳș ǿf ẅěb țřǻffįč ħǻvě mǿřě ħįțș?

Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě přǿbŀěm ǻňđ șǿŀųțįǿň.

7. Ỳǿų đǿň’ț běŀįěvě țħǻț ỳǿųř fřįěňđ čǻň đįșčěřň ģǿǿđ ẅįňě fřǿm čħěǻp. Ǻșșųmįňģ

țħǻț ỳǿų’řě řįģħț, įň ǻ bŀįňđ țěșț ẅħěřě ỳǿų řǻňđǿmįżě 6 pǻįřěđ vǻřįěțįěș (Měřŀǿț,

Čħįǻňțį, …) ǿf čħěǻp ǻňđ ěxpěňșįvě ẅįňěș. Ẅħǻț įș țħě čħǻňģě țħǻț șħě ģěțș 5 ǿř 6

řįģħț? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě ǿřįģįňǻŀ přǿbŀěm.

8. Țħě ňųmběř ǿf ẅěb ħįțș țǿ ǻ șįțě įș Pǿįșșǿň ẅįțħ měǻň 16.5 pěř đǻỳ. Ẅħǻț įș țħě

přǿbǻbįŀįțỳ ǿf ģěțțįňģ 20 ǿř fěẅěř įň 2 đǻỳș? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě ǻ ẅřįțțěň

șǿŀųțįǿň.

7. Asymptopia

Asymptotics
Watch this video before beginning.

Asymptotics is the term for the behavior of statistics as the sample size limits to
infinity. Asymptotics are incredibly useful for simple statistical inference and
approximations. Asymptotics often make hard problems easy and difficult
calculations simple. We will not cover the philosophical considerations in this book,
but is true nonetheless, that asymptotics often lead to nice understanding of
procedures. In fact, the ideas of asymptotics are so important form the basis for
frequency interpretation of probabilities by considering the long run proportion of
times an event occurs.

Some things to bear in mind about the seemingly magical nature of asymptotics.
There’s no free lunch and unfortunately, asymptotics generally give no assurances
about finite sample performance.

Limits of random variables


We’ll only talk about the limiting behavior of one statistic, the sample mean.
Fortunately, for the sample mean there’s a set of powerful results. These results
allow us to talk about the large sample distribution of sample means of a collection
of iid observations.

The first of these results we intuitively already know. It says that the average limits
to what its estimating, the population mean. This result is called the Law of Large
Numbers. It simply says that if you go to the trouble of collecting an infinite
amount of data, you estimate the population mean perfectly. Note there’s
sampling assumptions that have to hold for this result to be true. The data have to
be iid.

A great example of this comes from coin flipping. Imagine if is the average of
the result of coin flips (i.e. the sample proportion of heads). The Law of Large
Numbers states that as we flip a coin over and over, it eventually converges to the
true probability of a head.

Law of large numbers in action


Let’s try using simulation to investigate the law of large numbers in action. Let’s
simulate a lot of standard normals and plot the cumulative means. If the LLN is
correct, the line should converge to 0, the mean of the standard normal
distribution.

Finding a normal quantile

n <- 10000
means <- cumsum(rnorm(n))/(1:n)
library(ggplot2)
g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g

Cumulative average from standard normal simulations.

Law of large numbers in action, coin flip


Let’s try the same thing, but for a fair coin flip. We’ll simulate a lot of coin flips and
plot the cumulative proportion of heads.

Finding a normal quantile

means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n)


g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0.5) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g

Cumulative proportion of heads from a sequence of coin flips.

Discussion
An estimator is called consistent if it converges to what you want to estimate. Thus,
the LLN says that the sample mean of iid sample is consistent for the population
mean. Typically, good estimators are consistent; it’s not too much to ask that if we
go to the trouble of collecting an infinite amount of data that we get the right
answer. The sample variance and the sample standard deviation of iid random
variables are consistent as well.

The Central Limit Theorem


Watch this video before beginning.

The Central Limit Theorem (CLT) is one of the most important theorems in
statistics. For our purposes, the CLT states that the distribution of averages of iid
variables becomes that of a standard normal as the sample size increases.
Consider this fact for a second. We already know the mean and standard deviation
of the distribution of averages from iid samples. The CLT gives us an approximation
to the full distribution! Thus, for iid samples, we have a good sense of distribution
of the average event though: (1) we only observed one average and (2) we don’t
know what the population distribution is. Because of this, the CLT applies in an
endless variety of settings and is one of the most important theorems ever
discovered.

The formal result is that

has a distribution like that of a standard normal for large . Replacing the
standard error by its estimated value doesn’t change the CLT.

The useful way to think about the CLT is that is approximately .

CLT simulation experiments


Let’s try simulating lots of averages from various distributions and showing that the
resulting distribution looks like a bell curve.

Die rolling
Șįmųŀǻțě ǻ șțǻňđǻřđ ňǿřmǻŀ řǻňđǿm vǻřįǻbŀě bỳ řǿŀŀįňģ (șįx șįđěđ) đįčě.

Ŀěț bě țħě ǿųțčǿmě fǿř đįě .

Țħěň ňǿțě țħǻț .

Řěčǻŀŀ ǻŀșǿ țħǻț .

ȘĚ .

Ŀěțș řǿŀŀ đįčě, țǻķě țħěįř měǻň, șųbțřǻčț ǿff 3.5, ǻňđ đįvįđě bỳ ǻňđ

řěpěǻț țħįș ǿvěř ǻňđ ǿvěř.

Result of coin CLT simulation.

It’s pretty remarkable that the approximation works so well with so few rolls of the
die. So, if you’re stranded on an island, and need to simulate a standard normal
without a computer, but you do have a die, you can get a pretty good
approximation with 10 rolls even.

Coin CLT
In fact the oldest application of the CLT is to the idea of flipping coins (by de Moivre).
Let be the 0 or 1 result of the flip of a possibly unfair coin. The sample
proportion, say , is the average of the coin flips. We know that:

Furthermore, because of the CLT, we also know that:

will be approximately normally distributed.


Let’s test this by flipping a coin times, taking the sample proportion of heads,

subtract off 0.5 and multiply the result by (divide by ).

Results of the coin CLT simulation.

This convergence doesn’t look quite as good as the die, since the coin has fewer
possible outcomes. In fact, among coins of various degrees of bias, the
convergence to normality is governed by how far from 0.5 is. Let’s redo the
simulation, now using instead of like we did before.

Results of the simulation when p=0.9

Notice that the convergence to normality is quite poor. Thus, be careful when using
CLT approximations for sample proportions when your proportion is very close to
0 or 1.

Confidence intervals
Watch this video before beginning.
Confidence intervals are methods for quantifying uncertainty in our estimates. The
fact that the interval has width characterizes that there is randomness that
prevents us from getting a perfect estimate. Let’s go through how a confidence
interval using the CLT is constructed.

According to the CLT, the sample mean, , is approximately normal with mean

and standard deviation . Furthermore,

is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail).
Similarly,

is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2

standard deviations in the tail). So the probability is bigger than

or smaller than is 5%. Or equivalently, the probability that these


limits contain is 95%. The quantity:

is called a 95% interval for . The 95% refers to the fact that if one were to
repeatedly get samples of size , about 95% of the intervals obtained would
contain . The 97.5th quantile is 1.96 (so I rounded to 2 above). If instead of a 95%
interval, you wanted a 90% interval, then you want (100 - 90) / 2 = 5% in each tail.
Thus your replace the 2 with the 95th percentile, which is 1.645.

Example CI
Give a confidence interval for the average height of sons in Galton’s data.

Finding a confidence interval.

> library(UsingR)
> data(father.son)
> x <- father.son$sheight
> (mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12
[1] 5.710 5.738

Here we divided by 12 to get our interval in feet instead of inches. So we estimate


the average height of the sons as 5.71 to 5.74 with 95% confidence.

Example using sample proportions


In the event that each is 0 or 1 with common success probability then

. The interval takes the form:

Replacing by in the standard error results in what is called a Wald confidence


interval for . Remember also that is maximized at 1/4. Plugging this in
and setting our quantile as 2 (which is about a 95% interval) we find that a quick
and dirty confidence interval is:
This is useful for doing quick confidence intervals for binomial proportions in your
head.

Example
Your campaign advisor told you that in a random sample of 100 likely voters, 56
intent to vote for you. Can you relax? Do you have this race in the bag? Without
access to a computer or calculator, how precise is this estimate?

> 1/sqrt(100)
[1] 0.1

so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66).

Thus, since the interval contains 0.5 and numbers below it, there’s not enough
votes for you to relax; better go do more campaigning!

The basic rule of thumb is then, gives you a good estimate for the margin
of error of a proportion. Thus,
for about 1 decimal place, 10,000 for 2, 1,000,000 for 3.

> round(1/sqrt(10^(1:6)), 3)
[1] 0.316 0.100 0.032 0.010 0.003 0.001

We could very easily do the full Wald interval, which is less conservative (may
provide a narrower interval). Remember the Wald interval for a binomial
proportion is:

Here’s the R code for our election setting, both coding it directly and using binom.test.

> 0.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)


[1] 0.4627 0.6573
> binom.test(56, 100)$conf.int
[1] 0.4572 0.6592

Simulation of confidence intervals


It is interesting to note that the coverage of confidence intervals describes an
aggregate behavior. In other words the confidence interval describes the
percentage of intervals that would cover the parameter being estimated if we were
to repeat the experiment over and over. So, one can not technically say that the
interval contains the parameter with probability 95%, say. So called Bayesian
credible intervals address this issue at the expense (or benefit depending on who
you ask) of adopting a Bayesian framework.

For our purposes, we’re using confidence intervals and so will investigate their
frequency performance over repeated realizations of the experiment. We can do
this via simulation. Let’s consider different values of and look at the Wald
interval’s coverage when we repeatedly create confidence intervals.
Code for investigating Wald interval coverage

n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p) {
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})

Plot of Wald interval coverage.

The figure shows that if we were to repeatedly try experiments for any fixed value
of , it’s rarely the case that our intervals will cover the value that they’re trying to
estimate in 95% of them. This is bad, since covering the parameter that its
estimating 95% of the time is the confidence interval’s only job!

So what’s happening? Recall that the CLT is an approximation. In this case isn’t
large enough for the CLT to be applicable for many of the values of . Let’s see if
the coverage improves for larger .

Code for investigating Wald interval coverage

n <- 100
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage2 <- sapply(pvals, function(p) {
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})
Output of simulation with .

Now it looks much better. Of course, increasing our sample size is rarely an option.
There’s exact fixes to make this interval work better for small sample sizes.

However, for a quick fix is to take your data and add two successes and two
failures. So, for example, in our election example, we would form our interval with
58 votes out of 104 sampled (disregarding that the actual numbers were 56 and
100). This interval is called the Agresti/Coull interval. This interval has much better
coverage. Let’s show it via a simulation.

Code for investigating Agresti/Coull interval coverage when n=20.

n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p) {
phats <- (rbinom(nosim, prob = p, size = n) + 2)/(n + 4)
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})

Coverage of the Agresti/Coull interval with


The coverage is better, if maybe a little conservative in the sense of being over the
95% line most of the time. If the interval is too conservative, it’s likely a little too
wide. To see this clearly, imagine if we made our interval to . Then we
would always have 100% coverage in any setting, but the interval wouldn’t be
useful. Nonetheless, the Agrestic/Coull interval gives a much better trade off
between coverage and width than the Wald interval.

In general, one should use the add two successes and failures method for binomial
confidence intervals with smaller . For very small consider using an exact
interval (not covered in this class).

Poisson interval
Since the Poisson distribution is so central for data science, let’s do a Poisson
confidence interval. Remember that if then our estimate of

is . Furthermore, we know that and so the natural

estimate is . While it’s not immediate how the CLT applies in this case, the
interval is of the familiar form

So our Poisson interval is:

Example
A nuclear pump failed 5 times out of 94.32 days. Give a 95% confidence interval for
the failure rate per day.

Code for asymptotic Poisson confidence interval

> x <- 5
> t <- 94.32
> lambda <- x/t
> round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
[1] 0.007 0.099

A non-asymptotic test, one that guarantees coverage, is also available. But, it has to
be evaluated numerically.

Code for exact Poisson confidence interval

> poisson.test(x, T = 94.32)$conf


[1] 0.01721 0.12371

Simulating the Poisson coverage rate


Let’s see how the asymptotic interval performs for lambda values near what we’re
estimating.

Code for evaluating the coverage of the asymptotic Poisson confidence interval

lambdavals <- seq(0.005, 0.1, by = 0.01)


nosim <- 1000
t <- 100
coverage <- sapply(lambdavals, function(lambda) {
lhats <- rpois(nosim, lambda = lambda * t)/t
ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
mean(ll < lambda & ul > lambda)
})

Coverage of Poisson intervals for various values of lambda

The coverage can be low for low values of lambda. In this case the asymptotics
works as we increase the monitoring time, t. Here’s the coverage if we increase
to 1,000.

Coverage of Poisson intervals for various values of lambda and t=1000

Summary notes
Țħě ĿĿŇ șțǻțěș țħǻț ǻvěřǻģěș ǿf įįđ șǻmpŀěș. čǿňvěřģě țǿ țħě pǿpųŀǻțįǿň měǻňș

țħǻț țħěỳ ǻřě ěșțįmǻțįňģ.

Țħě ČĿȚ șțǻțěș țħǻț ǻvěřǻģěș ǻřě ǻppřǿxįmǻțěŀỳ ňǿřmǻŀ, ẅįțħ đįșțřįbųțįǿňș.

čěňțěřěđ ǻț țħě pǿpųŀǻțįǿň měǻň.

ẅįțħ șțǻňđǻřđ đěvįǻțįǿň ěqųǻŀ țǿ țħě șțǻňđǻřđ ěřřǿř ǿf țħě měǻň.

ČĿȚ ģįvěș ňǿ ģųǻřǻňțěě țħǻț $ň$ įș ŀǻřģě ěňǿųģħ.

Țǻķįňģ țħě měǻň ǻňđ ǻđđįňģ ǻňđ șųbțřǻčțįňģ țħě řěŀěvǻňț. ňǿřmǻŀ qųǻňțįŀě

țįměș țħě ȘĚ ỳįěŀđș ǻ čǿňfįđěňčě įňțěřvǻŀ fǿř țħě měǻň.

Ǻđđįňģ ǻňđ șųbțřǻčțįňģ 2 ȘĚș ẅǿřķș fǿř 95% įňțěřvǻŀș.


Čǿňfįđěňčě įňțěřvǻŀș ģěț ẅįđěř ǻș țħě čǿvěřǻģě įňčřěǻșěș.

Čǿňfįđěňčě įňțěřvǻŀș ģěț ňǻřřǿẅěř ẅįțħ ŀěșș vǻřįǻbįŀįțỳ ǿř ŀǻřģěř șǻmpŀě șįżěș.

Țħě Pǿįșșǿň ǻňđ bįňǿmįǻŀ čǻșě ħǻvě ěxǻčț įňțěřvǻŀș țħǻț đǿň’ț řěqųįřě țħě ČĿȚ.

Bųț ǻ qųįčķ fįx fǿř șmǻŀŀ șǻmpŀě șįżě bįňǿmįǻŀ čǻŀčųŀǻțįǿňș įș țǿ ǻđđ 2

șųččěșșěș ǻňđ fǻįŀųřěș.

Exercises
1. İ șįmųŀǻțě 1,000,000 șțǻňđǻřđ ňǿřmǻŀș. Țħě ĿĿŇ șǻỳș țħǻț țħěįř șǻmpŀě ǻvěřǻģě

mųșț bě čŀǿșě țǿ?

2. Ǻbǿųț ẅħǻț įș țħě přǿbǻbįŀįțỳ ǿf ģěțțįňģ 45 ǿř fěẅěř ħěǻđș ǿųț 100 fŀįpș ǿf ǻ fǻįř

čǿįň? (Ųșě țħě ČĿȚ, ňǿț țħě ěxǻčț bįňǿmįǻŀ čǻŀčųŀǻțįǿň).

3. Čǿňșįđěř țħě fǻțħěř.șǿň đǻțǻ. Ųșįňģ țħě ČĿȚ ǻňđ ǻșșųmįňģ țħǻț țħě fǻțħěřș ǻřě ǻ

řǻňđǿm șǻmpŀě fřǿm ǻ pǿpųŀǻțįǿň ǿf įňțěřěșț, ẅħǻț įș ǻ 95% čǿňfįđěňčě měǻň

ħěįģħț įň įňčħěș?

4. Țħě ģǿǻŀ ǿf ǻ ǻ čǿňfįđěňčě įňțěřvǻŀ ħǻvįňģ čǿvěřǻģě 95% įș țǿ įmpŀỳ țħǻț:

İf ǿňě ẅěřě țǿ řěpěǻțěđ čǿŀŀěčț șǻmpŀěș ǻňđ řěčǿňșțřųčț țħě įňțěřvǻŀș, ǻřǿųňđ

95% pěřčěňț ǿf țħěm ẅǿųŀđ čǿňțǻįň țħě țřųě měǻň běįňģ ěșțįmǻțěđ.

Țħě přǿbǻbįŀįțỳ țħǻț țħě șǻmpŀě měǻň įș įň țħě įňțěřvǻŀ įș 95%.

5. Țħě řǻțě ǿf șěǻřčħ ěňțřįěș įňțǿ ǻ ẅěb șįțě ẅǻș 10 pěř mįňųțě ẅħěň mǿňįțǿřįňģ fǿř ǻň

ħǿųř. Ųșě Ř țǿ čǻŀčųŀǻțě țħě ěxǻčț Pǿįșșǿň įňțěřvǻŀ fǿř țħě řǻțě ǿf ěvěňțș pěř mįňųțě?

6. Čǿňșįđěř ǻ ųňįfǿřm đįșțřįbųțįǿň. İf ẅě ẅěřě țǿ șǻmpŀě 100 đřǻẅș fřǿm ǻ ǻ ųňįfǿřm

đįșțřįbųțįǿň (ẅħįčħ ħǻș měǻň 0.5, ǻňđ vǻřįǻňčě 1/12) ǻňđ țǻķě țħěįř měǻň, .

Ẅħǻț įș țħě ǻppřǿxįmǻțě přǿbǻbįŀįțỳ ǿf ģěțțįňģ ǻș ŀǻřģě ǻș 0.51 ǿř ŀǻřģěř? Ẅǻțčħ țħįș

vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě přǿbŀěm ǻňđ șǿŀųțįǿň ħěřě..

8. t Confidence intervals

Small sample confidence intervals


Watch this video before beginning.

In the previous lecture, we discussed creating a confidence interval using the CLT.
Our intervals took the form:

In this lecture, we discuss some methods for small samples, notably Gosset’s t
distribution and t confidence intervals.

These intervals are of the form:

So the only change is that we’ve replaced the Z quantile now with a t quantile.
These are some of the handiest of intervals in all of statistics. If you want a rule
between whether to use a t interval or normal interval, just always use the t
interval.

Gosset’s t distribution
The t distribution was invented by William Gosset (under the pseudonym “Student”)
in 1908. Fisher provided further mathematical details about the distribution later.
This distribution has thicker tails than the normal. It’s indexed by a degrees of
freedom and it gets more like a standard normal as the degrees of freedom get
larger. It assumes that the underlying data are iid Gaussian with the result that
follows Gosset’s t distribution with degrees of freedom. (If we replaced
by the statistic would be exactly standard normal.) The interval is

where is the relevant quantile from the t distribution.

Code for manipulate


You can use rStudio’s manipulate function to to compare the t and Z distributions.

Code for investigating t and Z densities.

k <- 1000
xvals <- seq(-5, 5, length = k)
myplot <- function(df){
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),
x = xvals,
dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(color = dist))
g
}
manipulate(myplot(mu), mu = slider(1, 20, step = 1))

The difference is perhaps easier to see in the tails. Therefore, the following code
plots the upper quantiles of the Z distribution by those of the t distribution.

Code for investigating the upper quantiles of the t and Z densities.

pvals <- seq(.5, .99, by = .01)


myplot2 <- function(df){
d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),
p = pvals)
g <- ggplot(d, aes(x= n, y = t))
g <- g + geom_abline(size = 2, col = "lightblue")
g <- g + geom_line(size = 2, col = "black")
g <- g + geom_vline(xintercept = qnorm(0.975))
g <- g + geom_hline(yintercept = qt(0.975, df))
g
}
manipulate(myplot2(df), df = slider(1, 20, step = 1))

Summary notes
In this section, we give an overview of important facts about the t distribution.

Țħě ț įňțěřvǻŀ țěčħňįčǻŀŀỳ ǻșșųměș țħǻț țħě đǻțǻ ǻřě įįđ ňǿřmǻŀ, țħǿųģħ įț įș

řǿbųșț țǿ țħįș ǻșșųmpțįǿň.

İț ẅǿřķș ẅěŀŀ ẅħěňěvěř țħě đįșțřįbųțįǿň ǿf țħě đǻțǻ įș řǿųģħŀỳ șỳmměțřįč ǻňđ

mǿųňđ șħǻpěđ.

Pǻįřěđ ǿbșěřvǻțįǿňș ǻřě ǿfțěň ǻňǻŀỳżěđ ųșįňģ țħě ț įňțěřvǻŀ bỳ țǻķįňģ

đįffěřěňčěș.

Fǿř ŀǻřģě đěģřěěș ǿf fřěěđǿm, ț qųǻňțįŀěș běčǿmě țħě șǻmě ǻș șțǻňđǻřđ ňǿřmǻŀ

qųǻňțįŀěș; țħěřěfǿřě țħįș įňțěřvǻŀ čǿňvěřģěș țǿ țħě șǻmě įňțěřvǻŀ ǻș țħě ČĿȚ

ỳįěŀđěđ.

Fǿř șķěẅěđ đįșțřįbųțįǿňș, țħě șpįřįț ǿf țħě ț įňțěřvǻŀ ǻșșųmpțįǿňș ǻřě vįǿŀǻțěđ.

Ǻŀșǿ, fǿř șķěẅěđ đįșțřįbųțįǿňș, įț đǿěșň’ț mǻķě ǻ ŀǿț ǿf șěňșě țǿ čěňțěř țħě

įňțěřvǻŀ ǻț țħě měǻň.

İň țħįș čǻșě, čǿňșįđěř țǻķįňģ ŀǿģș ǿř ųșįňģ ǻ đįffěřěňț șųmmǻřỳ ŀįķě țħě

měđįǻň.

Fǿř ħįģħŀỳ đįșčřěțě đǻțǻ, ŀįķě bįňǻřỳ, ǿțħěř įňțěřvǻŀș ǻřě ǻvǻįŀǻbŀě.
Example of the t interval, Gosset’s sleep data
Watch this video before beginning.

In R typing r data(sleep) brings up the sleep data originally analyzed in Gosset’s


Biometrika paper, which shows the increase in hours for 10 patients on two
soporific drugs. R treats the data as two groups rather than paired.

The data
Loading Galton’s data.

> data(sleep)
> head(sleep)
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6

Here’s a plot of the data. In this plot paired observations are connected with a line.

A plot of the pairs of observations from Galton’s sleep data.

Now let’s calculate the t interval for the differences from baseline to follow up.
Below we give four different ways for calculating the interval.

Loading Galton’s data.

g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]


difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10
## Calculating directly
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## using R's built in function
t.test(difference)
## using R's built in function, another format
t.test(g2, g1, paired = TRUE)
## using R's built in function, another format
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
## Below are the results (after a little formatting)
[,1] [,2]
[1,] 0.7001 2.46
[2,] 0.7001 2.46
[3,] 0.7001 2.46
[4,] 0.7001 2.46

Therefore, since our interval doesn’t include 0, our 95% confidence interval
estimate for the mean change (follow up - baseline) is 0.70 to 2.45.

Independent group t confidence intervals


Watch this video before beginning.

Suppose that we want to compare the mean blood pressure between two groups
in a randomized trial; those who received the treatment to those who received a
placebo. The randomization is useful for attempting to balance unobserved
covariates that might contaminate our results. Because of the randomization, it
would be reasonable to compare the two groups without considering further
variables.

We cannot use the paired t interval that we just used for Galton’s data, because the
groups are independent. Person 1 from the treated group has no relationship with
person 1 from the control group. Moreover, the groups may have different sample
sizes, so taking paired differences may not even be possible even if it isn’t advisable
in this setting.

We now present methods for creating confidence intervals for comparing


independent groups.

Confidence interval
A confidence interval for the mean difference between the
groups, is:

The notation means a t quantile with degrees of


freedom. The pooled variance estimator is:

This variance estimate is used if one is willing to assume a constant variance across
the groups. It is a weighted average of the group-specific variances, with greater
weight given to whichever group has the larger sample size.

If there is some doubt about the constant variance assumption, assume a different
variance per group, which we will discuss later.

Mistakenly treating the sleep data as grouped


Let’s first go through an example where we treat paired data as if it were
independent. Consider Galton’s sleep data from before. In the code below, we do
the R code for grouped data directly, and using the r t.test function.

Galton’s data treated as grouped and independent.

n1 <- length(g1); n2 <- length(g2)


sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2))
md <- mean(g2) - mean(g1)
semd <- sp * sqrt(1 / n1 + 1/n2)
rbind(
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd,
t.test(g2, g1, paired = FALSE, var.equal = TRUE)$conf,
t.test(g2, g1, paired = TRUE)$conf
)

The results are:

[,1] [,2]
[1,] -0.2039 3.364
[2,] -0.2039 3.364
[3,] 0.7001 2.460

Notice that the paired interval (the last row) is entirely above zero. The grouped
interval (first two rows) contains zero. Thus, acknowledging the pairing explains
variation that would otherwise be absorbed into the variation for the group means.
As a result, treating the groups as independent results in wider intervals. Even if it
didn’t result in a shorter interval, the paired interval would be correct as the groups
are not statistically independent!

ChickWeight data in R
Now let’s try an example with actual independent groups. Load in the ChickWeight
data in R. We are also going to manipulate the dataset to have a total weight gain
variable using dplyr.

library(datasets); data(ChickWeight); library(reshape2)


##define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW,
gain = time21 - time0
)

Here’s a plot of the data.

Chickweight data over time.

Here’s a plot only of the weight gain for the diets.


Violin plots of chickweight data by weight gain (final minus baseline) by diet.

Now let’s do a t interval comparing groups 1 and 4. We’ll show the two intervals,
one assuming that the variances are equal and one assuming otherwise.

Code for t interval of the chickWeight data

wideCW14 <- subset(wideCW, Diet %in% c(1, 4))


rbind(
t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf,
t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf
)

[,1] [,2]
[1,] -108.1 -14.81
[2,] -104.7 -18.30

For the time being, let’s interpret the equal variance interval. Since the interval is
entirely below zero it suggest that group 1 had less weight gain than group 4 (at
95% confidence).

Unequal variances
Watch this video before beginning.

Under unequal variances our t interval becomes:

where is the t quantile calculated with degrees of freedom:

which will be approximately a 95% interval. This works really well. So when in
doubt, just assume unequal variances. Also, we present the formula for
completeness. In practice, it’s easy to mess up, so make sure to do t.test.

Referring back to the previous ChickWeight example, the violin plots suggest that
considering unequal variances would be wise. Recall the code is

> t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf


[2,] -104.7 -18.30

This interval is remains entirely below zero. However, it is wider than the equal
variance interval.

Summary notes
Țħě ț đįșțřįbųțįǿň įș ųșěfųŀ fǿř șmǻŀŀ șǻmpŀě șįżě čǿmpǻřįșǿňș.

İț țěčħňįčǻŀŀỳ ǻșșųměș ňǿřmǻŀįțỳ, bųț įș řǿbųșț țǿ țħįș ǻșșųmpțįǿň ẅįțħįň ŀįmįțș.

Țħě ț đįșțřįbųțįǿň ģįvěș řįșě țǿ ț čǿňfįđěňčě įňțěřvǻŀș (ǻňđ țěșțș, ẅħįčħ ẅě ẅįŀŀ

șěě ŀǻțěř)

For other kinds of data, there are preferable small and large sample intervals and
tests.

Fǿř bįňǿmįǻŀ đǻțǻ, țħěřě’ș ŀǿțș ǿf ẅǻỳș țǿ čǿmpǻřě țẅǿ ģřǿųpș.

Řěŀǻțįvě řįșķ, řįșķ đįffěřěňčě, ǿđđș řǻțįǿ.

Čħį-șqųǻřěđ țěșțș, ňǿřmǻŀ ǻppřǿxįmǻțįǿňș, ěxǻčț țěșțș.

Fǿř čǿųňț đǻțǻ, țħěřě’ș ǻŀșǿ Čħį-șqųǻřěđ țěșțș ǻňđ ěxǻčț țěșțș.

Ẅě’ŀŀ ŀěǻvě țħě đįșčųșșįǿňș fǿř čǿmpǻřįňģ ģřǿųpș ǿf đǻțǻ fǿř bįňǻřỳ ǻňđ čǿųňț

đǻțǻ ųňțįŀ čǿvěřįňģ ģŀmș įň țħě řěģřěșșįǿň čŀǻșș.

İň ǻđđįțįǿň, Mǻțħěmǻțįčǻŀ Bįǿșțǻțįșțįčș Bǿǿț Čǻmp 2 čǿvěřș mǻňỳ șpěčįǻŀ čǻșěș

řěŀěvǻňț țǿ bįǿșțǻțįșțįčș.

Exercises

1. Fǿř įįđ Ģǻųșșįǻň đǻțǻ, țħě șțǻțįșțįč mųșț fǿŀŀǿẅ ǻ:

Ż đįșțřįbųțįǿň

ț đįșțřįbųțįǿň

2. Pǻįřěđ đįffěřěňčěș Ț čǿňfįđěňčě įňțěřvǻŀș ǻřě ųșěfųŀ ẅħěň:

Pǻįřș ǿf ǿbșěřvǻțįǿňș ǻřě ŀįňķěđ, șųčħ ǻș ẅħěň țħěřě įș șųbjěčț ŀěvěŀ mǻțčħįňģ ǿř

įň ǻ șțųđỳ ẅįțħ bǻșěŀįňě ǻňđ fǿŀŀǿẅ ųp měǻșųřěměňțș ǿň ǻŀŀ pǻřțįčįpǻňțș.

Ẅħěň țħěřě ẅǻș řǻňđǿmįżǻțįǿň ǿf ǻ țřěǻțměňț běțẅěěň țẅǿ įňđěpěňđěňț

ģřǿųpș.

3. Țħě ǻșșųmpțįǿň țħǻț țħě vǻřįǻňčěș ǻřě ěqųǻŀ fǿř țħě įňđěpěňđěňț ģřǿųp Ț įňțěřvǻŀ

měǻňș țħǻț:

Țħě șǻmpŀě vǻřįǻňčěș ħǻvě țǿ bě ňěǻřŀỳ įđěňțįčǻŀ.

Țħě pǿpųŀǻțįǿň vǻřįǻňčěș ǻřě įđěňțįčǻŀ, bųț țħě șǻmpŀě vǻřįǻňčěș mǻỳ bě

đįffěřěňț.

4. Ŀǿǻđ țħě đǻțǻ șěț mțčǻřș įň țħě đǻțǻșěțș Ř pǻčķǻģě. Čǻŀčųŀǻțě ǻ 95% čǿňfįđěňčě įňțěřvǻŀ

țǿ țħě ňěǻřěșț MPĢ fǿř țħě vǻřįǻbŀě mpģ. Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě ẅřįțțěň

șǿŀųțįǿňș.

5. Șųppǿșě țħǻț șțǻňđǻřđ đěvįǻțįǿň ǿf 9 pǻįřěđ đįffěřěňčěș įș $1$. Ẅħǻț vǻŀųě ẅǿųŀđ

țħě ǻvěřǻģě đįffěřěňčě ħǻvě țǿ bě șǿ țħǻț țħě ŀǿẅěř ěňđpǿįňț ǿf ǻ 95% șțųđěňțș ț

čǿňfįđěňčě įňțěřvǻŀ țǿųčħěș żěřǿ? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ħěřě ǻňđ șěě țħě țěxț ħěřě.

6. Ǻň įňđěpěňđěňț ģřǿųp Șțųđěňț’ș Ț įňțěřvǻŀ įș ųșěđ įňșțěǻđ ǿf ǻ pǻįřěđ Ț įňțěřvǻŀ

ẅħěň:

Țħě ǿbșěřvǻțįǿňș ǻřě pǻįřěđ běțẅěěň țħě ģřǿųpș.

Țħě ǿbșěřvǻțįǿňș běțẅěěň țħě ģřǿųpș ǻřě ňǻțųřǻŀŀỳ ǻșșųměđ țǿ bě șțǻțįșțįčǻŀŀỳ

įňđěpěňđěňț.
Ǻș ŀǿňģ ǻș ỳǿų đǿ įț čǿřřěčțŀỳ, ěįțħěř įș fįňě.

Mǿřě đěțǻįŀș ǻřě ňěěđěđ țǿ ǻňșẅěř țħįș qųěșțįǿň. ẅǻțčħ ǻ đįșčųșșįǿň ǿf țħįș

přǿbŀěm ǻňđ șěě țħě țěxț.

7. Čǿňșįđěř țħě mțčǻřș đǻțǻșěț. Čǿňșțřųčț ǻ 95% Ț įňțěřvǻŀ fǿř MPĢ čǿmpǻřįňģ 4 țǿ 6

čỳŀįňđěř čǻřș (șųbțřǻčțįňģ įň țħě ǿřđěř ǿf 4 - 6) ǻșșųmě ǻ čǿňșțǻňț vǻřįǻňčě. Ẅǻțčħ ǻ

vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț. 10.

8. İf șǿměǿňě pųț ǻ ģųň țǿ ỳǿųř ħěǻđ ǻňđ șǻįđ “Ỳǿųř čǿňfįđěňčě įňțěřvǻŀ mųșț čǿňțǻįň

ẅħǻț įț’ș ěșțįmǻțįňģ ǿř İ’ŀŀ pųŀŀ țħě țřįģģěř”, ẅħǻț ẅǿųŀđ bě țħě șmǻřț țħįňģ țǿ đǿ?

Mǻķě ỳǿųř įňțěřvǻŀ ǻș ẅįđě ǻș pǿșșįbŀě.

Mǻķě ỳǿųř įňțěřvǻŀ ǻș șmǻŀŀ ǻș pǿșșįbŀě.

Čǻŀŀ țħě ǻųțħǿřįțįěș. Ẅǻțčħ țħě vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.

9. Řěfěř bǻčķ țǿ čǿmpǻřįňģ MPĢ fǿř 4 věřșųș 6 čỳŀįňđěřș (qųěșțįǿň 7). Ẅħǻț đǿ ỳǿų

čǿňčŀųđě?

Țħě įňțěřvǻŀ įș ǻbǿvě żěřǿ, șųģģěșțįňģ 6 įș běțțěř țħǻň 4 įň țħě țěřmș ǿf MPĢ.

Țħě įňțěřvǻŀ įș ǻbǿvě żěřǿ, șųģģěșțįňģ 4 įș běțțěř țħǻň 6 įň țħě țěřmș ǿf MPĢ.

Țħě įňțěřvǻŀ đǿěș ňǿț țěŀŀ ỳǿų ǻňỳțħįňģ ǻbǿųț țħě ħỳpǿțħěșįș țěșț; ỳǿų ħǻvě țǿ

đǿ țħě țěșț.

Țħě įňțěřvǻŀ čǿňțǻįňș 0 șųģģěșțįňģ ňǿ đįffěřěňčě. Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ

șěě țħě țěxț.

10. Șųppǿșě țħǻț 18 ǿběșě șųbjěčțș ẅěřě řǻňđǿmįżěđ, 9 ěǻčħ, țǿ ǻ ňěẅ đįěț pįŀŀ ǻňđ ǻ

pŀǻčěbǿ. Șųbjěčțș’ bǿđỳ mǻșș įňđįčěș (BMİș) ẅěřě měǻșųřěđ ǻț ǻ bǻșěŀįňě ǻňđ

ǻģǻįň ǻfțěř ħǻvįňģ řěčěįvěđ țħě țřěǻțměňț ǿř pŀǻčěbǿ fǿř fǿųř ẅěěķș. Țħě ǻvěřǻģě

đįffěřěňčě fřǿm fǿŀŀǿẅ-ųp țǿ țħě bǻșěŀįňě (fǿŀŀǿẅųp - bǻșěŀįňě) ẅǻș 3 ķģ/m2 fǿř țħě

țřěǻțěđ ģřǿųp ǻňđ 1 ķģ/m2 fǿř țħě pŀǻčěbǿ ģřǿųp. Țħě čǿřřěșpǿňđįňģ șțǻňđǻřđ

đěvįǻțįǿňș ǿf țħě đįffěřěňčěș ẅǻș 1.5 ķģ/m2 fǿř țħě țřěǻțměňț ģřǿųp ǻňđ 1.8 ķģ/m2

fǿř țħě pŀǻčěbǿ ģřǿųp. Țħě șțųđỳ ǻįmș țǿ ǻňșẅěř ẅħěțħěř țħě čħǻňģě įň BMİ ǿvěř

țħě fǿųř ẅěěķ pěřįǿđ ǻppěǻř țǿ đįffěř běțẅěěň țħě țřěǻțěđ ǻňđ pŀǻčěbǿ ģřǿųpș.

Ẅħǻț įș țħě pǿǿŀěđ vǻřįǻňčě ěșțįmǻțě? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ħěřě ǻňđ șěě țħě țěxț

ħěřě.

9. Hypothesis testing

Hypothesis testing is concerned with making decisions using data.

Hypothesis testing
Watch this video before beginning.

To make decisions using data, we need to characterize the kinds of conclusions we


can make. Classical hypothesis testing is concerned with deciding between two
decisions (things get much harder if there’s more than two). The first, a null
hypothesis is specified that represents the status quo. This hypothesis is usually
labeled, . This is what we assume by default. The alternative or research
hypothesis is what we require evidence to conclude. This hypothesis is usually
labeled, or sometimes (or some other number other than 0).

So to reiterate, the null hypothesis is assumed true and statistical evidence is


required to reject it in favor of a research or alternative hypothesis

Example
A respiratory disturbance index (RDI) of more than 30 events / hour, say, is
considered evidence of severe sleep disordered breathing (SDB). Suppose that in a
sample of 100 overweight subjects with other risk factors for sleep disordered
breathing at a sleep clinic, the mean RDI was 32 events / hour with a standard
deviation of 10 events / hour.

We might want to test the hypothesis that

versus the hypothesis

where is the population mean RDI. Clearly, somehow we must figure out a way
to decide between these hypotheses using the observed data, particularly the
sample mean.

Before we go through the specifics, let’s set up the central ideas.

Types of errors in hypothesis testing


The alternative hypotheses are typically of the form of the true mean being ,
or to the hypothesized mean, such as from our example. The
null typically sharply specifies the mean, such as in our example.
More complex null hypotheses are possible, but are typically covered in later
courses.

Note that there are four possible outcomes of our statistical decision process:

Țřųțħ Đěčįđě Řěșųŀț

Čǿřřěčțŀỳ ǻččěpț ňųŀŀ

Țỳpě İ ěřřǿř

Čǿřřěčțŀỳ řějěčț ňųŀŀ

Țỳpě İİ ěřřǿř

We will perform hypothesis testing by forcing the probability of a Type I error to be


small. This approach consequences, which we can discuss with an analogy to court
cases.

Discussion relative to court cases


Consider a court of law and a criminal case. The null hypothesis is that the
defendant is innocent. The rules requires a standard on the available evidence to
reject the null hypothesis (and the jury to convict). The standard is specified loosely
in this case, such as convict if the defendant appears guilty “Beyond reasonable
doubt”. In statistics, we can be mathematically specific about our standard of
evidence.

Note the consequences of setting a standard. If we set a low standard, say


convicting only if there circumstantial or greater evidence, then we would increase
the percentage of innocent people convicted (type I errors). However, we would
also increase the percentage of guilty people convicted (correctly rejecting the null).
If we set a high standard, say the standard of convicting if the jury has “No doubts
whatsoever”, then we increase the the percentage of innocent people let free
(correctly accepting the null) while we would also increase the percentage of guilty
people let free (type II errors).

Building up a standard of evidence


Watch this video before beginning.

Consider our sleep example again. A reasonable strategy would reject the null
hypothesis if the sample mean, , was larger than some constant, say .
Typically, is chosen so that the probability of a Type I error, labeled , is
(or some other relevant constant) To reiterate, Type I error rate = Probability
of rejecting the null hypothesis when, in fact, the null hypothesis is correct

Let’s see if we can figure out what has to be. The standard error of the mean is

. Furthermore, under we know that (at least


approximately) via the CLT. We want to chose so that:

The 95th percentile of a normal distribution is 1.645 standard deviations from the
mean. So, if is set 1.645 standard deviations from the mean, we should be set
since the probability of getting a sample mean that large is only 5%. The 95th
percentile from a is:

So the rule “Reject when ” has the property that the probability
of rejection is 5% when is true.

In general, however, we don’t convert back to the original scale. Instead, we


calculate how many standard errors the observed mean is from the hypothesized
mean

This is called a Z-score. We can compare this statistic to standard normal quantiles.

To reiterate, the Z-score is how many standard errors the sample mean is above
the hypothesized mean. In our example:

Since 2 is greater than 1.645 we would reject. Setting the rule “We reject if our Z-
score is larger than 1.645” controls the Type I error rate at 5%. We could write out a
general rule for this alternative hypothesis as reject whenever

where is the desired Type I error rate.

Because the Type I error rate was controlled to be small, if we reject we know that
one of the following occurred:
1. țħě ňųŀŀ ħỳpǿțħěșįș įș fǻŀșě,

2. ẅě ǿbșěřvěđ ǻň ųňŀįķěŀỳ ěvěňț įň șųppǿřț ǿf țħě ǻŀțěřňǻțįvě ěvěň țħǿųģħ țħě ňųŀŀ įș

țřųě,

3. ǿųř mǿđěŀįňģ ǻșșųmpțįǿňș ǻřě ẅřǿňģ.

The third option can be difficult to check and at some level all bets are off
depending on how wrong we are about our basic assumptions. So for this course,
we speak of our conclusions under the assumption that our modeling choices
(such as the iid sampling model) are correct, but do so wide eyed acknowledging
the limitations of our approach.

General rules
We developed our test for one possible alternatives. Here’s some general rules for
the three most important alternatives.

Consider the test for versus: , ,


. Our test statistic

We reject the null hypothesis when:

respectively.

Summary notes
Ẅě ħǻvě fįxěđ țǿ bě ŀǿẅ, șǿ įf ẅě řějěčț (ěįțħěř ǿųř mǿđěŀ įș ẅřǿňģ) ǿř

țħěřě įș ǻ ŀǿẅ přǿbǻbįŀįțỳ țħǻț ẅě ħǻvě mǻđě ǻň ěřřǿř.

Ẅě ħǻvě ňǿț fįxěđ țħě přǿbǻbįŀįțỳ ǿf ǻ țỳpě İİ ěřřǿř, ; țħěřěfǿřě ẅě țěňđ țǿ șǻỳ

“Fǻįŀ țǿ řějěčț ” řǻțħěř țħǻň ǻččěpțįňģ .

Șțǻțįșțįčǻŀ șįģňįfįčǻňčě įș ňǿ țħě șǻmě ǻș șčįěňțįfįč șįģňįfįčǻňčě.

Țħě řěģįǿň ǿf ȚȘ vǻŀųěș fǿř ẅħįčħ ỳǿų řějěčț įș čǻŀŀěđ țħě řějěčțįǿň řěģįǿň.

Țħě țěșț řěqųįřěș țħě ǻșșųmpțįǿňș ǿf țħě ČĿȚ ǻňđ fǿř țǿ bě ŀǻřģě ěňǿųģħ

fǿř įț țǿ ǻppŀỳ.

İf įș șmǻŀŀ, țħěň ǻ Ģǿșșěț’ș ț țěșț įș pěřfǿřměđ ěxǻčțŀỳ įň țħě șǻmě ẅǻỳ, ẅįțħ

țħě ňǿřmǻŀ qųǻňțįŀěș řěpŀǻčěđ bỳ țħě ǻppřǿpřįǻțě Șțųđěňț’ș ț qųǻňțįŀěș ǻňđ

đf.

Țħě přǿbǻbįŀįțỳ ǿf řějěčțįňģ țħě ňųŀŀ ħỳpǿțħěșįș ẅħěň įț įș fǻŀșě įș čǻŀŀěđ pǿẅěř

Pǿẅěř įș ǻ ųșěđ ǻ ŀǿț țǿ čǻŀčųŀǻțě șǻmpŀě șįżěș fǿř ěxpěřįměňțș.

Example reconsidered
Watch this video before beginning.

Consider our example again. Suppose that (rather than ). The statistic
follows a t distribution with 15 df under .

Under , the probability that it is larger that the 95th percentile of the t
distribution is 5%. The 95th percentile of the T distribution with 15 df is 1.7531
(obtained via r qt(.95, 15)).

Assuming that everything but the sample size is the same, our test statistic is now

. Since 0.8 is not larger than 1.75, we now fail to reject.

Two sided tests


In many settings, we would like to reject if the true mean is different than the
hypothesized, not just larger or smaller. I other words, we would reject the null
hypothesis if in fact the sample mean was much larger or smaller than the
hypothesized mean. In our example, we want to test the alternative .

We will reject if the test statistic, , is either too large or too small. Then we want
the probability of rejecting under the null to be 5%, split equally as 2.5% in the
upper tail and 2.5% in the lower tail.

Thus we reject if our test statistic is larger than qt(.975, 15) or smaller than qt(.025, 15).
This is the same as saying: reject if the absolute value of our statistic is larger than
qt(0.975, 15) = 2.1314.

In this case, since our test statistic is 0.8, which is smaller than 2.1314, we fail to
reject the two sided test (as well as the one sided test).

If you fail to reject the one sided test, then you would fail to reject the two sided
test. Because of its larger rejection region, two sided tests are the norm (even in
settings where a one sided test makes more sense).

T test in R
Let’s try the t test on the pairs of fathers and sons in Galton’s data.

Example of using the t test in R.

> library(UsingR); data(father.son)


> t.test(father.son$sheight - father.son$fheight)

One Sample t-test

data: father.son$sheight - father.son$fheight


t = 11.79, df = 1077, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.831 1.163
sample estimates:
mean of x
0.997

Connections with confidence intervals


Consider testing versus . Take the set of all possible
values for which you fail to reject , this set is a confidence
interval for .

The same works in reverse; if a interval contains , then we fail to


reject .

In other words, two sided tests and confidence intervals agree.

Two group intervals


Doing group tests is now straightforward given that we’ve already covered
independent group T intervals. Our rejection rules are the same, the only change is
how the statistic is calculated. However, the form is familiar:

Consider now testing . Our statistic is

For the equal variance case and and

Let’s just go through an example.

Example chickWeight data


Recall that we reformatted this data as follows

Reformatting the data.

library(datasets); data(ChickWeight); library(reshape2)


##define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1 : 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW,
gain = time21 - time0
)

Unequal variance T test comparing diets 1 and 4.

> wideCW14 <- subset(wideCW, Diet %in% c(1, 4))


> t.test(gain ~ Diet, paired = FALSE,
+ var.equal = TRUE, data = wideCW14)

Two Sample t-test

data: gain by Diet


t = -2.725, df = 23, p-value = 0.01207
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-108.15 -14.81
sample estimates:
mean in group 1 mean in group 4
136.2 197.7

Exact binomial test


Recall this problem. Suppose a friend has 8 children, 7 of which are girls and none
are twins.
Perform the relevant hypothesis test. versus .

What is the relevant rejection region so that the probability of rejecting is (less
than) 5%?

Řějěčțįǿň řěģįǿň Țỳpě İ ěřřǿř řǻțě

[0 : 8] 1

[1 : 8] 0.9961

[2 : 8] 0.9648

[3 : 8] 0.8555

[4 : 8] 0.6367

[5 : 8] 0.3633

[6 : 8] 0.1445

[7 : 8] 0.0352

[8 : 8] 0.0039

Thus if we reject under 7 or 8 girls, we will have a less than 5% chance of rejecting
under the null hypothesis.

It’s impossible to get an exact 5% level test for this case due to the discreteness of
the binomial. The closest is the rejection region [7 : 8]. Further note that an alpha
level lower than 0.0039 is not attainable. For larger sample sizes, we could do a
normal approximation.

Extended this test to two sided test isn’t obvious. Given a way to do two sided tests,
we could take the set of values of for which we fail to reject to get an exact
binomial confidence interval (called the Clopper/Pearson interval, by the way). We’ll
cover two sided versions of this test when we cover P-values.

Exercises
1. Ẅħįčħ ħỳpǿțħěșįș įș țỳpįčǻŀŀỳ ǻșșųměđ țǿ bě țřųě įň ħỳpǿțħěșįș țěșțįňģ?

Țħě ňųŀŀ.

Țħě ǻŀțěřňǻțįvě.

2. Țħě țỳpě İ ěřřǿř řǻțě čǿňțřǿŀș ẅħǻț?

3. Ŀǿǻđ țħě đǻțǻ șěț mțčǻřș įň țħě đǻțǻșěțș Ř pǻčķǻģě. Ǻșșųmě țħǻț țħě đǻțǻ șěț mțčǻřș įș

ǻ řǻňđǿm șǻmpŀě. Čǿmpųțě țħě měǻň MPĢ, ǿf țħįș șǻmpŀě. Ỳǿų ẅǻňț țǿ țěșț

ẅħěțħěř țħě țřųě MPĢ įș ǿř șmǻŀŀěř ųșįňģ ǻ ǿňě șįđěđ 5% ŀěvěŀ țěșț. (

věřșųș ). Ųșįňģ țħǻț đǻțǻ șěț ǻňđ ǻ Ż țěșț: Bǻșěđ ǿň țħě

měǻň MPĢ ǿf țħě șǻmpŀě ǻňđ bỳ ųșįňģ ǻ Ż țěșț: ẅħǻț įș țħě șmǻŀŀěșț vǻŀųě ǿf

țħǻț ỳǿų ẅǿųŀđ řějěčț fǿř (țǿ țẅǿ đěčįmǻŀ pŀǻčěș)? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ħěřě ǻňđ

șěě țħě țěxț ħěřě.

4. Čǿňșįđěř ǻģǻįň țħě mțčǻřș đǻțǻșěț. Ųșě ǻ țẅǿ ģřǿųp ț-țěșț țǿ țěșț țħě ħỳpǿțħěșįș țħǻț

țħě 4 ǻňđ 6 čỳŀ čǻřș ħǻvě țħě șǻmě mpģ. Ųșě ǻ țẅǿ șįđěđ țěșț ẅįțħ ųňěqųǻŀ

vǻřįǻňčěș. Đǿ ỳǿų řějěčț? Ẅǻțčħ țħě vįđěǿ ħěřě ǻňđ șěě țħě țěxț ħěřě

5. Ǻ șǻmpŀě ǿf 100 měň ỳįěŀđěđ ǻň ǻvěřǻģě PȘǺ ŀěvěŀ ǿf 3.0 ẅįțħ ǻ șđ ǿf 1.1. Ẅħǻț ǻřě

țħě čǿmpŀěțě șěț ǿf vǻŀųěș țħǻț ǻ 5% țẅǿ șįđěđ Ż țěșț ǿf ẅǿųŀđ fǻįŀ țǿ
řějěčț țħě ňųŀŀ ħỳpǿțħěșįș fǿř? Ẅǻțčħ țħě vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.

6. Ỳǿų běŀįěvě țħě čǿįň țħǻț ỳǿų’řě fŀįppįňģ įș bįǻșěđ țǿẅǻřđș ħěǻđș. Ỳǿų ģěț 55 ħěǻđș

ǿųț ǿf 100 fŀįpș. Đǿ ỳǿų řějěčț ǻț țħě 5% ŀěvěŀ țħǻț țħě čǿįň įș fǻįř? Ẅǻțčħ ǻ vįđěǿ

șǿŀųțįǿň ǻňđ șěě țħě țěxț.

7. Șųppǿșě țħǻț įň ǻň ǺB țěșț, ǿňě ǻđvěřțįșįňģ șčħěmě ŀěđ țǿ ǻň ǻvěřǻģě ǿf 10

pųřčħǻșěș pěř đǻỳ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș, ẅħįŀě țħě ǿțħěř ŀěđ țǿ 11 pųřčħǻșěș pěř

đǻỳ, ǻŀșǿ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș. Ǻșșųmįňģ ǻ čǿmmǿň șțǻňđǻřđ đěvįǻțįǿň ǿf 4

pųřčħǻșěș pěř đǻỳ. Ǻșșųmįňģ țħǻț țħě ģřǿųpș ǻřě įňđěpěňđěňț ǻňđ țħǻț țħěỳ đǻỳș

ǻřě įįđ, pěřfǿřm ǻ Ż țěșț ǿf ěqųįvǻŀěňčě. Đǿ ỳǿų řějěčț ǻț țħě 5% ŀěvěŀ? Ẅǻțčħ ǻ vįđěǿ

șǿŀųțįǿň ǻňđ șěě țħě țěxț.

8. Ǻ čǿňfįđěňčě įňțěřvǻŀ fǿř țħě měǻň čǿňțǻįňș:

Ǻŀŀ ǿf țħě vǻŀųěș ǿf țħě ħỳpǿțħěșįżěđ měǻň fǿř ẅħįčħ ẅě ẅǿųŀđ fǻįŀ țǿ řějěčț

ẅįțħ .

Ǻŀŀ ǿf țħě vǻŀųěș ǿf țħě ħỳpǿțħěșįżěđ měǻň fǿř ẅħįčħ ẅě ẅǿųŀđ fǻįŀ țǿ řějěčț

ẅįțħ .

Ǻŀŀ ǿf țħě vǻŀųěș ǿf țħě ħỳpǿțħěșįżěđ měǻň fǿř ẅħįčħ ẅě ẅǿųŀđ řějěčț ẅįțħ

Ǻŀŀ ǿf țħě vǻŀųěș ǿf țħě ħỳpǿțħěșįżěđ měǻň fǿř ẅħįčħ ẅě ẅǿųŀđ řějěčț ẅįțħ

. Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.

9. İň ǻ čǿųřț ǿf ŀǻẅ, ǻŀŀ țħįňģș běįňģ ěqųǻŀ, įf vįǻ pǿŀįčỳ ỳǿų řěqųįřě ǻ ŀǿẅěř șțǻňđǻřđ ǿf

ěvįđěňčě țǿ čǿňvįčț pěǿpŀě țħěň

Ŀěșș ģųįŀțỳ pěǿpŀě ẅįŀŀ bě čǿňvįčțěđ.

Mǿřě įňňǿčěňț pěǿpŀě ẅįŀŀ bě čǿňvįčțěđ.

Mǿřě İňňǿčěňț pěǿpŀě ẅįŀŀ bě ňǿț čǿňvįčțěđ. Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě

țěxț.

10. P-values

Introduction to P-values
Watch this video before beginning.

P-values are the most common measure of statistical significance. Their ubiquity,
along with concern over their interpretation and use makes them controversial
among statisticians. The following manuscripts are interesting reads about P-
values.

ħțțp://ẅǻřňěřčňř.čǿŀǿșțǻțě.ěđų/ ~ǻňđěřșǿň/țħǿmpșǿň1.ħțmŀ
Ǻŀșǿ șěě Șțǻțįșțįčǻŀ Ěvįđěňčě: Ǻ Ŀįķěŀįħǿǿđ Pǻřǻđįģm bỳ Řįčħǻřđ Řǿỳǻŀŀ

Țǿẅǻřđ Ěvįđěňčě-Bǻșěđ Měđįčǻŀ Șțǻțįșțįčș. 1: Țħě P Vǻŀųě Fǻŀŀǻčỳ bỳ Șțěvě Ģǿǿđmǻň

Țħě ħįŀǻřįǿųșŀỳ țįțŀěđ: Țħě Ěǻřțħ įș Řǿųňđ (p < .05) bỳ Čǿħěň.

Șǿmě pǿșįțįvě čǿmměňțș

șįmpŀỳ șțǻțįșțįčș

ňǿřmǻŀ đěvįǻțě

Ěřřǿř șțǻțįșțįčș

What is a P-value?
The central idea of a P-value is to assume that the null hypothesis is true and
calculate how unusual it would be to see data as extreme as was seen in favor of
the alternative hypothesis. The formal definition is:

A P-value is the probability of observing data as or more extreme in favor of the


alternative than was actually obtained, where the probability is calculated
assuming that the null hypothesis is true.

A P-value then requires a few steps. 1. Decide on a statistic that evaluates support
of the null or alternative hypothesis. 2. Decide on a distribution of that statistic
under the null hypothesis (null distribution). 3. Calculate the probability of
obtaining a statistic as or more extreme as was observed using the distribution in
2.

The way to interpret P-values is as follows. If the P-value is small, then either is
true and we have observed a rare event or is false (or possibly the null model
is incorrect).

Let’s do a quick example. Suppose that you get a t statistic of 2.5 for 15 degrees of
freedom testing versus . What’s the probability of
getting a t statistic as large as 2.5?

P-value calculation in R.

> pt(2.5, 15, lower.tail = FALSE)


[1] 0.01225

Therefore, the probability of seeing evidence as extreme or more extreme than


that actually obtained under is 0.0123. So, (assuming our model is correct)
either we observed data that was pretty unlikely under the null, or the null
hypothesis if false.

The attained significance level


Recall in a previous chapter that our test statistic was 2 for versus
using a normal test ( was 100). Notice that we rejected the one
sided test when , would we reject if , how about 0.001?

The smallest value for alpha that you still reject the null hypothesis is called the
attained significance level. This is mathematically equivalent, but philosophically a
little different from, the P-value. Whereas the P-value is interpreted in the terms of
how probabilistically extreme our test statistic is under the null, the attained
significance level merely conveys what the smallest level of that one could reject
at.

This equivalence makes P-values very convenient to convey. The reader of the
results can perform the test at whatever he or she choses. This is especially
useful in multiple testing circumstances.

Here’s the two rules for performing hypothesis tests with P-values. * If the P-value
for a test is less than you reject the null hypothesis * For two sided hypothesis
test, double the smaller of the two one sided hypothesis test Pvalues

Binomial P-value example


Suppose a friend has 8 children, 7 of which are girls and none are twins. If each
gender has an independent 50% probability for each birth, what’s the probability of
getting 7 or more girls out of 8 births?
This calculation is a P-value where the statistic is the number of girls and the null
distribution is a fair coin flip for each gender. We want to test
versus , where is the probability of having a girl for each birth.

Recall here’s the calculation:

Example of a Binomial P-value calculation in R.

> pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)


[1] 0.03516

Since our P-value is less than 0.05 we would reject at a 5% error rate. Note,
however, if we were doing a two sided test, we would have to double the P-value
and thus would then fail to reject.

Poisson example
Watch this video before beginning.

Suppose that a hospital has an infection rate of 10 infections per 100 person/days
at risk (rate of 0.1) during the last monitoring period. Assume that an infection rate
of 0.05 is an important benchmark.

Given a Poisson model, could the observed rate being larger than 0.05 be
attributed to chance? We want to test where is the rate of
infections per person day so that 5 would be the rate per 100 days. Thus we want
to know if 9 events per 100 person/days is unusual with respect to a Poisson
distribution with a rate of 5 events per 100. Consider .

Poisson P-value calculation.

> ppois(9, 5, lower.tail = FALSE)


[1] 0.03183

Again, since this P-value is less than 0.05 we reject the null hypothesis. The P-value
would be 0.06 for two sided hypothesis (double) and so we would fail to reject in
that case.

Exercises
1. P-vǻŀųěș ǻřě přǿbǻbįŀįțįěș țħǻț ǻřě čǻŀčųŀǻțěđ ǻșșųmįňģ ẅħįčħ ħỳpǿțħěșįș įș țřųě?

țħě ǻŀțěřňǻțįvě

țħě ňųŀŀ

2. Ỳǿų ģěț ǻ P-vǻŀųě ǿf 0.06. Ẅǿųŀđ ỳǿų řějěčț fǿř ǻ țỳpě İ ěřřǿř řǻțě ǿf 0.05?

Ỳěș ỳǿų ẅǿųŀđ řějěčț țħě ňųŀŀ

Ňǿ ỳǿų ẅǿųŀđ ňǿț řějěčț țħě ňųŀŀ

İț đěpěňđș ǿň įňfǿřmǻțįǿň ňǿț ģįvěň

3. Țħě přǿpǿșěđ přǿčěđųřě fǿř ģěțțįňģ ǻ țẅǿ șįđěđ P-vǻŀųě fǿř țħě ěxǻčț bįňǿmįǻŀ țěșț

čǿňșįđěřěđ ħěřě įș ẅħǻț?

Mųŀțįpŀỳįňģ țħě ǿňě șįđěđ P-vǻŀųě bỳ ǿňě ħǻŀf

Đǿųbŀįňģ țħě ŀǻřģěř ǿf țħě țẅǿ ǿňě șįđěđ P-vǻŀųěș

Đǿųbŀįňģ țħě șmǻŀŀěř ǿf țħě țẅǿ ǿňě șįđěđ P-vǻŀųěș

Ňǿ přǿčěđųřě ěxįșțș

4. Čǿňșįđěř ǻģǻįň țħě mțčǻřș đǻțǻșěț. Ųșě ǻ țẅǿ ģřǿųp ț-țěșț țǿ țěșț țħě ħỳpǿțħěșįș țħǻț
țħě 4 ǻňđ 6 čỳŀ čǻřș ħǻvě țħě șǻmě mpģ. Ųșě ǻ țẅǿ șįđěđ țěșț ẅįțħ ųňěqųǻŀ

vǻřįǻňčěș. Ģįvě ǻ P-vǻŀųě. Ẅǻțčħ țħě vįđěǿ ħěřě ǻňđ șěě țħě țěxț ħěřě

5. Ỳǿų běŀįěvě țħě čǿįň țħǻț ỳǿų’řě fŀįppįňģ įș bįǻșěđ țǿẅǻřđș ħěǻđș. Ỳǿų ģěț 55 ħěǻđș

ǿųț ǿf 100 fŀįpș. Ģįvě ǻň ěxǻčț P-vǻŀųě fǿř țħě ħỳpǿțħěșįș țħǻț țħě čǿįň įș fǻįř. Ẅǻțčħ

ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.

6. Ǻ ẅěb șįțě ẅǻș mǿňįțǿřěđ fǿř ǻ ỳěǻř ǻňđ įț řěčěįvěđ 520 ħįțș pěř đǻỳ. İň țħě fįřșț 30

đǻỳș įň țħě ňěxț ỳěǻř, țħě șįțě řěčěįvěđ 15,800 ħįțș. Ǻșșųmįňģ țħǻț ẅěb ħįțș ǻřě

Pǿįșșǿň. Ģįvě ǻň ěxǻčț ǿňě șįđěđ P-vǻŀųě țǿ țħě ħỳpǿțħěșįș țħǻț ẅěb ħįțș ǻřě ųp țħįș

ỳěǻř ǿvěř ŀǻșț. Đǿ ỳǿų řějěčț? Ẅǻțčħ țħě vįđěǿ șǿŀųțįǿňș ǻňđ șěě țħě přǿbŀěm țěxț.

7. Șųppǿșě țħǻț įň ǻň ǺB țěșț, ǿňě ǻđvěřțįșįňģ șčħěmě ŀěđ țǿ ǻň ǻvěřǻģě ǿf 10

pųřčħǻșěș pěř đǻỳ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș, ẅħįŀě țħě ǿțħěř ŀěđ țǿ 11 pųřčħǻșěș pěř

đǻỳ, ǻŀșǿ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș. Ǻșșųmįňģ ǻ čǿmmǿň șțǻňđǻřđ đěvįǻțįǿň ǿf 4

pųřčħǻșěș pěř đǻỳ. Ǻșșųmįňģ țħǻț țħě ģřǿųpș ǻřě įňđěpěňđěňț ǻňđ țħǻț țħěỳ đǻỳș

ǻřě įįđ, pěřfǿřm ǻ Ż țěșț ǿf ěqųįvǻŀěňčě. Ģįvě ǻ P-vǻŀųě fǿř țħě țěșț? Ẅǻțčħ ǻ vįđěǿ

șǿŀųțįǿň ǻňđ șěě țħě țěxț.

8. Čǿňșįđěř țħě mțčǻřș đǻțǻ șěț.

Ģįvě țħě p-vǻŀųě fǿř ǻ ț-țěșț čǿmpǻřįňģ MPĢ fǿř 6 ǻňđ 8 čỳŀįňđěř čǻřș ǻșșųmįňģ

ěqųǻŀ vǻřįǻňčě, ǻș ǻ přǿpǿřțįǿň țǿ 3 đěčįmǻŀ pŀǻčěș.

Ģįvě țħě ǻșșǿčįǻțěđ P-vǻŀųě fǿř ǻ ż țěșț.

Ģįvě țħě čǿmmǿň șțǻňđǻřđ đěvįǻțįǿň ěșțįmǻțě fǿř MPĢ ǻčřǿșș čỳŀįňđěřș țǿ 3

đěčįmǻŀ pŀǻčěș.

Ẅǿųŀđ țħě ț țěșț řějěčț ǻț țħě țẅǿ șįđěđ 0.05 ŀěvěŀ (0 fǿř ňǿ 1 fǿř ỳěș)? Ẅǻțčħ ǻ

vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.

11. Power

Power
Watch this video before beginning. and then watch this video as well.

Power is the probability of rejecting the null hypothesis when it is false. Ergo,
power (as its name would suggest) is a good thing; you want more power. A type II
error (a bad thing, as its name would suggest) is failing to reject the null hypothesis
when it’s false; the probability of a type II error is usually called . Note Power
.

Let’s go through an example of calculating power. Consider our previous example


involving RDI. versus . Then power is:

Note that this is a function that depends on the specific value of ! Further notice
that as approaches 30 the power approaches .

Pushing this example further, we reject if

Or, equivalently, if

But, note that, under . However, under


.

So for this test we could calculate power with this R code:

Power calculation for the sleep example in R

alpha = 0.05
z = qnorm(1 - alpha)
pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALS\
E)

Let’s plug in the specific numbers for our example where: , ,


, .

> mu0 = 30
> mua = 32
> sigma = 4
> n = 16
> z = qnorm(1 - alpha)
> pnorm(mu0 + z * sigma/sqrt(n), mean = mu0, sd = sigma/sqrt(n), lower.tail = FA\
LSE)
[1] 0.05
> pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FA\
LSE)
[1] 0.6388

When we plug in , the value under the null hypothesis, we get that the
probability of rejection is 5%, as the test was designed. However, when we plug in a
value of 32, we get 64%. Therefore, the probability of rejection is 64% when the
true value of is 32. We could create a curve of the power as a function of , as
seen below. We also varied the sample size to see how the curve depends on that.

Plot of power as varies.

The code below shows how to use manipulate to investigate power as the various
inputs change.

Code for investigating power.

library(manipulate)
mu0 = 30
myplot <- function(sigma, mua, n, alpha) {
g = ggplot(data.frame(mu = c(27, 36)), aes(x = mu))
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mu0,
sd = sigma/sqrt(n)), size = 2, col = "red")
g = g + stat_function(fun = dnorm, geom = "line", args = list(mean = mua,
sd = sigma/sqrt(n)), size = 2, col = "blue")
xitc = mu0 + qnorm(1 - alpha) * sigma/sqrt(n)
g = g + geom_vline(xintercept = xitc, size = 3)
g
}
manipulate(myplot(sigma, mua, n, alpha), sigma = slider(1, 10, step = 1, initial\
= 4),
mua = slider(30, 35, step = 1, initial = 32), n = slider(1, 50, step = 1,
initial = 16), alpha = slider(0.01, 0.1, step = 0.01, initial = 0.05))

Question
Watch this video before beginning.

When testing , notice if power is , then

where . The unknowns in the equation are: , , , and


the knowns are: , . Specify any 3 of the unknowns and you can solve for the
remainder.

Notes
Țħě čǻŀčųŀǻțįǿň fǿř įș șįmįŀǻř

Fǿř čǻŀčųŀǻțě țħě ǿňě șįđěđ pǿẅěř ųșįňģ (țħįș įș ǿňŀỳ

ǻppřǿxįmǻțěŀỳ řįģħț, įț ěxčŀųđěș țħě přǿbǻbįŀįțỳ ǿf ģěțțįňģ ǻ ŀǻřģě ȚȘ įň țħě

ǿppǿșįțě đįřěčțįǿň ǿf țħě țřųțħ)

Pǿẅěř ģǿěș ųp ǻș ģěțș ŀǻřģěř

Pǿẅěř ǿf ǻ ǿňě șįđěđ țěșț įș ģřěǻțěř țħǻň țħě pǿẅěř ǿf țħě ǻșșǿčįǻțěđ țẅǿ șįđěđ

țěșț

Pǿẅěř ģǿěș ųp ǻș ģěțș fųřțħěř ǻẅǻỳ fřǿm $\mų_0$

Pǿẅěř ģǿěș ųp ǻș ģǿěș ųp

Pǿẅěř đǿěșň’ț ňěěđ , ǻňđ , įňșțěǻđ ǿňŀỳ

Țħě qųǻňțįțỳ įș čǻŀŀěđ țħě ěffěčț șįżě, țħě đįffěřěňčě įň țħě měǻňș

įň șțǻňđǻřđ đěvįǻțįǿň ųňįțș.

Běįňģ ųňįț fřěě, įț ħǻș șǿmě ħǿpě ǿf įňțěřpřěțǻbįŀįțỳ ǻčřǿșș șěțțįňģș.

T-test power
Watch this before beginning.

Consider calculating power for a Gosset’s t test for our example where we now
assume that . The power is

Calculating this requires the so-called non-central t distribution. However,


fortunately for us, the R function power.t.test does this very well. Omit (exactly) any
one of the arguments and it solves for it. Our t-test power again only relies on the
effect size.

Let’s do our example trying different options.

Example of using ‘power.t.test’ in R.


# omitting the power and getting a power estimate
> power.t.test(n = 16, delta = 2/4, sd = 1, type = "one.sample", alt = "one.side\
d")$power
[1] 0.604
# illustrating that it depends only on the effect size, delta/sd
> power.t.test(n = 16, delta = 2, sd = 4, type = "one.sample", alt = "one.sided"\
)$power
[1] 0.604
# same thing again
> power.t.test(n = 16, delta = 100, sd = 200, type = "one.sample", alt = "one.si\
ded")$power
[1] 0.604
# specifying the power and getting n
> power.t.test(power = 0.8, delta = 2/4, sd = 1, type = "one.sample", alt = "one\
.sided")$n
[1] 26.14
# again illustrating that the effect size is all that matters
power.t.test(power = 0.8, delta = 2, sd = 4, type = "one.sample", alt = "one.sid\
ed")$n
[1] 26.14
# again illustrating that the effect size is all that matters
> power.t.test(power = 0.8, delta = 100, sd = 200, type = "one.sample", alt = "o\
ne.sided")$n
[1] 26.14

Exercises
1. Pǿẅěř įș ǻ přǿbǻbįŀįțỳ čǻŀčųŀǻțįǿň ǻșșųmįňģ ẅħįčħ įș țřųě:

Țħě ňųŀŀ ħỳpǿțħěșįș

Țħě ǻŀțěřňǻțįvě ħỳpǿțħěșįș

Bǿțħ țħě ňųŀŀ ǻňđ ǻŀțěřňǻțįvě

2. Ǻș ỳǿųř șǻmpŀě șįżě ģěțș bįģģěř, ǻŀŀ ěŀșě ěqųǻŀ, ẅħǻț đǿ ỳǿų țħįňķ ẅǿųŀđ ħǻppěň țǿ

pǿẅěř?

İț ẅǿųŀđ ģěț ŀǻřģěř

İț ẅǿųŀđ ģěț șmǻŀŀěř

İț ẅǿųŀđ șțǻỳ țħě șǻmě

İț čǻňňǿț bě đěțěřmįňěđ fřǿm țħě įňfǿřmǻțįǿň ģįvěň

3. Ẅħǻț ħǻppěňș țǿ pǿẅěř ǻș ģěțș fųřțħěř fřǿm ?

Pǿẅěř đěčřěǻșěș

Pǿẅěř įňčřěǻșěș

Pǿẅěř șțǻỳș țħě șǻmě

Pǿẅěř ǿșčįŀŀǻțěș

4. İň țħě čǿňțěxț ǿf čǻŀčųŀǻțįňģ pǿẅěř, țħě ěffěčț șįżě įș?

Țħě ňųŀŀ měǻň đįvįđěđ bỳ țħě șțǻňđǻřđ đěvįǻțįǿň

Țħě ǻŀțěřňǻțįvě měǻň đįvįđěđ bỳ țħě șțǻňđǻřđ ěřřǿř

Țħě đįffěřěňčě běțẅěěň țħě ňųŀŀ ǻňđ ǻŀțěřňǻțįvě měǻňș đįvįđěđ bỳ țħě șțǻňđǻřđ

đěvįǻțįǿň

Țħě șțǻňđǻřđ ěřřǿř đįvįđěđ bỳ țħě ňųŀŀ měǻň

5. Řěčǻŀŀ țħįș přǿbŀěm “Șųppǿșě țħǻț įň ǻň ǺB țěșț, ǿňě ǻđvěřțįșįňģ șčħěmě ŀěđ țǿ ǻň

ǻvěřǻģě ǿf 10 pųřčħǻșěș pěř đǻỳ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș, ẅħįŀě țħě ǿțħěř ŀěđ țǿ 11

pųřčħǻșěș pěř đǻỳ, ǻŀșǿ fǿř ǻ șǻmpŀě ǿf 100 đǻỳș. Ǻșșųmįňģ ǻ čǿmmǿň șțǻňđǻřđ

đěvįǻțįǿň ǿf 4 pųřčħǻșěș pěř đǻỳ.” Ǻșșųmįňģ țħǻț 10 pųřčħǻșěș pěř đǻỳ įș ǻ

běňčħmǻřķ ňųŀŀ vǻŀųě, țħǻț đǻỳș ǻřě įįđ ǻňđ țħǻț țħě șțǻňđǻřđ đěvįǻțįǿň įș 4

pųřčħǻșěș fǿř đǻỳ. Șųppǿșě țħǻț ỳǿų pŀǻň ǿň șǻmpŀįňģ 100 đǻỳș. Ẅħǻț ẅǿųŀđ bě țħě

pǿẅěř fǿř ǻ ǿňě șįđěđ 5% Ż měǻň țěșț țħǻț pųřčħǻșěș pěř đǻỳ ħǻvě įňčřěǻșěđ ųňđěř

țħě ǻŀțěřňǻțįvě ǿf pųřčħǻșě pěř đǻỳ? Ẅǻțčħ ǻ vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě

țěxț.

6. Řěșěǻřčħěřș ẅǿųŀđ ŀįķě țǿ čǿňđųčț ǻ șțųđỳ ǿf ħěǻŀțħỳ ǻđųŀțș țǿ đěțěčț ǻ fǿųř ỳěǻř

měǻň břǻįň vǿŀųmě ŀǿșș ǿf .01 mm3. Ǻșșųmě țħǻț țħě șțǻňđǻřđ đěvįǻțįǿň ǿf fǿųř

ỳěǻř vǿŀųmě ŀǿșș įň țħįș pǿpųŀǻțįǿň įș .04 mm3. Ẅħǻț įș ňěčěșșǻřỳ șǻmpŀě șįżě fǿř

țħě șțųđỳ fǿř ǻ 5% ǿňě șįđěđ țěșț věřșųș ǻ ňųŀŀ ħỳpǿțħěșįș ǿf ňǿ vǿŀųmě ŀǿșș țǿ

ǻčħįěvě 80% pǿẅěř? Ẅǻțčħ țħě vįđěǿ șǿŀųțįǿň ǻňđ șěě țħě țěxț.
12. The bootstrap and resampling

The bootstrap
Watch this video before beginning.

The bootstrap is a tremendously useful tool for constructing confidence intervals


and calculating standard errors for difficult statistics. For a classic example, how
would one derive a confidence interval for the median? The bootstrap procedure
follows from the so called bootstrap principle

To illustrate the bootstrap principle, imagine a die roll. The image below shows the
mass function of a die roll on the left. On the right we show the empirical
distribution obtained by repeatedly averaging 50 independent die rolls. By this
simulation, without any mathematics, we have a good idea of what the distribution
of averages of 50 die rolls looks like.

Image of true die roll distribution (left) and simulation of averages of 50 die rolls

Now imagine a case where we didn’t know whether or not the die was fair. We have
a sample of size 50 and we’d like to investigate the distribution of the average of 50
die rolls where we’re not allowed to roll the die anymore. This is more like a real
data analysis, we only get one sample from the population.
Image of empirical die roll distribution (left) and simulates of averages of 50 die
rolls from this distribution

The bootstrap principle is to use the empirical mass function of the data to
perform the simulation, rather than the true distribution. That is, we simulate
averages of 50 samples from the histogram that we observe. With enough data, the
empirical distribution should be a good estimate of the true distribution and this
should result in a good approximation of the sampling distribution.

That’s the bootstrap principle: investigate the sampling distribution of a statistic by


simulating repeated realizations from the observed distribution.

If we could simulate from the true distribution, then we would know the exact
sampling distribution of our statistic (if we ran our computer long enough.)
However, since we only get to sample from that distribution once, we have to be
content with using the empirical distribution. This is the clever idea of the
bootstrap.

Example Galton’s fathers and sons dataset


Watch this video before beginning.

The code below creates resamples via draws of size n with replacement with the
original data of the son’s heights from Galton’s data and plots a histogram of the
median of each resampled dataset.

Bootstrapping example

library(UsingR)
data(father.son)
x <- father.son$sheight
n <- length(x)
B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
resampledMedians <- apply(resamples, 1, median)
Bootstrapping example for the median of sons’ heights from Galton’s

The bootstrap principle


Watch this video before beginning.

Suppose that I have a statistic that estimates some population parameter, but I
don’t know its sampling distribution. The bootstrap principle suggests using the
distribution defined by the data to approximate its sampling distribution

The bootstrap in practice


In practice, the bootstrap principle is always carried out using simulation. We will
cover only a few aspects of bootstrap resampling. The general procedure follows
by first simulating complete data sets from the observed data with replacement.
This is approximately drawing from the sampling distribution of that statistic, at
least as far as the data is able to approximate the true population distribution.
Calculate the statistic for each simulated data set Use the simulated statistics to
either define a confidence interval or take the standard deviation to calculate a
standard error.

Nonparametric bootstrap algorithm example


Bootstrap procedure for calculating confidence interval for the median from a data
set of observations:

1. Șǻmpŀě ǿbșěřvǻțįǿňș ẅįțħ řěpŀǻčěměňț fřǿm țħě ǿbșěřvěđ đǻțǻ řěșųŀțįňģ įň ǿňě

șįmųŀǻțěđ čǿmpŀěțě đǻțǻ șěț.

2. Țǻķě țħě měđįǻň ǿf țħě șįmųŀǻțěđ đǻțǻ șěț

3. Řěpěǻț țħěșě țẅǿ șțěpș țįměș, řěșųŀțįňģ įň șįmųŀǻțěđ měđįǻňș

4. Țħěșě měđįǻňș ǻřě ǻppřǿxįmǻțěŀỳ đřǻẅň fřǿm țħě șǻmpŀįňģ đįșțřįbųțįǿň ǿf țħě

měđįǻň ǿf ǿbșěřvǻțįǿňș; țħěřěfǿřě ẅě čǻň:

Đřǻẅ ǻ ħįșțǿģřǻm ǿf țħěm

Čǻŀčųŀǻțě țħěįř șțǻňđǻřđ đěvįǻțįǿň țǿ ěșțįmǻțě țħě șțǻňđǻřđ ěřřǿř ǿf țħě měđįǻň

Țǻķě țħě ǻňđ pěřčěňțįŀěș ǻș ǻ čǿňfįđěňčě įňțěřvǻŀ fǿř țħě měđįǻň

For the general bootstrap, just replace the median with whatever statistic that
you’re investigating.
Example code
Consider our father/son data from before. Here is the relevant code for doing the
resampling.

B <- 10000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
medians <- apply(resamples, 1, median)

And here is some results.

> sd(medians)
[1] 0.08424

Thus, 0.084 estimates the standard error of the median for this data set. It did this
by repeatedly sampling medians from the observed distribution and taking the
standard deviation of the resulting collection of medians. Taking the 2.5 and 97.5
percentiles gives us a bootstrap 95% confidence interval for the median.

> quantile(medians, c(.025, .975))


2.5% 97.5%
68.43 68.81

We also always want to plot a histogram or density estimate of our simulated


statistic.

g = ggplot(data.frame(medians = medians), aes(x = medians))


g = g + geom_histogram(color = "black", fill = "lightblue", binwidth = 0.05)
g

Bootstrapping example for the median of sons’ heights from Galton’s

Summary notes on the bootstrap


Țħě bǿǿțșțřǻp įș ňǿň-pǻřǻměțřįč.

Běțțěř pěřčěňțįŀě bǿǿțșțřǻp čǿňfįđěňčě įňțěřvǻŀș čǿřřěčț fǿř bįǻș.

Țħěřě ǻřě ŀǿțș ǿf vǻřįǻțįǿňș ǿň bǿǿțșțřǻp přǿčěđųřěș; țħě bǿǿķ Ǻň İňțřǿđųčțįǿň țǿ

țħě Bǿǿțșțřǻp bỳ Ěfřǿň ǻňđ Țįbșħįřǻňį įș ǻ ģřěǻț pŀǻčě țǿ șțǻřț fǿř bǿțħ bǿǿțșțřǻp

ǻňđ jǻčķķňįfě įňfǿřmǻțįǿň.


Group comparisons via permutation tests
Watch this video before beginning.

Consider comparing two independent groups. Example, comparing sprays B and C.

data(InsectSprays)
g = ggplot(InsectSprays, aes(spray, count, fill = spray))
g = g + geom_boxplot()
g

Comparison of insect spray.

Permutation tests
Consider comparing means between the group. However, let’s use the calculate the
distribution of our statistic under a null hypothesis that the labels are irrelevant
(exchangeable). This is a handy way to create a null distribution for our test statistic
by simply permuting the labels over and over and seeing how extreme our data are
with respect to this permuted distribution.

The procedure would be as follows:

1. čǿňșįđěř ǻ đǻțǻ fřǿm ẅįțħ čǿųňț ǻňđ șpřǻỳ,

2. pěřmųțě țħě șpřǻỳ (ģřǿųp) ŀǻběŀș,

3. řěčǻŀčųŀǻțě țħě șțǻțįșțįč (șųčħ ǻș țħě đįffěřěňčě įň měǻňș),

4. čǻŀčųŀǻțě țħě pěřčěňțǻģě ǿf șįmųŀǻțįǿňș ẅħěřě țħě șįmųŀǻțěđ șțǻțįșțįč ẅǻș mǿřě

ěxțřěmě (țǿẅǻřđ țħě ǻŀțěřňǻțįvě) țħǻň țħě ǿbșěřvěđ.

Variations on permutation testing


This idea of exchangeability of the group labels is so powerful, that it’s been
reinvented several times in statistic. The table below gives three famous tests that
are obtained by permuting group labels.

Đǻțǻ țỳpě Șțǻțįșțįč Țěșț ňǻmě

Řǻňķș řǻňķ șųm řǻňķ șųm țěșț

ħỳpěřģěǿměțřįč
Bįňǻřỳ Fįșħěř’ș ěxǻčț țěșț
přǿb
Řǻẅ đǻțǻ pěřmųțǻțįǿň țěșț

Also, so-called randomization tests are exactly permutation tests, with a different
motivation. In that case, think of the permutation test as replicating the random
assignment over and over.

For matched or paired data, it wouldn’t make sense to randomize the group labels,
since that would break the association between the pairs. Instead, one can
randomize the signs of the pairs. For data that has been replaced by ranks, you
might of heard of this test before as the the signed rank test.

Again we won’t cover more complex examples, but it should be said that
permutation strategies work for regression as well by permuting a regressor of
interest (though this needs to be done with care). These tests work very well in
massively multivariate settings.

Permutation test B v C
Let’s create some code for our example. Our statistic will be the difference in the
means in each group.

Permutation distribution for the insect sprays dataset.

subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]


y <- subdata$count
group <- as.character(subdata$spray)
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))

Let’s look at some of the results. First let’s look at the observed statistic.

> observedStat
[1] 13.25

Now let’s see what proportion of times we got a simulated statistic larger than our
observed statistic.

mean(permutations > observedStat)


[1] 0

Since this is 0, our estimate of the P-value is 0 (i.e. we strongly reject the NULL). It’s
useful to look at a histogram of permuted statistics with a vertical line drawn at the
observed test statistic for reference.
Permutation distribution from the insectsprays dataset

Exercises
1. Țħě bǿǿțșțřǻp ųșěș ẅħǻț țǿ ěșțįmǻțě țħě șǻmpŀįňģ đįșțřįbųțįǿň ǿf ǻ șțǻțįșțįč?

Țħě țřųě pǿpųŀǻțįǿň đįșțřįbųțįǿň

Țħě ěmpįřįčǻŀ đįșțřįbųțįǿň țħǻț pųțș přǿbǻbįŀįțỳ 1/ň fǿř ěǻčħ ǿbșěřvěđ đǻțǻ

pǿįňț

2. Ẅħěň pěřfǿřmįňģ țħě bǿǿțșțřǻp vįǻ Mǿňțě Čǻřŀǿ řěșǻmpŀįňģ fǿř ǻ đǻțǻ șěț ǿf șįżě ň

ẅħįčħ įș țřųě? Ǻșșųmě țħǻț ỳǿų’řě ģǿįňģ țǿ đǿ 10,000 bǿǿțșțřǻp řěșǻmpŀěș?

Ỳǿų șǻmpŀě ň čǿmpŀěțě đǻțǻ șěțș ǿf șįżě 10,000 ẅįțħ řěpŀǻčěměňț

Ỳǿų șǻmpŀě 10,000 čǿmpŀěțě đǻțǻ șěțș ǿf șįżě ň ẅįțħǿųț řěpŀǻčěměňț

Ỳǿų șǻmpŀě 10,000 čǿmpŀěțě đǻțǻ șěțș ǿf șįżě ň ẅįțħ řěpŀǻčěměňț

Ỳǿų șǻmpŀě ň čǿmpŀěțě đǻțǻ șěțș ǿf șįżě 10,000 ẅįțħǿųț řěpŀǻčěměňț

3. Pěřmųțǻțįǿň țěșțș đǿ ẅħǻț?

Čřěǻțěș ǻ ňųŀŀ đįșțřįbųțįǿň fǿř ǻ ħỳpǿțħěșįș țěșț bỳ pěřmųțįňģ ǻ přěđįčțǿř

vǻřįǻbŀě.

Čřěǻțěș ǻ ňųŀŀ đįșțřįbųțįǿň bỳ řěșǻmpŀįňģ fřǿm țħě řěșpǿňșě ẅįțħ řěpŀǻčěměňț.

Čřěǻțěș ǻň ǻŀțěřňǻțįvě đįșțřįbųțįǿň bỳ pěřmųțįňģ ģřǿųp ŀǻběŀș.

Čřěǻțěș čǿňfįđěňčě įňțěřvǻŀș bỳ řěșǻmpŀįňģ ẅįțħ řěpŀǻčěměňț.

Ħěŀp Čǻųșěș Bųżż Bŀǿģ Mǻňįfěșțǿ Ǻffįŀįǻțě Přǿģřǻm Țěřmș ǿf Șěřvįčě Přįvǻčỳ Pǿŀįčỳ

Čǿpỳřįģħț Țǻķě Đǿẅň Pǿŀįčỳ Ǻbǿųț Čǿňțǻčț Ųș

Ŀěǻňpųb įș čǿpỳřįģħț © 2010-2015 Řųbǿșș. Řųbǿșș įș ǻ bǿųțįqųě čǿňșųŀțǻňčỳ țħǻț bųįŀđș přǿđųčțș ŀįķě

Ŀěǻňpųb ǻňđ Đǻșħčųbě ųșįňģ țěčħňǿŀǿģįěș ŀįķě Řěǻčț ǻňđ Řǻįŀș. Ģěț įň țǿųčħ!

You might also like