0% found this document useful (0 votes)
20 views385 pages

Safari

The document provides an overview of biostatistics, defining it as the application of statistics to biological sciences. It covers data collection methods, types of data, descriptive statistics, measures of center, and measures of dispersion, including the mean, median, mode, range, interquartile range, and variance. Additionally, it distinguishes between sample statistics and population parameters, emphasizing the importance of random sampling and accurate measurement in biostatistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views385 pages

Safari

The document provides an overview of biostatistics, defining it as the application of statistics to biological sciences. It covers data collection methods, types of data, descriptive statistics, measures of center, and measures of dispersion, including the mean, median, mode, range, interquartile range, and variance. Additionally, it distinguishes between sample statistics and population parameters, emphasizing the importance of random sampling and accurate measurement in biostatistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 385

STAT:3510 (22S:101) Biostatistics

Dale Zimmerman

Summer 2016
Definition of biostatistics
• Statistics — the science of collecting, describing, analyzing,
and interpreting data, so that inferences (conclusions about a
population based on data from merely a sample) can be made
with quantifiable certainty.

• Biostatistics — that portion of statistics that is most relevant


to the biological sciences.

The definition implies that Statistics requires defining a population


of interest, drawing a sample from that population, measuring some-
thing on each member of that sample, making a conclusion about the
population based on the measured quantities and, finally, using prob-
ability to say something about how sure you are of your conclusion.

2
Example applications of Biostatistics

We use biostatistics to address questions like:

• Does drinking a glass of red wine every day reduce a person’s


risk of heart disease?

• Does listening to classical music while pregnant increase the


musical ability and/or the intelligence of the child in the womb?

• Do larger male dragonflies defend larger territories?

• Does the presence of wind turbines decrease bird populations?

3
Data collection

Some issues:

• The sample drawn from the population should, if possible, be


a random sample, i.e. drawn in such a way that every member
of the population has an equal chance of being included in the
sample.

• Sample size, n (larger n is better, all other things being equal)

• Accuracy of measurement

For the most part, we will not do our own data collection in this
class, but will use existing data sets.

4
Descriptive Statistics

Summaries or reductions of the data into something easier to under-


stand. May be:

• numerical

• tabular

• graphical

Being a summary, a descriptive statistic may reveal some important


feature(s) of the data, but not others.

5
Data Types

The kinds of descriptive statistics that are most appropriate depend


on the type of data we have collected.
We consider four data types:

• Continuous — numbers that can take on any value in an inter-


val

• Discrete — numbers which are restricted to “isolated” values

• Ordinal (ranked) — ordered labels or categories

• Categorical (nominal) — unordered labels or categories

6
Data types: Continuous data

Examples:

• Weights of newborn infants

• Elapsed time from swine flu exposure to first symptoms

• Blood lead concentrations in kindergarten children

The data arise by measurement. All arithmetic operations (addition,


subtraction, multiplication, division) on the data are meaningful.

7
Data types: Discrete data

Examples:

• Number of children in family

• Number of gold medals won by U.S. in Winter Olympics

• Number of lung cancer cases for each county in U.S. in 2014

The data arise by counting. All arithmetic operations are meaning-


ful, but proper interpretations require common sense (e.g. an average
of 1.73 children per family).

8
Data types: Ordinal data

Examples:

• Burn severity (1st, 2nd, 3rd degree burns)

• Opinion questionaire items (strongly agree, somewhat agree,


neither agree nor disagree, somewhat disagree, strongly dis-
agree)

Can assign numbers to the category labels; ranking these numbers is


meaningful, but reporting results of arithmetic operations on them is
questionable.

9
Data types: Categorical data

Examples:

• Gender (male or female)

• Ethnicity (white, black, Asian, . . . )

• Blood type (A+, A-, B+, B-, O+, O-, AB+, AB-)

Can assign numbers to the category labels, but the order and magni-
tude of the numbers have no meaning. So almost no mathematical
operations can be performed on the data (exception: counting the
number of individuals that fall into each category).

10
Data types: Final remarks
• The line between continuous and discrete data may sometimes
appear blurry, due to measurement devices which are not “in-
finitely accurate.” Key discriminator: Would the data be dis-
crete if we could measure to an infinite level of accuracy?

• Generally as we move from continuous ! discrete ! ordinal


! categorical, the data become cruder and less informative.
Data collection may require trade-offs (larger sample size ver-
sus each datum being more informative).

11
Descriptive statistics: Measures of Center

One important characteristic of a set of data is its “center,” although


there are different ways to define center.
Three common measures of center:

• Mean

• Median

• Mode

12
Measures of Center: The Mean

If we represent the numbers in our data set generically as

X1 , X2 , . . . , Xn

then their mean, X (read “X bar”), is

X = (X1 + X2 + · · · + Xn )/n.

More compactly, using summation notation,

1 n
X= Â Xi.
n i=1

13
Measures of Center: The Mean

Clearly, in order to compute the mean and have it be meaningful, the


data must be either discrete or continuous.
Toy examples:

• Data 1,2,3,4,5: X = (1 + 2 + 3 + 4 + 5)/5 = 3

• Data 6,7,8,9,10: X = (6 + 7 + 8 + 9 + 10)/5 = 8

• Data 4,6,8,10,12: X = (4 + 6 + 8 + 10 + 12)/5 = 8

• Data 1,1,1,1,36: X = (1 + 1 + 1 + 1 + 36)/5 = 8

• Data 2,3,6,8,11,18: X = (2 + 3 + 6 + 8 + 11 + 18)/6 = 8

14
Measures of Center: The Mean

The mean is the “balance point” for the data, i.e. it is where a fulcrum
would need to be put to balance equally weighted objects placed on
the number line at X1 , X2 , . . . , Xn .

Equivalently, Âni=1 (Xi X) = 0.

15
Measures of Center: The Median

The median is the middle value in the ordering of all data values
from smallest to largest. Clearly this requires the data to be ordinal,
discrete, or continuous.
If we represent the ordered values in our data set generically as

X(1)  X(2)  · · ·  X(n)

then their median, X̃ (read “X tilde”), is


(
X( n+1 ) if n is odd
X̃ = 2
(X( n2 ) + X( n2 +1) )/2 if n is even

16
Measures of Center: The Median

Toy examples:

• Data 1,2,3,4,5: X̃ = 3

• Data 6,7,8,9,10: X̃ = 8

• Data 4,6,8,10,12: X̃ = 8

• Data 1,1,1,1,36: X̃ = 1

• Data 2,3,6,8,11,18: X̃ = (6 + 8)/2 = 7

The median is also known as the 50th percentile.

17
Measures of Center: The Mode

The mode is the datum that occurs most frequently in the sample.
Toy examples:

• Data 1,1,1,1,36: Mode= 1

• Data 2,3,6,8,11,18: No mode (or, alternatively, every datum is


a mode)

• Data 2,3,3,8,11,18,18: Two modes (bimodal), 3 and 18

The mode is the most common value, but it may or may not be rep-
resentative of the dataset’s center.

18
Mean vs. Median vs. Mode
• Mode is well-defined for all data types, median requires at
least ordinal data, mean requires at least discrete data.

• Mode often useless for continuous data.

• Mode is a datum; median may or may not be a datum (depend-


ing on whether n is odd or even); mean often is not a datum.

• Mean has superior statistical properties (to be seen later).

• Mean is distorted more than the others if the data are skewed
(definition to come later) or contain outliers.

• Units for all three are the same as the units of the data.

19
Descriptive Statistics: Measures of Dispersion (Spread)

Another important attribute of a dataset is how spread out it is from


its center.
Example: Dataset #1 (6,7,8,9,10) versus Dataset #2 (0,4,8,12,16).

Practical applications: pill dosage, nuts and bolts

20
Measures of Dispersion: The Range

Range = X(n) X(1)


Features:

• Very easy to compute

• Because it’s based on only 2 values, it is very sensitive to out-


liers

• Because it’s based on only 2 values, it does not reflect the


variability in the data that lie between the two extremes

21
Measures of Dispersion: The Interquartile Range (IQR)

IQR = Q3 Q1
where Q1 and Q3 are the first and third quartiles (Q2 , the second
quartile, coincides with the median).
How are the first and third quartiles defined?

• Q1 is the median of the observations less than Q2 (i.e., Q1 is


the median of the lower half of the ordered sample)

• Q3 is the median of the observations greater than Q2 (i.e., Q3


is the median of the upper half of the ordered sample)

22
Measures of Dispersion: The Interquartile Range (IQR)

Features of the IQR:

• Not as easy to compute as the range

• Much less sensitive to outliers than the range

• Still, it’s not as informative as a measure that would utilize all


the data

23
Measures of Dispersion: The Variance

We’d like a measure of spread that utilizes information from all the
observations. How about the mean deviation, 1n Âni=1 (Xi X)?
We can avoid the problem of negative deviations canceling out pos-
itive deviations by squaring each deviation, i.e. the mean squared
deviation
1 n
Â
n i=1
(Xi X)2 .

The variance, s2 , is very similar to this:


n
1
s2 =
n  (Xi
1 i=1
X)2 .

(Using a divisor of n 1 rather than n improves some statistical prop-


erties.)
24
Measures of Dispersion: The Variance

There is an alternate formula for s2 , called the “computational for-


mula,” which is usually easier to calculate (and less susceptible to
careless errors) than the one on the previous page:
!
n n 2
1 (Â X )
 Xi2
i
s2 = i=1
n 1 i=1 n

Note that this involves summing the squares of the data (i.e. to com-
pute Âni=1 Xi2 we square first and then sum), as well as squaring the
sum of the data.

25
Measures of Dispersion: The Variance

Features of the variance:

• Units of s2 are the squares of the units of observation. To


convert back to the units of observation, we take the (positive)
square root, yielding the standard deviation, s:
p
s = s2

• s2 (and s) is affected by outliers less than the range, but more


so than IQR.

26
Toy examples
• Data 6,7,8,9,10: Range = 10 6 = 4, IQR = 9 7 = 2,
1
s2 = [(6 8)2 +(7 8)2 +(8 8)2 +(9 8)2 +(10 8)2 ] = 2.5, o
4
1 402 1
s2 = [(62 +72 +82 +92 +102 ) ] = (330 320) = 2.5,
4 5 4
p
s = 2.5 = 1.58

• Data 4,6,8,10,12: Range = 12 4 = 8, IQR = 10 6 = 4, s2 =


10, s = 3.16

• Data 1,1,1,1,36: Range = 36 1 = 35, IQR = 1 1 = 0, s2 =


245, s = 15.65

27
Sample statistics vs. population parameters

Consider taking a sample (hopefully random) from a population and


recording some variable for each member of the sample. When com-
puted from this sample, the measures of center and spread that we’ve
discussed are called statistics and are often called the sample mean,
sample median, . . . , sample standard deviation.
Conceptually, we can imagine computing the same measures for the
entire population; yielding the population mean, population median,
etc. These measures are called parameters.
Each statistic can be thought of as an estimate of the corresponding
parameter. E.g., the sample variance is an estimate of the popula-
tion variance. As n increases, the estimate tends to get closer to the
corresponding parameter.

28
Sample statistics vs. population parameters

In some rare instances the population is sufficiently small and ac-


cessible that we can record the variable for every member of the
population. In such cases we can calculate the mean and variance of
the population — we don’t need to estimate it. These calculations
for the population are performed just as for the sample, except that
a divisor of N (the population size) is used in place of n 1 in the
calculation of the population variance.

29
Computing X and s2 from grouped data

Some discrete, ordinal, or categorical datasets are large, but if the


number of distinct values in these datasets is small they can be rep-
resented compactly by a frequency table.
Example (from Problem 5 in Chapter 1): 100 sampling quadrats
were surveyed in NY, Xi = # of Cepaea nemoralis snails per quadrat.
# of snails, X[i] Frequency, fi
0 69
1 18
2 7
3 2
4 1
5 1
8 1
15 1
100

30
Computing X and s2 from grouped data

X = 1n Âni=1 Xi = 1n Âi fi X[i] ,
✓ ◆
2 1 2 (Âi fi X[i] )2
s = n 1 Âi fi X[i] n

# of snails, X[i] fi fi X[i] 2


fi X[i]
0 69 0 0
1 18 18 18
2 7 14 28
3 2 6 18
4 1 4 16
5 1 5 25
8 1 8 64
15 1 15 225
100 70 394

2
394 70
X = 70/100 = 0.7, s2 = 99
100
= 3.48, s = 1.87
31
Linear transformations of data

Sometimes we may have computed a measure of center or spread


for data recorded in one type of units (e.g. minutes, pounds), but we
want to have instead the corresponding measure of center or spread
for the data recorded in different units (seconds, kilograms).
One option is to transform every datum to the new units, and recom-
pute the measure of center or spread.
But this is unnecessarily tedious if the new units are a linear trans-
formation of the original units. A linear transformation of data X1 , X2 , . . .
yields transformed data Y1 ,Y2 , . . . ,Yn via the equation

Yi = aXi + b, i = 1, 2, . . . , n.

32
Linear transformations of data

Examples:

• Minutes ! seconds: Yi = 60Xi + 0


. 1
• Pounds ! kilograms: Yi = 2.20462262 Xi + 0

• F ! C: Yi = 59 (Xi 32) = 59 Xi 160


9

Note: not all transformations are linear. One example of a nonlinear


transformation is the logarithmic (“log”) transformation, Yi = log Xi .

33
Linear transformations of data

What happens to measures of center when the data are linearly trans-
formed? Consider the following examples:

• Original data (1,2,3,4,4,4): X = 3, X̃ = 3.5, mode = 4

• Add 100 to each datum (Yi = Xi + 100), yielding transformed


data (101,102,103,104,104,104): Y = 103, Ỹ = 103.5, mode =
104

• Multiply each original datum by 5 (Yi = 5Xi ), yielding trans-


formed data (5,10,15,20,20,20): Y = 15, Ỹ = 17.5, mode = 20

In general, Y = aX + b (with a similar result for median and mode).

34
Linear transformations of data

What happens to measures of spread when the data are linearly trans-
formed? Consider the same example:

• Original data (1,2,3,4,4,4): Range = 3, IQR = 2, s2X = 1.6,


sX = 1.26

• Add 100 to each datum (Yi = Xi + 100), yielding transformed


data (101,102,103,104,104,104): Range = 3, IQR = 2, sY2 =
1.6, sY = 1.26

• Multiply each original datum by 5 (Yi = 5Xi ), yielding trans-


formed data (5,10,15,20,20,20): Range = 15, IQR = 10, sY2 =
40, sY = 6.32

35
In general, sY = |a|sX (with a similar result for range and IQR), while
sY2 = a2 s2X .

36
Accuracy and Precision

Accuracy is the closeness of a measured or computed value to its


true value.
Precision is the closeness of repeated measurements of the same
quantity to each other (regardless of whether they are close to the
true value).
The # of digits used for recording continuous data implies a certain
level of precision. The “30–300 rule” (see next slide) should be used
to determine this level. Whatever the level of precision of the data,
measures of center and all measures of spread except the variance
should be calculated to one additional digit. The variance should be
calculated to two additional digits. Example:
Data 1.7, 1.2, 2.9, 2.1: X = 1.975 ! 1.98, s2 = 0.5158333 ! 0.516.
37
The 30–300 Rule

As noted previously, the greater the level of precision in the mea-


sured data, generally the more costly (in time and effort) it is to
collect that data, as well as to compute numerical measures such as
the mean and variance.
But measuring too crudely (e.g. measuring height of people to the
nearest meter) may render the entire inferential enterprise worthless.

The 30–300 rule says to measure data to a level of precision for


which there are at least 30, but not more than 300, unit steps be-
tween the smallest and largest measurements.
Example: If the smallest and largest heights of students in this class
are 4’8” and 6’4”, then we should record height to the nearest 0.5”.
38
Frequency distributions (in tabular form)

Recall the frequency table used to summarize data from 100 sam-
pling quadrats surveyed in NY, Xi = # of Cepaea nemoralis snails
per quadrat:
# of snails, X[i] Frequency, fi
0 69
1 18
2 7
3 2
4 1
5 1
8 1
15 1
100

We can add columns to this table that give relative frequencies (pro-
portions of the total # of observations that fall in each category) and
cumulative relative frequencies (proportions of the total # of obser-
39
vations that fall in each category or previous categories).

# of snails, X[i] Frequency, fi Relative freq Cum rel freq


0 69 0.69 0.69
1 18 0.18 0.87
2 7 0.07 0.94
3 2 0.02 0.96
4 1 0.01 0.97
5 1 0.01 0.98
8 1 0.01 0.99
15 1 0.01 1.00
100 1.00

The example above involves discrete data; we can do a similar thing


for continuous data but we have to partition the interval containing
the data into several subintervals of equal size and do the counting
within those subintervals.

40
Frequency table for first age guess data

Age guess Frequency Relative freq Cum rel freq


[30 35) 1 0.01 0.01
[35 40) 2 0.03 0.04
[40 45) 4 0.06 0.10
[45 50) 26 0.39 0.49
[50 55) 28 0.42 0.91
[55 60) 6 0.09 1.00
67 1.00

41
Bar graphs
• A graphical display of the distribution of frequencies (or rela-
tive frequencies)

• Used for discrete, ordinal, or categorical data

• Simply plot a bar, centered at each data value, whose height is


equal to the corresponding (relative) frequency

42
Bar graph for snail data

43
Histograms
• Similar to a bar graph

• Used for continuous data

• Have to choose how many subintervals to use, and where to


start the first and end the last

• Generally 5 to 15 subintervals work best: < 5 ! oversumma-


rization, and > 15 ! undersummarization.

44
Histogram of Old Faithful geyser eruption durations
80
60
40
20
0

1 2 3 4 5
geyser$duration

45
46
Histogram of guesses of Dr. Z’s age

47
Data shape

In addition to center, spread, and outliers, a bar graph/histogram dis-


plays the shape of the data.
The data’s relative frequency distribution is said to be symmetric
(approximately) if the bar graph/histogram is symmetric (approxi-
mately) around its center.
Examples of symmetric distributions:

48
Data shape: Skewness

If the data are not symmetric, they may be skewed.

• Right skewed — a longer right tail

• Left skewed — a longer left tail

Examples of skewed distributions:

49
Effects of data shape on measures of center
• For a symmetric distribution, mean = median

• For a symmetric unimodal distribution, mean = median = mode

• For a right skewed distribution, mode < median < mean

• For a left skewed distribution, mean < median < mode

50
Five-number summaries and box plots

The five-number summary of a dataset consists of the numbers in the


following list (and in this order):

Minimum, Q1 , Median, Q3 , Maximum

A box plot of a dataset is a graphical version of the five-number


summary, with a few extras.
Generic box plot:

51
Step-by-step procedure for constructing a box plot
1. Draw a horizontal (or vertical) reference scale based on the
extent of the data.
2. Draw a box whose sides (or top and bottom) are located at Q1
and Q3 .
3. Draw a vertical (horizontal) line segment at the median.
4. Compute the fences, f1 = Q1 1.5 ⇤ IQR and f3 = Q3 + 1.5 ⇤
IQR.
5. Extend a line segment (so-called whisker) from Q1 out to the
most extreme observation that is at or inside the fence, i.e.,
f1 ). Repeat on the other side, i.e., from Q3 to the most
extreme observation that is  f3 . Mark the end of these line
segments with a ⇥.
52
6. Mark any observations beyond the fences with an open circle,
; these are regarded as outliers.

7. If you are constructing more than one box plot for comparison
purposes, use the same scale for all of them and put them side-
by-side (or one on top of another)

53
54
Box plots for guesses of Dr. Z’s age

55
Describing shape from a box plot
• For an (approximately) symmetric distribution, Q2 will be near
the middle of the box, and the two whiskers will be nearly the
same length.

• Right (left) skewness lengthens the box and the whisker to the
right (left) of the median, relative to the lengths of the box and
whisker on the other side of the median.

56
Probability: Basic concepts and terminology

Probability is a number between 0 and 1 (inclusive) that measures


how likely something is to occur. A more formal definition will be
given subsequently.
An experiment is an activity with an outcome that is observable but
not predictable with certainty (so it’s sometimes called a random
experiment). Examples:

• Rolling a six-sided die


• Taking an aspirin when you have a headache and determining
whether it relieved the pain
• Randomly sampling an incoming UI freshman and measuring
their HS GPA
57
Probability: Basic concepts and terminology

The sample space is the collection of all possible outcomes of the


experiment. Examples from previous 3 experiments:

• S = {1, 2, 3, 4, 5, 6}

• S = {pain relieved, pain not relieved}

• S = [0.00, 4.00]

58
Probability: Basic concepts and terminology

An event is a subset of the sample space, i.e. an outcome or a sub-


set of outcomes of the experiment. Usually represent events using
symbols A, B,C, . . .. Examples from previous 3 sample spaces:

• A = {1, 2}, B = {2, 4, 6}


• A = {pain not relieved}
• A = 2.93, B = [2.5, 3.0)

Probability, P, is a mathematical function that assigns a unique num-


ber between 0 and 1 (inclusive) to every possible event of an exper-
iment. It must obey certain axioms (see p. 35 of text).
We write the probability that event A occurs as P(A).
59
Probability: Basic concepts and terminology

If a sample space consists of a finite #, say n, of outcomes, and the


outcomes are equally likely, then the axioms of probability can be
used to show that each outcome has probability 1/n and any event
with k outcomes has probability k/n.
Example: Consider rolling a fair six-sided die. The sample space is
S = {1, 2, 3, 4, 5, 6} and each of the six outcomes is equally likely.
So, for instance, if we define A = {2, 4, 5}, then P(A) = 36 = 12 .
The definition of probability based on the aforementioned axioms is
known as classical probability.

60
Probability: Basic concepts and terminology

Another type, or interpretation, of probability is the relative fre-


quency definition. Under this definition, we consider repeating the
experiment a large number, N of times. The relative frequency prob-
ability of an event A is given by
# times A occurs
.
N
When the classical probability of A is known, it turns out that it
coincides with
# times A occurs
lim .
N!• N

61
Counting outcomes

Sometimes the sample space of an experiment consists of n equally


likely outcomes but n is not so easy to determine. This is particu-
larly so when the experiment is a multi-stage or composite exper-
iment, i.e. an experiment built up from smaller experiments. For
those cases we need to learn some rules about counting outcomes.
Example 1: Toss a fair coin 6 times. This experiment is made up of 6
smaller experiments, each of which is to toss a fair coin once. Each
smaller experiment has 2 possible (and equally likely) outcomes: H
for “heads” and T for “tails”. How many outcomes does the com-
posite experiment have?

62
Counting outcomes

The Multiplication Rule says that the number of outcomes for a


composite experiment is obtained by multiplying the number of out-
comes for the individual experiments together, subject to any im-
posed restrictions on repetition (or any other restrictions).
So for Example 1, the number of outcomes for the composite exper-
iment is
2 ⇥ 2 ⇥ 2 ⇥ 2 ⇥ 2 ⇥ 2 = 26 = 64.
1
Thus, P(H, H, T, T, H, T ) = 64 ,
24
and P(first and last tosses are heads) = 64 = 14 .
In general, if a composite experiment consists of k smaller experi-
ments, each with n possible outcomes, and repetition of outcomes is
allowed, then the composite experiment has nk possible outcomes.
63
Counting outcomes

Example 2: There are four different stimuli to be applied to crayfish.


Four crayfish are available. Each stimulus is to be applied to one and
only one crayfish. How many different assignments of stimulus to
the crayfish are possible?
Answer: There are 4 choices of stimulus for the first crayfish, 3
choices for the second, 2 for the third, and 1 for the last crayfish.
Thus the number of different assignments of stimulus to the 4 cray-
fish is
4 ⇥ 3 ⇥ 2 ⇥ 1 = 4! = 24.
In general, if a composite experiment consists of k smaller experi-
ments, each with n possible outcomes, and repetition of outcomes is
not allowed, then the composite experiment has n ⇥ (n 1) ⇥ · · · ⇥
(n k + 1) = (n n!k)! possible outcomes.
64
Counting outcomes

Example 3: There are four different stimuli that can be applied to


crayfish. Two of the four stimuli are to be randomly selected and ap-
plied to one crayfish (i.e. the crayfish receives both stimuli), and the
order in which it receives the two stimuli is regarded as irrelevant.
How many different stimulus pairs are there?
Answer: 6; specifically they are {(s1 , s2 ), (s1 , s3 ), (s1 , s4 ), (s2 , s3 ),
(s2 , s4 ), (s3 , s4 )}. This is the # of ways of choosing 2 stimuli from 4
stimuli, without regard to order.
In general, the # of ways of choosing k items from n items, without
regard to order, is ✓ ◆
n! n
⌘ .
k!(n k)! k

65
Counting outcomes

Example 3, continued: Thus, the probability that s3 is among the


two stimuli selected is calculated as
# of outcomes in {(s1 , s3 ), (s2 , s3 ), (s3 , s4 )} 3 1
✓ ◆ = = .
4 6 2
2

Example 4: Problem 2.3(a).

66
Complement of an event

Let A be an event. The complement of A, denoted A0 , consists of all


outcomes of the experiment that are not in A.
Complement in a Venn diagram:
Example 1: Roll a fair die once. If A = {3, 6}, then A0 = {1, 2, 4, 5}.

Example 2: Toss a fair coin 6 times. If A ={first and last tosses are
heads}, then A0 ={either the first toss or the last toss (or both) are
tails}.
Probability of complement:

P(A0 ) = 1 P(A).

67
Intersection of two events

Let A and B be two events. The intersection of A and B, denoted by


A \ B, consists of all outcomes that are in both A and B.
Intersection in a Venn diagram:

Example: Roll a fair die once. If A = {3, 6}, B = {2, 3, 4}, and
C = {1, 2, 4}, then:

A \ B = {3},
A \C = 0,
/
B \C = {2, 4}.

68
Conditional probability

Let A and B be two events. The conditional probability, P(A|B), is


the probability that A occurs given that B has occurred.
The probability of the intersection of two events is related to the con-
ditional probability of one event given the other through the multi-
plicative rule of probability:

P(A \ B) = P(A)P(B|A) = P(B)P(A|B).

Provided that P(A) 6= 0, we may re-express the first of these equali-


ties as follows (with a similar result for the second equality):

P(A \ B)
P(B|A) = .
P(A)

69
Conditional probability examples

In the die roll experiment, if A = {1, 2}, B = {1, 3, 5}, and C =


{1, 2, 3}, then:

70
Independent events

Events A and B are said to be independent if the occurrence of one


of them has no effect on the probability of occurrence of the other,
i.e. if

P(A|B) = P(A) or equivalently P(B|A) = P(B).

As a consequence of the multiplicative rule of probability, if A and


B are independent then the probability of their intersection is

P(A \ B) = P(A) · P(B).

Example: Roll a fair die once. Let A = {1, 2}, B = {1, 3, 5}, C =
{1, 2, 3}. Then A and B are independent, but A and C are not inde-
pendent.

71
Mutually exclusive events

Events A and B are said to be mutually exclusive (or disjoint) if they


have no outcomes in common, i.e. if their intersection is empty.
Mutually exclusive events in a Venn diagram:

Example: Roll a fair die once. Then A = {3, 6} and B = {1, 4, 5} are
mutually exclusive.
Note: If A and B are mutually exclusive, then P(A \ B) = 0 (and vice
versa).

72
Union of two events

Let A and B be two events. The union of A and B, denoted by A [ B,


consists of all outcomes that are in either A or B (or both).
Union in a Venn diagram:

Example: Roll a fair die once. If A = {3, 6}, B = {2, 3, 4}, and
C = {1, 2, 4}, then:

A [ B = {2, 3, 4, 6},
A [C = {1, 2, 3, 4, 6},
B [C = {1, 2, 3, 4}.

73
Probability of the union of two events

In general,

P(A [ B) = P(A) + P(B) P(A \ B).

However, if A and B are mutually exclusive, this equality simplifies


to
P(A [ B) = P(A) + P(B).
Applying these to the example on the previous slide, we obtain
1 1 1 2
P(A [ B) = + =
3 2 6 3
1 1 5
P(A [C) = + 0= .
3 2 6

74
Probability practice with a biological problem

A study in the pinyon pine woodlands of northern Arizona examined


the susceptibility of pinyon pine trees to drought-induced mortality.
The study area is often exposed to severe drought. Some pinyon
pine trees contain a protein called glycerate dehydrogenase (GLY),
which is suspected to increase a tree’s drought resistance. Suppose
that a pinyon pine tree is selected at random from the study area, and
that:
P(D) = 0.20 where D ={tree is dead},
P(G) = 0.70 where G ={GLY is present in tree}, and
P(D \ G) = 0.07.

75
Probability practice with a biological problem (cont’d)

1. Calculate the probability that the selected tree is dead or contains


GLY (or both).

76
Probability practice with a biological problem (cont’d)

2. Calculate the probability that the selected tree is dead, given that
it contains GLY.

77
Probability practice with a biological problem (cont’d)

3. Calculate the probability that the selected tree is dead, given that
it does not contain GLY.

78
Probability practice with a biological problem (cont’d)

4. Do the results of Problems 2 and 3 provide evidence to support the


suspicion that GLY provides some resistance to drought? Explain
briefly.

79
Probability practice with a biological problem (cont’d)

5. Are events D and G independent, mutually exclusive, both, or


neither? Justify your answer.

80
Bayes Rule

In some biological settings we know P(A), P(B), and P(A|B), and


we want to find P(B|A). We can obtain it using Bayes Rule:

P(B)P(A|B)
P(B|A) =
P(A)

If we do not know P(A) but we know P(A|B0 ), we can substitute for


P(A) in the above equation using the result

P(A) = P(A \ B) + P(A \ B0 )


= P(B)P(A|B) + P(B0 )P(A|B0 ).

81
Application of Bayes rule: Diagnostic screening

Suppose that a nurse tests a person for TB using the “skin test.”
Define

A = {skin test positive}, B = {person has TB}.

Suppose further that for the population of interest, P(B) = .03, P(A|B) =
.90, and P(A|B0 ) = .05. Then

P(B|A) =

82
Random variables

A random variable is a variable whose outcome is determined at


least partly by chance. It results from carrying out a (random) ex-
periment.
We focus (for now) on two types of random variables:

• Discrete random variable — the only values the variable can


have are discrete. Examples: # heads in 3 tosses of a fair coin,
# snails in a randomly chosen quadrat.

• Continuous — the values the variable can have are continuous.


Examples: wt of 1st infant born in UIHC in 2015, survival
time of a randomly selected stroke patient.

83
Random variables

We usually represent random variables by capital letters near the end


of the alphabet, especially X, Y , and Z; and we represent a generic
value of the variable by the corresponding lower-case letter.

84
Discrete random variables: the probability density
function

For every discrete random variable, the numerical values it can take
on can be listed (prior to conducting the random experiment) and a
probability can be associated with each one. The function, f (x) =
P(X = x), that assigns these probabilities is called the probability
density function (pdf).
Example: Toss a fair coin three times, let X = # heads. The pdf is:
x f (x)
0 .125
1 .375
2 .375
3 .125

85
Discrete random variables: the probability density
function

More generally, the pdf, f , of a discrete random variable satisfies:

• f is defined for all real numbers;


• f (x) 0 (since it’s a probability);
• f (x) = 0 for most real numbers since X is discrete;
• Âall x f (x) = 1.

Furthermore, if A is an event, we have


P(A) = P(X 2 A) = Â f (x).
all x2A

86
Discrete random variables: the probability density
function

Example: If A in the previous example is {at least one head}, we


find that

P(A) = P(X = 1)+P(X = 2)+P(X = 3) = .375+.375+.125 = .875.

If B = {an odd number of heads}, then

P(B) = P(X = 1) + P(X = 3) = .375 + .125 = .500.

87
Discrete random variables: the probability density
function

Note: We may display the pdf of a discrete random variable graphi-


cally, by something similar to a bar graph for discrete data. For the
previous example, we have:

88
Discrete random variables: the cumulative distribu-
tion function

If we apply the same accumulating procedure to the pdf of a dis-


crete variable as we applied to the relative frequencies of a discrete
frequency distribution to get the cumulative relative frequencies, we
obtain the cumulative distribution function (CDF), F(x) = P(X  x).

Example: Toss a fair coin three times, let X = # heads. The CDF is:
x f (x) F(x)
0 .125 .125
1 .375 .500
2 .375 .875
3 .125 1.000

89
Discrete random variables: the expected value

Suppose we repeated the experiment of 3 tosses of a fair coin a very


large number of times. What would be the average value of X over
all repetitions of the experiment? Using the pdf, we can determine
this analytically rather than empirically.
The expected value (long-run average value) of a discrete random
variable X is given by
µ = E(X) = Â x f (x).
all x
Example 1: For the experiment of 3 tosses of a fair coin, with X =
# of heads, we have
E(X) = (0)(.125) + (1)(.375) + (2)(.375) + (3)(.125) = 1.5.

90
Discrete random variables: the expected value

Example 2: The proportions of the 114.4 million U.S. households of


various sizes in 2006 were as follows (Source: U.S. Census Bureau):
Household size (x) Proportion ( f (x))
1 .270
2 .330
3 .170
4 .140
5 .060
6 .020
7+ .010

If we randomly sample one household from this population, what is


its expected size? Answer: E(X) = (1)(.270) + (2)(.330) + · · · +
(7)(.010) = 2.49 (actually 2.49+).

91
Discrete random variables: the variance

Again, suppose we repeat a random experiment a very large number


of times, and observe the outcome of the random variable X for each
such experiment. How spread out would we expect these outcomes
to be? The answer is provided by the variance of X, computed from
the pdf as follows:

s 2 = E[(X µ)2 ] = Â (x µ)2 f (x) = Â (x2 f (x)) µ 2.


all x all x
Example: For the experiment of 3 tosses of a fair coin, with X = #
of heads, we have

s 2 = (0)2 (.125)+(1)2 (.375)+(2)2 (.375)+(3)2 (.125) 1.52 = 0.75.

92
The binomial distribution: basic framework

In many biological applications, there is some basic trial for which


there are only two possible outcomes, labelled “success” and “fail-
ure”, and we are interested in the # of successes that occur when we
repeat the basic trial n times.
Examples:

• # of heads in 3 tosses of a fair coin


• # of left-handed people in a class of 25 kindergarteners
• # of cancer patients in a clinical trial whose cancer goes into
remission

All these are examples of binomial random variables.


93
The binomial distribution: basic framework

More precisely, if:

• A fixed number n of trials are carried out;

• The outcome of each trial is either a “success” or a “failure”;

• The probability of success, denoted by p, is constant from trial


to trial;

• The trials are independent (the outcome on any particular trial


does not affect the outcome of any other trial);

then X = # successes in the n trials


is a binomial random variable with parameters n and p.
94
The binomial distribution: the pdf

The pdf of a binomial random variable (with parameters n and p) is


called the binomial pdf (with parameters n and p). It turns out that
this pdf is given by
n!
f (x) = px (1 p)n x
for x = 0, 1, 2, . . . , n.
x!(n x)!
(See text, p. 72 for rationale.)
Example 1: X = # of heads in 3 tosses of a fair coin:
3! 6 3/4
f (x) = (0.5)x (1 0.5)3 x
= (0.5)3 =
x!(3 x)! x!(3 x)! x!(3 x)!
for x = 0, 1, 2, 3. (Check to make sure this matches what we had
previously on p. 81 of these notes.)
95
The binomial distribution: biological examples

Example 2: The proportion of U.S. residents that are lactose intoler-


ant is believed to be about 0.10. If this is correct and we randomly
select 7 people from the U.S. population, what is the probability that
there are no lactose intolerant people in the sample?
Answer:

96
The binomial distribution: biological examples

Example 3: The proportion of U.S. residents that have Type A blood


is about 40%. If we randomly select 20 people from the U.S. popu-
lation, what is the probability that at least 40% of the people in the
sample have Type A blood?
Answer:

97
The binomial distribution: the CDF and Table C.1

If X has a binomial distribution with parameters n and p, and if we


let d represent any integer between 0 and n, then the CDF of X is
given by
d
n!
F(d) = P(X  d) = Â x!(n x)!
px (1 p)n x .
x=0

For selected values of n and p, this CDF is given in Table C.1 of our
text, and we can use it to avoid having to compute the pdf.
Re-do of Example 1:

98
The binomial distribution: the CDF and Table C.1

Re-do of Example 2:

Re-do of Example 3:

99
The binomial distribution: Mean, variance, and shape

If X is a binomial random variable with parameters n and p, then:

• µ = E(X) = np

• s 2 = np(1 p)

• the pdf is symmetric if p = 0.5; right skewed if p < 0.5; and


left skewed if p > 0.5

100
The Poisson distribution: basic framework

Another important experimental framework in biology is one in which


counts are made of the # of occurrences of some phenomenon (a ba-
sic event) in a fixed interval of time or a fixed unit of length, area, or
volume.
Examples:

• # of hurricanes in a year

• # of snails in a 1 m ⇥ 1 m quadrat

• # of mutations in a segment of DNA of fixed length

All these are examples of Poisson random variables.

101
The Poisson distribution: basic framework

More precisely, if:

• The basic events occur one at a time (never simultaneously);

• The occurrence of a basic event in a given period is indepen-


dent of the occcurence of the event in any other non-overlapping
period;

• The expected # of basic events during any period of unit length


is µ, and the expected # of basic events during any period of
length t is tµ;

then X = # occurrences of the basic event in a period of unit length


is a Poisson random variable with parameter (and mean) µ.
102
The Poisson distribution: the pdf

The pdf of a Poisson random variable with parameter µ is given by


e µ µx
f (x) = , for x = 0, 1, 2, . . . ,.
x!
Example: Suppose that X, the number of hurricanes in any given
calendar year, is a Poisson random variable with mean 5. Find the
pdf of X and use it to determine the probability that there will be 3
or fewer hurricanes in 2016.
Answer:
e 5 5x
f (x) = , for x = 0, 1, 2, . . . ,,
x!✓ ◆
0 51 52 53
5 5
so P(X  3) = e + + + = .2650.
0! 1! 2! 3!

103
The Poisson distribution: the CDF and Table C.2

The CDF of a Poisson random variable with parameter µ is given by


d
e µ µx
F(d) = P(X  d) = Â x!
, for d = 0, 1, 2, . . . ,.
x=0

Hurricane example, continued: Use Table C.2 to re-obtain P(X  3)


and also to obtain the probability that there are exactly 3 hurricanes
in 2016 and the probability that there is at least one hurricane in
2016.

104
The Poisson distribution: changing the length of the
time period

The third part of the framework for a Poisson random variable (p. 98
of these notes) tells us that if the number of basic events in a period
of unit length has expected value µ, then the number of basic events
in a period of length t is a Poisson random variable with parameter
(and mean) tµ. So, with proper modification we can compute prob-
abilities for events involving a period of any length.
Hurricane example, continued: What is the probability that there are
10 or fewer hurricanes from 2016-2018 (inclusive)?

105
Poisson approximation to the binomial distribution

In addition to being useful for determining probabilities involving


Poisson random variables, the Poisson pdf (and CDF) is useful for
approximating probabilities involving binomial random variables in
certain circumstances. The circumstances are:

n 100, np  10.

In place of the binomial CDF with parameters n and p, we use the


Poisson CDF with mean µ = np.

106
Poisson approximation to the binomial distribution

Example: Suppose that you go on a trip to the Caribbean, and while


you are there, each time you are bitten by a mosquito the probability
that the mosquito is carrying the Zika virus is .01. Suppose further
that you are bitten by 120 mosquitos while on the trip. What is the
probability that at least one of these bites will be from a Zika virus
carrier?

107
Continuous random variables

Recall that if X is a continuous random variable, it can take on any


value in a specified interval. Higher mathematics tells us that the #
of real numbers in an interval is infinite, in fact uncountably infinite
(in contrast to, say, the nonnegative integers which are countably
infinite). As a result, the axioms of probability dictate that:

• We must assign a probability of 0 to the event that X equals


any single real number in the specified interval;

• The only events that can be assigned meaningful nonzero prob-


abilities are subintervals (or unions thereof) of the specified
interval.

108
Continuous random variables

Thus, if a and b are two real numbers such that a < b, then

P(X = a) = 0 and P(X = b) = 0,

but P(a < X < b) can be nonzero. Furthermore,

P(a < X < b) = P(a  X < b) = P(a < X  b) = P(a  X  b)

since, for example,

P(a  X < b) = P(X = a) + P(a < X < b) = 0 + P(a < X < b).

Example: If we randomly select an incoming UI freshman, we must


assign 0 to the probability that his/her HS GPA equals 2.93, and . . .

109
Continuous random variables: the pdf

Let X be a continuous random variable. Since P(X = x) = 0 for any


number x, the pdf of X must be defined somewhat differently than it
was for a discrete random variable.
The pdf of a continuous random variable is a function, f , that satis-
fies:

• f (x) 0;

• the area under the graph of f (and above the x-axis) is equal
to 1.0;

• for any real numbers a and b with a  b, P(a  X  b) is given


by the area under the graph of f between x = a and x = b.

110
Continuous random variables: the pdf

Example 1: Suppose that X is a continuous random variable with


pdf given by the following picture:

⇢ 1
2, if 1 < x < 3
Equivalently, f (x) =
0, otherwise.
Then, e.g., P(1 < X < 2) = 12 · 1 = 12 , P(1.5 < X < 4) =
We can also solve problems like: Find a number c such that P(X <
c) = 23 .

111
Continuous random variables: the pdf

Example 2: Suppose that X is a continuous random variable with


pdf given by the following picture:

Then, e.g.:

• P(X > 1) =
• P(|X| < 1) =
• Find a number c such that P(0 < X < c) = 14 :

112
Continuous random variables: the CDF

If X is a continuous random variable, its CDF is given by

F(x) = P(X  x),

which is the area under the graph of the pdf from • to x.


For the previous Example 1,
8
< 0 for x < 1,
1
F(x) = (x 1) for 1  x < 3,
: 2
1 for x 3.

113
Continuous random variables: mean and variance

The mean, µ, of the distribution of a continuous random variable is


the place where a fulcrum placed under the graph of the pdf would
make the graph balance.
In the previous Example 1, µ = 2; in the previous Example 2, µ = 0.
These were easily determined because of the symmetry of the pdf
around its mean. More generally, the mean is computed by methods
of integral calculus (never mind!).
The variance, s 2 , of the distribution of a continuous random variable
is a measure of how spread out the pdf is, and it’s also computed by
methods of integral calculus (never mind!).

114
The normal distribution: Introduction

The most important continuous probability distribution is the nor-


mal distribution, whose pdf is a bell-shaped curve. Why is it so
important?

• In practice, samples of many physical measurements and other


biological variables have a relative frequency distribution which
often seems to be bell-shaped.

• The normal distribution has nice mathematical properties.

• The normal distribution provides an accurate approximation


to the distribution of the sample mean, no matter what the
distribution of the observations in the sample is (Central Limit
Theorem — much more on this later).

115
The normal distribution: the pdf

The normal distribution’s pdf, the “normal curve,” has the following
features:

• a single peak, located at the mean µ

• symmetric around µ (thus mean=median=mode)

• tails that extend infinitely far in both directions

• a standard deviation, s , that controls where the curve’s 2 points


of inflection are: µ + s and µ s

• a complicated functional form (given in text — never mind!)

116
The normal distribution: the pdf

There is a different normal curve for each set of parameters µ and


s 2 , which we label as N(µ, s 2 ).
Example 1: N(0, 1), N(1, 1), N(2, 1)

Example 2: N(0, 1), N(0, 4), N(0, 9)

117
The standard normal distribution and its CDF

The specific normal curve, N(0, 1), is called the standard normal
distribution, and the corresponding random variable is denoted by
Z. We write
Z ⇠ N(0, 1).

The CDF of the standard normal distribution is


F(z) = P(Z  z),
and is given by the area under the standard normal curve to the left
of z. These probabilities are listed in Table C.3 for a large number
of values of z.
118
The standard normal distribution: Determining prob-
abilities

We can use Table C.3 and knowledge of the symmetry of the stan-
dard normal curve around 0 to obtain the probabilities of many events
involving Z.

• P(Z < 0) =

• P(Z > 0) =

• P(Z < 1.76) =

• P(Z > 1.76) =

• P(Z < 0.62) =

119
The standard normal distribution: Determining prob-
abilities
• P( 0.39 < Z < 1.64) =

• P(0.50 < Z < 1.50) =

• P(|Z| < 1.00) =

• P(|Z| < 2.00) =

• P(|Z| < 3.00) =

Drawing a picture is always a good idea.

120
The standard normal distribution: Inverse problems

We can also solve “inverse” problems, where we are given the prob-
ability and asked to find a z-value. For example, find z such that:

• P(Z < z) = 0.9750

• P(Z > z) = 0.7486

• P(0 < Z < z) = 0.3749

• P( z < Z < z) = 0.9010

121
The normal distribution: Standardization

In practice, biological variables rarely have a standard normal distri-


bution, but may instead have some other normal distribution. How
do we determine probabilities of events of interest in this case?
Example: Suppose that diastolic blood pressures of hypertensive
women are normally distributed, with mean 100 mm Hg and stan-
dard deviation 16 mm Hg. What is the probability that the diastolic
blood pressure of a randomly selected hypertensive woman is less
than 90 mm Hg?
We want P(X < 90) where X ⇠ N(100, 162 ). We use standardiza-
tion to convert this type of problem to an equivalent one involving
Z.

122
The normal distribution: Standardization

Standardization requires subtracting µ from X and dividing the re-


sult by s , and when we do so, the resulting random variable has a
standard normal distribution, i.e.
X µ
⇠ N(0, 1).
s
Applied to the previous example, we have
✓ ◆
X 100 90 100
P(X < 90) = P < = P(Z < 0.63) = 0.2643.
16 16

123
The normal distribution: Standardization

We can also do “inverse” problems like the following: 10% of the


population of hypertensive women have a diastolic blood pressure
above what level?
We want a number (in units mm Hg), say c, such that P(X > c) =
0.10.

124
Normal approximation to the binomial distribution

In addition to being useful for determining probabilities involving


normal random variables, the normal CDF is useful for approximat-
ing probabilities involving binomial random variables in certain cir-
cumstances. The circumstances are:

np > 5, n(1 p) > 5.

In place of the binomial CDF with parameters n and p, we use the


normal CDF with mean µ = np and variance s 2 = np(1 p).

125
Normal approximation to the binomial distribution

Also, since we are approximating a discrete distribution with a con-


tinuous one, we often employ a continuity correction factor. For a
random variable whose possible values are all integers within some
interval, this involves subtracting 0.5 from x in events of the form
(X < x), and adding 0.5 to x in all events of the form (X  x). So
instead of finding P(X < x) we find P(X < x 0.5), and instead of
finding P(X  x) we find P(X < x + 0.5).

126
Normal approximation to the binomial distribution

Example: Recall the Caribbean trip you take, on which you are bit-
ten by 120 mosquitos. Suppose that each time you are bitten by a
mosquito on the trip, the probability that the mosquito is carrying
the Zika virus is .10. What is the probability that at least one of the
120 bites will be from a Zika virus carrier?

127
Some practice combining concepts of Chapters 2 & 3

Example 1: The proportion of U.S. residents that are lactose intoler-


ant is believed to be about 0.10. If this is correct and we randomly
sample 7 people from the U.S. population, (a) what is the probabil-
ity that fewer than 3 and an odd number of people in the sample are
lactose intolerant?

(b) What is the probability that fewer than 3 or an odd number of


people in the sample are lactose intolerant?

128
(c) What is the probability that fewer than 3 people in the sample are
lactose intolerant, given that an odd number of people in the sample
are lactose intolerant?

(d) What is the probability that an odd number of people in the sam-
ple are lactose intolerant, given than fewer than 3 people in the sam-
ple are lactose intolerant?

129
Example 2: Suppose Z ⇠N(0, 1). (a) Find P(0.50 < Z < 1.50 \ Z <
1.00).

(b) Find P(0.50 < Z < 1.50|Z < 1.00).

130
(c) In a random sample of size 6 from a population whose distribu-
tion is N(0, 1), what is the probability distribution of X, where X
is defined as the number of observations in the sample that are less
than 1.00?

131
Sampling distributions: Introduction

Now we begin to wed the concepts of sample statistics (Chapter 1)


and probability distributions (Chapters 2 and 3).
Recall that we take a (random) sample from a population of inter-
est because we want to estimate some parameter of that popula-
tion; we use the sample-based estimate (e.g. X) to make inferences
about the population parameter (e.g. µ). We want to be able to say
something about how close our estimate is to the parameter (e.g.
P(|X µ| < .01) =?).
To do so, we need to know the sampling distribution, i.e. the proba-
bility distribution of the sample-based estimate.

132
Sampling distributions: Introduction

Suppose the population of interest is newborn infants in Iowa; the


random variable of interest is their weight at birth, X; and the pa-
rameter of interest is their mean, µ. Suppose we propose to take
a random sample X1 , X2 , . . . , Xn (of size n) from the population and
compute the sample mean, X, as an estimate of µ. What are the ran-
dom experiment and random variable here?

If you took a random sample of size n and I did likewise, would we


be likely to obtain the same value of X?
Thus X is a random variable in its own right, and it has a probability
distribution. What is this probability distribution?

133
Sampling distribution of X: A discrete example

Suppose X is a discrete random variable with the following pdf:


x f (x)
1 .4
2 .1 µX = 2.40, sX2 = 1.64
3 .2
4 .3

Now think of the x’s as the values of objects in some very large pop-
ulation, and the f (x)’s as their corresponding relative frequencies.
Imagine taking a random sample of size 2 from this population with
replacement, and let X represent the mean of this sample. Then the
probability distribution of X is as follows:

134
Sampling distribution of X: A discrete example

x f (x)
1.0 (.4)(.4)=.16
1.5 (.4)(.1)+(.1)(.4)=.08
2.0 (.4)(.2)+(.2)(.4)+(.1)(.1)=.17
µX = 2.40, sX2 = 0.82
2.5 (.4)(.3)+(.3)(.4)+(.1)(.2)+(.2)(.1)=.28
3.0 (.1)(.3)+(.3)(.1)+(.2)(.2)=.10
3.5 (.2)(.3)+(.3)(.2)=.12
4.0 (.3)(.3)=.09

135
Sampling distribution of X: A discrete example

If we take a random sample of size 3 (rather than 2), then the pdf of
X is:
x f (x)
1.0 .064
1.3̄ .048
1.6̄ .108
2.0 .193
2.3̄ .126 µX = 2.40, sX2 = 0.546
2.6̄ .165
3.0 .152
3.3̄ .063
3.6̄ .054
4.0 .027

136
Sampling distribution of X: A discrete example
0.5

0.5

0.5
0.4

0.4

0.4
0.3

0.3

0.3
0.2

0.2

0.2
0.1

0.1

0.1
0.0

0.0

0.0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

137
Sampling distribution of X: A discrete example

If we take a random sample of size 100, then the pdf of X is:

µX = 2.40, sX2 = 0.0164, and the pdf is very similar to that of a


normal distribution.

138
Sampling distribution of X: A discrete example

As the sample size increases, how does the distribution of X behave?

• It has the same mean as the distribution of X, regardless of n.

• Its variance is smaller than the variance of X, and keeps get-


ting progressively smaller.

• It becomes progressively more bell-shaped (more normal).

The behavior of the sampling distribution of X seen in the particular


example above also occurs much more generally. The general result
is known as the Central Limit Theorem.

139
The Central Limit Theorem (CLT)

When taking a random sample of size n from a population with any


probability distribution having mean µX and variance sX2 , the distri-
bution of X:

1. has mean µX ;

2. has variance sX2 /n;

3. becomes more and more like a normal distribution as n in-


creases. (If the population you’re sampling from is normal,
then the distribution of X is exactly normal.)

Amazing!

140
The Central Limit Theorem (CLT)

We can summarize the CLT as follows:

. s2
X ⇠ N(µ, ) for n sufficiently large.
n
Or equivalently,

X µ .
p ⇠ N(0, 1) for n sufficiently large.
s/ n
p
The quantity s / n is called the standard error of the mean.
Note: as the sample size increases, the standard error of the mean
decreases.

141
The Central Limit Theorem in practice

As a consequence of the CLT, we can approximate probabilities of


events involving X using a normal distribution, provided the sample
size (n) is sufficiently large.
How large must n be to safely use the approximation?

• If the sampled population is normal, then any n is OK.

• If the sampled population is non-normal but symmetric, then


n 5 suffices.

• If the sampled population is not symmetric, then n 25 suf-


fices in all but some pathological cases that usually don’t occur
in practice.

142
The CLT in practice: Examples

Suppose that the birth weights of Iowa infants born at gestational


age 40 weeks are approximately normally distributed with mean µ =
3500 g and standard deviation s = 430 g.

• What is the (approximate) probability that the birth weight of


an infant randomly selected from this population is less than
3000 g?

143
The CLT in practice: Examples
• What is the (approximate) probability that the average birth
weight of 5 infants randomly selected from this population is
less than 3000g?

144
The CLT in practice: Examples
• What birth weight (approximately) cuts off the lower 5% of
the distribution of this population’s birth weights?

145
The CLT in practice: Examples
• What birth weight (approximately) cuts off the lower 5% of
the distribution of sample mean birth weights based on sam-
ples of size 5 drawn from this population?

146
Interval estimation: Introduction

A point estimate of a population parameter is a single number, com-


puted from a sample, believed to be close to the parameter. Exam-
ples:

• X is a point estimate of µ

• s2 is a point estimate of s 2

An interval estimate, or confidence interval, is an interval of num-


bers that is computed from a sample in such a way that the interval
has a prespecified probability of containing the population parame-
ter.

147
Confidence interval for µ: Derivation

Suppose a random sample is taken from a population with unknown


mean µ and known variance s 2 , and n is large enough for the CLT
to apply. Then, approximately,

X µ
P( 2.58  p  2.58) = 0.99.
s/ n

(The interval from -2.58 to +2.58 captures the middle 99% of the
standard normal distribution.)
Algebraic manipulations to get µ by itself:

148
Confidence interval for µ: Derivation

So the probability that the random interval


s s
(X 2.58 p , X + 2.58 p )
n n

contains µ is 0.99. When we substitute the observed sample mean


for X, we call the interval a 99% confidence interval for µ and we
say we are 99% confident that µ lies within this interval.
Note that the width of the interval is not random: it =

149
Confidence interval for µ: Example

Ten insomniac patients were given two sleep-inducing drugs, say


drugs A and B, on separate occasions when they were having trou-
ble sleeping. The additional hours of sleep gained by using drug B
instead of drug A, for the ten patients, are given below (in hours):

1.2, 2.4, 1.3, 1.3, 1.0, 3.8, 0.0, 0.8, 4.6, 1.4

For these data, X = 1.58 hrs. Assume that s = 1.5 hrs and that the
distribution of sleep gain (B over A) is symmetric. Find an approx-
imate 99% confidence interval for µ, the mean additional hours of
sleep gained by using drug B rather than drug A among all insomni-
acs.
Answer:

150
Confidence interval for µ: Levels of confidence

There is nothing special about 99% as a level of confidence; if we


wish, we can obtain a confidence interval for µ with greater or lesser
level of confidence.
In general, if 0 < a < 1, a 100(1 a)% confidence interval for µ is
given by
s s
(X z1 a/2 p , X + z1 a/2 p )
n n
where z1 a/2 cuts off the upper (a/2)% of the N(0, 1) distribution.

Confidence level a z1 a/2


90% .10 1.645
95% .05 1.96
99% .01 2.58

151
Confidence interval for µ: Levels of confidence

Insomniac example: A 95% CI for µ is

Is this interval narrower or wider than the 99% CI computed on page


150 of these notes?

152
Confidence interval for µ: Unknown s

Up to now, we’ve assumed (unrealistically!) that we know s . In


practice, we generally don’t. So how to obtain a CI when we don’t
know s ?
A natural idea: replace s in the CI formula with s, its point estimate,
obtained from the same sample used to obtain the point estimate X
of µ:
s s
(X z1 a/2 p , X + z1 a/2 p )
n n
While this is not completely off-base, it is not quite right because
the distribution of Xps µ is not standard normal. Instead, it has what
n
is called a t distribution with n 1 degrees of freedom.

153
Diversion: The t distributions
• Like the N(0, 1), the t distributions are continuous, symmetric,
bell-shaped, with mean 0.

• A different t distribution exists for each sample size, n, or


equivalently for each degrees of freedom, n 1.

• The t distribution with any finite degrees of freedom is more


spread out than the N(0, 1).

• The larger that n (or n 1) is, the more closely the t distribu-
tion resembles the N(0, 1).

• The t distributions are tabled in Table C.4.

154
Confidence interval for µ: Unknown s

So the proper expression for a 100(1 a)% CI for µ when s is


unknown is
s s
(X t1 a/2,n 1 p , X + t1 a/2,n 1 p )
n n

where t1 a/2,n 1 is a number (from Table C.4) that cuts off the upper
(a/2)% of the probability of the t distribution with n 1 degrees of
freedom.
This is an exact 100(1 a)% CI when the random sample is drawn
from a normally-distributed population. If the population is not nor-
mally distributed, then the interval given above is an adequate ap-
proximate 100(1 a)% CI provided that the sample size is suffi-
ciently large (see p. 142 for how large is “large enough”).
155
Confidence interval for µ: Example with unknown s

The insomniac example, revisited: Ten insomniac patients were given


two sleep-inducing drugs, say drugs A and B, on separate occasions
when they were having trouble sleeping. The additional hours of
sleep gained by using drug B instead of drug A, for the ten patients,
are given below (in hours):

1.2, 2.4, 1.3, 1.3, 1.0, 3.8, 0.0, 0.8, 4.6, 1.4

For these data, X = 1.58 hrs and s = 1.66 hrs. Assume that the distri-
bution of sleep gain (B over A) is symmetric. Find an approximate
99% confidence interval for µ, the mean additional hours of sleep
gained by using drug B rather than drug A among all insomniacs.
Answer:

156
Confidence interval for µ: Factors affecting width

The narrower the confidence interval, the more precisely we’ve pinned
down µ. The width of a 100(1 a)% confidence interval for µ is

What factors affect width?

• Level of confidence

• Sample standard deviation

• Sample size

157
Confidence interval for s 2 : Introduction

Now suppose we are interested in an interval of plausible values


for not (or not only) the population mean, but for the population
variance, s 2 . We’ve already introduced a point estimate for s 2 ,
namely the sample variance s2 .
By analogy with the CI for the mean, we need to find a mathematical
expression for a random variable that contains s 2 and has a known
probability distribution. In theoretical statistics courses, it is proved
that if we take a random sample of size n from a normal distribution,
then the quantity
(n 1)s2
s2
has a known probability distribution called the chi-square distribu-
tion (with n 1 degrees of freedom).
158
Diversion: The chi-square (c 2 ) distribution
• a probability distribution for a continuous random variable

• the pdf is positive over the interval (0, •)

• the pdf is asymmetrically bell-shaped (right-skewed)

• there is a different chi-square distribution for each value of a


parameter called the degrees of freedom (like the t distribution
in this respect). We label each such distribution cd2 f .

• The CDF of cd2 f is tabled in Table C.5 for df= 1, . . . , 60.

159
Confidence interval for s 2 : Derivation

Write cn2 1,a/2 and cn2 1,1 a/2 for the values that cut off 100a/2%
from each tail of the cn2 1 distribution (leaving 100(1 a)% of the
distribution in the middle). Then
✓ ◆
2 (n 1)s2 2
P cn 1,a/2   cn 1,1 a/2 = 1 a
s2

which can be manipulated algebraically (see textbook) to yield


!
(n 1)s2 (n 1)s 2
P 2
 s2  2 = 1 a.
cn 1,1 a/2 cn 1,a/2

160
Confidence interval for s 2 : Formula

Thus, a 100(1 a)% confidence interval for s 2 , assuming the ran-


domly sampled population is normally distributed, is given by
!
(n 1)s2 (n 1)s2
, .
cn2 1,1 a/2 cn2 1,a/2

Observe that this interval is not of the form “s2 ± something”. So s2


is not the midpoint of the CI.

161
Confidence interval for s 2 : Example

The lifespans of a population of men having a certain gene are nor-


mally distributed. A random sample of size 16 from this population
had an average lifespan of 81.2 years with a standard deviation of
8.0 years. Determine a 95% confidence interval for the variance of
lifespans for this population of men.

162
Confidence interval for a proportion: Introduction

Now suppose the variable of interest for the population of interest


is categorical, and even more specifically, dichotomous; that is, the
values of the variable are coded as either a 1 (for “success,” i.e. pos-
sessing the characteristic of interest) or 0 (for “failure,” i.e. not pos-
sessing the characteristic of interest). In this case the population
parameter of interest is p, the proportion of the population having
the characteristic of interest.
From a random sample X1 , X2 , . . . , Xn drawn from the population, we
estimate p by the sample proportion of successes,

# successes Âni=1 Xi
p̂ = = = X.
n n

163
Confidence interval for a proportion: Sampling dis-
tribution of p̂

Since p̂ is a sample mean, the CLT gives its approximate sampling


distribution. Furthermore, recall that the expressions for the mean
and variance of a binomial random variable (with parameters n and
p) are np and np(1 p), respectively. Thus the mean and variance
of p̂ are p and p(1 p)/n, respectively. Thus by the CLT,
✓ ◆
. p(1 p)
p̂ ⇠ N p, for n sufficiently large.
n
Here, “sufficiently large” means np 5 and n(1 p) 5.
Standardizing yields
p̂ p .
p ⇠ N(0, 1).
p(1 p)/n
164
Confidence interval for a proportion

Using the result at the bottom of the previous page, we can derive
the following approximate 100(1 a)% CI for p:
p p
( p̂ z1 a/2 p(1 p)/n, p̂ + z1 a/2 p(1 p)/n).

But notice that the standard error of p̂ depends on p, which is un-


known! To fix this, we simply replace p with p̂ in the expres-
sion for the standard error of p̂, yielding the following approximate
100(1 a)% CI for p:
p p
( p̂ z1 a/2 p̂(1 p̂)/n, p̂ + z1 a/2 p̂(1 p̂)/n).

165
Confidence interval for a proportion: Example

In 2001-02, 272 deer were legally killed by hunters in the Mount


Horeb area of SW Wisconsin. From tissue sample analysis, it was
determined that 9 of the deer had chronic wasting disease (a disease
similar to mad cow disease). Determine a 95% confidence interval
for the proportion of the entire deer population in this area of Wis-
consin that had chronic wasting disease in 2001-02.
Answer:

Any issues with the assumptions?

166
Confidence interval for a proportion: Sample size con-
siderations

The width of the 100(1 a)% CI for p is


p
2z1 a/2 p̂(1 p̂)/n

and the margin of error, defined as half the width of this CI, is
p
z1 a/2 p̂(1 p̂)/n.

We can use these expressions (with some variations) to determine


the sample size necessary for the width or margin of error of this CI
to be less than a pre-specified value.

167
Confidence interval for a proportion: Sample size con-
siderations

Example: Suppose we wish to obtain a 95% CI for the proportion


of lefthanded people in the U.S., and we want this CI to be of width
.02 or less (equivalently, we want the CI’s margin of error to be .01
or less). Using the formula above, we want to choose n such that
p
2(1.96) p̂(1 p̂)/n  .02,

or equivalently,
4(1.96)2 p̂(1 p̂)
n .
.022

168
Confidence interval for a proportion: Sample size con-
siderations

But what do we use for p̂? Three possibilities:

• Prior guess. If we guess that the proportion of lefthanders is


about 10%, then we have
4(1.96)2 (0.1)(0.9)
n = 3457.44.
.022
• Pilot study. Suppose we sample 200 people to start with, and
32 of them are lefthanded. Then our provisional p̂ is 32/200 =
0.16; using this we obtain
4(1.96)2 (0.16)(0.84)
n = 5163.11.
.022
169
Confidence interval for a proportion: Sample size con-
siderations
• Conservative approach. Note that the function f (x) = x(1 x)
is maximized over 0 < x < 1 at x = 0.5; so replace p̂ with 0.5:

4(1.96)2 (0.5)(0.5)
n = 9604.
.022

170
Interpretations of confidence intervals

There are correct interpretations of confidence intervals, and there


are incorrect interpretations. Consider a 100(1 a)% confidence
interval for a population mean µ, based on a random sample of size
n. (Similar comments apply to a confidence interval for p.) Suppose
the computed interval is (117.3,126.9).
Correct interpretations:

• If we were to repeatedly take random samples of size n from


the population, on average 100(1 a)% of the CI’s so con-
structed would contain µ.

• We are 100(1 a)% confident that (117.3,126.9) contains µ.

171
Interpretations of confidence intervals

Incorrect interpretations:

• The probability that (117.3,126.9) contains µ is 1 a.

• 100(1 a)% of the population’s values lie in (117.3,126.9).

• 100(1 a)% of the sample means (based on samples of size


n) lie in (117.3,126.9).

172
Hypothesis testing: Introduction

Confidence intervals are one of the two main currencies of statistical


inference. The other is hypothesis testing.
Briefly, statistical hypothesis testing is a procedure for testing the
validity of a claim about a population parameter by evaluating how
compatible, probabilistically speaking, it is with a relevant statistic
computed from a random sample.
Example: An investigator doesn’t know the population mean body temperature of
African elephants, but using current knowledge of the physiology, surface area,
and weight of African elephants, together with current theory of how these things
affect body temperature, he hypothesizes that it is 96.0 F. If the mean of a random
sample of 50 African elephants is 96.2 F, does this cast doubt that the current
theory is applicable to African elephants?

173
Hypothesis testing: Null and alternative hypotheses

We’ve answered a few questions similar to this in an informal way


(e.g. Problems 1.4b, 4.23b). Now, however, we formalize things so
that we can test hypotheses in a consistent, scientifically valid way.
We consider only those situations in which there are two mutually
exclusive and exhaustive hypotheses:

1. Null hypothesis, H0 . This is the default, or status quo, claim


about a population parameter.

2. Alternative hypothesis, Ha . Also called the research hypothe-


sis, this is the scientific investigator’s claim about the popula-
tion parameter.

174
Hypothesis testing: Null and alternative hypotheses
• The investigator must specify these two hypotheses according
to the problem at hand and his/her goals.
• Depending on their goals, one investigator’s H0 may be differ-
ent than another investigator’s H0 , even for the same problem.
• The burden of proof is always on the investigator to provide
strong evidence that the null hypothesis is false (this is consis-
tent with the scientific method).

Example 1 (not biological): A murder is committed, and a suspect


is arrested and put to trial.
The jury’s H0 :
The jury’s Ha :
175
Hypothesis testing: Null and alternative hypotheses

Example 2: Mean body temperature of African elephants. Let µ rep-


resent the mean body temperature for the population of elephants.
The researcher’s H0 :
The researcher’s Ha :

Example 3: A new treatment regimen for pancreatic cancer is devel-


oped by researchers. Let p represent the proportion of a conceptual
population of patients receiving the new treatment that will be alive
after 5 years, and suppose that this proportion for the population of
patients receiving the current best treatment is 0.046.
The researcher’s H0 :
The researcher’s Ha :
176
Hypothesis testing: Hypotheses about a population
mean

Hypotheses about a population mean are of 3 types:

• H0 : µ = µ0 versus Ha : µ 6= µ0

• H0 : µ  µ0 versus Ha : µ > µ0

• H0 : µ µ0 versus Ha : µ < µ0

In the first of these, Ha is two-sided; in the others, Ha is one-sided.


Note that µ0 is always included in the null hypothesis.

177
Hypothesis testing: Type I and Type II errors

Based on the evidence, H0 will be either rejected or not rejected


(accepted). This decision can be right or wrong. There are two types
of correct decisions, and two types of wrong decisions (errors); the
latter are called Type I and Type II errors.
Example 1: Jury’s decision about murder suspect

178
Hypothesis testing: Type I and Type II errors

Example 2: Researcher’s decision about population mean body tem-


perature of African elephants

179
Hypothesis testing: Type I and Type II errors

Define:

a = Probability of making a Type I error,


b = Probability of making a Type II error.

• In statistical hypothesis testing, we have the ability to set a at


some relatively small number, traditionally at .05 or .01; then
we take what we get for b (generally, the smaller we make a,
the larger that b gets).
• Some situations may call for a larger or smaller a, e.g. des-
perate health situations.
• a is also called the level of significance.

180
Hypothesis testing: Power

The power of a test is the probability, using that test, that we will
reject H0 when it is false.

• Thus, Power = 1 b.

• High power is a good thing!

• Some tests have higher power than others, but often at the
price of more restrictive assumptions

181
Hypothesis testing: Six-step procedure
1. Formulate H0 and Ha , based on the scientific question of in-
terest.

2. Choose a level of significance, a, based on the relative impor-


tance of Type I and Type II errors in the given situation.

3. Choose an appropriate test statistic for the problem and com-


pute it. We generally choose the most powerful test statistic
available, provided that the assumptions required for its valid-
ity are satisfied.

4. (a) Determine a critical value(s), using a table, to which the


test statistic’s value will be compared; OR
(b) Determine the P value, using a table, to compare to a.

182
Hypothesis testing: Six-step procedure
5. (a) If the test statistic is more extreme than the critical value(s),
then reject H0 ; otherwise do not reject H0 . OR
(b) If the P value is less than a, reject H0 ; otherwise do not
reject H0 .

6. Express your conclusion as an answer to the scientific question


of interest.

183
Test statistics for hypotheses about a population mean

A test statistic is a quantity, computable from a random sample,


which measures the discrepancy between what the data say about
the population parameter’s value and what H0 claims the population
parameter’s value is.
For testing hypotheses about a population mean µ, this discrepancy
can (initially) be measured by

X µ0 .

However, the “extremeness” of any particular value of this discrep-


ancy cannot be judged until it is scaled by the inherent variability
of the data. So for our test statistic we scale this discrepancy by the
variability of X, or an estimate thereof.

184
Test statistics for hypotheses about a population mean
p
According to the CLT, the standard error of X is s / n. So, if s is
known to us, we may use as our test statistic the “z-statistic”

X µ0
Z= p .
s/ n
p
• Dividing by s / n calibrates the discrepancy between X and
µ0 in units of standard error.

• Note that if µ really does equal µ0 (i.e. if H0 is true), and if


n is sufficiently large, then by the CLT the distribution of the
z-statistic is approximately N(0, 1).

185
Test statistics for hypotheses about a population mean

If s is not known (which is almost always the case in practice), then


we may replace it with s in the z-statistic, yielding as a test statistic
the “t-statistic”
X µ0
t= p .
s/ n

• The discrepancy between X and µ0 is calibrated in units of


estimated standard error.

• If µ really does equal µ0 (i.e. if H0 is true), and if n is suffi-


ciently large, then the distribution of the t-statistic is approxi-
mately t with n 1 degrees of freedom.

186
Critical values for testing hypotheses about a popula-
tion mean

On the previous two pages, we noted what the distributions were,


under H0 , for the z and t test statistics. Tables of these distributions
are where we go to find critical values.
For the z-statistic, critical values are:

• ±z1 a/2 if Ha is two-sided

• z1 a if Ha is µ > µ0

• za if Ha is µ < µ0

187
Critical values for testing hypotheses about a popula-
tion mean

For the t-statistic, critical values are:

• ±t1 a/2,n 1 if Ha is two-sided

• t1 a,n 1 if Ha is µ > µ0

• ta,n 1 if Ha is µ < µ0

If our computed test statistic is more extreme than the critical value,
we reject H0 ; otherwise, we do not reject H0 .

188
Hypothesis testing for a population mean: Example

A researcher wanted to test the hypothesis that the mean body tem-
perature of African elephants was 96.0 F. He has no prior notion
about which direction the mean body temperature of elephants will
differ from 96.0 F if it is not equal to 96.0.

• Step 1. So, letting µ represent the mean body temperature of


African elephants, he wants to test
H0 : µ = 96.0 versus Ha : µ 6= 96.0.
• Step 2. He makes a traditional choice of a = .05.

A random sample of 50 African elephants is taken, from which the


sample mean and sample standard deviation were computed as fol-
lows: X = 96.2, s = 0.63.
189
Hypothesis testing for a population mean: Example
• Step 3. Since the population standard deviation is not known,
but the sample size is sufficiently large, he chooses the t-statistic
as our test statistic and computes it as follows:

• Step 4. Critical values are ±t.975,49 = ±2.010.

• Step 5. Since the computed test statistic is more extreme than


the critical values, he rejects H0 .

• Step 6. He concludes that the population mean body tempera-


ture of African elephants is not equal to 96.0 F.

190
Hypothesis testing for a population mean: Example

In this example, we implemented Steps 4a and 5a. Alternatively, we


could have used the “P value approach” of Steps 4b and 5b.
The P value of a computed test statistic is the probability, under H0 ,
that if we repeated the experiment we would get a computed test
statistic as extreme or more extreme than the one we got for the
experiment we actually did.

• Step 4b. Computed test statistic = , so

P value = P(t49 > ) + P(t49 < )=

• Step 5b. Since P value < a, we reject H0 .

191
Hypothesis testing: More on P values
• The P value approach to HT is equivalent to the critical value
approach; they always yield the same decision about H0 .

• The P value is a measure of the strength of evidence against


H0 that the data provides: a smaller P value corresponds to
stronger evidence against H0 .

• The P value may also be interpreted as the value of a at which


we would go from rejecting H0 to not rejecting H0 , if we re-
peatedly retested our hypotheses at significance levels starting
at a = 1 and decreasing a towards 0.

192
Hypothesis testing: Variations on the elephant exam-
ple

Suppose we change the significance level from .05 to .01. How does
the test change?

193
Hypothesis testing: Variations on the elephant exam-
ple

Suppose we return to using a = .05, but the investigator has reason


to believe, or wants to show, that the population mean body tem-
perature of African elephants is less than 96.0. Thus, he wants to
test
H0 : µ 96.0 Ha : µ < 96.0.
How does the test change?

194
Equivalence between hypothesis tests and confidence
intervals

Consider testing

H0 : µ = µ0 versus Ha : µ 6= µ0

at the a level of significance (and suppose s is unknown). Then we


will reject H0 if our test statistic t satisfies

t < t1 a/2,n 1 or t > t1 a/2,n 1 .

Equivalently, we will not reject H0 if

X µ0
t1 a/2,n 1  p  t1 a/2,n 1 .
s/ n

195
Equivalence between hypothesis tests and confidence
intervals

Manipulating this last inequality to get µ0 by itself in the middle, we


obtain the equivalent inequality
s s
X t1 a/2,n 1 p  µ0  X + t1 a/2,n 1 p .
n n
But observe that the endpoints of this interval coincide with the end-
points of a 100(1 a)% confidence interval for µ!!
Thus, if µ0 is a value inside the 100(1 a)% confidence interval for
µ, we will not reject H0 ; otherwise we will reject H0 .
Bottom line: If we don’t care about reporting a P value, we can sim-
ply use the appropriate confidence interval to perform a two-sided
test for a population mean.
196
Hypothesis testing for a population variance: Intro-
duction

Sometimes we wish to test two competing hypotheses about a pop-


ulation variance (rather than a mean).
Example: As a result of the manufacturing process for a particular
drug, there is some variation in the actual dosage of the active in-
gredient included in each pill. Suppose that the variance in actual
dosage is known to be 0.36 µg2 . A new manufacturing process is
developed which is believed to reduce this variance. Letting s 2 rep-
resent the variance in the population of pills manufactured by the
new process. The developer of the new process wants to test

H0 : s 2 0.36 versus Ha : s 2 < 0.36.

197
Hypothesis testing for a population variance: Intro-
duction

More generally, hypotheses about a population variance are of 3


types:

• H0 : s 2 = s02 versus Ha : s 2 6= s02


• H0 : s 2  s02 versus Ha : s 2 > s02
• H0 : s 2 s02 versus Ha : s 2 < s02

Note that a hypothesis about a population standard deviation is equiv-


alent to a hypothesis about a population variance, e.g.
H0 : s 2 = s02 , H0 : s = s0

198
HT for a population variance: Test statistic

Based on a random sample of size n from the population, our best


estimate of s 2 is the sample variance, s2 .
To measure the discrepancy between what the data say about the
variance (best guess is s2 ) and what the null hypothesis claims about
the variance (claimed to equal s02 ), we could use
s2 s02 .
However, it turns out that a better and more convenient measure of
the discrepancy is the ratio, s2 /s02 , and even more convenient is the
scaled ratio,
(n 1)s2
. (book’s denominator is s 2 )
s02

199
HT for a population variance: Test statistic

The further that s2 /s02 is from 1.0, the greater the discrepancy (and
the stronger the evidence against H0 ). Equivalently, the further that

(n 1)s2
s02

is from n 1, the stronger the evidence is against H0 . So this is our


test statistic.

For critical values and P values, we use the fact that our test statistic
has a chi-square distribution with n 1 degrees of freedom when H0
is true.

200
HT for a population variance: Critical values

Suppose the significance level is a. Let cn2 1,a be the 100ath per-
centile of the cn2 1 distribution, i.e.

P(cn2 1 < cn2 1,a ) = a.

Then:

• if Ha is two-sided, we reject H0 if

• if Ha is Ha : s 2 > s02 , we reject H0 if

• if Ha is Ha : s 2 < s02 , we reject H0 if

201
HT for a population variance: P values

If we wish to use the P value approach to HT instead, then we com-


pute the P value as follows:

• if Ha is two-sided, P value =

• if Ha is Ha : s 2 > s02 , P value =

• if Ha is Ha : s 2 < s02 , P value =

202
HT for a population variance: Example

A healthy lifestyle undoubtedly plays a role in longevity, but so does genetic makeup. Recent studies
have linked large cholesterol particles to longevity. A variant of a gene called CETP encoding the
cholestryl ester transferase protein apparently causes the formation of large cholesterol particles. In
a particular population the life spans for males are normally distributed with a mean of 74.2 yrs and
a standard deviation of 10.0 yrs. A sample of 16 males in this population that had the variant CETP
gene lived an average of 81.2 yrs with a standard deviation of 8.0 yrs. Does this establish that CETP
variant carriers are significantly less variable in their life spans than the general population?

203
Nonparametric methods for hypothesis testing

The HT methods we’ve learned so far require either (a) the sampled
population to be normally distributed, or (b) the sample size to be
large enough for the CLT to “steer” the distribution of the test statis-
tic sufficiently close to its reference distribution (Z, t, c 2 ). What if
neither (a) nor (b) is satisfied?
In that case, we use alternative HT methods called nonparametric or
distribution-free methods. Though more widely applicable than the
parametric methods already learned, they are not as powerful.

204
The sign test: Introduction

The first nonparametric HT method we learn is the sign test.

• tests hypotheses on the median, M, of a continuous population

• based on the idea that if M = M0 , then roughly half of the


observations in a random sample drawn from the population
should be greater than M0 , and half less than M0

• If too few observations in the sample lie to one side of M0 , that


suggests that M is not equal to M0 .

205
The sign test: Hypotheses and test statistics

The hypotheses to be tested are one of three types:

• H0 : M = M0 versus Ha : M 6= M0
• H0 : M  M0 versus Ha : M > M0
• H0 : M M0 versus Ha : M < M0

The test statistic is one or the other of:


S = # of observations in sample less than M0 ,
S+ = # of observations in sample greater than M0 .
If any observations in the sample equal M0 , they are ignored; n ex-
cludes them also (so here n represents a possibly reduced sample
size).
206
The sign test: Test statistics and reference distribu-
tion

Based on a random sample of size n, we compute S and S+ . Now,


each of these is a binomial random variable; in fact, if M = M0 then

So our reference distribution is bin(n, 12 ). If tabled critical values


were available for this distribution (for the commonly used levels of
significance) we would use them, but they aren’t. So we use a P
value testing approach instead.

207
The sign test: P value testing approach

We use a P value approach to make our decision about the hypothe-


ses. Three cases:

• If Ha is M > M0 , we reject H0 if S is too small, i.e. if

P(bin(n, 12 )  S ) < a.

• If Ha is M < M0 , we reject H0 if S+ is too small, i.e. if

P(bin(n, 12 )  S+ ) < a.

• If Ha is two-sided, we reject H0 if min(S , S+ ) is too small,


i.e. if
2P(bin(n, 12 )  min(S , S+ )) < a.

208
The sign test: P value testing approach

How do we get the binomial probabilities needed to do the test?

• If 5  n  20, we can use the column corresponding to p = 0.5


in Table C.1 directly.

• If n > 20, we invoke the CLT to approximate the bin(n, 12 )


distribution by the N( n2 , n4 ) distribution:
!
n
. S + 0.5
P(bin(n, 12 )  S) = P Z  p 2
.
n/4

(Here S is either S or S+ , depending on which one is applicable.)

209
The sign test: Example 1 (Problem 6.32 in text)

Abuse of substances containing toluene (for example, various glues) can produce neurological symp-
toms. In an investigation of the mechanism of these toxic effects, researchers measured the concentra-
tions of certain chemicals in the brains of rats who had been exposed to a toluene-laden atmosphere.
The concentrations (ng/gm) of the brain chemical norepinephrine in the medulla region of the brain of
9 toluene-exposed rats was determined and recorded below:

543 523 431 635 564 580 600 610 550

Does the exposure to toluene significantly increase norepinephrine levels in rat medullas above the
normal median level of 530 ng/gm?

210
The sign test: Example 2

According to the National Center for Health Statistics, the median


height of adult women in the United States is 63.6 inches. Assuming
that the women in our class are a random sample from the population
of women at UI, is the median height of UI women different than
this?

211
The Wilcoxon signed-rank test: Introduction
• Tests hypotheses about a population median, M (tests same
hypotheses as sign test)

• Its use requires the population distribution to be symmetric, so


it’s not as widely applicable as the sign test

• Since symmetry implies that the mean and median are equal,
this test tests the same hypotheses as the t test too

• Based on the idea that when the population distribution is


symmetric and M = M0 , the distances of sampled observations
from M0 should be about the same on both sides of it

• More powerful than sign test; not as powerful as t test

212
The Wilcoxon signed-rank test: Test statistic

Follow these steps to compute the test statistic:

1. Form the differences, Xi M0 , and take their absolute values


to get the distances of observations from the hypothesized me-
dian, |Xi M0 |.

2. Rank those distances from smallest to largest and replace the


numerical distance with its rank.

3. Add a plus or minus sign to the rank according to whether the


original difference was positive or negative.

4. Compute W+ = sum of the positive signed ranks, and W =


1 ⇥ sum of the negative signed ranks.

213
The Wilcoxon signed-rank test: Test statistic

Note:

• Once again we ignore any observations that equal M0 (and


reduce n accordingly).

• Also, if ties occur we average all the successive ranks that are
tied. For example,

214
The Wilcoxon signed-rank test statistic: Example

Recall the example of toluene-exposed rats used to illustrate the sign


test. The data (epinephrine concentrations in medulla) and the steps
of the test statistic computation are as follows:

Xi 543 523 431 635 564 580 600 610 550


Xi 530 13 -7 -99 105 34 50 70 80 20
|Xi 530| 13 7 99 105 34 50 70 80 20
Rank 2 1 8 9 4 5 6 7 3
Signed rank 2 -1 -8 9 4 5 6 7 3

W+ = 36, W =9

215
The Wilcoxon signed-rank test: Rationale

Recall the following fact:

n(n + 1)
1+2+3+···+n = .
2
So, if the population distribution is symmetric and M = M0 , we
would expect both W+ and W to equal roughly n(n + 1)/4. If either
of them is too small (too far away from n(n + 1)/4), we should reject
H0 .
How far is too far? We need a reference distribution for the Wilcoxon
signed rank statistic, which is provided in Table C.6. Call this distri-
bution W (n).

216
The Wilcoxon signed-rank test: P value testing ap-
proach

Three cases:

• If Ha is M > M0 , we reject H0 if W is too small, i.e. if


P(W (n)  W ) < a.

• If Ha is M < M0 , we reject H0 if W+ is too small, i.e. if


P(W (n)  W+ ) < a.

• If Ha is two-sided, we reject H0 if min(W ,W+ ) is too small,


i.e. if
2P(W (n)  min(W ,W+ )) < a.

217
The Wilcoxon signed-rank test: P value testing ap-
proach

To get the probabilities for the aforementioned approach:

• If 5  n  25, we can use Table C.6 directly.

• If n > 25, we invoke the CLT to approximate the W (n) distri-


bution by the N( n(n+1)
4 ,
n(n+1)(2n+1)
24 ) distribution:
n(n+1)
!
. w + 0.5 4
P(W (n)  w) = P Z  p .
n(n + 1)(2n + 1)/24

(Here w is either W or W+ , depending on which one is applicable.)

218
The Wilcoxon signed-rank test: Example 1

Let’s finish up the example of the epinephrine concentrations in the


medullas of toluene-exposed rats. Recall that W+ = 36 and W = 9.
Since Ha is M > 530, the P value is

219
The Wilcoxon signed-rank test: Example 2

According to the National Center for Health Statistics, the median


height of adult women in the United States is 63.6 inches. Assuming
that the women in our class are a random sample from the population
of women at UI, is the median height of UI women different than
this?

For these data, n = , W+ = , W .

P(W (n)  )=

220
Comparing two population means: Introduction

In some situations, the scientific question of interest is not so much


what the mean of a single population is, but how the means of two
distinct populations compare to each other.
Example: The ages (in days) at time of death for random samples of
11 girls and 16 boys who died from SIDS were as follows:
Girls 53 56 60 60 78 87 102 117 134 160 277
Boys 46 52 58 59 77 78 80 81 84 103 114 115 133 134 167 175

The question of interest: Are the mean ages at death due to SIDS
identical for boys and girls?

221
Comparing two population means: Introduction

The previous example is of the following general type.


Suppose there are two populations of interest, the first with mean µ1
and variance s12 , say, and the second with mean µ2 and variance s22 .
From the first population we take a random sample of size n1 :

X11 , X12 , . . . , X1n1 .

From the second population we take a random sample of size n2 :

X21 , X22 , . . . , X2n2 .

Based on the information in these two samples, we wish to estimate


the difference in the two means, i.e. µ1 µ2 , and perhaps test hy-
potheses about this difference.
222
Comparing two population means: Types of hypothe-
ses

Hypotheses comparing two population means are of 3 types:

• H0 : µ1 = µ2 versus Ha : µ1 6= µ2
• H0 : µ1  µ2 versus Ha : µ1 > µ2
• H0 : µ1 µ2 versus Ha : µ1 < µ2

Note that each hypothesis can also be expressed in terms of the dif-
ference of the means. For example, the last pair of hypotheses can
also be expressed as follows:

• H0 : µ1 µ2 0 versus Ha : µ1 µ2 < 0
223
Comparing two population means: Types of sampling

The random samples can be of two distinct types:

1. Paired samples — for each i, the ith individual in the first


sample is more closely related, in some meaningful way, to
the ith individual in the second sample than it is to the other
individuals in the second sample.
2. Independent samples — for each i, the ith individual in the
first sample is no more closely related, in any meaningful way,
to any individual in the second sample than it is to other indi-
viduals in the second sample.

For paired samples, n1 = n2 ; not necessarily so for independent sam-


ples.
224
Paired versus independent sampling: Examples
• Bacterial contamination in meat samples, before and after ir-
radiation

• Weight gain for cattle on two feeds

• Studies of survival after two disease therapies

225
Comparing two means via paired sampling

When sampling is paired, the data can be listed in pairs:

(X11 , X21 ), (X12 , X22 ), . . . , (X1n , X2n )

and we can reduce the data to a single sample of within-pair differ-


ences
di = X1i X2i , i = 1, . . . , n.
(Here we write n for n1 and n2 .)
Reducing the data to within-pair differences controls for sources of
variation in the data other than the source of main interest (more on
this later).

226
Comparing two means via paired sampling

In this context we let µd represent µ1 µ2 ; equivalently, µd is the


mean of the population of within-pair differences. The di ’s are a
random sample from this population.
The natural point estimate of µd is
1 n
Xd = Â di ,
n i=1

and the natural estimate of the variance of the population of within-


pair differences is
!
n n 2
1 ( Âi=1 i
d )
s2d = Â di2
n 1 i=1 n
.

227
Comparing two means via paired sampling

A 100(1 a)% confidence interval for µd is


sd
X d ± t1 a/2,n 1 p .
n

Also, to test hypotheses about µd , we use the same test statistic,


same critical value(s), same everything used for HT about the mean
µ of a single population — but applied here to the within-pair dif-
ferences. For example, to test H0 : µd = 0 versus Ha : µd 6= 0 at the
a sigificance level, we reject H0 if

Xd 0 Xd 0
p < t1 a/2,n 1 or p > t1 a/2,n 1 .
sd / n sd / n

228
Comparing two means via paired sampling: Example

An experiment was performed to study the effects of irradiation on bacterial contamination in meat.
The logarithm of the direct microscopic count (log DMC) of bacteria in 12 meat samples was measured
before irradiating the 12 meat samples, and then again afterwards. The data were as follows:

log DMC, before log DMC, after di (before minus after)


6.98 6.95 .03
7.08 6.94 .14
8.34 7.17 1.17
5.30 5.15 .15
6.26 6.28 –.02
6.77 6.81 –.04
7.03 6.59 .44
5.56 5.34 .22
5.97 5.98 –.01
6.64 6.51 .13
7.03 6.84 .19
7.69 6.99 .70

X d = .258, s2d = .127

229
Comparing two means via paired sampling: Example

The investigator wanted to show that irradiation reduces bacterial


contamination so, letting µd represent the mean change (before mi-
nus after) in log DMC due to irradiation for the conceptual popula-
tion of all possible meat samples, the hypotheses of interest are

H0 : µd  0 versus Ha : µd > 0.

The computed test statistic is

Xd 0 .258
p =p = 2.51.
sd / n .127/12
If we test at the .05 significance level, the critical value is t.95,11 =
1.796, so we reject H0 . Conclusion: there is statistically significant
evidence that irradiation reduces bacterial contamination in meat.
230
Comparing two means via independent sampling

For data arising from independent sampling of two populations, we


use the data as they are (we don’t form differences).
Our best point estimate of µ1 µ2 is X 1 X 2 , but how we use this
to obtain CI’s and do HT’s depends on what is assumed about the
population variances (s12 and s22 ). Two cases:

1. Assume s12 = s22

2. Do not assume s12 = s22

231
Comparing two means via independent sampling, as-
suming equal variances

Assuming that s12 = s22 , the two sample variances (s21 and s22 ) are
estimates of the same quantity. So it makes sense to combine, or
“pool” them:
(n1 1)s21 + (n2 1)s22
s2p =
n1 + n2 2
• The divisor, n1 + n2 2, is the degrees of freedom here

• The sample variance from the larger of the two samples gets
more weight in the pooled estimate (makes sense!)

• If n1 = n2 , then s2p is merely the average of the two sample


variances

232
Comparing two means via independent sampling, as-
suming equal variances

A 100(1 a)% confidence interval for µ1 µ2 is


r
1 1
(X 1 X 2 ) ± t1 a/2,n1 +n2 2s p + .
n1 n2
Furthermore, to test hypotheses comparing the two population means,
we use
X1 X2
t=q
s2p ( n11 + n12 )
as our test statistic.

233
Comparing two means via independent sampling, as-
suming equal variances

More specifically:

• To test H0 : µ1 = µ2 versus Ha : µ1 6= µ2 , we reject H0 if

X X2
q 1 < t1 a/2,n1 +n2 2
s2p ( n11 + n12 )

or
X1 X2
q > t1 a/2,n1 +n2 2 .
s2p ( n11 + n12 )

234
Comparing two means via independent sampling, as-
suming equal variances
• To test H0 : µ1 µ2 versus Ha : µ1 < µ2 , we reject H0 if

X X2
q 1 < t1 a,n1 +n2 2
s2p ( n11 + n12 )

• To test H0 : µ1  µ2 versus Ha : µ1 > µ2 , we reject H0 if

X X2
q 1 > t1 a,n1 +n2 2
s2p ( n11 + n12 )

235
Comparing two means via independent sampling, as-
suming equal variances: Example

In a study of the periodical cicada (Magicicada septendecim), researchers measured the hind tibia
lengths of the shed skins of 110 individuals: 60 males and 50 females. Some summary statistics for
the tibia lengths were as follows:

Gender ni Xi si
Males 60 78.42 2.87
Females 50 80.44 3.52

Let µ1 and s12 represent the mean and variance of hind tibia lengths for the entire population of male
periodical cicadas at shedding; define µ2 and s22 similarly for females. We want to test H0 : µ1 = µ2
versus Ha : µ1 6= µ2 at, say, the .05 level of significance, and suppose we’re willing to assume that
s12 = s22 .

236
Comparing two means via independent sampling, as-
suming equal variances: Example

Pooled sample variance:

Test statistic:

Critical values:

P-value:

Conclusion:

237
Comparing two means via independent sampling, as-
suming equal variances: Example

95% confidence interval for µ1 µ2 :

95% confidence interval for µ1 :

95% confidence interval for µ2 :

238
Comparing two means via independent sampling, when
variances are possibly unequal

In this case s21 and s22 aren’t necessarily estimating the same quantity,
so we do not pool them. We sum them instead (actually we sum
scaled versions of them).
100(1 a)% confidence interval for µ1 µ2 :
s
s21 s22
(X 1 X 2 ) ± t1 a/2,n + .
n1 n2

The degrees of freedom here, represented by the symbol n, is deter-


mined using a messy expression given on page 196 of the text.

239
Comparing two means via independent sampling, when
variances are possibly unequal

To test hypotheses comparing the means, we follow the procedure


for the equal variances case described on pp. 234-235 of these notes,
with two important differences:

1. Replace the test statistic there with

X X
q1 2 22
s1 s2
n1 + n2

2. Replace the df there (which is n1 + n2 2) with n

240
Comparing two means via independent sampling, when
variances are possibly unequal: Example

Let’s revisit the periodical cicada example, only this time let’s not
assume that the two population variances are equal. Then our test
statistic is

The df, n, is:

So critical values are ±t.975,94 = ±1.986. We still reject H0 , but


slightly less emphatically (test statistic not as extreme, critical values
more extreme).

241
Pros and cons of paired sampling

Paired sampling is to be preferred over independent sampling when


the within-pair differences are likely to be less variable than the data
from individuals in each sample.
Why? Because when the variability of within-pair differences is less
than the variability among individuals in the same sample, paired-
sampling based confidence intervals for the difference in the two
population means will be narrower, and paired-sampling based hy-
pothesis tests about the two population means will be more power-
ful.
Why? Consider the following numerical demonstrations. For both,
suppose we take paired samples of size 5 from two populations.

242
Pros and cons of paired sampling: Numerical demon-
stration

Case #1: Pairing is effective


First sample Second sample Within-pair difference
1 0 1
6 5 1
10 10 0
17 15 2
21 20 1

X d = 1, s2d = (0 + 0 + 1 + 1 + 0)/4 = 0.5.

95% confidence interval for µd :


p p p
X d ± t.975,4 sd / n = 1 ± 2.776 0.5/ 5 = 1 ± 0.88.
The t-test rejects H0 : µd = 0 (against a two-sided Ha ) at .05 signifi-
cance level (test statistic = 3.16, critical values are ±2.776).
243
Pros and cons of paired sampling: Numerical demon-
stration

Suppose we do an independent-sampling based analysis of the same


data:

X1 X 2 = 1, s21 = (100 + 25 + 1 + 36 + 100)/4 = 65.5,

s22 = (100 + 25 + 0 + 25 + 100)/4 = 62.5,


4(62.5) + 4(65.5)
s2p = = 64.
8
A 95% confidence interval for µ1 µ2 is
r
1 1 p
(X 1 X 2 ) ± t.975,8 s p + = 1 ± 2.306(8) 2/5 = 1 ± 11.67.
n1 n2
244
The t-test does not reject H0 : µ1 = µ2 (against a two-sided Ha ) at
the .05 significance level (test statistic = 0.197, critical values are
±2.306).
Thus, in this case the pairing was very effective, and the paired-
sampling analysis leads to a much better (narrower) confidence in-
terval for the difference in means, and a much more powerful test.

245
Pros and cons of paired sampling: Numerical demon-
stration

Case #2: Pairing is ineffective


First sample Second sample Within-pair difference
21 0 21
17 5 12
10 10 0
6 15 -9
1 20 -19

X d = 1, s2d = (400 + 121 + 1 + 100 + 400)/4 = 255.5.

95% confidence interval for µd :


p p p
X d ± t.975,4 sd / n = 1 ± 2.776 255.5/ 5 = 1 ± 19.84.
The t-test does not reject H0 : µd = 0 (against a two-sided Ha ) at .05
significance level (test statistic = 0.14, critical values are ±2.776).
246
Pros and cons of paired sampling: Numerical demon-
stration

Note that an independent-sampling based analysis of the second set


of data is identical to an independent-sampling based analysis of the
first set of data.
Thus, in Case #2 the paired-sampling based analysis is even worse
than the independent-sampling based analysis.
Moral of the story: To be effective, a paired-sampling based anal-
ysis must be based on effective paired sampling. For paired sam-
pling to be effective, it must remove (control for) at least some varia-
tion between individuals, so that the variability of differences within
pairs is less than the variability among individuals in each sample.

247
Sources of variation that paired sampling controls for:
Examples
• Bacterial contamination in paired meat samples, before and
after irradiation

• Weight gain for twin calves on two feeds

• Studies of survival of “matched pairs” after two disease ther-


apies

248
The irradiated meat example, revisited: Incorrectly
analyzed by an independent-sampling based approach

Although the sampling was paired in the irradiated meat example,


let’s see what happens when we incorrectly act as though the sam-
pling was independent. For simplicity assume that s12 = s22 .
Relevant summary statistics:

n1 = n2 = 12, X 1 = 6.721, X 2 = 6.463,

s21 = 0.6565, s22 = 0.4341, s2p = 0.5453,


s ✓ ◆
1 1
s2p + = 0.301.
n1 n2

249
For testing H0 : µ1  µ2 vs. Ha : µ1 > µ2 at the .05 significance level,
the critical value of t is t.95,22 = 1.717. The test statistic is

X1 X2
t=r ⇣ ⌘ = 0.857.
1 1
s2p n1 + n2

So, based on this analysis, we do not reject H0 — a different conclu-


sion from that reached by the paired-sampling based analysis.
Why the difference? Less variability among within-pair differences
than among the data in each sample (compare denominators of the
test statistics:
s ✓ ◆
2
1 1 p
sp + = 0.301 vs. sd / n = 0.103.)
n1 n2

250
Nonparametric tests for two populations: Introduc-
tion

The previously described tests for the means of two populations are
strictly valid only when either:

• the population of within-pair differences (for paired sampling)


or the two sampled populations themselves (for independent
sampling) are normally distributed; or

• the sample size n (for paired sampling), or both of n1 and n2


(for independent sampling), are large enough for the CLT to
apply.

If neither of these holds, then we can instead use nonparametric tests.

251
Nonparametric tests for two populations: Introduc-
tion

When the sampling is paired, we may test hypotheses about the me-
dian of the population of within-paired differences using either the

• sign test, or

• Wilcoxon signed-rank test (if the population is symmetric).

When the sampling is independent, we may test hypotheses about


the medians of the two populations using a test called the Wilcoxon
rank-sum test.

252
The irradiated meat example, revisited: Sign test and
Wilcoxon signed-rank test for the population median
of paired differences

Recall, once again, the following log DMC data in 12 meat samples before and after irradiation. The
investigator wishes to test
H0 : Md  0 versus Ha : Md > 0.

log DMC, before log DMC, after di


6.98 6.95 .03
7.08 6.94 .14
8.34 7.17 1.17
5.30 5.15 .15
6.26 6.28 –.02
6.77 6.81 –.04
7.03 6.59 .44
5.56 5.34 .22
5.97 5.98 –.01
6.64 6.51 .13
7.03 6.84 .19
7.69 6.99 .70
253
For these data,

S = , S+ = , W = 7, W+ = 71.

P-value for sign test:

P-value for Wilcoxon signed rank test:

Thus, the sign test does not reject H0 (at a = .05), but the Wilcoxon
signed rank test does reject H0 (indicating that there is and is not,
respectively, statistically significant evidence that irradiation reduces
bacterial contamination in meat).
Why the difference in conclusions?
254
Wilcoxon rank-sum test

When we wish to compare two population medians, and the sam-


pling of those populations is independent, we may use the Wilcoxon
rank-sum test, if we are willing to assume that the two populations
have the same shape.
The hypotheses to be tested are one of the following:

• H0 : M1 = M2 versus Ha : M1 6= M2

• H0 : M1  M2 versus Ha : M1 > M2

• H0 : M1 M2 versus Ha : M1 < M2

255
Wilcoxon rank-sum test: Test statistic

Important: Label your samples (and populations) such that the


smaller of the two sample sizes is n1 (if the two samples are equal in
size, then it doesn’t matter how you label them).
Procedure for computing the test statistic:

1. Conceptually pool the data from both samples into one sample
and rank the data from smallest to largest. Replace the data
with their ranks.

2. Sum the ranks that correspond to the smaller of the two sam-
ples (Sample 1). Call this rank sum W1 , which is our test statis-
tic.

256
Wilcoxon rank-sum test: Critical values

Critical values are listed in Table C.8:

• Book’s m and n are my n1 and n2 , respectively; the table gives


critical values for sample sizes in the range 3  n1  n2  25.

• The significance level options in the table are very limited.

• Column labeled W gives the lower and upper critical values.


Use both critical values for a two-sided test; use only the lower
one if Ha is M1 < M2 ; use only the upper one if Ha is M1 > M2 .

• Column labeled P gives the P value for a one-sided test whose


test statistic exactly equals the critical value; this rarely hap-
pens so you can ignore it.

257
Wilcoxon rank-sum test: Example

Recall (from page 221 of these notes) the following data, which are
the ages (in days) at time of death for random samples of 11 girls
and 16 boys who died from SIDS:
Girls 53 56 60 60 78 87 102 117 134 160 277
Boys 46 52 58 59 77 78 80 81 84 103 114 115 133 134 167 175

Histograms of these data show that for both girls and boys, the data
are right-skewed. Thus, the corresponding populations are probably
not normally distributed, and the sample sizes are relatively small.
Question of interest: Are the median ages at death due to SIDS iden-
tical for boys and girls?

258
So we wish to test
H0 : M1 = M2 versus Ha : M1 6= M2 .
Original data:
Girls 53 56 60 60 78 87 102 117 134 160 277
Boys 46 52 58 59 77 78 80 81 84 103 114 115 133 134 167 175

Ranks in pooled sample:


Girls 3 4 7.5 7.5 10.5 15 16 20 22.5 24 27
Boys 1 2 5 6 9 10.5 12 13 14 17 18 19 21 22.5 25 26

Test statistic: W1 = 157.


Critical values for two-sided test at a = .05: 113 and 195.
Since our test statistic is not as extreme as the critical values, we do
not reject H0 . Scientific conclusion: Among those babies that die of
SIDS, there is no significant difference between male’s and female’s
median ages at death.
259
Comparing two population variances: Introduction

Several pages back, we saw that the specifics of inference for the
difference in two population means based on independent samples
depended on whether or not we assumed that the two population
variances were equal. If we do make this assumption, it’s desirable
to have some justification for doing so. This leads us to consider the
problem of testing the hypothesis that the two population variances
are equal.
We may also be interested in testing whether two population vari-
ances are equal for its own sake.

260
Comparing two population variances: Hypotheses

The hypotheses to be tested are one of the following:

• H0 : s12 = s22 versus Ha : s12 6= s22

• H0 : s12  s22 versus Ha : s12 > s22

• H0 : s12 s22 versus Ha : s12 < s22

Observe that we can re-express any of these in terms of the ratio of


the two variances. For example, the third pair of hypotheses can be
rewritten as

• H0 : s12 /s22 1 versus Ha : s12 /s22 < 1

261
Comparing two population variances: Test statistic

As a test statistic, consider the ratio of the two sample variances,


which we label as F:
s2
F = 12
s2
If the two population variances are equal (equivalently, if s12 /s22 =
1), then F should usually be pretty close to 1. Extreme (too large or
too small) values of F would cast doubt on the hypothesis that the
two population variances are equal.
What do we compare F to, to determine if it’s extreme enough to
reject H0 ?

262
Comparing two population variances: Critical values

When H0 is true, F has a well-known (and tabled) distribution called


the F distribution, so it is this distribution that we refer to to obtain
critical values (and P values).
The F distribution:

• has a shape similar to the chi-square distribution (unimodal,


bell-shaped, right-skewed, positive pdf only over positive half-
line)
• is really a family of distributions — one for each of two degree-
of-freedom parameters, n1 and n2 , called the numerator df and
denominator df
• is tabled in Table C.7
263
Comparing two population variances: Critical values

Values in Table C.7 are critical values in the right tail, i.e. F1 a,(n1 ,n2 ) :

• F.95,(6,11) =

• F.99,(60,25) =
.
• F.975,(47,83) =

Critical values in the left tail are not tabled (to save space), but can
be obtained from critical values in the right tail as follows:

Fa,(n1 ,n2 ) = 1/F1 a,(n2 ,n1 ) .

(Note that the order of df are reversed on the RHS.)

264
Comparing two population variances: Critical values

For comparing two population variances,

n1 = n1 1 and n2 = n2 1.

So for the F-test, critical values are:

• 1/F1 a/2,(n2 1,n1 1) and F1 a/2,(n1 1,n2 1) , if Ha is two-sided

• F1 a,(n1 1,n2 1) , if Ha is s12 > s22

• 1/F1 a,(n2 1,n1 1) , if Ha is s12 < s22

If our F test statistic is more extreme than the critical value(s), we


reject H0 ; otherwise, we do not reject H0 .
265
Comparing two population variances: Example

Let’s revisit the periodical cicada example (from p. 236 of these


notes) to see if the assumption of equal population variances that
we made there can be justified. We wish to test
H0 : s12 = s22 versus Ha : s12 6= s22 .
Relevant summary statistics:
n1 = 60, n2 = 50, s1 = 2.87, s2 = 3.52
Test statistic: F =

Critical values (using a = .05):

Conclusion:
266
Hypothesis testing: Some loose ends . . .

1. Point estimate(s) satisfies null hypothesis


For a particular population, suppose we wish to test

H0 : µ  65 versus Ha : µ > 65.

Furthermore, suppose a random sample is taken from this population


and the sample mean (X) is 61. Then the test statistic is

In which tail is the critical value?


So we will not reject H0 (at any typical level of significance).

267
Hypothesis testing: Some loose ends . . . (continued)

Similarly, for a situation with two populations, suppose we wish to


test
H0 : µ1 µ2 versus Ha : µ1 < µ2 .
If independent random samples are taken from the two populations,
and the sample means satisfy X 1 X 2 , then the test statistic is

In which tail is the critical value?


So we will not reject H0 .
General Fact: When doing a one-sided test, if the point estimate
satisfies H0 , we can skip calculation of the test statistic and conclude
immediately that H0 is not rejected (at typical levels of significance).

268
Hypothesis testing: Some loose ends . . . (continued)

2. Comparison of conclusion for one-sided and two-sided alterna-


tives

(a) Suppose we are asked to test


H0 : µ  65 versus Ha : µ > 65
but we mistakenly test
H0 : µ = 65 versus Ha : µ 6= 65.
Suppose further that X is larger than 65, and that we reject the
second H0 (at a particular level of significance, say .05).
Would we also reject the first H0 (at the same level of signifi-
cance)?
269
Hypothesis testing: Some loose ends . . . (continued)

(b) Same scenario as (a), except suppose that we do not reject the
second H0 . Would we also not reject the first H0 ?

270
Hypothesis testing: Some loose ends . . . (continued)
(c) Suppose we are asked to test

H0 : µ = 65 versus Ha : µ 6= 65

but we mistakenly test

H0 : µ  65 versus Ha : µ > 65.

Suppose further that X is larger than 65, and that we reject the
second H0 (at a particular level of significance, say .05).
Would we also reject the first H0 (at the same level of signifi-
cance)?

271
Hypothesis testing: Some loose ends . . . (continued)
(d) Same scenario as (c), except suppose that we do not reject the
second H0 . Would we also not reject the first H0 ?

General fact: It takes a more extreme value of the test statistic to


reject H0 when Ha is two-sided than it does to reject H0 when Ha is
one-sided (at the same level of significance).

272
Hypothesis testing: Some loose ends . . . (continued)

3. Equivalence between hypothesis tests and confidence intervals


Recall, from pp. 195-196, that we can carry out the hypothesis test
of
H0 : µ = µ0 versus Ha : µ 6= µ0
at the a level of significance by rejecting H0 if, and only if, µ0 is not
contained in the 100(1 a)% confidence interval for µ.

Likewise, we can carry out the hypothesis test of

H0 : µ1 = µ2 versus Ha : µ1 6= µ2

at the a level of significance by rejecting H0 if, and only if, 0 is not


contained in the 100(1 a)% confidence interval for µ1 µ2 .
273
Hypothesis testing: Some loose ends . . . (continued)

Similar equivalences can be stated for tests of other parameters (but


don’t worry about the details).

274
Testing hypotheses on more than two population means:
Motivating example

Problem 8.8 from textbook, adapted from the article ”Phytoestrogen


supplements for the treatment of hot flashes: The isoflavone clover
extract (ICE) study,” Journal of the American Medical Association,
vol. 290, pp. 207-214,by J. Tice et al., 2002:
The primary reason women seek medical attention for menopausal symptoms is hot flashes. Dietary
supplements containing isoflavones derived from natural sources such as soy or red clover are marketed
as an alternative treatment for such symptoms and are being used increasingly by women in the U.S.
Isoflavones are polyphenol compounds that are similar in structure to estrogens.

A study was carried out to determine whether two dietary supplements derived from red clover were
more effective than a placebo in reducing hot flashes in post-menopausal women. The randomized,
double-blind trial was conducted using 252 menopausal women, aged 45 to 60 years, who were expe-
riencing at least 35 hot flashes per week. After a 2-week period in which all were given a placebo, the
women were randomly assigned to Promensil (82 mg of total isoflavones per day), Rimostil (57 mg of
total isoflavones per day), or an identical placebo; and then followed up for 12 weeks. The table below
provides summary statistics on the number of hot flashes (per day) experienced by the women at the
end of the trial.

275
Testing hypotheses on more than two population means:
Motivating example
Promensil Rimostil Placebo
ni 84 83 85
Xi 5.1 5.4 5.0
si 4.1 4.6 3.2

Assuming normality, analyze these data to determine whether there are any differences in the mean
number of hot flashes per day for these three treatments.

Hypotheses to test, using a two-population approach repeatedly:

276
Testing hypotheses on more than two population means:
Motivating example

Suppose we test each hypothesis (using a two-sample t test, assum-


ing equal population variances) at the .05 level of significance. What
is the overall Type I error probability, i.e. the probability that we re-
ject at least one of these null hypotheses when it is true?
Analogy to a monkey taking a three-question multiple-choice statis-
tics exam, where each question has 20 choices:

277
Testing hypotheses on more than two population means:
Motivating example

If the data being used to perform each hypothesis test was indepen-
dent of the data being used to perform the others, the overall Type I
error probability could be computed as follows:

But some of the data is the same in each of the tests, so indepen-
dence doesn’t hold and the previous calculation isn’t valid.
Bottom line: We can’t control (determine) the overall Type I error
probability by doing multiple two-sample t tests. If we want to con-
trol a, we need to take a completely different approach, which is
called the Analysis of Variance (ANOVA).

278
The ANOVA: Set-up and notation

So suppose we want to compare the means of k populations, where


k 2. (We already know how to do this when k = 2, so it’s really
k > 2 that interests us now.) We take independent random samples
from each of the k populations.
Notation:

• µi : population mean of ith population

• si2 : population variance of ith population

• ni : size of sample from ith population

• N: total number of observations, i.e. N =

279
The ANOVA: Set-up and notation

More notation:

• Xi j : the jth observation in the ith sample

• X i. : sample mean of ith sample, i.e.,

X i. =

• X .. : grand mean, i.e.


X .. =

• s2i : sample variance of ith sample, i.e.,

s2i =

280
The ANOVA: The hypotheses tested

The objective of an ANOVA is to test the null hypothesis

H0 : µ1 = µ2 = · · · = µk

against the alternative hypothesis

Ha : at least one µi is different from the others

in such a way that the overall Type I error probability can be pre-
specified (see pp. 277-278 of these notes).

281
The ANOVA: Underlying assumptions
1. The samples are independent random samples from their re-
spective populations.

2. The populations are normally distributed, or if not, then sam-


ple sizes are large enough for the CLT to apply (all ni > 25).

3. Population variances are all equal, i.e. s12 = s22 = · · · = sk2 .

Sketch of population distributions satisfying the last 2 assumptions:

282
The ANOVA: Partitioning deviations from grand mean

Consider the deviation of an arbitrary observation from the grand


mean, i.e.
Xi j X ..
By adding and subtracting the ith sample mean to this, we have the
algebraic identity

Xi j X .. =

This partitions the deviation of an observation from the grand mean


into two parts:

• the deviation of the group mean from the grand mean


• the deviation of the observation from its group mean

283
The ANOVA: Partitioning deviations from grand mean

Consider a toy example in which k = 3 and the data are as follows


(lines separate the 3 samples, whose sample sizes are 3, 2, and 3,
respectively):

Data Xi j X .. X i. X .. Xi j X i.
9 -11 -10 -1
10 -10 -10
11 -9 -10
19
21
28
30
32

284
The ANOVA: Partitioning the sums of squares

Now let us square the deviations of observations from the grand


mean, and then sum them up, i.e.
k ni
  (Xi j X .. )2 .
i=1 j=1

It turns out that


k ni k ni
  (Xi j X .. )2 =   [(X i. X .. ) + (Xi j X i. )]2
i=1 j=1 i=1 j=1
=
(algebra details provided on p. 236 of textbook — never mind!).
We rewrite this as
SSTotal = SSTreat + SSError .
285
The ANOVA: Partitioning the sums of squares

Let’s try this on our toy example:

SSTotal = ( 11)2 + ( 10)2 + · · · + (12)2 =


SSTreat = ( 10)2 + ( 10)2 + · · · + (10)2 =
SSError = ( 1)2 + (0)2 + · · · + (2)2 =

It works!

So what?

286
The ANOVA: Test statistic

All of the preceding algebraic development was for the purpose of


computing a test statistic for testing the equality of means hypothesis
described previously.
The test statistic is
SSTreat /(k 1)
F= .
SSError /(N k)

We compare this to a right-tail critical value from the F distribution


with k 1 and N k degrees of freedom. If F > F1 a,k 1,N k , then
we reject H0 ; otherwise we do not reject H0 .

287
The ANOVA: Toy example

For our toy example,

SSTreat = 600 and SSError = 12.

So,
600/(3 1)
F= = 125.0
12/(8 3)
The critical value for a test at the .05 significance level is F.95,2,5 =
5.79, so we would reject H0 for these “data.”

288
The ANOVA: Some remarks
• Although the hypotheses being tested are concerned with pop-
ulation means, the test statistic is a ratio of measures of spread!
Why is this reasonable?

• The alternative hypothesis is not one-sided (actually it’s “multi-


sided”), despite the fact that we reject H0 only for values of F
that are too large.

• There are better computational formulas for the sums of squares


(see next two pages).

289
The ANOVA: Computational formulas for sums of
squares

Additional notation:

• Ti. : sum of the observations in the ith sample, i.e.,


ni
Ti. = Â Xi j
j=1

• T.. : sum of all observations (over all samples), i.e.

T.. =

290
The ANOVA: Computational formulas for sums of
squares

In practice, the following formulas for the sums of squares are alge-
braically equivalent, but not as painful or prone to numerical errors:
!
k ni
T..2
SSTotal = Â Â ij X 2
N
i=1 j=1
!
k
Ti.2 T..2
SSTreat = Â
i=1 ni N
k
SSError = Â (ni 1)s2i
i=1
Actually, we only need to compute any two of these, and we can
then get the third using the fact that SSTotal = SSTreat + SSError .
291
The ANOVA table

It is customary (for books, scholarly journals, and computer soft-


ware) to display the results of an ANOVA in the form of a table, as
follows:
Source of variation SS df MS F
Treatments SSTreat k 1 SSTreat MSTreat
k 1 MSError
Error SSError N k SSError
N k
Total SSTotal N 1

A mean square (MS) is simply a sum of squares (SS) divided by the


corresponding degrees of freedom.

292
The ANOVA: A real example

Now let’s return to the isoflavone compound example.


Promensil Rimostil Placebo
ni 84 83 85
Xi 5.1 5.4 5.0
Ti. 428 448 425
si 4.1059 4.6023 3.1997

We get

(428)2 (448)2 (425)2 (1301)2


SSTreat = + + = 7.21
84 83 85 252
SSError = (83)(4.1059)2 + (82)(4.6023)2 + (84)(3.1997)2 = 3996.1

293
The ANOVA: A real example

The ANOVA table is


Source of variation SS df MS F
Treatments 2
Error 3996.1 16.049
Total 251

Since the computed F statistic is not in the right tail, we do not re-
ject H0 at the .05 level of significance (or at any other typical level
of significance).
Conclusion: The mean number of hot flashes per day is not signifi-
cantly different for the three treatments.

294
Mean separation: Introduction

In the isoflavone compound example, we found no significant differ-


ences among the three treatment means, so the analysis is finished
at that point. In other examples, however, the F test statistic will be
larger than the critical value (at the chosen level of significance) and
we will reject H0 . What then?
In that case we must try to discover which population means are sig-
nificantly different from the rest. The textbook describes two meth-
ods for this, Bonferroni t tests and the Student-Newman-Keuls test.
You may ignore these; we will use a different, and much easier,
method called the “protected F method” that performs just as well.

295
Mean separation: Protected F method

The protected F method consists of t tests at significance level a for


the equality of each pair of population means, but these are only
performed if the F test at significance level a rejects the overall
equality-of-means hypothesis.
In this context, the t test statistic for comparing µi and µ j is

X i. X j.
ti j = r ⇣ ⌘.
1 1
MSError ni + n j

The appropriate critical values, taking a to be identical to that used


for the overall F test, are ±t1 a/2,N k . We conclude that µi and µ j
are significantly different if ti j is more extreme than either of the two
critical values.
296
Mean separation: Example

An experiment was performed to study the psychological effects of exercise on male college students.
Four groups of college men were studied:

• Exercisers (E): participants in a prescribed semester-length exercise program

• Quitters (Q): people who volunteered to participate in the exercise program but did not follow
through

• Joggers (J): non-participants who, however, jog regularly

• Slackers (S): other non-participants

At the beginning and end of the experiment, a psychological test was taken by each person. The scoring
on the exam was measured in such a way that a greater degree of satisfaction/confidence/happiness at
the end of the experiment corresponded to a greater difference in the two exam scores taken by each
person. We therefore think of the µi ’s as representing mean satisfaction levels.

At the 0.10 level of significance, we want to test H0 : µE = µQ = µJ = µS versus Ha : at least one mean
is different from the others.

297
Mean separation: Example
Results:

Group ni X i. si
E 5 57.40 10.46
Q 10 51.90 6.42
J 10 58.20 9.49
S 11 49.73 6.27

MSTreat = 158.88, MSError = 62.88


Test statistic: F = 2.53
.
Critical value: F.90,3,32 = F.90,3,30 = 2.28.
So we reject H0 , concluding that mean satisfaction levels of the four
groups are not all equal.

298
Mean separation: Example

Protected F method of mean separation:


57.40 51.90 57.40 58.20
tEQ = q = 1.27, tEJ = q = 0.18
62.88 15 + 10
1
62.88 15 + 10
1

57.40 49.73 51.90 58.20


tES = q = 1.79, tQJ = q = 1.78
62.88 15 + 11
1 1
62.88 10 1
+ 10
51.90 49.73 58.20 49.73
tQS = q = 0.63, tJS = q = 2.44
1 1 1 1
62.88 10 + 11 62.88 10 + 11

299
Mean separation: Example

Critical values are ±t.95,32 = ±1.694. So we conclude that 3 of the


pairwise comparisons are statistically significant at the .10 level:

• Exercisers versus Slackers

• Quitters versus Joggers

• Joggers versus Slackers

In other words, faithful participants in the exercise program and jog-


gers have significantly higher satisfaction levels than slackers; and
likewise for joggers compared to people who begin an exercise pro-
gram but quit.

300
Nonparametric ANOVA (Kruskal-Wallis test: Intro-
duction

Recall that one of the assumptions for the ANOVA to be valid is


that the variable of interest is normally distributed in each of the
populations under study, or if not, then the sample sizes are large
enough for the CLT to apply. If neither of these holds, then there
is an alternative, nonparametric testing approach called the Kruskal-
Wallis test.
The Kruskal-Wallis test is essentially an ANOVA of the ranks of the
data (rather than of their numerical values). It extends the Wilcoxon
rank sum test to k groups in the same way that the ANOVA F test
extends the two-sample t test to k groups.

301
Kruskal-Wallis test: The hypotheses tested

The objective of a Kruskal-Wallis test is to test the null hypothesis

H0 : M1 = M2 = · · · = Mk

against the alternative hypothesis

Ha : at least one Mi is different from the others

in such a way that the overall Type I error probability can be pre-
specified.

302
Kruskal-Wallis test: Underlying assumptions
1. The samples are independent random samples from their re-
spective populations.

2. The population distributions are all the same shape; they differ
(possibly) only insofar as their medians are concerned.

Sketch of population distributions satisfying the last assumption:

303
Kruskal-Wallis test: Test statistic

As with the Wilcoxon rank sum statistic, we begin to compute the


Kruskal-Wallis statistic by ranking all observations without regard
to which sample they come from.
Notation and terminology:

• Ri : sum of the ranks associated with the ith sample


Ri
• ni : sample mean rank for the ith sample
N+1
• 2 : grand mean rank

304
Kruskal-Wallis test: Test statistic

Then the test statistic is


k ✓ ◆2
12 Ri N +1
H= Â
N(N + 1) i=1
ni
ni 2
.

Ri N+1
If H0 is true, then we would expect each ni to be fairly close to 2 ;
so if H is large it casts doubt on H0 .
Computational formula for test statistic:
!
k
12 R2i
H=
N(N + 1) i=1 Â ni 3(N + 1).

305
Kruskal-Wallis test: Critical values and P values

Rather than providing a table of critical values specific to the distri-


bution of H under H0 , the textbook recommends obtaining critical
values and P values from the chi-square distribution.
Specifically, use ck2 1 . So we reject H0 at level of significance a if
and only if
H > c12 a,k 1 .
If we don’t reject H0 , the analysis stops. If we do reject H0 , we must
follow up with pairwise comparisons of medians, using a Wilcoxon
rank sum test for each pairwise comparison. This is called the “pro-
tected Kruskal-Wallis method.”

306
Kruskal-Wallis test: Example

(Problem 8.16 from textbook.) To compare the efficacy of three insect repellants, 19 volunteers applied
fixed amounts of repellant to their hand and arm and then placed them in a chamber with several
hundred hungry female Culex erraticus mosquitoes. The repellants were citronella, N,N-diethyl-meta-
toluamide (DEET) 20%, and Avon Skin So Soft hand lotion. The data recorded below are the times in
minutes until first bite; ranks are given in parentheses.

Citronella DEET 20% Avon Skin So Soft


5 (1) 12 (10) 6 (2.5)
6 (2.5) 16 (14) 7 (4)
8 (5.5) 25 (16) 9 (7)
8 (5.5) 27 (17) 10 (8)
14 (12) 28 (18) 11 (9)
15 (13) 31 (19) 13 (11)
17 (15)
Ri 39.5 94 56.5

307
Kruskal-Wallis test: Example

Test statistic:
✓ ◆
12 39.52 942 56.52
H= + + 3(20) = 9.13.
19(20) 6 6 7
2
Critical value: c.95,2 = 5.99.
So we reject H0 . There is statistically significant evidence that the
median time to first bite for at least one repellant is different from
the others.
Pairwise Wilcoxon rank sum tests for each pairwise comparison of
medians show that median times to first bite are significantly differ-
ent for citronella and DEET 20%, and also for Avon Skin So Soft
and DEET 20%; but median times to first bite are not significantly
different for citronella and Avon Skin So Soft.
308
Hypothesis testing for the probabilities of a distribu-
tion of a categorical variable

In all the hypothesis testing situations we’ve considered so far, the


variable of interest was continuous, or discrete with many levels,
e.g.:

• hind tibia lengths of periodical cicadas

• lifespan of male CETP variant carriers

• norephinephrine concentrations in medulla of toluene-exposed


rats

• average number of hot flashes per day in menopausal women

309
Hypothesis testing for the probabilities of a distribu-
tion of a categorical variable
Now we consider situations where the variable of interest is categor-
ical. Examples:

• presence of disease (present or absent)

• color morph of a salamander species (black, red-striped, or


red-spotted)
In such situations, the parameter(s) of interest are the actual prob-
abilities of each category of the variable, e.g. the probability that a
randomly selected individual from the population of interest has the
disease of interest.
Equivalently, we’re interested in population proportions.

310
Hypothesis testing for the probabilities of a distribu-
tion of a categorical variable

In the dichotomous case, the categorical variable of interest has only


two levels, which we generically label as “success” and “failure.”
Write

p = probability of success = proportion of successes in population.

We may want to test hypotheses about p, for example

H0 : p = 0.9 versus Ha : p 6= 0.9.

More generally, we may want to test hypotheses about all the pro-
portions, for example whether the ratio of three color morphs in a
salamander population is Black:Red-striped:Red-spotted = 1:2:1.
311
The binomial and proportions tests: Introduction

Consider a situation with a single population. When the variable of


interest is dichotomous, there are just two population proportions of
interest: p and 1 p. So there’s just one population parameter of
interest: p.
Hypotheses about p are of 3 types:

• H0 : p = p0 versus Ha : p 6= p0

• H0 : p  p0 versus Ha : p > p0

• H0 : p p0 versus Ha : p < p0

Note: we’ve already dealt with a confidence interval for p (p. 165).

312
The binomial and proportions tests: Introduction

Example: Some years ago, public health officials in Atlanta decided


that if less than 90% of Atlanta children under 6 had received the
DPT vaccine, then they would carry out an immunization program.
Here, p represents the proportion of children under 6 in Atlanta that
had received the DPT vaccine, and the hypotheses of interest are
H0 : p 0.90 versus Ha : p < 0.90.
The statistical hypothesis test that can address this exists in 2 ver-
sions, depending on the size of the random sample drawn from the
population:

• the binomial test, if np0  5 or n(1 p0 )  5


• the proportions test, if np0 > 5 and n(1 p0 ) > 5.

313
The binomial test: Test statistic

The test statistic for the binomial test is simply the number of suc-
cesses in the sample, i.e.

S = # of successes in sample.

This will be an integer between 0 and n (sound familiar?). If H0 is


true, then we would expect S to lie near the middle of the bin(n, p0 )
distribution; if H0 is false then S is likely to lie closer to one of the
extremes (0 or n).

You might have noticed that this is similar to the sign test statistic;
in fact it’s equivalent to the sign test when p0 = 0.5.

314
The binomial test: P values

P value testing approach:

• If Ha is Ha : p < p0 , reject H0 if P(bin(n, p0 )  S) < a

• If Ha is Ha : p > p0 , reject H0 if P(bin(n, p0 ) S) < a

• If Ha is 2-sided, reject H0 if

P[bin(n, p0 )  min(S, n S)]+P[bin(n, p0 ) max(S, n S)] < a

315
The binomial test: Beer bottling example

Beer drinkers and brewmeisters have long known that exposure to light can cause a “skunky” taste and
smell in beer. In fact, chemical studies have shown how the light-sensitive compounds in hops called
isohumulones degrade forming free radicals that bond to sulfur to cause the skunky taste. Most bottled
beer is sold in green or brown bottles to prevent this. Miller Genuine Draft (MGD) is claimed to be
made from chemically altered hops that don’t break down into free radicals in light and, therefore, the
beer can be sold in less expensive clear bottles. The company thinks the extra cost of a dark bottle will
pay off for them only if more than 60% of beer drinkers would prefer MGD unexposed to light. In a
taste test of MGD stored for 6 months in light-tight containers or exposed to light, a panel of 20 tasters
preferred the light-tight beer 16 times. Should the company use dark bottles?

Note: Here np0 = 20(0.6) = 12 and n(1 p0 ) = 20(0.4) = 8, so


the proportions test could be used, but the binomial test is doable
and will give a more accurate answer since the sample size is rather
small.
316
The binomial test: Beer bottling example (continued)

Hypotheses of interest: H0 : p  0.6 versus Ha : p > 0.6.


Test statistic: S = 16
P value:

P(bin(20, 0.6) 16) = 1 P(bin(20, 0.6)  15)


= 1 .9490
= .051.

The company would probably want to do additional taste-panel test-


ing, but if forced to make a decision based merely on this result, they
would be well-advised to use dark bottles (even though the P value
was a bit over .05).

317
The proportions test: Test statistic and critical values

When the sample sizes are such that np0 > 5 and n(1 p0 ) > 5, we
can still do the binomial test, but alternatively we can get a good
approximation to it using the normal approximation to the binomial.
The test statistic is
S np0
z= p ,
np0 (1 p0 )
or equivalently,
p̂ p0
z= p
p0 (1 p0 )/n
where p̂ is the sample proportion of successes, S/n (defined previ-
ously on page 158 of these notes).
Critical values are either z1 a , za , or ±z1 a/2 depending on whether
Ha points to the right, to the left, or is two-sided.
318
The proportions test: Atlanta immunization example
Recall the Atlanta immunization example introduced on page 308.
In order to test the hypotheses
H0 : p 0.9 versus Ha : p < 0.9,

a random sample of 537 Atlanta children under age 6 was taken. Of


these, 460 had been immunized for DPT. Should the city carry out
the immunization campaign?
Sample proportion: p̂ = 460/537 = 0.857.
0.857 0.90
Test statistic: z = p = 3.32.
0.90(1 0.90)/537

Critical value (at .01 level of significance): z.01 = 2.33.


So we reject H0 . On this basis, Atlanta officials determined that it
was wise to spend the $’s to carry out the immunization campaign.
319
Comparing two population proportions

Just as there are situations where we want to compare the means,


µ1 and µ2 , of two populations (for a continuous variable of inter-
est), there are situations where we want to compare the proportions,
p1 and p2 , of two populations (for a dichotomous characteristic of
interest). The hypotheses to be tested may be of 3 types:

• H0 : p1 = p2 versus Ha : p1 6= p2

• H0 : p1  p2 versus Ha : p1 > p2

• H0 : p1 p2 versus Ha : p1 < p2

320
Comparing two population proportions: Test statistic

Suppose that independent random samples of sizes n1 and n2 are


taken from the two populations. We will consider only a large-
sample version of the appropriate test (analogous to the proportions
test rather than the binomial test). For this test to be applicable, we
require

n1 p1 > 5, n1 (1 p1 ) > 5, n2 p2 > 5, n2 (1 p2 ) > 5.

Calculate the 2 sample proportions of successes:


# successes in 1st sample # successes in 2nd sample
p̂1 = , p̂2 =
n1 n2

321
and a “pooled” sample proportion,
n1 p̂1 + n2 p̂2 total # of successes
p̂c = = .
n1 + n2 total sample size
Rationale for pooled sample proportion: If the null hypothesis is
true, then p̂1 and p̂2 are estimating the same quantity, so we get a
better estimate by combining them (analogous to pooling the sam-
ple variance for the t test comparing means when we are willing to
assume the two population variances are equal).
Test statistic:
p̂1 p̂2
z= r ⇣ ⌘
1
p̂c (1 p̂c ) n1 + n12

322
Comparing two population proportions: Critical val-
ues

This test, like the proportions test for the proportion of a single pop-
ulation, is based on the normal approximation to the binomial dis-
tribution. So we get critical values (and P values) from the standard
normal distribution.
Critical values are z1 a , za , or ±z1 a/2 depending on whether Ha
points to the right, to the left, or is 2-sided.

323
Comparing two population proportions: Chronic wast-
ing disease example

On page 166 of these notes, we described a study in which 272 deer were legally
killed by hunters in the Mount Horeb area of SW Wisconsin in 2001-02. From
tissue sample analysis, it was determined that 9 of the deer had chronic wasting
disease (a disease similar to mad cow disease). If 272 deer from the population in
that same region were sampled next winter, and 16 tested positive for the disease,
would that be statistically significant evidence of a change in the infection rate?

Let p1 and p2 represent the proportions of infected deer in these


populations in 2001-02 and 2014-2015, respectively.
Hypotheses tested: H0 : p1 = p2 versus Ha : p1 6= p2

324
Comparing two population proportions: Chronic wast-
ing disease example

Sample proportions:
9 16 9 + 16
p̂1 = = .03309, p̂2 = = .05882, p̂c = = .04596
272 272 272 + 272
Test statistic:
.03309 .05882
z= q = 1.43
1 1
.04596(1 .04596) 272 + 272
Critical values (taking a = .05): ±z.975 = ±1.96
So we don’t reject H0 . If the sample infection rate changed to this
degree, it would not constitute statistically significant evidence for a
change in the population’s infection rate.
325
Confidence interval for the difference in two popula-
tion proportions

An approximate 100(1 a)% confidence interval for p1 p2 can


also be based on the normal approximation to the binomial distribu-
tion. The interval is
s
p̂1 (1 p̂1 ) p̂2 (1 p̂2 )
( p̂1 p̂2 ) ± z1 a/2 + .
n1 n2

For the approximation to be sufficiently good, the sample sizes should


satisfy the conditions listed on page 321 of these notes.

326
Chi-square test for goodness-of-fit: Introduction

In some situations with categorical data, there are more than two
categories (levels) and hence more than one proportion parameter of
interest. Consider the following example.

The nests of the wood ant, Formica rufa, are constructed from small twigs and
wood chips. As part of a study of where ants build these nests, the direction of
greatest slope was recorded for 42 such nests in Pound Wood, Essex, England.
Compass directions were divided into four classes: North, East, South, West. The
direction of greatest slope for the 42 nests were 3, 8, 24, and 7 (in the same order
of listing as the compass directions). Do the ants prefer any particular direction of
exposure over another?

327
Chi-square test for goodness-of-fit: Introduction

This scientific question can be addressed by letting pN , pE , pS , pW


represent the proportions of nests facing these directions for the en-
tire population of wood ants (in this region), and then testing

H0 : pN = pE = pS = pW = 14 versus
Ha : At least one proportion is different from the others

We might think we could accomplish this by doing 4 tests of popula-


tion proportions equalling 14 , or 6 tests of two population proportions
being equal. But neither approach allows us to prespecify the Type
I error probability. So we need a new approach.
The new approach is called the chi-square goodness-of-fit test.

328
Chi-square test for goodness-of-fit: Hypotheses tested

For general use, let p1 , p2 , . . . , pk represent the proportions for the k


categories of the categorical variable. We aim to test

H0 : p1 = p01 , p2 = p02 , . . . , pk = p0k versus


Ha : At least one proportion is different from the others

Here p01 , p02 , . . . , p0k are proportions we specify. (In the wood ant
example they are all 14 ).
Note: The textbook describes H0 and Ha in an equivalent way, but
using words; it doesn’t use the notation p1 , p01 , etc.

329
Chi-square test for goodness-of-fit: Test statistic

Suppose we take a random sample of size n from the population of


interest, and observe the categorical variable of interest for each one.
Define:

• Oi = observed frequency of ith category

• Ei = np0i = expected frequency of ith category under H0

Our test statistic is


k
(Oi Ei )2
c2 = Â .
i=1 Ei

330
Chi-square test for goodness-of-fit: Test statistic

Note:

• c 2 = 0 only if the observed frequencies exactly match those


predicted by the null hypothesis

• The larger the discrepancy between the observed frequencies


and what they are expected to be under H0 , the larger the value
of c 2 .

• Therefore, we should reject H0 for values of c 2 that are too


large.

331
Chi-square test for goodness-of-fit: Critical value

The test statistic is given the symbol c 2 because it has a chi-square


distribution when H0 is true; and so we get our critical value from a
chi-square distribution.
Specifically, we reject H0 at significance level a if and only if

c 2 > c12 a,k 1 .

Note: What we have just tested is what the textbook would call an
extrinsic model. The text also describes an intrinsic model and mod-
ifies the c 2 test slightly for such a model; you can skip this.

332
Chi-square test for goodness-of-fit: Wood ant exam-
ple

Recall the wood ant example, for which the hypotheses to be tested
are

H0 : pN = pE = pS = pW = 14 versus
Ha : At least one proportion is different from the others.

Summary of data and partial computation of test statistic:


(Oi Ei )2
Direction Oi Ei Ei
North 3 10.5 5.357
East 8 10.5 0.595
South 24 10.5 17.357
West 7 10.5 1.167

333
Chi-square test for goodness-of-fit: Wood ant exam-
ple

Test statistic: c 2 = 5.357 + 0.595 + 17.357 + 1.167 = 24.476.


2
Critical value (using a = .01): c.99,3 = 11.3.
So we reject H0 , concluding that ants take direction into account
when choosing where to build their nest. In fact, it appears (though
we have not shown it statistically) that ants prefer their nests to face
one particular direction (South) over the others.

334
Chi-square test for r ⇥ k contingency tables

A contingency table is a rectangular array (a table having rows and


columns) of frequencies of categorical variables. The frequencies in
the table can arise in either of 2 ways:

1. Taking independent random samples from r populations and


recording the frequencies of one categorical variable for each
sample

2. Taking a random sample from 1 population and recording the


cross-classified frequencies of two categorical variables

335
Contingency tables example: The CASH Study

The CASH (Cancer and Steroid Hormone) study was conducted during the 1980s
to investigate the relationship between oral contraceptive use and three cancers
(breast, endometrial, and ovarian) in U.S. women. Part of this very comprehensive
study investigated whether family history of breast cancer was a risk factor for
breast cancer. 4730 women having breast cancer, and 4688 women not having
breast cancer, were asked how many of their first-degree relatives (mother, sister,
daughter) had breast cancer. The results are displayed below:

Breast cancer? 0 1 2 or more


No 4403 279 6 4688
Yes 4171 511 48 4730
8574 790 54 9418

Scientific question: Does family history of breast cancer increase a woman’s own
risk of breast cancer?

336
Contingency tables example: beetles in logs
A method commonly used by ecologists to detect an association between species
(possibly mutualistic, parasitic, or something else) is to take a series of obser-
vational units where the species live or forage, such as ponds or trees, and then
count the number of those units in which both species are found, neither species
is found, or one or the other of the species is found.

In one such study, 500 logs in a forest were sampled for the presence of two beetle
species (labeled here generically as Species A and Species B). Results were as
follows:

Species A? Present Absent


Present 202 80 282
Absent 106 112 218
308 192 500

Scientific question: Do these results indicate that there is a positive association


between the two species?
337
Chi-square test for contingency tables: Proportions

A contingency table has “cells” (row ⇥ column combinations), and


the frequencies in these cells can be used to estimate the proportion
of individuals in the population(s) that have the relevant characteris-
tics.
For example, in the CASH Study, define
pNo,0 = proportion of women w/o breast cancer who have
0 relatives w/ breast cancer,
pNo,1 = proportion of women w/o breast cancer who have
1 relative w/ breast cancer,
pNo, 2 = proportion of women w/o breast cancer who have
two or more relatives w/ breast cancer,

338
with similar definitions for pYes,0 , pYes,1 , and pYes, 2 . Then esti-
mates of, for example, pNo,0 and pYes,1 are

4403 511
p̂No,0 = = 0.9392, p̂Yes,1 = = 0.1080.
4688 4730
In the beetles-in-logs study, a single population is sampled, and we
have for example,
202 106
p̂PP = = 0.404, p̂AP = = 0.212.
500 500

339
Chi-square test for contingency tables: Hypotheses

In words, the hypotheses to be tested are

H0 : The row and column variables are not associated

versus

Ha : The row and column variables are associated.

Another word for “not associated” is “independent” (which your


textbook uses).

340
Chi-square test for contingency tables: Hypotheses

In the CASH example, this is equivalent to

H0 : pNo,0 = pYes,0 , pNo,1 = pYes,1 , pNo, 2 = pYes, 2

versus

Ha : At least one of the equalities in H0 is false.

In the beetles-in-logs example, the hypotheses are equivalent to


pPP pAP pPP pAP
H0 : = versus Ha : 6= .
pPA pAA pPA pAA

341
Chi-square test for contingency tables: Test statistic

The test statistic for testing these hypotheses is once again a chi-
square statistic,
r k (O 2
i j Ei j )
c2 = Â Â ,
i=1 j=1 Ei j
where

• Oi j = observed frequency of i jth cell


• Ei j = expected frequency of i jth cell under H0
• rk = number of cells

We use double subscripts because it seems natural to do so when the


data are laid out in a two-way table.
342
Chi-square test for contingency tables: Test statistic

The way we compute an expected frequency is different in this con-


text than it was in the goodness-of-fit testing context, however. Here,
ith row total ⇥ jth column total
Ei j = .
overall total
For example, for the beetles-in-logs example, the Ei j ’s are:

(282)(308) (282)(192)
E11 = = 173.712, E12 = = 108.288
500 500
(218)(192) (218)(308)
E22 = = 83.712, E21 = = 134.288.
500 500

343
Chi-square test for contingency tables: Critical val-
ues

As in the chi-square goodness-of-fit test, we reject H0 at significance


level a if and only if it is too large, and “too large” means larger than
the (1 a)th percentile of a chi-square distribution.
However, the degrees of freedom are different: they are

(r 1)(k 1),

where r = # of rows and k = # of columns.


So we reject H0 if and only if

c 2 > c12 a,(r 1)(k 1) .

344
Chi-square test for contingency tables: Breast cancer
example

Review the breast cancer example from several pages back. The
Oi j ’s are

Breast cancer? 0 1 2 or more


No 4403 279 6 4688
Yes 4171 511 48 4730
8574 790 54 9418

and the corresponding Ei j ’s are

Breast cancer? 0 1 2 or more


No 4267.9 393.2 26.9
Yes 4306.1 396.8 27.1
345
Chi-square test for contingency tables: Breast cancer
example

(4403 4267.9)2 (279 393.2)2 (48 27.1)2


c2 = + +···+
4267.9 393.2 27.1
= 106.91
2
The critical value for a test at significance level 0.005 is c.995,2 =
10.6, so we reject H0 . The statistical evidence is very strong that a
woman’s risk of getting breast cancer is associated with her family
history of breast cancer. In fact, from the sample proportions we can
say that a woman’s risk of getting breast cancer is increased if she
has a family history of breast cancer.

346
Chi-square test for contingency tables: Beetles-in-logs
example

We’ve already given the observed and expected cell frequencies.


The test statistic is
(202 173.712)2 (80 108.288)2
c2 = +
173.712 108.288
(112 83.712)2 (106 134.288)2
+ +
83.712 134.288
= 27.51
2
The critical value for a test at significance level 0.005 is c.995,1 =
7.88, so we reject H0 . We conclude that there is an association be-
tween the species. In fact, there is a positive association between
species: they occur together more often than we would expect if
they were just distributed randomly among logs.
347
Chi-square test for contingency tables: Final remarks
• There is a shortcut for finding the Ei j ’s: compute them in the
usual way for cells in the first r 1 rows and k 1 columns,
but get the rest by subtraction from the appropriate row totals
and column totals. Thus, in the 2 ⇥ 2 case, only E11 needs to
be computed in the usual way.
• For the chi-square test to be valid, the sample sizes must be
sufficiently large. Specifically, what is required is that every
Ei j 5.
• Some authors (ours included) recommend using a continuity
correction factor in the chi-square test statistic, particularly if
the dimensions of the contingency table are only 2 ⇥ 2. You
may ignore this recommendation and stick with the c 2 statistic
described in these notes.
348
Correlation and regression analysis: Overview

In some scientific research, the question of interest pertains to the


relationship between two continuous variables. So far in the course
we have not considered situations where there are two continuous
variables (though we did just consider testing for an association be-
tween two categorical variables).
Some examples of such questions:

• Is there a relationship between amount of time a child listens


to Mozart in the womb and their IQ at age 16?

• Is there a relationship between GPA and amount of alcohol


consumed by college students?

349
Correlation and regression analysis: Overview
Let X and Y represent the two continuous variables. Often, it makes
sense to think that one of the two variables may depend on the other;
we let Y represent that variable (the dependent variable) and let X
represent the other variable (the independent, or explanatory, vari-
able).
Statistical approach to understanding the relationship, if any, be-
tween X and Y : We imagine that X and Y exist for each member
of a large (possibly infinitely large) population. We take a finite ran-
dom sample of size n from the population and measure X and Y on
each sampled individual.
This yields data (X1 ,Y1 ), (X2 ,Y2 ), . . . , (Xn ,Yn ) which we may use to
make inferences about the relationship between X and Y for the pop-
ulation as a whole.
350
Correlation and regression analysis: Overview

A useful graphical summary of such data is the scatterplot, which is


a plot of the (Xi ,Yi ) points (with X on the horizontal axis and Y on
the vertical axis).
Terms associated with scatterplots (and with relationships between
X and Y ): linear, nonlinear, positive, negative, perfect, imperfect,
strong, weak, nonexistent.
Example scatterplots:

351
Correlation and regression analysis: Overview

352
Correlation and regression analysis: Overview

Correlation analysis seeks to describe whether an assumed linear


relationship between X and Y is positive or negative, and how strong
it is.
Regression analysis seeks to quantify precisely how much Y tends
to change if X changes by one unit, assuming that the relationship is
linear.
Note:

• Both techniques require the relationship to be linear. Methods


for studying nonlinear relationships between two continuous
variables are considered in more advanced courses.
• Correlation analysis is preliminary to regression analysis.
353
Pearson’s correlation coefficient

For correlation analysis, we need a statistic that measures the sign of


the relationship, and how strong that relationship is. Such a statistic
is Pearson’s correlation coefficient,

Âni=1 (Xi X)(Yi Y )


r = q
[Âni=1 (Xi X)2 ][Âni=1 (Yi Y )2 ]
Âni=1 XiYi 1n (Âni=1 Xi )(Âni=1 Yi )
= q
[Âni=1 Xi2 1n (Âni=1 Xi )2 ][Âni=1 Yi2 1n (Âni=1 Yi )2 ]
SSXY
= p .
SSX · SSY

354
Pearson’s correlation coefficient

Some properties of r:

• 1r1
• r = 0 , no linear relationship exists between X and Y
• r > 0 , positive linear relationship; r = 1 , perfect positive
linear relationship
• r < 0 , negative linear relationship; r = 1 , perfect nega-
tive linear relationship
• The closer r is to 1, the stronger the positive linear relation-
ship; the closer r is to 1, the stronger the negative linear
relationship

355
Pearson’s correlation coefficient for various scatter-
plots

356
Inference for a population correlation coefficient: Hy-
potheses

Imagine computing Pearson’s r for the entire population of (X,Y )


values; call this quantity the population correlation coefficient, and
use the symbol r for it.
r is a population parameter, and r is a point estimate of it.
We often want to address the question, “Are variables X and Y sig-
nificantly linearly related?” To address this we test

H0 : r = 0 versus Ha : r 6= 0.

One-sided Ha ’s could be considered too, if our research question


also includes the sign (positive or negative) of the linear relationship.

357
Inference for a population correlation coefficient: Test
statistic and critical values

The appropriate test statistic is


r 0
t=p .
(1 r2 )/(n 2)
The appropriate critical values are:

• t1 a,n 2 if Ha is r > 0
• t1 a,n 2 if Ha is r < 0
• ±t1 a/2,n 2 if Ha is r 6= 0

This testing approach is valid, provided that either X and Y are nor-
mally distributed or n is sufficiently large (n 25).
358
Correlation analysis example: Relationship between
heart disease and a fatty diet

Data from 22 countries (from 1980) are available on variables

Y = 100[log(# of deaths from heart disease per 100,000 males aged 55 59) 2]
X = fat calories as a percent of total calories in diet.

Do these data indicate that heart disease and a fatty diet are associated?

Country Y X Country Y X Country Y X


Australia 81 33 France 45 29 Netherlands 38 37
Austria 55 31 Germany 50 35 New Zealand 72 40
Canada 80 38 Ireland 69 31 Norway 41 38
Sri Lanka 24 17 Israel 66 23 Portugal 38 25
Chile 78 20 Italy 45 21 Sweden 52 39
Denmark 52 39 Japan 24 8 Switzerland 52 33
Finland 88 30 Mexico 43 23 United Kingdom 66 38
United States 89 39

359
Correlation analysis example: Relationship between
heart disease and a fatty diet
Scatter diagram:

For these data, r = 0.45, indicating that there might be a statistically


significant positive linear relationship.
360
Correlation analysis example: Relationship between
heart disease and a fatty diet

Let’s test
H0 : r = 0 versus Ha : r 6= 0
at the 0.05 level of significance.
0.45
Test statistic: t = p = 2.23.
(1 0.452 )/(22 2)

Critical values: ±t.975,20 = ±2.086 (P value is between 0.02 and


0.05).
So we reject H0 , concluding that there is a statistically significant
linear relationship between heart disease and a fatty diet.

361
Another correlation analysis example: Relationship
between heart disease and telephone abundance

Data from the same 22 countries from the previous example are also available on
the variable
X = # of telephones per 1000 population.
Are heart disease (Y ) and telephone abundance associated?

Country Y X Country Y X Country Y X


Australia 81 124 France 45 54 Netherlands 38 63
Austria 55 49 Germany 50 43 New Zealand 72 170
Canada 80 181 Ireland 69 41 Norway 41 125
Sri Lanka 24 4 Israel 66 17 Portugal 38 15
Chile 78 22 Italy 45 22 Sweden 52 221
Denmark 52 152 Japan 24 16 Switzerland 52 171
Finland 88 75 Mexico 43 10 United Kingdom 66 97
United States 89 254

362
Another correlation analysis example: Relationship
between heart disease and telephone abundance
Scatter diagram:

For these data, r = 0.47, indicating that there might be a statistically


significant positive linear relationship.
363
Another correlation analysis example: Relationship
between heart disease and telephone abundance

When we test

H0 : r = 0 versus Ha : r 6= 0

at the 0.05 level of significance, we find that the test statistic is


2.37 and the critical values are the same as in the previous exam-
ple (±t.975,20 = ±2.086), so we reject H0 .
The conclusion is that there is a statistically significant linear rela-
tionship between heart disease and telephone abundance. Do you
think there’s a cause-effect relationship here?
A proverb in science: “Correlation does not imply causation.”

364
Nonparametric correlation coefficients

In those situations where the sample size is relatively small (< 25)
and the population of values of the variable of interest is not nor-
mally distributed, we cannot safely do correlation analysis with Pear-
son’s correlation coefficient. Instead, we use a nonparametric corre-
lation coefficient, which is based on the ranks of the data.
Our textbook describes the following 2 nonparametric correlation
coefficients:

1. Kendall’s correlation coefficient


2. Spearman’s correlation coefficient

Of these, Spearman’s is easier to deal with so I will present it only


(you’re not responsible for anything regarding Kendall’s).
365
Spearman’s correlation coefficient
Spearman’s correlation coefficient, rS , is merely Pearson’s correla-
tion coefficient of the data’s ranks (rather than of the original val-
ues). Specifically, we rank the Xi ’s from smallest to largest, and do
the same for the Yi ’s (settle ties as before); then plug the ranks into
the formula for Pearson’s r.
Two equivalent formulas for rS (don’t use 2nd one if there are ties):
Âni=1 (rXi rX )(rYi rY )
1. rS = p
[Âni=1 (rXi rX )2 ][Âni=1 (rYi rY )2 ]

where “r” is used to indicate that the observation or observa-


tions have been replaced with their ranks
6 Âni=1 di2
2. rS = 1 n(n2 1)
where di = rXi rYi (the difference in the ranks of Xi and Yi ).
366
Spearman’s correlation coefficient: Toy example

Suppose we have three (X,Y ) observations, as follows:

Xi Yi rXi rYi di
1.72 0.19
0.58 0.92
1.12 1.54

6(6)
So rS = 1 3(8) = 0.5.

367
Spearman’s correlation coefficient: Hypothesis test-
ing

We can use Spearman’s correlation coefficient to test

H0 : rS = 0 versus Ha : rS 6= 0

or either one-sided version. Here rS represents the value of rS when


computed using every member of the population (usually an impos-
sible task!).
The test statistic is merely rS itself, and the critical value(s) are found
in Table C.13 in the textbook. For a specified significance level a,
we use the “2-tail column” in the table if Ha is 2-sided; otherwise
we use the “1-tail column.” Reject H0 if rS is more extreme than the
critical value(s).

368
Spearman’s correlation coefficient: Examples
Let’s revisit the example of heart disease versus fatty diet considered
previously. For those data we have
n = 22, rS = 0.39.
The critical values for a two-sided test at a = 0.05 are ±0.425, so
we do not reject H0 . (In fact, 0.05 < P < 0.10). So we conclude
that there is not statistically significant evidence of a relationship
between heart disease and a fatty diet.
Recall that with the parametric test (the t-test), we concluded there
was statistically significant evidence of a linear relationship between
these 2 variables. What gives? Two probable explanations:
• Nonparametric tests are not as powerful as parametric tests
• There are 2 outliers that strongly affect r but not rS
369
Spearman’s correlation coefficient: Examples

When we consider the example of heart disease versus telephone


abundance, we find that

n = 22, rS = 0.54.

The critical values for a two-sided test at a = 0.05 are ±0.425, so


here we do reject H0 . (In fact, 0.01 < P < 0.02). So we conclude
that there is statistically significant evidence of a relationship be-
tween heart disease and a telephone abundance.
In comparison to the results based on Pearson’s correlation coeffi-
cient, this analysis yields even stronger evidence against H0 . Note
that there are no influential outliers in the scatterplot.

370
Correlation analysis: Final remarks
• Correlation analysis assumes that a linear relationship exists
between Y and X (i.e. Yi = AXi + B, possibly with A = 0).
• Correlation analysis seeks to determine if the linear relation-
ship is positive, negative, or 0; and how strong it is.
• Effect of transformations:
– Linear transformations, Ui = AU Xi + BU and Vi = AV Yi +
BV , have no effect on either r or rS
– Monotone increasing transformations (like taking logs)
have no effect on rS (because the ranking remains the
same), but r may change
– Non-monotonic nonlinear transformations affect r and rS
in unpredictable ways

371
Regression analysis: Introduction

Regression analysis adds to the results of a correlation analysis.


Specifically, using a regression analysis we can address the follow-
ing questions:

1. What is the equation of the straight line that best fits the data?

2. What amount of increase (or decrease) in Y can I expect by


increasing X by a certain amount?

3. What do I predict the value of Y to be at a value of X that’s


not in my data?

372
Simple linear regression analysis: Introduction

Regression analysis is a HUGE topic in scientific research — many


hundreds of books are devoted entirely to this subject. We will con-
sider a very small piece of one small subtopic, called simple linear
regression analysis.

• Regression refers to predicting the value of one variable from


the value of others

• Linear refers to the assumption that the functional form of


the relationship between Y (actually the mean of Y ) and X is
linear

• Simple refers to the use of merely one independent (or ex-


planatory) variable to do the prediction of Y

373
Simple linear regression analysis: Conceptual foun-
dation

We imagine that at each value of X, there is a population of values


of Y . Each such population of Y -values has a mean and a variance,
but this mean and variance could conceivably depend on the value
of X. So we represent them by

µY |X and sY2|X .

Furthermore, we assume that

µY |X = a + b X;

that is, we assume that µY |X is a linear function of X.

374
Simple linear regression analysis: Conceptual foun-
dation

Here:

• a is the “Y -intercept,” while b is the “slope.”

• a and b are unknown parameters, and part of our task in re-


gression analysis is to estimate them from the available data.

How should we estimate a and b ?

• By eye?

• By the method of least squares (due to Legendre in 1805)

375
Least squares estimation of a and b

The “least squares estimates” of a and b are those values which


minimize the sum of squares of vertical deviations from the data to
the line.
Picture:

376
Least squares estimation of a and b

Minimization methods from calculus can be used to show that the


desired estimates are
(Âni=1 Xi )(Âni=1 Yi )
Âni=1 XiYi n SSXY
b̂ = (Âni=1 Xi )2
= , (“b” in book)
n
Âi=1 Xi2 SSX
n
â = Y b̂ X. (“a” in book)

Notice that we must obtain b̂ first, then â.


Once we have â and b̂ , we can plot the least squares line,

Y = â + b̂ X,

on the scatterplot.
377
Least squares estimation of a and b : Toy example

Suppose we have five (X,Y ) observations, as follows:


Xi Yi
1 0
3 3
4 2
2 1
2 2

b̂ =

â =

378
Prediction using the least squares regression line

One of our main objectives in a regression analysis is to predict what


the value of Y would be at any X value. Using the least squares
regression line we can do this, and we can also predict what the
mean of all Y -values at any X value would be.
Let x be the specified value of X at which we want to predict Y . To
predict the population mean of Y -values at x, we merely plug x into
the equation of the least squares regression line:

µ̂Y |x = â + b̂ x.

To predict a single observation of Y at x, we do the same, i.e.

Ŷ |x = â + b̂ x.

379
Prediction using the least squares regression line

Illustrations, using the previous toy example:

Caution: Do not extrapolate too far from the interval of observed


X’s!

380
Simple linear regression analysis: Assumptions for
inference

â, b̂ , and µ̂Y |x are all point estimates of their respective parameters.
To obtain confidence intervals for these parameters and do hypoth-
esis tests, we need to make some further assumptions. We assume
that:

• µY |X = a + b X (as before)

• sY2|X actually does not depend on X, so we write it more sim-


ply as sY2
• Each population of Y values is normally distributed
• All observations of Y are sampled independently

381
Simple linear regression model

All of these assumptions can be summarized and written as a model


for our data, as follows:

Yi = a + b Xi + ei , ei ⇠ independent N(0, sY2 ), i = 1, . . . , n.

Picture:

This model is called the simple linear regression model. Its param-
eters are a, b , and sY2 .

382
Estimation of residual variance

We’ve already estimated a and b ; it remains to estimate sY2 . The


estimate is
Ân [Yi (â + b̂ Xi )]2
sY2 = i=1 .
n 2
This estimate is very nearly the average squared vertical deviation
from the data to the least squares line; if the divisor were n rather
than n 2 it would be exactly the average squared vertical deviation
from the data to the least squares line.
We use sY to perform inference for many other quantities in regres-
sion analysis.

383
Estimation of residual variance: Toy example
For the toy example on page 378 of these notes, recall that

b̂ = , â =

Can add columns to table of data to facilitate calculation of sY2 :

Xi Yi â + b̂ Xi Yi (â + b̂ Xi )
1 0
3 3
4 2
2 1
2 2

So sY2 =

384
Confidence interval for Y |x

A 100(1 a)% confidence interval for Y |x is


s
1 (x X)2
(â + b̂ x) ± t1 a/2,n 2 sY 1 + + .
n SSX

Toy example: Confidence interval for Y |X = 2.5 is

385

You might also like