Work
Work
Work
1 What is statistics? 9
1.1 Types of Numerical DATA...........................................................................................10
1.2 Ways to summarize data...............................................................................................10
1.2.1 Descriptive statistics..........................................................................................10
1.2.2 Frequency tables and graphs............................................................................12
1.3 Examples.........................................................................................................................13
1.3.1 Frequency tables................................................................................................13
1.3.2 Bar charts...........................................................................................................13
1.3.3 The Box Plot......................................................................................................15
3 Probability distributions 31
3.1 Distributions of interest.................................................................................................32
3.1.1 The binomial distribution (discrete)................................................................32
3.1.2 Reading the standard-normal table.................................................................35
3.1.3 Examples.............................................................................................................36
3.2 Standardization...............................................................................................................37
4 Statistical inference 39
4.1 Sampling distributions...................................................................................................40
4.2 The Central Limit Theorem.........................................................................................40
4.2.1 Cholesterol level in U.S. males 20-74 years old...............................................41
1
4.2.2 Level of glucose in the blood of diabetic patients...........................................41
4.3 Hypothesis testing..........................................................................................................42
4.3.1 Hypothesis testing involving a single mean and known variance.................43
4.4 Implications of each step in hypothesis testing...........................................................44
4.4.1 Diabetes example..............................................................................................46
4.5 Hypothesis testing involving means and unknown variance......................................48
4.5.1 Concentration of benzene in a cigar................................................................48
4.5.2 Concentration of benzene in cigars.................................................................49
4.5.3 Computer implementation................................................................................50
4.6 Analyses involving two independent Samples............................................................51
4.6.1 Serum iron levels and cystic fibrosis................................................................51
4.6.2 Testing of two independent samples (assuming equal variances).................52
4.6.3 Paired samples...................................................................................................53
4.6.4 Hypothesis testing of paired samples..............................................................54
5 Estimation 57
5.1 Confidence Intervals.......................................................................................................57
5.2 Estimation for the population mean (σ known).........................................................57
5.2.1 Characteristics of confidence intervals............................................................58
5.2.2 Distribution of cholesterol levels......................................................................59
5.2.3 One-sided confidence intervals..........................................................................59
5.2.4 Anemia and lead exposure...............................................................................59
5.3 Confidence intervals when σ is unknown.....................................................................60
5.3.1 Antacids and plasma aluminumn level...........................................................60
5.3.2 Computer implementation................................................................................61
5.4 Confidence intervals of a difference of two means.......................................................61
5.4.1 Serum iron levels and cystic fibrosis................................................................62
5.5 Performing hypothesis testing using confidence intervals.........................................63
5.5.1 Computer implementation................................................................................63
5.5.2 One-sided tests..................................................................................................64
8 Contingency tables 91
8.0.3 Computer implementation................................................................................94
8.1 The odds ratio................................................................................................................94
8.1.1 Testing the hypothesis of no association (using the odds ratio)...................95
8.1.2 Confidence intervals...........................................................................................96
8.1.3 Computer implementation................................................................................97
8.2 Combining 2 × 2 contingency tables............................................................................97
8.2.1 Confidence intervals of the overall odds ratio.................................................99
8.2.2 The Mantel-Haenszel test for association.....................................................100
8.2.3 Computer implementation..............................................................................100
10 Correlation 113
10.1 Characteristics of the Correlation Coefficient..........................................................114
10.2 Hypothesis Testing for ρ = 0.....................................................................................115
10.2.1 Computer Implementation..............................................................................116
5
8.1 Chi-square distribution with one degree of freedom..................................................93
3.1 Table A.l Areas in one tail of the standard normal curve.........................................34
7
Chapter 1
What is statistics?
Obtain DATA
Analyze DATA
Present DATA
9
What is statistics anyway?
• Statistics is the summary of information (data) in a meaningful fashion, and its ap-
propriate presentation
• Bio-statistics is the segment of statistics that deals with data arising from biological
processes or medical experiments
• Nominal data
Numbers or text representing unordered categories (e.g., 0=male, 1=female)
• Ordinal data
Numbers or text representing categories where order counts (e.g., 1=fatal injury, 2=se-
vere injury, 3=moderate injury, etc.
• Discrete data
This is numerical data where both ordering and magnitude are important but only
whole number values are possible (e.g., Numbers of deaths caused by heart disease
(765,156 in 1988) versus suicide (40,368 in 1988, page 10 in text).
• Continuous data
Numerical data where any conceivable value is, in theory, attainable (e.g., height,
weight, etc.)
1. Mean
The mean is the average of the measurements in the data. If the data are made up
of n observations x1, x2, ..., xn, the mean is given by the sum of the observations
divided by their number, i.e.,
1 Σn
x¯ = xi
n i=1
Σ
where the notation ni=1 means “sum of terms counted from 1 to n”. For example if
the data are x1 = 1, x2 = 2, x3 = 3, then their average is 1/3(1 + 2 + 3) = 2.
2. Median
The median is the middle observation according to the observartions’ rank in the data.
In the previous example, the median is m = 2. The median is the observation with
rank (n + 1)/2 if n is odd, or the average of observations with rank n/2 and (n +
1)/2 if n is even.
Note what would happen if x3 = 40 in the above example. Then the mean is x¯ 3= 1(1 + 2
+ 40) = 14.33. However, the median is still m = 2. In general, the median is less
sensitive than the mean to extremely large or small values in the data.
Thus, when data are skewed to the left (there are a large number of small values), then the
mean will be smaller than the median. Conversely, if the data are skewed to the right (there
is a large number of high values), then the mean is larger than the median.
For example, the distribution of light bulb lifetimes (time until they burn out) is skewed to
the right (i.e., most burn out quickly, but some can last longer). Next time you buy a light
bulb, notice whether the mean or the median life of the light bulb is quoted on the package
by the manufacturer. Which statistic would be most appropriate?
Measures of spread
The most common measures of spread (variability) of the data are the variance and the
standard deviation.
1. Variance The variance is the average of the square deviations of the observations from
the mean. The deviations are squared because we are only interested in the size of
the deviation rather than the direction (larger or smaller than the mean). Note also
Σ
that
n
i=1 (xi − x¯ ) . Why? The variance is given by
1 Σn
s2 = 2
(x i − x¯ )
n − 1 i=1
where x1, x2, ..., xn are the data observations, and their mean. The variance of x1
= 1,x2 = 2,xn = 3, is 1 [(1 − 2)2 + (2 − 2)2 + (3 − 2)2] = 1.
2
The reason that we divide by n-1 instead of n has to do with the number of
“information units” in the standard deviation. Try to convince yourself, that after
estimating the sample mean, −there are only n 1 independent (i.e., a priori
unknown) observations in our data. Why? (hint. Use the fact that the sum of
deviations from the sample mean is zero)
2. Standard deviation
The standard deviation is given by the square root of the variance. It is attractive,
because it is expressed in the same units as the mean (instead of square units like the
variance).
,u
1 Σn
s=, (xi − x¯ ) 2
n − 1 i=1
Graphs
• Bar charts
• Frequency polygons
• Scatter plots
• Line graphs
• Box plots
1.3 Examples
1.3.1 Frequency tables
Table 1.1: Frequencies of serum cholesterol levels
Cumulative
Cholesterol level Cumulative Relative Relative
(mg/100 ml) Frequency Frequency Frequency (%) Frequency (%)
80-119 13 13 1.2 1.2
120-159 150 163 14.1 15.3
160-199 442 605 41.4 56.7
200-239 299 904 28.0 84.7
240-279 115 1019 10.8 95.5
280-319 34 1053 3.2 98.7
320-360 9 1062 0.8 99.5
360-399 5 1067 0.5 100.0
Total 1067 100.0
The choice of intervals in a frequency table is very important. Unfortunately, there are no
established rules for determining them. Just make sure that a ”cut-off” value is a beginning
point of one of the intervals. In the table above, the value of 200 mg/100 ml of cholesterol
is such a value.
1 5 3 1 2 4 1 3 1 5
2 1 1 5 3 1 2 1 4 1
4 1 3 1 5 1 2 1 1 2
5 1 1 5 1 5 3 1 2 1
2 3 1 1 2 1 5 1 5 1
1 2 5 1 1 2 3 4 1 1
1 1 2 1 1 2 1 1 2 3
3 3 1 5 2 3 5 1 3 4
1 1 2 4 5 4 1 5 1 5
5 1 1 5 1 1 5 1 1 5
1. Motor vehicle, 2. Drowning, 3. House fire, 4. Homicide, 5. Other
Table 1.2: U.S.A. cigarette consumption, 1900-1990
Number of
Year Cigarettes
1900 54
1910 151
1920 665
1930 1485
1940 1976
1950 3522
1960 4171
1970 3985
1980 3851
1990 2828
. tab accident
Cigarette consumption
4000
3000
2000
1000
6. Find the lower adjacent value LAV =smallest value in the data that is greater or equal
to the lower fence
7. Find the upper adjacent value UAV =largest value in the data that is smaller or equal
to the upper fence
. label val accident acclab
60
40
Frequency
20
0
Motor Ve Drowning House Fi Homicide Other
R easons of death
8. Any value outside the LAV or UAV is called an outlier and should receive extra
attention
Consider the following depression scale scores:
2 5 6 8 8 9 9
10 11 11 11 13 13 14
14 14 14 14 14 15 15
16 16 16 16 16 16 16
16 17 17 17 18 18 18
19 19 19 19 19 19 19
19 20 20
7. Find the upper adjacent value. UAV =largest value in data < 25.5, UAV = 20
8. Since 2 and 5 are lower than the LAV , these observations are outliers and must be
investigated further
depscore
20
15
10
Ask
Marilyn®
BY MARILYN VOS SAVANT
19
Consider what happens when you roll a die. Here are the possible outcomes of the single
die roll. What if you rolled two dice? In that case you would get,
2.1 Events
The outcome of each die roll is called an event. Events are also coin flips, results of experi-
ments, the weather and so on.
Two events that cannot both happen are called mutually exclusive events. For example
event A=“Male” and B=“Pregnant” are two mutually exclusive events (as no males can be
pregnant.
To envision events graphically, especially when all the outcomes are not easy to count, we
use diagrams like the following one that shows two mutually exclusive events.
Figure 2.3: Mutually exclusive events
Note S, the sample space is context specific. In the previous example, S=“Human”, but
in the example of the single die roll S = {1, 2, 3, 4, 5, 6}.
1. Event intersection Consider the following figure: An intersection between events A and
B are all the cases of overlap of the two events. For example if A=“Face of die is
odd” and B=“Number is less than 3” then A ∩ B = {1}. Note that if A and B are
mutually exclusive, then their intersection is the null event (i.e., A ∩ B = ∅).
2. Union of two events
The union of events A and B is comprised of all outcomes consistent to either A or
B or both. In the above example, the union of the two events A=“Face of die is odd”
and B=“Number is less than 3” is A ∪ B = {1, 2} (Figure 2.5).
3. Complement of an event
The complement of an event A, denoted as Ac is comprised of all outcomes that are not
compatible with A. Note that A ∪ Ac = S since all outcomes either will be contained
in A or its complement (not A). This is seen by the following figure:
By the definition of the intersection, events A and Ac are mutually exclusive (i.e.,
A ∩ Ac = ). This is because, there is no event that is consistent with both A and Ac
(so that their intersection is the null event as shown above).
2.3 Probability
We define a measure of the likelihood of the occurrence of an event.
nA
P (A) =
n
as n becomes large. According to this (the “frequentist”) definition, probability is the “long-
run frequency” of occurrence of the event.
For example, if A=”Die comes up 1” then from Figure 1 we see that n = 6 and nA = 1 so
P (A) = 61.
By the definition, P (S) = 1 and P (∅) = 0 but in general 0 ≤ P (A) ≤ 1.
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
You can verify this visually from Figure 2.5, or by considering the fact that A =
(A ∩Bc) ∪(A ∩B) and B = (B ∩Ac) ∪ (A B) so by taking the union we incorporate
the intersection
∩ event A∩ B twice in the calculations and thus need to remove it
once. As a special case, when A and B are mutually exclusive events (since
∩ A B=
) the above reduces to
P (A ∪ B) = P (A) + P (B)
P (A ∩ B)
P (B |A) = P (A)
and consequently, P (A ∩ B) = P (B|A)P (A).
Proof(Rozanov, 1977): By the definition of probability, P (A) = nnA and P (B) = nb . Now
n
since A is given to occur, event B can only occur only among points that are compatible
with the occurrence of A (i.e., here n = nA). Also notice that given that A occurs, the
occurrence of B means that both A and B will occur simultaneously (i.e., event A∩ B will
occur). By the definition of probability then,
Compute the probability of the event B | A=“A 60 year-old person in the U.S. will live to
the age of 65.”
From life tables collected on the U.S. population, it is known that out of 100,000 individuals
born, in 1988, 85,331 have reached 60 years of age while 79,123 have reached 65 years of
age. Given the large n we can consider these proportions as reasonably accurate estimates
of P (A) and P (B). That is,
P (A) = P (“Lives to 60”) ≈ 0.85
P (B) = P (“Lives to 65”) ≈ 0.79
Also, notice that P (A ∩ B) = P (“Lives to 60”and “Lives to 65”) = P (“Lives to 65”) =
P (B) ≈ 0.79. Finally, P (B|A) = PP (A∩B = 0.79
0.85 ≈ 0.93. That is, a person has 79% chance of
(A)
reaching 65 at birth, but a 60-year-old has 93% chance to reach the same age. The reason of
course is that all situations where an individual would have died prior to having reached 60
years of age (i.e., the elements of S that are incompatible with A) have been excluded from
the calculations (by the division with P (A)).
For example, the event A=“Heads” and B=“Tails” as results of a coin toss are independent
events. Having observed a head on the previous throw does not change the probability of
tails in the current throw. That is, P (B|A) = P (B) = 0.5 and P (A ∩ B) = P (A) × P (B)
= 0.25
(i.e., the sequence {H, T } has probability 0.25).
To compute this probability, we must realize that each sum is a mutually exclusive event
(since you cannot have 4+3 and 5+2 in the same toss), and thus, P (A) = P [(1, 6) ∪ (2, 5)
∪
. . . ∪ (6, 1)] = P (1, 6) + P (2, 5) + . . . + P (6, 1) by the additive rule. In add iti on , e ach
die is ∩ = 1,
1 1
rolled independently so for example, P (1, 6) = P (1 6) = P (1) P (6) = by
36
the multiplicative rule. The same of course holds true for the other sums. 6 6
Thus,
1 1 1 6 1
P (“Sum = 7”) + +...+ = =
36 36 36 36
= 6
2.7 Diagnostic tests
Consider the following events:
• D = “Disease is present”
• D c = “Disease is absent”
• T + = “Positive test result (test detects disease)”
• T − = “Negative test result (test does not detect disease)”
In diagnostic-testing situations, the following “performance parameters” of the diagnostic
procedure under consideration will be available:
• P (T+ |D) = “Sensitivity (true positive rate) of the test”
• P (T+ |D ) = “Probability of a false positive test result”
• P (T −|D) = “Probability of a false negative test result”
• P (T −|Dc) = ”Specificity (or true-negative rate) of the test”
In addition, in order to derive estimates of the PV P (i.e., the predictive value of a negative
test PV N = P (Dc |T −) ) we will need an estimate of the overall probability of disease in the
general population. This is called the prevalence of the disease P (D).
+
Goal: Find P (D T | ) the predictive value of a positive test result (or PVP), that is, find
the probability that a subject has the disease given a positive test.
From this table we can derive approximate estimates for the sensitivity and specificity of
the X-ray as a diagnostic test. For example, P (T +|D) ≈ 3022 = 0.7333. Notice that since D
is “given”, the sample space is comprised only by the 30 positive cases in the first column.
1739
Similarly, P (T −|Dc) ≈ 1790 = 0.9715.
Tuberculosis
X-ray result Yes No Total
Positive 22 51 73
Disease Disease
Test result Yes No Total Test result Yes No Total
P (D|T P (D ∩ T +) +
+
) +
= P (D|T )P (T
P (T ) +
)
=
P (T +)
Since we do not know P (T +) let us consult the Figure 2.7. From the Figure it is seen that
∩ ∩
T + = (D ∩ T +) ∪ (T + ∩ Dc) so that
h i
P (T +) = P (D ∩ T +) ∪ (T + ∩ Dc)
= P (D ∩ T +) + P (T + ∩ Dc)
since D ∩ T + and T + Dc are mutually exclusive events (using the additive rule). Then
substituting above we have
For every 100,000 positive x-rays, only 239 signal true cases of tuberculosis. This is called
the “false positive paradox”. Note also how we have incorporated the evidence from the
positive X-ray in the calculation of the probability of tuberculosis.
Before the X-ray P (D) = prior probability of disease = 0.000093. After the presence of
a positive test result we have P (D| T +) = posterior probability of disease (updated in the
presence of evidence)= 0.00239. So, although the probability of tuberculosis is low, we
have in fact reduced our degree of uncertainty 26-fold (0.00239/0.000093).
We have also
P (“Drug user”) = P (D) = 0.05 prevalence of drug
So finally,
Why does this happen? To answer this consider a representative group from the general
population as in Table 2.3. Approximately 48 (≈ 50 × 0.95) out of the 50 drug users in this
Table 2.3: Expected number of drug users in 1,000 individuals randomly selected from the
general population
Drug use
Drug test result Yes No Total
Positive 48 48 96
group of 1,000 individuals, will test positive, but so will (by mistake; a false positive result)
48 (≈ 0.05 950) of the 950 non drug users. Thus, only half of the 95 positive dru g test
will have detected true cases of drug use (and thus PV P ≈ 50%). In general, when a disease
(or, as in this case, drug use) is rare, even an accurate test will not easily reverse our initial
(prior) low probability of its occurrence.
2.8 Bayes Theorem
If A1, A2, . . . , An are mutually exclusive events whose union is S (i.e., these events
account for all possible outcomes or events without overlap), and suppose that the
probabilities P (B|Ai), P (Ai), i = 1 . . . , n are known. Then, P (Ai|B), i = 1, . . . , n is
given by
P ( B |A i ) P ( A i )
P (Ai |B) =
P (B|A )P
1 (A )1 + . . . + P (B|A i)P (Ai ) + . . . + P (B|A
n )P (A
n )
It is easily seen that diagnostic testing is a special case of the Bayes Theorem. In the case of
calculating the predictive value of a positive test (PV P ), then n = 2 and D ≡ A1, Dc ≡
A2 and T + ≡ B. In the case of the PV N , then T − ≡ B.
2.9 Bibliography
1. Principles of Biostatistics by M Pagano and K Gauvreau. Duxbury press
Probability distributions
A random variable is a measurement whose observed values are the outcomes of a random
experiment. In this sense, its values cannot be a priori determined. That is, we do not know
what the values of the random variable are going to be before we collect the sample, run the
experiment, etc.
The mechanism determining the probability or chance of observing each individual value
of the random variable is called a probability distribution (as it literally distributes the
probability among all the possible values of the random variables). Probability distributions
are defined through frequency tables, graphs, or mathematical expressions.
There are two types of probability distributions corresponding to the two kinds of random
variables:
2. Continuous probability distributions (Figure 2B): These handle cases were all
possible (real) numbers can be observed (e.g., height or weight). Note that large or
infinite numbers of countable (i.e., discrete) values are usually handled by continuous
distributions1.
Note. Unlike the discrete case, in the case of a continuous distributions, the probability
of observing any individual number is zero! Only probabilities of intervals have non-zero
probability. Those probabilities are equal to the area between the x axis and the probability
(density) curve.
1
In fact one may argue that, given the finite precision with which measurements can be made, there are
no truly continuous data!
31
Figure 3.1: Examples of probability distribution functions
2. All Bernoulli trial are mutually independent from all the others (i.e., information on
the outcome of one does not affect the chances of any other)
3. There are two possible outcomes usually denoted as “success”=1 and “failure”=0
The formula producing the probabilities of all possible arrangements of successes and
failures is
n j n −j
P (X = j) = C πj (1 − π)
where Cn is the number of ways of actually have j successes out of n trials. Actually,
j
Cn = n!
j
n = . The notation n! = n(n − 1)...1 is called “n factorial”).
j j!(n−j)!
For example, if n = 4 and j = 2 then the possible ways to have two ones (successes)
among four trials, is 4!/(2!2!) = 24/[(2)(2)] = 6. Enumerating these we have: [1100], [1010],
[1001], [0110], [0101], [0011].
Now if the probability of a success in each Bernoulli trial is π = 0.5 (say flipping a coin with
“heads” considered as the “success”) then the probability of two successes out of four trials
2 4−2
is P (X = 2) = (6)(0.5) − (1 0.5) = (6)(0.25)(0.25) = 0.375. In the coin-tossing
experiment that would mean that there is about 38% probability to see two heads out of
four tosses.
µ = nπ
This is intuitive. Consider the probability of “heads” π = 0.5 in a coin flip. Then if you toss
the coin n times you would expect heads approximately half of the time. Less intuitive is
the variance of the binomial distribution. This is given by
σ2 = nπ(1 − π)
1 1
f (x) = √2π exp (x — µx )2
2σ2
where µx and σ2 are the population (parameters) mean and variance respectively. The
function f (x) is called a probability density function. It is symmetrical and centered
around µx. Each probability is determined as the area between the density curve and the
x axis (see Figure 2B).
The areas under the curve of the normal distribution with mean µ = 0 and standard deviation
σ = 1 (the so-called “standard normal distribution”) have been tabulated and are given in
Table 3.1. This table presents probabilities in the tail of the standard normal
distribution, i.e., P (Z > z) for z > 0.0 (see Figure 3).
Table 3.1: Table A.l Areas in one tail of the standard normal curve
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.500 0.496 0.492 0.488 0.484 0.480 0.476 0.472 0.468 0.464
0.1 0.460 0.456 0.452 0.448 0.444 0.440 0.436 0.433 0.429 0.425
0.2 0.421 0.417 0.413 0.409 0.405 0.401 0.397 0.394 0.390 0.386
0.3 0.382 0.378 0.374 0.371 0.367 0.363 0.359 0.356 0.352 0.348
0.4 0.345 0.341 0.337 0.334 0.330 0.326 0.323 0.319 0.316 0.312
0.5 0.309 0.305 0.302 0.298 0.295 0.291 0.288 0.284 0.281 0.278
0.6 0.274 0.271 0.268 0.264 0.261 0.258 0.255 0.251 0.248 0.245
0.7 0.242 0.239 0.236 0.233 0.230 0.227 0.224 0.221 0.218 0.215
0.8 0.212 0.209 0.206 0.203 0.200 0.198 0.195 0.192 0.189 0.187
0.9 0.184 0.181 0.179 0.176 0.174 0.l7l 0.169 0.166 0.164 0.161
1.0 0.159 0.156 0.154 0.152 0.149 0.147 0.145 0.142 0.140 0.138
1.1 0.136 0.133 0.131 0.129 0.127 0.125 0.123 0.121 0.1l9 0.1l7
1.2 0.115 0.1l3 0.l1l 0.109 0.107 0.106 0.104 0.102 0.100 0.099
1.3 0.097 0.095 0.093 0.092 0.090 0.089 0.087 0.085 0.084 0.082
1.4 0.081 0.079 0.078 0.076 0.075 0.074 0.072 0.071 0.069 0.068
1.5 0.067 0.066 0.064 0.063 0.062 0.061 0.059 0.058 0.057 0.056
1.6 0.055 0.054 0.053 0.052 0.051 0.049 0.048 0.047 0.046 0.046
1.7 0.045 0.044 0.043 0.042 0.041 0.040 0.039 0.038 0.038 0.037
1.8 0.036 0.035 0.034 0.034 0.033 0.032 0.031 0.031 0.030 0.029
1.9 0.029 0.028 0.027 0.027 0.026 0.026 0.025 0.024 0.024 0.023
2.0 0.023 0.022 0.022 0.021 0.021 0.020 0.020 0.019 0.019 0.018
2.1 0.018 0.017 0.017 0.017 0.016 0.016 0.015 0.015 0.015 0.014
2.2 0.014 0.014 0.013 0.013 0.013 0.012 0.012 0.012 0.01l 0.01l
2.3 0.01l 0.010 0.010 0.010 0.010 0.009 0.009 0.009 0.009 0.008
2.4 0.008 0.008 0.008 0.008 0.007 0.007 0.007 0.007 0.007 0.006
2.5 0.006 0.006 0.006 0.006 0.006 0.005 0.005 0.005 0.005 0.005
2.6 0.005 0.005 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004
2.7 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003
2.8 0.003 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
2.9 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001
3.0 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.1 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.2 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3.4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Figure 3.2: Probabilities under the curve of the standard normal distribution
0.5
0.4
0.3
f(x)
P(Z<z)
0.2
0.1
0
-6 -4 -2 0 z 2 4 6
Figure 3.3: Probabilities under the curve of the standard normal distribution
that P (Z > 0.16) = 0.436. When reading a normal table, we take advantage of the following
features of the normal distribution:
• The symmetry of the standard normal curve around zero (its mean). Thus, P (Z ≥
z) = P (Z ≤ −z), where z ≥ 0.
• The fact that (as in any distribution) the area under the curve is equal to 1. Thus,
two complementary events, P (Z ≥ z) = 1 − P (Z ≤ z).
(a) P (Z ≥ z) = p
If p ≤ 0.5. Then z ≥ 0 and we look up p in the table. On the other hand, if
p ≥ 0.5 then z ≤ 0 and we look up p1 = 1 − p in the table. z is the negative of
the number located in the table
(b) P (Z ≤ z) = p
If p ≤ 0.5 then z ≤ 0 and again we look up p in table. z is the negative of the
number located there. On the other hand, if p ≥ 0.5 then z ≥ 0 and we look
up p1 = 1 − p in the table.
(c) P (−z ≤ Z ≤ z) = p. Look up p1 = (1 − p)/2 in the table. z is the closest
number while −z is its negative.
3.1.3 Examples
1. Find z such that P (Z > z) = 0.025. From above this can be looked-up directly in
the standard normal table. We see that z = 1.96 is such that P (Z > 1.96) = 0.025.
2. Find z such that P (Z < −z) = 0.05. This is equal to P (Z > z) which can be looked
up on the table. We note that there are two numbers close to 0.05 but none fulfils
the requirement exactly. We have P (Z > 1.64) = 0.051 while P (Z > 1.65) =
0.049. Interpolating between these two values we have P (Z > 1.645) ≈ 0.05.
2
Capital Z is the (normally distributed) random variable, while z is the values it assumes
3. Find z such that P ( —z < Z < z) = 0.95. As above this probability is 1 − 2P (Z >
z) = 0.95 which means that P (Z > z) = 0.025 which means z = 1.96. That is, 95%
of the area under the standard normal distribution is found between ±1.96.
3.2 Standardization
A useful feature of the normal distribution is that if a variable X is distributed according
to an arbitrary normal distribution N (µ, σ) then the variable Z = σ is distributed as a
X−µ
standard normal distribution N (0, 1) for which probabilities have been tabulated.
Intuitively this means that for all normal distributions the same amount of probability
is concentrated under the normal distribution curve within the same number of standard
deviations from the mean. Let’s see how this works: In the case of the standard normal
distribution, we know for example that 2.5% probability is concentrated above 1.96. That
is, 2.5% probability is concentrated above 1.96 standard deviations above the mean (recall
in the case of the standard normal, µ = 0 and σ = 1). What we are saying is that for any
normal distribution 2.5% probability is concentrated above µ + 1.96σ, that is,
P (X > µ + 1.96σ) = 0.025
Thus, any probability like P (X > b) can be calculated by reference to the standard
normal distribution, if one figures out how many standard deviations a is above the
mean µ. This will be,
X−µ a−µ
P (X > a) = P σ > σ
a−µ
= P Z>
σ
X−µ
where Z = σ is the number of standard deviations above the mean. In other a−µ
words, Z
is distributed according to N (0, 1). What the above says is that a is z = σ standard
deviations above zero. The probability associated with this event is of course easily obtained
from the normal table in the textbook.
Other, more complex probabilities are obtained by simplifying the expression according to
the methods that we discussed earlier. For example, recall the cholesterol level data, where
the cholesterol level in the U.S. male population ages from 20-74 years was distributed
according to the normal distribution N (211, 46). What would be the probability that a
randomly selected individual from this population has cholesterol level above a = 220? The
answer is given if one thinks about how many standard deviations is a above the mean µ =
211. That is,
X − µ 220 − 211
P (X > 220) = P σ > 46
9
= P Z > 46 = P (Z > 0.196) ≈ 0.42
That is, about 42% of the U.S. males ages 20-74 years-old have cholesterol above 220
mg/100 ml.
Chapter 4
Statistical inference
Population
Parameter
Inference
Sample
Statistic
39
Here we will make a parenthesis and introduce some basic statistical concepts.
2. Sample is any subset from the population of interest A characteristic of the sample is
called a statistic
3. As n gets large, the shape of the sample distribution of the mean is approximately that
of a normal distribution
1
A random sample is one where every member of the population has equal chance of being selected
4.2.1 Cholesterol level in U.S. males 20-74 years old
The serum cholesterol levels for all 20-74 year-old US males has mean µ = 211 mg/100 ml
and the standard deviation is σ = 46 mg/100 ml. That is, each individual serum cholesterol
level is distributed around µ = 211 mg/100 ml, with variability expressed by the standard
deviation σ.
If µ = 217 mg/100 ml, then from the Central Limit theorem we have that
X¯ − µ 217 − 211
P ( X¯ ≥ 217) = P
√
σ ≥ √46
= P (Z ≥ 0.65) = 0.258
n 25
Thus, less than 26% of the time will the means of the samples of size 25 will be above 217
mg/100 ml, about 16% of the time they will be above 220 mg/100 ml and less than 2% are
the sample means expected to be larger than 230mg/100 ml.
To calculate the upper and lower cutoff points enclosing the middle 95% of the means of
samples of size n = 25 drawn from this population we work as follows:
The cutoff points in the standard normal distribution − are 1.96 and +1.96. We can
translate this to a statement about serum cholesterol levels.
x¯ 2 5 − 211
−1.96 ≤ Z ≤ 1.96 ⇐⇒ −1.96 ≤ √4 ≤ 1.96
6 25
Approximately 95% of the sample means will fall between 193 and 229 mg/100 ml.
Note. This is a general result, i.e., 95% of any normal distribution is between µ ± 1.96σ.
Here, σ = σ x¯ = √nσ and µ = x¯ .
X¯ 6 4 − µ 13.6 − 9.7
P ( X¯ > 13.6) = P >
64 σ
√ √2
n 3.9 64
= P Z> = P (Z > 15.6)
0.25
This is equivalent to asking what the probability is that a number is 15.6 standard deviations
away from the mean. This of course is essentially zero!
Is this compatible with the hypothesis that diabetic patients have the same glucose levels
as the rest of the population? Most people would say that this probability is “too small”
or that the mean in the diabetics sample is “too far” from the hypothesized mean (of the
healthy population), so that the hypothesis of equality of the diabetic and healthy means is
suspect.
1. State the null hypothesis Ho. Usually we will try to disprove it (i.e., “reject” it).
3. Determine the α level of the test. This is the lowest level of probability resulting from
assuming the null hypothesis is true, that you are willing to consider, before rejecting
the null hypothesis (as having led you to a very unlikely event)
4. Specify the statistic T on which the test is based. In the cases that we are concerned
with, this statistic is of the form
θˆ − θ
T = s.e.θˆ
where θ and θˆ are the population parameter and sample statistic respectively, and
s.e.(θ) the “standard error” is the standard deviation of the statistic θˆ.
5. Specify the decision rule for rejecting or not the null hypothesis. This must be based
on the α level of the test and the test statistic T .
4.3.1 Hypothesis testing involving a single mean and known vari-
ance
Based on a random sample of size n we compute the sample mean ¯
X n . The testing of
hypothesis in this case is carried out as follows:
1. The null hypothesis is
3. Usually the α level will be 5% or 1% (the significance level of the test is (1 α)%, i.e.,
95% or 99% respectively). −
where z1−α is the upper (1 − α)% tail and zα is the lower tail of the standard normal
distribution respectively.
2. The alternative hypothesis is Ha : µ > µ0 which means that the mean glucose level
among diabetics is higher than normal
3. Let us choose α = 0.05 (significance level is 95%)
4. The test statistic is T = 13.6−9.7
x¯ − µ 0 √
= 2 = 15.6
σ √
64
n
5. Rejection rule (this is a one-sided test): Reject the null hypothesis if T > 1.645 = z0.95
Decision: Since T = 15.6 > 1.645 we reject Ho.
The data contradict the null hypothesis that diabetic patients have the same blood glucose
level as healthy patients. On the contrary, the data suggest that diabetics have significantly
higher glucose levels on average than individuals not suffering from diabetes.
μ0
Figure 4.2: Sampling distribution of the mean under H0
mean value.
STEP 2. State the alternative hypothesis
Ha : µ > µ0 (other alternatives are
possible) STEP 3. Choose the α level of
the test
Graphically, STEPS 2 and 3 are shown in Figure 4.3.
Steps 2 and 3 determine the location of the cutoff point(s) of the test. Step 2 implies that the
μ0 x0
Figure 4.3: Impact of STEPS 2 and 3 on our assumptions
cutoff point x¯0 will be on the right tail of the sample mean distribution. Any observed value
of above this point will raise suspicion about the veracity of the null hypothesis. Steps 2 and
3 have implications for the rejection rule. Step 3 determines how certain we want to be of
our decision. A small alpha level indicates that we would be willing to reject the null
hypothesis only for extremely unlikely values of X¯ . Larger alpha levels indicate a
willingness to reject more easily. Compare this to a jury verdict. In the first case, we would
want to be extra certain, while in the latter we would convict with weaker evidence.
Calculation of the cutoff
point x¯0 proceeds by translating the statement P X¯ n > x¯0 = α, to a statement about Z
¯
(for which cutoff points have been tabulated). Since P (Z > ) = α, and Z = X √σn − µ , we
z1−α
n
have that X¯ n − µ
=α ≡ X¯ > µ + 1−α √σn = α. This in turn immediately
P > z
P
z √σn 1−α n
So, given α, we would go up to z1−α std. deviations above the mean before rejecting the null
hypothesis (in favor of the one-sided alternative Ha : µ > µ0) at this α level.
If Ha : µ < µ0 then the cutoff point will be x¯ 0 = µ — z1−αn√σ . Thus, we reject H0 for values
of that are z1−α std. deviations below the mean.
If the test is two-sided (alternative hypothesis of the form Ha : µ =/ µ0) then the
situation is as shown in Figure 4.4(x¯ l and x¯ u are the lower and upper cutoff points
respectively). Note now that the sum of the two tails is α, i.e. P X¯ n < x¯ l + P X¯ n >
x¯ u = α.
α
2 2
. Working in a similar manner as before, we see that since P Z > z1−2 α = α
, and
Rejection region
Rejection region
Acceptance region
xl μ0 xu
Figure 4.4: A two-sided alternative
P Z < —z1− 2α = α2, we have that x¯ l = µ0 — z1−2 α n√σ .Similarly, we have that x¯ u =
µ0 + z1− α2 √σn . This means that, given α, we would reject the null hypothesis if were z1− 2α
standard deviations above or below the mean.
9.7
Figure 4.5: The distribution of the sample mean in the diabetes example
Acceptance region Rejection region
9.7 10.11
Figure 4.6: The diabetes example under a one-sided alternative and α = 0.05
Note however, that we are given less information than we were, when the population
standard deviation was known. Thus, T is not distributed according to a standard normal
distribu- tion. In fact we should expect T to be more variable than Z, and its distribution
should reflect this.
The t distribution is symmetric, and centered around zero, it has “fatter” tails compared
to the standard normal distribution and is defined by—n 1 “degrees of freedom” (where n
is the sample size). Notice in Figure 6 how the t distribution approaches the standard
normal distribution as the degrees of freedom increase. This is intuitively as expected, since
when
0.4
0.3
Standard normal T dist. with 5 d.f. T dist. with 1 d.f.
Density Value
0.2
0.1
0.0
-6 -4 -2 0 2 4 6
we have a large sample size n, then the information increases (and thus the uncertainty in-
troduced from having to estimate the standard deviation decreases). The degrees of freedom
are essentially the number of independent pieces of information provided by the sample.
Initially, every sample has n independent pieces of information (as many as the number of
observations). However, after we calculate the sample mean, there are only n — 1 indepen-
Σ
dent pieces. Recall that n i=1
(xi — x¯ ) = 0. Thus, if we know the first n — 1 observations, we
Σ −1
can compute the nth one (that would be xn = x¯ — ni=1 (xi —x¯)), and thus there are n 1
independent pieces of information. The test of hypotheses involving means with unknown
—
variance proceeds as follows:
2. Ha : µ /= µ0
The question is:“What is the probability that the cigar population mean benzene concen-
tration is µ = 81µg/g?
Since t = x¯7 −
is distributed as a t distribution with n — 1 = 6 degrees of freedom, and
µ0 s
√
n
t =−81
151 √
9 = 20.6, the probability that a sample mean of 151 or higher would occur under
7
the null hypothesis is less than 0.0001.
Since this is less than the alpha level of the test we reject the null hypothesis. Cigars have
higher concentration of benzene than cigarettes.
where #obs is the sample size, #mean is the sample mean, #sd is the sample standard deviation,
and #val is the population mean under the null hypothesis.
Computer implementation of the benzene concentration example
. ttesti 7 151 9 81, level(95)
Number of obs = 7
Degrees of freedom: 6
Ho: mean(x) = 81
Since we are performing a two-sided test, we concentrate in the middle part of the STATA
output. Since P > |t| = 0.0000, which is much smaller than 0.05, we reject the null
hypothesis.
Group 1 Group 2
Population
mean µ µ
std. deviation 1 2
σ σ
1 2
Sample
mean X¯ 1 X¯ 2
std. deviation s1 s2
sample size n1 n2
Testing for difference of the means of two independent samples (assuming equal variances) proceeds
as follows:
1. State the null hypothesis.
(a) One-sided tests: H0 : µ1 ≥ µ2 or H0 : µ1 ≤ µ2
(b) Two-sided tests: H0 : µ1 = µ2
2. Set up the alternative hypothesis
(a) One-sided tests: Ha : µ1 < µ2 or Ha : µ1 > µ2
(b) Two-sided tests: Ha : µ1 /= µ2
5. Rejection rule: Reject H0, if T > t20;0.975 = 2.086 or if T < —t20;0.975 = —2.086. Since
T = 2.63 > 2.086 we reject the null hypothesis.
That is, we are 95% sure that children suffering from cystic fibrosis have significantly different
levels of iron in their serum compared to healthy children. It appears that these children have an
iron deficiency. To carry out the above test of hypothesis by STATA we use the following
command:
where #obs1 and #obs2 are the sample sizes, #mean1 and #mean2 are the sample means, and #sd1
and #sd2 are the sample standard deviations for the two groups respectively.
Note. ttesti is the immediate version of the ttest command in STATA. We use the immediate
versions of commands, when we do not have access to the raw data, but we do have access to the
necessary summary statistics (like n, mean, standard deviation, etc.). If we had access to the raw
data, say under variable names X1 and X2, then the previous ttest command would be ttest
X1=X2 (and STATA would then proceed to calculate the means and standard deviations
necessary). The computer output is as follows:
. ttesti 9 18.9 5.9 13 11.9 6.3
x: Number of obs = 9
y: Number of obs = 13
Degrees of freedom: 20
Ho: mean(x) - mean(y) = diff = 0
Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0
t = 2.6278 t = 2.6278 t = 2.6278
P < t = 0.9919 P > |t| = 0.0161 P > t = 0.0081
The two-sided test corresponds to the middle alternative (Ha:diff = 0). The p-value (P > |t| =
0.0161) is less than the α level, so we reject H0. Children with cystic fibrosis (group y) have
different levels of iron in their blood from healthy children. The sample mean is less than that of
the healthy children meaning that children with cystic fibrosis have lower blood iron levels.
1. The two sets of measurements are not independent (because each pair is measured on the
same patient) and each patient serves as his own “control”. The advantage of this design
is that we are able to account for individual (biological) patient variability. Someone that
tends to experience angina faster on clean air will more likely experience angina faster
when the air is mixed with CO. Similarly someone that experienced angina later when
breathing clean air, will likely experience symptoms later when breathing CO as well.
2. It is not appropriate to think that we have 2n distinct (independent) data points (or units
of information) available to us, since each data point on the same subject provides a great
deal of information on the subsequent data points collected on the same subject.
4.6.4 Hypothesis testing of paired samples
In a random sample of size n paired observations, we compute the sample mean of the
differences between the pairs of observations di = xCi — xT i, i = 1, ..., n where “C” means
control and “T” means treatment. We carry out the test like a usual single sample t test based
on these differences that is,
3. The α level is 5%
¯
4. The test statistic is T = d
sd
√
= —2.59%
n
5. Rejection rule: Reject H0 if T < —t62;0.95 = —1.673. Since T = —2.59 < —1.673 = t62;0.95
the null hypothesis is rejected.
Subjects when breathing air with CO experience angina faster than when breathing air without
CO.
Computer implementation
To carry out the above test of hypothesis by STATA we use the one-sample t-test command as
before, noting that our data are now comprised by differences of the paired observations and the
mean under the null hypothesis is zero). The output is as follows:
Degrees of freedom: 62
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean ~= 0 Ha: mean > 0
t = -2.5936 t = -2.5936 t = -2.5936
P < t = 0.0059 P > |t| = 0.0118 P > t = 0.9941
Since P<t=0.0059 is less than 0.05, we reject the null hypothesis. Subjects experience angina faster
(by about 6.63%) when breathing air mixed with CO than when breathing clean air.
Chapter 5
Estimation
Hypothesis testing is one large part of what we call statistical inference, where by using a
sample we infer (make statements) about the population that the sample came from. Another
major part of statistical inference (and closely related to hypothesis testing) is estimation.
Estimation may be regarded as the opposite of hypothesis testing, in that we make a “guess” of
the value (or range of values) of the unknown quantity. This is different from testing where a
hypothesis about the value of this quantity must be made (what in hypothesis testing was the null
hypothesis) and until shown otherwise, this hypothesized value is considered known.
Nevertheless, estimation is closely related to testing, both conceptually (after all we still try to
“guess” the true value of the unknown quantity) as well as in terms of mathematical
implementation.
In what follows, we will concentrate on confidence intervals of the unknown population mean µ.
Just as in hypothesis testing we will be concerned with two types of confidence intervals:
• One-sided confidence intervals
We will also consider the case where the population standard deviation σ is known or unknown.
57
Then, the statistic
X¯ — µ
Z =√
σ/ n
P X ¯— 1.96 √σ ≤ µ ≤ X¯ + 1.96 σ =
√ n 0.95
n
This means that even though we do not know the exact value of µ, we expect it to be between
x¯ — 1.96 √σ and x¯ + 1.96 √σ 95% of the time. In this case, x¯ is the point estimate of µ, while the
n n
√ √
interval X¯ — 1.96σ/ n, X¯ + 1.96σ/ n is the 95% confidence interval for µ.
where z0.995 = 2.58, and consequently the 99% two-sided confidence interval of the population
mean is
√ √
X¯ — 2.58σ/ n, X¯ + 2.58σ/ n
All else being equal therefore, higher confidence (say 99% versus 95%) gets translated to a
wider confidence interval. This is intuitive, since the more certain we want to be that the interval
covers the unknown population mean, the more values (i.e., wider interval) we must allow this
unknown quantity to take. In general, all things being equal:
• Larger variability (larger standard deviation) is associated with wider confidence intervals
In estimation we obviously want the narrowest confidence intervals for the highest confidence
(i.e., wide confidence intervals are to be avoided).
5.2.2 Distribution of cholesterol levels
For all males in the United States who are hypertensive (have high systolic blood pressure) and
smoke the distribution of cholesterol levels has an unknown mean µ and standard deviation σ =
46mg/100ml. If we draw a sample of size n = 12 subjects from this group of hypertensive
smokers and compute their (sample) mean cholesterol level x¯ 1 2 = 217mg/100ml, the 95%
confidence interval based on information from this sample is
46
46
217 — 1.96 √ , 217 + 1.96 = (191,
√ 12 243)
12
In other words we are 95% confident that the interval (191, 243) covers the unknown mean of
the population of hypertensive smokers. Note that approximately 5% of the time the confidence
interval that we compute will not cover the unknown population mean.
σ , +∞
X¯ — 1.645 √
n
Just as in the case of one-sided hypothesis testing, the advantage of using one-sided confidence
intervals is obvious. Since we have to use zα instead of zα/2 , we need to consider lower cutoff
point in the direction of interest. For example, if only high values are of interest, in the case of a
95% one-sided confidence interval we need to go only 1.645 standard deviations above the
sample mean, instead of 1.96 standard deviations, as would be the case of the two-sided
confidence interval.
X¯ — t σ σ
n−1;1−α/2 √n
, X¯ + t n−1;1−α/2 √n
—∞, X¯ + t σ
n−1;1−α √n
X¯ — t σ
n−1;1−α √n , +∞
σ ¯ σ 7.13
¯— 7.13
X tn−1;1−α/2 √ , X + tn−1;1−α/2 = 37.2 — 2.262 √ , 37.2 + 2.262 = (32.1,
n 42.3)
√ 10
10
Compare the previous interval to the 95% confidence interval based on the normal distribution
de- rived by pretending that the estimate of the standard deviation s = 7.13µg/l is the true
population standard deviation. This interval is (32.8, 41.6) and has length 8.8(= 41.6 —
32.8)µg/l whereas the one based on the t distribution has length 10.2(= 42.3 — 32.1)µg/l. This
loss of accuracy (widening of the confidence interval) is the “penalty” we pay for the lack of
knowledge of the true population standard deviation.
5.3.2 Computer implementation
Confidence intervals can be computed in STATA by using the command ci, or its immediate
equivalent cii. The syntax is as follows:
where is ci used when we have access to the complete data set, while cii is used when only
the sample size (#obs), sample mean (#mean) and standard deviation (#sd) are known. In all
cases, we can manipulate the alpha level of the confidence interval by using the option level(#).
For example, level(95) would calculate a 95% confidence interval (default), while level(90)
would calculate a 90% confidence interval.
Example: Antacids and aluminum level (continued):
In the example above, a 95% (two-sided) confidence interval is as follows:
This agrees with our hand calculations (i.e., that the 95% C.I. is (32.1, 42.3)).
Caution! STATA only produces two-sided confidence intervals. If you want to obtain one-sided
confidence intervals for the aluminum example, you have to use the level(#) option as follows:
Thus, an upper 95% confidence interval would be (—∞, 41.3), while a lower 95% confidence
interval would be (33.1, +∞).
t = X¯ 1 — X¯ 2 — (µ1 — µ2)
q
1 1
sp +
n1n2
where sp is the pooled estimate of the population standard deviation. In this case, t isqdistributed
1 1
according to a t distribution with n1 + n2 — 2 degrees of freedom. Notice that sp + =
r n n
s2 sp2
p
+ n , that is, the standard error of the difference of two means is the square root of the sum
n
of 1the variances
2 of each mean (recall that we have two sources of variability when we deal with
two groups). This of course holds only when the two groups are independent!
The two-sided confidence interval for the difference of two means is
(x¯ — x¯ ) — t s s !
11 1 1
12 s
n1+n2−2;1−α/2 p +, ( x¯ — 12x¯ ) + t s
n1+n2−2;1−α/2 p
n1n2 n1 +
n2
s
(x¯ 1 s !
— x¯ ) — 1
s + , — x¯ ) +
t t 1
(x¯ 1
s +
1 2 n1+n2−2;1−α/2 p 1 2 n1+n2−2;1−α/2 p
n1 n2 n1 n2
r ⇓ r !
1 1 1 1
(18.9 — 11.9) — (2.086)(6.14) + , (18.9 — 11.9) — (2.086)(6.14) +
9 13 9 13
⇓
(1.4 , 12.6)
How does this compare to the result of the hypothesis test (which as you may recall rejected the
null hypothesis at the 5% level)?
If the data are not given as above, or access to the raw data is not possible, the ttest and
ttesti commands must be used instead. The previous caution for calculating one-sided confidence
intervals carries over to this case, that is, the option level(#) must be used. That is, to produce
a 95% one-sided confidence interval we use the ttest or ttesti command with the option
level(90).
Degrees of freedom: 20
Ho: mean(x) - mean(y) = diff = 0
Thus, a two-sided 95% confidence interval for the difference in serum iron levels between healthy
children and children suffering from cystic fibrosis is (1.4, 12.6) as we saw before.
Degrees of freedom: 20
Ho: mean(x) - mean(y) = diff = 0
Each experiment is called a Bernoulli trial. Such experiments include throwing the die and
observ- ing whether or not it comes up six, investigating the survival of a cancer patient, etc. For
example, consider smoking status, and define X = 1 if the person is a smoker, and X = 0 if he
or she is a non-smoker. If “success” is the event that a randomly selected individual is a
smoker and from previous research it is known that about 29
P (X = 1) =
p P (X = 0) = 1
—p
This is an example of a Bernoulli trial (we select just one individual at random, each selection is
carried out independently, and, each time the probability of that individual to be a “success” is
constant)
67
• X = 0: Neither is a smoker
P (X = 0) = (1 — p)2
= (0.71)2 = 0.5041
P (X = 2) = p2
= (0.29)2 = 0.0841
0.5
Probability of being a
0.4
0.3
0.2
smoker
0.1
0
X=0 X=1 X=2
Number of smokers
of X, going over all the possible numbers that X can attain (in the previous example those were
0, 1, and 2 = n). There is a special distribution that closely models the behavior of variables that
“count” successes among n repeated Bernoulli experiments. This is called the binomial
distribution.
In our treatment of the binomial distribution, we only need to know two basic parameters:
• The probability of “success” p
• The number of Bernoulli experiments n
One way of looking at p, the proportion of time that an experiment comes out as a “success”
out of n repeated (Bernoulli) trials is as the mean of a sample of measurements that are zeros or
ones. That is,
1 Σ
n
i
p = n i=1 X , i = 1,
where Xi are zeros or ones.
..., n
Given the above parameters, √the mean and standard deviation of X the count of successes out of
n trials are: µ = np and σ = np(1 — p) respectively.
For example, suppose that we want to find the proportion of samples of size n = 30 in which at
most six individuals smoke. With p = 0.29 and n = 30, X = 6 < np = 8.7. Thus, applying the
continuity correction as shown above,
P (X ≤ 6) = x ! np + 0.5
P √
— —p
np(1
Z !
≤ √
6 — (30)(0.29) + 0.5
= P Z (30)(0.29)(0.71
= P (Z ≤ —0.89) = 0.187
The exact binomial probability is 0.190, which is very close to the approximate value given above.
If we select repeated samples of size n = 50 patients diagnosed with lung cancer, what fraction
of the samples will have 20% or more survivors? That is, “what percent of the time 10(=
50(0.20)) or more patients will be alive after 5 years”?
• Two-sided hypothesis: Ho : p = p0
• One-sided hypothesis: Ho : p ≥ p0 or Ho : p ≤ p0
• Two-sided hypothesis: Ha : p p0
• One-sided hypothesis: Ha : p < p0 or Ha : p > p0
Equivalently, the rejection rule can be expressed as follows: Reject the null hypothesis if:
• Two-sided hypothesis: P (|Z| > z) < α, i.e., if P (Z < —z) + P (Z > z) < α
• One-sided hypothesis: P (Z < —z) < α or P (Z > z) < α respectively
From a sample of n = 52 patients under the age of 40 that have been diagnosed with lung cancer
the proportion surviving after five years is pˆ = 0.115. Is this equal or not to the known 5-year
survival of older patients? The test of hypothesis is constructed as follows:
1. Ho : p = 0.082
2. Ha : p /= 0.082
Since P (|Z| > 0.87) = P (Z > 0.87) + P (Z < —0.87) = 0.192 + 0.192 = 0.384 > 0.05, we do
not reject the null hypothesis. That is, there is no evidence to indicate that the five-year survival
of lung cancer patients who are younger than 40 years of age is different than that of the older
patients.
and
prtest varname = # [if exp] [in range] [, level(#)]
prtest varname = varname [if exp] [in range] [, level(#)]
prtesti #obs #p1 #p2 [, level(#)
prtesti #obs1 #p1 #obs2 #p2 [, level(#)]
We see that the p value associated with the two-sided test is 0.318 which is close to that
calculated above. Using the normal approximation we have:
. prtesti 52 6 0.082,count
Note! The above output was produced with STATA version 7.0. To obtain the same output
with STATA 6.0 or earlier you must omit the option count as follows:
. prtesti 52 6 0.082
This closely matches our calculations. We do not reject the null hypothesis, since the p value
associated with the two-sided test is 0.380 > α. Note that any differences with the hand
calculations are due to round-off error.
6.4 Estimation
Similar to the testing of hypothesis involving proportions, we can construct confidence intervals
where we can be fairly confident (at a pre-specified level) that the unknown true proportion lies.
Again these intervals will be based on the statistic
pˆ — p
qZ =
p(1 p)
n
−
q
p(1−p)
where pˆ =n x and n are the estimates of the proportion and its associated standard deviation
respectively.
Two sided confidence intervals:
pˆ— z s s
p(1 — p) p(1 — p)
1−
α , pˆ + z1− α2
2
n n
0, pˆ+ zs
p(1 — p)
1−α
n
• Lower, one-sided interval :
pˆ— zs
p(1 — p), 1
1−α
n
In the previous example, if 6 out of 52 lung cancer patients under 40 years of age were alive after
five years, and using the normal approximation (which is justified since np = 52(0.115) = 5.98 > n,
and 52(1 — 0.115) = 46.02 > n), an approximate 95% confidence interval for the true proportion p
is given by
s s
pˆ— z
α p(1 — , pˆ + α p(1 — p)
z
p)
1− 1−
2 n 2 n
⇓
s s
0.115
0.115(1 — 0.115(1 — 0.115)
— 1.96 52 , 0.115 + 1.96 52
0.115) ⇓
(0.028 , 0.202)
In other words, we are 95% confident that the true five-year survival of lung-cancer patients
under 40 years of age is between 2.8% and 20.2%. Note that this interval contains 8.2% (the
five-year survival rate among lung cancer patients that are older than 40 years of age). Thus it is
equivalent to the hypothesis testing test which did not reject the hypothesis that the five-year
survival between lung cancer patients that are older than 40 years old versus younger subjects.
Computer implementation
To construct one- and two-sided confidence intermvals we use the ci command and its
immediate equivalent cii. Their syntax is as follows:
ci varlist [weight] [if exp] [in range] [, level(#) binomial poisson exposure(varname)
by(varlist2) total ]
cii #obs #mean #sd [, level(#) ] (normal)
cii #obs #succ [, level(#) ] (binomial)
cii #exposure #events , poisson [ level(#) ] (Poisson)
which is distribution according to a standard normal distribution. The test is carried out as follows:
• Ha : p1 ≤ p2 (or p1 — p2 ≤ 0 )
• Ha : p1 ≥ p2 (or p1 — p2 ≥ 0 )
5. Rejection rule:
• Two-sided tests: Reject Ho if P (|Z| > z) < α (i.e., P (Z > z) + P (Z < —z) < α).
• One-sided tests: Reject Ho if P (Z > z) < α or reject Ho if P (Z < —z) < α respectively.
Example: Mortality of pediatric victims
In a study investigating morbidity and mortality among pediatric victims of motor vehicles
acci- dents, information regarding the effectiveness of seat belts was collected. Two random
samples were selected, one of size n1 = 123 from a population of children that were wearing seat
belts at the time of the accident, and another of size n2 = 290 from a group of children that
were not wearing seat belts at the time of the accident. In the first case, x1 = 3 children died,
while in the second x2 = 13 died. Consequently, pˆ1 = 0.024 and pˆ2 = 0.045 and the task is
to compare the two.
Carrying out the test of hypothesis as proposed earlier,
1. Ho : p1 = p2 (or p1 — p2 = 0)
2. Ha : p1 /= p2 (or p1 — p2 0 )
3. The alpha level of the test is 5%
4. The test statistic is
(pˆ1 — pˆ2 ) — (p1 — p2)
z = r
pˆ(1 — pˆ) 1n1+ n12
(0.024 — 0.045)
= r = —0.98
1 1
0.039(1 — 0.039) 123 + 290
5. Rejection rule: Reject Ho if P (|Z| > z) < α (i.e., P (Z > z) + P (Z < —z) < α).
This is, P (Z > 0.98) + P (Z < —0.98) = 0.325 > α. Thus, there is no evidence that children not
wearing seat belts are safer (die at different rates) than children wearing seat belts.
(pˆ1 —
q pˆ2) — (p1 — p2)
z= pˆ1122(1−pˆ )pˆ (1 −pˆ )
n1 + n2
Note! Since we no longer need to assume that the two proportions are equal, the estimate of
the standard deviation in the denominator is not a pooled estimate, but rather simply the sum of
the std. deviations in each group. sThat is, the standard deviation estimate is
pˆ1 (1 — pˆ1 ) pˆ2 (1 — pˆ2 )
s = +
p
n1 n2
This an important deviation from hypothesis testing and may lead to inconsistency between deci-
sions reached through usual hypothesis testing versus hypothesis testing performed using confidence
intervals.
1. Two-sided confidence intervals:
s s
(pˆ12 — pˆ ) —α z pˆ1 (1 — pˆ
1 ) pˆ
+
2 (1 — pˆ
2 )
, (pˆ12 — pˆ ) + αz2
pˆ1 (1 — pˆ
1 ) pˆ
+
2 (1n2
— pˆ
2 )
2
n1 n2 n1
s
—1, (pˆ pˆ1 (1 — pˆ
1 ) pˆ
2 (1n2
— pˆ
2 )
12α— pˆ ) + z +
n1
s
(pˆ12α— pˆ ) — z pˆ1 (1 — pˆ
1 ) pˆ
+
2 (1, 1
— pˆ
2 )
n1 n2
A two-sided 95% confidence interval for the true difference death rates among children wearing
seat
belts versus those that did not is given by
s s
pˆ1 (1 — pˆ1 ) pˆ2 (1 , pˆ1 (1 — pˆ1 ) pˆ2 (1 — pˆ2 )
—
— pˆ
pˆ2 ) — zα +
(pˆ — pˆ ) + zα +
(
pˆ
1 2 1 2
2 n1 n2 2 n1 n2
⇓
((0.024 — 0.045) — 1.96(0.018) , (0.024 — 0.045) + 1.96(0.018))
⇓
(—0.057 , 0.015)
That is, the true difference between the two groups will be between 5.7% in favor of children
wearing seat belts, to 1.5% in favor of children not wearing seat belts. In this regard, since the
zero (hypothesized under the null hypothesis) difference is included in the confidence interval we
do not reject the null hypothesis. There is no evidence to suggest a benefit of seat belts.
79
¯ −
X n µ
statistic z = , and the fact that we will reject the null hypothesis if Z α. In the cholesterol
√
σn
≥z
level example, the cutoff cholesterol level corresponding to a 5% α level is found as follows:
Z ≥ zα
X¯ µ
n
⇔ — σ ≥ α
√n z
X¯ n — 180
⇔ 4
≥ 1.645
√
6 25
⇔ X¯ n ≥ (1.645) + 180 = 195.1
46
25
Thus, if the sample mean from a group of n = 25 males 20-74 years old is higher than 195.1 mg/dL,
Figure 7.2: Implication of setting the α level of a test
then we will reject the null hypothesis, and decide that the 20-74 population mean cholesterol
level is higher than the 20-24 year old population. How often will such a sample mean be over
195.1 mg/dL even if the 20-74 year old males cholesterol level is the same as that of the 20-24
year olds?
This will happen α% of the time. That is, α% of the time we will be rejecting the null hypothesis
even though it is true. This is called an error of Type I.
The alpha level of the test is the maximum allowed probability of type-I error
What if the ”true” mean cholesterol of 20-74 year-olds is µ1 = 211 mg/dL? This situation is
given in the following figure.
Figure 7.3: Normal distributions of X¯ n with means µo = 180 mg/dL and µ1 = 211
mg/dL and identical std. deviations σ = 9.2mg/dL
I IV
II III
140 160 180 200 220 240 260
There is a chance that even though the mean cholesterol level is truly µ1 = 211 mg/dL, that
the sample mean will be to the left of the cutoff point (and in the acceptance region). In that
case we would have to accept the null hypothesis (even though it would be false). That would be
an error of Type II. The probability of a type-II error is symbolized by β.
What is this probability in this example? The probability of a type-II error is
¯ n ≤ 195.1|µ = µ1 = 211
β = P X
¯
X n — 211 195.1 — 211
= P 46
≤ 46
√ √
25 25
= P (Z ≤ —1.73) = 0.042
• II. In this case, the distribution of cholesterol levels among 20-74 year-old males has a
higher mean compared to that of 20-24 year-olds and the test has erroneously failed to
reject the null hypothesis. This is an error of Type II.
• III. The mean cholesterol of the 20-74 year-olds is the same as that of 20-24 year-olds but
the null hypothesis is rejected. This is an error of Type I.
• IV. The sample mean among 20-74 year-old individuals is truly higher than that of 20-24
year-olds and the test has correctly rejected the null hypothesis.
7.2 Power
The error associated with case II is called error of Type II. Just as we had defined the probability
of a Type I error as α, and the probability of a Type II error as β, we have a special name for the
probability associated with case IV, that is, the probability that the test will correctly reject the
null hypothesis. This is called the power of the test. In other words, power is the chance that the
test as defined will pick up true differences in the two populations.
Power = 1 — β
Figure 7.4: Separation of the null and alternative distributions with increasing sample size
100 150 200 250 140 160 180 200 220 240 260
are based on a sample size of n = 5 subjects versus n = 25 in the original situation (right).
To determine the sample size we need:
1. Null and alternative means
2. Standard deviation (or estimate of variability)
3. Alpha level of the test
4. Desired power
−µ2
Items 1 and 2 can sometimes be substituted by “standardized differences” δ = µ1
σ where σ is
the assumed common standard deviation.
Example: Cholesterol example (continued):
For example, if α = 1%, the desired power is 1 — β = 95%, and the two means are µ0 = 180
mg/dL and µ1 = 211 mg/dL respectively, then the cutoff point is
σ
x¯ = µ0 + zα √ = 180 + 46
(2.32) n √
n
Also, by virtue of the desired power,
σ
x¯ = µ1 — zβ √ = 211 — 46
(1.645 n √
n
So,
46 46
180 + √ = 211 — √
(2.32) n n
and thus, (1.645
2
(2.32 — (—1.645))
n= (46) = 34.6 ≈ 35
211 — 180
In general,
(zα + zβ) 2
n= σ
µ0 — µ1
To be assured of a 95% chance of detecting differences in cholesterol level between 20-74 and 20-24
year-old males (power) when carrying out the test at a 1% α level, we would need about 35 20-74
year-old males.
In the two-sample case, when n2 (and ratio) and/or sd2 is omitted, they are assumed equal
to n1 and sd1 respectively. You can use options n1 and ratio (=n2/n1) to specify n2. The
default is a two-sample comparison. In the one-sample case (population mean is known
exactly) use option onesample. Options pre(#), post(#), method(post|change|ancova|all), r0(#),
r1(#) refer to repeated-measures designs and are beyond the scope of this course.
Assumptions:
alpha = 0.0500 (one-sided)
alternative m = 211
sd = 46
sample size n = 25
Estimated power:
power = 0.9577
(say) independent
q sa qmples and known and equal variances (σ1 = σ2 = s) is d¯ ~ N (δ, σδ),
1
where
σ δ = σ n + n = σ n2 if n1 = n2 = n.
1
1. Ho: d = 0
2. Ha: d = 31 mg/ml (corresponding to the situation where µ1 = 180 mg/ml and µ2 = 211
mg/ml)
3. α=0.01
4. Power=1 — β = 0.95
To calculate the sample size for each group (n) we can use the previous one-sample formula, with the
appropriate estimate of the variance of course. That is, each group will be comprised of individuals
from each population,
2
nJ = 2
(zα + zβ) (zα + zβ) √
2
(zα + zβ)
σd = σ = 2n
δα δα σ 2 = δα
2
where n is the size of the identically defined one-sample case. That is, the sample size
h in the two- i2
(2.32+1.645)
sample case will be roughly double that of the one-sample case. In this case, nJ = 2 31 (46) =
69.23 The required sample size is at least 70 subjects per group (double the n = 35 subjects
re- quired in the identical one-sample study).
Note! The total required sample is 140 subjects, or four times that of the single-sample study.
This is the penalty of ignorance of both means versus just one out of the two means.
pˆ
Z = q −p
p(1−p)
~ N (0, 1)
n
s s
pˆ > p 0 p0(1 — p0) 0.082(1 — 0.082)
+ zα = 0.082 + 1.645 ≈ 0.145
n 52
This situation is depicted in the following figure: Example: Five-year survival of lung-cancer
Figure 7.6: Distribution of pˆ under the null hypothesis pˆ ~ N (0.082, 0.038) (blue) and
alternative hypothesis pˆ ~ N (0.200, 0.055) (red)
Distribution under Ha
patients (continued):
First of all,
q
p0(1−p0)
zβ = (pa — p0) — zα n
q
pa(1 pa)
q n
−
= 0.082(1 0.082) ≈ 1.00
q 52
0.200(1 0.200)
52
−
Thus, the probability of a type-II error is β = P (Z > zβ) = 0.159 and thus, the power of a test for
a single proportion based on n = 52 subjects is 1 — β = 0.841 or about 84%.
That is, about 121 lung cancer patients under 40 years old will be necessary to be followed, and
their 5-year survival status determined in order to ensure power of 95% when carrying out the
test at the 1% alpha level.
Assumptions:
Estimated power:
power = 0.8411
Notice that omission of estimates for the standard deviation ( sd1 and/or sd2) produced power
calculations for proportions.
Assumptions:
n = 121
Thus, n = 121 subjects will be necessary to be involved in the study and followed for 5-year survival.
Chapter 8
Contingency tables
Consider the following table:
Wearing Helmet
Head Injury Yes No Total
Yes 17 218 235
If we want to test whether the proportion of unprotected cyclists that have serious head injuries is
higher than that of protected cyclists, we can carry out a general test of hypothesis involving
the two proportions p1 = 17/147 = .115, and p2 = 218/646 = .337.
The normal variable associated with the difference between p1 (protected cyclists having
head injuries), and p2 (unprotected cyclists with head injuries) is z = —5.323, the null
hypothesis is rejected at the 95% significance level.
Now suppose that you wanted to determine whether there is any association between wearing
helmets and frequency of brain injuries based on the same data. Then you must perform a
different test, called the chi-square test because it is based on the χ2 distribution, which we will
cover momentarily. This test is set up as follows:
1. Ho: Suffering a head injury is not associated with wearing a helmet
2. Ha: There is an association between wearing a helmet and suffering a head injury
3. Specify the alpha level of the test
4. Rejection rule (two-sided only):
Reject Ho if the chi-square statistic is too large (see discussion below)
91
Consider the following output corresponding to that test:
Σ rc
2
χ2= i=1 (Oi−Ei)
Ei
Ei is the expected number of head injuries, Oi is the observed number, r is the number of rows
in the table, and c is the number of columns. Then, χ2 is distributed according to the chi square
distribution with df = (r — 1)(c — 1) degrees of freedom. Critical percentiles of the chi-square
distribution can be found in the appendix of your textbook. The chi-square distribution with one
degree of freedom is shown below:
Figure 8.1: Chi-square distribution with one degree of freedom
1.2
0.8
0.6 3.84
0
0.4
χ 2
= Σ (|Oi — Ei| — 0.5)
0.2
i=1
Ei
(|17 — 43.6| — (|130 — 103.4| — 0.5)2 (|217 — 191.4| — 0.5)2 (|428 — 454.6| — 0.5)2
= 2
0.5) 43.6 + + +
103.4 191.4 454.6
= 15.62 + 6.59 + 3.56 + 1.50 = 27.27
It is clear that large deviations of the observed counts from the expected ones will lead to large
chi-square statistics. Thus, large values of χ2 contradict the null hypothesis. The cutoff point of
the chi-square distribution is determined by the number of degrees of freedom and the alpha level
of the test. In the case of the previous example, the number of degrees of freedom is (r — 1) ×(c —
1) = (2 — 1) × (2 — 1) = 1. For α = 0.05 the point to the right of which lies 5% of the chi-
square distribution with one degree of freedom is 3.84.
The chi-square test in the previous example is implemented as follows:
1. Ho: Suffering a head injury is not associated with wearing a helmet
2. Ha: There is an association between wearing a helmet and suffering a head
injury 3. α = 0.05
4. Rejection rule (two-sided only):
Reject Ho if the chi-square statistic is higher than χ2(1)0.05 = 3.84
Comparing the observed value of the statistic to 3.84 we reject the null hypothesis as 27.27 is
much higher than 3.84. In fact the p value of the test is give easily by STATA as follows:
. display chiprob(1,27.27)
1.769e-07
In other words the probability under the chi-square distribution with one degree of freedom to
the right of 27.27 is 0.0000001769 or 1.769 in ten million! This is the chance of the null
hypothesis (of no association between wearing a helmet and suffering a head injury) is correct!
Since this probability is smaller than α = 0.05 this is an alternative justification for rejecting the
null hypothesis.
| col
row | 1 2 | Total
+ +
1 | 17 218 | 235
2 | 130 428 | 558
+ +
Total | 147 646 | 793
a/(a+c)
ˆ = c/(a+c) b/(b+d) a/c
OR =
d/(b+d)
b/dbc
= ad
To construct statistical tests of hypotheses involving the odds ratio we must determine its distri-
bution. The OR itself is not distributed normally; but its logarithm is.
In fact, the statistic
ln( ad )
Z=√ bc
1+ 1+ 1+ 1
a b c d
is approximately distributed according to the standard normal distribution. Tests and confidence
intervals are derived as usual.
5. Rejection rule:
(a) One-sided alternatives
• Reject the null hypothesis if Z > z1−α
• Reject the null hypothesis if Z < z1−α
(b) Two-sided alternative
Reject the null hypothesis if Z > z1− α
2
or if Z < —z1−2 α
r r !
1 1 1 1 ˆ 1 1 1 1
ln(OˆR)
α α
+ + + + +
—z 1− + , ln(OR) + z1−
2 a b c d a b c d
2
Thus, the (1 — α)% confidence interval of the true odds ratio is given by
n q , n q ,
ˆ 1 1
+++, exp1cd 1
ln(OR) + z ˆ 1
+++1b 1 1
exp ln(OR) — z α
1− 2 a b
α
1− 2 a cd
where ln(x) is the natural logarithm (or logarithm base e) of x, i.e., eln(x) = x; exp(x) is the
same as ex. Finally, e ≈ 2.718.
This confidence interval can also be used to perform a hypothesis test by inspecting whether it
covers 1 (the OR hypothesized value under the null hypothesis. Example: Consider the following
data on use of EFM (Electronic Fetal Monitoring) and frequency of Caesarean birth deliveries.
The table is as follows: To test the test of hypothesis of no association between EFM and
Caesarean
ln
ad
ln 358 ×2745
229×2492 ln(1.72)
bc = q = = 6.107
Z= q
1
1 a
+ b
+ 1c + 1
d
1
358
+ 1
229
+ 1
2492
+ 1
2745
0.089
Since Z = 6.107 > 1.96 = z0.975, we reject the null hypothesis (in favor of the two-sided alternative).
These data support a rather strong (positive) association between EFM and Caesarean births.
On the other hand, the 95% confidence interval is given by
(exp {ln(1.72) — (1.96)(0.089)} , exp {ln(1.72) + (1.96)(0.089)}) = e0.368, e0.716 = (1.44, 2.052)
Notice that 1 is not contained in the above confidence interval. This is consistent to the result of
the test of hypothesis, which rejected the null hypothesis of no association between EFM
exposure and risk of caesarean sections. The estimated odds ratio among women monitored via
EFM, is from 44% higher to over double that of women that were not monitored by EFM.
The odds ratio is 1.72 with a 95% confidence interval (1.447, 2.050). Thus, the null hypothesis of
no association is rejected as both limits of the confidence interval are above 1.0.
general we have g tables (i = 1, ..., g) that are constructed as follows (g = 2 in the previous example):
Exposure
Disease Yes No Total
Yes ai bi N1i
No ci di N2i
Total M1i M2i Ti
n , n ,
exp Y — z 1 − α2 s.e.(Y ) , exp Y + z 1− α s.e.(Y )
2
i
Ni
Thus, under the assumption of independence (no association), P (A∩B) = ai
= Mi
= P (A)P (B)
r
Ti Ti M Ti M N N
MiNi
and finally, ai = Ti
. A less obvious estimate of the variance of ai is σi = 2
1i 2i 1i 2i
.
T i (Ti−1)
2 2
4. Rejection rule: Reject Ho if X > χ (1)1−α.
MiTi
where mi = , i = 1, ..., g are the expected counts of diseased exposed individuals.
Ti
In the previous example, a1 = 1011, m1 = 981.3, σ2 = 29.81, a2 = 383, m2 = 358.4, σ2 = 37.69.
Thus, 1 2
Σg Σg mi]2
[ ai —
MH σ
2 = Σg
X 2
i=1 i
[(1011 + 863) (981.3 +
— = 43.68
358.4)]2
=
29.81 + 37.69
Since 43.68 is much larger than 3.84 the 5% tail of the chi-square distribution with 1 degree of
freedom we reject the null hypothesis. Coffee consumption has a significant positive association
with the risk of M.I. across smokers and non-smokers.
. sort smoke
Then we carry out the M-H test but remembering that each line of data is not a single line, but it
represents as many subjects as the number in the variable count.
This is done as follows:
Following the earlier strategy, the analysis can be performed from the previous output
1. Analyze the two tables separately.
The odds ratio among smokers is 2.464292, and among non-smokers is 1.955542.
2. Test of the homogeneity of the association between coffee consumption and MI
The test of homogeneity (“test for heterogeneity” in STATA) has a p-value 0.3342 > 0.05.
We do not reject the hypothesis of homogeneity in the two groups. A combined analysis can
be carried out over both smokers and non-smokers
3. Since the assumption of homogeneity was not rejected we perform an overall (combined)
analysis. From this analysis, the hypothesis of no association between coffee consumption and
myocardial infarction is rejected at the 95% alpha level (since the M-H p-value 0.0000 <
0.05).
By inspection of the combined Mantel-Haenszel estimate of the odds-ratio (2.179779) we
see that the risk of coffee drinkers (adjusting for smoking status) is over twice as high as
that of non-coffee drinkers.
Chapter 9
Analysis of Variance
Patients from 3 centers, Johns Hopkins, Rancho Los Amigos, and St. Louis, were involved in a
clinical trial. As a part of their baseline evaluations, the patients’ pulmonary performance was
assessed. A good marker of this is the Forced Expiratory Volume in 1 second FEV 1 . The data
are presented in Table 12.1 of the textbook and the STATA output below.
It was important to the investigators to ascertain whether the patients from the 3 centers had on
average similar pulmonary function before the beginning of the trial.
103
STATA Summary Statistics
. sort center
-> center=Johns Ho
-> center=Rancho L
To address the investigators’ concerns we must compare the average pulmonary function of the
patients at the 3 sites. Since the population mean and standard deviation of the pulmonary
function at each site is not known, we must estimate them from the data.
In general, when k such groups are involved we have the following:
We must use the sample information, in order to perform inference (hypothesis testing,
confidence intervals, etc.) on the population parameters.
A statistical test addressing this question is constructed as follows:
1. Ho : µ1 = µ2 = . . . = µk
2. Ha: At least one pair is not equal
(e) Pair-wise test rejection rule: Reject Ho,l if T > tni+nj −2;α∗/2, or if
T < —tn +n −2;α∗/2.
i j
3. Rejection rule (of the overall test): Reject Ho if any of the g pair-wise tests
rejects its null hypothesis Ho,l. Otherwise, do not reject Ho.
ii. If the level of significance of the overall test is (1 — α)% and that of each of the g sub-
tests is also (1 — α)%, then the level of significance of the overall test is lower than (1
— α)%.
Example: Consider the case where α = 0.05 (then the significance level is 95%) and k =
3 (then g = 3). Then if event A=“The overall test correctly rejects Ho”, and Al=“Pair-wise
test l correctly rejects Ho,l”, then P (A) = P (A1∩A2∩. . .∩Ag) = P (A1)P (A2) . . . P (Ag) = (1
—α)g assuming independence among sub-tests. If α = 0.05, P (A) = (1 — α)g = (0.95)3 =
0.857 <
0.95. Consequently, the probability of a Type-I error is 1 —0.857 = 0.143 instead of only
0.05 (under the assumption of independence). Even if the individual pair-wise tests are not
independent however, the significance level of the overall test can be much smaller than
anticipated. Thus, the two-sample t test is not totally satisfactory.
Pulmonary function example (continued):
(l = 1 ) Johns Hopkins versus Rancho Los Amigos
. ttest FEV1 if center==1 | center==2,
variances
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
Johns Ho | 21 2.62619 .1082732 .4961701 2.400337 2.852044
Rancho L | 16 3.0325 .13081 .5232399 2.753685 3.311315
+
combined | 37 2.801892 .0889105 .5408216 2.621573 2.982211
+
diff | -.4063096 .1685585 -.7485014 -.0641177
Degrees of freedom: 35
Ho: mean(Johns Ho) - mean(Rancho L) = diff = 0
variances
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
Johns Ho | 21 2.62619 .1082732 .4961701 2.400337 2.852044
St Louis | 23 2.878696 .1037809 .4977157 2.663467 3.093924
+
combined | 44 2.758182 .0765034 .5074664 2.603898 2.912466
+
diff | -.2525052 .1500002 -.5552179 .0502075
Degrees of freedom: 42
Ho: mean(Johns Ho) - mean(St Louis) = diff = 0
variances
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
Rancho L | 16 3.0325 .13081 .5232399 2.753685 3.311315
St Louis | 23 2.878696 .1037809 .4977157 2.663467 3.093924
+
combined | 39 2.941795 .0812345 .5073091 2.777344 3.106245
+
diff | .1538044 .1654468 -.1814227 .4890314
Degrees of freedom: 37
Ho: mean(Rancho L) - mean(St Louis) = diff = 0
Conclusions
1. We consider the two-sided alternative in each case, i.e., Ha,l:µi/=µj , l = 1, 2, 3.
3. Since g = 3 three possible pair-wise comparisons were possible. Each pair-wise test p
value should be less than α∗ =≈α/3 0.017 for the mean difference to be
statistically significant if the overall test were carried out with a 5% α-level. In the
case where the overall test were carried out at the 10% α-level, then the p value of
the pairwise comparison would have to be less than α∗ = α/3 ≈ 0.033.
4. Given that the pair-wise comparison between Rancho Los Amigos and Johns Hopkins
had a p value of 0.0213, given the Bonferroni adjustment, the comparison would be
statistically significant only if the overall test had been carried out at the 90% level of
significance (as 0.0213 < 0.033) but not at the 5% level of significance (as 0.0213 >
0.017).
k ni k ni
Σ Σ 1 Σ
where N = ni , Y = 1 j= and Y¯.. = Σ Yij is the total sample size, the
i=1 1Yij N i=1 j=1
¯i. ni
mean of each group i, and the overall mean of all the observations. This variability can be
divided into two parts as follows:
k ni
Σ Σ 2 Σk
— Y¯ = Σni h — Y¯ + i2
Yij Y ¯
ij .. i=1 j=1 i. i. — Y..
i=1 j=1
k ni k ni
2 Σ Σ 2
Σ Σ — Y¯ + Y — Y¯ = w + SSb
= Yij
SS i. i. ..
i−1 j=1 i=1 j=1
` ˛ x ˛ x
¸ ¸
Variability due ` Variability due
to to
differences differences
within groups between groups
It can be shown that these are two sources of variation can be written as follows:
a. Variability within each of the k groups, which to a certain extent is inherent. This is
estimated by
2 2 2
2 (n 1— 1) s1 + (n 2 — 1) s2 + . . . + (n k— 1) sk
s=w
n1 + n2 + . . . + nk — k
where s2 = 1
2
Σn
i
j=1
Y ij — Yi.¯ is the sample variance in group i. Notice that this
(n −1)
is a pooled estimate of the variance for k groups.
i
On the other hand, consider what happens when the k group means are not equal. Then
s2b increases (as the squared deviations from the overall mean increase) and the ratio of
the between to within variability ratio becomes significantly larger than 1. It can be shown
that
the ratio of s2b over s2w has an F distribution with k — 1 (the number of groups minus 1) and
n —k degrees of freedom (the remaining degrees of freedom from k — 1 to the total n 1).
The criterion of what is a “large” (statistically significant) deviation —
from 1.0 is determined
by comparing the ratio to the tail of Fk−1,n−k, i.e., an F distribution with k— 1 numerator
degrees of freedom associated with the between groups variability, and n— k denominator
degrees of freedom associated with the within group variability. Critical values of the F
distribution can be found in Appendix A.5 of the textbook).
follows: 1. Ho : µ1 = µ2 = . . . = µk
2. Ha : At least two means are not equal
3. Tests are carried out at the (1 — α)% level of significance
s2 MS b
4. The test statistic is F = b
=
sw
2 MSw
k
2 2
Total (k −1 )sb +(n−k )sw n−1 s2 +sw
b
2
3. STATA lists the Bartlett’s test for equality of the individual population variances.
This is a chi-square test, with the usual rejection rule (i.e. reject the hypothesis of
equality
of variances if the p-value listed is lower than a pre-specified α-level). From the
output above we are reassured that the hypothesis of equal group variances holds.
9.2.4 Remarks
1. The ANOVA procedure is a very powerful and direct way to compare an arbitrary
number of k groups for any value of k. If k = 2, then the F -test is equivalent to a
two-sample t test. In fact, the value of the F statistic in this case is equal to the square
of the two-sample T statistic.
Note! It is not meaningful to carry out such tests if the overall hypothesis has not been re-
jected. In some borderline cases, one or more pair-wise comparison may appear statistically
significant, even though the overall hypothesis has not been rejected by the F test.
1. Carry out all pair-wise tests l such that Ho,l : µi = µj, i, j are two of the k groups, and
l = 1, . . . , g.
2. Ha,l : µi /= µj
5. Rejection rule: Reject Ho,l if T > tn−k;α∗/2 or if T < —tn−k;α∗/2 (notice the degrees
of freedom).
The option bon in the oneway command of STATA produces the following output:
Row Mean-|
Col Mean | Johns Ho Rancho L
|
Rancho L | .40631
| 0.055
|
St. Loui | .252505 -.153804
| 0.307 1.000
The g = 3 possible pair-wise comparisons between the 3 clinical centers are listed in the
STATA output. The first entry is the difference between the two group sample means.
For example the difference between the mean pulmonary function among patients at Rancho
Los Amigos (“column mean”) and Johns Hopkins (“row mean”) — is x¯ 1 x¯ 2 = 3.0325
2.6261905 = 0.40631. After adjusting for multiple comparisons, the p-value of the test
(second entry) is
0.055. Thus, we would reject the null hypothesis that patients at Johns Hopkins and Rancho
Los Amigos have the same pulmonary function levels as measured by FEV1, at the 0.10
level but not at the 0.05 level (since the p-value of the test is 0.05 < 0.055 < 0.10).
Note that STATA simplifies carrying out the Bonferroni multiple test procedure by printing
out an adjusted p-value. This means that you should compare it to the α-level of the pair-
wise test and not to α∗α/g. In that regard, STATA makes it unnecessary to have think in
terms of α∗, and we can thus consistently carry out all tests at the usual level of significance.
Chapter 10
Correlation
Consider the diphtheria, pertussis, and tetanus (DPT) immunization rates, presented on
page 398 of your textbook. Now consider the following question:
Is there any association between the proportion of newborns immunized and the
level of infant mortality?
Notice the inadequacy of chi-square-based tests in order to address this question. The
data are continuous and even in a small data set as the one considered here, the problem is
beyond the scope of any test for categorical data (as continuous data have too many
“categories” for such test to be appropriate).
Example: DPT Immunization and Infant Mortality Consider the following two-way scatter
plot of the under-5 mortality rate on the y axis and the DPT levels (percent of the population
immunized) on the x axis (under five mortality rate data set).
By simple inspection of the graph it is clear that as the proportion of infants immunized
against DPT increases, the infant mortality rate decreases.
Now consider:
X: Percent of infants immunized against DPT
Y : Infant mortality (number of infants under 5 dying per 1,000 live births)
A measure of this association is the Pearson correlation coefficient ρ, the average of the
product of the standardized (normalized) deviates from the mean of each population. It is
estimated by
1
r = !
Σ n xi — x¯ yi — y¯
(n — 1) i=1 sx sy
Σn
(xi — x¯ ) (yi — y¯)
i=1
= r i i
hΣ n 2 2
i=1 (xi — x¯) i=1 (yi — y¯)
hΣ
n
113
. label var under5 Mortality rate per 1000 live births
. label var immunize Percent immunized
. graph under5 immunize, xlab ylab
200
100
50
0
0 50 100
Percent Immunized
Figure 10.1: Scatter plot of DPT immunization and under-5 mortality rate
Consider Figures 2 and 4. The correlation coefficient is zero in the former case (Figure 2)
and greater than zero in the latter case (Figure 4). However, in neither case is the
relationship between X and Y linear. Considering the data from table 17.1 (DPT data set),
we have the following:
n 20 n 20
x¯ = Σ1 xi Σ1
= = 77.4%, y¯ = Σ1 yi Σ1
= = 59.0 per 1,000 live births
n i=1 n i=1
xi n yi
20 i=1 20 i=1
n
Σ Σ
(xi — x¯ ) (yi — y¯) = (xi — 77.4) (yi — 59.0) = —22706
i=1 i=1
n 20 n 20
Σ Σ Σ Σ
(xi — x¯ ) 2 = (xi — 77.4)2 = 10630.8 (yi — y¯)2 = (yi — 59.0)2 = 77498
i=1 i=1 i=1 i=1
1 6
.5
Y
2
0
0 .5 1 1.5 0 .5 1 1.5
x x
8
8
6
6
Y
4
Y
4
2 2
0 0
0 .5 1 1.5 0 .5 1 1.5
x x
This implies that there is a fairly substantial negative association between immunization
levels for DPT and infant mortality.
1. Ho : ρ = 0
2. (a) Ha : ρ > 0
(b) Ha : ρ < 0
(c) Ha : ρ /= 0
5. Rejection rule:
(a) Reject Ho if t > tn−2;1−α
(b) Reject Ho if t < tn−2;α
(c) Reject Ho if t > tn−2;1−α/2 or t < tn−2;α/2
In the previous example, if α is 5% (significance level 95%), since
s s
20 — 2
n—2 = 0.79
— = 5.47
t=r
1 — r2 1 — (—0.79) 2
Since 5.47 << t18;0.025 we reject the null hypothesis at the 95% level of significance. There
—
is a statistically significant negative correlation between immunization levels and infant
mor- tality. This means that as immunization levels rise, infant mortality decreases.
Note! We cannot estimate how much infant mortality would decrease if a country were to
increase its immunization levels by say 10%.
Figure 10.3: Scatter plot of DPT immunization and under-5 mortality rate
| immunizeunder5
+
immunize | 1.0000
|
| under5 |
| -0.7911 1.0000
| 0.0000
The equation relating the xi to the yi is y = α + βx + ϵ. The linear part of the relationship
between x and y is:
µy|x = α + βx
α is called the intercept of the line (because if xi = 0 the line “intercepts” the y axis at α),
and β is called the slope of the line. The additional term ϵ, is an error term that accounts
for random variability from what is expected from a linear relationship.
10
0
0
0 2 4 6 8
0 2 4 6 8 x
x
III IV
0 0
-5
-5
-10
-15
-10
0 2 4 6 8 0 2 4 6 8
x x
117
II. Both lines have the same slope (they are parallel) but different intercept.
III. Both lines have the same intercept but different negative slopes.
IV. Both lines have the same (negative) slope but different intercepts.
The appeal of a linear relationship is the constant slope. This means that for a fixed increase
∆x in x, there will be a fixed change ∆y(= β∆x). This is going to be a fixed increase if
the slope is positive, or a fixed decrease if the slope is negative, regardless of the value of
x. This is in contrast to a non-linear relationship, such a quadratic or polynomial, where for
some values of x, y will be increasing, and for some other values y will be decreasing (or
vice versa).
Figure 11.2: Possible linear relationships between gestational age and head circumference
35
Head circumference (cm)
30
25
20
20 25 30 35
Gestational age (weeks)
seems that head circumference is increasing with increasing gestational age, the relationship
is not a perfect line. If we want to draw a line through the plotted observations that we
think best describes the trends in our data we may be confronted with many candidate lines.
i=1
11.1.1 The least-squares line
The least-squares estimates of α and β that determine the least-squares line are
Σ n
α^ = y¯ — β^x¯
β^ = i=1 (x ii−x¯)(y −y¯)
Σ n
(xi−x¯)2
i=1
Σn 2 Σn ¯^
2 Σn 2
Yi — ¯Y=
^
Yi — Yi+Y — Y i
i=1
`i=1 ˛¸ x`˛¸i=1 x
unexplained explained
variabilityvariability
Yˆ = α ˆ + β ˆ X
Y
Yi − Y − Yˆi
}
Y
{ }
Yi
Yˆ − Y = e
i i
i
(Xi,Yi)
120
40 Xi 60
Σn
There are two parts to the total variability in the data. One part, SSR =
that is explained by the linear association of x and y, and the other, SSE = 2
Σn — Y¯ ,
2
^ — Yi ^ ,
i=1 Yi
i=1
Yi
that is left unexplained, because the regression model cannot further explain why there are
still distances between the estimated points and the data (this is called error sum of squares).
11.1.3 Degrees of Freedom
Σn Σ
The total variability in the data is given by ¯
2
^
2 ^ 2
¯
Σ n n
Yi — Y = Yi — Yi + Yi — Y .
`i=1 ˛ ¸ x `i=1 ˛ ¸ x `i=1 ˛ ¸ x
S S Y S S E S S R
Σ
1. The total sum of squares SSY = n Yi — Y¯2 is made up of n terms of the form
¯ ¯
Yi — Y 2 . Once however the mean i=1
Y has been estimated, only n — 1 terms are needed
Σn
to compute SSY . The nth term is known since SSY = i=1 Yi — Y¯ = 0, for all
Yi, i = 1, . . . , n — 1. Thus, the degrees of freedom1 of SSY are n — 1.
2. On the other hand, it can be shown that the sum of squares due to regression, SSR, is
computed from a single function involving β and has thus only one degree of freedom
associated with it (which is “expended” in the estimation of β).
4. For every value x the standard deviation of the outcomes y is constant (and equal to
σ2y|x ). This concept is called homoscedacity.
Within each regression the primary interest is the assessment of the existence of the linear
relationship between x and y. If such an association exists, then x provides information
about y.
Inference on the existence of the linear association is accomplished via tests of hypotheses,
and confidence intervals. Both of these center around the estimate of the slope β,^ since it is
clear, that if the slope is zero, then changing x will have no impact on y (thus there is no
association between x and y).
1
You can think of the degrees of freedom as unique pieces of information
Source of Sums of squares Mean squares
variability (SS) df (MS) F Reject Ho if
Regression SSR 1 MSR = SSR F = MSR F > F1,n−2;1−α
1 MSE
Residual (error) SSE n—2 MSE = SSE
n−2
Total SSY n—1
5. Rejection rule: Reject Ho, if F > F1,n−2;1−α. This will happen if F is far from unity
(just like in the ANOVA case).
The F test of linear association tests whether a line (other than the horizontal one going
through the sample mean of the Y ’s) is useful in explaining some of the variability of the
data. The test is based on the observation that, under the null hypothesis, MSR ≈ σ2 ,
y|x
MSE ≈ σ2 . If the population regression slope β ≈ 0, that is, if the regression does not
y|x
add anything new to our understanding of the data (i.e., does not explain a substantial
part of the total variability), the two mean square errors MSR and MSE are estimating a
common quantity (the population variance σ2 ). Thus the ratio should be close to 1 if the
y|x
hypothesis of no linear association between X and Y is present. On the other hand, if a
linear relationship exists, (β is far from zero) then SSR > SSE and the ratio will deviate
significantly from 1.
^ s 2
r Σ y − ^y
n
with s.e. = y|x
, where is the estimate of σ and s = i=1 i i
=
β √Σ s ( )
√ n y|x y|x y|x n−2
MSE. One-sided confidence
i=1 (x −x¯) intervals are constructed in a similar manner.
2
i
In some occasions, tests involving the intercept are carried out. Both hypothesis tests and
r
confidence intervals are based on the variance s.e.(α) = n1
^ + x¯2
. The statistic
s y|x (x
2
i−x¯)
Σn
T =
s.e.(α) i=1
^
α
^ ~ t n − 2.
A. Degrees of freedom. There is one degree of freedom associated with the model, and
n — 2 = 98 degrees of freedom comprising the residual.
B. F test. This is the overall test of the hypothesis of no linear association. Note that
the numerator degrees of freedom for this test are the model degrees of freedom, while
the denominator degrees of freedom for this test are the residual (error) degrees of
freedom. This test is identical to the F test in the multi-mean comparison case, i.e., it
measures deviations from unity.
Figure 11.4: Output of the low birth weight data
. reg headcirc gestage B
A
D
headcirc | Coef.Std. Err. t P>|t| [95% Conf. Interval]
+
gestage |
_cons | .7800532 .0630744 12.367 0.000 .6548841 .9052223
3.914264 1.829147 2.140 0.035 .2843818 7.544146
F
E G
C. Rejection rule of the F test. Since the p-value is 0.000 << 0.05 = α, we will
reject the null hypothesis. There appears to be strong evidence against for a linear
association between gestational age and head circumference of the newborn.
D. Root MSe. This is the square root of the mean square error, and as mentioned
before, can be used as an estimate of σy|x = 1.5904.
E. gestage is the estimate of the slope, ^ β = 0.7800532. This means that for each
additional week of gestational age, the head circumference increases by an average
0.78 inches. cons is the estimate of ^the intercept α = 3.914264. It means that at
gestational age zero, the head circumference is approximately 3.91 inches (this is of
course not true). Note that the fact that the model fails at gestage=0 does not mean
that it is not useful, or that the linear association is not valid. Normally we would
be interested in a finite range of values within which the linear relationship would
be both useful and valid. This is one of these cases.
F. p-value of the t test described above. Since 0.000 << 0.05, we must reject the null
hypothesis. There is strong evidence of a proportional (positive) linear relationship
between gestational age, and head circumference of a newborn. Notice also that the
value of the t statistic 12.367 squared is equal to 154.95, the value of the F statistic.
The F test of overall linear association and the t test of zero slope are equivalent in
simple linear regression. This is not the case in multiple regression.
G. The confidence interval for β. The 95% confidence interval is the default. In the
previous example, the 95% confidence interval for the population slope is [0.6548841, 0.9052223].
Since this interval excludes 0, we must reject the null hypothesis, and conclude that
there is a strong positive linear relationship between gestational age and head circum-
ference of a newborn.