0% found this document useful (0 votes)
13 views119 pages

2 Simple Linear Regression I Least Squares Estimation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views119 pages

2 Simple Linear Regression I Least Squares Estimation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

1 Lecture 1 – Preliminaries

Textbook Sections: Chapter 1, Sections 2.1,2.2,3.1,3.2


Problems: 1.3,1.5,1.7,2.3,2.4,2.13,3.20,3.25,3.45a-d

These notes are intended to simultaneously review and extend the basic concepts of STA 2023
that are used in business applications. In this section, we describe the notions of:

• Populations and Samples

• Descriptive and Inferential Statistics

• Variable Types

• Tabular and Graphical Descriptions

• Numerical Descriptive Measures

1.1 Populations and Samples


Populations are collections of individuals or items of interest to a researcher. We are typically
concerned with one or more characteristics of the elements of the population. Examples include:

PO1 All firms listed on the New York Stock Exchange (NYSE) throughout year 2001.

PO2 All living graduates of the University of Florida.

PO3 All pairs of Levi’s 550 jeans produced in January, 2002.

Samples are subsets of their corresponding populations, used to describe or make inferences
concerning particular characteristics of the elements of the population. Examples include:

SA1 30 firms sampled at random from all firms listed on NYSE throughout 2001.

SA2 100 UF graduates sampled from alumni records.

SA3 A randomly selected set of 250 pairs of Levi’s 550 jeans produced in January, 2002.

1.2 Descriptive and Inferential Statistics


Descriptive Statistics — Methods used to describe a group of measurements (e.g. mean, median,
standard deviation, proportion (percent) with some characteristic). Examples include (Sources:
Wall Street Journal Almanac,1999, US Statistical Abstract, 1992):

• Average daily shares traded NYSE (millions):


1980 – 49.8 1985 – 121.26 1990 – 176.0 1995 – 361.9

• Median earnings for year round full–time workers (1990, in $1000):


Males – 27.7 Females – 19.8

• Mean and standard deviation of heights of adults 25–34 years old:


Females – Mean=63.5”, std. dev=2.5” Males – Mean=68.5”, std dev=2.7”
• Percent of families in US living below poverty level:
1970 – 12.6 1990 – 13.5

Inferential Statistics — Methods used to reach conclusions (decisions) concerning a popula-


tion, based on measurements from a sample. Examples include:

• A sample of 2007 American adults were asked if they tought there would be a recession in
the next five years. Of those sampled, 66% answered “Yes”. Based on this sample we can be
very confident that a majority of American adults feel there will be a recession in the next
five years. (Source: WSJ, 6/27/97, p. R1).
• After determining a safe dosing regimen, drug manufacturers must demonstrate efficacy of
new drugs by comparing them with a placebo or standard drug in large–scale Phase III trials.
In one such trial for the antidepressant Prozac (Eli Lilly & Co), researchers measured the
change from baseline in Hamilton Depression (HAM–D) scale. Based on a sample of 185
patients receiving Prozac, the mean change (improvement) was 11.0 points, and among 169
patients receiving placebo, the mean change was 8.2 points. Based on these samples, we can
conclude that mean change from baseline in all patients receiving Prozac would be higher
than the mean change from baseline in all patients receiving placebo at a very high level of
confidence.

1.3 Levels of Data Measurement


Nominal — Variable’s levels have no distinct ordering. Examples:
• Type of business (cyclical,non–cyclical,utility,. . . )
• Sex (female,male)

• Brand of beer purchased (Bud,Miller,Coors,. . . )


Ordinal — Levels can be ordered, but distances between levels are indeterminable. Examples:
• Product quality (poor,fair,good,excellent)
• Stanard & Poor’s Corporate bond rating (AAA,AA,A,BBB,BB,B,CCC,CC,C,D)

• Response to test drug (death,extensive deterioration,moderate/slight deterioration,no change,mod/slight


improvement,extensive improvement).

• Placement in sales rankings (1st , 2nd , . . . ,last).


Interval — Measurements fall along a numerical scale, such that distances between levels have
meaning. Examples:
• Quarterly corporate profits (in dollars)

• Time to assemble an automobile (in minutes)


• Number of items sold by a salesman in one month (units)

• Number of defective computer keyboards produced by a worker on a given day.


Ratio — Same as interval, but also containing an absolute 0, so that ratios, as well as distances
have meaning. All examples above, except profits (which can be negative) are ratio scale.
1.4 Tabular and Graphical Distribtions
Frequency distributions are lists of “classes” of levels of a variable, and the number of observed
outcomes within that range. Relative frequency distributions can be obtained for data of any level
of measurement. They can be depicted in tabular or graphical form.

Example 1.1 – 1994 Florida County Data


Table 1.4 gives the population, total income (in $1000s), per capita income (in $1000s), and
retail sales (in $1000s) for Florida’s 67 counties in 1994 (Source: U.S. Census Bureau).
The “classes” chosen for the frequency distribution for per capita income (in $1000s) are 5-10,
10–15, etc. If any observation fell right on a “breakpoint” between classes, it was assigned to the
upper class. The following computer output gives the following distributions for the 67 counties in
this dataset:

Frequency — Labelled “Frequency”, this gives the list of the numbers of counties falling in the
various categories.

Relative Frequency — Labelled “Percent”, this gives the percentage of the counties falling in
each of the categories.

Cumulative Frequency — Labelled “Cumulative Frequency”, this gives the number of counties
falling in or below this category.

Relative Cumulative Frequency — Labelled “Cumulative Percent”, this gives the percent of
counties falling in or below this category.

Cumulative Cumulative
pci94 Frequency Percent Frequency Percent
---------------------------------------------------------
5-10 1 1.49 1 1.49
10-15 21 31.34 22 32.84
15-20 28 41.79 50 74.63
20-25 10 14.93 60 89.55
25-30 3 4.48 63 94.03
30-35 4 5.97 67 100.00

Various graphs are useful in describing bodies of data, and are often given in business reporting.

Histograms – Vertical bar charts that identify categories for categorical variables and ranges of
values for interval scale variables, with heights representing frequencies of outcomes for a
single variable.

Pie Charts – Circular graphs where the size of each “slice” represents the frequency for a partic-
ular category or range of values.

Scatter Plots – Plots of pairs of outcomes on two variables, where each point on the graph
represents a single element from a set of data.

Time Series Plots – Plot of a single variable that is measured over a series of points in time.
Total Per Capita Retail
County Population Income Income Sales
Alachua 193353 3747486 19.382 305841
Baker 19786 293855 14.852 14637
Bay 139507 2495859 17.891 243300
Bradford 24004 318011 13.248 17103
Brevard 442637 8677944 19.605 574026
Broward 1386497 34167902 24.643 2493269
Calhoun 11738 146096 12.446 9410
Charlotte 125832 2400459 19.077 156605
Citrus 104035 1672558 16.077 102660
Clay 120257 2238700 18.616 154681
Collier 177778 5452519 30.670 361542
Columbia 47886 719253 15.020 57166
Dade 2012237 40530049 20.142 3251235
De Soto 25074 410138 16.357 21529
Dixie 11706 141523 12.090 8725
Duval 702846 14553773 20.707 1206452
Escambia 272187 4800237 17.636 391410
Flagler 37818 594050 15.708 26894
Franklin 9906 151954 15.340 8930
Gadsden 43021 617896 14.363 30312
Gilchrist 11929 149748 12.553 3492
Glades 7615 111747 14.675 2548
Gulf 13041 187590 14.385 8380
Hamilton 11570 141496 12.230 6139
Hardee 21611 338067 15.643 18125
Hendry 29325 505741 17.246 27234
Hernando 117141 1872699 15.987 106199
Highlands 73685 1296740 17.598 82240
Hillsborough 871046 17631999 20.242 1546211
Holmes 16933 215197 12.709 9701
Indian River 95250 2772529 29.108 148059
Jackson 43787 664992 15.187 45529
Jefferson 12761 185227 14.515 7127
Lafayette 5873 79439 13.526 2237
Lake 173250 3170498 18.300 188032
Lee 367322 8103201 22.060 637949
Leon 211763 4190977 19.791 358281
Levy 28827 396698 13.761 24600
Liberty 6257 89835 14.358 2405
Madison 17197 223594 13.002 12120
Manatee 226289 5194196 22.954 308061
Marion 219358 3655070 16.663 290072
Martin 109027 3491389 32.023 181558
Monroe 81460 2068322 25.391 195625
Nassau 49496 1035360 20.918 52127
Okaloosa 160725 3048783 18.969 250499
Okeechobee 31036 463635 14.939 32529
Orange 740474 15108479 20.404 1598855
Osceola 126386 2049838 16.219 197719
Palm Beach 959721 31994145 33.337 1674647
Pasco 298677 5051203 16.912 282914
Pinellas 865364 21502994 24.848 1499770
Polk 429408 7661229 17.841 746285
Putnam 68598 978635 14.266 55690
St. Johns 98214 2519924 25.657 131375
St. Lucie 169116 2788362 16.488 178863
Santa Rosa 99003 1695027 17.121 68145
Sarasota 291722 8831912 30.275 540520
Seminole 323719 7062419 21.817 507307
Sumter 33367 486950 14.594 26922
Suwannee 29489 436779 14.812 36258
Taylor 17332 268153 15.472 19028
Union 12193 115280 9.455 3937
Volusia 403899 7154872 17.715 561530
Wakulla 16665 258477 15.510 9233
Walton 32677 470799 14.408 32274
Washington 17984 248533 13.820 12653
Data Maps – Plot of a single variable that is measured over a series of points in 2-dimensional
space.

A histogram of per capita income is given in Figure 1. We see that most counties are in the
range of $10,000 to $25,000 (the second, third, and fourth ranges of values), with one county lower
than this range, and the seven most affluent counties being above this range. A pie chart of the
same data is given in Figure 2.
A scatter plot of retail sales (on the vertical or up/down axis) versus total income (on the
horizontal or left/right axis) is given in Figure 3. A tendency for counties with higher total incomes
to have higher retail sales can be seen. This is considered to be a postive association.
A data map of per capita income is given in Figure 4. We can see visually where the most
affluent and poorest counties are.
A time series plot of monthly average airfares (per 1000 miles of domestic flights) is given in
Figure ?? for the period January 1980 through December 2001 (Source: Air Transport Association).
We observe periodic trends (as demand shifts throughout the year) as well as longer term cycles,
however, the series is showing only a very small long-term increase in trend. These prices are
not adjusted for inflation and are called nominal prices (not to be confused with nominal variable
types). Figure ?? gives the series adjusted for inflation, showing that real prices have decreased
over this period. Figure 7 gives the monthly consumer price index (CPI) over this 22 year (264
month) period (Source: US Department of Commerce).

1.5 Parameters and Statistics


Parameters are numerical descriptive measures corresponding to populations. We will use the
general notation θ to represent parameters. Special cases include:

µ Population mean — The average value of all elements in the population. It is also considered
the ‘long–run’ average measurement in terms of conceptual populations. It can be thought of
as the value each unit would receive if the total of the outcomes had been evenly distributed
among the units.

σ 2 Population variance — Measure of spread (around µ) of the elements of the population.

P Population proportion — The proportion of all elements of the population that posess a partic-
ular characteristic.

µ1 − µ2 The difference between 2 population means.

P1 − P2 The difference between 2 population proportions.

Examples related to previous scenarios, as well as new ones include:

PA1 The proportion of all NYSE listed firms whose stock value increased in 1991 (P ).

PA2 The proportion of all living UF graduates who are members of the alumni association (P ).

PA3 The mean number of flaws in all pairs of Levis 550 jeans manufactured in January, 1992 (µ).

PA4 The proportion of all people who have (or will have) a disease that show remission due to
drug treatment (P ).

PA5 The difference between mean lifetimes of two brands of automobile tires (µ1 − µ2 ).
FREQUENCY
30

20

10

0
7.500 12.500 17.500 22.500 27.500 32.500
pci94 MIDPOINT

Figure 1: Frequency histogram of per capita incomes among Florida counties


FREQUENCY
30

20

10

0
7.500 12.500 17.500 22.500 27.500 32.500
pci94 MIDPOINT

Figure 2: Pie chart of per capita incomes among Florida counties


rtl94
4000000

3000000

2000000

1000000

0
0 10000000 20000000 30000000 40000000 50000000
inc94

Figure 3: Scatter plot of retail sales versus total income among Florida counties
pci94 12.000 15.000 18.000
pci94 2 17 . 05 0 0 12 24 . 50 0 0 12 7 . 50 0 0
23 20 . 50 0 0 27.500 32.500

Figure 4: Data map of per capita incomes among Florida counties


airfare
160

150

140

130

120

110

100
0 100 200 300
time

Figure 5: Monthly nominal (unadjusted for inflation) airfares (price per 1000 miles) on domestic
flights
airfare1
160

150

140

130

120

110

100

90

80

70

60
0 100 200 300
time

Figure 6: Monthly real (adjusted for inflation) airfares (price per 1000 miles) on domestic flights
cpi
180

170

160

150

140

130

120

110

100

90

80

70
0 100 200 300
time

Figure 7: Monthly Consumer Price Index (CPI-U) for all goods


PA6 The difference in the proportions of all men and women who have made credit card purchases
over the internet (P1 − P2 ).

Statistics are numerical descriptive measures corresponding to samples. We will use the general
notation θ̂ to represent statistics. Special cases include:

Mode — Outcome that occurs most often. Usually is reported for nominal or ordinal variables or
simply as a peak of a continuous distribution when variable is continuous.

Median — Middle value (after numbers have been sorted from smallest to largest). Can be
reported for ordinal or interval scale data. Let X(1) be the smallest , X(n) be the largest, and
X(i) be the ith ordered observation in a sample of n items:

X( n2 ) + X( n2 +1)
n even: Median = M =
2

n odd: Median = M = X( n+1 )


2

X Sample mean — The average value of the elements of the sample:


Pn
i=1 Xi
X =
n

S 2 Sample variance — Measure of the spread (around X) of the elements of the sample:

Pn Pn 2
2 − X)2
i=1 (Xi
2
i=1 Xi − nX
S =
n−1 n−1

S Sample standard deviation — Measure of the spread (around X) of the elements of the sample:
sP s
n Pn 2
− X)2
i=1 (Xi i=1 Xi2 − nX
S =
n−1 n−1

p̂ Sample proportion — The proportion of elements in the sample that have a particular charac-
teristic:

X # of elements with characterisic (Successes)


p̂ = =
n # of elements in the sample (trials)

X 1 − X 2 — The difference between two sample means.

p̂1 − p̂2 — The difference between two sample proportions.

Examples related to previous scenarios, as well as new ones include:

ST1 Among a random sample of n = 50 firms listed on the NYSE in 2001, 18 (p̂ = 18/50 = 0.36)
had stock prices increase during 2001.
ST2 Among a sample of 200 UF graduates, 44 are paying members of the alumni association
(p̂ = 44/200 = 0.22).
ST3 A quality inspector samples 60 pairs of Levis 550 jeans, and finds a total of 66 flaws, yielding
an average of X = 66/60 = 1.10 flaws per pair of jeans.
ST4 Of 20 patients selected with a particular disease, 12 (p̂ = 12/20 = .60) show some remission
after drug treatment.
ST5 Samples of 20 tires from each of two manufacturers are obtained, and the number of miles
run until the tread is worn to the legal limit are measured. Brand A has an average of
X 1 = 27, 459 miles, while Brand B has an average of X 2 = 32, 671 miles. The difference
between the two brands’ sample means is X 1 − X 2 = 27, 459 − 32, 671 = −5212 miles.
ST6 Independent samples of male and female consumers find that among males, p̂1 = 0.26 have
made credit card purchases over the internet. Among females, p̂2 = 0.44 have made credit
card purchases on the internet.
Statistics based on samples will be used to estimate parameters corresponding to populations, as
well as test hypotheses concerning the true values of parameters.

Example 1.2 – Closing Prices for Stocks: 3/5/2002

A sample of n = 5 firms are obtained from the NYSE, and their closing prices are obtained in
Table 1.5. We then compute the sample mean, median, variance, and standard deviation, where
Xi is the closing price for firm i.

Firm (i) Xi Rank Xi2 Xi − X (Xi − X)2


Coca-Cola (1) 47.60 4 2265.76 47.6-39.5=8.1 (8.1)2 = 65.61
GE (2) 40.50 2 1640.25 40.5-39.5=1.0 (1.0)2 = 1.0
Pfizer (3) 40.60 3 1648.36 40.6-39.5=1.1 (1.1)2 = 1.21
Sony (4) 50.30 5 2530.09 50.3-39.5=10.8 (10.8)2 = 116.64
Toys R Us (5) 18.50 1 342.25 18.5-39.5=-21.0 (−21.0)2 = 441.00
P
Sum ( ni=1 ) 197.50 8426.71 0.00 625.46

The sample mean, X is computed as follows:


Pn
i=1 Xi 197.50
X= = = 39.50
n 5
The median is the ((n + 1)/2)th = ((5 + 1)/2)th = 3rd ordered outcome, which is Pfizer’s closing
price (not because i = 3, but because its rank=3), which is M = X(3) = 40.60. The sample
variance, S 2 , and sample standard deviation, S, can be computed in two ways, the definitional
form, and the short cut form. The definitional form is as follows:
Pn √
(Xi − X)2 625.46 √
S = i=1
2
= = 156.37 S = + S 2 = + 156.37 = 12.50
n−1 5−1
The short cut form is as follows:
Pn 2
2 i=1 Xi2 − nX 8426.71 − 5(39.50)2 8426.71 − 7801.25 625.46 √
S = = = = = 156.37 S = + 156.37 = 12.50
n−1 5−1 4 4
2 Lecture 2 — Probability
Textbook Sections: 4.4–4.8
Problems: 4.25,4.27,4.29,4.31,4.33,4.39,4.43

Probability is used to measure the ‘likelihood’ or ‘chances’ of certain events (prespecified


outcomes) of an experiment. Certain rules of probability will be used in this course and are reviewed
here. We first will define 2 events A and B, with probabilities P (A) and P (B), respectively. The
intersection of events A and B is the event that both A and B occur, the notation being AB
(sometimes written A ∩ B). The union of events A and B is the event that either A or B occur,
the notation being A ∪ B. The complement of event A is the event that A does not occur, the
notation being A. Some useful rules on obtaining these and other probabilities include:

• P (A ∪ B) = P (A) + P (B) − P (AB)


P (AB)
• P (A|B) = P (A occurs given B has occurred) = P (B) (assuming P (B) > 0)

• P (AB) = P (A)P (B|A) = P (B)P (A|B)

• P (A) = 1 − P (A)

A special case occurs when events A and B are said to be independent. This is when P (A|B) =
P (A), or equivalently P (B|A) = P (B), in this situation, P (AB) = P (A)P (B). We will be using
this idea later in this course.

Example 2.1 – Phase III Clinical Trial for Pravachol


Among a population of adult males with high cholesterol, approximately half of the males were
assigned to receive Pravachol (Bristol–Myers Squibb), and approximately half received a placebo.
The outcome observed was whether or not the patient suffered from a cardiac event within five
years of beginning treatment. The counts of patients falling into each combination of treatment
and outcome are given in Table 1.

Cardiac Event
Treatment Present (B) Absent (B) Total
Pravachol (A) 174 3128 3302
Placebo (A) 248 3045 3293
Total 422 6173 6595

Table 1: Numbers of patients falling in each treatment/cardiac outcome combination (Source:


NEJM, 11/16/95, pp 1301–1307)

If we define the event A to be that the patient received pravachol, and the event B to be that
the patient suffers from a cardiac event over the study period, we can use the table to obtain some
pertinent probabilities:

1. P (A) = P (AB) + P (AB) = (174/6595) + (3128/6595) = (3302/6595) = .5007

2. P (A) = P (AB) + P (AB) = (248/6595) + (3045/6595) = (3293/6595) = .4993


3. P (B) = P (AB) + P (AB) = (174/6595) + (248/6595) = (422/6595) = .0640

4. P (B) = P (AB) + P (AB) = (3128/6595) + (3045/6595) = (6173/6595) = .9360

5. P (AB) = 174/6595 = .0264

6. P (AB) = 248/6595 = .0376

7. P (A ∪ B) = P (A) + P (B) − P (AB) = .5007 + .0640 − (174/6595) = .5383


P (AB) .0264
8. P (B|A) = P (A) = .5007 = .0527

P (AB) .0376
9. P (B|A) = P (A)
= .4993 = .0753

2.1 Bayes’ Rule


Sometimes we can easily obtain probabilities of the form P (A|B) and P (B) and wish to obtain
P (B|A). This is very important in decision theory with respect to updating information. We start
with a prior probability, P (B), we then observe an event A, and obtain P (A|B). Then, we update
our probability of B in light of knowledge that A has occurred.
In the case of B only having two possible outcomes: B and B, Bayes’ rule can be stated as
follows:
P (AB) P (AB) P (A|B)P (B)
P (B|A) = =
P (A) P (AB) + P (AB) P (A|B)P (B) + P (A|B)P (B)

In general if B has k possible (mutually exclusive and exhaustive) outcomes, the rule can be
stated as follows:
P (ABj ) P (ABj ) P (A|Bj )P (Bj )
P (Bj |A) = = Pk Pk
P (A) i=1 P (ABi ) i=1 P (A|Bi )P (Bi )

Example 2.2 – Moral Hazard

A manager cannot observe whether her salesperson works hard. She believes based on prior
experience that the probability her salesperson works hard (H) is 0.30. She believes that if the
salesperson works hard, the probability a sale (S) is made is 0.75. If the salesperson does not work
hard, the probability the sale is made is 0.15.
What is the probability that the salesperson worked hard if the sale was made? If not made?

• Pr{Works Hard}=P (H) = 0.30 Pr{Not Works Hard}=P (H) = 1 − 0.40 = 0.70

• Pr{Makes Sale | Works Hard}=P (S|H) = 0.75

• Pr{Makes Sale | Not Works Hard}=P (S|H) = 0.15

P (HS) P (S|H) · P (H)


P (H|S) = =
P (S) P (S|H) · P (H) + P (S|H) · P (H)

0.75(0.30) 0.225 0.225


= = = = 0.68
0.75(0.30) + 0.15(0.70) 0.225 + 0.105 0.330
0.25(0.30) 0.075 0.075
P (H|S) = = = = 0.11
0.25(0.30) + 0.85(0.70) 0.075 + 0.595 0.670
Note the amount of updating of the probability the salesperson worked hard, depending on whether
the sale was made.
This is a simplistic example of a theoretical area in information economics (See e.g. D.M. Kreps,
A Course in Microeconomic Theory, Chapter 16).

Example 2.3 – O.J. Simpson’s DNA


In the O.J. Simpson murder trial, it was stated that 0.43% (proportion=.0043) of blood samples
taken from all victims and suspects observed by the LA police department match the blood taken
from the murder scene of Nicole Brown Simpson and Ronald Goldman. We will assume that this
is representative of the fraction of people in the general population whose blood types match the
blood at the crime scene. Define the following events:

A —A randomly selected person’s blood matches that found at the crime scene

B — A person is innocent of the murders

B — A person is guilty of the murders

Assume that a guilty person’s blood will match with that at the crime scene with certainty.
In terms of diagnostic testing, the sensitivity of this test is 100% and the specificity of the test is
99.57%. That is:

P (A|B) = 1 P (A|B) = .0043 = 1 − .9957

Suppose you had a prior (to observing blood evidence) probability that O.J. was innocent of
0.5 (P (B) = 0.5). You now find out that his blood matches that at the crime scene. What is your
updated probability that he is innocent (ignoring possibility of tampering)?

0.5(.0043) .00215 .00215


P (B|A) = = = = .0043
0.5(.0043) + (1 − 0.5)(1) .00215 + 0.5 .50215

Repeat for prior probabilities of 0.9 and 0.1.


Source: Forst B. (1996). “Evidence, Probabilities and Legal Standards for the Determination
of Guilt: Beyond the O.J. Trial.” In Representing O.J.: Murder, Criminal Justice, and the Mass
Culture, ed. G. Barak, pp 22-28. Guilderland, N.Y.: Harrow and Heston.

Example 2.4 – Adverse Selection (Job Market Signaling)


Consider a simple model where there are two types of workers – low quality and high quality.
Employers are unable to determine the worker’s quality type. The workers choose education levels
to signal to employers their quality types. Workers can either obtain a college degree (high education
level) or not obtain a college degree (low education level). The effort of obtaining a college degree is
lower for high quality workers than for low quality workers. Employers pay higher wages to workers
with higher education levels, since this is a (imperfect) signal for their quality types.
Suppose you know that in the population of workers, half are low quality and half are high
quality. Thus, prior to observing a potential employee’s education level, the employer thinks the
probability the worker will be high quality is 0.5. Among high quality workers, 70% will pursue a
college degree (30% do not pursue a degree), and among low quality workers, 20% pursue a college
degree (80% do not).
Let Q be the event a worker is high quality, and Q be the event the worker is low quality.
Further, let E be the event the worker obtains a college degree, and E be the event that the worker
does not obtain a college degree. Then we are given the following probabilities from the previous
problem description:
P (Q) = 0.5 P (Q) = 1 − P (Q) = 1 − 0.5 = 0.5 P (E|Q) = 0.70 P (E|Q) = 0.20
What is the probability a worker is high quality, given (s)he has a college degree?
P (Q) · P (E|Q) 0.5(0.70) 0.35 0.35
P (Q|E) = = = = = 0.78
P (Q) · P (E|Q) + P (Q) · P (E|Q) 0.5(0.70) + 0.5(0.20) 0.35 + 0.10 0.45
What is the probability a worker is high quality, given (s)he does not have a college degree?
P (Q) · P (E|Q) 0.5(0.30) 0.15 0.15
P (Q|E) = = = = = 0.27
P (Q) · P (E|Q) + P (Q) · P (E|Q) 0.5(0.30) + 0.5(0.80) 0.15 + 0.40 0.55
This is a simplistic example of a theoretical area in information economics (See e.g. D.M. Kreps,
A Course in Microeconomic Theory, Chapter 17).
vsa
Example 2.5 – Cholera and London Water Companies
Epidemiologist John Snow conducted a massive survey during a cholera epidemic in London
during 1853-1854. He found that water was being provided through the pipes of two companies:
Southwark & Vauxhall (S&V) and Lambeth (L). Apparently, the Lambeth company was obtaining
their water upstream in the Thames River from the London sewer outflow, while the S&V company
got theirs near the sewer outflow.
Table 2 gives the numbers (or counts) of people who died of cholera and who did not, seperately
for the two firms.
Outcome Lambeth S&V Row Total
Cholera Death 407 3702 4109
No Cholera Death 170956 261211 432167
Column Total 171363 264913 436276

Table 2: John Snow’s London cholera results

a) What is the probability a randomly selected person received water from the Lambeth com-
pany? From the S&V company?

b) What is the probability a randomly selected person died of cholera? Did not die of cholera?

c) What proportion of the Lambeth consumers died of cholera? Among the S&V consumers?
Is the incidence of cholera death independent of firm?

d) What is the probability a person received water from S&V, given (s)he died of cholera?

Source: W.H. Frost (1936). Snow on Cholera, London, Oxford University Press.
3 Lecture 3 – Discrete Random Variables and Probability Distri-
butions
Textbook Sections: 5.1,5.2,Notes(for bivariate r.v.’s)
Problems: 5.1,5.3, see lecture notes

An experiment is conducted and some measurement is to be made regarding the outcome.


This type of measurement can be classified as either discrete or continuous. Discrete random
variables can take on only a finite (or countably infinite) possible set of outcomes. Examples
include:

RV1 The number of surveyed voters who favor Al Gore in the upcoming election from a survey of
722 registered voters

RV2 The number of military personnel that oppose the military’s ban on homosexuals from a
survey of 300 current military personnel

RV4 The number of patients, out of a group of 20 under study, that react positively to a new drug
treatment

RV5 The number of successful shuttle launches out of the first 30 shuttle missions

Continuous random variables can take on any value corresponding to points on a line interval. It
should be noted that while this type of variable occurs on a continuous scale, it is measured on some
sort of discrete scale (a news weatherman reports the temperature as 93◦ F , not 92.7756 . . . ◦ F ).
Examples include:

RV3 The gas mileage of a Ford Mustang GT convertible when run at 65 miles per hour.

RV7 The number of miles a tire can travel before wearing out.

RV9 The amount of time needed to housetrain a dog.

These are considered random variables because we have randomly selected some subject or object
from a population of such subjects (objects). The populations of these subjects (whether existing
or conceptual) are said to have probability distributions. These are models of the distribution
of the measurements corresponding to the elements of the population.
Discrete probability distributions are a set of outcomes (denoted by x) and their corre-
sponding probabilities. The distribution can be presented in terms of a table, graph, or formula
representing each possible outcome of the random variable and its probability of occuring. Defining
p(x) as “ the probability the random variable takes on the value x”, we have the following simple
rules for discrete probability distributions:

1. 0 ≤ p(x) ≤ 1
P
2. x p(x) =1

Thus, all probabilities must be between 0 and 1, and all probabilities must sum to 1.
We consider discrete random variables in this lecture.

Example 3.1 — New Florida Lotto Game


Consider Florida’s newly renovated lotto game. Before the drawing, you buy a card by choosing
6 different numbers between 1 and 53 inclusive, and giving the clerk $1. When the state chooses
their 6 numbers subsequently, there will be either 0, 1, 2, 3, 4, 5 or 6 numbers that match yours.
This is a discrete random variable. You do not know how many of the state’s numbers will match
yours, but you can obtain the probability of each possible outcome. This is a set of probabilities
that can be used to set up the corresponding probability distribution. For this case, if we let
X be the random variable representing how many of the state’s numbers match yours, it has the
probability distribution given in Table 3. The distribution used is the hypergeometric distribution,
which is described in many textbooks on mathematical statistics.

x p(x)
0 .46771566391
1 .40089914050
2 .11654044782
3 .01412611489
4 .00070630574
5 .00001228358
6 .00000004356

Table 3: Probability distribution for number of winning digits on a Florida lotto ticket

Note that all probabilities are between 0 and 1, and that they sum to 1. Of course, your ticket
is worthless unless x ≥ 3, so the probability distribution corresponding to your prize amount will
be different than this distribution (you will pool p(0), p(1), and p(2) to obtain the probability you
win nothing (.98515525223)).

For discrete probability distributions, the mean, µ is interpreted as the ‘long run average
outcome’ if the experiment were conducted many times. The variance, σ 2 is a measure of how
variable these outcomes are. The variance is the average squared distance between the outcome
of the random variable and the mean. The positive square root of the variance is the standard
deviation, σ and is in the units of the original data.
For a discrete random variable:
P
• µ = E(X) = xx · p(x)
P P
• σ 2 = V (X) = E[(X − µ)2 ] = x (x − µ)2 · p(x) = xx
2 · p(x) − µ2

• σ = + σ2

Example 3.1 – continued


Referring back to the new Florida lotto example, we obtain the mean and variance from calcu-
lations in Table 4.
Thus, under the new game, the average number of “winning digit” is µ = 0.6792, with a variance
and standard deviation of (using 4 digits in calculations):

σ 2 = 1.0058 − (0.6792)2 = 0.5445 σ = 0.7379


x p(x) x · p(x) x2 · p(x)
0 .46771566391 0(.46771566391) = 0 02 (.46771566391) = 0
1 .40089914050 .40089914050 .40089914050
2 .11654044782 .23308089564 .46616179128
3 .01412611489 .04237834466 .12713503399
4 .00070630574 .00282522298 .01130089191
5 .00001228358 .00006141789 .00030708945
6 .00000004356 .00000026135 .00000156812
P
1.00 .67924528302 1.00580551525

Table 4: Probability distribution for number of winning digits on a Florida lotto ticket

The variation in the correct numbers is relatively small as well, reflecting the fact that almost
always people get either 0, 1, or 2 correct numbers.

Example 3.2 – Adverse Selection (Akerlof ’s Market for Lemons)


George Akerlof shared the Nobel Prize for Economics in 2002 for an extended version of this
model. There are two used car types: peaches and lemons. Sellers know the car type, having been
driving it for a period of time. Buyers are unaware of a car’s quality. Buyers value peaches at $3000
and lemons at $2000. Sellers value peaches at $2500 and lemons at $1000. Note that if sellers had
higher valuations, no cars would be sold.
Suppose that 1/3 of the cars are peaches and the remaining 2/3 are lemons. What is the
expected value to a buyer, if (s)he purchases a car at random? We will let X represent the value
to the buyer, which takes on the values 3000 (for peaches) and 2000 (for lemons).
X
µ = E(X) = x · p(x) = 3000(1/3) + 2000(2/3) = 2333.33

Thus, buyers will not pay over $2333.33 for a used car, and since the value of peaches is $2500 to
sellers, only lemons will be sold, and buyers will learn that, and pay only $2000. At what fraction
of the cars being peaches, will both types of cars be sold?
For a theoretical treatment of this problem, see e.g. D.M. Kreps, A Course in Microeconomic
Theory, Chapter 17.

3.1 Bivariate Distributions


Often we are interested in the outcomes of 2 (or more) random variables. Suppose you have the
opporunity to purchase shares of two firms. Your (subjective) joint probability distribution (p(x, y))
for the return on the two stocks is given in Table 5.

Stock A
6% 10%
Stock 0% .10 .40
A 16% .40 .10

Table 5: Joint probability distribution for stock returns – Substitutable Industries


Thus, you have reason to believe there is little possibility that both will perform poorly or
strongly. For now, denote X as the return for stock A and Y as the return for stock B. We can
think of these industries as “substitutes.”
Marginally, what is the probability distribution for stock A (this called the marginal distribu-
tion)? For stock B? These are giben in Table 6.

Stock A Stock B
x P (X = x) y P (Y = y)
0 .10+.40=.50 6 .10+.40=.50
16 .40+.10=.50 10 .40+.10=.50

Table 6: Marginal probability distributions for stock returns

Hence, we can compute the mean and variance for X and Y:


2
E(X) = µX = 0(.5) + 16(0.5) = 8.0 V (X) = σX = (0 − 8)2 (0.5) + (16 − 8)2 (0.5) = 64.0

E(Y ) = µY = 6(.5) + 10(0.5) = 8.0 V (Y ) = σY2 = (6 − 8)2 (0.5) + (10 − 8)2 (0.5) = 4.0

So, both stocks have the same expected return, but stock A is riskier, in the sense that its
variance is much larger.
How do X and Y ”co-vary” together?
For these two firms, we find that the covariance is negative, since high values of X tend to be
seen with low values of Y and vice versa. We compute the Covariance of their returns, which we
denote as COV (X, Y ) = E(X − µX )(Y − µY ) in Table 7.
XX
COV (X, Y ) = E[(X − µX )(Y − µY )] = σXY = (x − µX )(y − µY )p(x, y) = E(XY ) − µX µY
x y

x x − µX y y − µY P (X = x, Y = y) = p(x, y) (x − µX )(y − µY )p(x, y)


0 −8 6 −2 0.10 (−8)(−2)(.10) = 1.6
0 −8 10 2 0.40 (−8)(2)(.40) = −6.4
16 8 6 −2 0.40 (8)(−2)(.40) = −6.4
16 8 10 2 0.10 (8)(2)(.10) = 1.6
1.00 −9.6

Table 7: Covariance of stock returns

Functions of Random Variables


Suppose we are interested in the sum of X and Y. What will be its probability distribution
(specifically its mean and variance)?

E(X + Y ) = E(X) + E(Y ) = µX + µY

2
V (X + Y ) = V (X) + V (Y ) + 2COV (X, Y ) = σX + σY2 + 2σXY
x y p(x, y) x+y
0 6 .10 6
0 10 .40 10
16 6 .40 22
16 10 .10 26

Table 8: Distribution for the sum of stock returns

To see this, look at the distribution of the random variable X + Y in Table 8.


1) By definition of mean and variance:

E(X + Y ) = 6(.10) + 10(.40) + 22(.40) + 26(.10) = 0.6 + 4.0 + 8.8 + 2.6 = 16

V (X + Y ) = (6 − 16)2(.1) + (10 − 16)2(.4) + (22 − 16)2(.4) + (26 − 16)2(.1) =

100(.1) + 36(.4) + 36(.4) + 100(.1) = 10.0 + 14.4 + 14.4 + 10.0 = 48.8


2) By the formula:

E(X + Y ) = E(X) + E(Y ) = 8 + 8 = 16

V (X + Y ) = V (X) + V (Y ) + 2COV (X, Y ) = 64 + 4 + 2(−9.6) = 48.8

General Case (Linear Function) where a and b are any constants:

E(aX + bY ) = aE(X) + bE(Y ) = aµY + bµY

V (aX + bY ) = a2 V (X) + b2 V (Y ) + 2abCOV (X, Y ) = a2 σX


2
+ b2 σY2 + 2abσXY

Example 3.3 – Stock Purchase


You can purchase either stock A, stock B, or any combination of A and B. Your two criteria
for choosing are 1) highest expected return, and 2) lowest variance of return. You can choose p
between 0 and 1 (inclusive), where p is the fraction of A you will purchase and 1 − p is the fraction
of B you will purchase. Your (random) return is:

R = pX + (1 − p)Y

1) Compute your expected return:

E(R) =

2) Compute the variance of your return:

V (R) =

3) What value of p should you choose?


Stock A
6% 10%
Stock 0% .40 .10
A 16% .10 .40

Table 9: Joint probability distribution for stock returns – Complementary Industries

Problem 3.1
Conduct the analysis for two complementary industries, where their fortunes tend to be good/bad
simultaneously. The joint probabiliy distribution is given in Table 9.
A classic paper on this topic (more mathematically rigorous than this example, where each stock
has only two possible outcomes) is given in: Harry M. Markowitz, “Portfolio Selection,” Journal of
Finance, 7 (March 1952), pp 77-91.
4 Lecture 4 – Introduction to Decision Analysis
Textbook Sections: 18.1,18.2(1st 2 subsections),18.3(1st 3 subsections)
Problems: 18.1a,b,3,5,6,7,8,9

Often times managers must make long-term decisions without knowing what future events will
occur that will effect the firm’s financial outcome from their decisions. Decision analysis is a means
for managers to consider their choices and help them select an optimal strategy. For instance:

• Financial officers must decide among certain investment strategies without knowing the state
of the economy over the investment horizon.

• A buyer must choose a model type for the firm’s fleet of cars, without knowing what gas
prices will be in the future.

• A drug company must decide whether to aggressively develop a new drug without knowing
whether the drug will be effective the patient population.

The decision analysis in its simplest form include the following components:

Decision Alternatives – These are the actions that the decision maker has to choose from.

States of Nature – These are occurrences that are out of the control of the decision maker, and
that occur after the decision has been made.

Payoffs – Benefits (or losses) that a particular decision alternative has been selected and a given
state of nature has observed.

Payoff Table – A tabular listing of payoffs for all combinations of decision alternatives and states
of nature.

Case 1 - Decision Making Under Certainty


In the extremely unlikely case that the manager knows which state of nature will occur, the
manager will simply choose the decision alternative with the highest payoff conditional on that
state of nature. Of course, this is a very unlikely situation unless you have a very accurate psychic
on the company payroll.

Case 2 - Decision Making Under Uncertainty


When the decision maker does not know which state will occur, or even what probabilities to
assign to the states of nature, several options occur. The two simplest criteria are:

Maximax – Look at the maximum payoff for each decision alternative. Choose the alternative
with the highest maximum payoff. This is optimistic.

Maximin – Look at the minimum payoff for each decision alternative. Choose the alternative
with the highest minimum payoff. This is pessimistic.
Case 3 - Decision Making Under Risk
In this case, the decision maker does not know which state will occur, but does have probabilities
to assign to the states. Payoff tables can be written in the form of decision trees. Note that in
diagarams below, squares refer to decision alternatives and circles refer to states of nature.
Expected Monetary Value (EMV) – This is the expected payoff for a given decision al-
ternative. We take each payoff times the probability of that state occuring, and sum it across
states. There will be one EMV per decision alternative. One criteria commonly used is to select
the alternative with the highest EMV.
Expected Value of Perfect Information (EVPI) – This is a measure of how valuable it
would be to know what state will occur. First we obtain the expected payoff with perfect information
by multiplying the probability of each state of nature and its highest payoff, then summing over
states of nature. Then we subtract off the highest EMV to obtain EVPI.

Example 4.1 – Long-term Marketing Plan


A drug manufacturer has two potential drugs for research and development. One drug targets a
childhood illness, the other targets an illness among the elderly. The firm expects that both drugs
will be effective and will obtain FDA approval, but it will be 10 years before either drug will be
brought to market, and involve very expensive research and development. They are not sure what
the size of each market will be in 10 years.
The firm has four decision alternatives:
• Pursue neither drug

• Pursue only the childhood drug

• Pursue only the elderly drug

• Pursue both drugs


Their are six possible states of nature:
• Birth rates decrease and life expectancies stay constant (B − /L0)

• Birth rates stay constant and life expectancies stay constant (B0/L0)

• Birth rates increase and life expectancies stay constant (B + /L0)

• Birth rates decrease and life expectancies increase (B − /L+)

• Birth rates stay constant and life expectancies increase (B0/L+)

• Birth rates increase and life expectancies increase (B + /L+)


The payoffs (in $million) for each combination of decisions and states of nature are given in
Table 10.
a) What would be your decision and payoff under each state of nature, if you were certain that
state were to occur?

B − /L0 – Decision: Payoff:

B0/L0 – Decision: Payoff:

B + /L0 – Decision: Payoff:


Decision State of Nature
Alternative B − /L0 B0/L0 B + /L0 B − /L+ B0/L+ B + /L+
Neither 0 0 0 0 0 0
Child -20 10 40 -20 10 40
Elderly -10 -10 -10 30 30 30
Both -30 0 30 10 40 70

Table 10: Payoff table for drug development decision

B − /L+ – Decision: Payoff:

B0/L+ – Decision: Payoff:

B + /L+ – Decision: Payoff:

b) Give the maximax and minimax decisions and their corresponding criteria:

Maximax – Decision: Criteria:

Maximin – Decision: Criteria:

c) Suppose we are given the probability distribution for the 6 states of nature in Table 11.

State Probability
B − /L0 0.05
B0/L0 0.10
B + /L0 0.15
B − /L+ 0.15
B0/L+ 0.25
B + /L+ 0.30

Table 11: Probability distribution for states of nature for drug development decision

To obtain the expected monetary value for each decision alternative, we multiply the payoffs
for each state of nature and their corresponding probabilities, summing over states of nature. For
the decision to develop only the childhood drug:

EM V (Child) = (−20)(0.05) + 10(0.10) + 40(0.15) + (−20)(0.15) + 10(0.25) + 40(0.30) =

−1.0 + 1.0 + 6.0 − 3.0 + 2.5 + 12.0 = 17.5

Neither – EM V (Neither)=

Child – EM V (Child)=

Elderly – EM V (Elderly)=

Both – EM V (Both)=
Based on the EMV criteria, which decision should the firm make?

d) A firm that conducts extensive research on population dynamics can be hired and can be
expected to tell your firm exactly which state of nature will occur. Give the expected payoff under
perfect information, and how much you would be willing to pay for that (EVPI).

Example 4.2 – Merck’s Decision to Build New Factory


Around 1993, Merck had to decide whether to build a new plant to manufacture the AIDS drug
Crixivan. The drug had not been tested at the time in clinical trials. The plant would be very
specialized as the process to synthesize the drug was quite different from the process to produce
other drugs.
Consider the following facts that were known at the time (I obtained most numbers through
newspaper reports, and company balance sheets, all numbers are approximate):

• Projected revenues – $500M/Year

• Merck profit margin – 25%

• Probability that drug will prove effective and obtain FDA approval – 0.10

• Cost of building new plants – $300M

• Sunk costs – $400M (Money spent in development prior to this decision)

• Length of time until new generation of drugs – 8 years

Ignoring tremendous social pressure, does Merck build the factory now, or wait two years and
observe the results of clinical trials (thus, forfeiting market share to Hoffman-Laroche and Abbott,
who are in fierce competition with Merck). Assume for this problem that if Merck builds now,
and the drug gets approved, they will make $125M/Year (present value) for eight years (Note
125=500(0.25)). If they wait, and the drug gets approved, they will generate $62.5M/Year (present
value) for six years. This is a by product of losing market share to competitors and 2 years of
production. Due to the specificity of the production process, the cost of the plant will be a total
loss if the drug does not obtain FDA approval.

a) What are Merck’s decision alternatives?

b) What are the states of nature?

c) Give the payoff table.

d) Give the Expected Monetary Value (EMV) for each decision. Ignoring social pressure, should
Merck go ahead and build the plant?

e) At what probability of the drug being successful, is Merck indifferent to building early or
waiting. That is, for what value are the EMV’s equal for the decision alternatives?
Note: Merck did build the plant early, and the drug did receive FDA approval.
5 Lecture 5 – Normal Distribution and Sampling Distributions
Textbook Sections and pages: 6.2,7.2,7.3,pp336-337,pp364-365,9.1
Problems: 6.7,9,11, 7.13,21,22,23,25,27

Continuous probability distributions are smooth curves that represent the ‘density’ of
probability around particular values. This density is not interpreted as a probability at the point
(all points will have probability of 0), but rather the probability of an outcome occuring between
points a and b is measured as the area under the density function between a and b. The
density function is always defined so that the total area under it is 1, and it is never negative. The
continuous distribution you have seen most often is the normal distribution, but many others
exist including the t-distribution, which you also have already seen.

5.1 The Normal Distribution


Normal distributions are indexed by 2 parameters, the mean and variance (standard deviation).
Figure 8 depicts 3 normal distributions with the same mean (µ = 100) and varying standard
deviations (σ = 3, 10, and 25). Figure 9 depicts 3 normal distributions with the same standard
deviation (σ = 10) and varying means (µ = 75, 100, and 125).

F1
0.14
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
50 60 70 80 90 100 110 120 130 140 150
X

Figure 8: Normal distributions with common means and varying standard deviations (3, 10, 25)

Standard notation for a random variable X, that follows a normal distribution with mean µ
and standard deviation σ is X ∼ N (µ, σ). Since there are infinitely many normal distributions
(corresponding to any µ and any σ > 0), we must standardize normal random variables to obtain
probabilities corresponding to them. If X ∼ N (µ, σ), we define Z = X−µ
σ . Z represents the number
of standard deviations above (or below, if negative) the mean that X lies. Table A.5 (p. A–14
and last page of text, not including inside back cover) gives the probability that Z lies between 0
and z for values of z between 0 and 3.49. Recall that the total area under the curve is 1, that the
probability that Z is larger than 0 is 0.5, and that the curve is symmetric.

Example 5.1
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 40 60 80 100 120 140 160 180
X

Figure 9: Normal distributions with common standard deviations and varying means (75, 100, 125)

Scores on the Verbal Ability section of the Graduate Record Examination (GRE) between
10/01/92 and 9/30/95 had a mean of 479 and a standard deviation of 116, based on a population
of N = 1188386 examinations. Scores can range between 200 and 800. Scores on standardized
tests tend to be approximately normally distributed. Let X be a score randomly selected from this
population. That is, X ∼ N (479, 116).
What is the probability that a randomly selected student scores above 700?
What is the probability the student scores between 400 and 600?
Above what score do the top 5% of all students score above?

1. P (X ≥ 700) = P ( X−µσ ≥ 700−479


116 ) = P (Z ≥ 1.91) = P (Z ≥ 0) − P (0 ≤ Z ≤ 1.91) =
0.50 − 0.4719 = .0281.

2. P (400 ≤ X ≤ 600) = P ( 400−479


116 ≤ X−µ
σ ≤
600−479
116 ) = P (−.68 ≤ Z ≤ 1.04) = P (−.68 ≤ Z ≤
0) + P (0 ≤ Z ≤ 1.04) = P (0 ≤ Z ≤ .68) + P (0 ≤ Z ≤ 1.04) = .2517 + .3508 = .6025

3. .05 = .5 − .4500 = .5 − P (0 ≤ Z ≤ 1.645) = P (Z ≥ 1.645) = P ( X−µ


σ ≥ 1.645) = P (X ≥
µ + 1.645σ) = P (X ≥ 479 + 1.645(116)) = P (X ≥ 670) = .05.

Source: “Interpreting Your GRE General Test and Subject Test Scores – 1996-97,” Educational
Testing Service.

5.2 Sample Statistics and Sampling Distributions


We have described sample statistics previously, showing how they are calculated once a sample
has been taken from a larger population. Since these samples are taken at random, the elements
of the sample, and thus the sample statisics themselves are random variables. One of the most
important theorems in statistics is the Central Limit Theorem, which states that when the
sample size is large (n ≥ 30), the sample mean is approximately normally distributed with mean µ
and variance σ 2 /n, regardless of the shape of the underlying distribution of measurements. Here,
µ and σ 2 are the mean and variance of the distribution of the measurements. We can then write

X ∼ N (µ, σ/ n).
The other sample statistics p̂, X 1 − X 2 , and p̂1 − p̂2 are also approximately normal in large
samples. The distribution of a sample statistic is called its sampling distribution. The standard
deviation of a sample statistic’s sampling distribution is called its standard error. Table 12
gives each of these 4 sample statistics as well as the means and standard errors of their sampling
distributions. The row involving d is a special case of the sample mean for differences among
matched pairs of observations (see the paired difference experiment).

Estimator (θ̂) Parameter (θ) Std. Error (σθ̂ ) Estimated Std. Error (Sθ̂ ) Degrees of Freedom (ν)
X µ √σ √S n−1
q n q n
P (1−P ) p̂(1−p̂)
p̂ P —
q 2 n 2 q 2 n 2∗
σ1 σ2 S1 S2
X1 − X2 µ1 − µ2 n1 + n2 n1 + n2 n1 + n2 − 2
σd Sd
d µd √
n

n
n−1
q q ∗∗
P1 (1−P1 )
p̂1 − p̂2 P1 − P2 n1 + P2 (1−P
n2
2) p̂1 (1−p̂1 )
n1 + p̂2 (1−p̂2 )
n2 —

Table 12: Means, standard errors, and estimated standard errors of four sample statistics (estima-
tors)

To obtain probabilities of observing particular values of a sample statisic, we use the fact that
the statistic is normally distributed, and work with Z = θ̂−θ
σ . θ̂

Example 5.2 – NCAA Basketball Tournament Scores


The NCAA basketball tournament (often referred to as “March Madness”) has been held every
spring since 1939. In the 55 years of the tournament (up until 1993), there had been 1583 games
played. Among these 1583 games (the population), the mean and standard deviation of the com-
bined scores of the two combatants are µ = 143.40 and σ = 26.07 points, respectively. Suppose
each person in this class took samples of size 1, 10, 25, and 50, respectively, from this population
of games. Between what two bounds would virtually all students’ sample means fall between? We
know that the sample mean is approximately normally distributed with mean µ = 143.40 and stan-

dard error σ/ n, (the underlying distribution is very well approximated by the normal, meaning
that we don’t need large samples sizes for the Central Limit Theorem to hold). We also know that
for any random variable that is normally distributed, the probability that the random variable falls
within two standard deviations (standard errors) of the mean is approximately .95. So, for each
sample size, we obtain bounds by computing µ ± 2 √σn . Table 13 gives these bounds for the sample
sizes mentioned above.

n µ − 2 √σn µ + 2 √σn
1 143.40 − 2 26.07

1
= 91.26 143.40 + 2 26.07

1
= 195.54
26.07 26.07
10 143.40 − 2 10 = 126.91
√ 143.40 + 2 10 = 159.89

25 143.40 − 2 26.07

25
= 132.97 143.40 + 2 26.07

25
= 153.83
26.07 26.07
50 143.40 − 2 50 = 136.03
√ 143.40 + 2 25 = 150.77

Table 13: Sample sizes and upper and lower bounds for sample means (95% confidence)

As the sample size increases, the sample means get closer and closer to the true mean. Thus if
we don’t know the true mean, but we wish to estimate it, we know that if we take a large sample the
sample mean will be relatively close to the true mean. The part that we are adding and subtracting
from the true mean is referred to as the bound on the error of estimation (it is also referred
to as the margin of error, particularly when used in context of a sample proportion).
Similar examples could be worked in terms of the other three estimators (sample statistics)
given in Table 12, using the corresponding parameter and standard error of the estimator in place
of those used in Example 6.2.

Example 5.3 – Pravachol Clinical Trial

In Example 2.1 we considered the results of the clinical trial for Pravachol (and treated the
data as a population of patients). In reality, that was a sample (very large one at that). If we let
P1 be the proportion of all possible Pravachol users to have a heart event within five years, and
P2 be the corresponding proportion for patients on a placebo, we are interested in the parameter
P1 − P2 . The estimator for this parameter is p̂1 − p̂2 , which, for this sample, takes on the value:
X 1 X2 174 248
p̂1 − p̂2 = − = − = .0527 − .0753 = −.0226
n1 n2 3302 3293
The estimated standard error of p̂1 − p̂2 is:
s s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) .0527(1 − .0527) .0753(1 − .0753)
+ = +
n1 n2 3302 3293

= .00001512 + .00002115 = .0060
Thus, we would expect that for approximately 95% of all possible samples, our statistic p̂1 − p̂2
will lie within 2 standard errors (2(.0060)=.0120) of the true difference P1 − P2 .
6 Lecture 6 – Large–Sample Tests and Confidence Intervals
Textbook Sections: 9.2,9.4,10.1,10.4
Problems: 9.1,5,9,19,23,25 10.1,5,7,9,31,33,35

In this section we begin making statistical inferences, using sample data to comment on what
is occuring in a larger population or nature.

6.1 Large–sample Confidence Intervals


By making use of the sampling distributions of sample statistics, we can use sample data to make an
inference concerning a population parameter. Since each estimator (θ̂) described in the previous
section is normally distributed with a mean equal to the true parameter (θ), and standard error
(σθ̂ ) given in the table, we can obtain a confidence interval for the true parameter.
We first define zα/2 to be the point on the standard normal distribution such that P (Z ≥
zα/2 ) = α/2. Some that we will see various times are z.05 = 1.645, z.025 = 1.96, and z.005 = 2.58,
respectively. The main idea behind confidence intervals is the following. Since we know that
θ̂ ∼ N (θ, σθ ), then we also know Z = θ̂−θ
S ∼ N (0, 1). So, we can write:
θ̂

θ̂ − θ
P (−zα/2 ≤ ≤ zα/2 ) = 1 − α
σθ̂

A little bit of algebra gives the following:

P (θ̂ − zα/2 σθ̂ ≤ θ ≤ (θ̂ + zα/2 σθ̂ ) = 1 − α

This merely says that “in repeated sampling, our estimator will lie within zα/2 standard errors of
the mean a fraction of 1 − α of the time.” The resulting formula for a (1 − α)100% confidence
interval for θ is
θ̂ ± zα/2 σθ̂ .
When the standard error σ θ̂ is unknown (almost always), we will replace it with the estimated
standard error Sθ̂ . Some notes concerning confidence intervals are given below.
• α is our level of confidence (with respect to repeated sampling) that the interval does not
contain the true parameter. If we wish to make α smaller, we will increase the width of our
interval (for a fixed sample size).

• The width of the interval depends on the sample size through the standard error. As the
sample size increases, the width of the interval will decrease (for a fixed α), which is good
since we have a more precise estimate.

• If we took many random samples of a fixed size from the population of interest, and calculated
the confidence interval based on each sample, approximately (1 − α)100% of these intervals
would contain the true parameter. This is where the term confidence arises from; since
almost all of these intervals contain θ, we can be very confident that the interval based on
our one sample contains θ.

Example 6.1
Fox News Opinion Poll: “CNN covered Iran–Contra live in 1987, but is not covering Senate
hearings of Democratic finance abuses. Do you think the decision was politically motivated?”
(Washington Times, National Weekly Edition,8/10/97). Out of n = 899 American adults sampled,
X = 476 agreed with the statement.

1 or 2 Populations? — We are observing a sample from a single population.

Numeric or Presence/Absence Outcome — Each person either agrees or does not agree with
the statement, thus it is a Presence/Absence outcome.

Parameter of Interest — P , the proportion of all American adults who feel the decision was
politically motivated.
X
Appropriate Estimator — p̂ = n
q
p̂(1−p̂)
Estimated Standard Error — n

We wish to obtain a 95% confidence interval for the proportion of all U.S. adults who believe
the decision was politically motivated. For this sample, p̂ = X 476
n = 899 = .53, and it’s estimated
standard error is: s s
p̂(1 − p̂) (.53)(.47)
Sp̂ = = = .0166.
n 899
Thus a 95% confidence interval for the true proportion, P , is:

p̂ ± z.025 Sp̂ = .53 ± 1.96(.0166) = .53 ± .0325 = (.4975, .5625)

We are 95% confident that the proportion of all U.S. adults who feel that the decision was politically
motivated was between 0.4975 and 0.5625. Note that since values below 0.50 are contained in the
interval, we cannot conclude that a majority agree with the statement at this significance level.

Example 6.2 – Salary Progression Gap Between Dual Earner and Traditional Male
Managers
A study compared the salary progressions from 1984 to 1989 among a sample of married male
managers of Fortune 500 companies with children at home. For each manager, their 5-year salary
progression was obtained as 100*(1989 salary-1984 salary)/1984 salary. This is a percent increase,
if the manager’s salary increased from $100K in 1984 to $200K in 1989, then X=100*(200K-
100K)/100K=100(1)=100%. The researchers were interested in determining whether there is a
difference in the mean salary progression between dual earner and traditional managers. Dual
earner managers had wives who worked full time, traditional managers’ wives did not work. The
authors reported the sample statistics in Table 6.1.

Statistic Dual Earner (i = 1) Traditional (i = 2)


Xi 60.46 69.24
Si 22.21 61.27
ni 166 182

Table 14: Summary statistics for male manager salary progression study
1 or 2 Populations? — We are observing samples from two populations (dual earner male man-
agers and traditional male managers).

Numeric or Presence/Absence Outcome — The outcome measured is the percent change in


salary 1984-1989. This is a numeric outcome.

Parameter of Interest — µ1 − µ2 , the difference between true mean salary progressions for dual
earner and traditional male managers.

Appropriate Estimator — X 1 − X 2
r
S12 S22
Estimated Standard Error — n1 + n2

We obtain a 95% confidence interval for the true mean difference between these two groups of
managers: µ1 − µ2 .
X 1 − X 2 = 60.46 − 69.24 = 8.78
s r
S12 S2 22.212 61.272 √ √
+ 2 = + = 2.97 + 20.63 = 23.60 = 4.86
n1 n2 166 182
s
S12 S22
95% CI for µ1 −µ2 : (X 1 −X 2 )±z.05/2 + ≡ 8.78±1.96(4.86) ≡ 8.78±9.52 ≡ (−0.74, 18.30)
n1 n2
We can be 95% confident that the true difference in mean salary progressions between the two groups
is between -0.74% and 18.30%. Since 0 is in this range (that is µ1 = µ2 ), we cannot conclude there
is a difference in the true underlying population means, although the sample means differed by
8.78%. This is because of the large amount of variation in the individual salary progressions (see
S1 and S2 ). What would be your conclusion had you constructed a 90% confidence inetrval for
µ1 − µ2 ?
Source: Stroh, L.K. and J.M. Brett (1996), “The Dual-Earner Dad Penalty in Salary Progres-
sion,” Human Resource Management, 35:181-201.

6.2 Large–Sample Tests of Hypotheses


We also have a procedure to test hypotheses concerning parameter values. Hypothesis testing is a
procedure to make a decision concerning the value of an unknown parameter (although the method
is also used to test more general characteristics of populations than simply parameter values). The
testing procedure involves setting up two contradicting statements concerning the true value of
the parameter, known as the null hypothesis and the alternative hypothesis, respectively. We
assume the null hypothesis is true, and usually (but not always) wish to show that the alternative is
actually true. After collecting sample data, we compute a test statistic which is used as evidence
for or against the null hypothesis (which we assume is true when calculating the test statistic). The
set of values of the test statistic that we feel provide sufficient evidence to reject the null hypothesis
in favor of the alternative is called the rejection region. The probability that we could have
obtained as strong or stronger evidence against the null hypothesis, assuming that it is true, than
what we observed from our sample data is called the observed significance level or p–value.
An analogy that may help clear up these ideas is as follows. The researcher is like a prosecutor
in a jury trial. The prosecutor must work under the assumption that the defendant is innocent (null
hypothesis), although he would like to show that the defendant is guilty (alternative hypothesis).
The evidence that the prosecutor brings to the court (test statistic) is weighed by the jury to see if
it provides sufficient evidence to rule the defendant guilty (rejection region). The probability that
an innocent defendant could have had more damning evidence brought to trial than was brought
by the prosecutor (p-value) provides a measure of how strong the prosecutor’s evidence is against
the defendant.
Testing hypotheses is ‘clearer’ than the jury trial because the test statistic and rejection region
are not subject to human judgement (directly) as the prosecutor’s evidence and jury’s perspective
are. Since we do not know the true parameter value and never will, we are making a decision in
light of uncertainty. We can break down reality and our decision into Table 15.

Decision
H0 True H0 False
Actual H0 True Correct Decision Type I Error
State H0 False Type II Error Correct Decision

Table 15: Possible outcomes of a hypothesis test

We would like to set up the rejection region to keep the probability of a Type I error (α) and
the probability of a Type II error (β) as small as possible. Unfortunately for a fixed sample size, if
we try to decrease α, we automatically increase β, and vice versa. We will set up rejection regions
to control for α, and will not concern ourselves with β. However all tests described here have the
lowest Type II error rates of any tests for a given sample size. Further, as sample sizes increase,
the type II error rate decreases for a given state (value of θ) in the alternative hypothesis. Here α
is the probability we reject the null hypothesis when it is true. (This is like sending an innocent
defendant to prison).
We can write out the general form of a hypothesis test in the following steps.

1. H0 : θ = θ0

2. HA : θ 6= θ0 or HA : θ > θ0 or HA : θ < θ0 (which alternative is appropriate should be clear


from the setting).
θ̂−θ0
3. T.S.: zobs = σθ̂ (if the standard error is unknown, it is replaced by the estimated standard
error).

4. R.R.: |zobs | > zα/2 or zobs > zα or zobs < −zα (which R.R. depends on which alternative
hypothesis you are using).

5. P -value: 2P (Z > |zobs |) or P (Z > zobs ) or P (Z < zobs ) (again, depending on which alternative
you are using).

In all cases, a P -value less than α corresponds to a test statistic being in the rejection region
(reject H0 ), and a P -value larger than α corresponds to a test statistic failing to be in the rejection
region (fail to reject H0 ).

Example 6.2 – Treatment for Premature Ejaculation


The efficacy of intracavernosal alprostadil was studied in men suffering from erectile dysfunction
(impotence). The measure under study (X) was the duration (in minutes) of erection as measured
by the Rigiscan instrument (> 70% rigidity). Patients were assigned at random to receive either
a high or a low dose of the drug, and the manufacturer was interested in determining whether
increased doses are associated with longer (in time, not length) erections. Consider the following
problem components:

1 or 2 Populations? — We are comparing two groups – High vs Low Dose

Numeric or Presence/Absence Outcome — We are measuring the length of time that the
erection sustains a specific level, which is numeric

Parameter of Interest — µ1 − µ2 , the difference in the true mean lengths of duration

Appropriate Estimator — X 1 − X 2 , the difference in the mean lengths of duration for the
samples of subjects in the clinical trial
r
S12 S22
Estimated Standard Error — n1 + n2

Research Hypothesis (HA ) — Goal is to show increased dose gives longer durations: HA : µ1 >
µ2 or equivalently HA : µ1 − µ2 > 0

Type I Error — This occurs when we conclude that the drug is effective (µ1 > µ2 ), when in fact
it is not.

Type II Error — This occurs when we fail to conclude the drug is effective (fail to conclude
µ1 > µ2 ) when in fact it is.

The sample statistics are reported in Table 16 (times are in minutes).

High Dose Low Dose


X 1 = 44 X 2 = 12
S1 = 56 S2 = 28
n1 = 58 n2 = 57

Table 16: Summary statistics for the High and Low Doses (in minutes)

Now we test whether the mean time for the high dose exceeds that for the low dose (setting
α = 0.05):

1. H0 : µ1 − µ2 = 0 H A : µ1 − µ 2 > 0

2. T.S.: zobs = θ̂−θ0


Sθ̂ = (X
q1 −X
S2
2 )−0
S2
= q(44−12)−0)
2 2
= 32
8.24 = 3.89
1 (56) (28)
n1
+ n2 58
+ 57
2

3. R.R.: zobs > zα = z.05 = 1.645

4. p-value: P (Z ≥ zobs ) = P (Z ≥ 3.89) = .5 − P (0 ≤ Z ≤ 3.89) < .5 − .4998 = .0002

5. Conclusion: Since the test statistic falls in the rejection region (or, equivalently, the p-value
is below α, we reject the null hypothesis and claim that the true mean duration of erection
is higher for the high dose than the low dose (µ1 − µ2 > 0 ⇒ µ1 > µ2 ).
Compute a 95% confidence interval for the difference in true mean erection times.

Example 6.3 Gastrointestinal Symptoms from Olestra


Anecdotal reports were spread through the mainstream press that Procter & Gamble’s fat-
free substitute Olestra causes gastrointestinal (GI) side effects, even though such effects were not
expected based on clinical trials.
A study was conducted to compare the GI effects of Olestra based chips versus traditional chips
made with trigyceride (TG). The goal was to determine whether or not the levels of GI side effects
between consumers of Olestra based chips and traditional chips.
At a Chicago movie theater, 563 subjects were randomized (blindly) to Olestra based chips, and
529 received (blindly) traditional (TG) chips. Of the Olestra group, 89 reported suffering from a
gastrointestinal symptom (e.g. gas, diarrhea, abdominal cramping). Test whether the two types of
chips differ in terms of gastrointestinal effects.
1 or 2 Populations? — We are comparing two groups – Olestra vs TG
Numeric or Presence/Absence Outcome — We are measuring whether a consumer had a
gastrointestinal (GI) side effect (Presence/Absence)
Parameter of Interest — P1 − P2 , the difference in the true proportions of consumers suffering
GI side effects
Appropriate Estimator — p̂1 − p̂2 , the difference in the in the proportions of consumers suffering
from GI side effects in the two groups
r  
1 1 X1 +X2
Estimated Standard Error (Under H0 : P1 = P2 ) — p(1 − p) n1 + n2 where p = n1 +n2

Research Hypothesis (HA ) — Goal is to show differences in proportions of GI side effects:


HA : P1 = P2 or equivalently HA : P1 − P2 6= 0
Type I Error — This occurs when we conclude that the rates of GI symptoms differ (P1 6= P2 ),
when in fact they do not.
Type II Error — This occurs when we fail to conclude the rates of GI symptoms differ (fail to
conclude P1 6= P2 ) when in fact they do.
For this experiment:
H0 : P1 −P2 = 0 No Olestra Effect (Good or Bad) HA : P1 −P2 6= 0Olestra Effect (Good or Bad) α = 0.05
X1 89 X2 93 X1 + X2 89 + 93 182
p̂1 = = = 0.158 p̂2 = = = 0.176 p = = = = 0.167
n1 563 n2 529 n1 + n2 563 + 529 1092
(p̂1 − p̂2 ) − 0 (0.158 − 0.176) − 0
T S : Zobs =r  =q  = −0.80 RR : |Zobs | ≥ z.025 = 1.96
1 1
1
p(1 − p) n1 + n2 1 0.167(1 − 0.167) 563 + 529

Thus, we have no evidence that the rate of GI symptoms differs between the two types of chips
(in fact, the sample proportion is smaller for Olestra chips.
Can you think of any other outcomes with respect to the chips of interest to Procter & Gamble?
Obtain a 95% confidence interval for the difference between the true proportions. (The standard
error used above is very similar to the standard error not assuming equal proportions).
Source: L.L. Cheskin, et al (1998), “Gastrointestinal Symptoms Following Consumption of Olestra
or Regular Triglyceride Potato Chips,” JAMA, 279:150-152.
7 Lecture 7 — Small–Sample Inference
Textbook Sections and pages: 9.3,10.2,10.3,pp392–393
Problems: 9.11,13,17 10.11,15,17,23,25,29 11.1,3

In the case of small samples from populations with unknown variances, we can make use of the
t-distribution to obtain confidence intervals or conduct tests of hypotheses regarding population
means. In all cases, we must assume that the underlying distribution is normal (or approximately
normal), although this restriction is not necessary for moderate sample sizes. We will consider the
case of a single mean, µ, and the difference between two means, µ1 − µ2 , separately. First, though,
we refer back to the t-distribution in Table 4, page 669. This table gives the values tα such that
P (T > tα ) = α for values of the degrees of freedom between 1 and 29. The bottom line gives the
values zα , which should be used when the degrees of freedom exceed 30. I will also often add a
second subscript to tα to represent the appropriate degrees of freedom.

7.1 Inference Concerning µ


The general form for a confidence interval for µ remains the same as the large–sample case, except
we replace zα/2 by tα/2,n−1 . The general formula is as follows:
s
X ± tα/2,n−1 √
n
Testing a hypothesis concerning µ is also very similar to the large–sample case, with similar
changes as for confidence intervals. The general method is as follows:
1. H0 : µ = µ0
2. HA : µ 6= µ0 or HA : µ > µ0 or HA : µ < µ0 (which alternative is appropriate should be clear
from the setting).
X−µ0
3. T.S.: tobs = √s
n

4. R.R.: |tobs | > tα/2,n−1 or tobs > tα,n−1 or tobs < −tα,n−1 (which R.R. depends on which
alternative hypothesis you are using).
5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).
In this case, you cannot obtain an exact p-value, but you can obtain bounds for the p-value.
Statistical computer packages report exact p-values.

7.2 Inference Concerning µ1 − µ2


Here we consider two ways of comparing the means of two populations or treatments. The first
approach is based on independent samples, where the measurements are independent across
groups. This occurs when we sample units seperately from two populations, or assign experimental
units to only one of two treatments being compared. The second approach involves paired sam-
ples, where either each experimental unit is assigned to each of the two treatments, or units are
paired based on similar traits (e.g. individuals paired on race, gender, age, income, . . . ), and one
receives treatment 1, the matched individual receives treatment 2. Other examples of paired data
are given below.
7.2.1 Independent Samples
When the samples are independent, we use methods very similar to those for the large–sample case.
Examples of situations where the samples are independent include the following situations.
1. The mean lifetimes of two brands of television picture tubes are to be compared. A consumer
advovate samples n1 = 10 Sony televisions and n2 = 10 Mitsubishi televisions, measuring the
lifetime of all tubes. The samples are independent because there is no connection between
the televisions in the two groups.

2. Two methods of teaching children a foreign language are to be compared. A class of 24


children is split (randomly) into 2 groups of size 12. Each group receives one of the teaching
methods. The students’ foreign language proficiencies are measured at the end of the courses.
These samples are independent because different children received the two teaching methods.
One important difference is that these methods assume the two population variances, although
unknown, are equal. We then ‘pool’ the 2 sample variances to get an estimate of the common
variance σ 2 = σ12 = σ22 . This estimate, that we will call s2p is calculated as follows:

(n1 − 1)s21 + (n2 − 1)s22


Sp2 = .
n1 + n 2 − 2
The corresponding confidence interval can be written:
s
1 1
(X 1 − X 2 ) ± tα/2,n1 +n2 −2 Sp2 ( + ).
n1 n2
Similarly, the test of hypothesis concerning µ1 − µ2 is conducted as follows:
1. H0 : µ1 − µ2 = ∆0 (∆0 is usually 0).
2. HA : µ1 − µ2 6= ∆0 or HA : µ1 − µ2 > ∆0 or HA : µ1 − µ2 < ∆0 (which alternative is
appropriate should be clear from the setting).
(X 1 −X 2 )−∆0
3. T.S.: tobs = q
Sp2 ( n1 + n1 )
1 2

4. R.R.: |tobs | > tα/2,n1 +n2 −2 or tobs > tα,n1 +n2 −2 or tobs < −tα,n1 +n2 −2 (which R.R. depends on
which alternative hypothesis you are using).

5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).

Example 7.2 – Prozac for Borderline Personality Disorder


The efficacy of fluoxetine (Prozac) on anger in patients with borderline personality disorder was
studied in 22 patients with BPD. Among the measurements made by researchers was the Profile of
Mood States (POMS) anger scale. Patients received either fluoxetine or placebo for 12 weeks, with
measurements being made before and after treatment. Table ?? gives post-treatment summary
statistics for the two treatment groups. Low scores are better since the patient displays less anger.
First, we obtain a 95% confidence interval for the difference in true mean scores for the two
treatment groups. Then, we conduct a test to determine whether fluoxetine reduces mean anger
score (has a lower true mean) as compared to placebo (α = 0.05).
Therapy 1 Therapy 2
X 1 = 40.3 X 2 = 44.9
s21 = 25.7 s22 = 75.2
n1 = 13 n2 = 9

Table 17: Summary statistics for clinical depression example

a) To set up this confidence interval, we need to obtain the pooled variance (we are assuming
these population variances are the same), as well as the value of tα/2,n1 +n2 −2 .
(n1 −1)s21 +(n2 −1)s22 (13−1)25.7+(9−1)75.2 308.4+601.6
• Sp2 = n1 +n2 −2 = 13+9−2 = 20 = 45.5

• tα/2,n1 +n2 −2 = t.025,20 = 2.086

Then we can set up the 95% confidence interval:


s r
1 1 1 1
(X 1 − X 2 ) ± tα/2,n1 +n2 −2 Sp2 ( + ) = (40.3 − 44.9) ± 2.086 45.5( + )
n1 n2 13 9

= −4.6 ± 2.086(2.92) = −4.6 ± 6.10 = (−10.7, 1.50)


We are 95% confident that the true mean difference in improvement scores between the two therapies
is between −10.7 and 1.5. Since this interval for µ1 − µ2 contains 0, we cannot conclude that
µ1 − µ2 < 0, that is, we cannot that Prozac reduces anger. Two notes: (i) This is equivalent to a
2-sided test (HA : µ1 6= µ2 ), not a 1-sided test (HA : µ1 < µ2 ). (ii) Note that these are very small
samples.
b) To test whether Prozac reduces mean score we test as follows, making use of calculations
made in part a) (α = 0.05):

H0 : µ1 − µ2 = 0 H A : µ1 − µ 2 < 0

(X 1 − X 2 ) − ∆0 −4.6
T S : tobs = q = = −1.58
Sp2 ( n11 + n12 ) 2.92

RR : tobs < −tα,n1 +n2 −2 = −t.05,20 = −1.725


Since our test statistic does not fall in the rejection region, we do not reject the null hypothesis of
no treatment effect. The P -value is the area under the t-distribution with 20 degrees of freedom
bewow -1.58, which is between 0.05 and 0.10 (−t.05,20 = −1.725 and −t.10,20 = −1.325).
Source: Salzman, et al (1995), “Effects of Fluoxetine on Anger in Symptomatic Volunteers with
Borderline Personality Disorder,” Journal of Clinical Psychopharmacology, 15:23-29.

7.2.2 Paired Samples


Samples are said to be paired if the measurements between the two samples are related. This is
often the case when we apply two ‘treatments’ to the same subjects or experimental material. The
following examples describe situations in which the experiment consists of paired samples.
1. An educationer would like to compare student scores on two different tests of natural ability.
She selects 20 students at random and has each student take each exam (in random order),
measuring the students’ scores on each exam. These samples are paired because the two
samples are made up of the same students. Some students will do very well on both exams,
while others may do poorly on both exams. However, if one exam tends to yield higher scores
than the other, it should show up for most or all of the students.

2. A clothing manufacturer wishes to compare the color retention of two types of blue dye. She
selects a sample of 10 types of fabric, cutting each piece in half, and applying each type of
dye to a half of the piece. Each of the pieces are washed 15 times, and the amount of fading
is measured. (NOTE: there are 20 total measurements here since each type of fabric receives
both dyes). These samples are paired because each piece of experimental material receives
each dye.

The analysis of paired data involves computing the difference in the two measurements for each
subject and then treating these differences as a single–sample. For each subject (or experimental
unit), we observe two measurements X1i and X2i (the i just represents which subject in the sample
the measurement represents). Then, for each subject, we calculate Di = X1i − X2i . Now, testing
whether the 2 population means are equal is equivalent to testing whether or not the mean difference
is 0. We compute: Pn Pn
Di (Di − D)2
D = i=1 , Sd2 = i=1 .
n n−1
The (1 − α)100% confidence interval for µ1 − µ2 = µD is:

Sd
D ± tα/2,n−1 √
n

Note the similarity between this and the single–sample case.


To test hypotheses concerning the difference between the two population means, we use the
following method.

1. H0 : µ1 − µ2 = µD = ∆0

2. HA : µD 6= ∆0 or HA : µD > ∆0 or HA : µD < ∆0 (which alternative is appropriate should


be clear from the setting).
D−∆0
3. T.S.: tobs = S
√d
n

4. R.R.: |tobs | > tα/2,n−1 or tobs > tα,n−1 or tobs < −tα,n−1 (which R.R. depends on which
alternative hypothesis you are using).

5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).

Example 7.3 – Nicotine Delivery Patches


The manufacturers of Nicoderm conducted an experiment to compare the delivery of Nicotine
of their patch versus that of their competitor Habitrol. They had 24 adult male smokers where each
patch for 5 days (half wore Nicoderm first and Habitrol second, and the other half wore Habitrol first
and Nicoderm second). There was a 6-day washout period between wearing the two patches. The
outcome measured was the amount of nicotine in the bloodstream over the fifth day. Higher values
mean more nicotine has been delivered to the bloodstream from the patch (the measure is called
AU C - area under the concentration vs time curve). The same subjects are used for each patch
because subjects metabolisms differ greatly, and this will remove subject-to-subject variability.
The mean difference (Nicoderm - Habitrol) among the n = 24 subjects was D = 55.0, with a
standard deviation among the differences of Sd = 69.8. First, we test whether the true means differ
(with α = 0.05), then we obtain a 95% confidence interval for difference in true means.
a) The test is done through the following steps.
1. H0 : µ1 − µ2 = µD = 0 HA : µ1 − µ2 = µD 6= 0
D−∆0 55.0−0 55.0
2. T.S.: tobs = S = 69.8 = 14.2 = 3.87
√d √
n 24

3. R.R.: |tobs | > tα/2,n−1 = t.025,23 = 2.069

4. p-value: 2P (T > tobs ) = 2P (T > 3.87) < 2P (t > 2.808) = 2(.005) = .01 (since 2.808 is the
largest value on the table for 23 d.f.).
We can conclude that the mean amount of nicotine delivered is higher for Nicoderm than for
Habitrol (since we reject H0 and D is positive).

b) The 95% confidence interval for the difference in true means is:
Sd 69.8
D ± tα/2,n−1 √ 55.0 ± 2.069 √ ≡ 55.0 ± 29.4 ≡ (25.6, 84.4)
n 24
We can conclude that the true mean for Nicoderm is between 25.6 and 84.4 units higher than the
true mean for Habitrol.
Source: S.K. Gupta, et al, (1995), “Comparison of the Pharmacokinetics of Two Nicodine Trans-
dermal Systems: Nicoderm and Habitrol,” Journal of Clinical Pharmacology, 35:493-498.

Example 7.4 – Consumer Response to Introduction of New Coke (1985)


In a well-publicized move, Coca-Cola made a bold business decision to replace the formulation
of its flag-ship soda Coke with a new formulation (New Coke). Preliminary research was conducted
and produced the following information:
From 1981 to 1984, Coca-Cola tested the new formula in studies involving more than 190,000
consumers in 25 cities. With the brands not identified, the New Coke flavor was preferred to original
one by 0.55 (55When the same consumers were told what they were tasting, preference for the New
Coke was .61 (61
However, there was a major problem: Consumers weren’t told that old Coke would be removed
from the market.
Coca-Cola introduces New Coke, and consumers revolt. The problem?
Referring to the fact that Coca-Cola researchers never made it clear to the consumers whom
they tested that original formula Coca-Cola would not be available as a choice, these executives
admitted ”[t]hat was a mistake” and ”maybe we goofed”.
A psychological theory of reactance is hypothesized to occur if it is the case that New Coke
wins in blind taste-tests and original Coke wins in labeled taste-tests. New Coke will be preferred
in both cases if reactance diminishes over time. Subjects tasted both new and old coke under two
conditions (open label and blind), giving a score of 0-100 on taste. Different subjects were in the
two groups. That is, you either drank the two formulations of Coke either blindly or open label,
not doth.

a) Labelling µnew as the true mean score for New Coke, and µold as the true mean score for Old
Coke. Give the alternative (research) hypotheses of reactance for the two conditions (blind and
labeled), where in each condition H0 : µnew − µold = 0:
Blind: HA : µnew − µold 0 Labeled: HA : µnew − µold 0

b) The following sample statistics were obtained from two samples of consumers. The samples
were taken approximately 7 months after Coke was re-released after the New Coke disaster. Within
each sample, subjects tasted both Coke and New Coke, rating each brand on a scale of 0-100. The
sample means, mean differences, and standard deviation of the differences are given below. Compute
the two test statistics for the tests from part a).

Blind X new = 59.5 X old = 31.3 D = 28.2 Sd = 30.5 n = 25

Labeled X new = 60.7 X old = 70.8 D = −10.1 Sd = 26.0 n = 24

c) Give the appropriate rejection regions, assuming that preference differences are approximately
normally distributed (each test based on α = 0.05):

d) Can we conclude that reactance has been demonstrated by consumers? Note that since we
are conducting two independent tests, our overall Type I error rate is approximately 2(0.05)=0.10
(that we reject at least one null hypothesis when they are both true).

Example 7.5 – Battle of the Network Nightly News


Who gets the best ratings, Peter Jennings or Dan Rather? We take a random sample of weekly
mean number of households for each news program over the course of 1997. For each week, we
observe the mean number of viewers as reported by Nielsen (actually, these are estimates, but ignore
that for our purposes) for ABC and CBS. We wish to determine whether there are differences in
the mean weekly viewership between the two networks.

a) Would these samples be considered independent or paired? Why? (Hint: what might cause
variations in weekly news ratings, independent of the actual news programs)

b) If we denote x1i be the ratings for ABC on week i and x2i be the ratings for CBS on the
same week. What are the null and alternative hypotheses that we wish to test?

c) The data are given in Table 18, and are in millions of viewers per night. Give the mean and
standard deviation of the differences.
Week (i) ABC (x1i ) CBS (x2i ) Di = x1i − x2i Di2
1 8.2 7.2 1.0 1.00
2 7.2 6.3 0.9 0.81
3 7.1 6.2 0.9 0.81
4 8.7 8.9 -0.2 0.04
5 7.2 7.1 0.1 0.01
6 6.6 6.4 0.2 0.04
7 8.8 7.6 1.2 1.44
8 8.5 8.6 -0.1 0.01
9 9.6 7.8 1.8 3.24
10 9.2 8.5 0.7 0.49
Sum 81.1 74.6 6.5 7.89

Table 18: Sample of 10 weeks viewers for ABC and CBS news from 1997

d) Test whether the true means differ for the two networks. Clearly state the null and alternative
hypotheses, test statistic, and rejection region.

e) What is the appropriate conclusion?

(i)µABC > µCBS (ii)µABC = µCBS (iii)µABC < µCBS

f) Based on your conclusion, we are at risk of (but have not necessarily made):
(i) A Type I Error
(ii) A Type II Error
(iii) No error
(iv) Either a Type I or Type II Error

Source:Daily Variety (1997 editions)

7.3 Statistical Models


We have seen the concepts of random variables, probability distributions, and inferential methods
concerning their parameters (confidence intervals and tests of hypotheses). We will now write the
random variable in a form that breaks its value down into two components – its mean, and its
‘deviation’ from the mean. We can write X as follows:

X = µ + (X − µ) = µ + ε,

where ε = X − µ. Note that if X ∼ N (µ, σ), then ε ∼ N (0, σ). Also note that µ is unknown
(although we can estimate it), and so ε is unknown as well. We will be fitting different models in
this course, and estimating parameters corresponding to the models, as well as testing hypotheses
concerning them.
8 Lecture 8 — Experimental Design and the Analysis of Variance
Textbook Section: Section 11.2
Problems: 11.7,9,14,15,17

In this section, we will look at the effects of strictly qualitative variable(s) on the mean response
of a quantitative outcome variable. There are two distinct methods by which these measurements
can be made: controlled experiments and sample surveys. Some situations where analyses of
this type are used are given below:

AOV1 A drug manufacturer would like to compare four formulations to decide which is most
effective at reducing arthritic pain.

AOV2 A psychologist wishes to find out which of six classroom atmospheres provides the best
learning results among young children.

AOV3 A management consultant is interested in deciding if three managerial techniques differ in


terms of their corporate efficiencies.

Before we get involved in the mathematics and model formulation, we will describe the two exper-
imental situations and define some useful terms.
In a controlled experiment, the experimenter selects a sample of subjects or objects from
a population of such subjects or objects. These are referred to as experimental units. These
experimental units are what we will make our measurements on. After the experimental units are
selected, treatments are applied to the experimental units. These treatments are made up of one
or more factors, or experimental variables. We wish to estimate the effects of these treatments
on the units. We will refer to the levels as the intensity settings of the factors. Note that in a
controlled experiment, we are applying the treatments to the experimental units, and we wish to
estimate the effects of the various levels of the factor(s). Generally we would like to decide if certain
levels provide higher (or lower) mean responses than other levels. Note that we have already done
this in the case of one factor possessing two levels in the previous chapter (two-sample t-test and
the paired difference test).
In an observational study, the experimenter selects samples of objects from several popula-
tions and wishes to observe if the population means are the same. The mathematics of the analysis
is the same for both of these methods, but the interpretations have subtle differences. In this
situation, we are not applying treatments to the elements of the sample, but rather observing some
measurement of interest. We still often refer to these different populations as treatments, even
though we aren’t really applying them to experimental units.
A couple of examples should clear up this diffence. First, consider a study of four blood pressure
medications. Twenty subjects with relatively comparable levels of high blood pressure are sampled,
and each subject is given one of the four medications for a month with their blood pressure being
measured at the end of the study. In this setting, the patients are the experimental units, the four
medications are the treatments, and we have randomly assigned one medication (level) to each
subject. This is a controlled experiment. Now consider a study to observe whether four brands of
television sets have the same mean lifetimes. We sample five of each brand, observing the lifetime
of each set. In this case, we consider the brands to be treatments, although we are not applying
brands to ‘experimental units’. However, in both of these situations, the method of testing for
differences among the effects of the medications, and of testing for differences among the brand
means are identical. Thus, we will not need to distinguish between the situations explicitly, but
it is important to distinguish which one you are in from an interpretation standpoint. We will
always refer to these designs from the controlled experiment setting, with obvious extensions to the
observational study being implied.

8.1 Completely Randomized Design (CRD)


In the Completely Randomized Design, we have one factor that we are controlling. This factor has
C levels, and we measure nj units on the j th level of the factor. In terms of the sample survey, we
are sampling nj items from the j th population, j = 1, . . . , C. We will define the observed responses
as Xij , representing the measurement on the ith experimental unit, receiving the j th treatment.
We will write this in model form as follows:

Xij = µ + αj + εij = µj + εij .

Here, µ is the overall mean measurement across all treatments, αj is the effect of the j th treatment
(µj = µ+αj ), and εij is a random error component that is assumed to be normally distributed with
mean 0 and standard deviation σ. This εij can be thought of as the fact that there will be variation
among the measurements of different experimental units receiving the same treatment in the case of
a controlled experiment, or the fact that different elements sampled from the same population will
have varying measurements. This means that our model assumes that Xij ∼ N (µ+αj = µj , σ). We
further assume the measurements are independent of one another. We will place a condition on the
effects αj , namely that they sum to zero. Of interest to the experimenter is whether or not there
is a treatment effect, that is do any of the levels of the treatment provide higher (lower) mean
response than other levels. This can be hypothesized symbolically as H0 : α1 = α2 = · · · = αC = 0
(no treatment effect) against the alternative HA : Not all αj = 0 (treatment effects exist). Before
we set up this testing procedure, we must define a few items.

N = n1 + · · · + nC
Pnj
i=1 Xij
Xj =
nj
Pnj
i=1 (Xij− X j )2
Sj2 =
nj − 1
nj
C X
X
T otalSS = (Xij − X)2
j=1 i=1
C nj
XX C
X
SST = (X j − X)2 = nj (X j − X)2
j=1 i=1 j=1
C
X C
X
SSE = (nj − 1)Sj2 = (nj − 1)Sj2
j=1 j=1

Total SS represents the total variation of the sample measurements around the overall sample
mean. This Total variation is partitioned into variation Between treatment means (SST ) and
variation Within treatments (SSE). Often, we refer to SST as the Model sum of squares and
SSE as the Error sum of squares. Note that the model and error sums of squares add up to the
total sum of squares. That is:
T otalSS = SST + SSE
The point of the Analysis of Variance is to detect whether differences exist in the population means
of the treatments, and if so, to determine which treatments provide higher (lower) mean responses.
Associated with each source of variation, we have degrees of freedom. The total sum of
squares has N − 1 degrees of freedom, since it is made up of N − 1 independent terms (we have
estimated the mean from the sample). The model sum of squares measures the variation in the
C treatment means around the overall mean, and has dfT = C − 1 degrees of freedom. Finally,
the error sum of squares is made up of variation of the individual measurements around the C
treatment means, and has dfE = N − C. Note that just as the model and error sums of squares
sum to the total sum of squares, the degrees of freedom also are additive. That is:

dfT otal = N −1 = (C − 1) + (N − C) = dfT + dfE

Also, we can obtain an estimate of the error variance σ 2 , by taking


P the ‘average’ squared distance
C
(nj −1)Sj2
of each observed value to its treatment mean. That is s2
= j=1
N −C = NSSE
−C = M SE, we
divide by N − C because we are estimating C parameters (treatment means). We can set up an
Analysis of Variance table representing the decomposition of the total variation into parts due to
the model (between treatments) and error (within treatments), this is shown in Table 19.

ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
P
TREATMENTS SST = kj=1 nj (X j − X)2 C−1 SST
M ST = C−1 F =MM ST
SE
P
ERROR SSE = C (nj − 1)Sj2 N −C M SE = NSSE
PCj=1Pnj −C
TOTAL T otalSS = j=1 i=1 (Xij − X)2 N −1

Table 19: The Analysis of Variance Table for the Completely Randomized Design

Recall the model that we are using to describe the data in this design:

Xij = µ + αj + εij = µj + εij .

The effect of the j th treatment is αj . If there is no treatment effect among any of the levels of the
factor under study, that is that the population means of the C treatments are the same, then each
of the parameters αj are 0. This is a hypothesis we would like to test. The alternative hypothesis
will be that not all treatments have the same mean, or equivalently, that treatment effects exist (not
all αj are 0). If the null hypothesis is true (all C population means are equal), then the statistic
M ST
F = M SE follows the F -distribution with C − 1 numerator and N − C denominator degrees’ of
freedom. Large values of F are evidence against the null hypothesis of no treatment effect (recall
what SST and SSE are).
Upper percentage points of the F –distribution are given in Table A.7 (pp A-16 – A-25) of your
text book. This distribution has 2 parameters ν1 and ν2 , which are called the numerator and
denominator degrees’ of freedom, respectively. These tables give the upper tail cut off for various
values of ν1 , ν2 , and α (the upper tail probability). Under the null hypothesis of no differences
M SR
among treatment means (α1 = · · · = αC = 0), the test statistic F = M SE has a F – distribution
M SR
with ν1 = C − 1 and ν2 = N − C degrees’ of freedom. Large values of F = M SE are evidence
against the null hypothesis. We will denote Fα,ν1 ,ν2 as the cut off value that leaves a probability of
α in the upper tail of the F –distribution with ν1 and ν2 degrees’ of freedom. The testing procedure
is as follows:
1. H0 : α1 = · · · = αC = 0 (µ1 = · · · = µC ) (No treatment effect)

2. HA : Not all αj are 0 (Treatment effects exist)


M ST
3. T.S. Fobs = M SE

4. R.R.: Fobs > Fα,C−1,N −C

5. p-value: P (F > Fobs )

Example 8.1 – Sexual Side Effects of 4 Antidepressants


A comparison of C = 4 antidepressants in terms of reported sexual side effects was conducted at
the University of Alabama Medical School. Patients currently prescribed to antidepressants were
contacted and asked a series of questions regarding their treatment and side effects. One response
of interest was their perceived change in libido, which was a continuous response between −2 and
+2, based on a mark along a continuous line segment. Note that this is an observational study,
as people had already been assigned to treatments and identified after they had begun treatment.
Note that three of the four brands (Prozac, Zoloft, and Paxil) are selective serotonin re-uptake
inhibitors (SSRI’s), while the fourh brand does not come from that class of drug (Wellbutrin).
Source: Clinical Pharmacology & Therapeutics,12:254–258.
X — Self–reported change in libido after treatment on a continuous visual analogue scale
ranging from −2 to 2, with 0 representing no change. (X = −0.38)
Summary calculations are given in Table 20 and Analysis of Variance in Table 21.

Drug (j) nj Xj Sj nj (X j − X)2 (nj − 1)Sj2


Wellbutrin (1) 22 0.46 0.80 22(0.46 − (−0.38))2 = 15.52 (22 − 1)(0.80)2 = 13.44
Prozac (2) 37 −0.49 0.97 37(−0.49 − (−0.38))2 = 0.45 (37 − 1)(0.97)2 = 33.87
Paxil (3) 21 −0.90 0.73 21(−0.90 − (−0.38))2 = 5.68 (21 − 1)(0.73)2 = 10.66
Zoloft (4) 27 −0.49 1.25 27(−0.49 − (−0.38))2 = 0.33 (27 − 1)(1.25)2 = 40.63
n = 107 X = −0.38 — SSC = 21.98 SSE = 98.60

Table 20: Summary statistics and sums of squares calculations for sexual side effects of antidepres-
sant data.

ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
TREATMENTS 21.98 3 7.33 7.64
ERROR 98.60 103 0.96
TOTAL 120.58 106

Table 21: The Analysis of Variance table for sexual side effects in four antidepressant groups

Are there differences among the effects of the four brands (Test with α = 0.05)?

• H0 : µ1 = µ2 = µ3 = µ4 Ha : Not all µj are equ al


M SC 7.33
• T S : Fobs = M SE = 0.96 = 7.64

• RR : Fobs ≥ Fα,C−1,N −C = F0.05,3,103 ≈ 2.68

• p–value:P (F ≥ Fobs ) = P (F ≥ 7.64) < P (F ≥ 4.58) ≈ 0.005

We can conclude that the sexual side effects differ among the 4 brands at virtually any level
of α since our P –value is so small. There is virtually no chance we would have observed
this large of variation among the four sample means if the true (unknown) population
means are the same.
Source: J.G. Modell, et al (1997), “Comparative Sexual Side Effects of Bupropion, Fluoxetine,
Paroxetine, and Sertraline,” Clinical Pharmacology & Therapeutics, 61:476-487.

Example 8.2 – Corporate Social Responsibility and the Marketplace


A study was conducted to determine whether levels of corporate social responsibility (CSR)
vary by industry type. That is, can we explain a reasonable fraction of the overall variation in
CSR by taking into account the firm’s industry? If there are differences by industry, this might
be interpreted as the existence of ”industry forces” that affect what a firm’s CSR will be. For
instance, consumer and service firms may be more aware of social issues and demonstrate higher
levels of CSR than companies that deal less with the direct public (more removed from the retail
marketplace).
The partial ANOVA table is given in Table 22. Use it to complete the following questions.

ANOVA
Source of Degrees of Sum of Mean
Variation Freedom Squares Square F
Industry (Trts) 17
Error 162 57.55
Total 82.71

Table 22: The Analysis of Variance table for Corporate Social Responsibility

a) Complete the ANOVA table.

b) Test whether mean CSR scores differ among the industries (α = 0.05).

c) What can be said of the P -value (give a range)?

d) How many firms were represented in the sample?

e) How many industries were represented?

Source: M.T. Cottrill (1990), “Corporate Social Responsibility and the Marketplace,” Journal
of Business Ethics, 9:723-729.
Example 8.3 – Salary Progression By Industry
A recent study reported salary progressions during the 1980’s among k=8 industries. Results
including industry means, standard deviations, and sample sizes are given in Table 23. Also included
are columns that produce the treatment (between industry) and error (within industry) sums of
squares. The overall mean X = 65.11.

Industry (j) nj Xj Sj nj (X j − X)2 (nj − 1)Sj2


Pharmaceuticals (1) 35 70.69 27.64 1089.8 25975.0
Communications (2) 74 62.23 22.80 613.7 37948.3
Food (3) 49 54.93 16.23 5077.9 12643.8
Financial Srvcs (4) 21 131.16 145.42 91615.0 422939.5
Retail (5) 10 43.54 11.97 4652.6 1289.5
Hotel and Travel (6) 21 65.80 30.17 10.0 18204.6
Chemicals (7) 60 60.04 20.39 1542.2 24529.4
Manufacturing (8) 78 60.43 22.28 1708.3 38222.7
Sum 348 106309.5 581752.8

Table 23: Salary Progressions by Industry

a) Give the Analysis of Variance

b) Test whether the true mean salary progressions differ among these industries (α = 0.05).

Source: L.K. Stroh and J.M. Brett (1996), “The Dual-Earner Dad Penalty in Salary Progres-
sion,” Human Resources Management 35:181-201.

Example 8.4 – Professional Women as a Potential Market Segment


A study reported results of a survey among women’s shopping habits. Among the variables
measured and reported was the amount of time spent shopping in a new grocery store. Women
were classified as: housewives (H), professional working women (P), or non-professional working
women (NP). Group means and sample sizes are given below and Table ?? gives the basis of the
Analysis of Variance.
X H = 56.9 nH = 80 X P = 53.3 nP = 24 X N P = 49.7 nN P = 40

Source df SS
Groups 1411.2
Error 141 30244.5
Total

Table 24: ANOVA table for Professional Women as Market Segment study

a) Complete the ANOVA table.


b) Give the estimate of the within group standard deviation in shopping times.

c) Give the null and alternative hypotheses for testing whether true mean shopping times differ
among these three potential market segments.

d) Give the test statistic, rejection region, and conclusion for the test in part c) (use 120
denominator df ).

e) What can be said of the P -value?

(i)P < .05 (ii)P > .05 (iii)P cannot be determined

Source: M. Joyce and J. Guiltinan (1978), “The Professional Woman: A Potential Market
Segment for Retailers,” Journal of Retailing, 54:59-70.

Example 8.5 – Impact of Attention on Attribute Performance Assessments

A study was conducted to determine whether the amount of attention (measured by the time
subject is exposed to advertisement) is related to th importance ratings of a product attribute.
Subjects were asked to rate on a scale the importance of water resistance in a watch. People were
exposed to the ad for either 60, 105, or 150 seconds. The means, standard deviations, and sample
sizes for each treatment are given in Table 25. The overall mean is computed as follows:
Total importance score 11(4.3) + 10(6.8) + 9(7.1) 179.2
X= = = = 6.0
Overall sample size 11 + 10 + 9 30

Statistic 60 seconds (j = 1) 105 seconds (j = 2) 150 seconds (j = 3)


Mean (X j ) 4.3 6.8 7.1
Std Dev (Sj ) 1.8 1.7 1.5
Sample Size (nj ) 11 10 9

Table 25: Summary statistics for Attention/Attribute Performance Study

a) Set up the ANOVA table.

b) Test whether differences exist among the mean importance scores for the three exposure
times (α = 0.05)

Source: S.B. MacKenzie (1986), “The Role of Attention in Mediating the Effect of Advertising
on Attribute Performance,” Journal of Consumer Research, 13:174-195.
9 Lecture 9 — Comparison of Treatment Means
Textbook Section: 11.3
Problems: Apply Tukey’s Method to the Problems in Section 11.2

Assuming that we have concluded that treatment means differ, we generally would like to know
which means are significantly different. This is generally done by making either pre–planned or
all pairwise comparisons between pairs of treatments. We will look at how to make pre–planned
comparisons, and then how to make all comparisons. The two methods are very similar.

9.1 Pre–Planned Comparisons


Suppose we want to compare treatments i and j. That is, we’d like to decide whether or not
µi = µj . In previous coursework, we have done this when we had 2 populations to be compared
(the two–sample t-test). The method for making comparisons among 2 of the C populations in
a CRD is exactly like what we did then, except now S 2 = M SE (see above) and the degress of
freedom are N − C. It is useful to think of M SE as a pooled estimate of the common variances
among the C populations, just like Sp2 was pooled in the two–sample case. The (1 − α)100%
confidence interval for the difference in two means (µi − µj ) is:
s
1 1
(X i − X j ) ± tα/2,N −C M SE( + ).
ni nj

The inference we can make concerning the population means is as follows:

1. If the entire confidence interval for µi − µj is positive, we conclude that treatment i has
a higher mean than treatment j.

2. If the entire confidence interval for µi − µj is negative, we conclude that treatment i has
a lower mean than treatment j.

3. If the interval contains both positive and negative values, we cannot conclude that
the means of treatments i and j are different.

9.2 All Pairwise Comparisons


The previous method described works well with pre–planned comparisons, but can lead to mis-
leading results when being used on many or all pairwise comparisons among treatments. Various
methods have been developed to handle all possible comparisons and keep the overall error rate at
α. We will describe one commonly used method known as Tukey’s method of multiple comparisons.
Computer packages will print these comparisons automatically. Tukey’s method involves setting up
confidence intervals for all pairs of treatment means simultaneously. If there are k treatments, their
will be C(C−1)
2 such intervals. The general form, allowing for different sample sizes for treatments
i and j is: v
u !
u M SE 1 1
(X i − X j ) ± qα,C,N −C t + ,
2 ni nj
where qα,C,N −C is called the studentized range and is given in Table A.10 on page A–29. When the
sample sizes are equal (ni = nj ), the formula can be simplified to:
s
1
(X i − X j ) ± qα,C,N −C M SE( ).
ni

The term v
u !
u M SE 1 1
qα,C,N −C t +
2 ni nj

is referred to as Tukey’s “Honest Significant Difference”, or HSD.


An alternative approach to forming the confidence interval is to simply compare |X i − X j |
with HSDi,j . If the difference in means exceeds (in absolute value) HSD, we can conclude that
the population means differ, otherwise we cannot conclude that they differ. The direction of any
significant difference depends on the sign of (X i − X j )

Example 9.1
We’ve determined differences exist among the sexual side effects of the antidepressant brands.
Which brands differ?
We use Tukey’s HSD test and make C(C−1) 2 = 4(3)
2 = 6 comparisons with a simultaneous Type
I error rate of α = 0.05. The critical difference for treatments i and j are:
v !
u
u M SE 1 1
HSDi,j = qα,C,N −C t +
2 ni nj

For treatments 1 and 2 (Wellbutrin vs Prozac), we have:


Trt1 vs Trt 2: q.05,4,103 ≈ 3.68 M SE = 0.96 n1 = 22 n2 = 37 which leads to:
s  
0.96 1 1 √
HSD1,2 = 3.68 + = 3.68 .0348 = 0.686
2 22 37

Thus, we compare the difference between the means for Wellbutrin and Prozac with this critical
difference.
X 1 − X 2 = 0.46 − (−0.49) = 0.95 > 0.686 (µ1 > µ2 )
Since the means differ by more than 0.686, we conclude they differ and that Wellbutrin users report
higher scores on average than Prozac users. The results for all pairs are given in Table 26, where
N.S.D. in the Conclusion column means “Not Significantly Different”.
The primary conclusion is that Wellbutrin users have a higher population mean than the other
three brands’ users. None of the three SSRI’s means can be determined to differ.

Example 9.2 – Salary Progression By Industry


Refer to Example 8.3.

a) Between which two industries will Tukey’s HSD be the largest? Compute this value. Do
these two industries differ significantly (use α = 0.05)?
Simultaneous 95% CI’s
Comparison Xi − Xj HSDi,j Conclude
1v2 0.46 − (−0.49) = 0.95 0.686 µ1 > µ 2
1v3 0.46 − (−0.90) = 1.36 0.778 µ1 > µ 3
1v4 0.46 − (−0.49) = 0.95 0.732 µ1 > µ 4
2v3 −0.49 − (−0.90) = 0.41 0.697 N.S.D.
2v4 −0.49 − (−0.49) = 0.00 0.645 N.S.D
3v4 −0.90 − (−0.49) = −0.41 0.742 N.S.D

Table 26: Tukey multiple comparisons for the sexual side effects study patients receiving antide-
pressants

b) Between which two industries will Tukey’s HSD be the smallest? Compute this value. Do
these two industries differ significantly (use α = 0.05)?

Example 9.3 – Professional Women as a Potential Market Segment


Refer to Example 8.4. Compute Tukey’s HSD for all three pairwise comparisons (use α = 0.05).

Example 9.4 – Impact of Attention on Attribute Performance Assessments


Refer to Example 8.5. Compute Tukey’s HSD for all three pairwise comparisons (use α = 0.05).
10 Lecture 10 — Simple Linear Regression I – Least Squares Es-
timation
Textbook Sections: 12.1–12.4
Problems: 12.1,3,4,5,8,13,16,23,26

Previously, we have worked with a random variable X that comes from a population that is
normally distributed with mean µ and variance σ 2 . We have seen that we can write X in terms of
µ and a random error component ε, that is, X = µ + ε. For the time being, we are going to change
our notation for our random variable from X to Y . So, we now write Y = µ + ε. We will now
find it useful to call the random variable Y a dependent or response variable. Many times, the
response variable of interest may be related to the value(s) of one or more known or controllable
independent or predictor variables. Consider the following situations:

LR1 A college recruiter would like to be able to predict a potential incoming student’s first–year
GPA (Y ) based on known information concerning high school GPA (X1 ) and college entrance
examination score (X2 ). She feels that the student’s first–year GPA will be related to the
values of these two known variables.

LR2 A marketer is interested in the effect of changing shelf height (X1 ) and shelf width (X2 ) on
the weekly sales (Y ) of her brand of laundry detergent in a grocery store.

LR3 A psychologist is interested in testing whether the amount of time to become proficient in a
foreign language (Y ) is related to the child’s age (X).

In each case we have at least one variable that is known (in some cases it is controllable), and a
response variable that is a random variable. We would like to fit a model that relates the response
to the known or controllable variable(s). The main reasons that scientists and social researchers
use linear regression are the following:

1. Prediction – To predict a future response based on known values of the predictor variables
and past data related to the process.

2. Description – To measure the effect of changing a controllable variable on the mean value
of the response variable.

3. Control – To confirm that a process is providing responses (results) that we ‘expect’ under
the present operating conditions (measured by the level(s) of the predictor variable(s)).

10.1 A Linear Deterministic Model


Suppose you are a vendor who sells a product that is in high demand (e.g. cold beer on the beach,
cable television in Gainesville, or life jackets on the Titanic, to name a few). If you begin your day
with 100 items, have a profit of $10 per item, and an overhead of $30 per day, you know exactly
how much profit you will make that day, namely 100(10)-30=$970. Similarly, if you begin the day
with 50 items, you can also state your profits with certainty. In fact for any number of items you
begin the day with (X), you can state what the day’s profits (Y ) will be. That is,

Y = 10 · X − 30.
This is called a deterministic model. In general, we can write the equation for a straight line as
Y = β0 + β1 X,
where β0 is called the Y–intercept and β1 is called the slope. β0 is the value of Y when X = 0,
and β1 is the change in Y when X increases by 1 unit. In many real–world situations, the response
of interest (in this example it’s profit) cannot be explained perfectly by a deterministic model. In
this case, we make an adjustment for random variation in the process.

10.2 A Linear Probabilistic Model


The adjustment people make is to write the mean response as a linear function of the predictor
variable. This way, we allow for variation in individual responses (Y ), while associating the mean
linearly with the predictor X. The model we fit is as follows:
E(Y |X) = β0 + β1 X,
and we write the individual responses as
Y = β0 + β1 X + ε,
We can think of Y as being broken into a systematic and a random component:

Y = β0 + β1 X + |{z}
ε
| {z }
systematic random
where X is the level of the predictor variable corresponding to the response, β0 and β1 are
unknown parameters, and ε is the random error component corresponding to the response whose
distribution we assume is N (0, σ), as before. Further, we assume the error terms are independent
from one another, we discuss this in more detail in a later chapter. Note that β0 can be interpreted
as the mean response when X=0, and β1 can be interpreted as the change in the mean response
when X is increased by 1 unit. Under this model, we are saying that Y |X ∼ N (β0 + β1 X, σ).
Consider the following example.

Example 10.1 – Coffee Sales and Shelf Space


A marketer is interested in the relation between the width of the shelf space for her brand of
coffee (X) and weekly sales (Y ) of the product in a suburban supermarket (assume the height is
always at eye level). Marketers are well aware of the concept of ‘compulsive purchases’, and know
that the more shelf space their product takes up, the higher the frequency of such purchases. She
believes that in the range of 3 to 9 feet, the mean weekly sales will be linearly related to the
width of the shelf space. Further, among weeks with the same shelf space, she believes that sales
will be normally distributed with unknown standard deviation σ (that is, σ measures how variable
weekly sales are at a given amount of shelf space). Thus, she would like to fit a model relating
weekly sales Y to the amount of shelf space X her product receives that week. That is, she is fitting
the model:
Y = β0 + β1 X + ε,
so that Y |X ∼ N (β0 + β1 X, σ).
One limitation of linear regression is that we must restrict our interpretation of the model to
the range of values of the predictor variables that we observe in our data. We cannot assume this
linear relation continues outside the range of our sample data.
We often refer to β0 + β1 X as the systematic component of Y and ε as the random component.
10.3 Least Squares Estimation of β0 and β1
We now have the problem of using sample data to compute estimates of the parameters β0 and β1 .
First, we take a sample of n subjects, observing values Y of the response variable and X of the
predictor variable. We would like to choose as estimates for β0 and β1 , the values b0 and b1 that
‘best fit’ the sample data. Consider the coffee example mentioned earlier. Suppose the marketer
conducted the experiment over a twelve week period (4 weeks with 3’ of shelf space, 4 weeks with
6’, and 4 weeks with 9’), and observed the sample data in Table 27.

Shelf Space Weekly Sales Shelf Space Weekly Sales


x y x y
6 526 6 434
3 421 3 443
6 581 9 590
9 630 6 570
3 412 3 346
9 560 9 672

Table 27: Coffee sales data for n = 12 weeks

SALES
700

600

500

400

300
0 3 6 9 12
SPACE

Figure 10: Plot of coffee sales vs amount of shelf space

Now, look at Figure 10. Note that while there is some variation among the weekly sales at 3’,
6’, and 9’, respectively, there is a trend for the mean sales to increase as shelf space increases. If
we define the fitted equation to be an equation:

Ŷ = b0 + b1 X,

we can choose the estimates b0 and b1 to be the values that minimize the distances of the data points
to the fitted line. Now, for each observed response Yi , with a corresponding predictor variable Xi ,
we obtain a fitted value Ŷi = b0 + b1 Xi . So, we would like to minimize the sum of the squared
distances of each observed response to its fitted value. That is, we want to minimize the error
sum of squares, SSE, where:
n
X n
X
SSE = (Yi − Ŷi )2 = (Yi − (b0 + b1 Xi ))2 .
i=1 i=1

A little bit of calculus can be used to obtain the estimates:


Pn
i=1 (Xi − X)(Yi − Y) Sxy
b1 = Pn 2
= ,
i=1 (Xi − X) Sxx

and Pn Pn
i=1 yi i=1 xi
b0 = Y − β̂1 X = . − b1
n n
Some shortcut equations, known as the corrected sums of squares and crossproducts, that while
not very intuitive are very useful in computing these and other estimates are:
Pn
Pn Pn ( Xi )2
• SSXX = i=1 (Xi − X)2 = 2
i=1 Xi − i=1
n
Pn Pn
Pn Pn ( Xi )( Yi )
• SSXY = i=1 (Xi − X)(Yi − Y ) = i=1 Xi Yi − i=1
n
i=1

Pn
Pn Pn ( Yi )2
• SY Y = i=1 (Yi −Y )2 = i=1 Yi
2 − i=1
n

Example 10.1 Continued – Coffee Sales and Shelf Space


For the coffee data, we observe the following summary statistics in Table 28.

Week Space (X) Sales (Y ) X2 XY Y2


1 6 526 36 3156 276676
2 3 421 9 1263 177241
3 6 581 36 3486 337561
4 9 630 81 5670 396900
5 3 412 9 1236 169744
6 9 560 81 5040 313600
7 6 434 36 2604 188356
8 3 443 9 1329 196249
9 9 590 81 5310 348100
10 6 570 36 3420 324900
11 3 346 9 1038 119716
12 9 672 81 6048 451584
P P P 2 P P 2
X = 72 Y = 6185 X = 504 XY = 39600 Y = 3300627

Table 28: Summary Calculations — Coffee sales data

From this, we obtain the following sums of squares and crossproducts.


P
X X ( X)2 (72)2
SSXX = (X − X)2 = X2 − = 504 − = 72
n 12

P P
X X ( X)( Y) (72)(6185)
SSXY = (X − X)(Y − Y ) = XY − = 39600 − = 2490
n 12
P
X X ( Y )2 (6185)2
2 2
SSY Y = (Y − Y ) = Y − = 3300627 − = 112772.9
n 12
From these, we obtain the least squares estimate of the true linear regression relation (β0 +β1 X).

SSXY 2490
b1 = = = 34.5833
SSXX 72

P P
Y X 6185 72
b0 = − b1 = − 34.5833( ) = 515.4167 − 207.5000 = 307.967.
n n 12 12

Ŷ = b0 + b1 X = 307.967 + 34.583X

So the fitted equation, estimating the mean weekly sales when the product has X feet of shelf
space is Ŷ = β̂0 + β̂1 X = 307.967 + 34.5833X. Our interpretation for b1 is “the estimate for the
increase in mean weekly sales due to increasing shelf space by 1 foot is 34.5833 bags of coffee”.
Note that this should only be interpreted within the range of X values that we have observed in
the “experiment”, namely X = 3 to 9 feet.

Example 10.2 – Computation of a Stock Beta


A widely used measure of a company’s performance is their beta. This is a measure of the firm’s
stock price volatility relative to the overall market’s volatility. One common use of beta is in the
capital asset pricing model (CAPM) in finance, but you will hear them quoted on many business
news shows as well. It is computed as (Value Line):

The “beta factor” is derived from a least squares regression analysis between weekly
percent changes in the price of a stock and weekly percent changes in the price of all
stocks in the survey over a period of five years. In the case of shorter price histories, a
smaller period is used, but never less than two years.

In this example, we will compute the stock beta over a 28-week period for Coca-Cola and
Anheuser-Busch, using the S&P500 as ’the market’ for comparison. Note that this period is only
about 10% of the period used by Value Line. Note: While there are 28 weeks of data, there are
only n=27 weekly changes.
Table 29 provides the dates, weekly closing prices, and weekly percent changes of: the S&P500,
Coca-Cola, and Anheuser-Busch. The following summary calculations are also provided, with X
representing the S&P500, YC representing Coca-Cola, and YA representing Anheuser-Busch. All
calculations should be based on 4 decimal places. Figure ?? gives the plot and least squares
regression line for Anheuser-Busch, and Figure ?? gives the plot and least squares regression line
for Coca-Cola.
X X X
X = 15.5200 YC = −2.4882 YA = 2.4281
X X X
X 2 = 124.6354 YC2 = 461.7296 YA2 = 195.4900
X X
XYC = 161.4408 XYA = 84.7527
Closing S&P A-B C-C S&P A-B C-C
Date Price Price Price % Chng % Chng % Chng
05/20/97 829.75 43.00 66.88 – – –
05/27/97 847.03 42.88 68.13 2.08 -0.28 1.87
06/02/97 848.28 42.88 68.50 0.15 0.00 0.54
06/09/97 858.01 41.50 67.75 1.15 -3.22 -1.09
06/16/97 893.27 43.00 71.88 4.11 3.61 6.10
06/23/97 898.70 43.38 71.38 0.61 0.88 -0.70
06/30/97 887.30 42.44 71.00 -1.27 -2.17 -0.53
07/07/97 916.92 43.69 70.75 3.34 2.95 -0.35
07/14/97 916.68 43.75 69.81 -0.03 0.14 -1.33
07/21/97 915.30 45.50 69.25 -0.15 4.00 -0.80
07/28/97 938.79 43.56 70.13 2.57 -4.26 1.27
08/04/97 947.14 43.19 68.63 0.89 -0.85 -2.14
08/11/97 933.54 43.50 62.69 -1.44 0.72 -8.66
08/18/97 900.81 42.06 58.75 -3.51 -3.31 -6.28
08/25/97 923.55 43.38 60.69 2.52 3.14 3.30
09/01/97 899.47 42.63 57.31 -2.61 -1.73 -5.57
09/08/97 929.05 44.31 59.88 3.29 3.94 4.48
09/15/97 923.91 44.00 57.06 -0.55 -0.70 -4.71
09/22/97 950.51 45.81 59.19 2.88 4.11 3.73
09/29/97 945.22 45.13 61.94 -0.56 -1.48 4.65
10/06/97 965.03 44.75 62.38 2.10 -0.84 0.71
10/13/97 966.98 43.63 61.69 0.20 -2.50 -1.11
10/20/97 944.16 42.25 58.50 -2.36 -3.16 -5.17
10/27/97 941.64 40.69 55.50 -0.27 -3.69 -5.13
11/03/97 914.62 39.94 56.63 -2.87 -1.84 2.04
11/10/97 927.51 40.81 57.00 1.41 2.18 0.65
11/17/97 928.35 42.56 57.56 0.09 4.29 0.98
11/24/97 963.09 43.63 63.75 3.74 2.51 10.75

Table 29: Weekly closing stock prices – S&P 500, Anheuser-Busch, Coca-Cola
ya
5

-1

-2

-3

-4

-5
-4 -3 -2 -1 0 1 2 3 4 5
x

Figure 11: Plot of weekly percent stock price changes for Anheuser-Busch versus S&P 500 and least
squares regression line

ya
5

-1

-2

-3

-4

-5
-4 -3 -2 -1 0 1 2 3 4 5
x

Figure 12: Plot of weekly percent stock price changes for Coca-Cola versus S&P 500 and least
squares regression line
a) Compute SSXX , SSXYC , and SSXYA .

b) Compute the stock betas for Coca-Cola and Anheuser-Busch.

Example 10.3 – Estimating Cost Functions of a Hosiery Mill

The following (approximate) data were published by Joel Dean, in the 1941 article: “Statistical
Cost Functions of a Hosiery Mill,” (Studies in Business Administration, vol. 14, no. 3).
Y — Monthly total production cost (in $1000s).
X — Monthly output (in thousands of dozens produced).
A sample of n = 48 months of data were used, with Xi and Yi being measured for each month.
The parameter β1 represents the change in mean cost per unit increase in output (unit variable
cost), and β0 represents the true mean cost when the output is 0, without shutting plant (fixed
cost). The data are given in Table 10.3 (the order is arbitrary as the data are printed in table form,
and were obtained from visual inspection/approximation of plot).

i Xi Yi i Xi Yi i Xi Yi
1 46.75 92.64 17 36.54 91.56 33 32.26 66.71
2 42.18 88.81 18 37.03 84.12 34 30.97 64.37
3 41.86 86.44 19 36.60 81.22 35 28.20 56.09
4 43.29 88.80 20 37.58 83.35 36 24.58 50.25
5 42.12 86.38 21 36.48 82.29 37 20.25 43.65
6 41.78 89.87 22 38.25 80.92 38 17.09 38.01
7 41.47 88.53 23 37.26 76.92 39 14.35 31.40
8 42.21 91.11 24 38.59 78.35 40 13.11 29.45
9 41.03 81.22 25 40.89 74.57 41 9.50 29.02
10 39.84 83.72 26 37.66 71.60 42 9.74 19.05
11 39.15 84.54 27 38.79 65.64 43 9.34 20.36
12 39.20 85.66 28 38.78 62.09 44 7.51 17.68
13 39.52 85.87 29 36.70 61.66 45 8.35 19.23
14 38.05 85.23 30 35.10 77.14 46 6.25 14.92
15 39.16 87.75 31 33.75 75.47 47 5.45 11.44
16 38.59 92.62 32 34.29 70.37 48 3.79 12.69

Table 30: Production costs and Output – Dean (1941)


.

This dataset has n = 48 observations with a mean output (in 1000s of dozens) of X = 31.0673,
and a mean monthly cost (in $1000s) of Y = 65.4329.
n
X n
X n
X n
X n
X
Xi = 1491.23 Xi2 = 54067.42 Yi = 3140.78 Yi2 = 238424.46 Xi Yi = 113095.80
i=1 i=1 i=1 i=1 i=1

From these quantites, we get:


Pn
Pn 2 ( Xi )2 (1491.23)2
• SSXX = i=1 Xi −
i=1
n = 54067.42 − 48 = 54067.42 − 46328.48 = 7738.94
Pn Pn
Pn ( Xi )( Yi )
• SSXY = i=1 Xi Yi −
i=1
n
i=1
= 113095.80− (1491.23)(3140.78)
48 = 113095.80−97575.53 =
15520.27
Pn
Pn 2 ( Yi )2 (3140.78)2
• SSY Y = i=1 Yi −
i=1
n = 238424.46 − 48 = 238424.46 − 205510.40 = 32914.06

Pn Pn
Pn ( Xi )( i=1 Yi )
i=1 Xi Yi − i=1
SSXY 15520.27
b1 = Pn n = = 2.0055
Pn 2 ( i=1 Xi )2 SSXX 7738.94
i=1 Xi − n

b0 = Y − b1 X = 65.4329 − (2.0055)(31.0673) = 3.1274


Ŷi = b 0 + b 1 Xi = 3.1274 + 2.0055Xi i = 1, . . . , 48
ei = Yi − Ŷi = Yi − (3.1274 + 2.0055Xi ) i = 1, . . . , 48
Table 10.3 gives the raw data, their fitted values, and residuals.
A plot of the data and regression line are given in Figure 13.

cost Y=Math Score X=Tissue LSD Concentration


100
score
80
90

8 07 0

70
60

60

50
50

40
40

30
30
20

1 02 0
01 21 0 3 20 4 30 5 40 6 5 07
scioznec

Figure 13: Estimated cost function for hosiery mill (Dean, 1941)
i Xi Yi Ŷi ei
1 46.75 92.64 96.88 -4.24
2 42.18 88.81 87.72 1.09
3 41.86 86.44 87.08 -0.64
4 43.29 88.80 89.95 -1.15
5 42.12 86.38 87.60 -1.22
6 41.78 89.87 86.92 2.95
7 41.47 88.53 86.30 2.23
8 42.21 91.11 87.78 3.33
9 41.03 81.22 85.41 -4.19
10 39.84 83.72 83.03 0.69
11 39.15 84.54 81.64 2.90
12 39.20 85.66 81.74 3.92
13 39.52 85.87 82.38 3.49
14 38.05 85.23 79.44 5.79
15 39.16 87.75 81.66 6.09
16 38.59 92.62 80.52 12.10
17 36.54 91.56 76.41 15.15
18 37.03 84.12 77.39 6.73
19 36.60 81.22 76.53 4.69
20 37.58 83.35 78.49 4.86
21 36.48 82.29 76.29 6.00
22 38.25 80.92 79.84 1.08
23 37.26 76.92 77.85 -0.93
24 38.59 78.35 80.52 -2.17
25 40.89 74.57 85.13 -10.56
26 37.66 71.60 78.65 -7.05
27 38.79 65.64 80.92 -15.28
28 38.78 62.09 80.90 -18.81
29 36.70 61.66 76.73 -15.07
30 35.10 77.14 73.52 3.62
31 33.75 75.47 70.81 4.66
32 34.29 70.37 71.90 -1.53
33 32.26 66.71 67.82 -1.11
34 30.97 64.37 65.24 -0.87
35 28.20 56.09 59.68 -3.59
36 24.58 50.25 52.42 -2.17
37 20.25 43.65 43.74 -0.09
38 17.09 38.01 37.40 0.61
39 14.35 31.40 31.91 -0.51
40 13.11 29.45 29.42 0.03
41 9.50 29.02 22.18 6.84
42 9.74 19.05 22.66 -3.61
43 9.34 20.36 21.86 -1.50
44 7.51 17.68 18.19 -0.51
45 8.35 19.23 19.87 -0.64
46 6.25 14.92 15.66 -0.74
47 5.45 11.44 14.06 -2.62
48 3.79 12.69 10.73 1.96

Table 31: Approximated Monthly Outputs, total costs, fitted values and residuals – Dean (1941)
.
We have seen now, how to estimate β0 and β1 . Now we can obtain an estimate of the variance of
the responses at a given value of X. Recall from your previous statistics course, you estimated the
variance by taking the ‘average’ squared
P deviation of each measurement from the sample (estimated)
n
(Yi −Y )2
mean. That is, you calculated S 2 = i=1n−1 . Now that we fit the regression model, we know
longer use Y to estimate the mean for each Yi , but rather Ŷi = b0 + b1 Xi to estimate the mean.
The estimate we use now looks similar to the previous estimate except we replace Y with Ŷi and
we replace n − 1 with n − 2 since we have estimated 2 parameters, β0 and β1 . The new estimate
(which we will refer as to the residual variance) is:
Pn 2
SSE i=1 (Yi
− Ŷi ) SSY Y − (SS XY )
SSXX
Se2 = M SE = = = .
n−2 n−2 n−2

This estimated variance Se2 can be thought of as the ‘average’ squared distance from each observed
response to the fitted line. The word average is in quotes since we divide by n − 2 and not n. The
closer the observed responses fall to the line, the smaller Se2 is and the better our predicted values
will be.

Example 10.1 (Continued) – Coffee Sales and Shelf Space


For the coffee data,
2
112772.9 − (2490) 112772.9 − 86112.5
Se2 = 72
= = 2666.04,
12 − 2 10

and the estimated residual standard error (deviation) is Se = 2666.04 = 51.63. We now have
estimates for all of the parameters of the regression equation relating the mean weekly sales to the
amount of shelf space the coffee gets in the store. Figure 14 shows the 12 observed responses and
the estimated (fitted) regression equation.

SALES
700

600

500

400

300
0 3 6 9 12
SPACE

Figure 14: Plot of coffee data and fitted equation

Example 10.3 (Continued) – Estimating Cost Functions of a Hosiery Mill


For the cost function data:
Pn 2
SSXY (15520.27)2
• SSE = i=1 (Yi − Ŷi )2 = SSY Y − SSXX = 32914.06 − 7738.94 = 32914.06 − 31125.55 =
1788.51

• Se2 = M SE = SSE
n−2 =
1788.51
48−2 = 38.88

• Se = 38.88 = 6.24
11 Lecture 11 — Simple Regression II — Inferences Concerning
β1
Textbook Sections: 12.5,12.6
Problems: 12.36,39,40,41, Compute 95% CI’s for β1 in these problems.

Recall that in our regression model, we are stating that E(Y |X) = β0 + β1 X. In this model, β1
represents the change in the mean of our response variable Y , as the predictor variable X increases
by 1 unit. Note that if β1 = 0, we have that E(Y |X) = β0 + β1 X = β0 + 0X = β0 , which implies
the mean of our response variable is the same at all values of X. In the context of the coffee sales
example, this would imply that mean sales are the same, regardless of the amount of shelf space, so
a marketer has no reason to purchase extra shelf space. This is like saying that knowing the level
of the predictor variable does not help us predict the response variable.
Under the assumptions stated previously, namely that Y ∼ N (β0 + β1 X, σ), our estimator b1
has a sampling distribution that is normal with mean β1 (the true value of the parameter), and
standard error pPn σ 2
. That is:
i=1
(Xi −X)

σ
b1 ∼ N (β1 , √ )
SSXX
We can now make inferences concerning β1 .

11.1 A Confidence Interval for β1


Recall the general form of a (1 − α)100% confidence interval for a parameter θ. The interval is of
the form:
θ̂ ± zα/2 σθ̂
for large samples, or
θ̂ ± tα/2 Sθ̂
for small samples when we must estimate the parameter’s standard error σθ̂ .
This leads us to the general form of a (1 − α)100% confidence interval for β1 . The interval can
be written:
Se
b1 ± tα/2,n−2 Sb1 ≡ b1 ± tα/2,n−2 √ .
SSXX
Se

Note that √SS is the estimated standard error of b1 since we use Se = M SE to estimate σ.
XX
Also, we have n − 2 degrees of freedom instead of n − 1, since the estimate Se2 has 2 estimated
paramters used in it (refer back to how we calculate it above).

Example 11.1 – Coffee Sales and Shelf Space

For the coffee sales example, we have the following results:


b1 = 34.5833, SSXX = 72, Se = 51.63, n = 12.
So a 95% confidence interval for the parameter β1 is:
51.63
34.5833 ± t.025,12−2 √ = 34.5833 ± 2.228(6.085) = 34.583 ± 13.557,
72
which gives us the range (21.026, 48.140). We are 95% confident that the true mean sales increase
by between 21.026 and 48.140 bags of coffee per week for each extra foot of shelf space the brand
gets (within the range of 3 to 9 feet). Note that the entire interval is positive (above 0), so we are
confident that in fact β1 > 0, so the marketer is justified in pursuing extra shelf space.

Example 11.2 – Hosiery Mill Cost Function

b1 = 2.0055, SSXX = 7738.94, Se = 6.24, n = 48.

For the hosiery mill cost function analysis, we obtain a 95% confidence interval for average unit
variable costs (β1 ). Note that t.025,48−2 = t.025,46 ≈ 2.015, since t.025,40 = 2.021 and t.025,60 = 2.000
(we could approximate this with z.025 = 1.96 as well).
6.24
2.0055 ± t.025,46 √ = 2.0055 ± 2.015(.0709) = 2.0055 ± 0.1429 = (1.8626, 2.1484)
7738.94
We are 95% confident that the true average unit variable costs are between $1.86 and $2.15 (this
is the incremental cost of increasing production by one unit, assuming that the production process
is in place.

11.2 Hypothesis Tests Concerning β1


Similar to the idea of the confidence interval, we can set up a test of hypothesis concerning β1 .
Since the confidence interval gives us the range of ‘believable’ values for β1 , it is more useful than
a test of hypothesis. However, here is the procedure to test if β1 is equal to some value, say β10 .

• H0 : β1 = β10 (β10 specified, usually 0)

• (1) Ha : β1 6= β10
(2) Ha : β1 > β10
(3) Ha : β1 < β10
b1 −β10
• T S : tobs = √ Se
SSXX

• (1) RR : |tobs | ≥ tα/2,n−2


(2) RR : tobs ≥ tα,n−2
(3) RR : tobs ≤ −tα,n−2

• (1) P –value: 2 · P (t ≥ |tobs |)


(2) P –value: P (t ≥ tobs )
(3) P –value: P (t ≤ tobs )
Using tables, we can only place bounds on these p–values.
Example 11.1 (Continued) – Coffee Sales and Shelf Space
Suppose in our coffee example, the marketer gets a set amount of space (say 60 ) for free, and
she must pay extra for any more space. For the extra space to be profitable (over the long run),
the mean weekly sales must increase by more than 20 bags, otherwise the expense outweighs the
increase in sales. She wants to test to see if it is worth it to buy more space. She works under the
assumption that it is not worth it, and will only purchase more if she can show that it is worth it.
She sets α = .05.
1. H0 : β1 = 20 HA : β1 > 20
34.5833−20 14.5833
2. T.S.: tobs = 51.63

= 6.085 = 2.397
72

3. R.R.: tobs > t.05,10 = 1.812


4. p-value: P (T > 2.397) < P (T > 2.228) = .025 and P (T > 2.397) > P (T > 2.764) = .010, so
.01 < p − value < .025.
So, she has concluded that β1 > 20, and she will purchase the shelf space. Note also that the entire
confidence interval was over 20, so we already knew this.

Example 11.2 (Continued) – Hosiery Mill Cost Function

Suppose we want to test whether average monthly production costs increase with monthly
production output. This is testing whether unit variable costs are positive (α = 0.05).
• H0 : β1 = 0 (Mean Monthly production cost is not associated with output)
• HA : β1 > 0 (Mean monthly production cost increases with output)
2.0055−0 2.0055
• T S : tobs = √ 6.24
= 0.0709 = 28.29
7738.94

• RR : tobs > t0.05,46 ≈ 1.680 (or use z0.05 = 1.645)


• p-value: P (T > 28.29) ≈ 0
We have overwhelming evidence of positive unit variable costs.

11.3 The Analysis of Variance Approach to Regression


Consider the deviations of the individual responses, Yi , from their overall mean Y . We would
like to break these deviations into two parts, the deviation of the observed value from its fitted
value, Ŷi = b0 + b1 Xi , and the deviation of the fitted value from the overall mean. See Figure 15
corresponding to the coffee sales example. That is, we’d like to write:
Yi − Y = (Yi − Ŷi ) + (Ŷi − Y ).

Note that all we are doing is adding and subtracting the fitted value. It so happens that
algebraically we can show the same equality holds once we’ve squared each side of the equation
and summed it over the n observed and fitted values. That is,
n
X n
X n
X
(Yi − Y )2 = (Yi − Ŷi )2 + (Ŷi − Y )2 .
i=1 i=1 i=1
SALES
700

600

500

400

300
0 3 6 9 12
SPACE

Figure 15: Plot of coffee data, fitted equation, and the line Y = 515.4167

These three pieces are called the total, error, and model sums of squares, respectively. We
denote them as SSyy , SSE, and SSR, respectively. We have already seen that SSyy represents the
total variation in the observed responses, and that SSE represents the variation in the observed
responses around the fitted regression equation. That leaves SSR as the amount of the total
variation that is ‘accounted for’ by taking into account the predictor variable X. We can use
this decomposition to test the hypothesis H0 : β1 = 0 vs HA : β1 6= 0. We will also find this
decomposition useful in subsequent sections when we have more than one predictor variable. We
first set up the Analysis of Variance (ANOVA) Table in Table 32. Note that we will have to
make minimal calculations to set this up since we have already computed SSyy and SSE in the
regression analysis.

ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Pn
MODEL SSR = i=1 (Ŷi − Y ) 2 1 M SR = SSR
1 F =MM SR
SE
P
ERROR SSE = ni=1 (Yi − Ŷi )2 n−2 M SE = SSE
n−2
P
TOTAL SSY Y = ni=1 (Yi − Y )2 n−1

Table 32: The Analysis of Variance Table for simple regression

The procedure of testing for a linear association between the response and predictor variables
using the analysis of variance involves using the F –distribution, which is given in Table A.7 (pp
A-16–A-25) of your text book. This is the same distribution we used in the previous chapter.
The testing procedure is as follows:
1. H0 : β1 = 0 HA : β1 6= 0 (This will always be a 2–sided test)
M SR
2. T.S.: Fobs = M SE

3. R.R.: Fobs > F1,n−2,α


4. p-value: P (F > Fobs ) (You can only get bounds on this, but computer outputs report them
exactly)
Note that we already have a procedure for testing this hypothesis (see the section on Inferences
Concerning β1 ), but this is an important lead–in to multiple regression.

Example 11.1 (Continued) – Coffee Sales and Shelf Space

Referring back to the coffee sales data, we have already made the following calculations:

SSY Y = 112772.9, SSE = 26660.4, n = 12.

We then also have that SSR = SSY Y − SSE = 86112.5. Then the Analysis of Variance is given in
Table 33.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
MODEL SSR = 86112.5 1 M SR = 86112.5
1 = 86112.5 F = 86112.5
2666.04 = 32.30
26660.4
ERROR SSE = 26660.4 12 − 2 = 10 M SE = 10 = 2666.04
TOTAL SSY Y = 112772.9 12 − 1 = 11

Table 33: The Analysis of Variance Table for the coffee data example

To test the hypothesis of no linear association between amount of shelf space and mean weekly
coffee sales, we can use the F -test described above. Note that the null hypothesis is that there is
no effect on mean sales from increasing the amount of shelf space. We will use α = .01.
1. H0 : β1 = 0 HA : β1 6= 0
M SR 86112.5
2. T.S.: Fobs = M SE = 2666.04 = 32.30

3. R.R.: Fobs > F1,n−2,α = F1,10,.01 = 10.04

4. p-value: P (F > Fobs ) = P (F > 32.30) < P (F > 12.83) = .005 (p-value < .005). See p. A-24.
We reject the null hypothesis, and conclude that β1 6= 0. There is an effect on mean weekly sales
when we increase the shelf space.

Example 11.2 (Continued) – Hosiery Mill Cost Function

For the hosiery mill data, the sums of squares for each source of variation in monthly production
costs and their corresponding degrees of freedom are (from previous calculations):
Pn
Total SS – SSY Y = i=1 (Yi − Y )2 = 32914.06 dfT otal = n − 1 = 47
Pn
Error SS – SSE = i=1 (Yi − Ŷi )2 = 1788.51 dfE = n − 2 = 46
Pn
Model SS – SSR = i=1 (Ŷi − Y )2 = SSY Y − SSE = 32914.06 − 1788.51 = 31125.55 dfR = 1
The Analysis of Variance is given in Table 34.
To test whether there is a linear association between mean monthly costs and monthly produc-
tion output, we conduct the F -test (α = 0.05).
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
MODEL SSR = 31125.55 1 M SR = 31125.55
1 = 31125.55 F = 31125.55
38.88 = 800.55
ERROR SSE = 1788.51 48 − 2 = 46 M SE = 1788.51
46 = 38.88
TOTAL SSY Y = 32914.06 48 − 1 = 47

Table 34: The Analysis of Variance Table for the hosiery mill cost example

1. H0 : β1 = 0 HA : β1 6= 0
M SR 31125.55
2. T.S.: Fobs = M SE = 38.88 = 800.55

3. R.R.: Fobs > F1,n−2,α = F1,46,.05 ≈ 4.06

4. p-value: P (F > Fobs ) = P (F > 800.55) <<<<< P (F > 8.83) = .005 (p-value <<<<<
.005). See p. A-24 (with 40 denominator df).

We reject the null hypothesis, and conclude that β1 6= 0.

11.3.1 Coefficient of Determination


A measure of association that has a clear physical interpretation is r 2 , the coefficient of deter-
mination. This measure is always between 0 and 1, so it does not reflect whether Y and X are
positively or negatively associated, and it represents the proportion of the total variation in the
response variable that is ‘accounted’ for by fitting the regression on X. The formula for r 2 is:
SSE SSR
r 2 = (r)2 = 1 − = .
SSY Y SSY Y
P
Note that SSyy = ni=1 (Yi − Y )2 represents the total variation in the response variable, while
P
SSE = ni=1 (Yi − Ŷi )2 represents the variation in the observed responses about the fitted equation
(after taking into account x). This is why we sometimes say that r 2 is “proportion of the variation
in Y that is ‘explained’ by X.”

Example 11.1 (Continued) – Coffee Sales and Shelf Space

For the coffee data, we can calculate r 2 using the values of SSXY , SSXX , SSY Y , and SSE we
have previously obtained.
26660.4 86112.5
r2 = 1 − = = .7636
112772.9 112772.9
Thus, over 3/4 of the variation in sales is “explained” by the model using shelf space to predict
sales.

Example 11.2 (Continued) – Hosiery Mill Cost Function


For the hosiery mill data, the model (regression) sum of squares is SSR = 31125.55 and the
total sum of squares is SSY Y = 32914.06. To get the coefficient of determination:
31125.55
r2 = = 0.9457
32914.06
Almost 95% of the variation in monthly production costs is “explained” by the monthly production
output.
12 Lecture 12 — Simple Regression III – Estimating the Mean
and Prediction at a Particular Level of X, Correlation
Textbook Sections: 12.7,12.8,12.9
Problems: 12.42,43,Compute r for problems in Section 12.2
We sometimes are interested in estimating the mean response at a particular level of the pre-
dictor variable, say X = X0 . That is, we’d like to estimate E(Y |X0 ) = β0 + β1 X0 . The actual
estimate is just Ŷ0 = b0 + b1 X0 , which is simply where the fitted line crosses X = X0 . Under
the previously stated normality assumptions, r the estimator Ŷ0 is normally distributed with mean
2
β0 + β1 X0 and standard error of estimate σ 1
n + Pn(X0(X
−X)
−X)2
. That is:
i=1 i

s
1 (X0 − X)2
Ŷ0 ∼ N (β0 + β1 X0 , σ + Pn 2
).
n i=1 (Xi − X)

Note that the standard error of the estimate is smallest at X0 = X, that is at the mean of the
sampled levels of the predictor variable. The standard error increases as the value X0 goes away
from this mean.
For instance, our marketer may wish to estimate the mean sales when she has 60 of shelf space,
or 70 , or 40 . She may also wish to obtain a confidence interval for the mean at these levels of X.

12.1 A Confidence Interval for E(Y |X0 ) = β0 + β1 X0


Using the ideas described in the previous section, we can write out the general form for a (1−α)100%
confidence interval for the mean response when X = X0 .
s
1 (X0 − X)2
(b0 + b1 X0 ) ± tα/2,n−2 Se +
n SSXX

Example 12.1 – Coffee Sales and Shelf Space

Suppose our marketer wants to compute 90% confidence intervals for the mean weekly sales at
X=4,6, and 7 feet, respectively (these are not simultaneous confidence intervals as were computed
based on Tukey’s Method previously). Each of these intervals will depend on tα/2,n−2 = t.05,10 =
1.812 and X = 6. These intervals are:
s
1 (4 − 6)2 √
(307.967 + 34.5833(4)) ± 1.812(51.63) + = 446.300 ± 93.554 .1389
12 72
= 446.300 ± 34.866 ≡ (411.434, 481.166)
s
1 (6 − 6)2 √
(307.967 + 34.5833(6)) ± 1.812(51.63) + = 515.467 ± 93.554 .0833
12 72
= 515.467 ± 27.001 ≡ (488.465, 542.468)
s
1 (7 − 6)2 √
(307.967 + 34.5833(7)) ± 1.812(51.63) + = 550.050 ± 93.554 .0972
12 72
= 550.050 ± 29.171 ≡ (520.879, 579.221)
Notice that the interval is the narrowest at X0 = 6. Figure 16 is a computer generated plot
of the data, the fitted equation and the confidence limits for the mean weekly coffee sales at each
value of X. Note how the limits get wider as X goes away from X = 6. Would these intervals be
wider or narrower, had they been 95% confidence intervals?
SALES
700

600

500

400

300
0 3 6 9 12
SPACE

Figure 16: Plot of coffee data, fitted equation, and 90% confidence limits for the mean

Example 12.2 – Hosiery Mill Cost Function

Suppose the plant manager is interested in mean costs among months where output is 30,000
items produced (X0 = 30). She wants a 95% confidence interval for this true unknown mean.
Recall:

b0 = 3.1274 b1 = 2.0055 Se = 6.24 n = 48 X = 31.0673 SSXX = 7738.94t.025,46 ≈ 2.015

Then the interval is obtained as:


r
1 (30 − 31.0673)2
3.1274 + 2.0055(30) ± 2.015(6.24) +
48 7738.94

≡ 63.29 ± 2.015(6.24) 0.0210 ≡ 63.29 ± 1.82 ≡ (61.47, 65.11)
We can be 95% confident that the mean production costs among months where 30,000 items are
produced is between $61,470 and $65,110 (recall units were thousands for X and thousands for Y ).
A plot of the data, regression line, and 95% confidence bands for mean costs is given in Figure 17.

12.2 Predicting a Future Response at a Given Level of X


In many situations, a researcher would like to predict the outcome of the response variable at a
specific level of the predictor variable. In the previous section we estimated the mean response,
in this section we are interested in predicting a single outcome. In the context of the coffee sales
example, this would be like trying to predict next week’s sales given we know that we will have 60
of shelf space.
cost
100

90

80

70

60

50

40

30

20

10
0 10 20 30 40 50
size

Figure 17: Plot of hosiery mill cost data, fitted equation, and 95% confidence limits for the mean
First, suppose you know the parameters β0 and β1 Then you know that the response variable,
for a fixed level of the predictor variable (X = X0 ), is normally distributed with mean E(Y |X0 ) =
β0 + β1 X0 and standard deviation σ. We know from previous work with the normal distribution
that approximately 95% of the measurements lie within 2 standard deviations of the mean. So if we
know β0 , β1 , and σ, we would be very confident that our response would lie between (β0 +β1 X0 )−2σ
and (β0 + β1 X0 ) + 2σ. Figure 18 represents this idea.
F2
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
50 60 70 80 90 100 110 120 130 140 150
X

Figure 18: Distribution of response variable with known β0 , β1 , and σ

We rarely, if ever, know these parameters, and we must estimate them as we have in previous
sections. There is uncertainty in what the mean response at the specified level, X0 , of the response
variable. We do, however know how to obtain an interval that we are very confident contains the
true mean β0 + β1 X0 . If we apply the method of the previous paragraph to all ‘believable’ values of
this mean we can obtain a prediction interval that we are very confident will contain our future
response. Since σ is being estimated as well, instead of 2 standard deviations, we must use tα/2,n−2
estimated standard deviations. Figure 19 portrays this idea.
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 60 100 140 180
X

Figure 19: Distribution of response variable with estimated β0 , β1 , and σ

Note that all we really need are the two extreme distributions from the confidence interval for
the mean response. If we use the method from the last paragraph on each of these two distributions,
we can obtain the prediction interval by choosing the left–hand point of the ‘lower’ distribution
and the right–hand point of the ‘upper’ distribution. This is displayed in Figure 20.
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 60 100 140 180
X

Figure 20: Upper and lower prediction limits when we have estimated the mean

The general formula for a (1 − α)100% prediction interval of a future response is similar to the
confidence interval for the mean at X0 , except that it is wider to reflect the variation in individual
responses. The formula is:
s
1 (X0 − X)2
(b0 + b1 X0 ) ± tα/2,n−2 s 1 + + .
n SSXX

Example 12.1 (Continued) – Coffee Sales and Shelf Space

For the coffee example, suppose the marketer wishes to predict next week’s sales when the coffee
will have 50 of shelf space. She would like to obtain a 95% prediction interval for the number of
bags to be sold. First, we observe that t.025,10 = 2.228, all other relevant numbers can be found in
the previous example. The prediction interval is then:
r
1 (5 − 6)2 √
(307.967 + 34.5833(5)) ± 2.228(51.63) 1 + + = 480.883 ± 93.554 1.0972
12 72
= 480.883 ± 97.996 ≡ (382.887, 578.879).
This interval is relatively wide, reflecting the large variation in weekly sales at each level of x. Note
that just as the width of the confidence interval for the mean response depends on the distance
between X0 and X, so does the width of the prediction interval. This should be of no surprise,
considering the way we set up the prediction interval (see Figure 19 and Figure 20). Figure 21
shows the fitted equation and 95% prediction limits for this example.
It must be noted that a prediction interval for a future response is only valid if conditions are
similar when the response occurs as when the data was collected. For instance, if the store is being
boycotted by a bunch of animal rights activists for selling meat next week, our prediction interval
will not be valid.
SALES
700

600

500

400

300
0 3 6 9 12
SPACE

Figure 21: Plot of coffee data, fitted equation, and 95% prediction limits for a single response

Example 12.2 (Continued) – Hosiery Mill Cost Function

Suppose the plant manager knows based on purchase orders that this month, her plant will
produce 30,000 items (X0 = 30.0). She would like to predict what the plant’s production costs will
be. She obtains a 95% prediction interval for this month’s costs.
r
1 (30 − 31.0673)2 √
3.1274 + 2.0055(30) ± 2.015(6.24) 1 + + ≡ 63.29 ± 2.015(6.24) 1.0210
48 7738.94
≡ 63.29 ± 12.70 ≡ (50.59, 75.99)
She predicts that the costs for this month will be between $50,590 and $75,990. This interval is
much wider than the interval for the mean, since it includes random variation in monthly costs
around the mean. A plot of the 95% prediction bands is given in Figure 22.

12.3 Coefficient of Correlation


In many situations, we would like to obtain a measure of the strength of the linear association
between the variables Y and X. One measure of this association that is reported in research
journals from many fields is the Pearson product moment coefficient of correlation. This measure,
denoted by r, is a number that can range from -1 to +1. A value of r close to 0 implies that there
is very little association between the two variables (Y tends to neither increase or decrease as X
increases). A positive value of r means there is a positive association between y and x (Y tends
to increase as X increases). Similarly, a negative value means there is a negative association (Y
tends to decrease as X increases). If r is either +1 or -1, it means the data fall on a straight line
(SSE = 0) that has either a positive or negative slope, depending on the sign of r. The formula
for calculating r is:
SSXY
r=√ .
SSXX SSY Y
Note that the sign of r is always the same as the sign of b1 .

Example 12.1 (Continued) – Coffee Sales and Shelf Space


cost
100

90

80

70

60

50

40

30

20

10
0 10 20 30 40 50
size

Figure 22: Plot of hosiery mill cost data, fitted equation, and 95% prediction limits for an individual
outcome
For the coffee data, we can calculate r using the values of SSXY , SSXX , SSY Y we have previously
obtained.
2490 2490
r=p = = .8738
(72)(112772.9) 2849.5

Example 12.2 (Continued) – Hosiery Mill Cost Function

For the hosiery mill cost function data, we have:


15520.27 15520.27
r=p = = .9725
(7738.94)(32914.06) 15959.95
Computer Output for Coffee Sales Example (Sec 12.8)

Dependent Variable: SALES


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 86112.50000 86112.50000 32.297 0.0002
Error 10 26662.41667 2666.24167
C Total 11 112774.91667

Root MSE 51.63566 R-square 0.7636


Dep Mean 515.41667 Adj R-sq 0.7399

Parameter Estimates

Parameter Standard T for H0:


Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 307.916667 39.43738884 7.808 0.0001
SPACE 1 34.583333 6.08532121 5.683 0.0002

Dep Var Predict Std Err Lower95% Upper95% Lower95%


Obs SALES Value Predict Mean Mean Predict
1 421.0 411.7 23.568 359.2 464.2 285.2
2 412.0 411.7 23.568 359.2 464.2 285.2
3 443.0 411.7 23.568 359.2 464.2 285.2
4 346.0 411.7 23.568 359.2 464.2 285.2
5 526.0 515.4 14.906 482.2 548.6 395.7
6 581.0 515.4 14.906 482.2 548.6 395.7
7 434.0 515.4 14.906 482.2 548.6 395.7
8 570.0 515.4 14.906 482.2 548.6 395.7
9 630.0 619.2 23.568 566.7 671.7 492.7
10 560.0 619.2 23.568 566.7 671.7 492.7
11 590.0 619.2 23.568 566.7 671.7 492.7
12 672.0 619.2 23.568 566.7 671.7 492.7

Upper95% Upper95%
Obs Predict Residual Obs Predict Residual
1 538.1 9.3333 7 635.2 -81.4167
2 538.1 0.3333 8 635.2 54.5833
3 538.1 31.3333 9 745.6 10.8333
4 538.1 -65.6667 10 745.6 -59.1667
5 635.2 10.5833 11 745.6 -29.1667
6 635.2 65.5833 12 745.6 52.8333
13 Lecture 13 — Multiple Regression I
Textbook Sections: 13.1,13.2
Problems: 13.3,5,7,8,9,13

In most situations, we have more than one independent variable. While the amount of math
can become overwhelming and involves matrix algebra, many computer packages exist that will
provide the analysis for you. In this chapter, we will analyze the data by interpreting the results
of a computer program. It should be noted that simple regression is a special case of multiple
regression, so most concepts we have already seen apply here.

13.1 The Multiple Regression Model and Least Squares Estimates


In general, if we have k predictor variables, we can write our response variable as:
Y = β0 + β1 X1 + · · · + βk Xk + ε.
Again, Y is broken into a systematic and a random component:
Y = β0 + β1 X1 + · · · + βk Xk + |{z}
ε
| {z }
systematic random

We make the same assumptions as before in terms of ε, specifically that they are indepen-
dent and normally distributed with mean 0 and standard deviation σ. That is, we are assuming
that Y , at a given set of levels of the k independent variables (X1 , . . . , Xk ) is normal with mean
E[Y |X1 , . . . , Xk ] = β0 + β1 X1 + · · · + βk Xk and standard deviation σ. Just as before, β0 , β1 , . . . , βk ,
and σ are unknown parameters that must be estimated from the sample data. The parameters βi
represent the change in the mean response when the ith predictor variable changes by 1 unit and
all other predictor variables are held constant.
In this model:
• Y — Random outcome of the dependent variable
• β0 — Regression constant (E(Y |X1 = · · · = Xk = 0) if appropriate)
• βi — Partial regression coefficient for variable Xi (Change in E(Y ) when Xi increases by 1
0
unit and all other X s are held constant)
• ε — Random error term, assumed (as before) that ε ∼ N (0, σ)
• k — The number of independent variables
Pn
By the method of least squares (choosing the bi values that minimize SSE = i=1 (Yi − Ŷi )2 ),
we obtain the fitted equation:
Ŷ = b0 + b1 X1 + b2 X2 + · · · + bk Xk
and our estimate of σ:
s s
P
(Y − Ŷ )2 SSE
Se = =
n−k−1 n−k−1
The Analysis of Variance table will be very similar to what we used previously, with the only
adjustments being in the degrees’ of freedom. Table 35 shows the values for the general case when
there are k predictor variables. We will rely on computer outputs to obtain the Analysis of Variance
and the estimates b0 , b1 , and bk .
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Pn
MODEL SSR = i=1 (Ŷi − Y ) 2 k M SR = SSRk F =MM SR
SE
P
ERROR SSE = ni=1 (Yi − Ŷi )2 n−k−1 SSE
M SE = n−k−1
P
TOTAL SSY Y = ni=1 (Yi − Y )2 n−1

Table 35: The Analysis of Variance Table for multiple regression

13.2 Testing for Association Between the Response and the Full Set of Predictor
Variables
To see if the set of predictor variables is useful in predicting the response variable, we will test
H0 : β1 = β2 = . . . = βk = 0. Note that if H0 is true, then the mean response does not depend
on the levels of the predictor variables. We interpret this to mean that there is no association
between the response variable and the set of predictor variables. To test this hypothesis, we use
the following method:

1. H0 : β1 = β2 = · · · = βk = 0

2. HA : Not every βi = 0
M SR
3. T.S.: Fobs = M SE

4. R.R.: Fobs > Fα,k,n−k−1

5. p-value: P (F > Fobs ) (You can only get bounds on this, but computer outputs report them
exactly)

The computer automatically performs this test and provides you with the p-value of the test, so
in practice you really don’t need to obtain the rejection region explicitly to make the appropriate
conclusion. However, we will do so in this course to help reinforce the relationship between the
test’s decision rule and the p-value. Recall that we reject the null hypothesis if the p-value is less
than α.

13.3 Testing Whether Individual Predictor Variables Help Predict the Re-
sponse
If we reject the previous null hypothesis and conclude that not all of the βi are zero, we may wish
to test whether individual βi are zero. Note that if we fail to reject the null hypothesis that βi is
zero, we can drop the predictor Xi from our model, thus simplifying the model. Note that this
test is testing whether Xi is useful given that we are already fitting a model containing
the remaining k − 1 predictor variables. That is, does this variable contribute anything once
we’ve taken into account the other predictor variables. These tests are t-tests, where we compute
t = Sbbi just as we did in the section on making inferences concerning β1 in simple regression. The
i
procedure for testing whether βi = 0 (the ith predictor variable does not contribute to predicting
the response given the other k − 1 predictor variables are in the model) is as follows:

• H0 : βi = 0 (Y is not associated with Xi after controlling for all other independent variables)
• (1) HA : βi 6= 0
(2) HA : βi > 0
(3) HA : βi < 0
bi
• T.S.: tobs = S bi

• R.R.: (1) |tobs | > tα/2,n−k−1


(2) tobs > tα,n−k−1
(3) tobs < −tα,n−k−1

• (1) p–value: 2P (T > |tobs |)


(2) p–value: P (T > tobs )
(3) p–value: P (T < tobs )

Computer packages print the test statistic and the p-value based on the two-sided test, so to
conduct this test is simply a matter of interpreting the results of the computer output.

13.4 Testing for an Association Between a Subset of Predictor Variables and


the Response
We have seen the two extreme cases of testing whether all regression coefficients are simultaneously
0 (the F -test), and the case of testing whether a single regression coefficient is 0, controlling for
all other predictors (the t-test). We can also test whether a subset of the k regression coefficients
are 0, controlling for all other predictors. Note that the two extreme cases can be tested using this
very general procedure.
To make the notation as simple as possible, suppose our model consists of k predictor vari-
ables, of which we’d like to test whether q (q ≤ k) are simultaneously not associated with Y ,
after controlling for the remaining k − q predictor variables. Further assume that the k − q re-
maining predictors are labelled X1 , X2 , . . . , Xk−q and that the q predictors of interest are labelled
Xk−q+1 , Xk−q+2 , . . . , Xk .
This test is of the form:

H0 : βk−q+1 = βk−q+2 = · · · = βk = 0 HA : βk−q+1 6= 0 and/or βk−q+2 6= 0 and/or . . . and/or βk 6= 0

The procedure for obtaining the numeric elements of the test is as follows:

1. Fit the model under the null hypothesis (βk−q+1 = βk−q+2 = · · · = βk = 0). It will include
only the first k − q predictor variables. This is referred to as the Reduced model. Obtain
the error sum of squares (SSE(R)) and the error degrees of freedom dfE (R) = n − (k − q) − 1.

2. Fit the model with all k predictors. This is referred to as the Complete or Full model
(and was used for the F -test for all regression coefficients). Obtain the error sum of squares
(SSE(F )) and the error degrees of freedom (dfE (F ) = n − k − 1).

By definition of the least squares citerion, we know that SSE(R) ≥ SSE(F ). We now obtain the
test statistic:
SSE(R)−SSE(F )
(n−(k−q)−1)−(n−k−1) (SSE(R) − SSE(F ))/q
TS : Fobs = SSE(F )
==
M SE(F )
n−k−1
and our rejection region is values of Fobs ≥ Fα,q,n−k−1 .

Example 13.1 – Texas Weather Data


In this example, we will use regression in the context of predicting an outcome. A construction
company is making a bid on a project in a remote area of Texas. A certain component of the project
will take place in December, and is very sensitive to the daily high temperatures. They would like
to estimate what the average high temperature will be at the location in December. They believe
that temperature at a location will depend on its latitude (measure of distance from the equator)
and its elevation. That is, they believe that the response variable (mean daily high temperature in
December at a particular location) can be written as:

Y = β0 + β1 X1 + β2 X2 + β3 X3 + ε,

where X1 is the latitude of the location, X2 is the longitude, and X3 is its elevation (in feet). As
before, we assume that ε ∼ N (0, σ). Note that higher latitudes mean farther north and higher
longitudes mean farther west.
To estimate the parameters β0 , β1 , β2 , beta3 , and σ, they gather data for a sample of n = 16
counties and fit the model described above. The data, including one other variable are given in
Table 36.
COUNTY LATITUDE LONGITUDE ELEV TEMP INCOME
HARRIS 29.767 95.367 41 56 24322
DALLAS 32.850 96.850 440 48 21870
KENNEDY 26.933 97.800 25 60 11384
MIDLAND 31.950 102.183 2851 46 24322
DEAF SMITH 34.800 102.467 3840 38 16375
KNOX 33.450 99.633 1461 46 14595
MAVERICK 28.700 100.483 815 53 10623
NOLAN 32.450 100.533 2380 46 16486
ELPASO 31.800 106.40 3918 44 15366
COLLINGTON 34.850 100.217 2040 41 13765
PECOS 30.867 102.900 3000 47 17717
SHERMAN 36.350 102.083 3693 36 19036
TRAVIS 30.300 97.700 597 52 20514
ZAPATA 26.900 99.283 315 60 11523
LASALLE 28.450 99.217 459 56 10563
CAMERON 25.900 97.433 19 62 12931

Table 36: Data corresponding to 16 counties in Texas

The results of the Analysis of Variance are given in Table 37 and the parameter estimates,
estimated standard errors, t-statistics and p-values are given in Table 38. Full computer programs
and printouts are given as well.
We see from the Analysis of Variance that at least one of the variables, latitude and elevation,
are related to the response variable temperature. This can be seen by setting up the test H0 : β1 =
β2 = β3 = 0 as described previously. The elements of this test, provided by the computer output,
are detailed below, assuming α = .05.
1. H0 : β1 = β2 = β3 = 0
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 934.328 k=3 M SR = 934.328
3 F = 311.443
0.634 .0001
=311.443 =491.235
ERROR SSE = 7.609 n−k−1= M SE = 7.609
12
16 − 3 − 1 = 12 =0.634
TOTAL SSY Y = 941.938 n − 1 = 15

Table 37: The Analysis of Variance Table for Texas data

t FOR H0 : STANDARD ERROR


PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) b0 =109.25887 36.68 .0001 2.97857
LATITUDE (β1 ) b1 = −1.99323 −14.61 .0001 0.13639
LONGITUDE (β2 ) b2 = −0.38471 −1.68 .1182 0.22858
ELEVATION (β3 ) b3 = −0.00096 −1.68 .1181 0.00057

Table 38: Parameter estimates and tests of hypotheses for individual parameters

2. HA : Not all βi = 0
M SR 311.443
3. T.S.: Fobs = M SE = 0.634 = 491.235

4. R.R.: Fobs > F2,13,.05 = 3.81 (This is not provided on the output, the p-value takes the place
of it).

5. p-value: P (F > 644.45) = .0001 (Actually it is less than .0001, but this is the smallest p-value
the computer will print).
We conclude that we are sure that at least one of these three variables is related to the response
variable temperature.
We also see from the individual t-tests that latitude is useful in predicting temperature, even
after taking into account the other predictor variables.
The formal test (based on α = 0.05 significance level) for determining wheteher temperature is
associated with latitude after controlling for longitude and elevation is given here:

• H0 : β1 = 0 (TEMP (Y ) is not associated with LAT (X1 ) after controlling for LONG (X2 )
and ELEV (X3 ))

• HA : βi 6= 0 (TEMP is associated with LAT after controlling for LONG and ELEV)
b1 −1.99323
• T.S.: tobs = S b1 = 0.136399 = −14.614

• R.R.: |tobs | > tα/2,n−k−1 = t.025,12 = 2.179

• p–value: 2P (T > |tobs |) = 2P (T > 14.614) = .0001

Thus, we can conclude that there is an association between temperature and latitude, controlling
for longitude and elevation. Note that the coeficient is negative, so we conclude that temperature
decreases as latitude increases (given a level of longitude and elevation).
Note from Table 38 that neither the coefficient for LONGITUDE (X2 ) or ELEVATION (X3 )
are significant at the α = 0.05 significance level (p-values are .1182 and .1181, respectively). Recall
these are testing whether each term is 0 controlling for LATITUDE and the other term.
Before concluding that neither LONGITUDE (X2 ) or ELEVATION (X3 ) are useful predictors,
controlling for LATITUDE, we will test whether they are both simultaneously 0, that is:

H0 : β2 = β3 = 0 vs HA : β2 6= 0 and/or β3 6= 0

First, note that we have:

n = 16 k = 3 q = 2 SSE(F ) = 7.609 dfE (F ) = 16 − 3 − 1 = 12 M SE(F ) = 0.634

dfE (R) = 16 − (3 − 2) − 1 = 14 F.05,2,12 = 3.89


Next, we fit the model with only LATITUDE (X1 ) and obtain the error sum of squares: SSE(R) =
60.935 and get the following test statistic:

(SSE(R) − SSE(F ))/q (60.935 − 7.609)/2 26.663


TS : Fobs = = = = 42.055
M SE(F ) 0.634 0.634

Since 42.055 >> 3.89, we reject H0 , and conclude that LONGITUDE (X2 ) and/or ELEVATION
(X3 ) are associated with TEMPERATURE (Y ), after controlling for LATITUDE (X1 ).

The reason we failed to reject H0 : β2 = 0 and H0 : β3 = 0 individually based on the t-tests is


that ELEVATION and LONGITUDE are highly correlated (Elevations rise as you go further west
in the state. So, once you control for LONGITUDE, we observe little ELEVATION effect, and vice
versa. We will discuss why this is the case later. In theory, we have little reason to believe that
temperatures naturally increase or decrease with LONGITUDE, but we may reasonably expect
that as ELEVATION increases, TEMPERATURE decreases.
We re–fit the more parsimonious (simplistic) model that uses ELEVATION (X1 ) and LAT-
ITUDE (X2 ) to predict TEMPERATURE (Y ). Note the new symbols for ELEVATION and
LATITUDE. That is to show you that they are merely symbols. The results are given in Table 39
and Table 40.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 932.532 k=2 M SR = 932.532
2 F = 466.266
0.634 .0001
=466.266 =644.014
ERROR SSE = 9.406 n−k−1= M SE = 9.406
13
16 − 2 − 1 = 13 =0.724
TOTAL SSY Y = 941.938 n − 1 = 15

Table 39: The Analysis of Variance Table for Texas data – without LONGITUDE

We see this by observing that the t-statistic for testing H0 : β1 = 0 (no latitude effect on
temperature) is −17.65, corresponding to a p-value of .0001, and the t-statistic for testing H0 :
β2 = 0 (no elevation effect) is −8.41, also corresponding to a p-value of .0001. Further note
that both estimates are negative, reflecting that as elevation and latitude increase, temperature
decreases. That should not come as any big surprise.
t FOR H0 : STANDARD ERROR
PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) b0 =63.45485 36.68 .0001 0.48750
ELEVATION (β1 ) b1 = −0.00185 −8.41 .0001 0.00022
LATITUDE (β2 ) b2 = −1.83216 −17.65 .0001 0.10380

Table 40: Parameter estimates and tests of hypotheses for individual parameters – without LON-
GITUDE

The magnitudes of the estimated coefficients are quite different, which may make you believe
that one predictor variable is more important than the other. This is not necessarily true, because
the ranges of their levels are quite different (1 unit change in latitude represents a change of
approximately 19 miles, while a unit change in elevation is 1 foot) and recall that βi represents the
change in the mean response when variable Xi is increased by 1 unit.
The data corresponding to the 16 locations in the sample are plotted in Figure 23 and the fitted
equation for the model that does not include LONGITUDE is plotted in Figure 24. The fitted
equation is a plane in three dimensions.

TEMP

62.00

53.33

44.67

36.00
36.35 3918
32.87 2618
L A T 12 9 . 3 8 1319 ELEV
25.90 19

Figure 23: Plot of temperature data in 3 dimensions

Example 13.2 – Mortgage Financing Cost Variation (By City)


A study in the mid 1960’s reported regional differences in mortgage costs for new homes. The
sampling units were n = 18 metro areas (SMSA’s) in the U.S. The dependent variable (Y ) is the
average yield (in percent) on a new home mortgage for the SMSA. The independent variables (Xi )
are given below.
Source: Schaaf, A.H. (1966), “Regional Differences in Mortgage Financing Costs,” Journal of
Finance, 21:85-94.

X1 – Average Loan Value / Mortgage Value Ratio (Higher X1 means lower down payment and
higher risk to lender).
YHAT

63.45

53.66

43.87

34.08
37 4000
33 2667
LAT 29 1333 ELEV
25 0

Figure 24: Plot of the fitted equation for temperature data

X2 – Road Distance from Boston (Higher X2 means further from Northeast, where most capital
was at the time, and higher costs of capital).

X3 – Savings per Annual Dwelling Unit Constructed (Higher X3 means higher relative credit
surplus, and lower costs of capital).

X4 – Savings per Capita (does not adjust for new housing demand).

X5 – Percent Increase in Population 1950–1960

X6 – Percent of First Mortgage Debt Controlled by Inter-regional Banks.

The data, fitted values, and residuals are given in Table 41. The Analysis of Variance is given
in Table 42. The regression coefficients, test statistics, and p-values are given in Table 43.
Show that the fitted value for Los Angeles is 6.19, based on the fitted equation, and that the
residual is -0.02.
Based on the large F -statistic, and its small corresponding P -value, we conclude that this set of
predictor variables is associated with the mortgage rate. That is, at least one of these independent
variables is associated with Y .
Based on the t-tests, while none are strictly significant at the α = 0.05 level, there is some
evidence that X1 (Loan Value/Mortgage Value, P = .0515), X3 (Savings per Unit Constructed,
P = .0593), and to a lesser extent, X4 (Savings per Capita, P = .1002) are helpful in predicting
mortgage rates. We can fit a reduced model, with just these three predictors, and test whether we
can simultaneously drop X2 , X5 , and X6 from the model. That is:

H0 : β2 = β5 = β6 = 0 vs HA : β2 6= 0 and/or β5 6= 0 and/or β6 6= 0

First, we have the following values:

n = 18 k=6 q=3

SSE(F ) = 0.10980 dfE (F ) = 18 − 6 − 1 = 11 M SE(F ) = 0.00998


dfE (R) = 18 − (6 − 3) − 1 = 14 F.05,3,11 = 3.59
SMSA Y X1 X2 X3 X4 X5 X6 Ŷ e = Y − Ŷ
Los Angeles-Long Beach 6.17 78.1 3042 91.3 1738.1 45.5 33.1 6.19 -0.02
Denver 6.06 77.0 1997 84.1 1110.4 51.8 21.9 6.04 0.02
San Francisco-Oakland 6.04 75.7 3162 129.3 1738.1 24.0 46.0 6.05 -0.01
Dallas-Fort Worth 6.04 77.4 1821 41.2 778.4 45.7 51.3 6.05 -0.01
Miami 6.02 77.4 1542 119.1 1136.7 88.9 18.7 6.04 -0.02
Atlanta 6.02 73.6 1074 32.3 582.9 39.9 26.6 5.92 0.10
Houston 5.99 76.3 1856 45.2 778.4 54.1 35.7 6.02 -0.03
Seattle 5.91 72.5 3024 109.7 1186.0 31.1 17.0 5.91 0.00
New York 5.89 77.3 216 364.3 2582.4 11.9 7.3 5.82 0.07
Memphis 5.87 77.4 1350 111.0 613.6 27.4 11.3 5.86 0.01
New Orleans 5.85 72.4 1544 81.0 636.1 27.3 8.1 5.81 0.04
Cleveland 5.75 67.0 631 202.7 1346.0 24.6 10.0 5.64 0.11
Chicago 5.73 68.9 972 290.1 1626.8 20.1 9.4 5.60 0.13
Detroit 5.66 70.7 699 223.4 1049.6 24.7 31.7 5.63 0.03
Minneapolis-St Paul 5.66 69.8 1377 138.4 1289.3 28.8 19.7 5.81 -0.15
Baltimore 5.63 72.9 399 125.4 836.3 22.9 8.6 5.77 -0.14
Philadelphia 5.57 68.7 304 259.5 1315.3 18.3 18.7 5.57 0.00
Boston 5.28 67.8 0 428.2 2081.0 7.5 2.0 5.41 -0.13

Table 41: Data and fitted values for mortgage rate multiple regression example.

ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 0.73877 k=6 M SR = 0.73877
6 F = 0.12313
0.00998 .0003
=0.12313 =12.33
ERROR SSE = 0.10980 n−k−1= M SE = 0.10980
11
18 − 6 − 1 = 11 =0.00998
TOTAL SSY Y = 0.84858 n − 1 = 17

Table 42: The Analysis of Variance Table for Mortgage rate regression analysis

STANDARD
PARAMETER ESTIMATE ERROR t-statistic P -value
INTERCEPT (β0 ) b0 =4.28524 0.66825 6.41 .0001
X1 (β1 ) b1 = 0.02033 0.00931 2.18 .0515
X2 (β2 ) b2 = 0.000014 0.000047 0.29 .7775
X3 (β3 ) b3 = −0.00158 0.000753 -2.10 .0593
X4 (β4 ) b4 = 0.000202 0.000112 1.79 .1002
X5 (β5 ) b5 = 0.00128 0.00177 0.73 .4826
X6 (β6 ) b6 = 0.000236 0.00230 0.10 .9203

Table 43: Parameter estimates and tests of hypotheses for individual parameters – Mortgage rate
regression analysis
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 0.73265 k−q =3 M SR = 0.73265
3 F = 0.24422
0.00828 .0001
=0.24422 =29.49
ERROR SSE = 0.11593 n − (k − q) − 1 = M SE = 0.11593
14
18 − 3 − 1 = 14 =0.00828
TOTAL SSY Y = 0.84858 n − 1 = 17

Table 44: The Analysis of Variance Table for Mortgage rate regression analysis (Reduced Model)

STANDARD
PARAMETER ESTIMATE ERROR t-statistic P -value
INTERCEPT (β0 ) b0 =4.22260 0.58139 7.26 .0001
X1 (β1 ) b1 = 0.02229 0.00792 2.81 .0138
X3 (β3 ) b3 = −0.00186 0.00041778 -4.46 .0005
X4 (β4 ) b4 = 0.000225 0.000074 3.03 .0091

Table 45: Parameter estimates and tests of hypotheses for individual parameters – Mortgage rate
regression analysis (Reduced Model)

Next, we fit the reduced model, with β2 = β5 = β6 = 0. We get the Analysis of Variance in
Table 44 and parameter estimates in Table 45.
Note first, that all three regression coefficients
√ are significant now at the α = 0.05 significance
level. Also, our residual standard error, Se = M SE has also decreased (0.09991 to 0.09100). This
implies we have lost very little predictive ability by dropping X2 , X5 , and X6 from the model. Now
to formally test whether these three predictor variables’ regression coefficients are simultaneously
0 (with α = 0.05):

• H0 : β2 = β5 = β6 = 0

• HA : β2 6= 0 and/or β5 6= 0 and/or β6 6= 0
(0.11593−0.10980)/2 .00307
• T S : Fobs = 0.00998 = .00998 = 0.307

• RR : Fobs ≥ F0.05,3,11 = 3.59

We fail to reject H0 , and conclude that none of X2 , X5 , or X6 are associated with mortgage
rate, after controlling for X1 , X3 , and X4 .

Example 13.3 – Store Location Characteristics and Sales


A study proposed using linear regression to describe sales at retail stores based on location
characteristics. As a case study, the authors modelled sales at n = 16 liquor stores in Charlotte,
N.C. Note that in North Carolina, all stores are state run, and do not practice promotion as liquor
stores in Florida do. The response was SALES volume (for the individual stores) in the fiscal year
7/1/1979-6/30/1980. The independent variables were: POP (number of people living within 1.5
miles of store), MHI (mean household income among households within 1.5 miles of store), DIS,
(distance to the nearest store), TFL (daily traffic volume on the street the store was located), and
EMP (the amount of employment within 1.5 miles of the store. The regression coefficients and
standard errors are given in Table 13.4.

Source: Lord, J.D. and C.D. Lynds (1981), “The Use of Regression Models in Store Location
Research: A Review and Case Study,” Akron Business and Economic Review, Summer, 13-19.

Variable Estimate Std Error


POP 0.09460 0.01819
MHI 0.06129 0.02057
DIS 4.88524 1.72623
TFL -2.59040 1.22768
EMP -0.00245 0.00454

Table 46: Regression coefficients and standard errors for liquor store sales study

a) Do any of these variables fail to be associated with store sales after controlling for the others?

b) Consider the signs of the significant regression coefficients. What do they imply?

13.5 R2 and Adjusted–R2


As was discussed in the previous chapter, the coefficient of multiple determination represents the
proportion of the variation in the dependent variable (Y ) that is “explained” by the regression on the
collection of independent variables: (X1 ,. . . ,Xk ). We use R2 (as opposed) to r 2 to differentiate the
coefficient of multiple determination from the coefficient of simple determination. R2 is computed
exactly as before:
SSR SSE
R2 = = 1−
SSY Y SSY Y

One problem with R2 is that when we continually add independent variables to a regression
model, it continually increases (or at least, never decreases), even when the new variable(s) add
little or no predictive power. Since we are trying to fit the simplest (most parsimonious) model
that explains the relationship between the set of independent variables and the dependent variable,
we need a measure that penalizes models that contain useless or redundant independent variables.
This penalization takes into account that by including useless or redundant predictors, we are
decreasing error degrees of freedom (dfE = n − k − 1). A second measure, that does not carry the
proportion of variation explained criteria, but is useful for comparing models of varying degrees of
complexity, is Adjusted-R2 :
 
2 SSE/(n − k − 1) n−1 SSE
Adjusted − R = 1− = 1−
SSY Y /(n − 1) n−k−1 SSY Y

Example 13.1 (Continued) – Texas Weather Data


Consider the two models we have fit:
Full Model — I.V.’s: LATITUDE, LONGITUDE, ELEVATION
Reduced Model — I.V.’s: LATITUDE, ELEVATION
For the Full Model, we have:

n = 16 k=3 SSE = 7.609 SSY Y = 941.938


and, we obtain RF2 and Adj-RF2 :
 
7.609 15 7.609
RF2 = 1 − = 1 − .008 = 0.992 Adj − RF2 = 1 − = 1 − 1.25(.008) = 0.9900
941.938 12 941.938
For the Reduced Model, we have:

n = 16 k=2 SSE = 9.406 SSY Y = 941.938


2 and Adj-R2 :
and, we obtain RR R
 
9.406 15 9.406
RF2 = 1− = 1 − .010 = 0.990 Adj − RF2 = 1− = 1 − 1.15(.010) = 0.9885
941.938 13 941.938
Thus, by both measures the Full Model “wins”, but it should be added that both appear to fit
the data very well!

Example 13.2 (Continued) – Mortgage Financing Costs


For the mortgage data (with Total Sum of Squares SSY Y = 0.84858 and n = 18), when we
include all 6 independent variables in the full model, we obtain the following results:
SSR = 0.73877 SSE = 0.10980 k=6
From this full model, we compute R2 and Adj-R2 :
   
SSRF 0.73877 n−1 SSEF 17 0.10980
RF2 = = = 0.8706 Adj−RF2 = 1− = 1− = 0.8000
SSY Y 0.84858 n−k−1 SSY Y 11 0.84858

Example 13.3 (Continued) – Store Location Characteristics and Sales


In this study, the authors reported that R2 = 0.69. Note that although we are not given the
Analysis of Variance, we can still conduct the F test for the overall model:
SSR
M SR SSR/k SSY Y /k R2 /k
F = = = SSE
=
M SE SSE/(n − k − 1) SSY Y /(n − k − 1) (1 − R2 )/(n − k − 1)
For the liquor store example, there were n = 16 stores and k = 5 variables in the full model. To
test:
H0 : β1 = β2 = β3 = β4 = β5 = 0 vs HA : Not all βi = 0
we get the following test statistic and rejection region (α = 0.05):
0.69/5 0.138
T S : Fobs = = = 4.45 RR : Fobs ≥ Fα,k,n−k−1 = F0.05,5,10 = 3.33
(1 − 0.69)/(16 − 5 − 1) 0.031
Thus, at least one of these variables is associated with store sales.

What is Adjusted -R2 for this analysis?


14 Lecture 14 — Special Cases of Multiple Regression
Textbook Sections: 13.3,13.4,Skim 13.5
Problems: 13.17,18,20,21,22,26

In this section, we will look at three special cases that are frequently used methods of multiple
regression. The ideas such as the Analysis of Variance, tests of hypotheses, and parameter estimates
are exactly the same as before and we will concentrate on their interpretation through specific
examples. The four special cases are:
1. Polynomial regression

2. Regression models with dummy variables

3. Regression models containing interaction terms

14.1 Polynomial Regression


While certainly not restricted to this case, it is best to describe polynomial regression in the case
of a model with only one predictor variable. In many real–world settings relationships will not be
linear, but will demonstrate nonlinear associations. In economics, a widely described phenomenon
is “diminishing marginal returns”. In this case, Y may increase with X, but the rate of increase
decreases over the range of X. By adding quadratic terms, we can test if this is the case. Other
situations may show that the rate of increase in Y is increasing in X.

Example 14.1 – Health Club Demand


Consider the dilemma of an owner of a new health club. She decides that she will charge a
very low membership fee and will charge people a fixed amount each time they come (this is to
help attract customers who are scared off by those tremendous membership fees that some clubs
charge). The premise to her plan is that she needs to have a fairly steady and heavy daily clientele
to keep money flowing sufficiently. A new, aesthetically pleasing, popular machine known as the
MEGABODY MAKER 3000 has just been mass produced, and she wants this to be the selling
point of her new club. She believes that the more of these machines she has in her club, the higher
the daily attendance will be. However, she also knows that this increase in attendance will ‘tail-off’
as the number of machines keeps getting larger. In the world of Economics, this is referred to as
“diminishing returns”. That is, she may expect mean daily attendance to increase by 300 people
if she were to increase from 1 to 2 machines, but possibly only increase by 20 people if she were to
increase from 5 to 6 machines. That is, she would expect the attendance to increase as the number
of machines increases, but the amount of the increase will diminish, thus implying that the effect of
changing the predictor variable depends on its level. This could be written into a model as follows
(letting y be the number of people attending on a given day, and x being the number of machines):

y = β0 + β1 X + β2 X 2 + ε.

Again, we assume that ε ∼ N (0, σ). In this model, the number of people attending in a day when
there are X machines is nomally distributed with mean β0 + β1 X + β2 X 2 and standard deviation
σ. Note that we are no longer saying that the mean is linearly related to X, but rather that
it is approximately quadratically related to X (curved). Suppose she leases varying numbers of
machines over a period of n = 12 Wednesdays (always advertising how many machines will be there
on the following Wednesday), and observes the number of people attending the club each day, and
obtaining the data in Table 47.

Week # Machines (X) Attendance (Y )


1 3 555
2 6 776
3 1 267
4 2 431
5 5 722
6 4 635
7 1 218
8 5 692
9 3 534
10 2 459
11 6 810
12 4 671

Table 47: Data for health club example

In this case, we would like to fit the multiple regression model:


y = β0 + β1 X + β2 X 2 + ε,
which is just like our previous model except instead of a second predictor variable X2 , we are using
the variable X 2 , the effect is that the fitted equation Ŷ will be a curve in 2 dimensions, not a
plane in 3 dimensions as we saw in the weather example. First we will run the regression on the
computer, obtaining the Analysis of Variance and the parameter estimates, then plot the data and
fitted equation. Table 48 gives the Analysis of Variance for this example and Table 49 gives the
parameter estimates and their standard errors. Note that even though we have only one predictor
variable, it is being used twice and could in effect be treated as two different predictor variables,
so k = 2.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 393933.12 k=2 M SR = 393933.12
2 F = 196966.56
776.06 .0001
=196966.56 =253.80
ERROR SSE = 6984.55 n − k − 1 = M SE = 6984.55
9
=12-2-1=9 =776.06
TOTAL SSY Y = 400917.67 n − 1 = 11

Table 48: The Analysis of Variance Table for health club data

The first test of hypothesis is whether the attendance is associated with the number of machines.
This is a test of H0 : β1 = β2 = 0. If the null hypothesis is true, that implies mean daily attendance
is unrelated to the number of machines, thus the club owner would purchase very few (if any) of the
machines. As before this test is the F -test from the Analysis of Variance table, which we conduct
here at α = .05.
1. H0 : β1 = β2 = 0
t FOR H0 : STANDARD ERROR
PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) b0 =72.0500 2.04 .0712 35.2377
MACHINES (β1 ) b1 = 199.7625 8.67 .0001 23.0535
MACHINES SQ (β2 ) b2 = −13.6518 −4.23 .0022 3.2239

Table 49: Parameter estimates and tests of hypotheses for individual parameters

2. HA : Not both βi = 0
M SR 196966.56
3. T.S.: Fobs = M SE = 776.06 = 253.80
4. R.R.: Fobs > F2,9,.05 = 4.26 (This is not provided on the output, the p-value takes the place
of it).
5. p-value: P (F > 253.80) = .0001 (Actually it is less than .0001, but this is the smallest p-value
the computer will print).
Another test with an interesting interpretation is H0 : β2 = 0. This is testing the hypothesis
that the mean increases linearly with X (since if β2 = 0 this becomes the simple regression model
(refer back to the coffee data example)). The t-test in Table 49 for this hypothesis has a test
statistic tobs = −4.23 which corresponds to a p-value of .0022, which since it is below .05, implies
we reject H0 and conclude β2 6= 0. Since b2 is is negative, we will conclude that β2 is negative,
which is in agreement with her theory that once you get to a certain number of machines, it does
not help to keep adding new machines. This is the idea of ‘diminishing returns’. Figure 25 shows
the actual data and the fitted equation Ŷ = 72.0500 + 199.7625X − 13.6518X 2 .
YHAT
900
800
700
600
500
400
300
200
100
0
0 1 2 3 4 5 6 7
X

Figure 25: Plot of the data and fitted equation for health club example

14.2 Regression Models With Dummy Variables


All of the predictor variables we have used so far were numeric or what are often called quantitative
variables. Other variables also can be used that are called qualitative variables. Qualitative vari-
ables measure characteristics that cannot be described numerically, such as a person’s sex, race,
religion, or blood type; a city’s region or mayor’s political affiliation; the list of possibilities is
endless. In this case, we frequently have some numeric predictor variable(s) that we believe is (are)
related to the response variable, but we believe this relationship may be different for different levels
of some qualitative variable of interest.
If a qualitative variable has m levels, we create m−1 indicator or dummy variables. Consider
an example where we are interested in health care expenditures as related to age for men and women,
separately. In this case, the response variable is health care expenditures, one predictor variable is
age, and we need to create a variable representing sex. This can be done by creating a variable X2
that takes on a value 1 if a person is female and 0 if the person is male. In this case we can write
the mean response as before:

E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 + ε.

Note that for women of age x1 , the mean expenditure is E[Y |X1 , 1] = β0 +β1 X1 +β2 (1) = (β0 +β2 )+
β1 X1 , while for men of age X1 , the mean expenditure is E[Y |X1 , 0] = β0 +β1 X1 +β0 (0) = β0 +β1 X1 .
This model allows for different means for men and women, but requires they have the same slope
(we will see a more general case in the next section). In this case the interpretation of β2 = 0 is
that the means are the same for both sexes, this is a hypothesis a health care professional may wish
to test in a study. In this example the variable sex had two variables, so we had to create 2 − 1 = 1
dummy variable, now consider a second example.

Example 14.2
We would like to see if annual per capita clothing expenditures is related to annual per capita
income in cities across the U.S. Further, we would like to see if there is any differences in the means
across the 4 regions (Northeast, South, Midwest, and West). Since the variable region has 4 levels,
we will create 3 dummy variables X2 , X3 , and X4 as follows (we leave X1 to represent the predictor
variable per capita income): (
1 if region=South
X2 =
0 otherwise
(
1 if region=Midwest
X3 =
0 otherwise
(
1 if region=West
X4 =
0 otherwise
Note that cities in the Northeast have X2 = X3 = X4 = 0, while cities in other regions will have
either X2 , X3 , or X4 being equal to 1. Northeast cities act like males did in the previous example.
The data are given in Table 50.
The Analysis of Variance is given in Table 51, and the parameter estimates and standard errors
are given in Table 52.
Note that we would fail to reject H0 : β1 = β2 = β3 = 0 at α = .05 significance level if we looked
only at the F -statistic and it’s p-value (Fobs = 2.93, p-value=.0562). This would lead us to conclude
that there is no association between the predictor variables income and region and the response
variable clothing expenditures. This is where you need to be careful when using multiple regression
with many predictor variables. Look at the test of H0 : β1 = 0, based on the t-test in Table 52.
Here we observe tobs =3.11, with a p-value of .0071. We thus conclude β1 6= 0, and that clothing
expenditures is related to income, as we would expect. However, we do fail to reject H0 : β2 = 0, H0 :
β3 = 0,and H0 : β4 = 0, so we fail to observe any differences among the regions in terms of clothing
PER CAPITA INCOME & CLOTHING EXPENDITURES (1990)
Income Expenditure
Metro Area Region X1 Y X2 X3 X4
New York City Northeast 25405 2290 0 0 0
Philadelphia Northeast 21499 2037 0 0 0
Pittsburgh Northeast 18827 1646 0 0 0
Boston Northeast 24315 1659 0 0 0
Buffalo Northeast 17997 1315 0 0 0
Atlanta South 20263 2108 1 0 0
Miami/Ft Laud South 19606 1587 1 0 0
Baltimore South 21461 1978 1 0 0
Houston South 19028 1589 1 0 0
Dallas/Ft Worth South 19821 1982 1 0 0
Chicago Midwest 21982 2108 0 1 0
Detroit Midwest 20595 1262 0 1 0
Cleveland Midwest 19640 2043 0 1 0
Minneapolis/St Paul Midwest 21330 1816 0 1 0
St Louis Midwest 20200 1340 0 1 0
Seattle West 21087 1667 0 0 1
Los Angeles West 20691 2404 0 0 1
Portland West 18938 1440 0 0 1
San Diego West 19588 1849 0 0 1
San Fran/Oakland West 25037 2556 0 0 1

Table 50: Clothes Expenditures and income example

ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL 1116419.0 4 279104.7 2.93 .0562
ERROR 1426640.2 15 95109.3
TOTAL 2543059.2 19

Table 51: The Analysis of Variance Table for clothes expenditure data

t FOR H0 : STANDARD ERROR


PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) −657.428 −0.82 .4229 797.948
X1 (β1 ) 0.113 3.11 .0071 0.036
X2 (β2 ) 237.494 1.17 .2609 203.264
X3 (β3 ) 21.691 0.11 .9140 197.536
X4 (β4 ) 254.992 1.30 .2130 196.036

Table 52: Parameter estimates and tests of hypotheses for individual parameters
expenditures after ‘adjusting’ for the variable income. Figure 26 and Figure 27 show the original
data using region as the plotting symbol and the 4 fitted equations corresponding to the 4 regions.
Recall that the fitted equation is Ŷ = −657.428 + 0.113X1 + 237.494X2 + 21.691X3 + 254.992X4 ,
and each of the regions has a different set of levels of variables X2 , X3 , and X4 .
Y
2600 W
2500
2400 W
2300 S
2200
2100 M N
2000 N S
M M
1900 W
1800 N
1700 S W S
1600 M M
1500
1400 W
1300 S N
N
1200
16000 18000 20000 22000 24000 26000
X1
REGION N N N Midwest S S S Northeast
M M M South W W W West

Figure 26: Plot of clothing data, with plotting symbol region

YN
2600
2500
2400
2300
2200
2100
2000
1900
1800
1700
1600
1500
1400
1300
1200
16000 18000 20000 22000 24000 26000
X1

Figure 27: Plot of fitted equations for each region

14.3 Regression Models With Interactions


In some situations, two or more predictor variables may interact in terms of their effects on the
mean response. That is, the effect on the mean response of changing the level of one predictor
variable depends on the level of another predictor variable. This idea is easiest understood in
the case where one of the variables is qualitative. Consider the following example involving the
number of AIDS cases reported over a period of 8 years (treating this as a sample from a conceptual
population). We are interested in the number of cases reported (response variable) among men and
women (qualitative predictor variable) over time (quantitative predictor variable). If we fit a model
as in the previous section, it would be of the form:
Y = β0 + β1 X1 + β2 X2 + ε,
where X1 is the year from the beginning of the study, and X2 is a dummy variable corresponding
to gender. The model allows for different intercepts, but requires common slopes with respect to
changes over time (this can be interpreted as saying that while the number of new cases in a year
can be at different levels among men and women, the rates of increase are the same). This would
be unlikely since we expect the rate of increase to be much higher for men. We would rather fit a
model that allowed for different intercepts and slopes (in effect, different regression equations for
the two sexes). Note that this is a situation where we would be trying to predict a future outcome
based on past data. The model, we will fit allows for interaction between time and sex, allowing
for different slopes for the two sexes. It can be written as:
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + ε.
If we define X2 to be 1 for males and 0 for females, we can write the equations for the two sexes as
follows:
Males: Y = β0 + β1 X1 + β2 (1) + β3 X1 (1) + ε = (β0 + β2 ) + (β1 + β3 )X1 + ε,
and
Females: Y = β0 + β1 X1 + β2 (0) + β3 X1 (0) + ε = β0 + β1 X1 + ε.
The data for years 1984–1991 are given in Table 53, note that we will use X1 = year − 1983 as our
predictor variable representing time to make calculations neater.

AIDS Cases Reported in U.S.


YEAR X1 = Y EAR − 1983 SEX X2 CASES (Y )
1984 1 FEMALE 0 296
1984 1 MALE 1 4146
1985 2 FEMALE 0 585
1985 2 MALE 1 7630
1986 3 FEMALE 0 1049
1986 3 MALE 1 12101
1987 4 FEMALE 0 1833
1987 4 MALE 1 19276
1988 5 FEMALE 0 3287
1988 5 MALE 1 27467
1989 6 FEMALE 0 3660
1989 6 MALE 1 29978
1990 7 FEMALE 0 4880
1990 7 MALE 1 36736
1991 8 FEMALE 0 5677
1991 8 MALE 1 37995

Table 53: AIDS cases reported in U.S. from 1984-1991 by sex

We will not provide the Analysis of Variance table for this example due to the magnitude of
the numbers, and the fact that we are certain of sex and year effects after one look at the data.
However, we do note that R2 = .991555, showing that our model does account for much of the
variation in reported cases. Table 55 provides the parameter estimates and their standard errors.

t FOR H0 : STANDARD ERROR


PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) -1007.464 -0.94 .3675 1075.908
X1 (β1 ) 814.631 3.82 .0024 213.062
X2 (β2 ) -877.929 -0.58 .5746 1521.564
X3 (β3 ) 4474.595 14.85 .0001 301.315

Table 54: Parameter estimates and tests of hypotheses for individual parameters – AIDS data

For males the fitted equation is:

Ŷmale = (b0 + b2 ) + (b1 + b3 )X1 = −1885.393 + 5289.226X1 ,

while for females, the fitted equation is:

Ŷf emale = b0 + b1 X1 = −1007.464 + 814.631X1 .

Note the difference in these equations, particularly their slopes, which represent the increase in
the number of new cases each year. Suppose that a government official would like to predict the
number of new cases in 1992 based on this equation (we are assuming that this increasing pattern
will continue). In this case X1 = year − 1983 = 1992 − 1983 = 9. The two predictions would be:

Males: Ŷ1992 = −1885.393 + 5289.226(9) = 45717.641,

and
Females: Ŷ1992 = −1007.464 + 814.631(9) = 6324.215.
Note that we are not limited to simple models like this, we could have interaction terms in any of
the regression models that we have seen in this chapter. Their interpretations increase in complexity
as the models include more variables. Usually we will test if a coefficient of an interaction term is
0, using the t-test, and remove the interaction term from the model if we fail to reject H0 : βi = 0.
Figure 28 shows the data, as well as the two fitted equations for the AIDS data.

15 Lecture 15 — Multicollinearity and Intro to Time Series


Textbook Sections: 13.6,15.1
Problems: See lecture, 15.1,15.3

15.1 Multicollinearity
Multicollinearity refers to the situation where independent variables are highly correlated among
themselves. This can cause problems mathematically and creates problems in interpreting regres-
sion coefficients.
Some of the problems that arise include:
Y
50000

40000

30000

20000

10000

-10000
1 2 3 4 5 6 7 8
X1

Figure 28: Plot of fitted equations for each sex

• Difficult to interpret regression coefficient estimates

• Inflated std errors of estimates (and thus small t–statistics)

• Signs of coefficients may not be what is expected.

• However, predicted values are not adversely affected

It can be thought that the independent variables are explaining “the same” variation in Y , and
it is difficult for the model to attribute the variation explained (recall partial regression coefficients).
Variance Inflation Factors provide a means of detecting whether a given independent variable
is causing multicollinearity. They are calculated (for each independent variable) as:
1
V IFi =
1 − Ri2

where Ri2 is the coefficient of multiple determination when Xi is regressed on the k − 1 other
independent variables. One rule of thumb suggests that severe multicollinearity is present if V IFi >
10 (Ri2 > .90).

Example 15.1
First, we run a regression with ELEVATION as the dependent variable and LATITUDE and
LONGITUDE as the independent variables. We then repeat the process with LATITUDE as the
dependent variable, and finally with LONGITUDE as the dependent variable. Table ?? gives R2
and V IF for each model.
Note how large the factor is for ELEVATION. Texas elevation increases as you go West and as
you go North. The Western rise is the more pronounced of the two (the simple correlation between
ELEVATION and LONGITUDE is .89).
Consider the effects on the coefficients in Table 56 and Table 57 (these are subsets of previously
shown tables).
Compare the estimate and estimated standard error for the coefficient for ELEVATION and
LATITUDE for the two models. In particular, the ELEVATION coefficient doubles in absolute
Variable R2 V IF
ELEVATION .9393 16.47
LATITUDE .7635 4.23
LONGITUDE .8940 9.43

Table 55: Variance Inflation Factors for Texas weather data

STANDARD ERROR
PARAMETER ESTIMATE OF ESTIMATE
INTERCEPT (β0 ) b0 =109.25887 2.97857
LATITUDE (β1 ) b1 = −1.99323 0.13639
LONGITUDE (β2 ) b2 = −0.38471 0.22858
ELEVATION (β3 ) b3 = −0.00096 0.00057

Table 56: Parameter estimates and standard errors for the full model

value and its standard error decreases by a factor of almost 3. The LATITUDE coefficient and
standard error do not change very much. We choose to keep ELEVATION, as opposed to LONGI-
TUDE, in the model due to theoretical considerations with respect to weather and climate.

15.2 Forecasting and Time Series


In the remainder of the course, we consider data that are collected over time. Many economic and
financial models are based on time series. We will describe some simple methods used to predict
future outcomes based on past values and (possibly) other known information at the time of the
forecast.
Since, there is unlimited number of possibilities of ways of forecasting future outcomes, we need
means of comparing the various methods. First, we introduce some notation:

• Xt — Actual (random) outcome at time t, unknown prior to t

• Ft — Forecast of Xt , made prior to t

• et — Error of forecast: et = Xt − Ft

Six commonly used measures are given


P below, think of ways that the measures may differ:
ei
Mean Error (ME) — ME= number of forecasts
P
|ei |
Mean Absolute Deviation (MAD) — MAD=
number of forecasts

STANDARD ERROR
PARAMETER ESTIMATE OF ESTIMATE
INTERCEPT (β0 ) b0 =63.45485 0.48750
ELEVATION (β1 ) b1 = −0.00185 0.00022
LATITUDE (β2 ) b2 = −1.83216 0.10380

Table 57: Parameter estimates and standard errors for the reduced model
P
i e2
Mean Square Error (MSE) — MSE= number of forecasts
P ei 
Xi
·100
Mean Percentage Error (MPE) — MPE= number of forecasts
P |ei | 
X
·100
Mean Absolute Percentage Error (MAPE) — MPE= number ofi forecasts

16 Lecture 16 — Simple Time Series Forecasting Techniques


Textbook Sections: 15.4, pp651–652
Problems: 15.11,13, See lecture

In this section, we describe some simple methods of using past data to predict future outcomes.
Most forecasts you here reported are generally complex hybrids of these techniques.

16.1 Moving Averages


Use the mean of the last n observations to forecast outcome at t:
Xt−1 + Xt−2 + · · · + Xt−n
Ft =
n
0s
The term “moving” implies that the n X are moving through time.
Problem: How to choose n?

Weighted Moving Averages

Put higher (presumably) weights on more recent values in the moving averages:

w1 Xt−1 + w2 Xt−2 + · · · + wn Xt−n


Ft = P
wi

Presumably w1 ≥ w2 ≥ · · · ≥ wn

Example 16.1
Table 58 gives average dividend yields for Anheuser–Busch for the years 1952–1995 (Source:Value
Line), forecasts and errors based on moving averages based on lags of 1, 2, and 3. Note that we
don’t have early year forecasts, and the longer the lag, the longer we must wait until we get our
first forecast.
Here we compute moving averages for year=1963:
1–Year: F1963 = X1962 = 3.2
X1962 +X1961 3.2+2.8
2–Year: F1963 = 2 = 2 = 3.0
X1962 +X1961 +X1960 3.2+2.8+4.4
3–Year: F1963 = 3 = 3 = 3.47
t Year Xt F1,t e1,t F2,t e2,t F3,t e3,t
1 1952 5.30 . . . . . .
2 1953 4.20 5.30 -1.10 . . . .
3 1954 3.90 4.20 -0.30 4.75 -0.85 . .
4 1955 5.20 3.90 1.30 4.05 1.15 4.47 0.73
5 1956 5.80 5.20 0.60 4.55 1.25 4.43 1.37
6 1957 6.30 5.80 0.50 5.50 0.80 4.97 1.33
7 1958 5.60 6.30 -0.70 6.05 -0.45 5.77 -0.17
8 1959 4.80 5.60 -0.80 5.95 -1.15 5.90 -1.10
9 1960 4.40 4.80 -0.40 5.20 -0.80 5.57 -1.17
10 1961 2.80 4.40 -1.60 4.60 -1.80 4.93 -2.13
11 1962 3.20 2.80 0.40 3.60 -0.40 4.00 -0.80
12 1963 3.10 3.20 -0.10 3.00 0.10 3.47 -0.37
13 1964 3.10 3.10 0.00 3.15 -0.05 3.03 0.07
14 1965 2.60 3.10 -0.50 3.10 -0.50 3.13 -0.53
15 1966 2.00 2.60 -0.60 2.85 -0.85 2.93 -0.93
16 1967 1.60 2.00 -0.40 2.30 -0.70 2.57 -0.97
17 1968 1.30 1.60 -0.30 1.80 -0.50 2.07 -0.77
18 1969 1.20 1.30 -0.10 1.45 -0.25 1.63 -0.43
19 1970 1.20 1.20 0.00 1.25 -0.05 1.37 -0.17
20 1971 1.10 1.20 -0.10 1.20 -0.10 1.23 -0.13
21 1972 0.90 1.10 -0.20 1.15 -0.25 1.17 -0.27
22 1973 1.40 0.90 0.50 1.00 0.40 1.07 0.33
23 1974 2.00 1.40 0.60 1.15 0.85 1.13 0.87
24 1975 1.90 2.00 -0.10 1.70 0.20 1.43 0.47
25 1976 2.30 1.90 0.40 1.95 0.35 1.77 0.53
26 1977 3.10 2.30 0.80 2.10 1.00 2.07 1.03
27 1978 3.50 3.10 0.40 2.70 0.80 2.43 1.07
28 1979 3.80 3.50 0.30 3.30 0.50 2.97 0.83
29 1980 3.70 3.80 -0.10 3.65 0.05 3.47 0.23
30 1981 3.10 3.70 -0.60 3.75 -0.65 3.67 -0.57
31 1982 2.60 3.10 -0.50 3.40 -0.80 3.53 -0.93
32 1983 2.40 2.60 -0.20 2.85 -0.45 3.13 -0.73
33 1984 3.00 2.40 0.60 2.50 0.50 2.70 0.30
34 1985 2.40 3.00 -0.60 2.70 -0.30 2.67 -0.27
35 1986 1.80 2.40 -0.60 2.70 -0.90 2.60 -0.80
36 1987 1.70 1.80 -0.10 2.10 -0.40 2.40 -0.70
37 1988 2.20 1.70 0.50 1.75 0.45 1.97 0.23
38 1989 2.10 2.20 -0.10 1.95 0.15 1.90 0.20
39 1990 2.40 2.10 0.30 2.15 0.25 2.00 0.40
40 1991 2.10 2.40 -0.30 2.25 -0.15 2.23 -0.13
41 1992 2.20 2.10 0.10 2.25 -0.05 2.20 0.00
42 1993 2.70 2.20 0.50 2.15 0.55 2.23 0.47
43 1994 3.00 2.70 0.30 2.45 0.55 2.33 0.67
44 1995 2.80 3.00 -0.20 2.85 -0.05 2.63 0.17

Table 58: Dividend yields, Forecasts, errors — 1, 2, and 3 year moving Averages
When might a “short” Moving Average be preferred to a “long” one?
When might a “long” Moving Average be preferred to a “short” one?
Figure 29 displays raw data and moving average forecasts.
DIV_YLD
7

6 Actual
MA(1)
MA(2)
5 MA(3)

0
1950 1960 1970 1980 1990 2000
CAL_YEAR

Figure 29: Plot of the data moving average forecast for Anheuser–Busch dividend data

Measurements of Forecasting
P Error
e
Mean Error: M E = Number of iforecasts e i = Xi − Fi
(−1.1)+(−0.3)+1.3+···+0.5+0.3+(−0.2) −2.5
1–Year: M E = 43 = 43 = −0.058
(−0.85)+1.15+1.25+···+0.55+0.55+(−0.05) −2.6
2–Year: M E = 42 = 42 = −0.062
0.73+1.37+1.33+···+0.47+0.67+0.17 −2.8
3–Year: M E = 41 = 41 = −0.068

P |ei | 
X
·100
Mean Absolute Percentage Error (MAPE) — M AP E = number ofi forecasts
1–Year:
       
|−1.1| |−0.3| |0.3| |−0.2|
4.2 · 100 + 3.9 · 100 + · · · + 3.0 · 100 + 2.8 · 100 687.06
M AP E = = = 15.98
43 43

2–Year:
       
|−0.85| |1.15| 0.55 |−0.05|
3.9 · 100 + 5.2 · 100 + · · · + 3.0 · 100 + 2.8 · 100 843.31
M AP E = = = 20.08
42 42

16.2 Exponential Smoothing


Exponential smoothing is a method of forecasting that weights data from previous time periods
with exponentially decreasing magnitudes. Forecasts can be written as follows, where the forecast
for period 2 is traditionally (but not always) simply the outcome from period 1:

Ft+1 = α · Xt + (1 − α) · Ft
where :

• Ft+1 is the forecast for period t + 1

• Xt is the outcome at t

• Ft is the forecast for period t

• α is the smoothing constant (0 ≤ α ≤ 1)

Forecasts are “smoother” than the raw data and weights of previous observations decline expo-
nentially with time.

Example 16.1 (Continued)


3 smoothing constants (allowing decreasing amounts of smoothness) for illustration:

• α = 0.2 — Ft+1 = 0.2Xt + 0.8Ft

• α = 0.5 — Ft+1 = 0.5Xt + 0.5Ft

• α = 0.8 — Ft+1 = 0.8Xt + 0.2Ft

Year 2 (1953) — set F1953 = X1952 , then cycle from there.


Table 59 gives average dividend yields for Anheuser–Busch for the years 1952–1995 (Source:Value
Line), forecasts and errors based on exponential smoothing based on lags of 1, 2, and 3.

Here we obtain Forecasts based on Exponential Smoothing, beginning with year 2 (1953):
1953: Fα=.2,1953 = X1952 = 5.30 Fα=.5,1952 = X1952 = 5.30 Fα=.8,1952 = X1952 = 5.30
1954 (α = 0.2): Fα=.2,1954 = .2X1953 + .8Fα=.2,1953 = .2(4.20) + .8(5.30) = 5.08
1954 (α = 0.5): Fα=.5,1954 = .5X1953 + .5Fα=.5,1953 = .5(4.20) + .5(5.30) = 4.75
1954 (α = 0.8): Fα=.8,1954 = .8X1953 + .2Fα=.5,1953 = .8(4.20) + .2(5.30) = 4.42
Which level of α appears to be “discounting” more distant observations at a quicker rate? What
would happen if α = 1? If α = 0? Figure 30 gives raw data and exponential smoothing forecasts.
Table 60 gives measures of forecast errors for three moving average, and three exponential
smoothing methods.

16.3 Autoregression
Sometimes regression is run on past or “lagged” values of the dependent variable (and possibly
other variables). An Autoregressive model with independent variables corresponding to k periods
can be written as follows:

Ŷt = b0 + b1 Yt−1 + b2 Yt−2 + · · · + bk Yt−k

Note that the regression cannot be run for the first k responses in the series.

Example 16.1 (Continued)


t Year Xt Fα=.2,t eα=.2,t Fα=.5,t eα=.5,t Fα=.8,t eα=.8,t
1 1952 5.30 . . . . . .
2 1953 4.20 5.30 -1.10 5.30 -1.10 5.30 -1.10
3 1954 3.90 5.08 -1.18 4.75 -0.85 4.42 -0.52
4 1955 5.20 4.84 0.36 4.33 0.88 4.00 1.20
5 1956 5.80 4.92 0.88 4.76 1.04 4.96 0.84
6 1957 6.30 5.09 1.21 5.28 1.02 5.63 0.67
7 1958 5.60 5.33 0.27 5.79 -0.19 6.17 -0.57
8 1959 4.80 5.39 -0.59 5.70 -0.90 5.71 -0.91
9 1960 4.40 5.27 -0.87 5.25 -0.85 4.98 -0.58
10 1961 2.80 5.10 -2.30 4.82 -2.02 4.52 -1.72
11 1962 3.20 4.64 -1.44 3.81 -0.61 3.14 0.06
12 1963 3.10 4.35 -1.25 3.51 -0.41 3.19 -0.09
13 1964 3.10 4.10 -1.00 3.30 -0.20 3.12 -0.02
14 1965 2.60 3.90 -1.30 3.20 -0.60 3.10 -0.50
15 1966 2.00 3.64 -1.64 2.90 -0.90 2.70 -0.70
16 1967 1.60 3.31 -1.71 2.45 -0.85 2.14 -0.54
17 1968 1.30 2.97 -1.67 2.03 -0.73 1.71 -0.41
18 1969 1.20 2.64 -1.44 1.66 -0.46 1.38 -0.18
19 1970 1.20 2.35 -1.15 1.43 -0.23 1.24 -0.04
20 1971 1.10 2.12 -1.02 1.32 -0.22 1.21 -0.11
21 1972 0.90 1.91 -1.01 1.21 -0.31 1.12 -0.22
22 1973 1.40 1.71 -0.31 1.05 0.35 0.94 0.46
23 1974 2.00 1.65 0.35 1.23 0.77 1.31 0.69
24 1975 1.90 1.72 0.18 1.61 0.29 1.86 0.04
25 1976 2.30 1.76 0.54 1.76 0.54 1.89 0.41
26 1977 3.10 1.86 1.24 2.03 1.07 2.22 0.88
27 1978 3.50 2.11 1.39 2.56 0.94 2.92 0.58
28 1979 3.80 2.39 1.41 3.03 0.77 3.38 0.42
29 1980 3.70 2.67 1.03 3.42 0.28 3.72 -0.02
30 1981 3.10 2.88 0.22 3.56 -0.46 3.70 -0.60
31 1982 2.60 2.92 -0.32 3.33 -0.73 3.22 -0.62
32 1983 2.40 2.86 -0.46 2.96 -0.56 2.72 -0.32
33 1984 3.00 2.77 0.23 2.68 0.32 2.46 0.54
34 1985 2.40 2.81 -0.41 2.84 -0.44 2.89 -0.49
35 1986 1.80 2.73 -0.93 2.62 -0.82 2.50 -0.70
36 1987 1.70 2.54 -0.84 2.21 -0.51 1.94 -0.24
37 1988 2.20 2.38 -0.18 1.96 0.24 1.75 0.45
38 1989 2.10 2.34 -0.24 2.08 0.02 2.11 -0.01
39 1990 2.40 2.29 0.11 2.09 0.31 2.10 0.30
40 1991 2.10 2.31 -0.21 2.24 -0.14 2.34 -0.24
41 1992 2.20 2.27 -0.07 2.17 0.03 2.15 0.05
42 1993 2.70 2.26 0.44 2.19 0.51 2.19 0.51
43 1994 3.00 2.35 0.65 2.44 0.56 2.60 0.40
44 1995 2.80 2.48 0.32 2.72 0.08 2.92 -0.12

Table 59: Dividend yields, Forecasts, and errors based on exponential smoothing with α =
0.2, 0.5, 0.8
DIV_YLD
7

6 Actual
ES(a=.2)
ES(a=.5)
5 ES(a=.8)

0
1950 1960 1970 1980 1990 2000
CAL_YEAR

Figure 30: Plot of the data and Exponential Smoothing forecasts for Anheuser–Busch dividend
data

Moving Average Exponential Smoothing


Measure 1–Period 2–Period 3–Period α = 0.2 α = 0.5 α = 0.8
ME −0.06 −0.06 −0.07 −0.32 −0.12 −0.07
MAE 0.43 0.53 0.62 0.82 0.58 0.47
MSE 0.30 0.43 0.57 0.97 0.48 0.34
MPE −3.31 −4.87 −6.62 −22.57 −7.83 −4.36
MAPE 15.98 20.07 24.19 37.01 22.69 17.29

Table 60: Relative performances of 6 forecasting methods — Anheuser–Busch data


From Computer software, autoregressions based on lags of 1, 2, and 3 periods are fit:
1–Period: Ŷt = 0.29 + 0.88Yt−1
2–Period: Ŷt = 0.29 + 1.18Yt−1 − 0.29Yt−2
3–Period: Ŷt = 0.28 + 1.21Yt−1 − 0.37Yt−2 + 0.05Yt−3
Table 62 gives raw data and forecasts based on three autoregression models. Table 61 gives the
forecasting errors. Figure 31 displays the actual outcomes and predictions.

Autoregression
Measure 1–Period 2–Period 3–Period
ME 0.00 0.00 0.00
MAE 0.41 0.38 0.39
MSE 0.27 0.24 0.24
MPE −3.47 −3.13 −3.16
MAPE 16.02 15.14 15.45

Table 61: Relative performances of 3 forecasting methods — Anheuser–Busch data

How do these methods of forecasting compare with moving averages and exponential smoothing?
DIV_YLD
7

6 Actual
AR(1)
AR(2)
5 AR(3)

0
1950 1960 1970 1980 1990 2000
CAL_YEAR

Figure 31: Plot of the data and Autoregressive forecasts for Anheuser–Busch dividend data
t Year Xt FAR(1),t eAR(1),t FAR(2),t e(AR(2),t FAR(3),t eAR(3),t
1 1952 5.3 . . . . . .
2 1953 4.2 4.96 -0.76 . . . .
3 1954 3.9 3.99 -0.09 3.72 0.18 . .
4 1955 5.2 3.72 1.48 3.68 1.52 3.72 1.48
5 1956 5.8 4.87 0.93 5.30 0.50 5.35 0.45
6 1957 6.3 5.40 0.90 5.64 0.66 5.58 0.72
7 1958 5.6 5.84 -0.24 6.06 -0.46 6.03 -0.43
8 1959 4.8 5.22 -0.42 5.09 -0.29 5.03 -0.23
9 1960 4.4 4.52 -0.12 4.34 0.06 4.35 0.05
10 1961 2.8 4.16 -1.36 4.10 -1.30 4.12 -1.32
11 1962 3.2 2.75 0.45 2.33 0.87 2.29 0.91
12 1963 3.1 3.11 -0.01 3.26 -0.16 3.35 -0.25
13 1964 3.1 3.02 0.08 3.03 0.07 3.00 0.10
14 1965 2.6 3.02 -0.42 3.06 -0.46 3.05 -0.45
15 1966 2 2.58 -0.58 2.47 -0.47 2.44 -0.44
16 1967 1.6 2.05 -0.45 1.90 -0.30 1.90 -0.30
17 1968 1.3 1.70 -0.40 1.60 -0.30 1.61 -0.31
18 1969 1.2 1.43 -0.23 1.36 -0.16 1.37 -0.17
19 1970 1.2 1.35 -0.15 1.33 -0.13 1.34 -0.14
20 1971 1.1 1.35 -0.25 1.36 -0.26 1.36 -0.26
21 1972 0.9 1.26 -0.36 1.24 -0.34 1.23 -0.33
22 1973 1.4 1.08 0.32 1.03 0.37 1.03 0.37
23 1974 2 1.52 0.48 1.68 0.32 1.70 0.30
24 1975 1.9 2.05 -0.15 2.25 -0.35 2.23 -0.33
25 1976 2.3 1.96 0.34 1.96 0.34 1.92 0.38
26 1977 3.1 2.31 0.79 2.46 0.64 2.47 0.63
27 1978 3.5 3.02 0.48 3.29 0.21 3.28 0.22
28 1979 3.8 3.37 0.43 3.53 0.27 3.49 0.31
29 1980 3.7 3.64 0.06 3.77 -0.07 3.75 -0.05
30 1981 3.1 3.55 -0.45 3.56 -0.46 3.54 -0.44
31 1982 2.6 3.02 -0.42 2.88 -0.28 2.86 -0.26
32 1983 2.4 2.58 -0.18 2.47 -0.07 2.47 -0.07
33 1984 3 2.40 0.60 2.37 0.63 2.39 0.61
34 1985 2.4 2.93 -0.53 3.14 -0.74 3.16 -0.76
35 1986 1.8 2.40 -0.60 2.26 -0.46 2.20 -0.40
36 1987 1.7 1.87 -0.17 1.72 -0.02 1.73 -0.03
37 1988 2.2 1.79 0.41 1.78 0.42 1.80 0.40
38 1989 2.1 2.23 -0.13 2.40 -0.30 2.41 -0.31
39 1990 2.4 2.14 0.26 2.13 0.27 2.10 0.30
40 1991 2.1 2.40 -0.30 2.52 -0.42 2.53 -0.43
41 1992 2.2 2.14 0.06 2.08 0.12 2.05 0.15
42 1993 2.7 2.23 0.47 2.28 0.42 2.29 0.41
43 1994 3 2.67 0.33 2.84 0.16 2.85 0.15
44 1995 2.8 2.93 -0.13 3.05 -0.25 3.03 -0.23

Table 62: Average dividend yields and Forecasts/errors based on autoregression with lags of 1, 2,
and 3 periods
17 Lecture 17 — Autocorrelation
Textbook Section: 15.5
Problems: See Lecture

Recall a key assumption in regression: Error terms are independent. When data are collected
over time, the errors are often serially correlated (Autocorrelated). Under first–Order Autocorre-
lation, consecutive error terms are linealy related:

epst = ρεt−1 + νt

where ρ is the correlation between consecutive error terms, and νt is a normally distributed
independent error term. When errors display a positive correlation, ρ > 0 (Consecutive error
terms are associated). We can test this relation as follows, note that when ρ = 0, error terms
are independent (which is the assumption in the derivation of the tests in the chapters on linear
regression).

Durbin–Watson Test for Autocorrelation

H0 : ρ = 0 No autocorrelation Ha : ρ > 0 Postive Autocorrelation


Pn 2
D= P(ent −et−1
t=2
2
)
e
t=1 t
D ≥ dU =⇒ Don’t Reject H0
D ≤ dL =⇒ Reject H0
dL ≤ D ≤ dU =⇒ Withhold judgement
Values of dL and dU (indexed by n and k (the number of predictor variables)) are given in Table
A.9, p. A–27

“Cures” for Autocorrelation:

• Additional independent variable(s) — A variable may be missing from the model that will
eliminate the autocorrelation (see example).

• Transform the variables — Take “first differences” (Xt+1 − Xt ) and (Yt+1 − Yt ) and run
regression with transformed Y and X.

Example — Autocorrelation — P&G Sales and CPI


Y — Quarterly Sales for Procter & Gamble (1965(q1)–1995(q4))
X — Consumer Price Index for quarter (1982–1984=100)
(Data Sources: Value Line and Economic Indicators Handbook (3rd Ed.))
Simple Regression: Ŷt = b0 + b1 Xt = −1742.62 + 58.353Xt

The raw data are given in Table 63, and plotted (with the fitted equation) in Figure 32. Figure 33
gives a plot of residuals vs time order. Notice the distinct pattern in the residuals and that
consecutive residuals are very close to one another.
Compute the first three residuals, and their contributions to the numerator and denominator
of the D–W statistic.
For the entire sample, we obtain: n = 124, k = 1, D = 0.092 — Test for autocorrelation.
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year Sales CPI Sales CPI Sales CPI Sales CPI
1965 523.0 31.2 486.9 31.5 527.8 31.6 520.9 31.7
1966 558.5 32.0 531.2 32.3 591.8 32.6 561.7 32.9
1967 643.7 32.9 564.0 33.2 633.5 33.5 597.5 33.8
1968 659.9 34.2 590.1 34.5 668.8 35.0 623.8 35.4
1969 695.3 35.8 648.4 36.4 694.6 37.0 669.3 37.5
1970 747.7 38.0 706.8 38.6 760.0 39.1 764.3 39.6
1971 815.2 39.9 764.0 40.3 799.6 40.8 799.3 41.0
1972 904.6 41.3 816.4 41.6 915.0 42.0 878.4 42.4
1973 975.2 42.9 913.3 43.9 1024.3 44.9 993.9 45.9
1974 1158.6 47.2 1136.1 48.5 1338.9 50.0 1278.7 51.5
1975 1530.6 52.4 1455.6 53.2 1587.8 54.4 1507.7 55.2
1976 1586.7 55.8 1541.2 56.5 1736.8 57.4 1648.0 58.0
1977 1830.0 59.0 1731.0 60.3 1921.0 61.2 1802.0 61.9
1978 1932.0 62.9 1929.0 64.5 2173.0 66.1 2065.0 67.4
1979 2286.0 69.1 2249.0 71.5 2457.0 73.8 2337.0 75.9
1980 2664.0 78.9 2622.0 81.8 2790.0 83.3 2696.0 85.5
1981 2908.0 87.8 2759.0 89.8 2949.0 92.4 2800.0 93.7
1982 3026.0 94.5 2895.0 95.9 3094.0 97.7 2979.0 97.9
1983 3201.0 97.9 3030.0 99.1 3131.0 100.3 3090.0 101.2
1984 3277.0 102.3 3135.0 103.4 3238.0 104.5 3251.0 105.3
1985 3485.0 106.0 3375.0 107.3 3350.0 108.0 3342.0 109.0
1986 3605.0 109.2 3865.0 109.0 4081.0 109.8 3888.0 110.4
1987 4356.0 111.6 4255.0 113.1 4222.0 114.4 4167.0 115.4
1988 4664.0 116.1 4839.0 117.5 4860.0 119.1 4973.0 120.3
1989 5267.0 121.7 5268.0 123.7 5430.0 124.7 5433.0 125.9
1990 5807.0 128.0 6025.0 129.3 6123.0 131.6 6126.0 133.7
1991 6652.0 134.8 6857.0 135.6 6795.0 136.7 6722.0 137.7
1992 7205.0 138.7 7597.0 139.8 7483.0 140.9 7167.0 141.9
1993 7879.0 143.1 7839.0 144.2 7350.0 144.8 7365.0 145.8
1994 7564.0 146.7 7788.0 147.6 7441.0 148.9 7503.0 149.6
1995 8161.0 150.9 8467.0 152.2 8312.0 152.9 8494.0 153.6

Table 63: Quarterly Sales for P&G (Y ) and CPI (X) — 1965–1995

Here, we attempt to cure the autocorrelation. The relationship appears approximately linear,
with two slopes, with the split in 1985(q4) (CPI=109.0).
A LEXIS/NEXIS search shows that the company bought some OTC drug companies around
this time. Could this have changed rate of increase? Also, there was an uproar that their corporate
logo was Satanic around this time. Maybe a deal with Satan — Increased revenues for advertising
space?
We fit a piecewise linear regression model with an interesting use of dummy variables and
interaction terms:

Ŷt = b0 + b1 X1t + b2 (X1t − 109.0)X2t

where: X1t is CPI at time t, X2t is 1 if after 1985(q4), 0 if before. We obtain the following
SALES
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
-1000
0.0 50.0 100.0 150.0 200.0
CPI

Figure 32: Plot of sales vs CPI and the fitted equation — P&G data (Model 1)

2000

1000
R
e
s
i 0
d
u
a
l
-1000

-2000
0 20 40 60 80 100 120 140
TIME

Figure 33: Plot of residuals vs time order — P&G data (Model 1)


fitted equation:

Ŷt = −757.98 + 40.548X1t + 68.580(X1t − 109.0)X2t

Figure 34 gives a plot of the raw data and fitted equation and Figure 35 gives a plot of residuals
vs time order. Is there a pattern in the residuals? Are consecutive residuals close or far apart?
Compute the last three residuals, and their contributions to the numerator and denominator of the
D–W statistic. For the full sample, we obtain n = 124, k = 2, D = 0.986 — Test for autocorrelation.
SALES
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
0.0 50.0 100.0 150.0 200.0
CPI

Figure 34: Plot of sales vs CPI and the fitted equation — P&G data (Model 2)

600
500
400
300
R 200
e
s 100
i 0
d
u -100
a
l -200
-300
-400
-500
-600
0 20 40 60 80 100 120 140
TIME

Figure 35: Plot of residuals vs time order — P&G data (Model 2)

You might also like