2 Simple Linear Regression I Least Squares Estimation
2 Simple Linear Regression I Least Squares Estimation
These notes are intended to simultaneously review and extend the basic concepts of STA 2023
that are used in business applications. In this section, we describe the notions of:
• Variable Types
PO1 All firms listed on the New York Stock Exchange (NYSE) throughout year 2001.
Samples are subsets of their corresponding populations, used to describe or make inferences
concerning particular characteristics of the elements of the population. Examples include:
SA1 30 firms sampled at random from all firms listed on NYSE throughout 2001.
SA3 A randomly selected set of 250 pairs of Levi’s 550 jeans produced in January, 2002.
• A sample of 2007 American adults were asked if they tought there would be a recession in
the next five years. Of those sampled, 66% answered “Yes”. Based on this sample we can be
very confident that a majority of American adults feel there will be a recession in the next
five years. (Source: WSJ, 6/27/97, p. R1).
• After determining a safe dosing regimen, drug manufacturers must demonstrate efficacy of
new drugs by comparing them with a placebo or standard drug in large–scale Phase III trials.
In one such trial for the antidepressant Prozac (Eli Lilly & Co), researchers measured the
change from baseline in Hamilton Depression (HAM–D) scale. Based on a sample of 185
patients receiving Prozac, the mean change (improvement) was 11.0 points, and among 169
patients receiving placebo, the mean change was 8.2 points. Based on these samples, we can
conclude that mean change from baseline in all patients receiving Prozac would be higher
than the mean change from baseline in all patients receiving placebo at a very high level of
confidence.
Frequency — Labelled “Frequency”, this gives the list of the numbers of counties falling in the
various categories.
Relative Frequency — Labelled “Percent”, this gives the percentage of the counties falling in
each of the categories.
Cumulative Frequency — Labelled “Cumulative Frequency”, this gives the number of counties
falling in or below this category.
Relative Cumulative Frequency — Labelled “Cumulative Percent”, this gives the percent of
counties falling in or below this category.
Cumulative Cumulative
pci94 Frequency Percent Frequency Percent
---------------------------------------------------------
5-10 1 1.49 1 1.49
10-15 21 31.34 22 32.84
15-20 28 41.79 50 74.63
20-25 10 14.93 60 89.55
25-30 3 4.48 63 94.03
30-35 4 5.97 67 100.00
Various graphs are useful in describing bodies of data, and are often given in business reporting.
Histograms – Vertical bar charts that identify categories for categorical variables and ranges of
values for interval scale variables, with heights representing frequencies of outcomes for a
single variable.
Pie Charts – Circular graphs where the size of each “slice” represents the frequency for a partic-
ular category or range of values.
Scatter Plots – Plots of pairs of outcomes on two variables, where each point on the graph
represents a single element from a set of data.
Time Series Plots – Plot of a single variable that is measured over a series of points in time.
Total Per Capita Retail
County Population Income Income Sales
Alachua 193353 3747486 19.382 305841
Baker 19786 293855 14.852 14637
Bay 139507 2495859 17.891 243300
Bradford 24004 318011 13.248 17103
Brevard 442637 8677944 19.605 574026
Broward 1386497 34167902 24.643 2493269
Calhoun 11738 146096 12.446 9410
Charlotte 125832 2400459 19.077 156605
Citrus 104035 1672558 16.077 102660
Clay 120257 2238700 18.616 154681
Collier 177778 5452519 30.670 361542
Columbia 47886 719253 15.020 57166
Dade 2012237 40530049 20.142 3251235
De Soto 25074 410138 16.357 21529
Dixie 11706 141523 12.090 8725
Duval 702846 14553773 20.707 1206452
Escambia 272187 4800237 17.636 391410
Flagler 37818 594050 15.708 26894
Franklin 9906 151954 15.340 8930
Gadsden 43021 617896 14.363 30312
Gilchrist 11929 149748 12.553 3492
Glades 7615 111747 14.675 2548
Gulf 13041 187590 14.385 8380
Hamilton 11570 141496 12.230 6139
Hardee 21611 338067 15.643 18125
Hendry 29325 505741 17.246 27234
Hernando 117141 1872699 15.987 106199
Highlands 73685 1296740 17.598 82240
Hillsborough 871046 17631999 20.242 1546211
Holmes 16933 215197 12.709 9701
Indian River 95250 2772529 29.108 148059
Jackson 43787 664992 15.187 45529
Jefferson 12761 185227 14.515 7127
Lafayette 5873 79439 13.526 2237
Lake 173250 3170498 18.300 188032
Lee 367322 8103201 22.060 637949
Leon 211763 4190977 19.791 358281
Levy 28827 396698 13.761 24600
Liberty 6257 89835 14.358 2405
Madison 17197 223594 13.002 12120
Manatee 226289 5194196 22.954 308061
Marion 219358 3655070 16.663 290072
Martin 109027 3491389 32.023 181558
Monroe 81460 2068322 25.391 195625
Nassau 49496 1035360 20.918 52127
Okaloosa 160725 3048783 18.969 250499
Okeechobee 31036 463635 14.939 32529
Orange 740474 15108479 20.404 1598855
Osceola 126386 2049838 16.219 197719
Palm Beach 959721 31994145 33.337 1674647
Pasco 298677 5051203 16.912 282914
Pinellas 865364 21502994 24.848 1499770
Polk 429408 7661229 17.841 746285
Putnam 68598 978635 14.266 55690
St. Johns 98214 2519924 25.657 131375
St. Lucie 169116 2788362 16.488 178863
Santa Rosa 99003 1695027 17.121 68145
Sarasota 291722 8831912 30.275 540520
Seminole 323719 7062419 21.817 507307
Sumter 33367 486950 14.594 26922
Suwannee 29489 436779 14.812 36258
Taylor 17332 268153 15.472 19028
Union 12193 115280 9.455 3937
Volusia 403899 7154872 17.715 561530
Wakulla 16665 258477 15.510 9233
Walton 32677 470799 14.408 32274
Washington 17984 248533 13.820 12653
Data Maps – Plot of a single variable that is measured over a series of points in 2-dimensional
space.
A histogram of per capita income is given in Figure 1. We see that most counties are in the
range of $10,000 to $25,000 (the second, third, and fourth ranges of values), with one county lower
than this range, and the seven most affluent counties being above this range. A pie chart of the
same data is given in Figure 2.
A scatter plot of retail sales (on the vertical or up/down axis) versus total income (on the
horizontal or left/right axis) is given in Figure 3. A tendency for counties with higher total incomes
to have higher retail sales can be seen. This is considered to be a postive association.
A data map of per capita income is given in Figure 4. We can see visually where the most
affluent and poorest counties are.
A time series plot of monthly average airfares (per 1000 miles of domestic flights) is given in
Figure ?? for the period January 1980 through December 2001 (Source: Air Transport Association).
We observe periodic trends (as demand shifts throughout the year) as well as longer term cycles,
however, the series is showing only a very small long-term increase in trend. These prices are
not adjusted for inflation and are called nominal prices (not to be confused with nominal variable
types). Figure ?? gives the series adjusted for inflation, showing that real prices have decreased
over this period. Figure 7 gives the monthly consumer price index (CPI) over this 22 year (264
month) period (Source: US Department of Commerce).
µ Population mean — The average value of all elements in the population. It is also considered
the ‘long–run’ average measurement in terms of conceptual populations. It can be thought of
as the value each unit would receive if the total of the outcomes had been evenly distributed
among the units.
P Population proportion — The proportion of all elements of the population that posess a partic-
ular characteristic.
PA1 The proportion of all NYSE listed firms whose stock value increased in 1991 (P ).
PA2 The proportion of all living UF graduates who are members of the alumni association (P ).
PA3 The mean number of flaws in all pairs of Levis 550 jeans manufactured in January, 1992 (µ).
PA4 The proportion of all people who have (or will have) a disease that show remission due to
drug treatment (P ).
PA5 The difference between mean lifetimes of two brands of automobile tires (µ1 − µ2 ).
FREQUENCY
30
20
10
0
7.500 12.500 17.500 22.500 27.500 32.500
pci94 MIDPOINT
20
10
0
7.500 12.500 17.500 22.500 27.500 32.500
pci94 MIDPOINT
3000000
2000000
1000000
0
0 10000000 20000000 30000000 40000000 50000000
inc94
Figure 3: Scatter plot of retail sales versus total income among Florida counties
pci94 12.000 15.000 18.000
pci94 2 17 . 05 0 0 12 24 . 50 0 0 12 7 . 50 0 0
23 20 . 50 0 0 27.500 32.500
150
140
130
120
110
100
0 100 200 300
time
Figure 5: Monthly nominal (unadjusted for inflation) airfares (price per 1000 miles) on domestic
flights
airfare1
160
150
140
130
120
110
100
90
80
70
60
0 100 200 300
time
Figure 6: Monthly real (adjusted for inflation) airfares (price per 1000 miles) on domestic flights
cpi
180
170
160
150
140
130
120
110
100
90
80
70
0 100 200 300
time
Statistics are numerical descriptive measures corresponding to samples. We will use the general
notation θ̂ to represent statistics. Special cases include:
Mode — Outcome that occurs most often. Usually is reported for nominal or ordinal variables or
simply as a peak of a continuous distribution when variable is continuous.
Median — Middle value (after numbers have been sorted from smallest to largest). Can be
reported for ordinal or interval scale data. Let X(1) be the smallest , X(n) be the largest, and
X(i) be the ith ordered observation in a sample of n items:
X( n2 ) + X( n2 +1)
n even: Median = M =
2
S 2 Sample variance — Measure of the spread (around X) of the elements of the sample:
Pn Pn 2
2 − X)2
i=1 (Xi
2
i=1 Xi − nX
S =
n−1 n−1
S Sample standard deviation — Measure of the spread (around X) of the elements of the sample:
sP s
n Pn 2
− X)2
i=1 (Xi i=1 Xi2 − nX
S =
n−1 n−1
p̂ Sample proportion — The proportion of elements in the sample that have a particular charac-
teristic:
ST1 Among a random sample of n = 50 firms listed on the NYSE in 2001, 18 (p̂ = 18/50 = 0.36)
had stock prices increase during 2001.
ST2 Among a sample of 200 UF graduates, 44 are paying members of the alumni association
(p̂ = 44/200 = 0.22).
ST3 A quality inspector samples 60 pairs of Levis 550 jeans, and finds a total of 66 flaws, yielding
an average of X = 66/60 = 1.10 flaws per pair of jeans.
ST4 Of 20 patients selected with a particular disease, 12 (p̂ = 12/20 = .60) show some remission
after drug treatment.
ST5 Samples of 20 tires from each of two manufacturers are obtained, and the number of miles
run until the tread is worn to the legal limit are measured. Brand A has an average of
X 1 = 27, 459 miles, while Brand B has an average of X 2 = 32, 671 miles. The difference
between the two brands’ sample means is X 1 − X 2 = 27, 459 − 32, 671 = −5212 miles.
ST6 Independent samples of male and female consumers find that among males, p̂1 = 0.26 have
made credit card purchases over the internet. Among females, p̂2 = 0.44 have made credit
card purchases on the internet.
Statistics based on samples will be used to estimate parameters corresponding to populations, as
well as test hypotheses concerning the true values of parameters.
A sample of n = 5 firms are obtained from the NYSE, and their closing prices are obtained in
Table 1.5. We then compute the sample mean, median, variance, and standard deviation, where
Xi is the closing price for firm i.
• P (A) = 1 − P (A)
A special case occurs when events A and B are said to be independent. This is when P (A|B) =
P (A), or equivalently P (B|A) = P (B), in this situation, P (AB) = P (A)P (B). We will be using
this idea later in this course.
Cardiac Event
Treatment Present (B) Absent (B) Total
Pravachol (A) 174 3128 3302
Placebo (A) 248 3045 3293
Total 422 6173 6595
If we define the event A to be that the patient received pravachol, and the event B to be that
the patient suffers from a cardiac event over the study period, we can use the table to obtain some
pertinent probabilities:
P (AB) .0376
9. P (B|A) = P (A)
= .4993 = .0753
In general if B has k possible (mutually exclusive and exhaustive) outcomes, the rule can be
stated as follows:
P (ABj ) P (ABj ) P (A|Bj )P (Bj )
P (Bj |A) = = Pk Pk
P (A) i=1 P (ABi ) i=1 P (A|Bi )P (Bi )
A manager cannot observe whether her salesperson works hard. She believes based on prior
experience that the probability her salesperson works hard (H) is 0.30. She believes that if the
salesperson works hard, the probability a sale (S) is made is 0.75. If the salesperson does not work
hard, the probability the sale is made is 0.15.
What is the probability that the salesperson worked hard if the sale was made? If not made?
• Pr{Works Hard}=P (H) = 0.30 Pr{Not Works Hard}=P (H) = 1 − 0.40 = 0.70
A —A randomly selected person’s blood matches that found at the crime scene
Assume that a guilty person’s blood will match with that at the crime scene with certainty.
In terms of diagnostic testing, the sensitivity of this test is 100% and the specificity of the test is
99.57%. That is:
Suppose you had a prior (to observing blood evidence) probability that O.J. was innocent of
0.5 (P (B) = 0.5). You now find out that his blood matches that at the crime scene. What is your
updated probability that he is innocent (ignoring possibility of tampering)?
a) What is the probability a randomly selected person received water from the Lambeth com-
pany? From the S&V company?
b) What is the probability a randomly selected person died of cholera? Did not die of cholera?
c) What proportion of the Lambeth consumers died of cholera? Among the S&V consumers?
Is the incidence of cholera death independent of firm?
d) What is the probability a person received water from S&V, given (s)he died of cholera?
Source: W.H. Frost (1936). Snow on Cholera, London, Oxford University Press.
3 Lecture 3 – Discrete Random Variables and Probability Distri-
butions
Textbook Sections: 5.1,5.2,Notes(for bivariate r.v.’s)
Problems: 5.1,5.3, see lecture notes
RV1 The number of surveyed voters who favor Al Gore in the upcoming election from a survey of
722 registered voters
RV2 The number of military personnel that oppose the military’s ban on homosexuals from a
survey of 300 current military personnel
RV4 The number of patients, out of a group of 20 under study, that react positively to a new drug
treatment
RV5 The number of successful shuttle launches out of the first 30 shuttle missions
Continuous random variables can take on any value corresponding to points on a line interval. It
should be noted that while this type of variable occurs on a continuous scale, it is measured on some
sort of discrete scale (a news weatherman reports the temperature as 93◦ F , not 92.7756 . . . ◦ F ).
Examples include:
RV3 The gas mileage of a Ford Mustang GT convertible when run at 65 miles per hour.
RV7 The number of miles a tire can travel before wearing out.
These are considered random variables because we have randomly selected some subject or object
from a population of such subjects (objects). The populations of these subjects (whether existing
or conceptual) are said to have probability distributions. These are models of the distribution
of the measurements corresponding to the elements of the population.
Discrete probability distributions are a set of outcomes (denoted by x) and their corre-
sponding probabilities. The distribution can be presented in terms of a table, graph, or formula
representing each possible outcome of the random variable and its probability of occuring. Defining
p(x) as “ the probability the random variable takes on the value x”, we have the following simple
rules for discrete probability distributions:
1. 0 ≤ p(x) ≤ 1
P
2. x p(x) =1
Thus, all probabilities must be between 0 and 1, and all probabilities must sum to 1.
We consider discrete random variables in this lecture.
x p(x)
0 .46771566391
1 .40089914050
2 .11654044782
3 .01412611489
4 .00070630574
5 .00001228358
6 .00000004356
Table 3: Probability distribution for number of winning digits on a Florida lotto ticket
Note that all probabilities are between 0 and 1, and that they sum to 1. Of course, your ticket
is worthless unless x ≥ 3, so the probability distribution corresponding to your prize amount will
be different than this distribution (you will pool p(0), p(1), and p(2) to obtain the probability you
win nothing (.98515525223)).
For discrete probability distributions, the mean, µ is interpreted as the ‘long run average
outcome’ if the experiment were conducted many times. The variance, σ 2 is a measure of how
variable these outcomes are. The variance is the average squared distance between the outcome
of the random variable and the mean. The positive square root of the variance is the standard
deviation, σ and is in the units of the original data.
For a discrete random variable:
P
• µ = E(X) = xx · p(x)
P P
• σ 2 = V (X) = E[(X − µ)2 ] = x (x − µ)2 · p(x) = xx
2 · p(x) − µ2
√
• σ = + σ2
Table 4: Probability distribution for number of winning digits on a Florida lotto ticket
The variation in the correct numbers is relatively small as well, reflecting the fact that almost
always people get either 0, 1, or 2 correct numbers.
Thus, buyers will not pay over $2333.33 for a used car, and since the value of peaches is $2500 to
sellers, only lemons will be sold, and buyers will learn that, and pay only $2000. At what fraction
of the cars being peaches, will both types of cars be sold?
For a theoretical treatment of this problem, see e.g. D.M. Kreps, A Course in Microeconomic
Theory, Chapter 17.
Stock A
6% 10%
Stock 0% .10 .40
A 16% .40 .10
Stock A Stock B
x P (X = x) y P (Y = y)
0 .10+.40=.50 6 .10+.40=.50
16 .40+.10=.50 10 .40+.10=.50
E(Y ) = µY = 6(.5) + 10(0.5) = 8.0 V (Y ) = σY2 = (6 − 8)2 (0.5) + (10 − 8)2 (0.5) = 4.0
So, both stocks have the same expected return, but stock A is riskier, in the sense that its
variance is much larger.
How do X and Y ”co-vary” together?
For these two firms, we find that the covariance is negative, since high values of X tend to be
seen with low values of Y and vice versa. We compute the Covariance of their returns, which we
denote as COV (X, Y ) = E(X − µX )(Y − µY ) in Table 7.
XX
COV (X, Y ) = E[(X − µX )(Y − µY )] = σXY = (x − µX )(y − µY )p(x, y) = E(XY ) − µX µY
x y
2
V (X + Y ) = V (X) + V (Y ) + 2COV (X, Y ) = σX + σY2 + 2σXY
x y p(x, y) x+y
0 6 .10 6
0 10 .40 10
16 6 .40 22
16 10 .10 26
R = pX + (1 − p)Y
E(R) =
V (R) =
Problem 3.1
Conduct the analysis for two complementary industries, where their fortunes tend to be good/bad
simultaneously. The joint probabiliy distribution is given in Table 9.
A classic paper on this topic (more mathematically rigorous than this example, where each stock
has only two possible outcomes) is given in: Harry M. Markowitz, “Portfolio Selection,” Journal of
Finance, 7 (March 1952), pp 77-91.
4 Lecture 4 – Introduction to Decision Analysis
Textbook Sections: 18.1,18.2(1st 2 subsections),18.3(1st 3 subsections)
Problems: 18.1a,b,3,5,6,7,8,9
Often times managers must make long-term decisions without knowing what future events will
occur that will effect the firm’s financial outcome from their decisions. Decision analysis is a means
for managers to consider their choices and help them select an optimal strategy. For instance:
• Financial officers must decide among certain investment strategies without knowing the state
of the economy over the investment horizon.
• A buyer must choose a model type for the firm’s fleet of cars, without knowing what gas
prices will be in the future.
• A drug company must decide whether to aggressively develop a new drug without knowing
whether the drug will be effective the patient population.
The decision analysis in its simplest form include the following components:
Decision Alternatives – These are the actions that the decision maker has to choose from.
States of Nature – These are occurrences that are out of the control of the decision maker, and
that occur after the decision has been made.
Payoffs – Benefits (or losses) that a particular decision alternative has been selected and a given
state of nature has observed.
Payoff Table – A tabular listing of payoffs for all combinations of decision alternatives and states
of nature.
Maximax – Look at the maximum payoff for each decision alternative. Choose the alternative
with the highest maximum payoff. This is optimistic.
Maximin – Look at the minimum payoff for each decision alternative. Choose the alternative
with the highest minimum payoff. This is pessimistic.
Case 3 - Decision Making Under Risk
In this case, the decision maker does not know which state will occur, but does have probabilities
to assign to the states. Payoff tables can be written in the form of decision trees. Note that in
diagarams below, squares refer to decision alternatives and circles refer to states of nature.
Expected Monetary Value (EMV) – This is the expected payoff for a given decision al-
ternative. We take each payoff times the probability of that state occuring, and sum it across
states. There will be one EMV per decision alternative. One criteria commonly used is to select
the alternative with the highest EMV.
Expected Value of Perfect Information (EVPI) – This is a measure of how valuable it
would be to know what state will occur. First we obtain the expected payoff with perfect information
by multiplying the probability of each state of nature and its highest payoff, then summing over
states of nature. Then we subtract off the highest EMV to obtain EVPI.
• Birth rates stay constant and life expectancies stay constant (B0/L0)
b) Give the maximax and minimax decisions and their corresponding criteria:
c) Suppose we are given the probability distribution for the 6 states of nature in Table 11.
State Probability
B − /L0 0.05
B0/L0 0.10
B + /L0 0.15
B − /L+ 0.15
B0/L+ 0.25
B + /L+ 0.30
Table 11: Probability distribution for states of nature for drug development decision
To obtain the expected monetary value for each decision alternative, we multiply the payoffs
for each state of nature and their corresponding probabilities, summing over states of nature. For
the decision to develop only the childhood drug:
Neither – EM V (Neither)=
Child – EM V (Child)=
Elderly – EM V (Elderly)=
Both – EM V (Both)=
Based on the EMV criteria, which decision should the firm make?
d) A firm that conducts extensive research on population dynamics can be hired and can be
expected to tell your firm exactly which state of nature will occur. Give the expected payoff under
perfect information, and how much you would be willing to pay for that (EVPI).
• Probability that drug will prove effective and obtain FDA approval – 0.10
Ignoring tremendous social pressure, does Merck build the factory now, or wait two years and
observe the results of clinical trials (thus, forfeiting market share to Hoffman-Laroche and Abbott,
who are in fierce competition with Merck). Assume for this problem that if Merck builds now,
and the drug gets approved, they will make $125M/Year (present value) for eight years (Note
125=500(0.25)). If they wait, and the drug gets approved, they will generate $62.5M/Year (present
value) for six years. This is a by product of losing market share to competitors and 2 years of
production. Due to the specificity of the production process, the cost of the plant will be a total
loss if the drug does not obtain FDA approval.
d) Give the Expected Monetary Value (EMV) for each decision. Ignoring social pressure, should
Merck go ahead and build the plant?
e) At what probability of the drug being successful, is Merck indifferent to building early or
waiting. That is, for what value are the EMV’s equal for the decision alternatives?
Note: Merck did build the plant early, and the drug did receive FDA approval.
5 Lecture 5 – Normal Distribution and Sampling Distributions
Textbook Sections and pages: 6.2,7.2,7.3,pp336-337,pp364-365,9.1
Problems: 6.7,9,11, 7.13,21,22,23,25,27
Continuous probability distributions are smooth curves that represent the ‘density’ of
probability around particular values. This density is not interpreted as a probability at the point
(all points will have probability of 0), but rather the probability of an outcome occuring between
points a and b is measured as the area under the density function between a and b. The
density function is always defined so that the total area under it is 1, and it is never negative. The
continuous distribution you have seen most often is the normal distribution, but many others
exist including the t-distribution, which you also have already seen.
F1
0.14
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
50 60 70 80 90 100 110 120 130 140 150
X
Figure 8: Normal distributions with common means and varying standard deviations (3, 10, 25)
Standard notation for a random variable X, that follows a normal distribution with mean µ
and standard deviation σ is X ∼ N (µ, σ). Since there are infinitely many normal distributions
(corresponding to any µ and any σ > 0), we must standardize normal random variables to obtain
probabilities corresponding to them. If X ∼ N (µ, σ), we define Z = X−µ
σ . Z represents the number
of standard deviations above (or below, if negative) the mean that X lies. Table A.5 (p. A–14
and last page of text, not including inside back cover) gives the probability that Z lies between 0
and z for values of z between 0 and 3.49. Recall that the total area under the curve is 1, that the
probability that Z is larger than 0 is 0.5, and that the curve is symmetric.
Example 5.1
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 40 60 80 100 120 140 160 180
X
Figure 9: Normal distributions with common standard deviations and varying means (75, 100, 125)
Scores on the Verbal Ability section of the Graduate Record Examination (GRE) between
10/01/92 and 9/30/95 had a mean of 479 and a standard deviation of 116, based on a population
of N = 1188386 examinations. Scores can range between 200 and 800. Scores on standardized
tests tend to be approximately normally distributed. Let X be a score randomly selected from this
population. That is, X ∼ N (479, 116).
What is the probability that a randomly selected student scores above 700?
What is the probability the student scores between 400 and 600?
Above what score do the top 5% of all students score above?
Source: “Interpreting Your GRE General Test and Subject Test Scores – 1996-97,” Educational
Testing Service.
Estimator (θ̂) Parameter (θ) Std. Error (σθ̂ ) Estimated Std. Error (Sθ̂ ) Degrees of Freedom (ν)
X µ √σ √S n−1
q n q n
P (1−P ) p̂(1−p̂)
p̂ P —
q 2 n 2 q 2 n 2∗
σ1 σ2 S1 S2
X1 − X2 µ1 − µ2 n1 + n2 n1 + n2 n1 + n2 − 2
σd Sd
d µd √
n
√
n
n−1
q q ∗∗
P1 (1−P1 )
p̂1 − p̂2 P1 − P2 n1 + P2 (1−P
n2
2) p̂1 (1−p̂1 )
n1 + p̂2 (1−p̂2 )
n2 —
Table 12: Means, standard errors, and estimated standard errors of four sample statistics (estima-
tors)
To obtain probabilities of observing particular values of a sample statisic, we use the fact that
the statistic is normally distributed, and work with Z = θ̂−θ
σ . θ̂
n µ − 2 √σn µ + 2 √σn
1 143.40 − 2 26.07
√
1
= 91.26 143.40 + 2 26.07
√
1
= 195.54
26.07 26.07
10 143.40 − 2 10 = 126.91
√ 143.40 + 2 10 = 159.89
√
25 143.40 − 2 26.07
√
25
= 132.97 143.40 + 2 26.07
√
25
= 153.83
26.07 26.07
50 143.40 − 2 50 = 136.03
√ 143.40 + 2 25 = 150.77
√
Table 13: Sample sizes and upper and lower bounds for sample means (95% confidence)
As the sample size increases, the sample means get closer and closer to the true mean. Thus if
we don’t know the true mean, but we wish to estimate it, we know that if we take a large sample the
sample mean will be relatively close to the true mean. The part that we are adding and subtracting
from the true mean is referred to as the bound on the error of estimation (it is also referred
to as the margin of error, particularly when used in context of a sample proportion).
Similar examples could be worked in terms of the other three estimators (sample statistics)
given in Table 12, using the corresponding parameter and standard error of the estimator in place
of those used in Example 6.2.
In Example 2.1 we considered the results of the clinical trial for Pravachol (and treated the
data as a population of patients). In reality, that was a sample (very large one at that). If we let
P1 be the proportion of all possible Pravachol users to have a heart event within five years, and
P2 be the corresponding proportion for patients on a placebo, we are interested in the parameter
P1 − P2 . The estimator for this parameter is p̂1 − p̂2 , which, for this sample, takes on the value:
X 1 X2 174 248
p̂1 − p̂2 = − = − = .0527 − .0753 = −.0226
n1 n2 3302 3293
The estimated standard error of p̂1 − p̂2 is:
s s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) .0527(1 − .0527) .0753(1 − .0753)
+ = +
n1 n2 3302 3293
√
= .00001512 + .00002115 = .0060
Thus, we would expect that for approximately 95% of all possible samples, our statistic p̂1 − p̂2
will lie within 2 standard errors (2(.0060)=.0120) of the true difference P1 − P2 .
6 Lecture 6 – Large–Sample Tests and Confidence Intervals
Textbook Sections: 9.2,9.4,10.1,10.4
Problems: 9.1,5,9,19,23,25 10.1,5,7,9,31,33,35
In this section we begin making statistical inferences, using sample data to comment on what
is occuring in a larger population or nature.
θ̂ − θ
P (−zα/2 ≤ ≤ zα/2 ) = 1 − α
σθ̂
This merely says that “in repeated sampling, our estimator will lie within zα/2 standard errors of
the mean a fraction of 1 − α of the time.” The resulting formula for a (1 − α)100% confidence
interval for θ is
θ̂ ± zα/2 σθ̂ .
When the standard error σ θ̂ is unknown (almost always), we will replace it with the estimated
standard error Sθ̂ . Some notes concerning confidence intervals are given below.
• α is our level of confidence (with respect to repeated sampling) that the interval does not
contain the true parameter. If we wish to make α smaller, we will increase the width of our
interval (for a fixed sample size).
• The width of the interval depends on the sample size through the standard error. As the
sample size increases, the width of the interval will decrease (for a fixed α), which is good
since we have a more precise estimate.
• If we took many random samples of a fixed size from the population of interest, and calculated
the confidence interval based on each sample, approximately (1 − α)100% of these intervals
would contain the true parameter. This is where the term confidence arises from; since
almost all of these intervals contain θ, we can be very confident that the interval based on
our one sample contains θ.
Example 6.1
Fox News Opinion Poll: “CNN covered Iran–Contra live in 1987, but is not covering Senate
hearings of Democratic finance abuses. Do you think the decision was politically motivated?”
(Washington Times, National Weekly Edition,8/10/97). Out of n = 899 American adults sampled,
X = 476 agreed with the statement.
Numeric or Presence/Absence Outcome — Each person either agrees or does not agree with
the statement, thus it is a Presence/Absence outcome.
Parameter of Interest — P , the proportion of all American adults who feel the decision was
politically motivated.
X
Appropriate Estimator — p̂ = n
q
p̂(1−p̂)
Estimated Standard Error — n
We wish to obtain a 95% confidence interval for the proportion of all U.S. adults who believe
the decision was politically motivated. For this sample, p̂ = X 476
n = 899 = .53, and it’s estimated
standard error is: s s
p̂(1 − p̂) (.53)(.47)
Sp̂ = = = .0166.
n 899
Thus a 95% confidence interval for the true proportion, P , is:
We are 95% confident that the proportion of all U.S. adults who feel that the decision was politically
motivated was between 0.4975 and 0.5625. Note that since values below 0.50 are contained in the
interval, we cannot conclude that a majority agree with the statement at this significance level.
Example 6.2 – Salary Progression Gap Between Dual Earner and Traditional Male
Managers
A study compared the salary progressions from 1984 to 1989 among a sample of married male
managers of Fortune 500 companies with children at home. For each manager, their 5-year salary
progression was obtained as 100*(1989 salary-1984 salary)/1984 salary. This is a percent increase,
if the manager’s salary increased from $100K in 1984 to $200K in 1989, then X=100*(200K-
100K)/100K=100(1)=100%. The researchers were interested in determining whether there is a
difference in the mean salary progression between dual earner and traditional managers. Dual
earner managers had wives who worked full time, traditional managers’ wives did not work. The
authors reported the sample statistics in Table 6.1.
Table 14: Summary statistics for male manager salary progression study
1 or 2 Populations? — We are observing samples from two populations (dual earner male man-
agers and traditional male managers).
Parameter of Interest — µ1 − µ2 , the difference between true mean salary progressions for dual
earner and traditional male managers.
Appropriate Estimator — X 1 − X 2
r
S12 S22
Estimated Standard Error — n1 + n2
We obtain a 95% confidence interval for the true mean difference between these two groups of
managers: µ1 − µ2 .
X 1 − X 2 = 60.46 − 69.24 = 8.78
s r
S12 S2 22.212 61.272 √ √
+ 2 = + = 2.97 + 20.63 = 23.60 = 4.86
n1 n2 166 182
s
S12 S22
95% CI for µ1 −µ2 : (X 1 −X 2 )±z.05/2 + ≡ 8.78±1.96(4.86) ≡ 8.78±9.52 ≡ (−0.74, 18.30)
n1 n2
We can be 95% confident that the true difference in mean salary progressions between the two groups
is between -0.74% and 18.30%. Since 0 is in this range (that is µ1 = µ2 ), we cannot conclude there
is a difference in the true underlying population means, although the sample means differed by
8.78%. This is because of the large amount of variation in the individual salary progressions (see
S1 and S2 ). What would be your conclusion had you constructed a 90% confidence inetrval for
µ1 − µ2 ?
Source: Stroh, L.K. and J.M. Brett (1996), “The Dual-Earner Dad Penalty in Salary Progres-
sion,” Human Resource Management, 35:181-201.
Decision
H0 True H0 False
Actual H0 True Correct Decision Type I Error
State H0 False Type II Error Correct Decision
We would like to set up the rejection region to keep the probability of a Type I error (α) and
the probability of a Type II error (β) as small as possible. Unfortunately for a fixed sample size, if
we try to decrease α, we automatically increase β, and vice versa. We will set up rejection regions
to control for α, and will not concern ourselves with β. However all tests described here have the
lowest Type II error rates of any tests for a given sample size. Further, as sample sizes increase,
the type II error rate decreases for a given state (value of θ) in the alternative hypothesis. Here α
is the probability we reject the null hypothesis when it is true. (This is like sending an innocent
defendant to prison).
We can write out the general form of a hypothesis test in the following steps.
1. H0 : θ = θ0
4. R.R.: |zobs | > zα/2 or zobs > zα or zobs < −zα (which R.R. depends on which alternative
hypothesis you are using).
5. P -value: 2P (Z > |zobs |) or P (Z > zobs ) or P (Z < zobs ) (again, depending on which alternative
you are using).
In all cases, a P -value less than α corresponds to a test statistic being in the rejection region
(reject H0 ), and a P -value larger than α corresponds to a test statistic failing to be in the rejection
region (fail to reject H0 ).
Numeric or Presence/Absence Outcome — We are measuring the length of time that the
erection sustains a specific level, which is numeric
Appropriate Estimator — X 1 − X 2 , the difference in the mean lengths of duration for the
samples of subjects in the clinical trial
r
S12 S22
Estimated Standard Error — n1 + n2
Research Hypothesis (HA ) — Goal is to show increased dose gives longer durations: HA : µ1 >
µ2 or equivalently HA : µ1 − µ2 > 0
Type I Error — This occurs when we conclude that the drug is effective (µ1 > µ2 ), when in fact
it is not.
Type II Error — This occurs when we fail to conclude the drug is effective (fail to conclude
µ1 > µ2 ) when in fact it is.
Table 16: Summary statistics for the High and Low Doses (in minutes)
Now we test whether the mean time for the high dose exceeds that for the low dose (setting
α = 0.05):
1. H0 : µ1 − µ2 = 0 H A : µ1 − µ 2 > 0
5. Conclusion: Since the test statistic falls in the rejection region (or, equivalently, the p-value
is below α, we reject the null hypothesis and claim that the true mean duration of erection
is higher for the high dose than the low dose (µ1 − µ2 > 0 ⇒ µ1 > µ2 ).
Compute a 95% confidence interval for the difference in true mean erection times.
Thus, we have no evidence that the rate of GI symptoms differs between the two types of chips
(in fact, the sample proportion is smaller for Olestra chips.
Can you think of any other outcomes with respect to the chips of interest to Procter & Gamble?
Obtain a 95% confidence interval for the difference between the true proportions. (The standard
error used above is very similar to the standard error not assuming equal proportions).
Source: L.L. Cheskin, et al (1998), “Gastrointestinal Symptoms Following Consumption of Olestra
or Regular Triglyceride Potato Chips,” JAMA, 279:150-152.
7 Lecture 7 — Small–Sample Inference
Textbook Sections and pages: 9.3,10.2,10.3,pp392–393
Problems: 9.11,13,17 10.11,15,17,23,25,29 11.1,3
In the case of small samples from populations with unknown variances, we can make use of the
t-distribution to obtain confidence intervals or conduct tests of hypotheses regarding population
means. In all cases, we must assume that the underlying distribution is normal (or approximately
normal), although this restriction is not necessary for moderate sample sizes. We will consider the
case of a single mean, µ, and the difference between two means, µ1 − µ2 , separately. First, though,
we refer back to the t-distribution in Table 4, page 669. This table gives the values tα such that
P (T > tα ) = α for values of the degrees of freedom between 1 and 29. The bottom line gives the
values zα , which should be used when the degrees of freedom exceed 30. I will also often add a
second subscript to tα to represent the appropriate degrees of freedom.
4. R.R.: |tobs | > tα/2,n−1 or tobs > tα,n−1 or tobs < −tα,n−1 (which R.R. depends on which
alternative hypothesis you are using).
5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).
In this case, you cannot obtain an exact p-value, but you can obtain bounds for the p-value.
Statistical computer packages report exact p-values.
4. R.R.: |tobs | > tα/2,n1 +n2 −2 or tobs > tα,n1 +n2 −2 or tobs < −tα,n1 +n2 −2 (which R.R. depends on
which alternative hypothesis you are using).
5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).
a) To set up this confidence interval, we need to obtain the pooled variance (we are assuming
these population variances are the same), as well as the value of tα/2,n1 +n2 −2 .
(n1 −1)s21 +(n2 −1)s22 (13−1)25.7+(9−1)75.2 308.4+601.6
• Sp2 = n1 +n2 −2 = 13+9−2 = 20 = 45.5
H0 : µ1 − µ2 = 0 H A : µ1 − µ 2 < 0
(X 1 − X 2 ) − ∆0 −4.6
T S : tobs = q = = −1.58
Sp2 ( n11 + n12 ) 2.92
2. A clothing manufacturer wishes to compare the color retention of two types of blue dye. She
selects a sample of 10 types of fabric, cutting each piece in half, and applying each type of
dye to a half of the piece. Each of the pieces are washed 15 times, and the amount of fading
is measured. (NOTE: there are 20 total measurements here since each type of fabric receives
both dyes). These samples are paired because each piece of experimental material receives
each dye.
The analysis of paired data involves computing the difference in the two measurements for each
subject and then treating these differences as a single–sample. For each subject (or experimental
unit), we observe two measurements X1i and X2i (the i just represents which subject in the sample
the measurement represents). Then, for each subject, we calculate Di = X1i − X2i . Now, testing
whether the 2 population means are equal is equivalent to testing whether or not the mean difference
is 0. We compute: Pn Pn
Di (Di − D)2
D = i=1 , Sd2 = i=1 .
n n−1
The (1 − α)100% confidence interval for µ1 − µ2 = µD is:
Sd
D ± tα/2,n−1 √
n
1. H0 : µ1 − µ2 = µD = ∆0
4. R.R.: |tobs | > tα/2,n−1 or tobs > tα,n−1 or tobs < −tα,n−1 (which R.R. depends on which
alternative hypothesis you are using).
5. p-value: 2P (T > |tobs |) or P (T > tobs ) or P (T < tobs ) (again, depending on which alternative
you are using).
4. p-value: 2P (T > tobs ) = 2P (T > 3.87) < 2P (t > 2.808) = 2(.005) = .01 (since 2.808 is the
largest value on the table for 23 d.f.).
We can conclude that the mean amount of nicotine delivered is higher for Nicoderm than for
Habitrol (since we reject H0 and D is positive).
b) The 95% confidence interval for the difference in true means is:
Sd 69.8
D ± tα/2,n−1 √ 55.0 ± 2.069 √ ≡ 55.0 ± 29.4 ≡ (25.6, 84.4)
n 24
We can conclude that the true mean for Nicoderm is between 25.6 and 84.4 units higher than the
true mean for Habitrol.
Source: S.K. Gupta, et al, (1995), “Comparison of the Pharmacokinetics of Two Nicodine Trans-
dermal Systems: Nicoderm and Habitrol,” Journal of Clinical Pharmacology, 35:493-498.
a) Labelling µnew as the true mean score for New Coke, and µold as the true mean score for Old
Coke. Give the alternative (research) hypotheses of reactance for the two conditions (blind and
labeled), where in each condition H0 : µnew − µold = 0:
Blind: HA : µnew − µold 0 Labeled: HA : µnew − µold 0
b) The following sample statistics were obtained from two samples of consumers. The samples
were taken approximately 7 months after Coke was re-released after the New Coke disaster. Within
each sample, subjects tasted both Coke and New Coke, rating each brand on a scale of 0-100. The
sample means, mean differences, and standard deviation of the differences are given below. Compute
the two test statistics for the tests from part a).
c) Give the appropriate rejection regions, assuming that preference differences are approximately
normally distributed (each test based on α = 0.05):
d) Can we conclude that reactance has been demonstrated by consumers? Note that since we
are conducting two independent tests, our overall Type I error rate is approximately 2(0.05)=0.10
(that we reject at least one null hypothesis when they are both true).
a) Would these samples be considered independent or paired? Why? (Hint: what might cause
variations in weekly news ratings, independent of the actual news programs)
b) If we denote x1i be the ratings for ABC on week i and x2i be the ratings for CBS on the
same week. What are the null and alternative hypotheses that we wish to test?
c) The data are given in Table 18, and are in millions of viewers per night. Give the mean and
standard deviation of the differences.
Week (i) ABC (x1i ) CBS (x2i ) Di = x1i − x2i Di2
1 8.2 7.2 1.0 1.00
2 7.2 6.3 0.9 0.81
3 7.1 6.2 0.9 0.81
4 8.7 8.9 -0.2 0.04
5 7.2 7.1 0.1 0.01
6 6.6 6.4 0.2 0.04
7 8.8 7.6 1.2 1.44
8 8.5 8.6 -0.1 0.01
9 9.6 7.8 1.8 3.24
10 9.2 8.5 0.7 0.49
Sum 81.1 74.6 6.5 7.89
Table 18: Sample of 10 weeks viewers for ABC and CBS news from 1997
d) Test whether the true means differ for the two networks. Clearly state the null and alternative
hypotheses, test statistic, and rejection region.
f) Based on your conclusion, we are at risk of (but have not necessarily made):
(i) A Type I Error
(ii) A Type II Error
(iii) No error
(iv) Either a Type I or Type II Error
X = µ + (X − µ) = µ + ε,
where ε = X − µ. Note that if X ∼ N (µ, σ), then ε ∼ N (0, σ). Also note that µ is unknown
(although we can estimate it), and so ε is unknown as well. We will be fitting different models in
this course, and estimating parameters corresponding to the models, as well as testing hypotheses
concerning them.
8 Lecture 8 — Experimental Design and the Analysis of Variance
Textbook Section: Section 11.2
Problems: 11.7,9,14,15,17
In this section, we will look at the effects of strictly qualitative variable(s) on the mean response
of a quantitative outcome variable. There are two distinct methods by which these measurements
can be made: controlled experiments and sample surveys. Some situations where analyses of
this type are used are given below:
AOV1 A drug manufacturer would like to compare four formulations to decide which is most
effective at reducing arthritic pain.
AOV2 A psychologist wishes to find out which of six classroom atmospheres provides the best
learning results among young children.
Before we get involved in the mathematics and model formulation, we will describe the two exper-
imental situations and define some useful terms.
In a controlled experiment, the experimenter selects a sample of subjects or objects from
a population of such subjects or objects. These are referred to as experimental units. These
experimental units are what we will make our measurements on. After the experimental units are
selected, treatments are applied to the experimental units. These treatments are made up of one
or more factors, or experimental variables. We wish to estimate the effects of these treatments
on the units. We will refer to the levels as the intensity settings of the factors. Note that in a
controlled experiment, we are applying the treatments to the experimental units, and we wish to
estimate the effects of the various levels of the factor(s). Generally we would like to decide if certain
levels provide higher (or lower) mean responses than other levels. Note that we have already done
this in the case of one factor possessing two levels in the previous chapter (two-sample t-test and
the paired difference test).
In an observational study, the experimenter selects samples of objects from several popula-
tions and wishes to observe if the population means are the same. The mathematics of the analysis
is the same for both of these methods, but the interpretations have subtle differences. In this
situation, we are not applying treatments to the elements of the sample, but rather observing some
measurement of interest. We still often refer to these different populations as treatments, even
though we aren’t really applying them to experimental units.
A couple of examples should clear up this diffence. First, consider a study of four blood pressure
medications. Twenty subjects with relatively comparable levels of high blood pressure are sampled,
and each subject is given one of the four medications for a month with their blood pressure being
measured at the end of the study. In this setting, the patients are the experimental units, the four
medications are the treatments, and we have randomly assigned one medication (level) to each
subject. This is a controlled experiment. Now consider a study to observe whether four brands of
television sets have the same mean lifetimes. We sample five of each brand, observing the lifetime
of each set. In this case, we consider the brands to be treatments, although we are not applying
brands to ‘experimental units’. However, in both of these situations, the method of testing for
differences among the effects of the medications, and of testing for differences among the brand
means are identical. Thus, we will not need to distinguish between the situations explicitly, but
it is important to distinguish which one you are in from an interpretation standpoint. We will
always refer to these designs from the controlled experiment setting, with obvious extensions to the
observational study being implied.
Here, µ is the overall mean measurement across all treatments, αj is the effect of the j th treatment
(µj = µ+αj ), and εij is a random error component that is assumed to be normally distributed with
mean 0 and standard deviation σ. This εij can be thought of as the fact that there will be variation
among the measurements of different experimental units receiving the same treatment in the case of
a controlled experiment, or the fact that different elements sampled from the same population will
have varying measurements. This means that our model assumes that Xij ∼ N (µ+αj = µj , σ). We
further assume the measurements are independent of one another. We will place a condition on the
effects αj , namely that they sum to zero. Of interest to the experimenter is whether or not there
is a treatment effect, that is do any of the levels of the treatment provide higher (lower) mean
response than other levels. This can be hypothesized symbolically as H0 : α1 = α2 = · · · = αC = 0
(no treatment effect) against the alternative HA : Not all αj = 0 (treatment effects exist). Before
we set up this testing procedure, we must define a few items.
N = n1 + · · · + nC
Pnj
i=1 Xij
Xj =
nj
Pnj
i=1 (Xij− X j )2
Sj2 =
nj − 1
nj
C X
X
T otalSS = (Xij − X)2
j=1 i=1
C nj
XX C
X
SST = (X j − X)2 = nj (X j − X)2
j=1 i=1 j=1
C
X C
X
SSE = (nj − 1)Sj2 = (nj − 1)Sj2
j=1 j=1
Total SS represents the total variation of the sample measurements around the overall sample
mean. This Total variation is partitioned into variation Between treatment means (SST ) and
variation Within treatments (SSE). Often, we refer to SST as the Model sum of squares and
SSE as the Error sum of squares. Note that the model and error sums of squares add up to the
total sum of squares. That is:
T otalSS = SST + SSE
The point of the Analysis of Variance is to detect whether differences exist in the population means
of the treatments, and if so, to determine which treatments provide higher (lower) mean responses.
Associated with each source of variation, we have degrees of freedom. The total sum of
squares has N − 1 degrees of freedom, since it is made up of N − 1 independent terms (we have
estimated the mean from the sample). The model sum of squares measures the variation in the
C treatment means around the overall mean, and has dfT = C − 1 degrees of freedom. Finally,
the error sum of squares is made up of variation of the individual measurements around the C
treatment means, and has dfE = N − C. Note that just as the model and error sums of squares
sum to the total sum of squares, the degrees of freedom also are additive. That is:
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
P
TREATMENTS SST = kj=1 nj (X j − X)2 C−1 SST
M ST = C−1 F =MM ST
SE
P
ERROR SSE = C (nj − 1)Sj2 N −C M SE = NSSE
PCj=1Pnj −C
TOTAL T otalSS = j=1 i=1 (Xij − X)2 N −1
Table 19: The Analysis of Variance Table for the Completely Randomized Design
Recall the model that we are using to describe the data in this design:
The effect of the j th treatment is αj . If there is no treatment effect among any of the levels of the
factor under study, that is that the population means of the C treatments are the same, then each
of the parameters αj are 0. This is a hypothesis we would like to test. The alternative hypothesis
will be that not all treatments have the same mean, or equivalently, that treatment effects exist (not
all αj are 0). If the null hypothesis is true (all C population means are equal), then the statistic
M ST
F = M SE follows the F -distribution with C − 1 numerator and N − C denominator degrees’ of
freedom. Large values of F are evidence against the null hypothesis of no treatment effect (recall
what SST and SSE are).
Upper percentage points of the F –distribution are given in Table A.7 (pp A-16 – A-25) of your
text book. This distribution has 2 parameters ν1 and ν2 , which are called the numerator and
denominator degrees’ of freedom, respectively. These tables give the upper tail cut off for various
values of ν1 , ν2 , and α (the upper tail probability). Under the null hypothesis of no differences
M SR
among treatment means (α1 = · · · = αC = 0), the test statistic F = M SE has a F – distribution
M SR
with ν1 = C − 1 and ν2 = N − C degrees’ of freedom. Large values of F = M SE are evidence
against the null hypothesis. We will denote Fα,ν1 ,ν2 as the cut off value that leaves a probability of
α in the upper tail of the F –distribution with ν1 and ν2 degrees’ of freedom. The testing procedure
is as follows:
1. H0 : α1 = · · · = αC = 0 (µ1 = · · · = µC ) (No treatment effect)
Table 20: Summary statistics and sums of squares calculations for sexual side effects of antidepres-
sant data.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
TREATMENTS 21.98 3 7.33 7.64
ERROR 98.60 103 0.96
TOTAL 120.58 106
Table 21: The Analysis of Variance table for sexual side effects in four antidepressant groups
Are there differences among the effects of the four brands (Test with α = 0.05)?
We can conclude that the sexual side effects differ among the 4 brands at virtually any level
of α since our P –value is so small. There is virtually no chance we would have observed
this large of variation among the four sample means if the true (unknown) population
means are the same.
Source: J.G. Modell, et al (1997), “Comparative Sexual Side Effects of Bupropion, Fluoxetine,
Paroxetine, and Sertraline,” Clinical Pharmacology & Therapeutics, 61:476-487.
ANOVA
Source of Degrees of Sum of Mean
Variation Freedom Squares Square F
Industry (Trts) 17
Error 162 57.55
Total 82.71
Table 22: The Analysis of Variance table for Corporate Social Responsibility
b) Test whether mean CSR scores differ among the industries (α = 0.05).
Source: M.T. Cottrill (1990), “Corporate Social Responsibility and the Marketplace,” Journal
of Business Ethics, 9:723-729.
Example 8.3 – Salary Progression By Industry
A recent study reported salary progressions during the 1980’s among k=8 industries. Results
including industry means, standard deviations, and sample sizes are given in Table 23. Also included
are columns that produce the treatment (between industry) and error (within industry) sums of
squares. The overall mean X = 65.11.
b) Test whether the true mean salary progressions differ among these industries (α = 0.05).
Source: L.K. Stroh and J.M. Brett (1996), “The Dual-Earner Dad Penalty in Salary Progres-
sion,” Human Resources Management 35:181-201.
Source df SS
Groups 1411.2
Error 141 30244.5
Total
Table 24: ANOVA table for Professional Women as Market Segment study
c) Give the null and alternative hypotheses for testing whether true mean shopping times differ
among these three potential market segments.
d) Give the test statistic, rejection region, and conclusion for the test in part c) (use 120
denominator df ).
Source: M. Joyce and J. Guiltinan (1978), “The Professional Woman: A Potential Market
Segment for Retailers,” Journal of Retailing, 54:59-70.
A study was conducted to determine whether the amount of attention (measured by the time
subject is exposed to advertisement) is related to th importance ratings of a product attribute.
Subjects were asked to rate on a scale the importance of water resistance in a watch. People were
exposed to the ad for either 60, 105, or 150 seconds. The means, standard deviations, and sample
sizes for each treatment are given in Table 25. The overall mean is computed as follows:
Total importance score 11(4.3) + 10(6.8) + 9(7.1) 179.2
X= = = = 6.0
Overall sample size 11 + 10 + 9 30
b) Test whether differences exist among the mean importance scores for the three exposure
times (α = 0.05)
Source: S.B. MacKenzie (1986), “The Role of Attention in Mediating the Effect of Advertising
on Attribute Performance,” Journal of Consumer Research, 13:174-195.
9 Lecture 9 — Comparison of Treatment Means
Textbook Section: 11.3
Problems: Apply Tukey’s Method to the Problems in Section 11.2
Assuming that we have concluded that treatment means differ, we generally would like to know
which means are significantly different. This is generally done by making either pre–planned or
all pairwise comparisons between pairs of treatments. We will look at how to make pre–planned
comparisons, and then how to make all comparisons. The two methods are very similar.
1. If the entire confidence interval for µi − µj is positive, we conclude that treatment i has
a higher mean than treatment j.
2. If the entire confidence interval for µi − µj is negative, we conclude that treatment i has
a lower mean than treatment j.
3. If the interval contains both positive and negative values, we cannot conclude that
the means of treatments i and j are different.
The term v
u !
u M SE 1 1
qα,C,N −C t +
2 ni nj
Example 9.1
We’ve determined differences exist among the sexual side effects of the antidepressant brands.
Which brands differ?
We use Tukey’s HSD test and make C(C−1) 2 = 4(3)
2 = 6 comparisons with a simultaneous Type
I error rate of α = 0.05. The critical difference for treatments i and j are:
v !
u
u M SE 1 1
HSDi,j = qα,C,N −C t +
2 ni nj
Thus, we compare the difference between the means for Wellbutrin and Prozac with this critical
difference.
X 1 − X 2 = 0.46 − (−0.49) = 0.95 > 0.686 (µ1 > µ2 )
Since the means differ by more than 0.686, we conclude they differ and that Wellbutrin users report
higher scores on average than Prozac users. The results for all pairs are given in Table 26, where
N.S.D. in the Conclusion column means “Not Significantly Different”.
The primary conclusion is that Wellbutrin users have a higher population mean than the other
three brands’ users. None of the three SSRI’s means can be determined to differ.
a) Between which two industries will Tukey’s HSD be the largest? Compute this value. Do
these two industries differ significantly (use α = 0.05)?
Simultaneous 95% CI’s
Comparison Xi − Xj HSDi,j Conclude
1v2 0.46 − (−0.49) = 0.95 0.686 µ1 > µ 2
1v3 0.46 − (−0.90) = 1.36 0.778 µ1 > µ 3
1v4 0.46 − (−0.49) = 0.95 0.732 µ1 > µ 4
2v3 −0.49 − (−0.90) = 0.41 0.697 N.S.D.
2v4 −0.49 − (−0.49) = 0.00 0.645 N.S.D
3v4 −0.90 − (−0.49) = −0.41 0.742 N.S.D
Table 26: Tukey multiple comparisons for the sexual side effects study patients receiving antide-
pressants
b) Between which two industries will Tukey’s HSD be the smallest? Compute this value. Do
these two industries differ significantly (use α = 0.05)?
Previously, we have worked with a random variable X that comes from a population that is
normally distributed with mean µ and variance σ 2 . We have seen that we can write X in terms of
µ and a random error component ε, that is, X = µ + ε. For the time being, we are going to change
our notation for our random variable from X to Y . So, we now write Y = µ + ε. We will now
find it useful to call the random variable Y a dependent or response variable. Many times, the
response variable of interest may be related to the value(s) of one or more known or controllable
independent or predictor variables. Consider the following situations:
LR1 A college recruiter would like to be able to predict a potential incoming student’s first–year
GPA (Y ) based on known information concerning high school GPA (X1 ) and college entrance
examination score (X2 ). She feels that the student’s first–year GPA will be related to the
values of these two known variables.
LR2 A marketer is interested in the effect of changing shelf height (X1 ) and shelf width (X2 ) on
the weekly sales (Y ) of her brand of laundry detergent in a grocery store.
LR3 A psychologist is interested in testing whether the amount of time to become proficient in a
foreign language (Y ) is related to the child’s age (X).
In each case we have at least one variable that is known (in some cases it is controllable), and a
response variable that is a random variable. We would like to fit a model that relates the response
to the known or controllable variable(s). The main reasons that scientists and social researchers
use linear regression are the following:
1. Prediction – To predict a future response based on known values of the predictor variables
and past data related to the process.
2. Description – To measure the effect of changing a controllable variable on the mean value
of the response variable.
3. Control – To confirm that a process is providing responses (results) that we ‘expect’ under
the present operating conditions (measured by the level(s) of the predictor variable(s)).
Y = 10 · X − 30.
This is called a deterministic model. In general, we can write the equation for a straight line as
Y = β0 + β1 X,
where β0 is called the Y–intercept and β1 is called the slope. β0 is the value of Y when X = 0,
and β1 is the change in Y when X increases by 1 unit. In many real–world situations, the response
of interest (in this example it’s profit) cannot be explained perfectly by a deterministic model. In
this case, we make an adjustment for random variation in the process.
Y = β0 + β1 X + |{z}
ε
| {z }
systematic random
where X is the level of the predictor variable corresponding to the response, β0 and β1 are
unknown parameters, and ε is the random error component corresponding to the response whose
distribution we assume is N (0, σ), as before. Further, we assume the error terms are independent
from one another, we discuss this in more detail in a later chapter. Note that β0 can be interpreted
as the mean response when X=0, and β1 can be interpreted as the change in the mean response
when X is increased by 1 unit. Under this model, we are saying that Y |X ∼ N (β0 + β1 X, σ).
Consider the following example.
SALES
700
600
500
400
300
0 3 6 9 12
SPACE
Now, look at Figure 10. Note that while there is some variation among the weekly sales at 3’,
6’, and 9’, respectively, there is a trend for the mean sales to increase as shelf space increases. If
we define the fitted equation to be an equation:
Ŷ = b0 + b1 X,
we can choose the estimates b0 and b1 to be the values that minimize the distances of the data points
to the fitted line. Now, for each observed response Yi , with a corresponding predictor variable Xi ,
we obtain a fitted value Ŷi = b0 + b1 Xi . So, we would like to minimize the sum of the squared
distances of each observed response to its fitted value. That is, we want to minimize the error
sum of squares, SSE, where:
n
X n
X
SSE = (Yi − Ŷi )2 = (Yi − (b0 + b1 Xi ))2 .
i=1 i=1
and Pn Pn
i=1 yi i=1 xi
b0 = Y − β̂1 X = . − b1
n n
Some shortcut equations, known as the corrected sums of squares and crossproducts, that while
not very intuitive are very useful in computing these and other estimates are:
Pn
Pn Pn ( Xi )2
• SSXX = i=1 (Xi − X)2 = 2
i=1 Xi − i=1
n
Pn Pn
Pn Pn ( Xi )( Yi )
• SSXY = i=1 (Xi − X)(Yi − Y ) = i=1 Xi Yi − i=1
n
i=1
Pn
Pn Pn ( Yi )2
• SY Y = i=1 (Yi −Y )2 = i=1 Yi
2 − i=1
n
P P
X X ( X)( Y) (72)(6185)
SSXY = (X − X)(Y − Y ) = XY − = 39600 − = 2490
n 12
P
X X ( Y )2 (6185)2
2 2
SSY Y = (Y − Y ) = Y − = 3300627 − = 112772.9
n 12
From these, we obtain the least squares estimate of the true linear regression relation (β0 +β1 X).
SSXY 2490
b1 = = = 34.5833
SSXX 72
P P
Y X 6185 72
b0 = − b1 = − 34.5833( ) = 515.4167 − 207.5000 = 307.967.
n n 12 12
Ŷ = b0 + b1 X = 307.967 + 34.583X
So the fitted equation, estimating the mean weekly sales when the product has X feet of shelf
space is Ŷ = β̂0 + β̂1 X = 307.967 + 34.5833X. Our interpretation for b1 is “the estimate for the
increase in mean weekly sales due to increasing shelf space by 1 foot is 34.5833 bags of coffee”.
Note that this should only be interpreted within the range of X values that we have observed in
the “experiment”, namely X = 3 to 9 feet.
The “beta factor” is derived from a least squares regression analysis between weekly
percent changes in the price of a stock and weekly percent changes in the price of all
stocks in the survey over a period of five years. In the case of shorter price histories, a
smaller period is used, but never less than two years.
In this example, we will compute the stock beta over a 28-week period for Coca-Cola and
Anheuser-Busch, using the S&P500 as ’the market’ for comparison. Note that this period is only
about 10% of the period used by Value Line. Note: While there are 28 weeks of data, there are
only n=27 weekly changes.
Table 29 provides the dates, weekly closing prices, and weekly percent changes of: the S&P500,
Coca-Cola, and Anheuser-Busch. The following summary calculations are also provided, with X
representing the S&P500, YC representing Coca-Cola, and YA representing Anheuser-Busch. All
calculations should be based on 4 decimal places. Figure ?? gives the plot and least squares
regression line for Anheuser-Busch, and Figure ?? gives the plot and least squares regression line
for Coca-Cola.
X X X
X = 15.5200 YC = −2.4882 YA = 2.4281
X X X
X 2 = 124.6354 YC2 = 461.7296 YA2 = 195.4900
X X
XYC = 161.4408 XYA = 84.7527
Closing S&P A-B C-C S&P A-B C-C
Date Price Price Price % Chng % Chng % Chng
05/20/97 829.75 43.00 66.88 – – –
05/27/97 847.03 42.88 68.13 2.08 -0.28 1.87
06/02/97 848.28 42.88 68.50 0.15 0.00 0.54
06/09/97 858.01 41.50 67.75 1.15 -3.22 -1.09
06/16/97 893.27 43.00 71.88 4.11 3.61 6.10
06/23/97 898.70 43.38 71.38 0.61 0.88 -0.70
06/30/97 887.30 42.44 71.00 -1.27 -2.17 -0.53
07/07/97 916.92 43.69 70.75 3.34 2.95 -0.35
07/14/97 916.68 43.75 69.81 -0.03 0.14 -1.33
07/21/97 915.30 45.50 69.25 -0.15 4.00 -0.80
07/28/97 938.79 43.56 70.13 2.57 -4.26 1.27
08/04/97 947.14 43.19 68.63 0.89 -0.85 -2.14
08/11/97 933.54 43.50 62.69 -1.44 0.72 -8.66
08/18/97 900.81 42.06 58.75 -3.51 -3.31 -6.28
08/25/97 923.55 43.38 60.69 2.52 3.14 3.30
09/01/97 899.47 42.63 57.31 -2.61 -1.73 -5.57
09/08/97 929.05 44.31 59.88 3.29 3.94 4.48
09/15/97 923.91 44.00 57.06 -0.55 -0.70 -4.71
09/22/97 950.51 45.81 59.19 2.88 4.11 3.73
09/29/97 945.22 45.13 61.94 -0.56 -1.48 4.65
10/06/97 965.03 44.75 62.38 2.10 -0.84 0.71
10/13/97 966.98 43.63 61.69 0.20 -2.50 -1.11
10/20/97 944.16 42.25 58.50 -2.36 -3.16 -5.17
10/27/97 941.64 40.69 55.50 -0.27 -3.69 -5.13
11/03/97 914.62 39.94 56.63 -2.87 -1.84 2.04
11/10/97 927.51 40.81 57.00 1.41 2.18 0.65
11/17/97 928.35 42.56 57.56 0.09 4.29 0.98
11/24/97 963.09 43.63 63.75 3.74 2.51 10.75
Table 29: Weekly closing stock prices – S&P 500, Anheuser-Busch, Coca-Cola
ya
5
-1
-2
-3
-4
-5
-4 -3 -2 -1 0 1 2 3 4 5
x
Figure 11: Plot of weekly percent stock price changes for Anheuser-Busch versus S&P 500 and least
squares regression line
ya
5
-1
-2
-3
-4
-5
-4 -3 -2 -1 0 1 2 3 4 5
x
Figure 12: Plot of weekly percent stock price changes for Coca-Cola versus S&P 500 and least
squares regression line
a) Compute SSXX , SSXYC , and SSXYA .
The following (approximate) data were published by Joel Dean, in the 1941 article: “Statistical
Cost Functions of a Hosiery Mill,” (Studies in Business Administration, vol. 14, no. 3).
Y — Monthly total production cost (in $1000s).
X — Monthly output (in thousands of dozens produced).
A sample of n = 48 months of data were used, with Xi and Yi being measured for each month.
The parameter β1 represents the change in mean cost per unit increase in output (unit variable
cost), and β0 represents the true mean cost when the output is 0, without shutting plant (fixed
cost). The data are given in Table 10.3 (the order is arbitrary as the data are printed in table form,
and were obtained from visual inspection/approximation of plot).
i Xi Yi i Xi Yi i Xi Yi
1 46.75 92.64 17 36.54 91.56 33 32.26 66.71
2 42.18 88.81 18 37.03 84.12 34 30.97 64.37
3 41.86 86.44 19 36.60 81.22 35 28.20 56.09
4 43.29 88.80 20 37.58 83.35 36 24.58 50.25
5 42.12 86.38 21 36.48 82.29 37 20.25 43.65
6 41.78 89.87 22 38.25 80.92 38 17.09 38.01
7 41.47 88.53 23 37.26 76.92 39 14.35 31.40
8 42.21 91.11 24 38.59 78.35 40 13.11 29.45
9 41.03 81.22 25 40.89 74.57 41 9.50 29.02
10 39.84 83.72 26 37.66 71.60 42 9.74 19.05
11 39.15 84.54 27 38.79 65.64 43 9.34 20.36
12 39.20 85.66 28 38.78 62.09 44 7.51 17.68
13 39.52 85.87 29 36.70 61.66 45 8.35 19.23
14 38.05 85.23 30 35.10 77.14 46 6.25 14.92
15 39.16 87.75 31 33.75 75.47 47 5.45 11.44
16 38.59 92.62 32 34.29 70.37 48 3.79 12.69
This dataset has n = 48 observations with a mean output (in 1000s of dozens) of X = 31.0673,
and a mean monthly cost (in $1000s) of Y = 65.4329.
n
X n
X n
X n
X n
X
Xi = 1491.23 Xi2 = 54067.42 Yi = 3140.78 Yi2 = 238424.46 Xi Yi = 113095.80
i=1 i=1 i=1 i=1 i=1
Pn Pn
Pn ( Xi )( i=1 Yi )
i=1 Xi Yi − i=1
SSXY 15520.27
b1 = Pn n = = 2.0055
Pn 2 ( i=1 Xi )2 SSXX 7738.94
i=1 Xi − n
8 07 0
70
60
60
50
50
40
40
30
30
20
1 02 0
01 21 0 3 20 4 30 5 40 6 5 07
scioznec
Figure 13: Estimated cost function for hosiery mill (Dean, 1941)
i Xi Yi Ŷi ei
1 46.75 92.64 96.88 -4.24
2 42.18 88.81 87.72 1.09
3 41.86 86.44 87.08 -0.64
4 43.29 88.80 89.95 -1.15
5 42.12 86.38 87.60 -1.22
6 41.78 89.87 86.92 2.95
7 41.47 88.53 86.30 2.23
8 42.21 91.11 87.78 3.33
9 41.03 81.22 85.41 -4.19
10 39.84 83.72 83.03 0.69
11 39.15 84.54 81.64 2.90
12 39.20 85.66 81.74 3.92
13 39.52 85.87 82.38 3.49
14 38.05 85.23 79.44 5.79
15 39.16 87.75 81.66 6.09
16 38.59 92.62 80.52 12.10
17 36.54 91.56 76.41 15.15
18 37.03 84.12 77.39 6.73
19 36.60 81.22 76.53 4.69
20 37.58 83.35 78.49 4.86
21 36.48 82.29 76.29 6.00
22 38.25 80.92 79.84 1.08
23 37.26 76.92 77.85 -0.93
24 38.59 78.35 80.52 -2.17
25 40.89 74.57 85.13 -10.56
26 37.66 71.60 78.65 -7.05
27 38.79 65.64 80.92 -15.28
28 38.78 62.09 80.90 -18.81
29 36.70 61.66 76.73 -15.07
30 35.10 77.14 73.52 3.62
31 33.75 75.47 70.81 4.66
32 34.29 70.37 71.90 -1.53
33 32.26 66.71 67.82 -1.11
34 30.97 64.37 65.24 -0.87
35 28.20 56.09 59.68 -3.59
36 24.58 50.25 52.42 -2.17
37 20.25 43.65 43.74 -0.09
38 17.09 38.01 37.40 0.61
39 14.35 31.40 31.91 -0.51
40 13.11 29.45 29.42 0.03
41 9.50 29.02 22.18 6.84
42 9.74 19.05 22.66 -3.61
43 9.34 20.36 21.86 -1.50
44 7.51 17.68 18.19 -0.51
45 8.35 19.23 19.87 -0.64
46 6.25 14.92 15.66 -0.74
47 5.45 11.44 14.06 -2.62
48 3.79 12.69 10.73 1.96
Table 31: Approximated Monthly Outputs, total costs, fitted values and residuals – Dean (1941)
.
We have seen now, how to estimate β0 and β1 . Now we can obtain an estimate of the variance of
the responses at a given value of X. Recall from your previous statistics course, you estimated the
variance by taking the ‘average’ squared
P deviation of each measurement from the sample (estimated)
n
(Yi −Y )2
mean. That is, you calculated S 2 = i=1n−1 . Now that we fit the regression model, we know
longer use Y to estimate the mean for each Yi , but rather Ŷi = b0 + b1 Xi to estimate the mean.
The estimate we use now looks similar to the previous estimate except we replace Y with Ŷi and
we replace n − 1 with n − 2 since we have estimated 2 parameters, β0 and β1 . The new estimate
(which we will refer as to the residual variance) is:
Pn 2
SSE i=1 (Yi
− Ŷi ) SSY Y − (SS XY )
SSXX
Se2 = M SE = = = .
n−2 n−2 n−2
This estimated variance Se2 can be thought of as the ‘average’ squared distance from each observed
response to the fitted line. The word average is in quotes since we divide by n − 2 and not n. The
closer the observed responses fall to the line, the smaller Se2 is and the better our predicted values
will be.
SALES
700
600
500
400
300
0 3 6 9 12
SPACE
• Se2 = M SE = SSE
n−2 =
1788.51
48−2 = 38.88
√
• Se = 38.88 = 6.24
11 Lecture 11 — Simple Regression II — Inferences Concerning
β1
Textbook Sections: 12.5,12.6
Problems: 12.36,39,40,41, Compute 95% CI’s for β1 in these problems.
Recall that in our regression model, we are stating that E(Y |X) = β0 + β1 X. In this model, β1
represents the change in the mean of our response variable Y , as the predictor variable X increases
by 1 unit. Note that if β1 = 0, we have that E(Y |X) = β0 + β1 X = β0 + 0X = β0 , which implies
the mean of our response variable is the same at all values of X. In the context of the coffee sales
example, this would imply that mean sales are the same, regardless of the amount of shelf space, so
a marketer has no reason to purchase extra shelf space. This is like saying that knowing the level
of the predictor variable does not help us predict the response variable.
Under the assumptions stated previously, namely that Y ∼ N (β0 + β1 X, σ), our estimator b1
has a sampling distribution that is normal with mean β1 (the true value of the parameter), and
standard error pPn σ 2
. That is:
i=1
(Xi −X)
σ
b1 ∼ N (β1 , √ )
SSXX
We can now make inferences concerning β1 .
For the hosiery mill cost function analysis, we obtain a 95% confidence interval for average unit
variable costs (β1 ). Note that t.025,48−2 = t.025,46 ≈ 2.015, since t.025,40 = 2.021 and t.025,60 = 2.000
(we could approximate this with z.025 = 1.96 as well).
6.24
2.0055 ± t.025,46 √ = 2.0055 ± 2.015(.0709) = 2.0055 ± 0.1429 = (1.8626, 2.1484)
7738.94
We are 95% confident that the true average unit variable costs are between $1.86 and $2.15 (this
is the incremental cost of increasing production by one unit, assuming that the production process
is in place.
• (1) Ha : β1 6= β10
(2) Ha : β1 > β10
(3) Ha : β1 < β10
b1 −β10
• T S : tobs = √ Se
SSXX
Suppose we want to test whether average monthly production costs increase with monthly
production output. This is testing whether unit variable costs are positive (α = 0.05).
• H0 : β1 = 0 (Mean Monthly production cost is not associated with output)
• HA : β1 > 0 (Mean monthly production cost increases with output)
2.0055−0 2.0055
• T S : tobs = √ 6.24
= 0.0709 = 28.29
7738.94
Note that all we are doing is adding and subtracting the fitted value. It so happens that
algebraically we can show the same equality holds once we’ve squared each side of the equation
and summed it over the n observed and fitted values. That is,
n
X n
X n
X
(Yi − Y )2 = (Yi − Ŷi )2 + (Ŷi − Y )2 .
i=1 i=1 i=1
SALES
700
600
500
400
300
0 3 6 9 12
SPACE
Figure 15: Plot of coffee data, fitted equation, and the line Y = 515.4167
These three pieces are called the total, error, and model sums of squares, respectively. We
denote them as SSyy , SSE, and SSR, respectively. We have already seen that SSyy represents the
total variation in the observed responses, and that SSE represents the variation in the observed
responses around the fitted regression equation. That leaves SSR as the amount of the total
variation that is ‘accounted for’ by taking into account the predictor variable X. We can use
this decomposition to test the hypothesis H0 : β1 = 0 vs HA : β1 6= 0. We will also find this
decomposition useful in subsequent sections when we have more than one predictor variable. We
first set up the Analysis of Variance (ANOVA) Table in Table 32. Note that we will have to
make minimal calculations to set this up since we have already computed SSyy and SSE in the
regression analysis.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Pn
MODEL SSR = i=1 (Ŷi − Y ) 2 1 M SR = SSR
1 F =MM SR
SE
P
ERROR SSE = ni=1 (Yi − Ŷi )2 n−2 M SE = SSE
n−2
P
TOTAL SSY Y = ni=1 (Yi − Y )2 n−1
The procedure of testing for a linear association between the response and predictor variables
using the analysis of variance involves using the F –distribution, which is given in Table A.7 (pp
A-16–A-25) of your text book. This is the same distribution we used in the previous chapter.
The testing procedure is as follows:
1. H0 : β1 = 0 HA : β1 6= 0 (This will always be a 2–sided test)
M SR
2. T.S.: Fobs = M SE
Referring back to the coffee sales data, we have already made the following calculations:
We then also have that SSR = SSY Y − SSE = 86112.5. Then the Analysis of Variance is given in
Table 33.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
MODEL SSR = 86112.5 1 M SR = 86112.5
1 = 86112.5 F = 86112.5
2666.04 = 32.30
26660.4
ERROR SSE = 26660.4 12 − 2 = 10 M SE = 10 = 2666.04
TOTAL SSY Y = 112772.9 12 − 1 = 11
Table 33: The Analysis of Variance Table for the coffee data example
To test the hypothesis of no linear association between amount of shelf space and mean weekly
coffee sales, we can use the F -test described above. Note that the null hypothesis is that there is
no effect on mean sales from increasing the amount of shelf space. We will use α = .01.
1. H0 : β1 = 0 HA : β1 6= 0
M SR 86112.5
2. T.S.: Fobs = M SE = 2666.04 = 32.30
4. p-value: P (F > Fobs ) = P (F > 32.30) < P (F > 12.83) = .005 (p-value < .005). See p. A-24.
We reject the null hypothesis, and conclude that β1 6= 0. There is an effect on mean weekly sales
when we increase the shelf space.
For the hosiery mill data, the sums of squares for each source of variation in monthly production
costs and their corresponding degrees of freedom are (from previous calculations):
Pn
Total SS – SSY Y = i=1 (Yi − Y )2 = 32914.06 dfT otal = n − 1 = 47
Pn
Error SS – SSE = i=1 (Yi − Ŷi )2 = 1788.51 dfE = n − 2 = 46
Pn
Model SS – SSR = i=1 (Ŷi − Y )2 = SSY Y − SSE = 32914.06 − 1788.51 = 31125.55 dfR = 1
The Analysis of Variance is given in Table 34.
To test whether there is a linear association between mean monthly costs and monthly produc-
tion output, we conduct the F -test (α = 0.05).
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
MODEL SSR = 31125.55 1 M SR = 31125.55
1 = 31125.55 F = 31125.55
38.88 = 800.55
ERROR SSE = 1788.51 48 − 2 = 46 M SE = 1788.51
46 = 38.88
TOTAL SSY Y = 32914.06 48 − 1 = 47
Table 34: The Analysis of Variance Table for the hosiery mill cost example
1. H0 : β1 = 0 HA : β1 6= 0
M SR 31125.55
2. T.S.: Fobs = M SE = 38.88 = 800.55
4. p-value: P (F > Fobs ) = P (F > 800.55) <<<<< P (F > 8.83) = .005 (p-value <<<<<
.005). See p. A-24 (with 40 denominator df).
For the coffee data, we can calculate r 2 using the values of SSXY , SSXX , SSY Y , and SSE we
have previously obtained.
26660.4 86112.5
r2 = 1 − = = .7636
112772.9 112772.9
Thus, over 3/4 of the variation in sales is “explained” by the model using shelf space to predict
sales.
s
1 (X0 − X)2
Ŷ0 ∼ N (β0 + β1 X0 , σ + Pn 2
).
n i=1 (Xi − X)
Note that the standard error of the estimate is smallest at X0 = X, that is at the mean of the
sampled levels of the predictor variable. The standard error increases as the value X0 goes away
from this mean.
For instance, our marketer may wish to estimate the mean sales when she has 60 of shelf space,
or 70 , or 40 . She may also wish to obtain a confidence interval for the mean at these levels of X.
Suppose our marketer wants to compute 90% confidence intervals for the mean weekly sales at
X=4,6, and 7 feet, respectively (these are not simultaneous confidence intervals as were computed
based on Tukey’s Method previously). Each of these intervals will depend on tα/2,n−2 = t.05,10 =
1.812 and X = 6. These intervals are:
s
1 (4 − 6)2 √
(307.967 + 34.5833(4)) ± 1.812(51.63) + = 446.300 ± 93.554 .1389
12 72
= 446.300 ± 34.866 ≡ (411.434, 481.166)
s
1 (6 − 6)2 √
(307.967 + 34.5833(6)) ± 1.812(51.63) + = 515.467 ± 93.554 .0833
12 72
= 515.467 ± 27.001 ≡ (488.465, 542.468)
s
1 (7 − 6)2 √
(307.967 + 34.5833(7)) ± 1.812(51.63) + = 550.050 ± 93.554 .0972
12 72
= 550.050 ± 29.171 ≡ (520.879, 579.221)
Notice that the interval is the narrowest at X0 = 6. Figure 16 is a computer generated plot
of the data, the fitted equation and the confidence limits for the mean weekly coffee sales at each
value of X. Note how the limits get wider as X goes away from X = 6. Would these intervals be
wider or narrower, had they been 95% confidence intervals?
SALES
700
600
500
400
300
0 3 6 9 12
SPACE
Figure 16: Plot of coffee data, fitted equation, and 90% confidence limits for the mean
Suppose the plant manager is interested in mean costs among months where output is 30,000
items produced (X0 = 30). She wants a 95% confidence interval for this true unknown mean.
Recall:
90
80
70
60
50
40
30
20
10
0 10 20 30 40 50
size
Figure 17: Plot of hosiery mill cost data, fitted equation, and 95% confidence limits for the mean
First, suppose you know the parameters β0 and β1 Then you know that the response variable,
for a fixed level of the predictor variable (X = X0 ), is normally distributed with mean E(Y |X0 ) =
β0 + β1 X0 and standard deviation σ. We know from previous work with the normal distribution
that approximately 95% of the measurements lie within 2 standard deviations of the mean. So if we
know β0 , β1 , and σ, we would be very confident that our response would lie between (β0 +β1 X0 )−2σ
and (β0 + β1 X0 ) + 2σ. Figure 18 represents this idea.
F2
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
50 60 70 80 90 100 110 120 130 140 150
X
We rarely, if ever, know these parameters, and we must estimate them as we have in previous
sections. There is uncertainty in what the mean response at the specified level, X0 , of the response
variable. We do, however know how to obtain an interval that we are very confident contains the
true mean β0 + β1 X0 . If we apply the method of the previous paragraph to all ‘believable’ values of
this mean we can obtain a prediction interval that we are very confident will contain our future
response. Since σ is being estimated as well, instead of 2 standard deviations, we must use tα/2,n−2
estimated standard deviations. Figure 19 portrays this idea.
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 60 100 140 180
X
Note that all we really need are the two extreme distributions from the confidence interval for
the mean response. If we use the method from the last paragraph on each of these two distributions,
we can obtain the prediction interval by choosing the left–hand point of the ‘lower’ distribution
and the right–hand point of the ‘upper’ distribution. This is displayed in Figure 20.
F1
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
20 60 100 140 180
X
Figure 20: Upper and lower prediction limits when we have estimated the mean
The general formula for a (1 − α)100% prediction interval of a future response is similar to the
confidence interval for the mean at X0 , except that it is wider to reflect the variation in individual
responses. The formula is:
s
1 (X0 − X)2
(b0 + b1 X0 ) ± tα/2,n−2 s 1 + + .
n SSXX
For the coffee example, suppose the marketer wishes to predict next week’s sales when the coffee
will have 50 of shelf space. She would like to obtain a 95% prediction interval for the number of
bags to be sold. First, we observe that t.025,10 = 2.228, all other relevant numbers can be found in
the previous example. The prediction interval is then:
r
1 (5 − 6)2 √
(307.967 + 34.5833(5)) ± 2.228(51.63) 1 + + = 480.883 ± 93.554 1.0972
12 72
= 480.883 ± 97.996 ≡ (382.887, 578.879).
This interval is relatively wide, reflecting the large variation in weekly sales at each level of x. Note
that just as the width of the confidence interval for the mean response depends on the distance
between X0 and X, so does the width of the prediction interval. This should be of no surprise,
considering the way we set up the prediction interval (see Figure 19 and Figure 20). Figure 21
shows the fitted equation and 95% prediction limits for this example.
It must be noted that a prediction interval for a future response is only valid if conditions are
similar when the response occurs as when the data was collected. For instance, if the store is being
boycotted by a bunch of animal rights activists for selling meat next week, our prediction interval
will not be valid.
SALES
700
600
500
400
300
0 3 6 9 12
SPACE
Figure 21: Plot of coffee data, fitted equation, and 95% prediction limits for a single response
Suppose the plant manager knows based on purchase orders that this month, her plant will
produce 30,000 items (X0 = 30.0). She would like to predict what the plant’s production costs will
be. She obtains a 95% prediction interval for this month’s costs.
r
1 (30 − 31.0673)2 √
3.1274 + 2.0055(30) ± 2.015(6.24) 1 + + ≡ 63.29 ± 2.015(6.24) 1.0210
48 7738.94
≡ 63.29 ± 12.70 ≡ (50.59, 75.99)
She predicts that the costs for this month will be between $50,590 and $75,990. This interval is
much wider than the interval for the mean, since it includes random variation in monthly costs
around the mean. A plot of the 95% prediction bands is given in Figure 22.
90
80
70
60
50
40
30
20
10
0 10 20 30 40 50
size
Figure 22: Plot of hosiery mill cost data, fitted equation, and 95% prediction limits for an individual
outcome
For the coffee data, we can calculate r using the values of SSXY , SSXX , SSY Y we have previously
obtained.
2490 2490
r=p = = .8738
(72)(112772.9) 2849.5
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 86112.50000 86112.50000 32.297 0.0002
Error 10 26662.41667 2666.24167
C Total 11 112774.91667
Parameter Estimates
Upper95% Upper95%
Obs Predict Residual Obs Predict Residual
1 538.1 9.3333 7 635.2 -81.4167
2 538.1 0.3333 8 635.2 54.5833
3 538.1 31.3333 9 745.6 10.8333
4 538.1 -65.6667 10 745.6 -59.1667
5 635.2 10.5833 11 745.6 -29.1667
6 635.2 65.5833 12 745.6 52.8333
13 Lecture 13 — Multiple Regression I
Textbook Sections: 13.1,13.2
Problems: 13.3,5,7,8,9,13
In most situations, we have more than one independent variable. While the amount of math
can become overwhelming and involves matrix algebra, many computer packages exist that will
provide the analysis for you. In this chapter, we will analyze the data by interpreting the results
of a computer program. It should be noted that simple regression is a special case of multiple
regression, so most concepts we have already seen apply here.
We make the same assumptions as before in terms of ε, specifically that they are indepen-
dent and normally distributed with mean 0 and standard deviation σ. That is, we are assuming
that Y , at a given set of levels of the k independent variables (X1 , . . . , Xk ) is normal with mean
E[Y |X1 , . . . , Xk ] = β0 + β1 X1 + · · · + βk Xk and standard deviation σ. Just as before, β0 , β1 , . . . , βk ,
and σ are unknown parameters that must be estimated from the sample data. The parameters βi
represent the change in the mean response when the ith predictor variable changes by 1 unit and
all other predictor variables are held constant.
In this model:
• Y — Random outcome of the dependent variable
• β0 — Regression constant (E(Y |X1 = · · · = Xk = 0) if appropriate)
• βi — Partial regression coefficient for variable Xi (Change in E(Y ) when Xi increases by 1
0
unit and all other X s are held constant)
• ε — Random error term, assumed (as before) that ε ∼ N (0, σ)
• k — The number of independent variables
Pn
By the method of least squares (choosing the bi values that minimize SSE = i=1 (Yi − Ŷi )2 ),
we obtain the fitted equation:
Ŷ = b0 + b1 X1 + b2 X2 + · · · + bk Xk
and our estimate of σ:
s s
P
(Y − Ŷ )2 SSE
Se = =
n−k−1 n−k−1
The Analysis of Variance table will be very similar to what we used previously, with the only
adjustments being in the degrees’ of freedom. Table 35 shows the values for the general case when
there are k predictor variables. We will rely on computer outputs to obtain the Analysis of Variance
and the estimates b0 , b1 , and bk .
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Pn
MODEL SSR = i=1 (Ŷi − Y ) 2 k M SR = SSRk F =MM SR
SE
P
ERROR SSE = ni=1 (Yi − Ŷi )2 n−k−1 SSE
M SE = n−k−1
P
TOTAL SSY Y = ni=1 (Yi − Y )2 n−1
13.2 Testing for Association Between the Response and the Full Set of Predictor
Variables
To see if the set of predictor variables is useful in predicting the response variable, we will test
H0 : β1 = β2 = . . . = βk = 0. Note that if H0 is true, then the mean response does not depend
on the levels of the predictor variables. We interpret this to mean that there is no association
between the response variable and the set of predictor variables. To test this hypothesis, we use
the following method:
1. H0 : β1 = β2 = · · · = βk = 0
2. HA : Not every βi = 0
M SR
3. T.S.: Fobs = M SE
5. p-value: P (F > Fobs ) (You can only get bounds on this, but computer outputs report them
exactly)
The computer automatically performs this test and provides you with the p-value of the test, so
in practice you really don’t need to obtain the rejection region explicitly to make the appropriate
conclusion. However, we will do so in this course to help reinforce the relationship between the
test’s decision rule and the p-value. Recall that we reject the null hypothesis if the p-value is less
than α.
13.3 Testing Whether Individual Predictor Variables Help Predict the Re-
sponse
If we reject the previous null hypothesis and conclude that not all of the βi are zero, we may wish
to test whether individual βi are zero. Note that if we fail to reject the null hypothesis that βi is
zero, we can drop the predictor Xi from our model, thus simplifying the model. Note that this
test is testing whether Xi is useful given that we are already fitting a model containing
the remaining k − 1 predictor variables. That is, does this variable contribute anything once
we’ve taken into account the other predictor variables. These tests are t-tests, where we compute
t = Sbbi just as we did in the section on making inferences concerning β1 in simple regression. The
i
procedure for testing whether βi = 0 (the ith predictor variable does not contribute to predicting
the response given the other k − 1 predictor variables are in the model) is as follows:
• H0 : βi = 0 (Y is not associated with Xi after controlling for all other independent variables)
• (1) HA : βi 6= 0
(2) HA : βi > 0
(3) HA : βi < 0
bi
• T.S.: tobs = S bi
Computer packages print the test statistic and the p-value based on the two-sided test, so to
conduct this test is simply a matter of interpreting the results of the computer output.
The procedure for obtaining the numeric elements of the test is as follows:
1. Fit the model under the null hypothesis (βk−q+1 = βk−q+2 = · · · = βk = 0). It will include
only the first k − q predictor variables. This is referred to as the Reduced model. Obtain
the error sum of squares (SSE(R)) and the error degrees of freedom dfE (R) = n − (k − q) − 1.
2. Fit the model with all k predictors. This is referred to as the Complete or Full model
(and was used for the F -test for all regression coefficients). Obtain the error sum of squares
(SSE(F )) and the error degrees of freedom (dfE (F ) = n − k − 1).
By definition of the least squares citerion, we know that SSE(R) ≥ SSE(F ). We now obtain the
test statistic:
SSE(R)−SSE(F )
(n−(k−q)−1)−(n−k−1) (SSE(R) − SSE(F ))/q
TS : Fobs = SSE(F )
==
M SE(F )
n−k−1
and our rejection region is values of Fobs ≥ Fα,q,n−k−1 .
Y = β0 + β1 X1 + β2 X2 + β3 X3 + ε,
where X1 is the latitude of the location, X2 is the longitude, and X3 is its elevation (in feet). As
before, we assume that ε ∼ N (0, σ). Note that higher latitudes mean farther north and higher
longitudes mean farther west.
To estimate the parameters β0 , β1 , β2 , beta3 , and σ, they gather data for a sample of n = 16
counties and fit the model described above. The data, including one other variable are given in
Table 36.
COUNTY LATITUDE LONGITUDE ELEV TEMP INCOME
HARRIS 29.767 95.367 41 56 24322
DALLAS 32.850 96.850 440 48 21870
KENNEDY 26.933 97.800 25 60 11384
MIDLAND 31.950 102.183 2851 46 24322
DEAF SMITH 34.800 102.467 3840 38 16375
KNOX 33.450 99.633 1461 46 14595
MAVERICK 28.700 100.483 815 53 10623
NOLAN 32.450 100.533 2380 46 16486
ELPASO 31.800 106.40 3918 44 15366
COLLINGTON 34.850 100.217 2040 41 13765
PECOS 30.867 102.900 3000 47 17717
SHERMAN 36.350 102.083 3693 36 19036
TRAVIS 30.300 97.700 597 52 20514
ZAPATA 26.900 99.283 315 60 11523
LASALLE 28.450 99.217 459 56 10563
CAMERON 25.900 97.433 19 62 12931
The results of the Analysis of Variance are given in Table 37 and the parameter estimates,
estimated standard errors, t-statistics and p-values are given in Table 38. Full computer programs
and printouts are given as well.
We see from the Analysis of Variance that at least one of the variables, latitude and elevation,
are related to the response variable temperature. This can be seen by setting up the test H0 : β1 =
β2 = β3 = 0 as described previously. The elements of this test, provided by the computer output,
are detailed below, assuming α = .05.
1. H0 : β1 = β2 = β3 = 0
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 934.328 k=3 M SR = 934.328
3 F = 311.443
0.634 .0001
=311.443 =491.235
ERROR SSE = 7.609 n−k−1= M SE = 7.609
12
16 − 3 − 1 = 12 =0.634
TOTAL SSY Y = 941.938 n − 1 = 15
Table 38: Parameter estimates and tests of hypotheses for individual parameters
2. HA : Not all βi = 0
M SR 311.443
3. T.S.: Fobs = M SE = 0.634 = 491.235
4. R.R.: Fobs > F2,13,.05 = 3.81 (This is not provided on the output, the p-value takes the place
of it).
5. p-value: P (F > 644.45) = .0001 (Actually it is less than .0001, but this is the smallest p-value
the computer will print).
We conclude that we are sure that at least one of these three variables is related to the response
variable temperature.
We also see from the individual t-tests that latitude is useful in predicting temperature, even
after taking into account the other predictor variables.
The formal test (based on α = 0.05 significance level) for determining wheteher temperature is
associated with latitude after controlling for longitude and elevation is given here:
• H0 : β1 = 0 (TEMP (Y ) is not associated with LAT (X1 ) after controlling for LONG (X2 )
and ELEV (X3 ))
• HA : βi 6= 0 (TEMP is associated with LAT after controlling for LONG and ELEV)
b1 −1.99323
• T.S.: tobs = S b1 = 0.136399 = −14.614
Thus, we can conclude that there is an association between temperature and latitude, controlling
for longitude and elevation. Note that the coeficient is negative, so we conclude that temperature
decreases as latitude increases (given a level of longitude and elevation).
Note from Table 38 that neither the coefficient for LONGITUDE (X2 ) or ELEVATION (X3 )
are significant at the α = 0.05 significance level (p-values are .1182 and .1181, respectively). Recall
these are testing whether each term is 0 controlling for LATITUDE and the other term.
Before concluding that neither LONGITUDE (X2 ) or ELEVATION (X3 ) are useful predictors,
controlling for LATITUDE, we will test whether they are both simultaneously 0, that is:
H0 : β2 = β3 = 0 vs HA : β2 6= 0 and/or β3 6= 0
Since 42.055 >> 3.89, we reject H0 , and conclude that LONGITUDE (X2 ) and/or ELEVATION
(X3 ) are associated with TEMPERATURE (Y ), after controlling for LATITUDE (X1 ).
Table 39: The Analysis of Variance Table for Texas data – without LONGITUDE
We see this by observing that the t-statistic for testing H0 : β1 = 0 (no latitude effect on
temperature) is −17.65, corresponding to a p-value of .0001, and the t-statistic for testing H0 :
β2 = 0 (no elevation effect) is −8.41, also corresponding to a p-value of .0001. Further note
that both estimates are negative, reflecting that as elevation and latitude increase, temperature
decreases. That should not come as any big surprise.
t FOR H0 : STANDARD ERROR
PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) b0 =63.45485 36.68 .0001 0.48750
ELEVATION (β1 ) b1 = −0.00185 −8.41 .0001 0.00022
LATITUDE (β2 ) b2 = −1.83216 −17.65 .0001 0.10380
Table 40: Parameter estimates and tests of hypotheses for individual parameters – without LON-
GITUDE
The magnitudes of the estimated coefficients are quite different, which may make you believe
that one predictor variable is more important than the other. This is not necessarily true, because
the ranges of their levels are quite different (1 unit change in latitude represents a change of
approximately 19 miles, while a unit change in elevation is 1 foot) and recall that βi represents the
change in the mean response when variable Xi is increased by 1 unit.
The data corresponding to the 16 locations in the sample are plotted in Figure 23 and the fitted
equation for the model that does not include LONGITUDE is plotted in Figure 24. The fitted
equation is a plane in three dimensions.
TEMP
62.00
53.33
44.67
36.00
36.35 3918
32.87 2618
L A T 12 9 . 3 8 1319 ELEV
25.90 19
X1 – Average Loan Value / Mortgage Value Ratio (Higher X1 means lower down payment and
higher risk to lender).
YHAT
63.45
53.66
43.87
34.08
37 4000
33 2667
LAT 29 1333 ELEV
25 0
X2 – Road Distance from Boston (Higher X2 means further from Northeast, where most capital
was at the time, and higher costs of capital).
X3 – Savings per Annual Dwelling Unit Constructed (Higher X3 means higher relative credit
surplus, and lower costs of capital).
X4 – Savings per Capita (does not adjust for new housing demand).
The data, fitted values, and residuals are given in Table 41. The Analysis of Variance is given
in Table 42. The regression coefficients, test statistics, and p-values are given in Table 43.
Show that the fitted value for Los Angeles is 6.19, based on the fitted equation, and that the
residual is -0.02.
Based on the large F -statistic, and its small corresponding P -value, we conclude that this set of
predictor variables is associated with the mortgage rate. That is, at least one of these independent
variables is associated with Y .
Based on the t-tests, while none are strictly significant at the α = 0.05 level, there is some
evidence that X1 (Loan Value/Mortgage Value, P = .0515), X3 (Savings per Unit Constructed,
P = .0593), and to a lesser extent, X4 (Savings per Capita, P = .1002) are helpful in predicting
mortgage rates. We can fit a reduced model, with just these three predictors, and test whether we
can simultaneously drop X2 , X5 , and X6 from the model. That is:
H0 : β2 = β5 = β6 = 0 vs HA : β2 6= 0 and/or β5 6= 0 and/or β6 6= 0
n = 18 k=6 q=3
Table 41: Data and fitted values for mortgage rate multiple regression example.
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 0.73877 k=6 M SR = 0.73877
6 F = 0.12313
0.00998 .0003
=0.12313 =12.33
ERROR SSE = 0.10980 n−k−1= M SE = 0.10980
11
18 − 6 − 1 = 11 =0.00998
TOTAL SSY Y = 0.84858 n − 1 = 17
Table 42: The Analysis of Variance Table for Mortgage rate regression analysis
STANDARD
PARAMETER ESTIMATE ERROR t-statistic P -value
INTERCEPT (β0 ) b0 =4.28524 0.66825 6.41 .0001
X1 (β1 ) b1 = 0.02033 0.00931 2.18 .0515
X2 (β2 ) b2 = 0.000014 0.000047 0.29 .7775
X3 (β3 ) b3 = −0.00158 0.000753 -2.10 .0593
X4 (β4 ) b4 = 0.000202 0.000112 1.79 .1002
X5 (β5 ) b5 = 0.00128 0.00177 0.73 .4826
X6 (β6 ) b6 = 0.000236 0.00230 0.10 .9203
Table 43: Parameter estimates and tests of hypotheses for individual parameters – Mortgage rate
regression analysis
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL SSR = 0.73265 k−q =3 M SR = 0.73265
3 F = 0.24422
0.00828 .0001
=0.24422 =29.49
ERROR SSE = 0.11593 n − (k − q) − 1 = M SE = 0.11593
14
18 − 3 − 1 = 14 =0.00828
TOTAL SSY Y = 0.84858 n − 1 = 17
Table 44: The Analysis of Variance Table for Mortgage rate regression analysis (Reduced Model)
STANDARD
PARAMETER ESTIMATE ERROR t-statistic P -value
INTERCEPT (β0 ) b0 =4.22260 0.58139 7.26 .0001
X1 (β1 ) b1 = 0.02229 0.00792 2.81 .0138
X3 (β3 ) b3 = −0.00186 0.00041778 -4.46 .0005
X4 (β4 ) b4 = 0.000225 0.000074 3.03 .0091
Table 45: Parameter estimates and tests of hypotheses for individual parameters – Mortgage rate
regression analysis (Reduced Model)
Next, we fit the reduced model, with β2 = β5 = β6 = 0. We get the Analysis of Variance in
Table 44 and parameter estimates in Table 45.
Note first, that all three regression coefficients
√ are significant now at the α = 0.05 significance
level. Also, our residual standard error, Se = M SE has also decreased (0.09991 to 0.09100). This
implies we have lost very little predictive ability by dropping X2 , X5 , and X6 from the model. Now
to formally test whether these three predictor variables’ regression coefficients are simultaneously
0 (with α = 0.05):
• H0 : β2 = β5 = β6 = 0
• HA : β2 6= 0 and/or β5 6= 0 and/or β6 6= 0
(0.11593−0.10980)/2 .00307
• T S : Fobs = 0.00998 = .00998 = 0.307
We fail to reject H0 , and conclude that none of X2 , X5 , or X6 are associated with mortgage
rate, after controlling for X1 , X3 , and X4 .
Source: Lord, J.D. and C.D. Lynds (1981), “The Use of Regression Models in Store Location
Research: A Review and Case Study,” Akron Business and Economic Review, Summer, 13-19.
Table 46: Regression coefficients and standard errors for liquor store sales study
a) Do any of these variables fail to be associated with store sales after controlling for the others?
b) Consider the signs of the significant regression coefficients. What do they imply?
One problem with R2 is that when we continually add independent variables to a regression
model, it continually increases (or at least, never decreases), even when the new variable(s) add
little or no predictive power. Since we are trying to fit the simplest (most parsimonious) model
that explains the relationship between the set of independent variables and the dependent variable,
we need a measure that penalizes models that contain useless or redundant independent variables.
This penalization takes into account that by including useless or redundant predictors, we are
decreasing error degrees of freedom (dfE = n − k − 1). A second measure, that does not carry the
proportion of variation explained criteria, but is useful for comparing models of varying degrees of
complexity, is Adjusted-R2 :
2 SSE/(n − k − 1) n−1 SSE
Adjusted − R = 1− = 1−
SSY Y /(n − 1) n−k−1 SSY Y
In this section, we will look at three special cases that are frequently used methods of multiple
regression. The ideas such as the Analysis of Variance, tests of hypotheses, and parameter estimates
are exactly the same as before and we will concentrate on their interpretation through specific
examples. The four special cases are:
1. Polynomial regression
y = β0 + β1 X + β2 X 2 + ε.
Again, we assume that ε ∼ N (0, σ). In this model, the number of people attending in a day when
there are X machines is nomally distributed with mean β0 + β1 X + β2 X 2 and standard deviation
σ. Note that we are no longer saying that the mean is linearly related to X, but rather that
it is approximately quadratically related to X (curved). Suppose she leases varying numbers of
machines over a period of n = 12 Wednesdays (always advertising how many machines will be there
on the following Wednesday), and observes the number of people attending the club each day, and
obtaining the data in Table 47.
Table 48: The Analysis of Variance Table for health club data
The first test of hypothesis is whether the attendance is associated with the number of machines.
This is a test of H0 : β1 = β2 = 0. If the null hypothesis is true, that implies mean daily attendance
is unrelated to the number of machines, thus the club owner would purchase very few (if any) of the
machines. As before this test is the F -test from the Analysis of Variance table, which we conduct
here at α = .05.
1. H0 : β1 = β2 = 0
t FOR H0 : STANDARD ERROR
PARAMETER ESTIMATE βi =0 P-VALUE OF ESTIMATE
INTERCEPT (β0 ) b0 =72.0500 2.04 .0712 35.2377
MACHINES (β1 ) b1 = 199.7625 8.67 .0001 23.0535
MACHINES SQ (β2 ) b2 = −13.6518 −4.23 .0022 3.2239
Table 49: Parameter estimates and tests of hypotheses for individual parameters
2. HA : Not both βi = 0
M SR 196966.56
3. T.S.: Fobs = M SE = 776.06 = 253.80
4. R.R.: Fobs > F2,9,.05 = 4.26 (This is not provided on the output, the p-value takes the place
of it).
5. p-value: P (F > 253.80) = .0001 (Actually it is less than .0001, but this is the smallest p-value
the computer will print).
Another test with an interesting interpretation is H0 : β2 = 0. This is testing the hypothesis
that the mean increases linearly with X (since if β2 = 0 this becomes the simple regression model
(refer back to the coffee data example)). The t-test in Table 49 for this hypothesis has a test
statistic tobs = −4.23 which corresponds to a p-value of .0022, which since it is below .05, implies
we reject H0 and conclude β2 6= 0. Since b2 is is negative, we will conclude that β2 is negative,
which is in agreement with her theory that once you get to a certain number of machines, it does
not help to keep adding new machines. This is the idea of ‘diminishing returns’. Figure 25 shows
the actual data and the fitted equation Ŷ = 72.0500 + 199.7625X − 13.6518X 2 .
YHAT
900
800
700
600
500
400
300
200
100
0
0 1 2 3 4 5 6 7
X
Figure 25: Plot of the data and fitted equation for health club example
E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 + ε.
Note that for women of age x1 , the mean expenditure is E[Y |X1 , 1] = β0 +β1 X1 +β2 (1) = (β0 +β2 )+
β1 X1 , while for men of age X1 , the mean expenditure is E[Y |X1 , 0] = β0 +β1 X1 +β0 (0) = β0 +β1 X1 .
This model allows for different means for men and women, but requires they have the same slope
(we will see a more general case in the next section). In this case the interpretation of β2 = 0 is
that the means are the same for both sexes, this is a hypothesis a health care professional may wish
to test in a study. In this example the variable sex had two variables, so we had to create 2 − 1 = 1
dummy variable, now consider a second example.
Example 14.2
We would like to see if annual per capita clothing expenditures is related to annual per capita
income in cities across the U.S. Further, we would like to see if there is any differences in the means
across the 4 regions (Northeast, South, Midwest, and West). Since the variable region has 4 levels,
we will create 3 dummy variables X2 , X3 , and X4 as follows (we leave X1 to represent the predictor
variable per capita income): (
1 if region=South
X2 =
0 otherwise
(
1 if region=Midwest
X3 =
0 otherwise
(
1 if region=West
X4 =
0 otherwise
Note that cities in the Northeast have X2 = X3 = X4 = 0, while cities in other regions will have
either X2 , X3 , or X4 being equal to 1. Northeast cities act like males did in the previous example.
The data are given in Table 50.
The Analysis of Variance is given in Table 51, and the parameter estimates and standard errors
are given in Table 52.
Note that we would fail to reject H0 : β1 = β2 = β3 = 0 at α = .05 significance level if we looked
only at the F -statistic and it’s p-value (Fobs = 2.93, p-value=.0562). This would lead us to conclude
that there is no association between the predictor variables income and region and the response
variable clothing expenditures. This is where you need to be careful when using multiple regression
with many predictor variables. Look at the test of H0 : β1 = 0, based on the t-test in Table 52.
Here we observe tobs =3.11, with a p-value of .0071. We thus conclude β1 6= 0, and that clothing
expenditures is related to income, as we would expect. However, we do fail to reject H0 : β2 = 0, H0 :
β3 = 0,and H0 : β4 = 0, so we fail to observe any differences among the regions in terms of clothing
PER CAPITA INCOME & CLOTHING EXPENDITURES (1990)
Income Expenditure
Metro Area Region X1 Y X2 X3 X4
New York City Northeast 25405 2290 0 0 0
Philadelphia Northeast 21499 2037 0 0 0
Pittsburgh Northeast 18827 1646 0 0 0
Boston Northeast 24315 1659 0 0 0
Buffalo Northeast 17997 1315 0 0 0
Atlanta South 20263 2108 1 0 0
Miami/Ft Laud South 19606 1587 1 0 0
Baltimore South 21461 1978 1 0 0
Houston South 19028 1589 1 0 0
Dallas/Ft Worth South 19821 1982 1 0 0
Chicago Midwest 21982 2108 0 1 0
Detroit Midwest 20595 1262 0 1 0
Cleveland Midwest 19640 2043 0 1 0
Minneapolis/St Paul Midwest 21330 1816 0 1 0
St Louis Midwest 20200 1340 0 1 0
Seattle West 21087 1667 0 0 1
Los Angeles West 20691 2404 0 0 1
Portland West 18938 1440 0 0 1
San Diego West 19588 1849 0 0 1
San Fran/Oakland West 25037 2556 0 0 1
ANOVA
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F p-value
MODEL 1116419.0 4 279104.7 2.93 .0562
ERROR 1426640.2 15 95109.3
TOTAL 2543059.2 19
Table 51: The Analysis of Variance Table for clothes expenditure data
Table 52: Parameter estimates and tests of hypotheses for individual parameters
expenditures after ‘adjusting’ for the variable income. Figure 26 and Figure 27 show the original
data using region as the plotting symbol and the 4 fitted equations corresponding to the 4 regions.
Recall that the fitted equation is Ŷ = −657.428 + 0.113X1 + 237.494X2 + 21.691X3 + 254.992X4 ,
and each of the regions has a different set of levels of variables X2 , X3 , and X4 .
Y
2600 W
2500
2400 W
2300 S
2200
2100 M N
2000 N S
M M
1900 W
1800 N
1700 S W S
1600 M M
1500
1400 W
1300 S N
N
1200
16000 18000 20000 22000 24000 26000
X1
REGION N N N Midwest S S S Northeast
M M M South W W W West
YN
2600
2500
2400
2300
2200
2100
2000
1900
1800
1700
1600
1500
1400
1300
1200
16000 18000 20000 22000 24000 26000
X1
We will not provide the Analysis of Variance table for this example due to the magnitude of
the numbers, and the fact that we are certain of sex and year effects after one look at the data.
However, we do note that R2 = .991555, showing that our model does account for much of the
variation in reported cases. Table 55 provides the parameter estimates and their standard errors.
Table 54: Parameter estimates and tests of hypotheses for individual parameters – AIDS data
Note the difference in these equations, particularly their slopes, which represent the increase in
the number of new cases each year. Suppose that a government official would like to predict the
number of new cases in 1992 based on this equation (we are assuming that this increasing pattern
will continue). In this case X1 = year − 1983 = 1992 − 1983 = 9. The two predictions would be:
and
Females: Ŷ1992 = −1007.464 + 814.631(9) = 6324.215.
Note that we are not limited to simple models like this, we could have interaction terms in any of
the regression models that we have seen in this chapter. Their interpretations increase in complexity
as the models include more variables. Usually we will test if a coefficient of an interaction term is
0, using the t-test, and remove the interaction term from the model if we fail to reject H0 : βi = 0.
Figure 28 shows the data, as well as the two fitted equations for the AIDS data.
15.1 Multicollinearity
Multicollinearity refers to the situation where independent variables are highly correlated among
themselves. This can cause problems mathematically and creates problems in interpreting regres-
sion coefficients.
Some of the problems that arise include:
Y
50000
40000
30000
20000
10000
-10000
1 2 3 4 5 6 7 8
X1
It can be thought that the independent variables are explaining “the same” variation in Y , and
it is difficult for the model to attribute the variation explained (recall partial regression coefficients).
Variance Inflation Factors provide a means of detecting whether a given independent variable
is causing multicollinearity. They are calculated (for each independent variable) as:
1
V IFi =
1 − Ri2
where Ri2 is the coefficient of multiple determination when Xi is regressed on the k − 1 other
independent variables. One rule of thumb suggests that severe multicollinearity is present if V IFi >
10 (Ri2 > .90).
Example 15.1
First, we run a regression with ELEVATION as the dependent variable and LATITUDE and
LONGITUDE as the independent variables. We then repeat the process with LATITUDE as the
dependent variable, and finally with LONGITUDE as the dependent variable. Table ?? gives R2
and V IF for each model.
Note how large the factor is for ELEVATION. Texas elevation increases as you go West and as
you go North. The Western rise is the more pronounced of the two (the simple correlation between
ELEVATION and LONGITUDE is .89).
Consider the effects on the coefficients in Table 56 and Table 57 (these are subsets of previously
shown tables).
Compare the estimate and estimated standard error for the coefficient for ELEVATION and
LATITUDE for the two models. In particular, the ELEVATION coefficient doubles in absolute
Variable R2 V IF
ELEVATION .9393 16.47
LATITUDE .7635 4.23
LONGITUDE .8940 9.43
STANDARD ERROR
PARAMETER ESTIMATE OF ESTIMATE
INTERCEPT (β0 ) b0 =109.25887 2.97857
LATITUDE (β1 ) b1 = −1.99323 0.13639
LONGITUDE (β2 ) b2 = −0.38471 0.22858
ELEVATION (β3 ) b3 = −0.00096 0.00057
Table 56: Parameter estimates and standard errors for the full model
value and its standard error decreases by a factor of almost 3. The LATITUDE coefficient and
standard error do not change very much. We choose to keep ELEVATION, as opposed to LONGI-
TUDE, in the model due to theoretical considerations with respect to weather and climate.
• et — Error of forecast: et = Xt − Ft
STANDARD ERROR
PARAMETER ESTIMATE OF ESTIMATE
INTERCEPT (β0 ) b0 =63.45485 0.48750
ELEVATION (β1 ) b1 = −0.00185 0.00022
LATITUDE (β2 ) b2 = −1.83216 0.10380
Table 57: Parameter estimates and standard errors for the reduced model
P
i e2
Mean Square Error (MSE) — MSE= number of forecasts
P ei
Xi
·100
Mean Percentage Error (MPE) — MPE= number of forecasts
P |ei |
X
·100
Mean Absolute Percentage Error (MAPE) — MPE= number ofi forecasts
In this section, we describe some simple methods of using past data to predict future outcomes.
Most forecasts you here reported are generally complex hybrids of these techniques.
Put higher (presumably) weights on more recent values in the moving averages:
Presumably w1 ≥ w2 ≥ · · · ≥ wn
Example 16.1
Table 58 gives average dividend yields for Anheuser–Busch for the years 1952–1995 (Source:Value
Line), forecasts and errors based on moving averages based on lags of 1, 2, and 3. Note that we
don’t have early year forecasts, and the longer the lag, the longer we must wait until we get our
first forecast.
Here we compute moving averages for year=1963:
1–Year: F1963 = X1962 = 3.2
X1962 +X1961 3.2+2.8
2–Year: F1963 = 2 = 2 = 3.0
X1962 +X1961 +X1960 3.2+2.8+4.4
3–Year: F1963 = 3 = 3 = 3.47
t Year Xt F1,t e1,t F2,t e2,t F3,t e3,t
1 1952 5.30 . . . . . .
2 1953 4.20 5.30 -1.10 . . . .
3 1954 3.90 4.20 -0.30 4.75 -0.85 . .
4 1955 5.20 3.90 1.30 4.05 1.15 4.47 0.73
5 1956 5.80 5.20 0.60 4.55 1.25 4.43 1.37
6 1957 6.30 5.80 0.50 5.50 0.80 4.97 1.33
7 1958 5.60 6.30 -0.70 6.05 -0.45 5.77 -0.17
8 1959 4.80 5.60 -0.80 5.95 -1.15 5.90 -1.10
9 1960 4.40 4.80 -0.40 5.20 -0.80 5.57 -1.17
10 1961 2.80 4.40 -1.60 4.60 -1.80 4.93 -2.13
11 1962 3.20 2.80 0.40 3.60 -0.40 4.00 -0.80
12 1963 3.10 3.20 -0.10 3.00 0.10 3.47 -0.37
13 1964 3.10 3.10 0.00 3.15 -0.05 3.03 0.07
14 1965 2.60 3.10 -0.50 3.10 -0.50 3.13 -0.53
15 1966 2.00 2.60 -0.60 2.85 -0.85 2.93 -0.93
16 1967 1.60 2.00 -0.40 2.30 -0.70 2.57 -0.97
17 1968 1.30 1.60 -0.30 1.80 -0.50 2.07 -0.77
18 1969 1.20 1.30 -0.10 1.45 -0.25 1.63 -0.43
19 1970 1.20 1.20 0.00 1.25 -0.05 1.37 -0.17
20 1971 1.10 1.20 -0.10 1.20 -0.10 1.23 -0.13
21 1972 0.90 1.10 -0.20 1.15 -0.25 1.17 -0.27
22 1973 1.40 0.90 0.50 1.00 0.40 1.07 0.33
23 1974 2.00 1.40 0.60 1.15 0.85 1.13 0.87
24 1975 1.90 2.00 -0.10 1.70 0.20 1.43 0.47
25 1976 2.30 1.90 0.40 1.95 0.35 1.77 0.53
26 1977 3.10 2.30 0.80 2.10 1.00 2.07 1.03
27 1978 3.50 3.10 0.40 2.70 0.80 2.43 1.07
28 1979 3.80 3.50 0.30 3.30 0.50 2.97 0.83
29 1980 3.70 3.80 -0.10 3.65 0.05 3.47 0.23
30 1981 3.10 3.70 -0.60 3.75 -0.65 3.67 -0.57
31 1982 2.60 3.10 -0.50 3.40 -0.80 3.53 -0.93
32 1983 2.40 2.60 -0.20 2.85 -0.45 3.13 -0.73
33 1984 3.00 2.40 0.60 2.50 0.50 2.70 0.30
34 1985 2.40 3.00 -0.60 2.70 -0.30 2.67 -0.27
35 1986 1.80 2.40 -0.60 2.70 -0.90 2.60 -0.80
36 1987 1.70 1.80 -0.10 2.10 -0.40 2.40 -0.70
37 1988 2.20 1.70 0.50 1.75 0.45 1.97 0.23
38 1989 2.10 2.20 -0.10 1.95 0.15 1.90 0.20
39 1990 2.40 2.10 0.30 2.15 0.25 2.00 0.40
40 1991 2.10 2.40 -0.30 2.25 -0.15 2.23 -0.13
41 1992 2.20 2.10 0.10 2.25 -0.05 2.20 0.00
42 1993 2.70 2.20 0.50 2.15 0.55 2.23 0.47
43 1994 3.00 2.70 0.30 2.45 0.55 2.33 0.67
44 1995 2.80 3.00 -0.20 2.85 -0.05 2.63 0.17
Table 58: Dividend yields, Forecasts, errors — 1, 2, and 3 year moving Averages
When might a “short” Moving Average be preferred to a “long” one?
When might a “long” Moving Average be preferred to a “short” one?
Figure 29 displays raw data and moving average forecasts.
DIV_YLD
7
6 Actual
MA(1)
MA(2)
5 MA(3)
0
1950 1960 1970 1980 1990 2000
CAL_YEAR
Figure 29: Plot of the data moving average forecast for Anheuser–Busch dividend data
Measurements of Forecasting
P Error
e
Mean Error: M E = Number of iforecasts e i = Xi − Fi
(−1.1)+(−0.3)+1.3+···+0.5+0.3+(−0.2) −2.5
1–Year: M E = 43 = 43 = −0.058
(−0.85)+1.15+1.25+···+0.55+0.55+(−0.05) −2.6
2–Year: M E = 42 = 42 = −0.062
0.73+1.37+1.33+···+0.47+0.67+0.17 −2.8
3–Year: M E = 41 = 41 = −0.068
P |ei |
X
·100
Mean Absolute Percentage Error (MAPE) — M AP E = number ofi forecasts
1–Year:
|−1.1| |−0.3| |0.3| |−0.2|
4.2 · 100 + 3.9 · 100 + · · · + 3.0 · 100 + 2.8 · 100 687.06
M AP E = = = 15.98
43 43
2–Year:
|−0.85| |1.15| 0.55 |−0.05|
3.9 · 100 + 5.2 · 100 + · · · + 3.0 · 100 + 2.8 · 100 843.31
M AP E = = = 20.08
42 42
Ft+1 = α · Xt + (1 − α) · Ft
where :
• Xt is the outcome at t
Forecasts are “smoother” than the raw data and weights of previous observations decline expo-
nentially with time.
Here we obtain Forecasts based on Exponential Smoothing, beginning with year 2 (1953):
1953: Fα=.2,1953 = X1952 = 5.30 Fα=.5,1952 = X1952 = 5.30 Fα=.8,1952 = X1952 = 5.30
1954 (α = 0.2): Fα=.2,1954 = .2X1953 + .8Fα=.2,1953 = .2(4.20) + .8(5.30) = 5.08
1954 (α = 0.5): Fα=.5,1954 = .5X1953 + .5Fα=.5,1953 = .5(4.20) + .5(5.30) = 4.75
1954 (α = 0.8): Fα=.8,1954 = .8X1953 + .2Fα=.5,1953 = .8(4.20) + .2(5.30) = 4.42
Which level of α appears to be “discounting” more distant observations at a quicker rate? What
would happen if α = 1? If α = 0? Figure 30 gives raw data and exponential smoothing forecasts.
Table 60 gives measures of forecast errors for three moving average, and three exponential
smoothing methods.
16.3 Autoregression
Sometimes regression is run on past or “lagged” values of the dependent variable (and possibly
other variables). An Autoregressive model with independent variables corresponding to k periods
can be written as follows:
Note that the regression cannot be run for the first k responses in the series.
Table 59: Dividend yields, Forecasts, and errors based on exponential smoothing with α =
0.2, 0.5, 0.8
DIV_YLD
7
6 Actual
ES(a=.2)
ES(a=.5)
5 ES(a=.8)
0
1950 1960 1970 1980 1990 2000
CAL_YEAR
Figure 30: Plot of the data and Exponential Smoothing forecasts for Anheuser–Busch dividend
data
Autoregression
Measure 1–Period 2–Period 3–Period
ME 0.00 0.00 0.00
MAE 0.41 0.38 0.39
MSE 0.27 0.24 0.24
MPE −3.47 −3.13 −3.16
MAPE 16.02 15.14 15.45
How do these methods of forecasting compare with moving averages and exponential smoothing?
DIV_YLD
7
6 Actual
AR(1)
AR(2)
5 AR(3)
0
1950 1960 1970 1980 1990 2000
CAL_YEAR
Figure 31: Plot of the data and Autoregressive forecasts for Anheuser–Busch dividend data
t Year Xt FAR(1),t eAR(1),t FAR(2),t e(AR(2),t FAR(3),t eAR(3),t
1 1952 5.3 . . . . . .
2 1953 4.2 4.96 -0.76 . . . .
3 1954 3.9 3.99 -0.09 3.72 0.18 . .
4 1955 5.2 3.72 1.48 3.68 1.52 3.72 1.48
5 1956 5.8 4.87 0.93 5.30 0.50 5.35 0.45
6 1957 6.3 5.40 0.90 5.64 0.66 5.58 0.72
7 1958 5.6 5.84 -0.24 6.06 -0.46 6.03 -0.43
8 1959 4.8 5.22 -0.42 5.09 -0.29 5.03 -0.23
9 1960 4.4 4.52 -0.12 4.34 0.06 4.35 0.05
10 1961 2.8 4.16 -1.36 4.10 -1.30 4.12 -1.32
11 1962 3.2 2.75 0.45 2.33 0.87 2.29 0.91
12 1963 3.1 3.11 -0.01 3.26 -0.16 3.35 -0.25
13 1964 3.1 3.02 0.08 3.03 0.07 3.00 0.10
14 1965 2.6 3.02 -0.42 3.06 -0.46 3.05 -0.45
15 1966 2 2.58 -0.58 2.47 -0.47 2.44 -0.44
16 1967 1.6 2.05 -0.45 1.90 -0.30 1.90 -0.30
17 1968 1.3 1.70 -0.40 1.60 -0.30 1.61 -0.31
18 1969 1.2 1.43 -0.23 1.36 -0.16 1.37 -0.17
19 1970 1.2 1.35 -0.15 1.33 -0.13 1.34 -0.14
20 1971 1.1 1.35 -0.25 1.36 -0.26 1.36 -0.26
21 1972 0.9 1.26 -0.36 1.24 -0.34 1.23 -0.33
22 1973 1.4 1.08 0.32 1.03 0.37 1.03 0.37
23 1974 2 1.52 0.48 1.68 0.32 1.70 0.30
24 1975 1.9 2.05 -0.15 2.25 -0.35 2.23 -0.33
25 1976 2.3 1.96 0.34 1.96 0.34 1.92 0.38
26 1977 3.1 2.31 0.79 2.46 0.64 2.47 0.63
27 1978 3.5 3.02 0.48 3.29 0.21 3.28 0.22
28 1979 3.8 3.37 0.43 3.53 0.27 3.49 0.31
29 1980 3.7 3.64 0.06 3.77 -0.07 3.75 -0.05
30 1981 3.1 3.55 -0.45 3.56 -0.46 3.54 -0.44
31 1982 2.6 3.02 -0.42 2.88 -0.28 2.86 -0.26
32 1983 2.4 2.58 -0.18 2.47 -0.07 2.47 -0.07
33 1984 3 2.40 0.60 2.37 0.63 2.39 0.61
34 1985 2.4 2.93 -0.53 3.14 -0.74 3.16 -0.76
35 1986 1.8 2.40 -0.60 2.26 -0.46 2.20 -0.40
36 1987 1.7 1.87 -0.17 1.72 -0.02 1.73 -0.03
37 1988 2.2 1.79 0.41 1.78 0.42 1.80 0.40
38 1989 2.1 2.23 -0.13 2.40 -0.30 2.41 -0.31
39 1990 2.4 2.14 0.26 2.13 0.27 2.10 0.30
40 1991 2.1 2.40 -0.30 2.52 -0.42 2.53 -0.43
41 1992 2.2 2.14 0.06 2.08 0.12 2.05 0.15
42 1993 2.7 2.23 0.47 2.28 0.42 2.29 0.41
43 1994 3 2.67 0.33 2.84 0.16 2.85 0.15
44 1995 2.8 2.93 -0.13 3.05 -0.25 3.03 -0.23
Table 62: Average dividend yields and Forecasts/errors based on autoregression with lags of 1, 2,
and 3 periods
17 Lecture 17 — Autocorrelation
Textbook Section: 15.5
Problems: See Lecture
Recall a key assumption in regression: Error terms are independent. When data are collected
over time, the errors are often serially correlated (Autocorrelated). Under first–Order Autocorre-
lation, consecutive error terms are linealy related:
epst = ρεt−1 + νt
where ρ is the correlation between consecutive error terms, and νt is a normally distributed
independent error term. When errors display a positive correlation, ρ > 0 (Consecutive error
terms are associated). We can test this relation as follows, note that when ρ = 0, error terms
are independent (which is the assumption in the derivation of the tests in the chapters on linear
regression).
• Additional independent variable(s) — A variable may be missing from the model that will
eliminate the autocorrelation (see example).
• Transform the variables — Take “first differences” (Xt+1 − Xt ) and (Yt+1 − Yt ) and run
regression with transformed Y and X.
The raw data are given in Table 63, and plotted (with the fitted equation) in Figure 32. Figure 33
gives a plot of residuals vs time order. Notice the distinct pattern in the residuals and that
consecutive residuals are very close to one another.
Compute the first three residuals, and their contributions to the numerator and denominator
of the D–W statistic.
For the entire sample, we obtain: n = 124, k = 1, D = 0.092 — Test for autocorrelation.
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Year Sales CPI Sales CPI Sales CPI Sales CPI
1965 523.0 31.2 486.9 31.5 527.8 31.6 520.9 31.7
1966 558.5 32.0 531.2 32.3 591.8 32.6 561.7 32.9
1967 643.7 32.9 564.0 33.2 633.5 33.5 597.5 33.8
1968 659.9 34.2 590.1 34.5 668.8 35.0 623.8 35.4
1969 695.3 35.8 648.4 36.4 694.6 37.0 669.3 37.5
1970 747.7 38.0 706.8 38.6 760.0 39.1 764.3 39.6
1971 815.2 39.9 764.0 40.3 799.6 40.8 799.3 41.0
1972 904.6 41.3 816.4 41.6 915.0 42.0 878.4 42.4
1973 975.2 42.9 913.3 43.9 1024.3 44.9 993.9 45.9
1974 1158.6 47.2 1136.1 48.5 1338.9 50.0 1278.7 51.5
1975 1530.6 52.4 1455.6 53.2 1587.8 54.4 1507.7 55.2
1976 1586.7 55.8 1541.2 56.5 1736.8 57.4 1648.0 58.0
1977 1830.0 59.0 1731.0 60.3 1921.0 61.2 1802.0 61.9
1978 1932.0 62.9 1929.0 64.5 2173.0 66.1 2065.0 67.4
1979 2286.0 69.1 2249.0 71.5 2457.0 73.8 2337.0 75.9
1980 2664.0 78.9 2622.0 81.8 2790.0 83.3 2696.0 85.5
1981 2908.0 87.8 2759.0 89.8 2949.0 92.4 2800.0 93.7
1982 3026.0 94.5 2895.0 95.9 3094.0 97.7 2979.0 97.9
1983 3201.0 97.9 3030.0 99.1 3131.0 100.3 3090.0 101.2
1984 3277.0 102.3 3135.0 103.4 3238.0 104.5 3251.0 105.3
1985 3485.0 106.0 3375.0 107.3 3350.0 108.0 3342.0 109.0
1986 3605.0 109.2 3865.0 109.0 4081.0 109.8 3888.0 110.4
1987 4356.0 111.6 4255.0 113.1 4222.0 114.4 4167.0 115.4
1988 4664.0 116.1 4839.0 117.5 4860.0 119.1 4973.0 120.3
1989 5267.0 121.7 5268.0 123.7 5430.0 124.7 5433.0 125.9
1990 5807.0 128.0 6025.0 129.3 6123.0 131.6 6126.0 133.7
1991 6652.0 134.8 6857.0 135.6 6795.0 136.7 6722.0 137.7
1992 7205.0 138.7 7597.0 139.8 7483.0 140.9 7167.0 141.9
1993 7879.0 143.1 7839.0 144.2 7350.0 144.8 7365.0 145.8
1994 7564.0 146.7 7788.0 147.6 7441.0 148.9 7503.0 149.6
1995 8161.0 150.9 8467.0 152.2 8312.0 152.9 8494.0 153.6
Table 63: Quarterly Sales for P&G (Y ) and CPI (X) — 1965–1995
Here, we attempt to cure the autocorrelation. The relationship appears approximately linear,
with two slopes, with the split in 1985(q4) (CPI=109.0).
A LEXIS/NEXIS search shows that the company bought some OTC drug companies around
this time. Could this have changed rate of increase? Also, there was an uproar that their corporate
logo was Satanic around this time. Maybe a deal with Satan — Increased revenues for advertising
space?
We fit a piecewise linear regression model with an interesting use of dummy variables and
interaction terms:
where: X1t is CPI at time t, X2t is 1 if after 1985(q4), 0 if before. We obtain the following
SALES
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
-1000
0.0 50.0 100.0 150.0 200.0
CPI
Figure 32: Plot of sales vs CPI and the fitted equation — P&G data (Model 1)
2000
1000
R
e
s
i 0
d
u
a
l
-1000
-2000
0 20 40 60 80 100 120 140
TIME
Figure 34 gives a plot of the raw data and fitted equation and Figure 35 gives a plot of residuals
vs time order. Is there a pattern in the residuals? Are consecutive residuals close or far apart?
Compute the last three residuals, and their contributions to the numerator and denominator of the
D–W statistic. For the full sample, we obtain n = 124, k = 2, D = 0.986 — Test for autocorrelation.
SALES
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
0.0 50.0 100.0 150.0 200.0
CPI
Figure 34: Plot of sales vs CPI and the fitted equation — P&G data (Model 2)
600
500
400
300
R 200
e
s 100
i 0
d
u -100
a
l -200
-300
-400
-500
-600
0 20 40 60 80 100 120 140
TIME