Logistic Reg
Logistic Reg
Mail-order sales
company uses many different mailed catalogs to sell different types
A of merchandise. One catalog that features home goods, such as
bedspreads and pillows, was mailed to 200,000 people who were not
current customers.1 The response variable is whether or not the person
places an order. Logistic regression is used to model the probability p
of a purchase as a function of five explanatory variables. These are the
number of purchases within the last 24 months from a home gift catalog,
the proportion of single people in the zip code area based on census data,
the number of credit cards, a variable that distinguishes apartment dwellers
from those who live in single-family homes, and an indicator of whether or
not the customer has ever made a purchase from a similar type of catalog.
The fitted logistic model is then used to estimate the probability that a
large collection of potential customers will make a purchase. Catalogs are
sent to those whose estimated probability is above some cutoff value.
17-2
CHAPTER
17
Logistic Regression*
Introduction
The simple and multiple linear regression methods we studied in Chapters 10
and 11 are used to model the relationship between a quantitative response
variable and one or more explanatory variables. In this chapter we describe
similar methods for use when the response variable has only two possible
values: customer buys or does not buy, patient lives or dies, candidate
accepts job or not.
In general, we call the two values of the response variable “success”
and “failure” and represent them by 1 (for a success) and 0. The mean is
then the proportion of ones, p ⳱ P(success). If our data are n independent
observations with the same p, this is the Binomial setting (page 319). What
is new in this chapter is that the data now include an explanatory variable
x. The probability p of a success depends on the value of x. For example,
suppose we are studying whether a customer makes a purchase (y ⳱ 1) or
not (y ⳱ 0) after being offered a discount. Then p is the probability that
the customer makes a purchase, and possible explanatory variables include
(a) whether the customer has made similar purchases in the past, (b) the
type of discount, and (c) the age of the customer. Note that the explanatory
variables can be either categorical or quantitative. Logistic regression2 is a
statistical method for describing these kinds of relationships.
BINGE DRINKERS
x⳱ 1
0 冦 if the student is a man
if the student is a woman
The sample contained 7180 men and 9916 women. The probability
that a randomly chosen student is a frequent binge drinker has two values,
p1 for men and p0 for women. The number of men in the sample who
are frequent binge drinkers has the Binomial distribution B(7180, p1 ). The
count of frequent binge drinkers among the women has the B(9916, p0 )
distribution.
CASE 17.1
The binge-drinking study found that 1630 of the 7180 men in the sample were
frequent binge drinkers, as were 1684 of the 9916 women. Our estimates of the
two population proportions are
1630
Men: pˆ 1 ⳱ ⳱ 0.2270
7180
and
1684
Women: pˆ 0 ⳱ ⳱ 0.1698
9916
That is, we estimate that 22.7% of college men and 17.0% of college women are
frequent binge drinkers.
odds Logistic regression works with odds rather than proportions. The odds
are the ratio of the proportions for the two possible outcomes. If p is the
probability of a success, then 1 ⫺ p is the probability of a failure, and
p probability of success
ODDS ⳱ ⳱
1⫺p probability of failure
A similar formula for the sample odds is obtained by substituting pˆ for p in
this expression.
pˆ 1 ⳱ 0.227, so the proportion of men who are not frequent binge drinkers is
1 ⫺ pˆ 1 ⳱ 1 ⫺ 0.2270 ⳱ 0.7730
17-6 CHAPTER 17 䡲 Logistic Regression
The estimated odds of a male student being a frequent binge drinker are therefore
pˆ 1
ODDS ⳱
1 ⫺ pˆ 1
0.2270
⳱ ⳱ 0.2937
0.7730
For women, the odds are
pˆ 0
ODDS ⳱
1 ⫺ pˆ 0
0.1698
⳱ ⳱ 0.2045
1 ⫺ 0.1698
When people speak about odds, they often round to integers or fractions.
Since 0.205 is approximately 1/5, we could say that the odds that a female
college student is a frequent binge drinker are 1 to 5. In a similar way, we
could describe the odds that a college woman is not a frequent binge drinker
as 5 to 1.
17.1 Successful franchises and exclusive territories. In Case 9.1 (page 549) we
APPLY YOUR
studied data on the success of 170 franchise firms and whether or not the
KNOWLEDGE
owner of a franchise had an exclusive territory. Here are the data:
Gender
Label user Women Men Total
Yes 63 27 90
No 233 224 457
Total 296 251 547
17.1 The Logistic Regression Model 17-7
Find the proportion of men who are label users; do the same for women.
Restate each of these proportions as odds.
冢1 ⫺ p 冣 ⳱  Ⳮ  x
p
log 0 1
Figure 17.1 graphs the relationship between p and x for some different
values of 0 and 1 . The logistic regression model uses natural logarithms.
Most calculators and statistical software systems have a built-in function for
the natural logarithm, often labeled “ln.”
p
1.0
0.9 β 0 = – 4.0 β 0 = – 8.0
0.8 β 1 = 2.0 β 1 = 1.6
0.7 β 0 = – 4.0
0.6 β1 = 1.8
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
x
Verify these results with your calculator, remembering that “log” is the
natural logarithm. Here is a summary of the logistic regression model.
冢1 ⫺ p 冣 ⳱  Ⳮ  x
p
log 0 1
17.3 Log odds for exclusive territories. Refer to Exercise 17.1. Find the log odds
APPLY YOUR
for the franchises that have exclusive territories. Do the same for the firms
KNOWLEDGE
that do not.
17.4 Log odds for No Sweat labels. Refer to Exercise 17.2. Find the log odds for
the women. Do the same for the men.
for men
冢1 ⫺ p 冣 ⳱  Ⳮ 
p1
log 0 1
1
冢1 ⫺ p 冣 ⳱ 
p0
log 0
0
Note that there is a 1 term in the equation for men because x ⳱ 1, but it is
missing in the equation for women because x ⳱ 0.
CASE 17.1
log
冢 pˆ 1
1 ⫺ pˆ 1 冣
⳱ ⫺1.23
log
冢 pˆ 0
1 ⫺ pˆ 0 冣
⳱ ⫺1.59
log 冢1 ⫺ p 冣 ⳱ 
p0
0
0 and log
冢 pˆ 0
1 ⫺ pˆ 0 冣
⳱ ⫺1.59
the estimate b0 of the intercept is simply the log(ODDS) for the women,
b0 ⳱ ⫺1.59
Similarly, the estimated slope is the difference between the log(ODDS) for the men
and the log(ODDS) for the women,
b1 ⳱ ⫺1.23 ⫺ (⫺1.59) ⳱ 0.36
The slope in this logistic regression model is the difference between the
log(ODDS) for men and the log(ODDS) for women. Most people are not
comfortable thinking in the log(ODDS) scale, so interpretation of the results
in terms of the regression slope is difficult. Usually, we apply a transformation
to help us. The exponential function (e x key on your calculator) reverses the
natural logarithm transformation. That is, continuing Example 17.4,
From this, the ratio of the odds for men (x ⳱ 1) and women (x ⳱ 0) is
ODDSmen
⳱ e0.36 ⳱ 1.43
ODDSwomen
The transformation e0.36 undoes the logarithm and transforms the logistic
odds ratio regression slope into an odds ratio, in this case, the ratio of the odds that a
man is a frequent binge drinker to the odds that a woman is a frequent binge
drinker. We can multiply the odds for women by the odds ratio to obtain
the odds for men:
The odds for men are 1.43 times the odds for women.
Notice that we have chosen the coding for the indicator variable so that
the regression slope is positive. This will give an odds ratio that is greater
than 1. Had we coded women as 1 and men as 0, the signs of the slope would
be reversed, the fitted equation would be log(ODDS) ⳱ ⫺1.23 ⫺ 0.36x, and
the odds ratio would be e⫺0.36 ⳱ 0.70. The odds for women are 70% of the
odds for men.
It is of course often the case that the explanatory variable is quantitative
rather than an indicator variable. We must then use software to fit the
logistic regression model. Here is an example.
冢1 ⫺ p 冣 ⳱  Ⳮ  x
p
log 0 1
where p is the probability that the cheese is acceptable and x is the value of Acetic.
The model for estimated log odds fitted by software is
The odds ratio is eb1 ⳱ 9.49. This means that if we increase the acetic acid content
x by one unit, we increase the odds that the cheese will be acceptable by about 9.5
times. (See Exercise 17.7.)
17.5 Fitted model for exclusive territories. Refer to Exercises 17.1 and 17.3.
APPLY YOUR
Find the estimates b0 and b1 and give the fitted logistic model. What is
KNOWLEDGE
the odds ratio for exclusive territory (x ⳱ 1) versus no exclusive territory
(x ⳱ 0)?
17.6 Fitted model for No Sweat labels. Refer to Exercises 17.2 and 17.4. Find
the estimates b0 and b1 and give the fitted logistic model. What is the odds
ratio for women (x ⳱ 1) versus men (x ⳱ 0)?
17.7 Interpreting an odds ratio. If we apply the exponential function to the fitted
model in Example 17.5, we get
ODDS ⳱ e⫺13.71Ⳮ2.25x ⳱ e⫺13.71 ⫻ e2.25x
Show that for any value of the quantitative explanatory variable x, the odds
ratio for increasing x by 1,
ODDSxⳭ1
ODDSx
is e2.25 ⳱ 9.49. This justifies the interpretation given at the end of
Example 17.5.
冢 冣
p
log ⳱ 0 Ⳮ 1 x
1⫺p
That is, each value of x gives a different proportion p of successes. The
data are n values of x, with observed success or failure for each. The
model assumes that these n success-or-failure trials are independent, with
probabilities of success given by the logistic regression equation. The
parameters of the model are 0 and 1 .
䡵 The odds ratio is the ratio of the odds of a success at x Ⳮ 1 to the odds
of a success at x. It is found as e1 , where 1 is the slope in the logistic
regression equation.
䡵 Software fits the data to the model, producing estimates b0 and b1 of the
parameters 0 and 1 .
17-12 CHAPTER 17 䡲 Logistic Regression
CASE 17.1
parameter estimates given by SPSS are b0 ⳱ ⫺1.587 and b1 ⳱ 0.362, more exact
than we calculated directly in Example 17.4. The standard errors are 0.027 and
0.039. A 95% confidence interval for the slope is
SPSS
SAS
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
FIGURE 17.2 Logistic regression output from SPSS and SAS for the
binge-drinking data, for Example 17.6.
17-14 CHAPTER 17 䡲 Logistic Regression
test of the null hypothesis that the odds ratio is 1 at the 5% significance level.
If the confidence interval does not include 1, we reject H0 and conclude that
the odds for the two groups are different; if the interval does include 1, the
data do not provide enough evidence to distinguish the groups in this way.
CASE 17.1
17.8 Read the output. Examine the SAS output in Figure 17.2. Report
APPLY YOUR
the estimates of 0 and 1 with the standard errors as given in
KNOWLEDGE
this display. Also report the odds ratio with its 95% confidence
interval as given in this output.
17.9 Inference for exclusive territories. Use software to run a logistic regression
analysis for the exclusive territory data of Exercise 17.1. Summarize the
results of the inference.
17.10 Inference for No Sweat labels. Use software to run the logistic regression
analysis for the No Sweat label data of Exercise 17.2. Summarize the results
of the inference.
–1
–2
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
Log concentration
FIGURE 17.3 Plot of log odds of percent killed versus log concentration
for the insecticide data, for Example 17.7.
100
90
80
70
Percent killed
60
50
40
30
20
10
0
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
Log concentration
FIGURE 17.4 Plot of the percent killed versus log concentration with the
logistic fit for the insecticide data, for Example 17.7.
17-16 CHAPTER 17 䡲 Logistic Regression
When the explanatory variable has several values, we can use graphs
like those in Figures 17.3 and 17.4 to visually assess whether the logistic
regression model seems appropriate. Just as a scatterplot of y versus x in
simple linear regression should show a linear pattern, a plot of log odds
versus x in logistic regression should be close to linear. Just as in simple
linear regression, outliers in the x direction should be avoided because they
may overly influence the fitted model.
The graphs strongly suggest that insecticide concentration affects the kill
rate in a way that fits the logistic regression model. Is the effect statistically
significant? Suppose that rotenone has no ability to kill Macrosiphoniella
sanborni. What is the chance that we would observe experimental results
at least as convincing as what we observed if this supposition were
true? The answer is the P-value for the test of the null hypothesis that
the logistic regression slope is zero. If this P-value is not small, our
graph may be misleading. As usual, we must add inference to our data
analysis.
冢1 ⫺ p 冣 ⳱  Ⳮ  x
p
log 0 1
where the values of the explanatory variable x are 0.96, 1.33, 1.63, 2.04, 2.32.
From the SPSS output we see that the fitted model is
log(ODDS) ⳱ b0 Ⳮ b1 x ⳱ ⫺4.89 Ⳮ 3.11x
or
pˆ
⳱ e⫺4.89Ⳮ3.11x
1 ⫺ pˆ
Figure 17.4 is a graph of the fitted pˆ given by this equation against x, along
with the data used to fit the model. SPSS gives the statistic X2 under the heading
“Wald.” The null hypothesis that 1 ⳱ 0 is clearly rejected (X2 ⳱ 64.23,
P ⬍ 0.001).
The estimated odds ratio is 22.39. An increase of one unit in the log
concentration of insecticide (x) is associated with a 22-fold increase in the odds
that an insect will be killed. The confidence interval for the odds is given in the SAS
output: (10.470, 47.896).
Remember that the test of the null hypothesis that the slope is 0 is the same
as the test of the null hypothesis that the odds are 1. If we were reporting the
results in terms of the odds, we could say, “The odds of killing an insect increase
by a factor of 22.3 for each unit increase in the log concentration of insecticide
(X2 ⳱ 64.23, P ⬍ 0.001; 95% CI = 10.5 to 47.9).”
17.11 Find the 95% confidence interval for the slope. Using the information in the
APPLY YOUR
output of Figure 17.5, find a 95% confidence interval for 1 .
KNOWLEDGE
17.2 Inference for Logistic Regression 17-17
SPSS
SAS
The LOGISTIC Procedure S.E.
Analysis of Maximum Likelihood Estimates
0.388
0.643 Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Minitab
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant –4.8923 0.6426 –7.61 0.000
lconc 3.1088 0.3879 8.01 0.000 22.39 10.47 47.90
FIGURE 17.5 Logistic regression output from SPSS, SAS, and Minitab for the insecticide
data, for Example 17.8.
17.12 Find the 95% confidence interval for the odds ratio. Using the estimate b1
and its standard error, find the 95% confidence interval for the odds
ratio and verify that this agrees with the interval given by SAS.
17.13 X 2 or z. The Minitab output in Figure 17.5 does not give the value of X2 .
The column labeled “Z” provides similar information.
(a) Find the value under the heading “Z” for the predictor lconc. Verify that
Z is simply the estimated coefficient divided by its standard error. This
is a z statistic that has approximately the standard Normal distribution
if the null hypothesis (slope 0) is true.
(b) Show that the square of z is X2 . The two-sided P-value for z is the same
as P for X2 .
FIGURE 17.6 Logistic regression output from Minitab for the cheese data with Acetic as
the explanatory variable, for Example 17.9.
We estimate that increasing the acetic acid content of the cheese by one
unit will increase the odds that the cheese will be acceptable by about 9
times. The data, however, do not give us a very accurate estimate. The
odds ratio could be as small as a little more than 1 or as large as 71
with 95% confidence. We have evidence to conclude that cheeses with
higher concentrations of acetic acid are more likely to be acceptable, but
establishing the true relationship accurately would require more data.
SPSS
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Model 16.334 3 0.001
SAS
Testing Global Null Hypothesis: BETA = 0
Test Chi-Square DF Pr > ChiSq
Minitab
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant –14.260 8.287 –1.72 0.085
acetic 0.584 1.544 0.38 0.705 1.79 0.09 37.01
h2s 0.6849 0.4040 1.69 0.090 1.98 0.90 4.38
lactic 3.468 2.650 1.31 0.191 32.09 0.18 5777.85
Log-Likelihood = –9.230
Test that all slopes are zero: G = 16.334, DF = 3, P-Value = 0.001
FIGURE 17.7 Logistic regression output from SPSS, SAS, and Minitab for the cheese data with
Acetic, H2S, and Lactic as the explanatory variables, for Example 17.10.
17.3 Multiple Logistic Regression 17-21
3 degrees of freedom. The P-value is 0.001. We reject H0 and conclude that one or
more of the explanatory variables can be used to predict the odds that the cheese is
acceptable.
Next, examine the coefficients for each variable and the tests that each of these
is 0 in a model that contains the other two. The P-values are 0.71, 0.09, and 0.19.
None of the null hypotheses, H0: 1 ⳱ 0, H0: 2 ⳱ 0, and H0: 3 ⳱ 0, can be
rejected. That is, none of the three explanatory variables adds significant predictive
ability once the other two are already in the model.
Our initial multiple logistic regression analysis told us that the explana-
tory variables contain information that is useful for predicting whether or
not the cheese is acceptable. Because the explanatory variables are correlated,
however, we cannot clearly distinguish which variables or combinations of
variables are important. Further analysis of these data using subsets of the
three explanatory variables is needed to clarify the situation. We leave this
work for the exercises.
variables. The null hyphothesis that the coefficients for all of the
explanatory variables are zero is tested by a statistic that has a distribution
that is approximately 2 with degrees of freedom equal to the number of
explanatory variables. The P-value is approximately P( 2 ⱖ X2 ).
䡵 Hypotheses about individual coefficients, H0 : j ⳱ 0 or H1 : ej ⳱ 1, in
terms of the odds ratio, are tested by a statistic that is approximately 2
with 1 degree of freedom. The P-value is approximately P( 2 ⱖ X2 ).
STATISTICS IN SUMMARY
Logistic regression is much like linear regression. We use one or more
explanatory variables to predict a response in both instances. For logistic
regression, however, the response variable has only two possible values.
The statistical inference issues are quite similar. We can test the significance
of an explanatory variable and give confidence intervals for its coefficient
in the model. When there are several explanatory variables, we can test
the null hypothesis that all of their coefficients are zero and the null
hypothesis that a single coefficient is zero when the other variables are
in the model. Here are the skills you should develop from studying this
chapter.
17-22 CHAPTER 17 䡲 Logistic Regression
A. PRELIMINARIES
1. Recognize the logistic regression setting: a success-or-failure response
variable and a straight-line relationship between the log odds of a
success and an explanatory variable x.
2. If the data contain several observations with the same or similar values
of the explanatory variable, compute the proportion of successes, the
odds of a success, and the log odds for each different value. Plot
the log odds versus the explanatory variable. The relationship should
be approximately linear.
17.20 Healthy companies versus failed companies. Case 7.2 (page 476) compared
the mean ratio of current assets to current liabilities of 68 healthy firms
17-24 CHAPTER 17 䡲 Logistic Regression
with the mean ratio for 33 firms that failed. Here we analyze the same
data with a logistic regression. The outcome is whether or not the firm
is successful, and the explanatory variable is the ratio of current assets to
current liabilities. Here is the output from Minitab:
(a) Give the fitted equation for the log odds that a firm will be successful.
(b) Describe the results of the significance test for the coefficient of the ratio
of current assets to current liabilities.
(c) The odds ratio is the estimated amount that the odds of being successful
would increase when the current assets to current liabilities ratio is
increased by one unit. Report this odds ratio with the 95% confidence
interval.
(d) Write a short summary of this analysis and compare it with the analysis
of these data that we performed in Chapter 7. Which approach do you
prefer?
17.21 Blood pressure and cardiovascular disease. There is much evidence that high
blood pressure is associated with increased risk of death from cardiovascular
disease. A major study of this association examined 3338 men with high
blood pressure and 2676 men with low blood pressure. During the period
of the study, 21 men in the low-blood-pressure and 55 in the high-blood-
pressure group died from cardiovascular disease.
(a) Find the proportion of men who died from cardiovascular disease in the
high-blood-pressure group. Then calculate the odds.
(b) Do the same for the low-blood-pressure group.
(c) Now calculate the odds ratio with the odds for the high-blood-pressure
group in the numerator. Describe the result in words.
17.22 Do the inference and summarize the results. Refer to the previous exercise.
Computer output for a logistic regression analysis of these data gives the
estimated slope b1 ⳱ 0.7505 with standard error SEb1 ⳱ 0.2578.
(a) Give a 95% confidence interval for the slope.
(b) Calculate the X2 statistic for testing the null hypothesis that the slope is
zero and use Table F to find an approximate P-value.
(c) Write a short summary of the results and conclusions.
17.23 Transform to the odds. The results describing the relationship between
blood pressure and cardiovascular disease are given in terms of the change
in log odds in the previous exercise.
(a) Transform the slope to the odds and the 95% confidence interval for
the slope to a 95% confidence interval for the odds.
(b) Write a conclusion using the odds to describe the results.
17.24 Do syntax textbooks have gender bias? To what extent do syntax textbooks,
which analyze the structure of sentences, illustrate gender bias? A study of
Chapter 17 Review Exercises 17-25
this question sampled sentences from 10 texts.7 One part of the study
examined the use of the words “girl,” “boy,” “man,” and “woman.” We
will call the first two words juvenile and the last two adult. Here are data
from one of the texts:
Gender n X (juvenile)
Female 60 48
Male 132 52
(a) Find the proportion of the female references that are juvenile. Then
transform this proportion to odds.
(b) Do the same for the male references.
(c) What is the odds ratio for comparing the female references to the male
references? (Put the female odds in the numerator.)
17.25 Do the inference and summarize the results. The data from the study of
gender bias in syntax textbooks given in the previous exercise are analyzed
using logistic regression. The estimated slope is b1 ⳱ 1.8171 and its standard
error is SEb1 ⳱ 0.3686.
(a) Give a 95% confidence interval for the slope.
(b) Calculate the X2 statistic for testing the null hypothesis that the slope is
zero and use Table F to find an approximate P-value.
(c) Write a short summary of the results and conclusions.
17.26 Transform to the odds. The gender bias in syntax textbooks is described in
the log odds scale in the previous exercise.
(a) Transform the slope to the odds and the 95% confidence interval for
the slope to a 95% confidence interval for the odds.
(b) Write a conclusion using the odds to describe the results.
17.27 Analysis of a reduction in force. To meet competition or cope with economic
slowdowns, corporations sometimes undertake a “reduction in force” (RIF),
where substantial numbers of employees are terminated. Federal and various
state laws require that employees be treated equally regardless of their age.
In particular, employees over the age of 40 years are in a “protected” class,
and many allegations of discrimination focus on comparing employees over
40 with their younger coworkers. Here are the data for a recent RIF:
Over 40
Terminated No Yes
Yes 7 41
No 504 765
(a) Write the logistic regression model for this problem using the log odds
of a termination as the response variable and an indicator for over and
under 40 years of age as the explanatory variable.
17-26 CHAPTER 17 䡲 Logistic Regression
CASE STUDY 17.2: Predict whether or not the GPA will be 3.0 or better. In
Case 11.2 we used multiple regression methods to predict grade point average using
the CSDATA data set described in the Data Appendix. The explanatory variables
were the two SAT scores and three high school grade variables. Let’s define success
as earning a GPA of 3.0 or better. So, we define an indicator variable, say HIGPA,
to be 1 if the GPA is 3.0 or better and 0 otherwise. Examine logistic regression
models for predicting HIGPA using the two SAT scores and three high school grade
variables. Summarize all of your results and compare them with what we found
using multiple regression to predict the GPA.
CASE STUDY 17.3: Analyze a Simpson’s paradox data set. In Exercise 2.85 (page
153) we studied an example of Simpson’s paradox, the reversal of the direction
of a comparison or an association when data from several groups are combined
to form a single group. The data concerned two hospitals, A and B, and whether
or not patients undergoing surgery died or survived. Here are the data for all
patients:
Hospital A Hospital B
Died 63 16
Survived 2037 784
Total 2100 800
And here are the more detailed data where the patients are categorized as being in
good condition or poor condition before surgery:
Use a logistic regression to model the odds of death with hospital as the explanatory
variable. Summarize the results of your analysis and give a 95% confidence interval
for the odds ratio of Hospital A relative to Hospital B. Then rerun your analysis using
the hospital and the condition of the patient as explanatory variables. Summarize
the results of your analysis and give a 95% confidence interval for the odds ratio of
Hospital A relative to Hospital B. Write a report explaining Simpson’s paradox in
terms of the results of your analyses.
CASE STUDY 17.4: Compare the homes in two zip codes. The HOMES data set
described in the Data Appendix gives data on the selling prices and characteristics
of homes in several zip codes. In Case 11.3 we used multiple regression models to
predict the prices of homes in zip code 47904. For this case study we will look at
only the homes in zip codes 47904 and 47906. Define a response variable that has
the value 0 for the homes in 47904 and 1 for the homes in 47906. Prepare numerical
and graphical summaries that describe the prices and characteristics of homes in
these two zip codes. Then explore logistic regression models to predict whether or
not a home is in 47906. Summarize your results in a report.
4. The poll is part of the American Express Retail Index Project and is reported in
Stores, December 2000, pp. 38–40.
5. Based on Greg Clinch, “Employee compensation and firms’ research and develop-
ment activity,” Journal of Accounting Research, 29 (1991), pp. 59–78.
6. Data from Guohua Li and Susan P. Baker, “Alcohol in fatally injured bicyclists,”
Accident Analysis and Prevention, 26 (1994), pp. 543–548.
7. From Monica Macaulay and Colleen Brice, “Don’t touch my projectile: gen-
der bias and stereotyping in syntactic examples,” Language, 73, no. 4 (1997),
pp. 798–825.
Notes for Chapter 17 17-29
9. From Karin Weber and Weley S. Roehl, “Profiling people searching for and purchas-
ing travel products on the World Wide Web,” Journal of Travel Research, 37 (1999),
pp. 291–298.