Haberman Data Logistic Regression Analysis
Haberman Data Logistic Regression Analysis
Professor O’Brien
CDA 310
30 March 2020
I. Overview
The Haberman’s Survival Data Set consists of a number of cases from a study that was
conducted between 1958 and 1970 at the University of Chicago Billings Hospital. The goal was
to determine whether certain variables affected the survival status of patients (n=305) who had
To run a binary logistic regression for the data set it was necessary to establish a dichotomous
variable. The dichotomous response variable was the survival status (class attribute) of the
patient. A value of 1 signified a patient died within 5 years of the operation, while a value of 2
signified a patient survived 5 years or longer. The independent exposure variables were the age
of the patient at the time of operation, the year of the operation, as well as the number of positive
axillary nodes detected. PSPP software was utilized in order to run two binary logistic
III. Findings
A logistic regression was performed to ascertain the effects of age, year of operation, and
number of nodes on the likelihood of survival of patients that had undergone surgery for breast
cancer. The explained variation in the dependent variable based on this model ranged from
61.0% to 91.0%. The classification table proved that the survival rate could be correctly
predicted from the independent variables with a 98.69% accuracy, using a cutoff value of 0.50.
Wald statistic, only the exposure variable of the number of nodes was statistically significant, as
the p-value was recorded as 0.000. A second logistic regression was then run in order to improve
its quality by solely containing the independent variable that was significant. The critical value
was calculated using the formula -intercept/coefficient, or -6.67/-1.38, to get 4.83. That is, if a
patient had 4.83 nodes or less present at the time of operation, the logistic regression predicted
that the patient would survive longer than 5 years. To further examine the effect of the exposure
variables, two cross-wise analyses were run using the data from 1=No Survival and 2=Survival.
It was further determined that the number of nodes are statistically significant due to the
variation in means. An average of 2.79 nodes were present for patients surviving 5 or more
years, compared to 7.45 nodes present for those that did not survive more than 5 years. The
means for the age of the patient and year of the operation were nearly the same in both cases,
IV. Recommendations
For future studies, it would be advantageous to include additional exposure variables in order to
account for any possible confounding factors. The survival rates of the patients did not take into
account their lifestyle, habits, or medical history. Such factors that could have impacted their