0% found this document useful (0 votes)
132 views

Haberman Data Logistic Regression Analysis

The document analyzes a dataset of 305 patients who underwent breast cancer surgery between 1958-1970. Logistic regression was used to determine the effects of age, year of operation, and number of positive nodes on survival. The number of positive nodes was the only statistically significant variable, with patients having 4.83 nodes or less predicted to survive over 5 years. Further analysis found patients surviving had on average 2.79 nodes compared to 7.45 nodes for those who didn't survive over 5 years.

Uploaded by

Evelyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

Haberman Data Logistic Regression Analysis

The document analyzes a dataset of 305 patients who underwent breast cancer surgery between 1958-1970. Logistic regression was used to determine the effects of age, year of operation, and number of positive nodes on survival. The number of positive nodes was the only statistically significant variable, with patients having 4.83 nodes or less predicted to survive over 5 years. Further analysis found patients surviving had on average 2.79 nodes compared to 7.45 nodes for those who didn't survive over 5 years.

Uploaded by

Evelyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Evelyn Lupancu

Professor O’Brien

CDA 310

30 March 2020

Analysis of Haberman’s Survival Data Set Using Logistic Regression

I. Overview

The Haberman’s Survival Data Set consists of a number of cases from a study that was

conducted between 1958 and 1970 at the University of Chicago Billings Hospital. The goal was

to determine whether certain variables affected the survival status of patients (n=305) who had

undergone surgery for breast cancer.

II. Statistical Methods Used

To run a binary logistic regression for the data set it was necessary to establish a dichotomous

variable. The dichotomous response variable was the survival status (class attribute) of the

patient. A value of 1 signified a patient died within 5 years of the operation, while a value of 2

signified a patient survived 5 years or longer. The independent exposure variables were the age

of the patient at the time of operation, the year of the operation, as well as the number of positive

axillary nodes detected. PSPP software was utilized in order to run two binary logistic

regressions as well as to further analyze the class-wise data.

III. Findings

A logistic regression was performed to ascertain the effects of age, year of operation, and

number of nodes on the likelihood of survival of patients that had undergone surgery for breast

cancer. The explained variation in the dependent variable based on this model ranged from

61.0% to 91.0%. The classification table proved that the survival rate could be correctly
predicted from the independent variables with a 98.69% accuracy, using a cutoff value of 0.50.

The model provided a regression function of 4.27-1.40(x1)-0.02(x2)+0.05(x3). According to the

Wald statistic, only the exposure variable of the number of nodes was statistically significant, as

the p-value was recorded as 0.000. A second logistic regression was then run in order to improve

its quality by solely containing the independent variable that was significant. The critical value

was calculated using the formula -intercept/coefficient, or -6.67/-1.38, to get 4.83. That is, if a

patient had 4.83 nodes or less present at the time of operation, the logistic regression predicted

that the patient would survive longer than 5 years. To further examine the effect of the exposure

variables, two cross-wise analyses were run using the data from 1=No Survival and 2=Survival.

It was further determined that the number of nodes are statistically significant due to the

variation in means. An average of 2.79 nodes were present for patients surviving 5 or more

years, compared to 7.45 nodes present for those that did not survive more than 5 years. The

means for the age of the patient and year of the operation were nearly the same in both cases,

thus proving not to be significant.

IV. Recommendations

For future studies, it would be advantageous to include additional exposure variables in order to

account for any possible confounding factors. The survival rates of the patients did not take into

account their lifestyle, habits, or medical history. Such factors that could have impacted their

likelihood of survival and thus should be addressed in logistic regression models.


Appendix

Binary Logistic Regression using 3 Exposure Variables


Output from 1=No Survival Output from 2=Survival
Binary Logistic Regression Using One Exposure Variable

You might also like