Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
Mo Kaiser
Mar 14, 2020 · 7 min read
Source: DataCamp
Background
Breast cancer is the second most common cancer and has the highest cancer death rate
among women in the United States. Breast cancer occurs as a result of abnormal
growth of cells in the breast tissue, commonly referred to as a tumor. A tumor does not
mean cancer — can be benign (no breast cancer) or malignant (breast cancer). Tests
such as an MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose
breast cancer.
In this tutorial, we are going to create a model that will predict whether or not a
patient has a positive breast cancer diagnosis based off of the tumor characteristics.
id (patientid)
name
radius (the distance from the center to the circumference of the tumor)
area
compactness
symmetry
fractal_dimension
age
Click here to get the dataset and see my full code on GitHub.
Now that our libraries have been imported, let’s go ahead and import our data using
pandas.
train = pd.read_csv('breastcancer.csv')
print(f'Preview:\n\n{train.head()}')
As a side note, F-strings are amazing! They allow you to print strings and expressions
in a more concise manner. The \n part means to add a new line. I do this to create more
white space.
plt.show()
where:
Looks like we only have nulls in the radius column! Not bad at all and easily fixable :)
sns.set_style("whitegrid")
sns.countplot(data = train, x = 'diagnosis', palette = 'husl')
where:
style is affecting the color of the axes, whether a grid is enabled by default, and
other aesthetic elements
palette is the color you want to use (palette name, list, or dict)
where:
Note that 0 doesn’t always indicate an absence of something and that 1 means a
presence of something. Make sure you are reading your data correctly.
where:
figsize(width, height) sets a figure object with a width of 10 inches and height of 6
inches
Data is not skewed and doesn’t have a distinct shape — doesn’t tell us too much. Let’s
move on to cleaning our data.
Data Cleaning
The missing data in the radius column needs to be filled in. We are going to do this by
imputing the mean radius, not just dropping all null values. To impute a value simply
means we are going to replace missing values with our newly calculated value. For our
method specifically, it is referred to as mean imputation.
Let’s visualize the average radius of a tumor by diagnosis via a box plot.
plt.figure(figsize = (10,7))
sns.boxplot(x = "diagnosis", y = "radius", data = train)
Women who were diagnosed with breast cancer (diagnosis = 1) tend to have a
higher tumor radius size, which is the distance from the center to the circumference
of the tumor.
train.groupby('diagnosis')["radius"].mean()
“Women who are not diagnosed with breast cancer have an average/mean tumor radius
size of 12.34.”
“Women who are diagnosed with breast cancer have an average/mean tumor radius size of
17.89.”
Now that we have found our average tumor radius by diagnosis, let’s impute them into
our missing (aka our null) values.
def impute_radius(cols):
radius = cols[0]
diagnosis = cols[1]
# if value in radius column is null
if pd.isnull(radius):
train['radius'] = train[['radius',
'diagnosis']].apply(impute_radius, axis = 1)
In English, this means we are applying our function to both the radius column and
diagnosis column.
We can visualize whether our function worked by checking our heat map again:
# check the heat map again after applying the above function
plt.show()
All rows that were missing data have now been imputed (aka substituted) with the
average radius size, which was determined by whether the woman was diagnosed with
breast cancer. No need to drop other columns or impute more missing values.
train.info()
See how the id and name columns are of object data type? That means they are
categorical, and we need to drop those like so:
train.head()
X = train.drop('diagnosis', axis = 1)
y = train['diagnosis']
y.head()
Precision and recall are not the same. Precision is the fraction of relevant results.
Recall is the fraction of all relevant results that were correctly classified.
F1 score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).
classification_report(y_test, predictions)
confusion_matrix(y_test, predictions)
Conclusion
We had 171 women in our test set. Out of the 105 women predicted to not have
breast cancer, 7 women were classified as not having breast cancer when they
actually did (Type I error). Out of the 66 women predicted to have breast cancer, 10
were classified as having breast cancer when they did not (Type II error). In a nut
shell, our model was more or less 90% accurate.
Documentation Links
seaborn.heatmap
seaborn.set_style
seaborn.countplot
seaborn.color_palette
pandas.DataFrame.apply()
pandas.DataFrame.info()
sklearn.metrics.classification_report
References
Understanding the Classification Report
Author Note
Thanks for reading! Please feel free to follow me on Medium and LinkedIn. I’d love to
continue the conversation and hear your thoughts/suggestions.
-Mo
Thanks to P Ozturk.
Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.
Your email
Logistic Regression Exploratory Data Analysis Machine Learning Data Science Python