0% found this document useful (0 votes)
85 views15 pages

Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium

This document summarizes a tutorial on using logistic regression to predict breast cancer. It discusses exploring breast cancer data through visualizations to understand missing values and the distribution of variables. Mean imputation is used to fill in missing radius values. Logistic regression is applied to predict diagnoses and is evaluated to have around 90% accuracy on test data.

Uploaded by

Ghifari Raka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views15 pages

Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium

This document summarizes a tutorial on using logistic regression to predict breast cancer. It discusses exploring breast cancer data through visualizations to understand missing values and the distribution of variables. Mean imputation is used to fill in missing radius values. Logistic regression is applied to predict diagnoses and is evaluated to have around 90% accuracy on test data.

Uploaded by

Ghifari Raka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

You have 1 free member-only story left this month.

Sign up for Medium and get an extra one

Predicting Breast Cancer Using Logistic


Regression
Learn how to perform Exploratory Data Analysis, apply mean imputation, build a
classification algorithm, and interpret the results.

Mo Kaiser
Mar 14, 2020 · 7 min read

Source: DataCamp
Background
Breast cancer is the second most common cancer and has the highest cancer death rate
among women in the United States. Breast cancer occurs as a result of abnormal
growth of cells in the breast tissue, commonly referred to as a tumor. A tumor does not
mean cancer — can be benign (no breast cancer) or malignant (breast cancer). Tests
such as an MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose
breast cancer.

In this tutorial, we are going to create a model that will predict whether or not a
patient has a positive breast cancer diagnosis based off of the tumor characteristics.

This dataset contains the following features:

id (patientid)

name

radius (the distance from the center to the circumference of the tumor)

texture (standard deviation of gray-scale values)

perimeter (circumference of the tumor, approx. 2*3.14 *radius)

area

smoothness (local variation in radius lengths)

compactness

concavity (severity of concave portions of the contour)

symmetry

fractal_dimension

age

diagnosis: 0 or 1 indicating whether patient has breast cancer or not

Click here to get the dataset and see my full code on GitHub.

Import Libraries and Data


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print(f'Libraries have been imported! :)')

Now that our libraries have been imported, let’s go ahead and import our data using
pandas.

train = pd.read_csv('breastcancer.csv')
print(f'Preview:\n\n{train.head()}')

As a side note, F-strings are amazing! They allow you to print strings and expressions
in a more concise manner. The \n part means to add a new line. I do this to create more
white space.

Exploratory Data Analysis


Exploratory Data Analysis (EDA) answers the “What are we dealing with?” question.
EDA is where we try to understand our data first. We want to gain insights before
messing around with it.
Visualizations are a great way to do this.

Visualization #1: Heat Map

# simple heat map showing where we are missing data

heat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar =


True, cmap = "PuRd", vmin = 0, vmax = 1)

plt.show()

where:

train.isnull() is checking for nulls in the train df

yticklabels is not plotting train df column names to y-axis

cbar is adding a color bar

cmap is mapping data values to a color space

vmin set 0 as the minimum for the color bar

vmax set 1 as the maximum for the color bar


This heat map is interpreted as the following:

0 (white color) means we have a value

1 (dark red color) means we have a null

Looks like we only have nulls in the radius column! Not bad at all and easily fixable :)

Visualization #2: Count Plot

# a count plot shows the counts of observations in each categorical


bin using bars
# think of it as a histogram across a categorical, instead of
quantitative, variable

sns.set_style("whitegrid")
sns.countplot(data = train, x = 'diagnosis', palette = 'husl')

where:

style is affecting the color of the axes, whether a grid is enabled by default, and
other aesthetic elements

data is the df, array, or list of arrays to plot

x is the the name of the variable in the data parameter

palette is the color you want to use (palette name, list, or dict)
where:

0 indicates no breast cancer

1 indicates breast cancer

Note that 0 doesn’t always indicate an absence of something and that 1 means a
presence of something. Make sure you are reading your data correctly.

Visualization #3: Histogram

# let's check out the spread of ages using a histogram

train['age'].plot.hist(bins = 25, figsize = (10,6))

where:

we are looking at the age column within the train df

bins are setting the number of class intervals

figsize(width, height) sets a figure object with a width of 10 inches and height of 6
inches
Data is not skewed and doesn’t have a distinct shape — doesn’t tell us too much. Let’s
move on to cleaning our data.

Data Cleaning
The missing data in the radius column needs to be filled in. We are going to do this by
imputing the mean radius, not just dropping all null values. To impute a value simply
means we are going to replace missing values with our newly calculated value. For our
method specifically, it is referred to as mean imputation.

Let’s visualize the average radius of a tumor by diagnosis via a box plot.

plt.figure(figsize = (10,7))
sns.boxplot(x = "diagnosis", y = "radius", data = train)
Women who were diagnosed with breast cancer (diagnosis = 1) tend to have a
higher tumor radius size, which is the distance from the center to the circumference
of the tumor.

# calculate the average radius size by diagnosis (0 or 1)

train.groupby('diagnosis')["radius"].mean()

This is interpreted as…

“Women who are not diagnosed with breast cancer have an average/mean tumor radius
size of 12.34.”

“Women who are diagnosed with breast cancer have an average/mean tumor radius size of
17.89.”

Now that we have found our average tumor radius by diagnosis, let’s impute them into
our missing (aka our null) values.

# create a function that imputes average radius into missing values

def impute_radius(cols):
radius = cols[0]
diagnosis = cols[1]
# if value in radius column is null
if pd.isnull(radius):

# if woman is diagnosed with breast cancer


if diagnosis == 1:
return 17
# if woman was not diagnosed with breast cancer
else:
return 12
# when value in radius column is not null
else:
# return that same value
return radius

After creating our function, we need to apply it like so:

train['radius'] = train[['radius',
'diagnosis']].apply(impute_radius, axis = 1)

In English, this means we are applying our function to both the radius column and
diagnosis column.

We can visualize whether our function worked by checking our heat map again:

# check the heat map again after applying the above function

heat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar =


True, cmap = "PuRd", vmin = 0, vmax = 1)

plt.show()
All rows that were missing data have now been imputed (aka substituted) with the
average radius size, which was determined by whether the woman was diagnosed with
breast cancer. No need to drop other columns or impute more missing values.

Let’s now look at a concise summary of our data:

train.info()
See how the id and name columns are of object data type? That means they are
categorical, and we need to drop those like so:

# dropping categorical variables

train.drop(['id', 'name'], axis = 1, inplace = True)

Checking out what our dataframe looks like:

train.head()

Build the Model


Step 1: Split data into X and y

X = train.drop('diagnosis', axis = 1)
y = train['diagnosis']

Check out what X and y look like:


X.head()

y.head()

Step 2: Split data into train set and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size


= 0.3, random_state = 101)

Step 3: Train and predict

from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()
logreg.fit(X_train, y_train)
predictions = logreg.predict(X_test)

Evaluate the Model


A classification report checks our model’s precision, recall, and F1 score. The support is
the number of samples of the true response that lies in that class.

Precision and recall are not the same. Precision is the fraction of relevant results.
Recall is the fraction of all relevant results that were correctly classified.

F1 score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).

from sklearn.metrics import classification_report

classification_report(y_test, predictions)

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)

Conclusion
We had 171 women in our test set. Out of the 105 women predicted to not have
breast cancer, 7 women were classified as not having breast cancer when they
actually did (Type I error). Out of the 66 women predicted to have breast cancer, 10
were classified as having breast cancer when they did not (Type II error). In a nut
shell, our model was more or less 90% accurate.

Documentation Links
seaborn.heatmap

seaborn.set_style

seaborn.countplot

seaborn.color_palette

pandas.DataFrame.apply()

pandas.DataFrame.info()

sklearn.metrics.classification_report

References
Understanding the Classification Report

Impute Missing Values with Means

Author Note
Thanks for reading! Please feel free to follow me on Medium and LinkedIn. I’d love to
continue the conversation and hear your thoughts/suggestions.

-Mo

Thanks to P Ozturk. 

Sign up for Top 10 Stories


By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.

Your email

Get this newsletter


By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Logistic Regression Exploratory Data Analysis Machine Learning Data Science Python

About Help Legal

Get the Medium app

You might also like