0% found this document useful (0 votes)

85 views15 pages

Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium

This document summarizes a tutorial on using logistic regression to predict breast cancer. It discusses exploring breast cancer data through visualizations to understand missing values and the distribution of variables. Mean imputation is used to fill in missing radius values. Logistic regression is applied to predict diagnoses and is evaluated to have around 90% accuracy on test data.

Uploaded by

Ghifari Raka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views15 pages

Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium

Uploaded by

Ghifari Raka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

You have 1 free member-only story left this month.

Sign up for Medium and get an extra one

Predicting Breast Cancer Using Logistic

Regression
Learn how to perform Exploratory Data Analysis, apply mean imputation, build a
classification algorithm, and interpret the results.

Mo Kaiser
Mar 14, 2020 · 7 min read

Source: DataCamp
Background
Breast cancer is the second most common cancer and has the highest cancer death rate
among women in the United States. Breast cancer occurs as a result of abnormal
growth of cells in the breast tissue, commonly referred to as a tumor. A tumor does not
mean cancer — can be benign (no breast cancer) or malignant (breast cancer). Tests
such as an MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose
breast cancer.

In this tutorial, we are going to create a model that will predict whether or not a
patient has a positive breast cancer diagnosis based off of the tumor characteristics.

This dataset contains the following features:

id (patientid)

name

radius (the distance from the center to the circumference of the tumor)

texture (standard deviation of gray-scale values)

perimeter (circumference of the tumor, approx. 23.14 radius)

area

smoothness (local variation in radius lengths)

compactness

concavity (severity of concave portions of the contour)

symmetry

fractal_dimension

age

diagnosis: 0 or 1 indicating whether patient has breast cancer or not

Click here to get the dataset and see my full code on GitHub.

Import Libraries and Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print(f'Libraries have been imported! :)')

Now that our libraries have been imported, let’s go ahead and import our data using
pandas.

train = pd.read_csv('breastcancer.csv')
print(f'Preview:\n\n{train.head()}')

As a side note, F-strings are amazing! They allow you to print strings and expressions
in a more concise manner. The \n part means to add a new line. I do this to create more
white space.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) answers the “What are we dealing with?” question.
EDA is where we try to understand our data first. We want to gain insights before
messing around with it.
Visualizations are a great way to do this.

Visualization #1: Heat Map

# simple heat map showing where we are missing data

heat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar =

True, cmap = "PuRd", vmin = 0, vmax = 1)

plt.show()

where:

train.isnull() is checking for nulls in the train df

yticklabels is not plotting train df column names to y-axis

cbar is adding a color bar

cmap is mapping data values to a color space

vmin set 0 as the minimum for the color bar

vmax set 1 as the maximum for the color bar

This heat map is interpreted as the following:

0 (white color) means we have a value

1 (dark red color) means we have a null

Looks like we only have nulls in the radius column! Not bad at all and easily fixable :)

Visualization #2: Count Plot

# a count plot shows the counts of observations in each categorical

bin using bars
# think of it as a histogram across a categorical, instead of
quantitative, variable

sns.set_style("whitegrid")
sns.countplot(data = train, x = 'diagnosis', palette = 'husl')

where:

style is affecting the color of the axes, whether a grid is enabled by default, and
other aesthetic elements

data is the df, array, or list of arrays to plot

x is the the name of the variable in the data parameter

palette is the color you want to use (palette name, list, or dict)
where:

0 indicates no breast cancer

1 indicates breast cancer

Note that 0 doesn’t always indicate an absence of something and that 1 means a
presence of something. Make sure you are reading your data correctly.

Visualization #3: Histogram

# let's check out the spread of ages using a histogram

train['age'].plot.hist(bins = 25, figsize = (10,6))

where:

we are looking at the age column within the train df

bins are setting the number of class intervals

figsize(width, height) sets a figure object with a width of 10 inches and height of 6
inches
Data is not skewed and doesn’t have a distinct shape — doesn’t tell us too much. Let’s
move on to cleaning our data.

Data Cleaning
The missing data in the radius column needs to be filled in. We are going to do this by
imputing the mean radius, not just dropping all null values. To impute a value simply
means we are going to replace missing values with our newly calculated value. For our
method specifically, it is referred to as mean imputation.

Let’s visualize the average radius of a tumor by diagnosis via a box plot.

plt.figure(figsize = (10,7))
sns.boxplot(x = "diagnosis", y = "radius", data = train)
Women who were diagnosed with breast cancer (diagnosis = 1) tend to have a
higher tumor radius size, which is the distance from the center to the circumference
of the tumor.

# calculate the average radius size by diagnosis (0 or 1)

train.groupby('diagnosis')["radius"].mean()

This is interpreted as…

“Women who are not diagnosed with breast cancer have an average/mean tumor radius
size of 12.34.”

“Women who are diagnosed with breast cancer have an average/mean tumor radius size of
17.89.”

Now that we have found our average tumor radius by diagnosis, let’s impute them into
our missing (aka our null) values.

# create a function that imputes average radius into missing values

def impute_radius(cols):
radius = cols[0]
diagnosis = cols[1]
# if value in radius column is null
if pd.isnull(radius):

# if woman is diagnosed with breast cancer

if diagnosis == 1:
return 17
# if woman was not diagnosed with breast cancer
else:
return 12
# when value in radius column is not null
else:
# return that same value
return radius

After creating our function, we need to apply it like so:

train['radius'] = train[['radius',
'diagnosis']].apply(impute_radius, axis = 1)

In English, this means we are applying our function to both the radius column and
diagnosis column.

We can visualize whether our function worked by checking our heat map again:

# check the heat map again after applying the above function

heat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar =

True, cmap = "PuRd", vmin = 0, vmax = 1)

plt.show()
All rows that were missing data have now been imputed (aka substituted) with the
average radius size, which was determined by whether the woman was diagnosed with
breast cancer. No need to drop other columns or impute more missing values.

Let’s now look at a concise summary of our data:

train.info()
See how the id and name columns are of object data type? That means they are
categorical, and we need to drop those like so:

# dropping categorical variables

train.drop(['id', 'name'], axis = 1, inplace = True)

Checking out what our dataframe looks like:

train.head()

Build the Model

Step 1: Split data into X and y

X = train.drop('diagnosis', axis = 1)
y = train['diagnosis']

Check out what X and y look like:

X.head()

y.head()

Step 2: Split data into train set and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size

= 0.3, random_state = 101)

Step 3: Train and predict

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
predictions = logreg.predict(X_test)

Evaluate the Model

A classification report checks our model’s precision, recall, and F1 score. The support is
the number of samples of the true response that lies in that class.

Precision and recall are not the same. Precision is the fraction of relevant results.
Recall is the fraction of all relevant results that were correctly classified.

F1 score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).

from sklearn.metrics import classification_report

classification_report(y_test, predictions)

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)

Conclusion
We had 171 women in our test set. Out of the 105 women predicted to not have
breast cancer, 7 women were classified as not having breast cancer when they
actually did (Type I error). Out of the 66 women predicted to have breast cancer, 10
were classified as having breast cancer when they did not (Type II error). In a nut
shell, our model was more or less 90% accurate.

Documentation Links
seaborn.heatmap

seaborn.set_style

seaborn.countplot

seaborn.color_palette

pandas.DataFrame.apply()

pandas.DataFrame.info()

sklearn.metrics.classification_report

References
Understanding the Classification Report

Impute Missing Values with Means

Author Note
Thanks for reading! Please feel free to follow me on Medium and LinkedIn. I’d love to
continue the conversation and hear your thoughts/suggestions.

-Mo

Thanks to P Ozturk.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories —
delivered straight into your inbox, once a week. Take a look.

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Logistic Regression Exploratory Data Analysis Machine Learning Data Science Python

About Help Legal

Get the Medium app

1 Anti-Steiner Point Definition
No ratings yet
1 Anti-Steiner Point Definition
24 pages
Analysis of A Three Flight Free Standing Staircase
100% (1)
Analysis of A Three Flight Free Standing Staircase
8 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
15 pages
Machine Learning Data Analysis
No ratings yet
Machine Learning Data Analysis
21 pages
Using Predictive Analytics Model To Diagnose Breast Cnacer
No ratings yet
Using Predictive Analytics Model To Diagnose Breast Cnacer
9 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
Breast Cancer Detection and Prediction: Created by
No ratings yet
Breast Cancer Detection and Prediction: Created by
20 pages
IJERT Developing A Web Based System For
No ratings yet
IJERT Developing A Web Based System For
5 pages
BSAN Case 3
No ratings yet
BSAN Case 3
9 pages
IDS Project Group 11
No ratings yet
IDS Project Group 11
35 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
Machine Learning Evaluation Metrics Lecturer
No ratings yet
Machine Learning Evaluation Metrics Lecturer
30 pages
Final Big Data
No ratings yet
Final Big Data
23 pages
Classification Algorithms
No ratings yet
Classification Algorithms
16 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
Logistic Regression vs. SVMs - Solution
No ratings yet
Logistic Regression vs. SVMs - Solution
7 pages
Project Final
No ratings yet
Project Final
15 pages
Breast Cancer Detection Algo Comparison
No ratings yet
Breast Cancer Detection Algo Comparison
15 pages
Machine Learning Algorithm
No ratings yet
Machine Learning Algorithm
18 pages
Intel Report
No ratings yet
Intel Report
15 pages
S2 24 WIPRO AML Labcourse2 Kittu
No ratings yet
S2 24 WIPRO AML Labcourse2 Kittu
15 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Support Vector Machine (SVM) - Bioinformatics
No ratings yet
Support Vector Machine (SVM) - Bioinformatics
10 pages
Breast Cancer Survival Prediction With Machine Learning
No ratings yet
Breast Cancer Survival Prediction With Machine Learning
12 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
Support Vector Machines Com Python
No ratings yet
Support Vector Machines Com Python
13 pages
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
No ratings yet
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
27 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Cancer Detection
No ratings yet
Cancer Detection
12 pages
Breast Cancer Prediction
No ratings yet
Breast Cancer Prediction
5 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
No ratings yet
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
11 pages
Breast Cancer Classification Using Python
No ratings yet
Breast Cancer Classification Using Python
26 pages
Python Final Project Group 03
No ratings yet
Python Final Project Group 03
18 pages
utf-8''C2M1 Assignment
No ratings yet
utf-8''C2M1 Assignment
24 pages
HW Wincon
No ratings yet
HW Wincon
3 pages
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
No ratings yet
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
11 pages
Breast Cancer Classification Report
No ratings yet
Breast Cancer Classification Report
16 pages
ML Healthcare Clean APA Final
No ratings yet
ML Healthcare Clean APA Final
9 pages
Unit 2
No ratings yet
Unit 2
19 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Worksheet Classification1
No ratings yet
Worksheet Classification1
15 pages
Exploratory Data Analysis Main Concepts
No ratings yet
Exploratory Data Analysis Main Concepts
1 page
Breast Cancer Classification Using DTC
No ratings yet
Breast Cancer Classification Using DTC
1 page
Mla - 2 (Cia - 1) - 20221013
No ratings yet
Mla - 2 (Cia - 1) - 20221013
14 pages
Cancer Detection - Formal Version
No ratings yet
Cancer Detection - Formal Version
38 pages
Decision Support
No ratings yet
Decision Support
21 pages
SUMMARY
No ratings yet
SUMMARY
16 pages
A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model
No ratings yet
A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model
6 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
Project Synopsis On Breast Cancer Detection Using Data Mining
No ratings yet
Project Synopsis On Breast Cancer Detection Using Data Mining
3 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Thyroid Disease Classification Using ML
No ratings yet
Thyroid Disease Classification Using ML
37 pages
Molecular Classification of Leukemia Using Gene Expression Data and Random Forest
No ratings yet
Molecular Classification of Leukemia Using Gene Expression Data and Random Forest
17 pages
Project Report
No ratings yet
Project Report
18 pages
Machine Learning For Breast Cancer Diagnosis An End-To-End Analysis
No ratings yet
Machine Learning For Breast Cancer Diagnosis An End-To-End Analysis
20 pages
Vertical Projectile Motion 2025 DOBS
No ratings yet
Vertical Projectile Motion 2025 DOBS
55 pages
PHY10 Lesson 2 Kinematics (Full)
No ratings yet
PHY10 Lesson 2 Kinematics (Full)
35 pages
Share Class 6 Pt2 2025
No ratings yet
Share Class 6 Pt2 2025
4 pages
Equations PDF
No ratings yet
Equations PDF
20 pages
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
No ratings yet
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
132 pages
Mechanical 3 RD Sem Syllabus
No ratings yet
Mechanical 3 RD Sem Syllabus
6 pages
Balancing Hard To Balance Equations PDF
No ratings yet
Balancing Hard To Balance Equations PDF
2 pages
Unit 2 Acceleration Analysis - Relative Velocity Method
No ratings yet
Unit 2 Acceleration Analysis - Relative Velocity Method
9 pages
Lesson 4 Polynomial Curves Tangents and Normal To Plane Curves
No ratings yet
Lesson 4 Polynomial Curves Tangents and Normal To Plane Curves
57 pages
Cad-Cam Modeling PDF
100% (2)
Cad-Cam Modeling PDF
390 pages
第七單元細線化與骨架抽取
No ratings yet
第七單元細線化與骨架抽取
13 pages
DSP 1imp
No ratings yet
DSP 1imp
13 pages
Models - Ssf.forchheimer Flow
No ratings yet
Models - Ssf.forchheimer Flow
12 pages
Pennachi - Theory of Asset Pricing
100% (1)
Pennachi - Theory of Asset Pricing
570 pages
HW 4
No ratings yet
HW 4
7 pages
Hashsorting
No ratings yet
Hashsorting
33 pages
Math First Quarter Module
No ratings yet
Math First Quarter Module
4 pages
B Tech Electrical and Electronics Engineering
No ratings yet
B Tech Electrical and Electronics Engineering
221 pages
Astrophysics, Gravitation and Quantum Physics PDF
100% (2)
Astrophysics, Gravitation and Quantum Physics PDF
300 pages
Class Test
No ratings yet
Class Test
10 pages
Origin of The South African Measurement System
No ratings yet
Origin of The South African Measurement System
3 pages
Pressure Buildup Analysis With Wellbore Phase Redistribution
No ratings yet
Pressure Buildup Analysis With Wellbore Phase Redistribution
12 pages
Paraview Tutorial
No ratings yet
Paraview Tutorial
28 pages
Conference Latex Template ECCE
No ratings yet
Conference Latex Template ECCE
6 pages
Quadratic Formula PROOF
100% (1)
Quadratic Formula PROOF
1 page
Maths Paper Solving
No ratings yet
Maths Paper Solving
5 pages
MAT1100 Integral Calculus I - 2020
No ratings yet
MAT1100 Integral Calculus I - 2020
6 pages
A Numerical Case Study On Pier Shape Coefficient o
No ratings yet
A Numerical Case Study On Pier Shape Coefficient o
7 pages