18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Breast Cancer Classification Using
Python
A guide to EDA and classification
Mugdha Paithankar
Nov 8, 2020 · 13 min read
Photo by Peter Boccia on Unsplash
[Link] 1/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Breast cancer (BC) is one of the most common cancers among women in
the world today.
Currently, the average risk of a woman in the United States developing
breast cancer sometime in her life is about 13%, which means there is a 1
in 8 chance she will develop breast cancer!
An early diagnosis of BC can greatly improve the prognosis and chance of
survival for patients. Thus an accurate identification of malignant tumors is
of paramount importance.
In this article I will also go over all the steps needed to make a Data Science
project complete in itself, and with the use of machine learning algorithms,
ultimately build a model which accurately classifies tumors as Benign or
Malignant based on the tumor shape and its geometry.
Step 1: Get the data!
I got the dataset from Kaggle. It contains 596 rows and 32 columns of
tumor shape and specifications. The tumor is classified as benign or
malignant based on its geometry and shape. Features are computed from a
digitized image of a fine needle aspirate (FNA) of a breast mass, which is
type of biopsy procedure. They describe characteristics of the cell nuclei
present in the image.
The features of the dataset include:
1. tumor radius (mean of distances from center to points on the
perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
[Link] 2/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter² / area — 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension
The mean, standard error and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30
features.
Step 2: Exploratory Data Analysis (EDA)
#make a dataframe
df = pd.read_csv(‘[Link]’)
#examine the shape of the data
[Link]()
#get the column names
[Link]
The dataset has 569 rows and 33 columns. There are two extra columns
“id” and “Unnamed: 32”. We drop Unnamed: 32 which has all Nan
values.
#Drop the column with all missing values (na, NAN, NaN)
#NOTE: This drops the column Unnamed: 32 column
[Link] 3/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
df = [Link](axis=1)
#Get a count of the number of 'M' & 'B' cells
df['diagnosis'].value_counts()
#Visualize this count
[Link](df['diagnosis'],label="Count")
212 Malignant and 357 Benign tumors
There are now 30 features we can visualize. I decided to plot 10 features
at a time. This led to 3 plots containing 10 features each. The means of all
the features were plotted together, so were the standard errors and worst
dimensions.
Violin plots are like density plots and unlike bar graphs with means and
error bars, violin plots contain all data points which make them an excellent
tool to visualize samples of small sizes.
I made violin plots and commented, based on their distribution whether
that feature will be good for classification. To make violin plots for this
dataset, first separate the data labels ‘M’ or ‘B’ (into y) and features (into
X). Then visualize 10 features at a time.
[Link] 4/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
# y includes diagnosis column with M or B values
y = [Link]
# drop the column 'id' as it is does not convey any
useful info
# drop diagnosis since we are separating labels and
features
list = [‘id’,’diagnosis’]
# X includes our features
X = [Link](list,axis = 1)
# get the first ten features
data_dia = y
data = X
data_std = (data — [Link]()) / ([Link]()) #
standardization
# get the first 10 features
data = [Link]([y,data_std.iloc[:,0:10]],axis=1)
data = [Link](data,id_vars=”diagnosis”,
var_name=”features”,
value_name=’value’)
# make a violin plot
[Link](figsize=(10,10))
[Link](x=”features”, y=”value”,
hue=”diagnosis”, data=data,split=True, inner=”quart”)
[Link](rotation=90)
[Link] 5/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Violin plot displaying all the mean features
The median of texture_mean for Malignant and Benign looks separated,
so it might be a good feature for classification. For
fractal_dimension_mean, the medians of the Malignant and Benign
groups are very close to each other.
[Link] 6/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Violin plot displaying all the standard error features
The medians for almost all Malignant or Benign don’t vary much for the
standard error features above, except for concave points_se and
concavity_se. smoothness_se or symmetry_se have a very similar
distribution which could make classification using this feature difficult. The
shape of the violin plot for area_se looks warped and the distribution of
data points for benign and malignant very different!
[Link] 7/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Violin plot displaying all the worst dimension features
area_worst look well separated, so it might be easier to use this feature for
classification! Variance seems highest for fractal_dimension_worst.
concavity_worst and concave_points_worst seem to have a similar data
distribution.
In order to check the correlation between the features, I plotted a correlation
matrix. It is effective in summarizing a large amount of data where the goal
is to see patterns.
#correlation map
f,ax = [Link](figsize=(18, 18))
matrix = [Link]([Link]())
[Link]([Link](), annot=True, linewidths=.5, fmt=
‘.1f’,ax=ax, mask=matrix)
[Link] 8/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Correlation heatmap of all the features
The means, std errors and worst dimension lengths of compactness,
concavity and concave points of tumors are highly correlated amongst
each other (correlation > 0.8). The mean, std errors and worst dimensions
of radius, perimeter and area of tumors have a correlation of 1!
[Link] 9/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
texture_mean and texture_worst have a correlation of 0.9. area_worst and
area_mean have a correlation of 1.
By now we have a rough idea that many of the features are highly
correlated amongst each other. But what about correlation between the
benign and malignant groups for each feature? In order to understand if
there is a difference between the data distribution for malignant and
benign groups, I visualized some features via box plots and performed a t
test to detect statistical significance.
Box plots succinctly compare multiple distributions and are a great way to
visualize the IQR.
# create boxplots for texture mean vs diagnosis of
tumor
plot = [Link](x=’diagnosis’, y=’texture_mean’,
data=df, showfliers=False)
plot.set_title(“Graph of texture mean vs diagnosis of
tumor”)
Comparing mean features for M and B groups
Texture means, for malignant and benign tumors vary by about 3 units.
[Link] 10/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
The distribution looks similar for both the groups. Malignant tumors tend
to have a higher texture mean compared to benign.
Fractal dimension means are almost the same for malignant and benign
tumors. The IQR is wider for malignant tumors.
Comparing se features for M and B groups
Malignant groups have a distinctly wider range of values for area se. The
distribution range is very narrow for benign groups. This might be a good
feature for classification.
Standard error (se) of concave points has a higher mean and IQR for
malignant tumors. The distribution looks somewhat similar for both tumor
types.
[Link] 11/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Comparing worst dimension features for M and B groups
Malignant groups have a wider range of values for radius worst compared
to benign groups. The IQR is wider for the same. Malignant tumors have a
higher radius worst compared to benign groups.
Similar to area_se, area_worst has a very different data distribution for
malignant and benign tumors. Malignant tumors tend to have a higher
value of mean and wider IQR range. Because of noticeable differences
between B and M tumors, this could be a good feature for classification.
Box plots indicated a difference in means for most of the features
visualized above. But are these differences statistically significant? One
way to check for this is by a t test.
t test tells us he t test tells you how significant the differences between groups
are; In other words it lets you know if those differences (measured in means)
could have happened by chance.
# make a new dataframe with only the desired feature
for t test
new = [Link](data=df[[‘area_worst’,
‘diagnosis’]])
new = new.set_index(‘diagnosis’)
stats.ttest_ind(new_d.loc[‘M’], new_d.loc['B'])
[Link] 12/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
t test results for some features from the dataset
Except for fractal dimension mean, the p value and t statistic is
statistically significant for all the features in the table above. For fractal
dimension mean the null hypothesis stands true, meaning there is no
difference in means for the fractal dimension mean of M and B tumors.
From the correlation matrix we saw earlier, it was clear that there are quite
a few features with very high correlations. So I dropped one of the
features, from each of the feature pairs which had a correlation greater
than 0.95. ‘perimeter_mean’, ‘area_mean’, ‘perimeter_se’, ‘area_se’,
‘radius_worst’, ‘perimeter_worst’, ‘area_worst’ were amongst the features
that were dropped.
# Create correlation matrix
corr_matrix = [Link]().abs()
# Select upper triangle of correlation matrix
upper =
corr_matrix.where([Link]([Link](corr_matrix.shape),
k=1).astype([Link]))
[Link] 13/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
# Find index of feature columns with correlation
greater than 0.95
to_drop = [column for column in [Link] if
any(upper[column] > 0.95)]
# Drop features
X = [Link](X[to_drop], axis=1)
[Link]
Step 3: Machine Learning
We want to build a model which classifies tumors as benign or malignant. I
used sklearn’s Logistic Regression, Support Vector Classifier, Decision Tree
and Random Forest for this purpose.
But first, transform the categorical variable column (diagnosis) to a
numeric type. I used sklearn’s LabelEncoder for this purpose. The M and
B variables were changed to 1 and 0 by the label encoder.
Transform categorical variables
#Encoding categorical data values
from [Link] import LabelEncoder
labelencoder_y = LabelEncoder()
y= labelencoder_y.fit_transform(y)
print(labelencoder_y.fit_transform(y))
Train Test Split the data
40% of the data was reserved for testing purposes. The dataset was
stratified in order to preserve the proportion of target as in the original
dataset, in the train and test datasets as well.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
[Link] 14/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
y, test_size = 0.40, stratify=y, random_state = 17)
Scale the features
sklearn’s Robust Scaler was used to scale the features of the dataset. The
centering and scaling statistics of this scaler are based on percentiles and
are therefore not influenced by a few number of very large marginal
outliers.
#Feature Scaling
from [Link] import RobustScaler
sc = RobustScaler()
X_train = sc.fit_transform(X_train)
X_test = [Link](X_test)
Train the data
# Define a function which trains models
def models(X_train,y_train):
#Using Logistic Regression
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state = 0)
[Link](X_train, y_train)
#Using SVC linear
from [Link] import SVC
svc_lin = SVC(kernel = 'linear', random_state = 0)
svc_lin.fit(X_train, y_train)
#Using SVC rbf
from [Link] import SVC
svc_rbf = SVC(kernel = 'rbf', random_state = 0)
svc_rbf.fit(X_train, y_train)
#Using DecisionTreeClassifier
from [Link] import DecisionTreeClassifier
[Link] 15/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
tree = DecisionTreeClassifier(criterion =
'entropy', random_state = 0)
[Link](X_train, y_train)
#Using Random Forest Classifier
from [Link] import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10,
criterion = 'entropy', random_state = 0)
[Link](X_train, y_train)
#print model accuracy on the training data.
print('[0]Logistic Regression Training Accuracy:',
[Link](X_train, y_train))
print('[1]Support Vector Machine (Linear
Classifier) Training Accuracy:', svc_lin.score(X_train,
y_train))
print('[2]Support Vector Machine (RBF Classifier)
Training Accuracy:', svc_rbf.score(X_train, y_train))
print('[3]Decision Tree Classifier Training
Accuracy:', [Link](X_train, y_train))
print('[4]Random Forest Classifier Training
Accuracy:', [Link](X_train, y_train))
return log, svc_lin, svc_rbf, tree, forest
#get the training results
model = models(X_train,y_train)
[0]Logistic Regression Training Accuracy:
0.9794721407624634
[1]Support Vector Machine (Linear Classifier) Training
Accuracy: 0.9794721407624634
[2]Support Vector Machine (RBF Classifier) Training
Accuracy: 0.9824046920821115
[3]Decision Tree Classifier Training Accuracy: 1.0
[4]Random Forest Classifier Training Accuracy:
0.9912023460410557
Confusion matrix
from [Link] import confusion_matrix
for i in range(len(model)):
[Link] 16/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
cm = confusion_matrix(y_test,
model[i].predict(X_test))
TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
print(cm)
print(‘Model[{}] Testing Accuracy = “{}”’.format(i,
(TP + TN) / (TP + TN + FN + FP)))
print()# Print a new line
[[142 1]
[ 2 83]]
Model[0] Testing Accuracy = "0.9868421052631579"
[[141 2]
[ 4 81]]
Model[1] Testing Accuracy = "0.9736842105263158"
[[141 2]
[ 3 82]]
Model[2] Testing Accuracy = "0.9780701754385965"
[[129 14]
[ 5 80]]
Model[3] Testing Accuracy = "0.9166666666666666"
[[139 4]
[ 6 79]]
Model[4] Testing Accuracy = "0.956140350877193"
Classification Report
from [Link] import classification_report
from [Link] import accuracy_score
for i in range(len(model)):
print(‘Model ‘,i)
#Check precision, recall
recall, f1-score
print(classification_report(y_test,
[Link] 17/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
model[i].predict(X_test)))
#Another way to get the models accuracy on the test
data
print(accuracy_score(y_test,
model[i].predict(X_test)))
print()#Print a new line
Model 0
precision recall f1-score support
0 0.99 0.99 0.99 143
1 0.99 0.98 0.98 85
accuracy 0.99 228
macro avg 0.99 0.98 0.99 228
weighted avg 0.99 0.99 0.99 228
0.9868421052631579
Model 1
precision recall f1-score support
0 0.97 0.99 0.98 143
1 0.98 0.95 0.96 85
accuracy 0.97 228
macro avg 0.97 0.97 0.97 228
weighted avg 0.97 0.97 0.97 228
0.9736842105263158
Model 2
precision recall f1-score support
0 0.98 0.99 0.98 143
1 0.98 0.96 0.97 85
accuracy 0.98 228
macro avg 0.98 0.98 0.98 228
weighted avg 0.98 0.98 0.98 228
0.9780701754385965
Model 3
precision recall f1-score support
0 0.96 0.90 0.93 143
[Link] 18/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
1 0.85 0.94 0.89 85
accuracy 0.92 228
macro avg 0.91 0.92 0.91 228
weighted avg 0.92 0.92 0.92 228
0.9166666666666666
Model 4
precision recall f1-score support
0 0.96 0.97 0.97 143
1 0.95 0.93 0.94 85
accuracy 0.96 228
macro avg 0.96 0.95 0.95 228
weighted avg 0.96 0.96 0.96 228
0.956140350877193
Hyper parameter tuning
Hyperparameters are crucial as they control the overall behavior of a
machine learning model.
In the context of cancer classification, my goal was to minimize the
misclassifications for the positive class (ie when the tumor is malignant
‘M’). But misclassifications include False Positives (FP) and False Negatives
(FN). I was focused more on reducing the FN because tumors which are
malignant should never be classified as benign even if this means the
model might classify a few benign tumors as malignant! Therefore I used
the sklearn’s fbeta_score as the scoring function with GridSearchCV. A
beta > 1 makes fbeta_score favor recall over precision.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
[Link] 19/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
#make the scoring function with a beta = 2
from [Link] import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
# Create logistic regression
logistic = LogisticRegression()
# Create regularization penalty space
penalty = [‘l1’, ‘l2’]
# Create regularization hyperparameter space
C = [Link](0, 1, 0.001)
# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)
# Create grid search using 5-fold cross validation
clf = GridSearchCV(logistic, hyperparameters, cv=5,
scoring=ftwo_scorer, verbose=0)
# Fit grid search
best_model = [Link](X_train, y_train)
# View best hyperparameters
print('Best Penalty:',
best_model.best_estimator_.get_params()['penalty'])
print('Best C:',
best_model.best_estimator_.get_params()['C'])
Best Penalty: l2
Best C: 0.591
predictions = best_model.predict(X_test)
print("Accuracy score %f" % accuracy_score(y_test,
predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
Accuracy score 0.986742
precision recall f1-score support
0 0.99 0.99 0.99 143
1 0.99 0.98 0.98 85
accuracy 0.99 228
macro avg 0.99 0.98 0.99 228
[Link] 20/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
weighted avg 0.99 0.99 0.99 228
[[142 1]
[ 2 83]]
After grid searching the accuracy improved a little but the FNs are still 2.
Grid searching was done on SVC and Random Forest models too but the
recall was best for logistic regression which is why I am discussing logistic
regression in this post.
Custom Threshold to increase recall
The default threshold for interpreting probabilities to class labels is 0.5, and
tuning this hyperparameter is called threshold moving.
y_scores = best_model.predict_proba(X_test)[:, 1]
from [Link] import precision_recall_curve
recall
p, r, thresholds = precision_recall_curve(y_test,
recall
y_scores)
def adjusted_classes(y_scores, t):
#This function adjusts class predictions based on the
prediction threshold (t).Works only for binary
classification problems.
return [1 if y >= t else 0 for y in y_scores]
def precision_recall_threshold(p,
recall r, thresholds,
t=0.5):
#plots the precision recall curve and shows the
current value for each by identifying the
classifier's threshold (t).
# generate new class predictions based on the
adjusted classes
function above and view the resulting confusion
[Link] 21/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
matrix.
y_pred_adj = adjusted_classes(y_scores, t)
print([Link](confusion_matrix(y_test,
y_pred_adj),
columns=['pred_neg',
'pred_pos'],
index=['neg', 'pos']))
print(classification_report(y_test, y_pred_adj))
precision_recall_threshold(p,
recall r, thresholds, 0.42)
pred_neg pred_pos
neg 141 2
pos 1 84
precision recall f1-score support
0 0.99 0.99 0.99 143
1 0.98 0.99 0.98 85
accuracy 0.99 228
macro avg 0.98 0.99 0.99 228
weighted avg 0.99 0.99 0.99 228
Finally the FNs reduced to 1, after manually setting a decision threshold of
0.42!
Graph of recall and precision VS threshold
def plot_precision_recall_vs_threshold(precisions,
recall
recalls, thresholds):
recall
[Link](figsize=(8, 8))
[Link](“Recall
Recall Scores as a function of the decision
threshold”)
[Link](thresholds, precisions[:-1], “b — “,
label=”Precision”)
[Link](thresholds, recalls[:-1],
recall “g-”,
label=”Recall”)
Recall
[Link](x=.42, color=’black’)
[Link](.39,.50,’Optimal Threshold for best
[Link] 22/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Recall’,rotation=90)
Recall
[Link](“Recall
Recall Score”)
[Link](“Decision Threshold”)
[Link](loc=’best’)
# use the same p, r, thresholds that were previously
calculated
plot_precision_recall_vs_threshold(p,
recall r, thresholds)
Graph of recall and precision scores VS thresholds
The line for optimal decision threshold indicates the point of maximum
recall which could be achieved without compromising a lot on precision.
After that point the precision starts to drop more.
[Link] 23/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
from sklearn import metrics
from [Link] import roc_curve
# Compute predicted probabilities: y_pred_prob
y_pred_prob = best_model.predict_proba(X_test)[:,1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
print([Link](fpr, tpr))
# Plot ROC curve
[Link]([0, 1], [0, 1], ‘k — ‘)
[Link](fpr, tpr)
[Link](‘False Positive Rate’)
[Link](‘True Positive Rate’)
[Link](‘ROC Curve for Logistic Regression’)
[Link]()
AUC score is 0.9979432332373509
ROC Curve for Logistic Regression model
The AUC score for this model is 0.9979.
AUC score tells us how good our model is at distinguishing between classes,
in this case, predicting benign tumors as benign and malignant tumors as
malignant.
[Link] 24/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis
and FPR is on the x-axis. ROC curve looks almost ideal.
When the TPR and FPR don’t overlap at all, it means model has an ideal
measure of separability ie it is able to correctly classify positives as positives
and negatives as negatives.
To conclude this post, I have discussed a few EDA, statistical analysis and
machine learning techniques as applied to breast cancer classification
dataset. Complete code of this project can be found on Github.
The breast cancer classification dataset is good to get started with making a
complete Data Science project before you move on to more advanced
datasets and techniques.
Hope you guys found this post helpful and learnt something new too!
Follow Mugdha Paithankar for more stories. Please clap this article if you
like it!
Sign up for Top 10 Stories
By The Startup
Get smarter at building your thing. Subscribe to receive The Startup's top 10 most
read stories — delivered straight into your inbox, once a week. Take a look.
Emails will be sent to davidsoneq@[Link].
Get this newsletter Not you?
[Link] 25/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Exploratory Data Analysis Machine Learning Classification Algorithms Recall
Hyperparameter Tuning
About Help Legal
Get the Medium app
[Link] 26/26