Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
This guide demonstrates how to use the TensorFlow Core low-level APIs
(https://www.tensorflow.org/guide/core) to perform binary classification
(https://developers.google.com/machine-learning/glossary#binary_classification) with logistic
regression (https://developers.google.com/machine-learning/crash-course/logistic-regression/). It uses
the Wisconsin Breast Cancer Dataset
(https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)) for tumor classification.
Setup
This tutorial uses pandas (https://pandas.pydata.org) for reading a CSV file into a DataFrame
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), seaborn
(https://seaborn.pydata.org) for plotting a pairwise relationship in a dataset, Scikit-learn
(https://scikit-learn.org/) for computing a confusion matrix, and matplotlib (https://matplotlib.org/)
for creating visualizations.
import tensorflow as tf
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn.metrics as sk_metrics
import tempfile
import os
print(tf.__version__)
# To make the results reproducible, set the random seed value.
tf.random.set_seed(22)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-w
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
dataset.head()
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_m
rows × 32 columns
Split the dataset into training and test sets using pandas.DataFrame.sample
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html),
pandas.DataFrame.drop
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) and
pandas.DataFrame.iloc (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html).
Make sure to split the features from the target labels. The test set is used to evaluate your
model's generalizability to unseen data.
len(train_dataset)
427
test_dataset = dataset.drop(train_dataset.index)
len(test_dataset)
142
Make sure to also check the overall statistics. Note how each feature covers a vastly different
range of values.
train_dataset.describe().transpose()[:10]
count mean std min 25% 50%
427.02.756014e+071.162735e+088670.00000865427.500000905539.000008.810829e
Given the inconsistent ranges, it is beneficial to standardize the data such that each feature
has a zero mean and unit variance. This process is called normalization
(https://developers.google.com/machine-learning/glossary#normalization).
class Normalize(tf.Module):
def __init__(self, x):
# Initialize the mean and standard deviation for normalization
self.mean = tf.Variable(tf.math.reduce_mean(x, axis=0))
self.std = tf.Variable(tf.math.reduce_std(x, axis=0))
norm_x = Normalize(x_train)
x_train_norm, x_test_norm = norm_x.norm(x_train), norm_x.norm(x_test)
Logistic regression
Before building a logistic regression model, it is crucial to understand the method's differences
compared to traditional linear regression.
Logistic regression maps the continuous outputs of traditional linear regression, (-∞, ∞), to
probabilities, (0, 1). This transformation is also symmetric so that flipping the sign of the
linear output results in the inverse of the original probability.
Let denote the probability of being in class 1 (the tumor is malignant). The desired mapping
can be achieved by interpreting the linear regression output as the log odds
(https://developers.google.com/machine-learning/glossary#log-odds) ratio of being in class 1 as
opposed to class 0:
The dataset in this tutorial deals with a high-dimensional feature matrix. Therefore, the above
equation must be rewritten in a matrix vector form as follows:
where:
: a target vector
: a feature matrix
: a weight vector
: a bias
Start by visualizing the sigmoid function, which transforms the linear output, (-∞, ∞), to fall
between 0 and 1. The sigmoid function is available in tf.math.sigmoid.
where:
In the above equation for the log loss, recall that each can be rewritten in terms of the
inputs as .
class LogisticRegression(tf.Module):
def __init__(self):
self.built = False
To validate, make sure the untrained model outputs values in the range of (0, 1) for a small
subset of the training data.
log_reg = LogisticRegression()
Next, write an accuracy function to calculate the proportion of correct classifications during
training. In order to retrieve the classifications from the predicted probabilities, set a threshold
for which all probabilities higher than the threshold belong to class 1. This is a configurable
hyperparameter that can be set to 0.5 as a default.
def predict_class(y_pred, thresh=0.5):
# Return a tensor with `1` if `y_pred` > `0.5`, and `0` otherwise
return tf.cast(y_pred > thresh, tf.float32)
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices((x_train_norm, y_train))
train_dataset = train_dataset.shuffle(buffer_size=x_train.shape[0]).batch(batch_
test_dataset = tf.data.Dataset.from_tensor_slices((x_test_norm, y_test))
test_dataset = test_dataset.shuffle(buffer_size=x_test.shape[0]).batch(batch_siz
Now write a training loop for the logistic regression model. The loop utilizes the log loss
function and its gradients with respect to the input in order to iteratively update the model's
parameters.
Performance evaluation
Observe the changes in your model's loss and accuracy over time.
The model demonstrates a high accuracy and a low loss when it comes to classifying tumors
in the training dataset and also generalizes well to the unseen test data. To go one step further,
you can explore error rates that give more insight beyond the overall accuracy score. The two
most popular error rates for binary classification problems are the false positive rate (FPR) and
the false negative rate (FNR).
For this problem, the FPR is the proportion of malignant tumor predictions amongst tumors
that are actually benign. Conversely, the FNR is the proportion of benign tumor predictions
among tumors that are actually malignant.
In order to control for the FPR and FNR, try changing the threshold hyperparameter before
classifying the probability predictions. A lower threshold increases the model's overall chances
of making a malignant tumor classification. This inevitably increases the number of false
positives and the FPR but it also helps to decrease the number of false negatives and the FNR.
Normalization
Probability prediction
Class prediction
class ExportModule(tf.Module):
def __init__(self, model, norm_x, class_pred):
# Initialize pre- and post-processing functions
self.model = model
self.norm_x = norm_x
self.class_pred = class_pred
log_reg_export = ExportModule(model=log_reg,
norm_x=norm_x,
class_pred=predict_class)
If you want to save the model at its current state, you can do so with the
tf.saved_model.save (https://www.tensorflow.org/api_docs/python/tf/saved_model/save) function.
To load a saved model and make predictions, use the tf.saved_model.load
(https://www.tensorflow.org/api_docs/python/tf/saved_model/load) function.
models = tempfile.mkdtemp()
save_path = os.path.join(models, 'log_reg_export')
tf.saved_model.save(log_reg_export, save_path)
log_reg_loaded = tf.saved_model.load(save_path)
test_preds = log_reg_loaded(x_test)
test_preds[:10].numpy()
array([1., 1., 1., 1., 0., 1., 1., 1., 1., 1.], dtype=float32)
Conclusion
This notebook introduced a few techniques to handle a logistic regression problem. Here are a
few more tips that may help:
Analyzing error rates is a great way to gain more insight about a classification model's
performance beyond its overall accuracy score.
Overfitting is another common problem for logistic regression models, though it wasn't a
problem for this tutorial. Visit the Overfit and underfit
(https://www.tensorflow.org/tutorials/keras/overfit_and_underfit) tutorial for more help with this.
For more examples of using the TensorFlow Core APIs, check out the guide
(https://www.tensorflow.org/guide/core). If you want to learn more about loading and preparing
data, see the tutorials on image data loading (https://www.tensorflow.org/load_data/images) or CSV
data loading (https://www.tensorflow.org/load_data/csv).
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License
(https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies
(https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its affiliates.