0% found this document useful (0 votes)

2 views

04_training_linear_models

Chapter 4 focuses on training models using Python, specifically covering linear regression, gradient descent methods, polynomial regression, and learning curves. It provides sample code and exercises to illustrate concepts such as the Normal Equation, Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. The chapter also includes visualizations to demonstrate model predictions and performance metrics.

Uploaded by

aagamkasliwal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

04_training_linear_models

Uploaded by

aagamkasliwal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Chapter 4 – Training Models

This notebook contains all the sample code and solutions to the exercises in
chapter 4.

Open in Colab Open in Kaggle

Setup
This project requires Python 3.7 or above:

In [1]: import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ≥ 1.0.1:

In [2]: from packaging import version

import sklearn

assert version.parse(sklearn.version) >= version.parse("1.0.1")

As we did in previous chapters, let's define the default font sizes to make the
figures prettier:

In [3]: import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

And let's create the images/training_linear_models folder (if it doesn't

already exist), and define the save_fig() function which is used through this
notebook to save the figures in high-res for the book:
In [4]: from pathlib import Path

IMAGES_PATH = Path() / "images" / "training_linear_models"

IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300)

path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)

Linear Regression

The Normal Equation

In [5]: import numpy as np

np.random.seed(42) # to make this code example reproducible

m = 100 # number of instances
X = 2 * np.random.rand(m, 1) # column vector
y = 4 + 3 * X + np.random.randn(m, 1) # column vector

In [6]: # extra code – generates and saves Figure 4–1

import matplotlib.pyplot as plt

plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([0, 2, 0, 15])
plt.grid()
save_fig("generated_data_plot")
plt.show()
In [7]: from sklearn.preprocessing import add_dummy_feature

X_b = add_dummy_feature(X) # add x0 = 1 to each instance

theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

In [8]: theta_best

Out[8]: array([[4.21509616],
[2.77011339]])

In [9]: X_new = np.array([[0], [2]])

X_new_b = add_dummy_feature(X_new) # add x0 = 1 to each instance
y_predict = X_new_b @ theta_best
y_predict

Out[9]: array([[4.21509616],
[9.75532293]])

In [10]: import matplotlib.pyplot as plt

plt.figure(figsize=(6, 4)) # extra code – not needed, just formatting

plt.plot(X_new, y_predict, "r-", label="Predictions")
plt.plot(X, y, "b.")

# extra code – beautifies and saves Figure 4–2

plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([0, 2, 0, 15])
plt.grid()
plt.legend(loc="upper left")
save_fig("linear_model_predictions_plot")

plt.show()

In [11]: from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

Out[11]: (array([4.21509616]), array([[2.77011339]]))

In [12]: lin_reg.predict(X_new)

Out[12]: array([[4.21509616],
[9.75532293]])

The LinearRegression class is based on the scipy.linalg.lstsq() function

(the name stands for "least squares"), which you could call directly:

In [13]: theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)

theta_best_svd

Out[13]: array([[4.21509616],
[2.77011339]])
+ +
This function computes X y, where X is the pseudoinverse of X (specifically
the Moore-Penrose inverse). You can use np.linalg.pinv() to compute the
pseudoinverse directly:

In [14]: np.linalg.pinv(X_b) @ y

Out[14]: array([[4.21509616],
[2.77011339]])

Gradient Descent
Batch Gradient Descent
In [15]: eta = 0.1 # learning rate
n_epochs = 1000
m = len(X_b) # number of instances

np.random.seed(42)
theta = np.random.randn(2, 1) # randomly initialized model parameters

for epoch in range(n_epochs):

gradients = 2 / m * X_b.T @ (X_b @ theta - y)
theta = theta - eta * gradients

The trained model parameters:

In [16]: theta

Out[16]: array([[4.21509616],
[2.77011339]])
In [17]: # extra code – generates and saves Figure 4–8

import matplotlib as mpl

def plot_gradient_descent(theta, eta):

m = len(X_b)
plt.plot(X, y, "b.")
n_epochs = 1000
n_shown = 20
theta_path = []
for epoch in range(n_epochs):
if epoch < n_shown:
y_predict = X_new_b @ theta
color = mpl.colors.rgb2hex(plt.cm.OrRd(epoch / n_shown + 0.15))
plt.plot(X_new, y_predict, linestyle="solid", color=color)
gradients = 2 / m * X_b.T @ (X_b @ theta - y)
theta = theta - eta * gradients
theta_path.append(theta)
plt.xlabel("$x_1$")
plt.axis([0, 2, 0, 15])
plt.grid()
plt.title(fr"$\eta = {eta}$")
return theta_path

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

plt.figure(figsize=(10, 4))
plt.subplot(131)
plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0)
plt.subplot(132)
theta_path_bgd = plot_gradient_descent(theta, eta=0.1)
plt.gca().axes.yaxis.set_ticklabels([])
plt.subplot(133)
plt.gca().axes.yaxis.set_ticklabels([])
plot_gradient_descent(theta, eta=0.5)
save_fig("gradient_descent_plot")
plt.show()
Stochastic Gradient Descent
In [18]: theta_path_sgd = [] # extra code – we need to store the path of theta in th
# parameter space to plot the next figure

In [19]: n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):
return t0 / (t + t1)

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

n_shown = 20 # extra code – just needed to generate the figure below

plt.figure(figsize=(6, 4)) # extra code – not needed, just formatting

for epoch in range(n_epochs):

for iteration in range(m):

# extra code – these 4 lines are used to generate the figure

if epoch == 0 and iteration < n_shown:
y_predict = X_new_b @ theta
color = mpl.colors.rgb2hex(plt.cm.OrRd(iteration / n_shown + 0.1
plt.plot(X_new, y_predict, color=color)

random_index = np.random.randint(m)
xi = X_b[random_index : random_index + 1]
yi = y[random_index : random_index + 1]
gradients = 2 * xi.T @ (xi @ theta - yi) # for SGD, do not divide b
eta = learning_schedule(epoch * m + iteration)
theta = theta - eta * gradients
theta_path_sgd.append(theta) # extra code – to generate the figure

# extra code – this section beautifies and saves Figure 4–10

plt.plot(X, y, "b.")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([0, 2, 0, 15])
plt.grid()
save_fig("sgd_plot")
plt.show()
In [20]: theta

Out[20]: array([[4.21076011],
[2.74856079]])

In [21]: from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-5, penalty=None, eta0=0.01,

n_iter_no_change=100, random_state=42)
sgd_reg.fit(X, y.ravel()) # y.ravel() because fit() expects 1D targets

Out[21]: SGDRegressor(n_iter_no_change=100, penalty=None, random_state=42, tol=1e-0

In [22]: sgd_reg.intercept_, sgd_reg.coef_

Out[22]: (array([4.21278812]), array([2.77270267]))

Mini-batch gradient descent

The code in this section is used to generate the next figure, it is not in the book.

In [23]: # extra code – this cell generates and saves Figure 4–11

from math import ceil

n_epochs = 50
minibatch_size = 20
n_batches_per_epoch = ceil(m / minibatch_size)

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

t0, t1 = 200, 1000 # learning schedule hyperparameters

def learning_schedule(t):
return t0 / (t + t1)
theta_path_mgd = []
for epoch in range(n_epochs):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for iteration in range(0, n_batches_per_epoch):
idx = iteration * minibatch_size
xi = X_b_shuffled[idx : idx + minibatch_size]
yi = y_shuffled[idx : idx + minibatch_size]
gradients = 2 / minibatch_size * xi.T @ (xi @ theta - yi)
eta = learning_schedule(iteration)
theta = theta - eta * gradients
theta_path_mgd.append(theta)

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

plt.figure(figsize=(7, 4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1,
label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2,
label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3,
label="Batch")
plt.legend(loc="upper left")
plt.xlabel(r"$\theta_0$")
plt.ylabel(r"$\theta_1$ ", rotation=0)
plt.axis([2.6, 4.6, 2.3, 3.4])
plt.grid()
save_fig("gradient_descent_paths_plot")
plt.show()

Polynomial Regression
In [24]: np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)

In [25]: # extra code – this cell generates and saves Figure 4–12
plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([-3, 3, 0, 10])
plt.grid()
save_fig("quadratic_data_plot")
plt.show()

In [26]: from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)

X_poly = poly_features.fit_transform(X)
X[0]

Out[26]: array([-0.75275929])

In [27]: X_poly[0]

Out[27]: array([-0.75275929, 0.56664654])

In [28]: lin_reg = LinearRegression()

lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

Out[28]: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

In [29]: # extra code – this cell generates and saves Figure 4–13

X_new = np.linspace(-3, 3, 100).reshape(100, 1)

X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)

plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.legend(loc="upper left")
plt.axis([-3, 3, 0, 10])
plt.grid()
save_fig("quadratic_predictions_plot")
plt.show()

In [30]: # extra code – this cell generates and saves Figure 4–14

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

plt.figure(figsize=(6, 4))

for style, width, degree in (("r-+", 2, 1), ("b--", 2, 2), ("g-", 1, 300)):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = make_pipeline(polybig_features, std_scaler, lin_
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
label = f"{degree} degree{'s' if degree > 1 else ''}"
plt.plot(X_new, y_newbig, style, label=label, linewidth=width)

plt.plot(X, y, "b.", linewidth=3)

plt.legend(loc="upper left")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([-3, 3, 0, 10])
plt.grid()
save_fig("high_degree_polynomials_plot")
plt.show()
Learning Curves
In [31]: from sklearn.model_selection import learning_curve

train_sizes, train_scores, valid_scores = learning_curve(

LinearRegression(), X, y, train_sizes=np.linspace(0.01, 1.0, 40), cv=5,
scoring="neg_root_mean_squared_error")
train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.figure(figsize=(6, 4)) # extra code – not needed, just formatting

plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")

# extra code – beautifies and saves Figure 4–15

plt.xlabel("Training set size")
plt.ylabel("RMSE")
plt.grid()
plt.legend(loc="upper right")
plt.axis([0, 80, 0, 2.5])
save_fig("underfitting_learning_curves_plot")

plt.show()
In [32]: from sklearn.pipeline import make_pipeline

polynomial_regression = make_pipeline(
PolynomialFeatures(degree=10, include_bias=False),
LinearRegression())

train_sizes, train_scores, valid_scores = learning_curve(

polynomial_regression, X, y, train_sizes=np.linspace(0.01, 1.0, 40), cv=
scoring="neg_root_mean_squared_error")

In [33]: # extra code – generates and saves Figure 4–16

train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.figure(figsize=(6, 4))
plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")
plt.legend(loc="upper right")
plt.xlabel("Training set size")
plt.ylabel("RMSE")
plt.grid()
plt.axis([0, 80, 0, 2.5])
save_fig("learning_curves_plot")
plt.show()
Regularized Linear Models

Ridge Regression
Let's generate a very small and noisy linear dataset:

In [34]: # extra code – we've done this type of generation several times before
np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

In [35]: # extra code – a quick peek at the dataset we just generated

plt.figure(figsize=(6, 4))
plt.plot(X, y, ".")
plt.xlabel("$x_1$")
plt.ylabel("$y$ ", rotation=0)
plt.axis([0, 3, 0, 3.5])
plt.grid()
plt.show()
In [36]: from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.1, solver="cholesky")

ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

Out[36]: array([[1.55325833]])

In [37]: # extra code – this cell generates and saves Figure 4–17

def plot_model(model_class, polynomial, alphas, **model_kwargs):

plt.plot(X, y, "b.", linewidth=3)
for alpha, style in zip(alphas, ("b:", "g--", "r-")):
if alpha > 0:
model = model_class(alpha, **model_kwargs)
else:
model = LinearRegression()
if polynomial:
model = make_pipeline(
PolynomialFeatures(degree=10, include_bias=False),
StandardScaler(),
model)
model.fit(X, y)
y_new_regul = model.predict(X_new)
plt.plot(X_new, y_new_regul, style, linewidth=2,
label=fr"$\alpha = {alpha}$")
plt.legend(loc="upper left")
plt.xlabel("$x_1$")
plt.axis([0, 3, 0, 3.5])
plt.grid()

plt.figure(figsize=(9, 3.5))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$ ", rotation=0)
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)
plt.gca().axes.yaxis.set_ticklabels([])
save_fig("ridge_regression_plot")
plt.show()

In [38]: sgd_reg = SGDRegressor(penalty="l2", alpha=0.1 / m, tol=None,

max_iter=1000, eta0=0.01, random_state=42)
sgd_reg.fit(X, y.ravel()) # y.ravel() because fit() expects 1D targets
sgd_reg.predict([[1.5]])

Out[38]: array([1.55302613])

In [39]: # extra code – show that we get roughly the same solution as earlier when
# we use Stochastic Average GD (solver="sag")
ridge_reg = Ridge(alpha=0.1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

Out[39]: array([[1.55321535]])

In [40]: # extra code – shows the closed form solution of Ridge regression,
# compare with the next Ridge model's learned parameters below
alpha = 0.1
A = np.array([[0., 0.], [0., 1.]])
X_b = np.c_[np.ones(m), X]
np.linalg.inv(X_b.T @ X_b + alpha * A) @ X_b.T @ y

Out[40]: array([[0.97898394],
[0.3828496 ]])

In [41]: ridge_reg.intercept_, ridge_reg.coef_ # extra code

Out[41]: (array([0.97944909]), array([[0.38251084]]))

Lasso Regression
In [42]: from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
Out[42]: array([1.53788174])

In [43]: # extra code – this cell generates and saves Figure 4–18
plt.figure(figsize=(9, 3.5))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$ ", rotation=0)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 1e-2, 1), random_state=42)
plt.gca().axes.yaxis.set_ticklabels([])
save_fig("lasso_regression_plot")
plt.show()

In [44]: # extra code – this BIG cell generates and saves Figure 4–19

t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5

t1s = np.linspace(t1a, t1b, 500)

t2s = np.linspace(t2a, t2b, 500)
t1, t2 = np.meshgrid(t1s, t2s)
T = np.c_[t1.ravel(), t2.ravel()]
Xr = np.array([[1, 1], [1, -1], [1, 0.5]])
yr = 2 * Xr[:, :1] + 0.5 * Xr[:, 1:]

J = (1 / len(Xr) * ((T @ Xr.T - yr.T) ** 2).sum(axis=1)).reshape(t1.shape)

N1 = np.linalg.norm(T, ord=1, axis=1).reshape(t1.shape)

N2 = np.linalg.norm(T, ord=2, axis=1).reshape(t1.shape)

t_min_idx = np.unravel_index(J.argmin(), J.shape)

t1_min, t2_min = t1[t_min_idx], t2[t_min_idx]

t_init = np.array([[0.25], [-1]])

def bgd_path(theta, X, y, l1, l2, core=1, eta=0.05, n_iterations=200):

path = [theta]
for iteration in range(n_iterations):
gradients = (core * 2 / len(X) * X.T @ (X @ theta - y)
+ l1 * np.sign(theta) + l2 * theta)
theta = theta - eta * gradients
path.append(theta)
return np.array(path)

fig, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(10.1, 8))

for i, N, l1, l2, title in ((0, N1, 2.0, 0, "Lasso"), (1, N2, 0, 2.0, "Ridge
JR = J + l1 * N1 + l2 * 0.5 * N2 ** 2

tr_min_idx = np.unravel_index(JR.argmin(), JR.shape)

t1r_min, t2r_min = t1[tr_min_idx], t2[tr_min_idx]

levels = np.exp(np.linspace(0, 1, 20)) - 1

levelsJ = levels * (J.max() - J.min()) + J.min()
levelsJR = levels * (JR.max() - JR.min()) + JR.min()
levelsN = np.linspace(0, N.max(), 10)

path_J = bgd_path(t_init, Xr, yr, l1=0, l2=0)

path_JR = bgd_path(t_init, Xr, yr, l1, l2)
path_N = bgd_path(theta=np.array([[2.0], [0.5]]), X=Xr, y=yr,
l1=np.sign(l1) / 3, l2=np.sign(l2), core=0)
ax = axes[i, 0]
ax.grid()
ax.axhline(y=0, color="k")
ax.axvline(x=0, color="k")
ax.contourf(t1, t2, N / 2.0, levels=levelsN)
ax.plot(path_N[:, 0], path_N[:, 1], "y--")
ax.plot(0, 0, "ys")
ax.plot(t1_min, t2_min, "ys")
ax.set_title(fr"$\ell_{i + 1}$ penalty")
ax.axis([t1a, t1b, t2a, t2b])
if i == 1:
ax.set_xlabel(r"$\theta_1$")
ax.set_ylabel(r"$\theta_2$", rotation=0)

ax = axes[i, 1]
ax.grid()
ax.axhline(y=0, color="k")
ax.axvline(x=0, color="k")
ax.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
ax.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
ax.plot(path_N[:, 0], path_N[:, 1], "y--")
ax.plot(0, 0, "ys")
ax.plot(t1_min, t2_min, "ys")
ax.plot(t1r_min, t2r_min, "rs")
ax.set_title(title)
ax.axis([t1a, t1b, t2a, t2b])
if i == 1:
ax.set_xlabel(r"$\theta_1$")

save_fig("lasso_vs_ridge_plot")
plt.show()
Elastic Net
In [45]: from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)

elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

Out[45]: array([1.54333232])

Early Stopping
Let's go back to the quadratic dataset we used earlier:

Warning: In recent versions of Scikit-Learn, you must use

root_mean_squared_error() to compute the RMSE, instead of
mean_squared_error(labels, predictions, squared=False) . The following
try / except block tries to import root_mean_squared_error , and if it fails it
just defines it.
In [46]: try:
from sklearn.metrics import root_mean_squared_error
except ImportError:
from sklearn.metrics import mean_squared_error

def root_mean_squared_error(labels, predictions):

return mean_squared_error(labels, predictions, squared=False)

In [47]: from copy import deepcopy

from sklearn.preprocessing import StandardScaler

# extra code – creates the same quadratic dataset as earlier and splits it
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)
X_train, y_train = X[: m // 2], y[: m // 2, 0]
X_valid, y_valid = X[m // 2 :], y[m // 2 :, 0]

preprocessing = make_pipeline(PolynomialFeatures(degree=90, include_bias=Fal

StandardScaler())
X_train_prep = preprocessing.fit_transform(X_train)
X_valid_prep = preprocessing.transform(X_valid)
sgd_reg = SGDRegressor(penalty=None, eta0=0.002, random_state=42)
n_epochs = 500
best_valid_rmse = float('inf')
train_errors, val_errors = [], [] # extra code – it's for the figure below

for epoch in range(n_epochs):

sgd_reg.partial_fit(X_train_prep, y_train)
y_valid_predict = sgd_reg.predict(X_valid_prep)
val_error = root_mean_squared_error(y_valid, y_valid_predict)
if val_error < best_valid_rmse:
best_valid_rmse = val_error
best_model = deepcopy(sgd_reg)

# extra code – we evaluate the train error and save it for the figure
y_train_predict = sgd_reg.predict(X_train_prep)
train_error = root_mean_squared_error(y_train, y_train_predict)
val_errors.append(val_error)
train_errors.append(train_error)

# extra code – this section generates and saves Figure 4–20

best_epoch = np.argmin(val_errors)
plt.figure(figsize=(6, 4))
plt.annotate('Best model',
xy=(best_epoch, best_valid_rmse),
xytext=(best_epoch, best_valid_rmse + 0.5),
ha="center",
arrowprops=dict(facecolor='black', shrink=0.05))
plt.plot([0, n_epochs], [best_valid_rmse, best_valid_rmse], "k:", linewidth=
plt.plot(val_errors, "b-", linewidth=3, label="Validation set")
plt.plot(best_epoch, best_valid_rmse, "bo")
plt.plot(train_errors, "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right")
plt.xlabel("Epoch")
plt.ylabel("RMSE")
plt.axis([0, n_epochs, 0, 3.5])
plt.grid()
save_fig("early_stopping_plot")
plt.show()

Logistic Regression

Estimating Probabilities
In [48]: # extra code – generates and saves Figure 4–21

lim = 6
t = np.linspace(-lim, lim, 100)
sig = 1 / (1 + np.exp(-t))

plt.figure(figsize=(8, 3))
plt.plot([-lim, lim], [0, 0], "k-")
plt.plot([-lim, lim], [0.5, 0.5], "k:")
plt.plot([-lim, lim], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \dfrac{1}{1 + e^{-t
plt.xlabel("t")
plt.legend(loc="upper left")
plt.axis([-lim, lim, -0.1, 1.1])
plt.gca().set_yticks([0, 0.25, 0.5, 0.75, 1])
plt.grid()
save_fig("logistic_function_plot")
plt.show()
Decision Boundaries
In [49]: from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
list(iris)

Out[49]: ['data',
'target',
'frame',
'target_names',
'DESCR',
'feature_names',
'filename',
'data_module']

In [50]: print(iris.DESCR) # extra code – it's a bit too long

.. _iris_dataset:

Iris plants dataset

--------------------

Data Set Characteristics:

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is take
n
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field an
d
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to
a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analys
is.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transacti
ons
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...

In [51]: iris.data.head(3)

Out[51]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

In [52]: iris.target.head(3) # note that the instances are not shuffled

Out[52]: 0 0
1 0
2 0
Name: target, dtype: int64

In [53]: iris.target_names

Out[53]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [54]: from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

X = iris.data[["petal width (cm)"]].values

y = iris.target_names[iris.target] == 'virginica'
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

Out[54]: LogisticRegression(random_state=42)

In [55]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1) # reshape to get a column ve

y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0, 0]

plt.figure(figsize=(8, 3)) # extra code – not needed, just formatting

plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2,
label="Not Iris virginica proba")
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica prob
plt.plot([decision_boundary, decision_boundary], [0, 1], "k:", linewidth=2,
label="Decision boundary")
# extra code – this section beautifies and saves Figure 4–23
plt.arrow(x=decision_boundary, y=0.08, dx=-0.3, dy=0,
head_width=0.05, head_length=0.1, fc="b", ec="b")
plt.arrow(x=decision_boundary, y=0.92, dx=0.3, dy=0,
head_width=0.05, head_length=0.1, fc="g", ec="g")
plt.plot(X_train[y_train == 0], y_train[y_train == 0], "bs")
plt.plot(X_train[y_train == 1], y_train[y_train == 1], "g^")
plt.xlabel("Petal width (cm)")
plt.ylabel("Probability")
plt.legend(loc="center left")
plt.axis([0, 3, -0.02, 1.02])
plt.grid()
save_fig("logistic_regression_plot")

plt.show()

In [56]: decision_boundary

Out[56]: 1.6516516516516517

In [57]: log_reg.predict([[1.7], [1.5]])

Out[57]: array([ True, False])

In [58]: # extra code – this cell generates and saves Figure 4–24

X = iris.data[["petal length (cm)", "petal width (cm)"]].values

y = iris.target_names[iris.target] == 'virginica'
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_reg = LogisticRegression(C=2, random_state=42)

log_reg.fit(X_train, y_train)

# for the contour plot

x0, x1 = np.meshgrid(np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1))
X_new = np.c_[x0.ravel(), x1.ravel()] # one instance per point on the figur
y_proba = log_reg.predict_proba(X_new)
zz = y_proba[:, 1].reshape(x0.shape)

# for the decision boundary

left_right = np.array([2.9, 7])
boundary = -((log_reg.coef_[0, 0] * left_right + log_reg.intercept_[0])
/ log_reg.coef_[0, 1])

plt.figure(figsize=(10, 4))
plt.plot(X_train[y_train == 0, 0], X_train[y_train == 0, 1], "bs")
plt.plot(X_train[y_train == 1, 0], X_train[y_train == 1, 1], "g^")
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
plt.clabel(contour, inline=1)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.27, "Not Iris virginica", color="b", ha="center")
plt.text(6.5, 2.3, "Iris virginica", color="g", ha="center")
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.axis([2.9, 7, 0.8, 2.7])
plt.grid()
save_fig("logistic_regression_contour_plot")
plt.show()

Softmax Regression
In [59]: X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

softmax_reg = LogisticRegression(C=30, random_state=42)

softmax_reg.fit(X_train, y_train)

Out[59]: LogisticRegression(C=30, random_state=42)

In [60]: softmax_reg.predict([[5, 2]])

Out[60]: array([2])

In [61]: softmax_reg.predict_proba([[5, 2]]).round(2)

Out[61]: array([[0. , 0.04, 0.96]])

In [62]: # extra code – this cell generates and saves Figure 4–25
from matplotlib.colors import ListedColormap

custom_cmap = ListedColormap(["#fafab0", "#9898ff", "#a0faa0"])

x0, x1 = np.meshgrid(np.linspace(0, 8, 500).reshape(-1, 1),

np.linspace(0, 3.5, 200).reshape(-1, 1))
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)

zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y == 2, 0], X[y == 2, 1], "g^", label="Iris virginica")
plt.plot(X[y == 1, 0], X[y == 1, 1], "bs", label="Iris versicolor")
plt.plot(X[y == 0, 0], X[y == 0, 1], "yo", label="Iris setosa")

plt.contourf(x0, x1, zz, cmap=custom_cmap)

contour = plt.contour(x0, x1, zz1, cmap="hot")
plt.clabel(contour, inline=1)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="center left")
plt.axis([0.5, 7, 0, 3.5])
plt.grid()
save_fig("softmax_regression_contour_plot")
plt.show()

Exercise solutions

1. to 11.

1. If you have a training set with millions of features you can use Stochastic
Gradient Descent or Mini-batch Gradient Descent, and perhaps Batch
Gradient Descent if the training set fits in memory. But you cannot use the
Normal Equation or the SVD approach because the computational complexity
grows quickly (more than quadratically) with the number of features.
2. If the features in your training set have very different scales, the cost
function will have the shape of an elongated bowl, so the Gradient Descent
algorithms will take a long time to converge. To solve this you should scale
the data before training the model. Note that the Normal Equation or SVD
approach will work just fine without scaling. Moreover, regularized models
may converge to a suboptimal solution if the features are not scaled: since
regularization penalizes large weights, features with smaller values will tend
to be ignored compared to features with larger values.
3. Gradient Descent cannot get stuck in a local minimum when training a
Logistic Regression model because the cost function is convex. Convex
means that if you draw a straight line between any two points on the curve,
the line never crosses the curve.
4. If the optimization problem is convex (such as Linear Regression or Logistic
Regression), and assuming the learning rate is not too high, then all Gradient
Descent algorithms will approach the global optimum and end up producing
fairly similar models. However, unless you gradually reduce the learning
rate, Stochastic GD and Mini-batch GD will never truly converge; instead,
they will keep jumping back and forth around the global optimum. This
means that even if you let them run for a very long time, these Gradient
Descent algorithms will produce slightly different models.
5. If the validation error consistently goes up after every epoch, then one
possibility is that the learning rate is too high and the algorithm is diverging.
If the training error also goes up, then this is clearly the problem and you
should reduce the learning rate. However, if the training error is not going
up, then your model is overfitting the training set and you should stop
training.
6. Due to their random nature, neither Stochastic Gradient Descent nor Mini-
batch Gradient Descent is guaranteed to make progress at every single
training iteration. So if you immediately stop training when the validation
error goes up, you may stop much too early, before the optimum is reached.
A better option is to save the model at regular intervals; then, when it has
not improved for a long time (meaning it will probably never beat the
record), you can revert to the best saved model.
7. Stochastic Gradient Descent has the fastest training iteration since it
considers only one training instance at a time, so it is generally the first to
reach the vicinity of the global optimum (or Mini-batch GD with a very small
mini-batch size). However, only Batch Gradient Descent will actually
converge, given enough training time. As mentioned, Stochastic GD and
Mini-batch GD will bounce around the optimum, unless you gradually reduce
the learning rate.
8. If the validation error is much higher than the training error, this is likely
because your model is overfitting the training set. One way to try to fix this
is to reduce the polynomial degree: a model with fewer degrees of freedom
is less likely to overfit. Another thing you can try is to regularize the model—
for example, by adding an ℓ₂ penalty (Ridge) or an ℓ₁ penalty (Lasso) to the
cost function. This will also reduce the degrees of freedom of the model.
Lastly, you can try to increase the size of the training set.
9. If both the training error and the validation error are almost equal and fairly
high, the model is likely underfitting the training set, which means it has a
high bias. You should try reducing the regularization hyperparameter α.
10. Let's see:

A model with some regularization typically performs better than a model

without any regularization, so you should generally prefer Ridge Regression
over plain Linear Regression.
Lasso Regression uses an ℓ₁ penalty, which tends to push the weights down
to exactly zero. This leads to sparse models, where all weights are zero
except for the most important weights. This is a way to perform feature
selection automatically, which is good if you suspect that only a few features
actually matter. When you are not sure, you should prefer Ridge Regression.
Elastic Net is generally preferred over Lasso since Lasso may behave
erratically in some cases (when several features are strongly correlated or
when there are more features than training instances). However, it does add
an extra hyperparameter to tune. If you want Lasso without the erratic
behavior, you can just use Elastic Net with an l1_ratio close to 1.

11. If you want to classify pictures as outdoor/indoor and daytime/nighttime,

since these are not exclusive classes (i.e., all four combinations are possible)
you should train two Logistic Regression classifiers.

12. Batch Gradient Descent with early

stopping for Softmax Regression
Exercise: Implement Batch Gradient Descent with early stopping for Softmax
Regression without using Scikit-Learn, only NumPy. Use it on a classification task
such as the iris dataset.

Let's start by loading the data. We will just reuse the Iris dataset we loaded
earlier.
In [63]: X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris["target"].values

We need to add the bias term for every instance (x0 = 1). The easiest option to
do this would be to use Scikit-Learn's add_dummy_feature() function, but the
point of this exercise is to get a better understanding of the algorithms by
implementing them manually. So here is one possible implementation:

In [64]: X_with_bias = np.c_[np.ones(len(X)), X]

The easiest option to split the dataset into a training set, a validation set and a
test set would be to use Scikit-Learn's train_test_split() function, but
again, we want to do it manually:

In [65]: test_ratio = 0.2

validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(total_size * test_ratio)

validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

np.random.seed(42)
rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

The targets are currently class indices (0, 1 or 2), but we need target class
probabilities to train the Softmax Regression model. Each instance will have
target class probabilities equal to 0.0 for all classes except for the target class
which will have a probability of 1.0 (in other words, the vector of class
probabilities for any given instance is a one-hot vector). Let's write a small
function to convert the vector of class indices into a matrix containing a one-hot
vector for each instance. To understand this code, you need to know that
np.diag(np.ones(n)) creates an n×n matrix full of 0s except for 1s on the
main diagonal. Moreover, if a is a NumPy array, then a[[1, 3, 2]] returns an
array with 3 rows equal to a[1] , a[3] and a[2] (this is advanced NumPy
indexing).

In [66]: def to_one_hot(y):

return np.diag(np.ones(y.max() + 1))[y]

Let's test this function on the first 10 instances:

In [67]: y_train[:10]

Out[67]: array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1])

In [68]: to_one_hot(y_train[:10])

Out[68]: array([[0., 1., 0.],

[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 1., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[0., 1., 0.]])

Looks good, so let's create the target class probabilities matrix for the training
set and the test set:

In [69]: Y_train_one_hot = to_one_hot(y_train)

Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

Now let's scale the inputs. We compute the mean and standard deviation of each
feature on the training set (except for the bias feature), then we center and scale
each feature in the training set, the validation set, and the test set:

In [70]: mean = X_train[:, 1:].mean(axis=0)

std = X_train[:, 1:].std(axis=0)
X_train[:, 1:] = (X_train[:, 1:] - mean) / std
X_valid[:, 1:] = (X_valid[:, 1:] - mean) / std
X_test[:, 1:] = (X_test[:, 1:] - mean) / std

Now let's implement the Softmax function. Recall that it is defined by the
following equation:

exp(sk (x))
σ(s(x)) =
k
K

∑ exp(sj (x))
j=1

In [71]: def softmax(logits):

exps = np.exp(logits)
exp_sums = exps.sum(axis=1, keepdims=True)
return exps / exp_sums

We are almost ready to start training. Let's define the number of inputs and
outputs:
In [72]: n_inputs = X_train.shape[1] # == 3 (2 features plus the bias term)
n_outputs = len(np.unique(y_train)) # == 3 (there are 3 iris classes)

Now here comes the hardest part: training! Theoretically, it's simple: it's just a
matter of translating the math equations into Python code. But in practice, it can
be quite tricky: in particular, it's easy to mix up the order of the terms, or the
indices. You can even end up with code that looks like it's working but is actually
not computing exactly the right thing. When unsure, you should write down the
shape of each term in the equation and make sure the corresponding terms in
your code match closely. It can also help to evaluate each term independently
and print them out. The good news it that you won't have to do this everyday,
since all this is well implemented by Scikit-Learn, but it will help you understand
what's going on under the hood.

So the equations we will need are the cost function:

$J(\mathbf{\Theta}) =

\dfrac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{K}
{y_k^{(i)}\log\left(\hat{p}_k^{(i)}\right)}$

And the equation for the gradients:

m
1 (i) (i) (i)
∇ ( k) J (Θ) = ^
∑ (p − y )x
θ k k
m
i=1

(i) (i)
Note that ^
log(p
k
) may not be computable if ^
p
k
= 0. So we will add a tiny
(i)
value ϵ to ^
log(p
k
) to avoid getting nan values.

In [73]: eta = 0.5

n_epochs = 5001
m = len(X_train)
epsilon = 1e-5

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)

for epoch in range(n_epochs):

logits = X_train @ Theta
Y_proba = softmax(logits)
if epoch % 1000 == 0:
Y_proba_valid = softmax(X_valid @ Theta)
xentropy_losses = -(Y_valid_one_hot * np.log(Y_proba_valid + epsilon
print(epoch, xentropy_losses.sum(axis=1).mean())
error = Y_proba - Y_train_one_hot
gradients = 1 / m * X_train.T @ error
Theta = Theta - eta * gradients
0 3.7085808486476917
1000 0.14519367480830644
2000 0.1301309575504088
3000 0.12009639326384539
4000 0.11372961364786884
5000 0.11002459532472425

And that's it! The Softmax model is trained. Let's look at the model parameters:

In [74]: Theta

Out[74]: array([[ 0.41931626, 6.11112089, -5.52429876],

[-6.53054533, -0.74608616, 8.33137102],
[-5.28115784, 0.25152675, 6.90680425]])

Let's make predictions for the validation set and check the accuracy score:

In [75]: logits = X_valid @ Theta

Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

accuracy_score = (y_predict == y_valid).mean()

accuracy_score

Out[75]: 0.9333333333333333

Well, this model looks pretty ok. For the sake of the exercise, let's add a bit of ℓ2

regularization. The following training code is similar to the one above, but the
loss now has an additional ℓ2 penalty, and the gradients have the proper
additional term (note that we don't regularize the first element of Theta since
this corresponds to the bias term). Also, let's try increasing the learning rate
eta .

In [76]: eta = 0.5

n_epochs = 5001
m = len(X_train)
epsilon = 1e-5
alpha = 0.01 # regularization hyperparameter

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)

for epoch in range(n_epochs):

logits = X_train @ Theta
Y_proba = softmax(logits)
if epoch % 1000 == 0:
Y_proba_valid = softmax(X_valid @ Theta)
xentropy_losses = -(Y_valid_one_hot * np.log(Y_proba_valid + epsilon
l2_loss = 1 / 2 * (Theta[1:] ** 2).sum()
total_loss = xentropy_losses.sum(axis=1).mean() + alpha * l2_loss
print(epoch, total_loss.round(4))
error = Y_proba - Y_train_one_hot
gradients = 1 / m * X_train.T @ error
gradients += np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
Theta = Theta - eta * gradients

0 3.7372
1000 0.3259
2000 0.3259
3000 0.3259
4000 0.3259
5000 0.3259

Because of the additional ℓ2 penalty, the loss seems greater than earlier, but
perhaps this model will perform better? Let's find out:

In [77]: logits = X_valid @ Theta

Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

accuracy_score = (y_predict == y_valid).mean()

accuracy_score

Out[77]: 0.9333333333333333

In this case, the ℓ2 penalty did not change the test accuracy. Perhaps try fine-
tuning alpha ?

Now let's add early stopping. For this we just need to measure the loss on the
validation set at every iteration and stop when the error starts growing.

In [78]: eta = 0.5

n_epochs = 50_001
m = len(X_train)
epsilon = 1e-5
C = 100 # regularization hyperparameter
best_loss = np.infty

np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)

for epoch in range(n_epochs):

logits = X_train @ Theta
Y_proba = softmax(logits)
Y_proba_valid = softmax(X_valid @ Theta)
xentropy_losses = -(Y_valid_one_hot * np.log(Y_proba_valid + epsilon))
l2_loss = 1 / 2 * (Theta[1:] ** 2).sum()
total_loss = xentropy_losses.sum(axis=1).mean() + 1 / C * l2_loss
if epoch % 1000 == 0:
print(epoch, total_loss.round(4))
if total_loss < best_loss:
best_loss = total_loss
else:
print(epoch - 1, best_loss.round(4))
print(epoch, total_loss.round(4), "early stopping!")
break
error = Y_proba - Y_train_one_hot
gradients = 1 / m * X_train.T @ error
gradients += np.r_[np.zeros([1, n_outputs]), 1 / C * Theta[1:]]
Theta = Theta - eta * gradients

0 3.7372
281 0.3256
282 0.3256 early stopping!

In [79]: logits = X_valid @ Theta

Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

accuracy_score = (y_predict == y_valid).mean()

accuracy_score

Out[79]: 0.9333333333333333

Oh well, still no change in validation accuracy, but at least early stopping

shortened training a bit.

Now let's plot the model's predictions on the whole dataset (remember to scale
all features fed to the model):

In [80]: custom_cmap = mpl.colors.ListedColormap(['#fafab0', '#9898ff', '#a0faa0'])

x0, x1 = np.meshgrid(np.linspace(0, 8, 500).reshape(-1, 1),

np.linspace(0, 3.5, 200).reshape(-1, 1))
X_new = np.c_[x0.ravel(), x1.ravel()]
X_new = (X_new - mean) / std
X_new_with_bias = np.c_[np.ones(len(X_new)), X_new]

logits = X_new_with_bias @ Theta

Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

zz1 = Y_proba[:, 1].reshape(x0.shape)

zz = y_predict.reshape(x0.shape)

plt.contourf(x0, x1, zz, cmap=custom_cmap)

contour = plt.contour(x0, x1, zz1, cmap="hot")
plt.clabel(contour, inline=1)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="upper left")
plt.axis([0, 7, 0, 3.5])
plt.grid()
plt.show()
And now let's measure the final model's accuracy on the test set:

In [81]: logits = X_test @ Theta

Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

accuracy_score = (y_predict == y_test).mean()

accuracy_score

Out[81]: 0.9666666666666667

Well we get even better performance on the test set. This variability is likely due
to the very small size of the dataset: depending on how you sample the training
set, validation set and the test set, you can get quite different results. Try
changing the random seed and running the code again a few times, you will see
that the results will vary.

In [ ]:

This notebook was converted with convert.ploomber.io

SPG/SSG Monitoring Tool
100% (4)
SPG/SSG Monitoring Tool
2 pages
Machine Learning Exercises in Python, Part 1: Curious Insight
No ratings yet
Machine Learning Exercises in Python, Part 1: Curious Insight
14 pages
Rapid Development PDF
100% (2)
Rapid Development PDF
66 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 4
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 4
24 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
ML Labs
No ratings yet
ML Labs
46 pages
Gradient Descent Vizcs229 PDF
No ratings yet
Gradient Descent Vizcs229 PDF
7 pages
Assignment7
No ratings yet
Assignment7
5 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
No ratings yet
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
19 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
23 pages
21BCE5775 Neural Networks
No ratings yet
21BCE5775 Neural Networks
19 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Lab-5 Report
No ratings yet
Lab-5 Report
11 pages
niraj dl
No ratings yet
niraj dl
15 pages
Experiment No
No ratings yet
Experiment No
29 pages
Machine Learning Lab (3) Report (21 CP 81)
No ratings yet
Machine Learning Lab (3) Report (21 CP 81)
7 pages
MMDS Da3
No ratings yet
MMDS Da3
8 pages
Advance AI and ML LAB
No ratings yet
Advance AI and ML LAB
16 pages
Infotec Ai 1000 Program-hcia-Ai Lab Guide
No ratings yet
Infotec Ai 1000 Program-hcia-Ai Lab Guide
82 pages
Lecture04. Training Models (Regression in Chapter 4)
No ratings yet
Lecture04. Training Models (Regression in Chapter 4)
44 pages
Physics-Informed-Neural-Networks-for-Numerical-Analysis
No ratings yet
Physics-Informed-Neural-Networks-for-Numerical-Analysis
16 pages
Weatherwax Geron Solutions
No ratings yet
Weatherwax Geron Solutions
52 pages
Solutions HOML PDF
No ratings yet
Solutions HOML PDF
45 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
ANN PR Code and Output
No ratings yet
ANN PR Code and Output
25 pages
Deep-Learning-Keras-Tensorflow - 1.1.1 Perceptron and Adaline - Ipynb at Master Leriomaggio - Deep-Learning-Keras-Tensorflow
No ratings yet
Deep-Learning-Keras-Tensorflow - 1.1.1 Perceptron and Adaline - Ipynb at Master Leriomaggio - Deep-Learning-Keras-Tensorflow
11 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
30 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
7 neural_network
No ratings yet
7 neural_network
78 pages
DL Practical 02 Binary Class Classifier Using ANN
No ratings yet
DL Practical 02 Binary Class Classifier Using ANN
5 pages
21 CP 46 - (ML LAB 3)
No ratings yet
21 CP 46 - (ML LAB 3)
13 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
ML Record Print
No ratings yet
ML Record Print
20 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
IN5400 - Machine Learning For Image Analysis
No ratings yet
IN5400 - Machine Learning For Image Analysis
6 pages
Ex 4
No ratings yet
Ex 4
15 pages
Book Pytorch Scikit Learn Numpy
No ratings yet
Book Pytorch Scikit Learn Numpy
247 pages
ML
No ratings yet
ML
8 pages
Lecture 21
No ratings yet
Lecture 21
138 pages
20102A0071 DL Experiment5.b
No ratings yet
20102A0071 DL Experiment5.b
5 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
1-Linear Regression and TensorFlow
No ratings yet
1-Linear Regression and TensorFlow
79 pages
Software Laboratory II Code
No ratings yet
Software Laboratory II Code
27 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
DNN ALL Practical 28
No ratings yet
DNN ALL Practical 28
34 pages
LinearRegression Tutorial
No ratings yet
LinearRegression Tutorial
40 pages
Neural Network Code
No ratings yet
Neural Network Code
5 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Lec 03 Deep Networks 1
No ratings yet
Lec 03 Deep Networks 1
53 pages
21CSC305P Ml - Lab Programs 1 -9
No ratings yet
21CSC305P Ml - Lab Programs 1 -9
36 pages
Pdf
No ratings yet
Pdf
41 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
ML_Lab_01999676272
No ratings yet
ML_Lab_01999676272
12 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
ME444_Lab5_Group8
No ratings yet
ME444_Lab5_Group8
18 pages
2017_Endsem_Paper_Solutions
No ratings yet
2017_Endsem_Paper_Solutions
12 pages
BB101_Tutorial2
No ratings yet
BB101_Tutorial2
2 pages
2018_Endsem_Paper
No ratings yet
2018_Endsem_Paper
19 pages
2019_Endsem_Paper
No ratings yet
2019_Endsem_Paper
8 pages
2009_Endsem_Paper
No ratings yet
2009_Endsem_Paper
10 pages
2012 Endsem Make Up Exam
No ratings yet
2012 Endsem Make Up Exam
2 pages
2011_Endsem_Paper
No ratings yet
2011_Endsem_Paper
8 pages
E6 - Report: Problem 1
No ratings yet
E6 - Report: Problem 1
16 pages
EPRS-knowledge-sources - Artificial Intelligence
100% (1)
EPRS-knowledge-sources - Artificial Intelligence
48 pages
Second Truth Inventory
No ratings yet
Second Truth Inventory
3 pages
2.0 Lowercase Alphabet On The Keyboard Lesson
No ratings yet
2.0 Lowercase Alphabet On The Keyboard Lesson
3 pages
Pembahasan Quiz Grammar
No ratings yet
Pembahasan Quiz Grammar
5 pages
Tuckman S Stages of Group Development PDF
No ratings yet
Tuckman S Stages of Group Development PDF
2 pages
Fall 2018 English 9 Syllabus - Engelke
No ratings yet
Fall 2018 English 9 Syllabus - Engelke
2 pages
PDF Merged
No ratings yet
PDF Merged
4 pages
Donna Claire B. Cañeza: Central Bicol State University of Agriculture
No ratings yet
Donna Claire B. Cañeza: Central Bicol State University of Agriculture
8 pages
Hanif Khan Ui Ux Designer Resume
No ratings yet
Hanif Khan Ui Ux Designer Resume
1 page
Shantelle Dauterive
No ratings yet
Shantelle Dauterive
3 pages
DLP Math 3 Odd and Even Numbers
No ratings yet
DLP Math 3 Odd and Even Numbers
6 pages
Category-I: Civil Engineering: S. No. Name of Services GEN OBC SC ST Total
No ratings yet
Category-I: Civil Engineering: S. No. Name of Services GEN OBC SC ST Total
1 page
RA 7722 - Higher Education Act of 1994
100% (1)
RA 7722 - Higher Education Act of 1994
5 pages
Cla 1 Ob
No ratings yet
Cla 1 Ob
7 pages
This Agreement Is Entered To by and Between
No ratings yet
This Agreement Is Entered To by and Between
3 pages
Investigating Iraqi EFL Learners' Performance in Using Sentence Types
No ratings yet
Investigating Iraqi EFL Learners' Performance in Using Sentence Types
39 pages
Mga Sikolohikal Na Salik Sa Pagkatuto NG Pangalawang
No ratings yet
Mga Sikolohikal Na Salik Sa Pagkatuto NG Pangalawang
13 pages
The Art of Verbal Storytelling
No ratings yet
The Art of Verbal Storytelling
2 pages
NSTU_application_form
No ratings yet
NSTU_application_form
2 pages
complete CMRS Application form(1)
No ratings yet
complete CMRS Application form(1)
2 pages
Personality Development: M.Saravana
No ratings yet
Personality Development: M.Saravana
24 pages
g7 Oct06
No ratings yet
g7 Oct06
2 pages
Lesson 14.: Unit Iii. 21 Century Literacies
No ratings yet
Lesson 14.: Unit Iii. 21 Century Literacies
70 pages
GRADING and REPORTING
No ratings yet
GRADING and REPORTING
39 pages
Influencers Book Summary
100% (1)
Influencers Book Summary
15 pages
Apresentacao Escala Risc Unitech-Bull
No ratings yet
Apresentacao Escala Risc Unitech-Bull
49 pages
Dean's Statement: Guidance For Applicants
No ratings yet
Dean's Statement: Guidance For Applicants
3 pages
Indian Institute of Technology Bombay August 24, 2023 EE782 Advanced Topics in Machine Learning Assignment 1: LSTM-based Stock Trading System
No ratings yet
Indian Institute of Technology Bombay August 24, 2023 EE782 Advanced Topics in Machine Learning Assignment 1: LSTM-based Stock Trading System
1 page