04_training_linear_models
04_training_linear_models
This notebook contains all the sample code and solutions to the exercises in
chapter 4.
Setup
This project requires Python 3.7 or above:
As we did in previous chapters, let's define the default font sizes to make the
figures prettier:
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)
Linear Regression
plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([0, 2, 0, 15])
plt.grid()
save_fig("generated_data_plot")
plt.show()
In [7]: from sklearn.preprocessing import add_dummy_feature
In [8]: theta_best
Out[8]: array([[4.21509616],
[2.77011339]])
Out[9]: array([[4.21509616],
[9.75532293]])
plt.show()
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
In [12]: lin_reg.predict(X_new)
Out[12]: array([[4.21509616],
[9.75532293]])
Out[13]: array([[4.21509616],
[2.77011339]])
+ +
This function computes X y, where X is the pseudoinverse of X (specifically
the Moore-Penrose inverse). You can use np.linalg.pinv() to compute the
pseudoinverse directly:
In [14]: np.linalg.pinv(X_b) @ y
Out[14]: array([[4.21509616],
[2.77011339]])
Gradient Descent
Batch Gradient Descent
In [15]: eta = 0.1 # learning rate
n_epochs = 1000
m = len(X_b) # number of instances
np.random.seed(42)
theta = np.random.randn(2, 1) # randomly initialized model parameters
In [16]: theta
Out[16]: array([[4.21509616],
[2.77011339]])
In [17]: # extra code – generates and saves Figure 4–8
np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization
plt.figure(figsize=(10, 4))
plt.subplot(131)
plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0)
plt.subplot(132)
theta_path_bgd = plot_gradient_descent(theta, eta=0.1)
plt.gca().axes.yaxis.set_ticklabels([])
plt.subplot(133)
plt.gca().axes.yaxis.set_ticklabels([])
plot_gradient_descent(theta, eta=0.5)
save_fig("gradient_descent_plot")
plt.show()
Stochastic Gradient Descent
In [18]: theta_path_sgd = [] # extra code – we need to store the path of theta in th
# parameter space to plot the next figure
In [19]: n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
def learning_schedule(t):
return t0 / (t + t1)
np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization
random_index = np.random.randint(m)
xi = X_b[random_index : random_index + 1]
yi = y[random_index : random_index + 1]
gradients = 2 * xi.T @ (xi @ theta - yi) # for SGD, do not divide b
eta = learning_schedule(epoch * m + iteration)
theta = theta - eta * gradients
theta_path_sgd.append(theta) # extra code – to generate the figure
Out[20]: array([[4.21076011],
[2.74856079]])
In [23]: # extra code – this cell generates and saves Figure 4–11
n_epochs = 50
minibatch_size = 20
n_batches_per_epoch = ceil(m / minibatch_size)
np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization
def learning_schedule(t):
return t0 / (t + t1)
theta_path_mgd = []
for epoch in range(n_epochs):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for iteration in range(0, n_batches_per_epoch):
idx = iteration * minibatch_size
xi = X_b_shuffled[idx : idx + minibatch_size]
yi = y_shuffled[idx : idx + minibatch_size]
gradients = 2 / minibatch_size * xi.T @ (xi @ theta - yi)
eta = learning_schedule(iteration)
theta = theta - eta * gradients
theta_path_mgd.append(theta)
theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)
plt.figure(figsize=(7, 4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1,
label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2,
label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3,
label="Batch")
plt.legend(loc="upper left")
plt.xlabel(r"$\theta_0$")
plt.ylabel(r"$\theta_1$ ", rotation=0)
plt.axis([2.6, 4.6, 2.3, 3.4])
plt.grid()
save_fig("gradient_descent_paths_plot")
plt.show()
Polynomial Regression
In [24]: np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)
In [25]: # extra code – this cell generates and saves Figure 4–12
plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.axis([-3, 3, 0, 10])
plt.grid()
save_fig("quadratic_data_plot")
plt.show()
Out[26]: array([-0.75275929])
In [27]: X_poly[0]
In [29]: # extra code – this cell generates and saves Figure 4–13
plt.figure(figsize=(6, 4))
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)
plt.legend(loc="upper left")
plt.axis([-3, 3, 0, 10])
plt.grid()
save_fig("quadratic_predictions_plot")
plt.show()
In [30]: # extra code – this cell generates and saves Figure 4–14
plt.figure(figsize=(6, 4))
for style, width, degree in (("r-+", 2, 1), ("b--", 2, 2), ("g-", 1, 300)):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = make_pipeline(polybig_features, std_scaler, lin_
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
label = f"{degree} degree{'s' if degree > 1 else ''}"
plt.plot(X_new, y_newbig, style, label=label, linewidth=width)
plt.show()
In [32]: from sklearn.pipeline import make_pipeline
polynomial_regression = make_pipeline(
PolynomialFeatures(degree=10, include_bias=False),
LinearRegression())
train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)
plt.figure(figsize=(6, 4))
plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")
plt.legend(loc="upper right")
plt.xlabel("Training set size")
plt.ylabel("RMSE")
plt.grid()
plt.axis([0, 80, 0, 2.5])
save_fig("learning_curves_plot")
plt.show()
Regularized Linear Models
Ridge Regression
Let's generate a very small and noisy linear dataset:
In [34]: # extra code – we've done this type of generation several times before
np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)
Out[36]: array([[1.55325833]])
In [37]: # extra code – this cell generates and saves Figure 4–17
plt.figure(figsize=(9, 3.5))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$ ", rotation=0)
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)
plt.gca().axes.yaxis.set_ticklabels([])
save_fig("ridge_regression_plot")
plt.show()
Out[38]: array([1.55302613])
In [39]: # extra code – show that we get roughly the same solution as earlier when
# we use Stochastic Average GD (solver="sag")
ridge_reg = Ridge(alpha=0.1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
Out[39]: array([[1.55321535]])
In [40]: # extra code – shows the closed form solution of Ridge regression,
# compare with the next Ridge model's learned parameters below
alpha = 0.1
A = np.array([[0., 0.], [0., 1.]])
X_b = np.c_[np.ones(m), X]
np.linalg.inv(X_b.T @ X_b + alpha * A) @ X_b.T @ y
Out[40]: array([[0.97898394],
[0.3828496 ]])
Lasso Regression
In [42]: from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
Out[42]: array([1.53788174])
In [43]: # extra code – this cell generates and saves Figure 4–18
plt.figure(figsize=(9, 3.5))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$ ", rotation=0)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 1e-2, 1), random_state=42)
plt.gca().axes.yaxis.set_ticklabels([])
save_fig("lasso_regression_plot")
plt.show()
In [44]: # extra code – this BIG cell generates and saves Figure 4–19
for i, N, l1, l2, title in ((0, N1, 2.0, 0, "Lasso"), (1, N2, 0, 2.0, "Ridge
JR = J + l1 * N1 + l2 * 0.5 * N2 ** 2
ax = axes[i, 1]
ax.grid()
ax.axhline(y=0, color="k")
ax.axvline(x=0, color="k")
ax.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
ax.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
ax.plot(path_N[:, 0], path_N[:, 1], "y--")
ax.plot(0, 0, "ys")
ax.plot(t1_min, t2_min, "ys")
ax.plot(t1r_min, t2r_min, "rs")
ax.set_title(title)
ax.axis([t1a, t1b, t2a, t2b])
if i == 1:
ax.set_xlabel(r"$\theta_1$")
save_fig("lasso_vs_ridge_plot")
plt.show()
Elastic Net
In [45]: from sklearn.linear_model import ElasticNet
Out[45]: array([1.54333232])
Early Stopping
Let's go back to the quadratic dataset we used earlier:
# extra code – creates the same quadratic dataset as earlier and splits it
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)
X_train, y_train = X[: m // 2], y[: m // 2, 0]
X_valid, y_valid = X[m // 2 :], y[m // 2 :, 0]
# extra code – we evaluate the train error and save it for the figure
y_train_predict = sgd_reg.predict(X_train_prep)
train_error = root_mean_squared_error(y_train, y_train_predict)
val_errors.append(val_error)
train_errors.append(train_error)
Logistic Regression
Estimating Probabilities
In [48]: # extra code – generates and saves Figure 4–21
lim = 6
t = np.linspace(-lim, lim, 100)
sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(8, 3))
plt.plot([-lim, lim], [0, 0], "k-")
plt.plot([-lim, lim], [0.5, 0.5], "k:")
plt.plot([-lim, lim], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \dfrac{1}{1 + e^{-t
plt.xlabel("t")
plt.legend(loc="upper left")
plt.axis([-lim, lim, -0.1, 1.1])
plt.gca().set_yticks([0, 0.25, 0.5, 0.75, 1])
plt.grid()
save_fig("logistic_function_plot")
plt.show()
Decision Boundaries
In [49]: from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
list(iris)
Out[49]: ['data',
'target',
'frame',
'target_names',
'DESCR',
'feature_names',
'filename',
'data_module']
:Summary Statistics:
The famous Iris database, first used by Sir R.A. Fisher. The dataset is take
n
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
.. topic:: References
In [51]: iris.data.head(3)
Out[51]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Out[52]: 0 0
1 0
2 0
Name: target, dtype: int64
In [53]: iris.target_names
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
Out[54]: LogisticRegression(random_state=42)
plt.show()
In [56]: decision_boundary
Out[56]: 1.6516516516516517
In [58]: # extra code – this cell generates and saves Figure 4–24
plt.figure(figsize=(10, 4))
plt.plot(X_train[y_train == 0, 0], X_train[y_train == 0, 1], "bs")
plt.plot(X_train[y_train == 1, 0], X_train[y_train == 1, 1], "g^")
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
plt.clabel(contour, inline=1)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.27, "Not Iris virginica", color="b", ha="center")
plt.text(6.5, 2.3, "Iris virginica", color="g", ha="center")
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.axis([2.9, 7, 0.8, 2.7])
plt.grid()
save_fig("logistic_regression_contour_plot")
plt.show()
Softmax Regression
In [59]: X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Out[60]: array([2])
In [62]: # extra code – this cell generates and saves Figure 4–25
from matplotlib.colors import ListedColormap
y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y == 2, 0], X[y == 2, 1], "g^", label="Iris virginica")
plt.plot(X[y == 1, 0], X[y == 1, 1], "bs", label="Iris versicolor")
plt.plot(X[y == 0, 0], X[y == 0, 1], "yo", label="Iris setosa")
Exercise solutions
1. to 11.
1. If you have a training set with millions of features you can use Stochastic
Gradient Descent or Mini-batch Gradient Descent, and perhaps Batch
Gradient Descent if the training set fits in memory. But you cannot use the
Normal Equation or the SVD approach because the computational complexity
grows quickly (more than quadratically) with the number of features.
2. If the features in your training set have very different scales, the cost
function will have the shape of an elongated bowl, so the Gradient Descent
algorithms will take a long time to converge. To solve this you should scale
the data before training the model. Note that the Normal Equation or SVD
approach will work just fine without scaling. Moreover, regularized models
may converge to a suboptimal solution if the features are not scaled: since
regularization penalizes large weights, features with smaller values will tend
to be ignored compared to features with larger values.
3. Gradient Descent cannot get stuck in a local minimum when training a
Logistic Regression model because the cost function is convex. Convex
means that if you draw a straight line between any two points on the curve,
the line never crosses the curve.
4. If the optimization problem is convex (such as Linear Regression or Logistic
Regression), and assuming the learning rate is not too high, then all Gradient
Descent algorithms will approach the global optimum and end up producing
fairly similar models. However, unless you gradually reduce the learning
rate, Stochastic GD and Mini-batch GD will never truly converge; instead,
they will keep jumping back and forth around the global optimum. This
means that even if you let them run for a very long time, these Gradient
Descent algorithms will produce slightly different models.
5. If the validation error consistently goes up after every epoch, then one
possibility is that the learning rate is too high and the algorithm is diverging.
If the training error also goes up, then this is clearly the problem and you
should reduce the learning rate. However, if the training error is not going
up, then your model is overfitting the training set and you should stop
training.
6. Due to their random nature, neither Stochastic Gradient Descent nor Mini-
batch Gradient Descent is guaranteed to make progress at every single
training iteration. So if you immediately stop training when the validation
error goes up, you may stop much too early, before the optimum is reached.
A better option is to save the model at regular intervals; then, when it has
not improved for a long time (meaning it will probably never beat the
record), you can revert to the best saved model.
7. Stochastic Gradient Descent has the fastest training iteration since it
considers only one training instance at a time, so it is generally the first to
reach the vicinity of the global optimum (or Mini-batch GD with a very small
mini-batch size). However, only Batch Gradient Descent will actually
converge, given enough training time. As mentioned, Stochastic GD and
Mini-batch GD will bounce around the optimum, unless you gradually reduce
the learning rate.
8. If the validation error is much higher than the training error, this is likely
because your model is overfitting the training set. One way to try to fix this
is to reduce the polynomial degree: a model with fewer degrees of freedom
is less likely to overfit. Another thing you can try is to regularize the model—
for example, by adding an ℓ₂ penalty (Ridge) or an ℓ₁ penalty (Lasso) to the
cost function. This will also reduce the degrees of freedom of the model.
Lastly, you can try to increase the size of the training set.
9. If both the training error and the validation error are almost equal and fairly
high, the model is likely underfitting the training set, which means it has a
high bias. You should try reducing the regularization hyperparameter α.
10. Let's see:
Let's start by loading the data. We will just reuse the Iris dataset we loaded
earlier.
In [63]: X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris["target"].values
We need to add the bias term for every instance (x0 = 1). The easiest option to
do this would be to use Scikit-Learn's add_dummy_feature() function, but the
point of this exercise is to get a better understanding of the algorithms by
implementing them manually. So here is one possible implementation:
The easiest option to split the dataset into a training set, a validation set and a
test set would be to use Scikit-Learn's train_test_split() function, but
again, we want to do it manually:
np.random.seed(42)
rnd_indices = np.random.permutation(total_size)
X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]
The targets are currently class indices (0, 1 or 2), but we need target class
probabilities to train the Softmax Regression model. Each instance will have
target class probabilities equal to 0.0 for all classes except for the target class
which will have a probability of 1.0 (in other words, the vector of class
probabilities for any given instance is a one-hot vector). Let's write a small
function to convert the vector of class indices into a matrix containing a one-hot
vector for each instance. To understand this code, you need to know that
np.diag(np.ones(n)) creates an n×n matrix full of 0s except for 1s on the
main diagonal. Moreover, if a is a NumPy array, then a[[1, 3, 2]] returns an
array with 3 rows equal to a[1] , a[3] and a[2] (this is advanced NumPy
indexing).
In [68]: to_one_hot(y_train[:10])
Looks good, so let's create the target class probabilities matrix for the training
set and the test set:
Now let's scale the inputs. We compute the mean and standard deviation of each
feature on the training set (except for the bias feature), then we center and scale
each feature in the training set, the validation set, and the test set:
Now let's implement the Softmax function. Recall that it is defined by the
following equation:
exp(sk (x))
σ(s(x)) =
k
K
∑ exp(sj (x))
j=1
We are almost ready to start training. Let's define the number of inputs and
outputs:
In [72]: n_inputs = X_train.shape[1] # == 3 (2 features plus the bias term)
n_outputs = len(np.unique(y_train)) # == 3 (there are 3 iris classes)
Now here comes the hardest part: training! Theoretically, it's simple: it's just a
matter of translating the math equations into Python code. But in practice, it can
be quite tricky: in particular, it's easy to mix up the order of the terms, or the
indices. You can even end up with code that looks like it's working but is actually
not computing exactly the right thing. When unsure, you should write down the
shape of each term in the equation and make sure the corresponding terms in
your code match closely. It can also help to evaluate each term independently
and print them out. The good news it that you won't have to do this everyday,
since all this is well implemented by Scikit-Learn, but it will help you understand
what's going on under the hood.
$J(\mathbf{\Theta}) =
\dfrac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{K}
{y_k^{(i)}\log\left(\hat{p}_k^{(i)}\right)}$
m
1 (i) (i) (i)
∇ ( k) J (Θ) = ^
∑ (p − y )x
θ k k
m
i=1
(i) (i)
Note that ^
log(p
k
) may not be computable if ^
p
k
= 0. So we will add a tiny
(i)
value ϵ to ^
log(p
k
) to avoid getting nan values.
np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)
And that's it! The Softmax model is trained. Let's look at the model parameters:
In [74]: Theta
Let's make predictions for the validation set and check the accuracy score:
Out[75]: 0.9333333333333333
Well, this model looks pretty ok. For the sake of the exercise, let's add a bit of ℓ2
regularization. The following training code is similar to the one above, but the
loss now has an additional ℓ2 penalty, and the gradients have the proper
additional term (note that we don't regularize the first element of Theta since
this corresponds to the bias term). Also, let's try increasing the learning rate
eta .
np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)
0 3.7372
1000 0.3259
2000 0.3259
3000 0.3259
4000 0.3259
5000 0.3259
Because of the additional ℓ2 penalty, the loss seems greater than earlier, but
perhaps this model will perform better? Let's find out:
Out[77]: 0.9333333333333333
In this case, the ℓ2 penalty did not change the test accuracy. Perhaps try fine-
tuning alpha ?
Now let's add early stopping. For this we just need to measure the loss on the
validation set at every iteration and stop when the error starts growing.
np.random.seed(42)
Theta = np.random.randn(n_inputs, n_outputs)
0 3.7372
281 0.3256
282 0.3256 early stopping!
Out[79]: 0.9333333333333333
Now let's plot the model's predictions on the whole dataset (remember to scale
all features fed to the model):
plt.figure(figsize=(10, 4))
plt.plot(X[y == 2, 0], X[y == 2, 1], "g^", label="Iris virginica")
plt.plot(X[y == 1, 0], X[y == 1, 1], "bs", label="Iris versicolor")
plt.plot(X[y == 0, 0], X[y == 0, 1], "yo", label="Iris setosa")
Out[81]: 0.9666666666666667
Well we get even better performance on the test set. This variability is likely due
to the very small size of the dataset: depending on how you sample the training
set, validation set and the test set, you can get quite different results. Try
changing the random seed and running the code again a few times, you will see
that the results will vary.
In [ ]: