D2L CH3 Part1
D2L CH3 Part1
Chapter 3:
Linear Neural
Networks
Linear Regression
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression
Model
Training Training Prediction and Prediction
Evaluation
Linear Neural Networks
• Before we get into the details of deep neural networks, we need to cover the basics of neural network training.
Model
Training Training Prediction and Prediction
Evaluation
Linear Regression
• Regression refers to a set of methods for modeling the relationship between one or more independent variables and
a dependent variable.
o The purpose of regression is most often to characterize the relationship between the inputs and outputs.
o Machine learning, on the other hand, is most often concerned with prediction.
https://elsaghirscience.weebly.com/predicting.html
https://www.sfehrlich.com/blog/stans-world-its-prediction-time-or-it
Linear Regression
• Regression problems pop up whenever we want to predict a numerical value.
o Predicting prices (of homes, stocks, etc.)
o Predicting length of stay (for patients in the hospital)
o Demand forecasting (for retail sales)
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression
Model
Training Training Prediction and Prediction
Evaluation
Basic Elements of Linear Regression
• linear regression flows from a few simple assumptions:
o The relationship between the independent variables and the dependent variable is linear,
i.e., that can be expressed as a weighted sum of the elements in , given some noise on the observations.
o Assume that any noise is well-behaved (following a Gaussian distribution).
• To develop a model for predicting house prices, we would need to get a dataset consisting of sales for which we know
the sale price, area, and age for each home.
o The dataset is called a training dataset or training set.
o Each row (here the data corresponding to one sale) is called an example (or data point, data instance, sample).
o The thing we are trying to predict (price) is called a label (or target).
o The independent variables (age and area) upon which the predictions are based are called
features (or covariates).
• We will use to denote the number of examples in our dataset. We index the data examples by , denoting each input
as and the corresponding label as .
Basic Elements of Linear Regression
• The linearity assumption says that the target (price) can be expressed as a weighted sum of the features (area and age):
(3.1.1)
and are called weights, and is called a bias (also called an offset or intercept).
• The bias just says what value the predicted price should take when all of the features take value 0.
• (3.1.1) is an affine transformation of input features, which is characterized by a linear transformation of features
via weighted sum, combined with a translation via the added bias.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• Given a dataset, our goal is to choose the weights and the bias such that on average, the predictions made
according to our model best fit the true prices observed in the data.
• Models whose output prediction is determined by the affine transformation of input features are linear models
o The affine transformation is specified by the chosen weights and bias.
• When our inputs consist of features, we express our prediction (the “hat” symbol denotes estimates) as
(3.1.2)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• Collecting all features into a vector and all weights into a vector , we can express our model using
a dot product:
(3.1.3)
• We refer to features of our entire dataset of examples via the design matrix .
o Here, contains one row for every example and one column for every feature.
• For a collection of features , the predictions can be expressed via the matrix-vector product:
(3.1.4)
• Given features of a training dataset and corresponding (known) labels , the goal of linear regression is to find the
weight vector and the bias term that given features of a new data example sampled from the same distribution
as , the new example’s label will (in expectation) be predicted with the lowest error.
• We would not expect to find a real-world dataset of examples where exactly equals for all .
o Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to
account for such errors.
• Before searching for the best parameters (or model parameters) and , we will need two more things:
1. A quality measure for some given model.
2. A procedure for updating the model to improve its quality.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• To think about how to fit data with our model, we need to determine a measure of fitness.
• The loss function quantifies the distance between the real and predicted value of the target.
o The loss will be a non-negative number where smaller values are better.
o Perfect predictions incur a loss of 0.
• The most popular loss function in regression problems is the squared error:
(3.1.5)
is the predicted label, is the corresponding true label for the example.
• The constant makes no difference but will prove notationally convenient, canceling out when we take the
derivative of the loss.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• Consider the example below where we plot a regression problem for a one-dimensional case as shown in Fig. 3.1.1.
• Note that large differences between estimates and observations lead to even larger contributions to the loss,
due to the quadratic dependence.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• To measure the quality of a model on the entire dataset of examples, we average (or equivalently, sum) the
losses on the training set:
(3.1.6)
• When training the model, we want to find parameters that minimize the total loss across all
training examples:
(3.1.7)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• The requirement of an analytic solution is so restrictive that it would exclude all of deep learning.
o Simple problems like linear regression may admit analytic solutions but,
you should not get used to such good fortune.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• In cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice.
• The key technique for optimizing nearly any deep learning model is called gradient descents.
o Gradient descent iteratively reduces the error by updating the parameters in the direction that incrementally
lowers the loss function.
Starting
point
loss
Point of convergence
Value of weight (the loss function at its minimum)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• The most naive application of gradient descent consists of taking the derivative of the loss function, which is an
average of the losses computed on every single example in the dataset.
o This is extremely slow: we must pass over the entire dataset before making a single update.
o Thus, we will sample a random minibatch of examples every time we need to compute the update,
this variant called minibatch stochastic gradient descent.
• In each iteration:
1. We first randomly sample a minibatch B consisting of a fixed number of training examples.
2. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model
parameters.
3. Finally, we multiply the gradient by a predetermined positive value η and subtract the resulting term from
the current parameter values.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• We can express the update mathematically as follows ( denotes the partial derivative):
• For quadratic losses and affine transformations, we can write this out explicitly as follows:
(3.1.10)
• The values of the batch size and learning rate are manually pre-specified and not typically learned through
model training.
o These parameters that are tunable but not updated in the training loop are called hyperparameters.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• Hyperparameter tuning is the process by which hyperparameters are chosen, and typically requires that we adjust
them based on the results of the training loop as assessed on a separate validation dataset.
• After training for some predetermined number of iterations (or until some other stopping criteria are met),
we record the estimated model parameters, denoted .
o If our function is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because,
although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number
of steps.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model
• Linear regression happens to be a learning problem where there is only one minimum over the entire domain.
o For more complicated models, like deep networks, the loss surfaces contain many minima.
• The more formidable task is to find parameters that will achieve low loss
on data that we have not seen before.
o A challenge called generalization.
• Given the learned linear regression model , we can estimate
the price of a new house given its area and age . https://www.mltut.com/stochastic-gradient-descent-a-super-easy-complete-guide/
o Estimating targets given features is commonly called
prediction or inference.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression
Model
Training Training Prediction and Prediction
Evaluation
Vectorization for Speed
• When training our models, we typically want to process whole minibatches of examples simultaneously.
o Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries.
%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np
import time
n = 10000
a = np.ones(n)
b = np.ones(n)
Vectorization for Speed
• Let us define a timer.
class Timer: #@save
"""Record multiple running times."""
def __init__(self):
self.times = []
self.start()
def start(self):
"""Start the timer."""
self.tik = time.time()
def stop(self):
"""Stop the timer and record the time in a list."""
self.times.append(time.time() - self.tik)
return self.times[-1]
def avg(self):
"""Return the average time."""
return sum(self.times) / len(self.times)
def sum(self):
"""Return the sum of time."""
return sum(self.times)
def cumsum(self):
"""Return the accumulated time."""
return np.array(self.times).cumsum().tolist()
Vectorization for Speed
• Now we can benchmark the workloads.
o First, we add them, one coordinate at a time, using a for-loop.
c = np.zeros(n)
timer = Timer()
for i in range(n):
c[i] = a[i] + b[i]
f'{timer.stop():.5f} sec'
Model
Training Training Prediction and Prediction
Evaluation
The Normal Distribution and Squared Loss
• There is a strong connection between the normal distribution (Gaussian) and linear regression.
• The probability density of a normal distribution with mean and variance (standard deviation ) is given as:
(3.1.11)
(3.1.12)
• Thus, we write out the likelihood of seeing a particular 𝑦 for a given 𝒙 via
(3.1.13)
• Now, according to the principle of maximum likelihood, the best values of parameters w and b are those that
maximize the likelihood of the entire dataset:
(3.1.14)
• Estimators chosen according to the principle of maximum likelihood are called maximum likelihood estimators.
The Normal Distribution and Squared Loss
• Maximizing the product of many exponential functions is difficult.
o So, we maximize the log of the likelihood instead.
o For historical reasons, optimizations are expressed as minimization rather than maximization.
(3.1.15)
• Minimizing the mean squared error is equivalent to maximum likelihood estimation of a linear model under the
assumption of additive Gaussian noise.
o The solution does not depend on .
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression
Model
Training Training Prediction and Prediction
Evaluation
From Linear Regression to Deep Networks
• We think of the linear model as a neural network by expressing it in the language of neural networks.
o Neural networks cover a much richer family of models.
• We depict our linear regression model as a neural network.
o These diagrams highlight the connectivity pattern such as how each input is connected to the output,
but not the values taken by the weights or biases.
https://medium.com/@abhismatrix/neural-network-model-training-tricks-61254a2a1f6b
Summary
• Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm,
and quite obviously, the model itself.
• Vectorizing makes everything better (mostly math) and faster (mostly code).
• Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.