0% found this document useful (0 votes)
24 views36 pages

D2L CH3 Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views36 pages

D2L CH3 Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Dive into Deep Learning

Chapter 3:
Linear Neural
Networks
Linear Regression
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Linear Neural Networks
• Before we get into the details of deep neural networks, we need to cover the basics of neural network training.

• In this chapter, we will cover the entire training process including:


o Defining simple neural network architectures
o Handling data
o Specifying a loss function
o Training the model

• Classic statistical learning techniques such as linear and


softmax regression can be cast as linear neural networks.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Linear Regression
• Regression refers to a set of methods for modeling the relationship between one or more independent variables and
a dependent variable.
o The purpose of regression is most often to characterize the relationship between the inputs and outputs.
o Machine learning, on the other hand, is most often concerned with prediction.

https://elsaghirscience.weebly.com/predicting.html

https://www.sfehrlich.com/blog/stans-world-its-prediction-time-or-it
Linear Regression
• Regression problems pop up whenever we want to predict a numerical value.
o Predicting prices (of homes, stocks, etc.)
o Predicting length of stay (for patients in the hospital)
o Demand forecasting (for retail sales)

• Not every prediction problem is a classic regression problem.

• In classification problems, the goal is to predict membership among a set of categories.

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Basic Elements of Linear Regression
• linear regression flows from a few simple assumptions:
o The relationship between the independent variables and the dependent variable is linear,
i.e., that can be expressed as a weighted sum of the elements in , given some noise on the observations.
o Assume that any noise is well-behaved (following a Gaussian distribution).

• To develop a model for predicting house prices, we would need to get a dataset consisting of sales for which we know
the sale price, area, and age for each home.
o The dataset is called a training dataset or training set.
o Each row (here the data corresponding to one sale) is called an example (or data point, data instance, sample).
o The thing we are trying to predict (price) is called a label (or target).
o The independent variables (age and area) upon which the predictions are based are called
features (or covariates).

• We will use to denote the number of examples in our dataset. We index the data examples by , denoting each input
as and the corresponding label as .
Basic Elements of Linear Regression

Linear Model Loss Function Analytic Solution

Minibatch Making Predictions


Stochastic with the Learned
Gradient Descent Model
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The linearity assumption says that the target (price) can be expressed as a weighted sum of the features (area and age):

(3.1.1)

and are called weights, and is called a bias (also called an offset or intercept).

• The weights determine the influence of each feature on our prediction.

• The bias just says what value the predicted price should take when all of the features take value 0.

• (3.1.1) is an affine transformation of input features, which is characterized by a linear transformation of features
via weighted sum, combined with a translation via the added bias.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Given a dataset, our goal is to choose the weights and the bias such that on average, the predictions made
according to our model best fit the true prices observed in the data.

• Models whose output prediction is determined by the affine transformation of input features are linear models
o The affine transformation is specified by the chosen weights and bias.

• In machine learning, we usually work with high-dimensional datasets.

• When our inputs consist of features, we express our prediction (the “hat” symbol denotes estimates) as

(3.1.2)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Collecting all features into a vector and all weights into a vector , we can express our model using
a dot product:
(3.1.3)

the vector corresponds to features of a single data example.

• We refer to features of our entire dataset of examples via the design matrix .
o Here, contains one row for every example and one column for every feature.
• For a collection of features , the predictions can be expressed via the matrix-vector product:

(3.1.4)

where broadcasting (see Section 2.1.3) is applied during the summation.


Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Given features of a training dataset and corresponding (known) labels , the goal of linear regression is to find the
weight vector and the bias term that given features of a new data example sampled from the same distribution
as , the new example’s label will (in expectation) be predicted with the lowest error.

• We would not expect to find a real-world dataset of examples where exactly equals for all .
o Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to
account for such errors.

• Before searching for the best parameters (or model parameters) and , we will need two more things:
1. A quality measure for some given model.
2. A procedure for updating the model to improve its quality.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• To think about how to fit data with our model, we need to determine a measure of fitness.

• The loss function quantifies the distance between the real and predicted value of the target.
o The loss will be a non-negative number where smaller values are better.
o Perfect predictions incur a loss of 0.

• The most popular loss function in regression problems is the squared error:

(3.1.5)

is the predicted label, is the corresponding true label for the example.
• The constant makes no difference but will prove notationally convenient, canceling out when we take the
derivative of the loss.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The empirical error is only a function of the model parameters.

• Consider the example below where we plot a regression problem for a one-dimensional case as shown in Fig. 3.1.1.

Fig. 3.1.1 Fit data with a linear model.

• Note that large differences between estimates and observations lead to even larger contributions to the loss,
due to the quadratic dependence.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• To measure the quality of a model on the entire dataset of examples, we average (or equivalently, sum) the
losses on the training set:

(3.1.6)

• When training the model, we want to find parameters that minimize the total loss across all
training examples:

(3.1.7)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Linear regression can be solved analytically by applying a simple formula:


o Subsume the bias into the parameter by appending a column to the design matrix consisting of all ones.
o Then our prediction problem is to minimize .
o Take the loss surface to be the minimum of the loss over the entire domain.
o Taking the derivative of the loss with respect to and setting it equal to zero yields the analytic
(closed-form) solution:
(3.1.8)

• The requirement of an analytic solution is so restrictive that it would exclude all of deep learning.
o Simple problems like linear regression may admit analytic solutions but,
you should not get used to such good fortune.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• In cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice.

• The key technique for optimizing nearly any deep learning model is called gradient descents.
o Gradient descent iteratively reduces the error by updating the parameters in the direction that incrementally
lowers the loss function.

Starting
point

loss

Point of convergence
Value of weight (the loss function at its minimum)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The most naive application of gradient descent consists of taking the derivative of the loss function, which is an
average of the losses computed on every single example in the dataset.
o This is extremely slow: we must pass over the entire dataset before making a single update.
o Thus, we will sample a random minibatch of examples every time we need to compute the update,
this variant called minibatch stochastic gradient descent.

• In each iteration:
1. We first randomly sample a minibatch B consisting of a fixed number of training examples.
2. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model
parameters.
3. Finally, we multiply the gradient by a predetermined positive value η and subtract the resulting term from
the current parameter values.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• We can express the update mathematically as follows ( denotes the partial derivative):

is the weights vector,


is the bias,
η is predetermined positive value,
and the term “” means the partial derivative of the loss of element.

• To summarize the steps of the algorithm:


Iteratively sample Update the
Randomly initialize
random parameters in the
the values of the
minibatches from direction of the
model parameters
the data negative gradient
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• For quadratic losses and affine transformations, we can write this out explicitly as follows:

(3.1.10)

Note that and are vectors.


The set cardinality represents the number of examples in each minibatch (the batch size).
denotes the learning rate.

• The values of the batch size and learning rate are manually pre-specified and not typically learned through
model training.
o These parameters that are tunable but not updated in the training loop are called hyperparameters.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Hyperparameter tuning is the process by which hyperparameters are chosen, and typically requires that we adjust
them based on the results of the training loop as assessed on a separate validation dataset.

• After training for some predetermined number of iterations (or until some other stopping criteria are met),
we record the estimated model parameters, denoted .
o If our function is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because,
although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number
of steps.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Linear regression happens to be a learning problem where there is only one minimum over the entire domain.
o For more complicated models, like deep networks, the loss surfaces contain many minima.

• Deep learning practitioners seldom struggle to find parameters that


minimize the loss on training sets.

• The more formidable task is to find parameters that will achieve low loss
on data that we have not seen before.
o A challenge called generalization.
• Given the learned linear regression model , we can estimate
the price of a new house given its area and age . https://www.mltut.com/stochastic-gradient-descent-a-super-easy-complete-guide/
o Estimating targets given features is commonly called
prediction or inference.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Vectorization for Speed
• When training our models, we typically want to process whole minibatches of examples simultaneously.
o Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries.

%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np
import time

• We consider two methods for adding vectors.


o To start we instantiate two 10000-dimensional vectors containing all ones.
o In one method we will loop over the vectors with a Python for-loop.
o In the other method we will rely on a single call to +.

n = 10000
a = np.ones(n)
b = np.ones(n)
Vectorization for Speed
• Let us define a timer.
class Timer: #@save
"""Record multiple running times."""
def __init__(self):
self.times = []
self.start()

def start(self):
"""Start the timer."""
self.tik = time.time()

def stop(self):
"""Stop the timer and record the time in a list."""
self.times.append(time.time() - self.tik)
return self.times[-1]

def avg(self):
"""Return the average time."""
return sum(self.times) / len(self.times)

def sum(self):
"""Return the sum of time."""
return sum(self.times)

def cumsum(self):
"""Return the accumulated time."""
return np.array(self.times).cumsum().tolist()
Vectorization for Speed
• Now we can benchmark the workloads.
o First, we add them, one coordinate at a time, using a for-loop.
c = np.zeros(n)
timer = Timer()
for i in range(n):
c[i] = a[i] + b[i]
f'{timer.stop():.5f} sec'

o Alternatively, we rely on the reloaded + operator to compute the elementwise sum.


timer.start()
d = a + b
f'{timer.stop():0.5f} sec'

• The second method is dramatically faster than the first.


o Vectorizing code often yields order-of-magnitude speedups.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
The Normal Distribution and Squared Loss
• There is a strong connection between the normal distribution (Gaussian) and linear regression.

• The probability density of a normal distribution with mean and variance (standard deviation ) is given as:

(3.1.11)

o A Python function to compute the normal distribution:

def normal(x, mu, sigma):


p = 1 / math.sqrt(2 * math.pi * sigma**2)
return p * np.exp(-0.5 * sigma**2 * (x-mu)**2)

o Visualize the normal distributions:


# Use numpy again for visualization
x = np.arange(-7, 7, 0.01)

# Mean and standard deviation pairs


params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params],
xlabel='x',ylabel='p(x)', figsize=(4.5, 2.5),
legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
The Normal Distribution and Squared Loss
• To motivate linear regression with the squared loss function, assume that observations arise from noisy observations,
where the noise is normally distributed as follows:

(3.1.12)

• Thus, we write out the likelihood of seeing a particular 𝑦 for a given 𝒙 via

(3.1.13)

• Now, according to the principle of maximum likelihood, the best values of parameters w and b are those that
maximize the likelihood of the entire dataset:

(3.1.14)

• Estimators chosen according to the principle of maximum likelihood are called maximum likelihood estimators.
The Normal Distribution and Squared Loss
• Maximizing the product of many exponential functions is difficult.
o So, we maximize the log of the likelihood instead.
o For historical reasons, optimizations are expressed as minimization rather than maximization.

• So, We minimize the negative log-likelihood :

(3.1.15)

o We assume that is some fixed constant.


o Thus we can ignore the first term because it does not depend on or .
o Now the second term is identical to the squared error loss , ,
except for the multiplicative constant .

• Minimizing the mean squared error is equivalent to maximum likelihood estimation of a linear model under the
assumption of additive Gaussian noise.
o The solution does not depend on .
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing


Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax


Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing


Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing


Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification


Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the


Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
From Linear Regression to Deep Networks
• We think of the linear model as a neural network by expressing it in the language of neural networks.
o Neural networks cover a much richer family of models.
• We depict our linear regression model as a neural network.
o These diagrams highlight the connectivity pattern such as how each input is connected to the output,
but not the values taken by the weights or biases.

• For the neural network shown in Fig. 3.1.2,


o The inputs are , so the number of inputs
(or feature dimensionality) in the input layer is d.
o The output of the network is , so the number of outputs is 1.
o The inputs are all given and there is just a single computed neuron.

• We do not consider the input layer when counting layers.


o The number of layers for the neural network in Fig. 3.1.2 is 1.
o We can think of linear regression models as neural networks consisting of just a single artificial neuron,
or as single-layer neural networks.
o Since for linear regression, every input is connected to every output, we can regard this transformation
(the output layer in Fig. 3.1.2) as a fully-connected layer or dense layer.
From Linear Regression to Deep Networks
• Linear regression (invented in 1795) predates computational neuroscience.
o Warren McCulloch and Walter Pitts began to develop models of artificial neurons.

• Consider the cartoonish picture of a biological neuron in Fig. 3.1.3


o Dendrites = input terminals,
o Nucleus = CPU,
o Axon = output wire,
o Axon terminals = output terminals, enable connections to other neurons via synapses.
From Linear Regression to Deep Networks
• Information arriving from other neurons (or environmental sensors such as the retina) is received in the dendrites.
o That information is weighted by synaptic weights determining the effect of the inputs (e.g., activation or
inhibition via the product ).
o The weighted inputs arriving from multiple sources are aggregated in the nucleus as a weighted sum
.
o This information is sent for further processing in the axon , typically after nonlinear processing via .
o From there it either reaches its destination (e.g., a muscle) or is fed into another neuron via its dendrites.

https://medium.com/@abhismatrix/neural-network-model-training-tricks-61254a2a1f6b
Summary

• Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm,
and quite obviously, the model itself.

• Vectorizing makes everything better (mostly math) and faster (mostly code).

• Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.

• Linear regression models are neural networks, too.

You might also like