0% found this document useful (0 votes)

24 views36 pages

D2L CH3 Part1

Uploaded by

ِAbdalrahman M. Amer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views36 pages

D2L CH3 Part1

Uploaded by

ِAbdalrahman M. Amer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Dive into Deep Learning

Chapter 3:
Linear Neural
Networks
Linear Regression
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Linear Neural Networks
• Before we get into the details of deep neural networks, we need to cover the basics of neural network training.

• In this chapter, we will cover the entire training process including:

o Defining simple neural network architectures
o Handling data
o Specifying a loss function
o Training the model

• Classic statistical learning techniques such as linear and

softmax regression can be cast as linear neural networks.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Linear Regression
• Regression refers to a set of methods for modeling the relationship between one or more independent variables and
a dependent variable.
o The purpose of regression is most often to characterize the relationship between the inputs and outputs.
o Machine learning, on the other hand, is most often concerned with prediction.

https://elsaghirscience.weebly.com/predicting.html

https://www.sfehrlich.com/blog/stans-world-its-prediction-time-or-it
Linear Regression
• Regression problems pop up whenever we want to predict a numerical value.
o Predicting prices (of homes, stocks, etc.)
o Predicting length of stay (for patients in the hospital)
o Demand forecasting (for retail sales)

• Not every prediction problem is a classic regression problem.

• In classification problems, the goal is to predict membership among a set of categories.

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Basic Elements of Linear Regression
• linear regression flows from a few simple assumptions:
o The relationship between the independent variables and the dependent variable is linear,
i.e., that can be expressed as a weighted sum of the elements in , given some noise on the observations.
o Assume that any noise is well-behaved (following a Gaussian distribution).

• To develop a model for predicting house prices, we would need to get a dataset consisting of sales for which we know
the sale price, area, and age for each home.
o The dataset is called a training dataset or training set.
o Each row (here the data corresponding to one sale) is called an example (or data point, data instance, sample).
o The thing we are trying to predict (price) is called a label (or target).
o The independent variables (age and area) upon which the predictions are based are called
features (or covariates).

• We will use to denote the number of examples in our dataset. We index the data examples by , denoting each input
as and the corresponding label as .
Basic Elements of Linear Regression

Linear Model Loss Function Analytic Solution

Minibatch Making Predictions

Stochastic with the Learned
Gradient Descent Model
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The linearity assumption says that the target (price) can be expressed as a weighted sum of the features (area and age):

(3.1.1)

and are called weights, and is called a bias (also called an offset or intercept).

• The weights determine the influence of each feature on our prediction.

• The bias just says what value the predicted price should take when all of the features take value 0.

• (3.1.1) is an affine transformation of input features, which is characterized by a linear transformation of features
via weighted sum, combined with a translation via the added bias.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Given a dataset, our goal is to choose the weights and the bias such that on average, the predictions made
according to our model best fit the true prices observed in the data.

• Models whose output prediction is determined by the affine transformation of input features are linear models
o The affine transformation is specified by the chosen weights and bias.

• In machine learning, we usually work with high-dimensional datasets.

• When our inputs consist of features, we express our prediction (the “hat” symbol denotes estimates) as

(3.1.2)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Collecting all features into a vector and all weights into a vector , we can express our model using
a dot product:
(3.1.3)

the vector corresponds to features of a single data example.

• We refer to features of our entire dataset of examples via the design matrix .
o Here, contains one row for every example and one column for every feature.
• For a collection of features , the predictions can be expressed via the matrix-vector product:

(3.1.4)

where broadcasting (see Section 2.1.3) is applied during the summation.

Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Given features of a training dataset and corresponding (known) labels , the goal of linear regression is to find the
weight vector and the bias term that given features of a new data example sampled from the same distribution
as , the new example’s label will (in expectation) be predicted with the lowest error.

• We would not expect to find a real-world dataset of examples where exactly equals for all .
o Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to
account for such errors.

• Before searching for the best parameters (or model parameters) and , we will need two more things:
1. A quality measure for some given model.
2. A procedure for updating the model to improve its quality.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• To think about how to fit data with our model, we need to determine a measure of fitness.

• The loss function quantifies the distance between the real and predicted value of the target.
o The loss will be a non-negative number where smaller values are better.
o Perfect predictions incur a loss of 0.

• The most popular loss function in regression problems is the squared error:

(3.1.5)

is the predicted label, is the corresponding true label for the example.
• The constant makes no difference but will prove notationally convenient, canceling out when we take the
derivative of the loss.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The empirical error is only a function of the model parameters.

• Consider the example below where we plot a regression problem for a one-dimensional case as shown in Fig. 3.1.1.

Fig. 3.1.1 Fit data with a linear model.

• Note that large differences between estimates and observations lead to even larger contributions to the loss,
due to the quadratic dependence.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• To measure the quality of a model on the entire dataset of examples, we average (or equivalently, sum) the
losses on the training set:

(3.1.6)

• When training the model, we want to find parameters that minimize the total loss across all
training examples:

(3.1.7)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Linear regression can be solved analytically by applying a simple formula:

o Subsume the bias into the parameter by appending a column to the design matrix consisting of all ones.
o Then our prediction problem is to minimize .
o Take the loss surface to be the minimum of the loss over the entire domain.
o Taking the derivative of the loss with respect to and setting it equal to zero yields the analytic
(closed-form) solution:
(3.1.8)

• The requirement of an analytic solution is so restrictive that it would exclude all of deep learning.
o Simple problems like linear regression may admit analytic solutions but,
you should not get used to such good fortune.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• In cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice.

• The key technique for optimizing nearly any deep learning model is called gradient descents.
o Gradient descent iteratively reduces the error by updating the parameters in the direction that incrementally
lowers the loss function.

Starting
point

loss

Point of convergence
Value of weight (the loss function at its minimum)
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• The most naive application of gradient descent consists of taking the derivative of the loss function, which is an
average of the losses computed on every single example in the dataset.
o This is extremely slow: we must pass over the entire dataset before making a single update.
o Thus, we will sample a random minibatch of examples every time we need to compute the update,
this variant called minibatch stochastic gradient descent.

• In each iteration:
1. We first randomly sample a minibatch B consisting of a fixed number of training examples.
2. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model
parameters.
3. Finally, we multiply the gradient by a predetermined positive value η and subtract the resulting term from
the current parameter values.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• We can express the update mathematically as follows ( denotes the partial derivative):

is the weights vector,

is the bias,
η is predetermined positive value,
and the term “” means the partial derivative of the loss of element.

• To summarize the steps of the algorithm:

Iteratively sample Update the
Randomly initialize
random parameters in the
the values of the
minibatches from direction of the
model parameters
the data negative gradient
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• For quadratic losses and affine transformations, we can write this out explicitly as follows:

(3.1.10)

Note that and are vectors.

The set cardinality represents the number of examples in each minibatch (the batch size).
denotes the learning rate.

• The values of the batch size and learning rate are manually pre-specified and not typically learned through
model training.
o These parameters that are tunable but not updated in the training loop are called hyperparameters.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Hyperparameter tuning is the process by which hyperparameters are chosen, and typically requires that we adjust
them based on the results of the training loop as assessed on a separate validation dataset.

• After training for some predetermined number of iterations (or until some other stopping criteria are met),
we record the estimated model parameters, denoted .
o If our function is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because,
although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number
of steps.
Basic Elements of Linear Regression
Minibatch Making Predictions
Linear Model Loss Function Analytic Solution Stochastic Gradient with the Learned
Descent Model

• Linear regression happens to be a learning problem where there is only one minimum over the entire domain.
o For more complicated models, like deep networks, the loss surfaces contain many minima.

• Deep learning practitioners seldom struggle to find parameters that

minimize the loss on training sets.

• The more formidable task is to find parameters that will achieve low loss
on data that we have not seen before.
o A challenge called generalization.
• Given the learned linear regression model , we can estimate
the price of a new house given its area and age . https://www.mltut.com/stochastic-gradient-descent-a-super-easy-complete-guide/
o Estimating targets given features is commonly called
prediction or inference.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
Vectorization for Speed
• When training our models, we typically want to process whole minibatches of examples simultaneously.
o Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries.

%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np
import time

• We consider two methods for adding vectors.

o To start we instantiate two 10000-dimensional vectors containing all ones.
o In one method we will loop over the vectors with a Python for-loop.
o In the other method we will rely on a single call to +.

n = 10000
a = np.ones(n)
b = np.ones(n)
Vectorization for Speed
• Let us define a timer.
class Timer: #@save
"""Record multiple running times."""
def __init__(self):
self.times = []
self.start()

def start(self):
"""Start the timer."""
self.tik = time.time()

def stop(self):
"""Stop the timer and record the time in a list."""
self.times.append(time.time() - self.tik)
return self.times[-1]

def avg(self):
"""Return the average time."""
return sum(self.times) / len(self.times)

def sum(self):
"""Return the sum of time."""
return sum(self.times)

def cumsum(self):
"""Return the accumulated time."""
return np.array(self.times).cumsum().tolist()
Vectorization for Speed
• Now we can benchmark the workloads.
o First, we add them, one coordinate at a time, using a for-loop.
c = np.zeros(n)
timer = Timer()
for i in range(n):
c[i] = a[i] + b[i]
f'{timer.stop():.5f} sec'

o Alternatively, we rely on the reloaded + operator to compute the elementwise sum.

timer.start()
d = a + b
f'{timer.stop():0.5f} sec'

• The second method is dramatically faster than the first.

o Vectorizing code often yields order-of-magnitude speedups.
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
The Normal Distribution and Squared Loss
• There is a strong connection between the normal distribution (Gaussian) and linear regression.

• The probability density of a normal distribution with mean and variance (standard deviation ) is given as:

(3.1.11)

o A Python function to compute the normal distribution:

def normal(x, mu, sigma):

p = 1 / math.sqrt(2 * math.pi * sigma**2)
return p * np.exp(-0.5 * sigma**2 * (x-mu)**2)

o Visualize the normal distributions:

# Use numpy again for visualization
x = np.arange(-7, 7, 0.01)

# Mean and standard deviation pairs

params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params],
xlabel='x',ylabel='p(x)', figsize=(4.5, 2.5),
legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
The Normal Distribution and Squared Loss
• To motivate linear regression with the squared loss function, assume that observations arise from noisy observations,
where the noise is normally distributed as follows:

(3.1.12)

• Thus, we write out the likelihood of seeing a particular 𝑦 for a given 𝒙 via

(3.1.13)

• Now, according to the principle of maximum likelihood, the best values of parameters w and b are those that
maximize the likelihood of the entire dataset:

(3.1.14)

• Estimators chosen according to the principle of maximum likelihood are called maximum likelihood estimators.
The Normal Distribution and Squared Loss
• Maximizing the product of many exponential functions is difficult.
o So, we maximize the log of the likelihood instead.
o For historical reasons, optimizations are expressed as minimization rather than maximization.

• So, We minimize the negative log-likelihood :

(3.1.15)

o We assume that is some fixed constant.

o Thus we can ignore the first term because it does not depend on or .
o Now the second term is identical to the squared error loss , ,
except for the multiplicative constant .

• Minimizing the mean squared error is equivalent to maximum likelihood estimation of a linear model under the
assumption of additive Gaussian noise.
o The solution does not depend on .
Contents
CH2
Linear Regression Concise Implementation of Concise
The Image
Linear Regression Implementation Implementation of Softmax Regression Softmax Regression Implementation of
Classification Dataset
from Scratch Linear Regression from Scratch Softmax Regression

Basic Elements Initializing Initializing

Generating the Generating the Classification Reading the
of Linear Model Model
Dataset Dataset Problem Dataset
Regression Parameters Parameters

Defining the Softmax

Vectorization for Reading the Reading the Network Reading a
Softmax Implementation
Speed Dataset Dataset Architecture Minibatch
Operation Revisited

The Normal Initializing

Defining the Softmax Putting All Defining the Optimization
Distribution and Model
Model Operation Things Together Model Algorithm
Squared Loss Parameters

From Linear Initializing

Defining the Vectorization for Defining the
Regression to Model Training
Model Minibatches Loss Function
Deep Networks Parameters

Defining the Defining the Classification

Loss Function
Loss Function Loss Function Accuracy

Defining the Defining the

Information
Optimization Optimization Training
Theory Basics
Algorithm Algorithm

Model
Training Training Prediction and Prediction
Evaluation
From Linear Regression to Deep Networks
• We think of the linear model as a neural network by expressing it in the language of neural networks.
o Neural networks cover a much richer family of models.
• We depict our linear regression model as a neural network.
o These diagrams highlight the connectivity pattern such as how each input is connected to the output,
but not the values taken by the weights or biases.

• For the neural network shown in Fig. 3.1.2,

o The inputs are , so the number of inputs
(or feature dimensionality) in the input layer is d.
o The output of the network is , so the number of outputs is 1.
o The inputs are all given and there is just a single computed neuron.

• We do not consider the input layer when counting layers.

o The number of layers for the neural network in Fig. 3.1.2 is 1.
o We can think of linear regression models as neural networks consisting of just a single artificial neuron,
or as single-layer neural networks.
o Since for linear regression, every input is connected to every output, we can regard this transformation
(the output layer in Fig. 3.1.2) as a fully-connected layer or dense layer.
From Linear Regression to Deep Networks
• Linear regression (invented in 1795) predates computational neuroscience.
o Warren McCulloch and Walter Pitts began to develop models of artificial neurons.

• Consider the cartoonish picture of a biological neuron in Fig. 3.1.3

o Dendrites = input terminals,
o Nucleus = CPU,
o Axon = output wire,
o Axon terminals = output terminals, enable connections to other neurons via synapses.
From Linear Regression to Deep Networks
• Information arriving from other neurons (or environmental sensors such as the retina) is received in the dendrites.
o That information is weighted by synaptic weights determining the effect of the inputs (e.g., activation or
inhibition via the product ).
o The weighted inputs arriving from multiple sources are aggregated in the nucleus as a weighted sum
.
o This information is sent for further processing in the axon , typically after nonlinear processing via .
o From there it either reaches its destination (e.g., a muscle) or is fed into another neuron via its dendrites.

https://medium.com/@abhismatrix/neural-network-model-training-tricks-61254a2a1f6b
Summary

• Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm,
and quite obviously, the model itself.

• Vectorizing makes everything better (mostly math) and faster (mostly code).

• Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.

• Linear regression models are neural networks, too.

ML Cheatsheet PDF
100% (1)
ML Cheatsheet PDF
211 pages
Basic Life Skills Course Facilitator's Manual
100% (3)
Basic Life Skills Course Facilitator's Manual
89 pages
Machine Learning Cheat Sheet
100% (1)
Machine Learning Cheat Sheet
211 pages
ML2 Regression
No ratings yet
ML2 Regression
148 pages
Lect03 Linear Model ML
No ratings yet
Lect03 Linear Model ML
100 pages
Lect03 Linear Model ML
No ratings yet
Lect03 Linear Model ML
93 pages
2-LR Optim
No ratings yet
2-LR Optim
60 pages
Linear Regression
No ratings yet
Linear Regression
61 pages
CM20315 02 Supervised
No ratings yet
CM20315 02 Supervised
53 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture 2-Regression
No ratings yet
Lecture 2-Regression
49 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Crash Lecture On Deep Convolutional Neural Networks
No ratings yet
Crash Lecture On Deep Convolutional Neural Networks
27 pages
Training Models
No ratings yet
Training Models
13 pages
Unit 2
No ratings yet
Unit 2
35 pages
Lecture 1.2. Basics and Prerequisite
No ratings yet
Lecture 1.2. Basics and Prerequisite
34 pages
Q2 Grade 8 Music DLL Week 1
91% (23)
Q2 Grade 8 Music DLL Week 1
9 pages
Lecture W2c
No ratings yet
Lecture W2c
16 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
d3 It ML Jan 2023 Part 2
No ratings yet
d3 It ML Jan 2023 Part 2
32 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
CS435 Ch6
No ratings yet
CS435 Ch6
14 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
UnderstandingDeepLearning 03-26-25 C 31 38
No ratings yet
UnderstandingDeepLearning 03-26-25 C 31 38
8 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
02-Linear Regression
No ratings yet
02-Linear Regression
17 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Lab02
No ratings yet
Lab02
14 pages
ML Primer PDF
No ratings yet
ML Primer PDF
122 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
National Nutrition Council
100% (1)
National Nutrition Council
1 page
Module 3
No ratings yet
Module 3
27 pages
88 Embedded-Questions US
No ratings yet
88 Embedded-Questions US
18 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
DLL Cpar
No ratings yet
DLL Cpar
3 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Georg Simmel: On Individuality and Social Forms
No ratings yet
Georg Simmel: On Individuality and Social Forms
3 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
DSBDL - Write - Ups - 4 To 7
No ratings yet
DSBDL - Write - Ups - 4 To 7
11 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Grade Thresholds - June 2024: Cambridge IGCSE Physics (0625)
No ratings yet
Grade Thresholds - June 2024: Cambridge IGCSE Physics (0625)
2 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Burned Final
No ratings yet
Burned Final
304 pages
ML 2
No ratings yet
ML 2
155 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
The Brand Persuasion Wheel
No ratings yet
The Brand Persuasion Wheel
8 pages
The Psychology of Music Reading
No ratings yet
The Psychology of Music Reading
19 pages
LIMING CV 9 - 15 c2
No ratings yet
LIMING CV 9 - 15 c2
7 pages
04 - Vals Venezolano No.1 (A.lauro)
100% (1)
04 - Vals Venezolano No.1 (A.lauro)
2 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Module 5
No ratings yet
Module 5
48 pages
Bianca Batti - Curriculum Vitae - August 2018
No ratings yet
Bianca Batti - Curriculum Vitae - August 2018
6 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Week - 03 Week04
No ratings yet
Week - 03 Week04
32 pages
Aim High 6 Teachers Book
No ratings yet
Aim High 6 Teachers Book
129 pages
PedagogySyllabus F11
No ratings yet
PedagogySyllabus F11
3 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Homework s13
No ratings yet
Homework s13
14 pages
Sample LAS Template For Q2 LAS English - Tagalog 1
No ratings yet
Sample LAS Template For Q2 LAS English - Tagalog 1
10 pages
Downloader
No ratings yet
Downloader
3 pages
4 - HEMATOMUL DIGITAL SPONTAN PAROXISTIC Ro 369
No ratings yet
4 - HEMATOMUL DIGITAL SPONTAN PAROXISTIC Ro 369
4 pages
2 Класс 2 Четверть
No ratings yet
2 Класс 2 Четверть
27 pages
English 10 Quarter 4 Lessons (Week1 - Week 2)
No ratings yet
English 10 Quarter 4 Lessons (Week1 - Week 2)
3 pages
Massive Online Open Course (MOOC) On Gender Sensitization: 20 May 2022 - July15 2022
No ratings yet
Massive Online Open Course (MOOC) On Gender Sensitization: 20 May 2022 - July15 2022
2 pages
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
No ratings yet
Bai Tap Ve Tu Noi Trong Tieng Anh Linking Words Connectors
5 pages
Getting Away With Murder Final
No ratings yet
Getting Away With Murder Final
34 pages
Name: - Class: - Date: - Planning A Party Rubric - Miss Brennan's 6 Grade Math Class - Fractions Unit
No ratings yet
Name: - Class: - Date: - Planning A Party Rubric - Miss Brennan's 6 Grade Math Class - Fractions Unit
2 pages
Modern Education
No ratings yet
Modern Education
6 pages
Makesworth Accountants Scholarship For ACCA Students in Nepal 2024
No ratings yet
Makesworth Accountants Scholarship For ACCA Students in Nepal 2024
2 pages
Economics, SSS1 - The Theory of Production (3) - Labour As An F.O.P
No ratings yet
Economics, SSS1 - The Theory of Production (3) - Labour As An F.O.P
3 pages
1 Lesson Plan in Mapeh 7
No ratings yet
1 Lesson Plan in Mapeh 7
7 pages
Class 1st English Annual 2025
No ratings yet
Class 1st English Annual 2025
5 pages
Python迁移学习: Chinese Edition
From Everand
Python迁移学习: Chinese Edition
Posts & Telecom Press
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

D2L CH3 Part1

Uploaded by

D2L CH3 Part1

Uploaded by

Dive into Deep Learning

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

• In this chapter, we will cover the entire training process including:

• Classic statistical learning techniques such as linear and

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

• Not every prediction problem is a classic regression problem.

• In classification problems, the goal is to predict membership among a set of categories.

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

Linear Model Loss Function Analytic Solution

Minibatch Making Predictions

• The weights determine the influence of each feature on our prediction.

• In machine learning, we usually work with high-dimensional datasets.

the vector corresponds to features of a single data example.

where broadcasting (see Section 2.1.3) is applied during the summation.

• The empirical error is only a function of the model parameters.

Fig. 3.1.1 Fit data with a linear model.

• Linear regression can be solved analytically by applying a simple formula:

is the weights vector,

• To summarize the steps of the algorithm:

Note that and are vectors.

• Deep learning practitioners seldom struggle to find parameters that

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

• We consider two methods for adding vectors.

o Alternatively, we rely on the reloaded + operator to compute the elementwise sum.

• The second method is dramatically faster than the first.

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

o A Python function to compute the normal distribution:

def normal(x, mu, sigma):

o Visualize the normal distributions:

# Mean and standard deviation pairs

• So, We minimize the negative log-likelihood :

o We assume that is some fixed constant.

Basic Elements Initializing Initializing

Defining the Softmax

The Normal Initializing

From Linear Initializing

Defining the Defining the Classification

Defining the Defining the

• For the neural network shown in Fig. 3.1.2,

• We do not consider the input layer when counting layers.

• Consider the cartoonish picture of a biological neuron in Fig. 3.1.3

• Linear regression models are neural networks, too.

You might also like