0% found this document useful (0 votes)
14 views42 pages

Cours 1

apprentissage automatique

Uploaded by

bouchaar dounia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views42 pages

Cours 1

apprentissage automatique

Uploaded by

bouchaar dounia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Deep learning

Dr. Aissa Boulmerka


[email protected]

2023-2024

1
CHAPTER 1
LOGISTIC REGRESSION

2
Introduction to Deep Learning

• Artificial Intelligence is the new


Electricity.
• Electricity had once transformed
countless industries: transportation,
manufacturing, healthcare,
communications, and more
• AI will now bring about an equally big
transformation.
Andrew Ng

3
The Machine Learning Approach

• Instead of writing a program by hand for each specific task, we collect lots
of examples that specify the correct output for a given input.
• A machine learning algorithm then takes these examples and produces a
program that does the job.
– The program produced by the learning algorithm may look very
different from a typical hand-written program. It may contain millions
of numbers.
– If we do it right, the program works for new cases as well as the ones
we trained it on.
– If the data changes the program can change too by training on the new
data.
• Massive amounts of computation are now cheaper than paying someone
to write a task-specific program.

4
It is very hard to say what makes a 2

5
Some examples of tasks best solved by learning

• Recognizing patterns:
– Objects in real scenes
– Facial identities or facial expressions
– Spoken words
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates
– Which movies will a person like?

6
Types of learning task

• Supervised learning
– Learn to predict an output when given an input vector.

• Reinforcement learning
– Learn to select an action to maximize payoff.

• Unsupervised learning
– Discover a good internal representation of the input.

7
What you will learn in this course?

1. Neural Networks and Deep Learning


2. Improving Deep Neural Networks: Hyperparameter tuning,
Regularization and Optimization
3. Structuring your Machine Learning project
4. Convolutional Neural Networks
5. Sequential Models such as Natural Language Processing and
Time series.

8
What is neural network?
It is a powerful learning algorithm inspired by how the brain works.
Example 1 – single neural network
• Given data about the size of houses on the real estate market and you
want to fit a function that will predict their price. It is a linear regression
problem because the price as a function of size is a continuous output.
• We know the prices can never be negative so we are creating a function
called Rectified Linear Unit (ReLU) which starts at zero.

9
Example 1 – single neural network

• The input is the size of the house (x)


• The output is the price (y)
• The “neuron” implements the function ReLU (blue line)

10
Example 2 – Multiple neural network
The price of a house can be affected by other features such as size, number of
bedrooms, zip code and wealth. The role of the neural network is to predict
the price and it will automatically generate the hidden units. We only need to
give the inputs x and the output y.

11
Supervised learning for Neural Network
 In supervised learning, we are given a dataset and already know what our
correct output should look like, having the idea that there is a relationship
between the input and the output.
 Supervised learning problems are categorized into "regression" and
"classification" problems.
 In a regression problem, we are trying to predict results within a
continuous output, meaning that we are trying to map input variables to
some continuous function.
 In a classification problem, we are instead trying to predict results in a
discrete output. In other words, we are trying to map input variables into
discrete categories.

12
Examples of supervised learning
 Here are some examples of supervised learning

 There are different types of neural network, for example :


 Convolution Neural Network (CNN): used often for image application
 Recurrent Neural Network (RNN): used for one-dimensional sequence
data such as translating English to Chinese or a temporal component
such as text transcript.
 Hybrid neural network architecture : for the autonomous driving.
13
Structured vs unstructured data
 Structured data refers to things that has a defined meaning such as price,
age.
 Unstructured data refers to thing like pixel, raw audio, text.

14
Why is deep learning taking off? ‫ﻟﻤﺎذا ﻳﻨﻄﻠﻖ اﻟﺘﻌﻠﻢ اﻟﻌﻤﻴﻖ؟‬

 Deep learning is taking off due to a large amount of data available through
the digitization of the society, faster computation and innovation in the
development of neural network algorithm. ‫ﻳﻨﻄﻠﻖ اﻟﺘﻌﻠﻢ اﻟﻌﻤﻴﻖ ﺑﺴﺒﺐ ﻛﻤﻴﺔ ﻛﺒﻴﺮة ﻣﻦ اﻟﺒﻴﺎﻧﺎت‬
‫اﻟﻤﺘﺎﺣﺔ ﻣﻦ ﺧﻞال رﻗﻤﻨﺔ اﻟﻤﺠﺘﻤﻊ واﻟﺤﻮﺳﺒﺔ الأسرع‬
.‫والاﺑﺘﻜﺎر ﻓﻲ ﺗﻄﻮﻳﺮ ﺧﻮارزﻣﻴﺔ اﻟﺸﺒﻜﺔ اﻟﻌﺼﺒﻴﺔ‬

 Two things have to be considered to get to the high level of performance:


1. Being able to train a big enough neural network
2. Huge amount of labeled data
15
Training a neural network
 The process of training a neural network is iterative.

 It could take a good amount of time to train a neural network, which


affects your productivity. Faster computation helps to iterate and improve
new algorithm.

16
Binary Classification
 In a binary classification problem, the result is a discrete value output.
 For example
 account hacked (1) or compromised (0)
 a tumor malign (1) or benign (0)

Example: Cat vs Non-Cat


 The goal is to train a classifier that the input is an image represented by a
feature vector, 𝑥, and predicts whether the corresponding label 𝑦 is 1 or 0.
In this case, whether this is a cat image (1) or a non-cat image (0).

64

64
17
Binary Classification
 An image is store in the computer in three separate matrices
corresponding to the Red, Green, and Blue color channels of the image.
 The three matrices have the same size as the image, for example, the
resolution of the cat image is 64 pixels x 64 pixels, the three matrices
(RGB) are 64 X 64 each.
 The value in a cell represents the pixel intensity which will be used to
create a feature vector of n dimension.
 In pattern recognition and machine learning, a feature vector represents
an object, in this case, a cat or no cat.

Red
- To create a feature vector, 𝑥, the pixel intensity
values will be “unroll” or “reshape” for each
color. Green
- The dimension of the input feature vector 𝑥 is:
𝒏𝒙 = 𝟔𝟒 × 𝟔𝟒 × 𝟑 = 𝟏𝟐 𝟐𝟖𝟖.
Blue

18
Binary Classification

1 (cat) vs 0 (non cat)

𝟔𝟒 × 𝟔𝟒 × 𝟑 = 𝟏𝟐𝟐𝟖𝟖

𝒏 = 𝒏𝒙 = 𝟏𝟐𝟐𝟖𝟖

𝑥 →𝑦

19
Notation
 𝑥, 𝑦 𝑥 ∈ ℝ𝑛𝑥 , 𝑦 ∈ 0,1

 𝑚 training examples: 𝑥 (1) , 𝑦 (1) , 𝑥 (2) , 𝑦 (2) , … , 𝑥 (𝑚) , 𝑦 (𝑚)

 𝑚 = # train examples 𝑚𝑡𝑒𝑠𝑡 = # test examples

|| ⋯ |
 𝑋 = 𝑥 (1)𝑥 (2) ⋯𝑥 (𝑚) 𝑛𝑥 𝑋 ∈ ℝ𝑛𝑥×𝑚 𝑋. 𝑠ℎ𝑎𝑝𝑒 = (𝑛𝑥 , 𝑚)
| | ⋯ |

 𝑌 = 𝑦 (1) , 𝑦 (2) , ⋯ , 𝑦 (𝑚) 𝑌 ∈ ℝ1×𝑚 Y. 𝑠ℎ𝑎𝑝𝑒 = (1, 𝑚)

20
Logistic Regression
 Logistic regression is a learning algorithm used in a supervised learning
problem when the output 𝑦 are all either zero or one. The goal of logistic
regression is to minimize the error between its predictions and training
data.
Example: Cat vs No - cat
 Given an image represented by a feature vector 𝑥, the algorithm will
evaluate the probability of a cat being in that image.
𝐺𝑖𝑣𝑒𝑛 𝑥, 𝑦 = 𝑃 𝑦 = 1 𝑥 , 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑦 ≤ 1
The parameters used in Logistic regression are:
 The input features vector: 𝑥 ∈ ℝ𝑛𝑥 , where 𝑛𝑥 is the number of features
 The training label: 𝑦 ∈ 0,1
 The weights: 𝑤 ∈ ℝ𝑛𝑥 , where 𝑛𝑥 is the number of features
 The threshold: 𝑏 ∈ ℝ
 The output: 𝑦 = 𝜎 𝑤 𝑇 𝑥 + 𝑏
1
 Sigmoid function: 𝑠 = 𝜎(𝑤 𝑇 𝑥 + 𝑏) = 𝜎(𝑧) =
1+ 𝑒 −𝑧

21
Logistic Regression

 𝑤 𝑇 𝑥 + 𝑏 is a linear function (𝑎𝑥 + 𝑏), but since we are looking for a


probability constraint between [0,1], the sigmoid function is used. The
function is bounded between [0,1] as shown in the graph above.
 Some observations from the graph:
 If 𝑧 is a large positive number, then 𝜎(𝑧) = 1
 If 𝑧 is small or large negative number, then 𝜎(𝑧) = 0
 If 𝑧 = 0, then 𝜎(𝑧) = 0.5
22
Logistic Regression
𝑥 ∈ ℝ𝑛𝑥
𝐺𝑖𝑣𝑒𝑛 𝑥 , 𝑦=𝑃 𝑦=1𝑥
𝑥 ∈ ℝ𝑛𝑥 0≤𝑦≤1
Parameters: 𝑤 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
Output: 𝑦 = 𝜎 𝑤 𝑇 𝑥 + 𝑏

𝑥0 = 1, 𝑥 ∈ ℝ𝑛𝑥 +1
𝑦 = 𝜎 𝜃𝑇 𝑥

𝜃0
𝜃1
Θ = 𝜃2

𝜃𝑛𝑥

23
Logistic regression cost function

 To train the parameters w and b, we need to define a cost function.

Recap:

𝑖 1
 𝑦 = 𝜎 𝑤𝑇𝑥 𝑖 + 𝑏 , where 𝜎(𝑧 𝑖 ) = 𝑖 𝑧 𝑖 = 𝑤𝑇𝑥 𝑖 +𝑏
1+ 𝑒 −𝑧

 Gieven 𝑥 (1) , 𝑦 (1) , 𝑥 (2) , 𝑦 (2) , … , 𝑥 (𝑚) , 𝑦 (𝑚) , we want : 𝑦 𝑖 ≈ 𝑦 (𝑖)

24
Logistic regression cost function

Loss (error) function:

The loss function computes the error for a single training example.

1 2
 ℒ 𝑦 𝑖 ,𝑦 𝑖 = 𝑦 𝑖 −𝑦 𝑖 (mean squared error)
2

 ℒ 𝑦 𝑖 ,𝑦 𝑖 = − 𝑦 𝑖 𝑙𝑜𝑔 𝑦 𝑖 + 1−𝑦 𝑖 𝑙𝑜𝑔 1 − 𝑦 𝑖 (cross-


entropy)
𝑖
 If 𝑦 = 1: ℒ 𝑦 𝑖 , 𝑦 𝑖
= −𝑙𝑜𝑔 𝑦 𝑖
where 𝑙𝑜𝑔 𝑦 𝑖
=0 and 𝑦 𝑖
should be close to 1
𝑖
 If 𝑦 = 0: ℒ 𝑦 𝑖 , 𝑦 𝑖
= −𝑙𝑜𝑔 1 − 𝑦 𝑖
where 𝑙𝑜𝑔 1 − 𝑦 𝑖
=0 and 𝑦 𝑖
should be
close to 0

25
26
Logistic regression cost function

Cost function

 The cost function is the average of the loss function of the entire training
set. The goal is to find the parameters 𝑤 and 𝑏 that minimize the overall
cost function.

𝑚 𝑚
1 1
𝐽 𝑤, 𝑏 = ℒ 𝑦 𝑖 ,𝑦 𝑖 =− 𝑦 𝑖 log 𝑦 𝑖 + 1−𝑦 𝑖 log 1 − 𝑦 𝑖
𝑚 𝑚
𝑖=1 𝑖=1

27
Gradient Descent
 Recap:
𝑖 1
𝑦 = 𝜎 𝑤𝑇𝑥 𝑖 + 𝑏 , where 𝜎 𝑧 𝑖 = 𝑖
1+ 𝑒 −𝑧
1 𝑚 1 𝑚
𝐽 𝑤, 𝑏 = 𝑚 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖 = −𝑚 𝑖=1 𝑦 𝑖 log 𝑦 𝑖 + 1−𝑦 𝑖 log 1 − 𝑦 𝑖

Want to find 𝑤, 𝑏 that minimize 𝐽 𝑤, 𝑏

𝐽 𝑤, 𝑏

𝑏
𝑤
28
Gradient Descent

𝐽 𝑤 Repeat {
𝑑𝐽
𝑤 ←𝑤−𝛼
𝑑𝑤
}
𝛼: Learning rate

𝜕𝐽(𝑤, 𝑏) 𝑑𝐽(𝑤, 𝑏)
= = 𝑑𝑤
𝜕𝑤 𝑑𝑤
𝐽(𝑤, 𝑏): cost function
𝜕𝐽(𝑤, 𝑏) 𝑑𝐽(𝑤, 𝑏)
= = 𝑑𝑏
𝜕𝑏 𝑑𝑏

29
Computation Graph
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐 = 3 5 + 3 ∗ 2 = 33

𝑈 = 𝑏𝑐
𝑉 = 𝑎 + 𝑈
𝐽 = 3𝑉
𝑎=5
𝟏𝟏 𝟑𝟑
𝑏=3 𝟔 𝑉 = 𝑎+𝑈 𝐽 = 3𝑉
𝑈 = 𝑏𝑐
𝑐=2

30
Computation Graph
𝐽 𝑎, 𝑏, 𝑐 = 3 𝑎 + 𝑏𝑐 = 3 5 + 3 ∗ 2 = 33

𝑈 = 𝑏𝑐
𝑉 = 𝑎 + 𝑈
𝐽 = 3𝑉
𝑎=5
𝟏𝟏 𝟑𝟑
𝑏=3 𝟔 𝑉 = 𝑎+𝑈 𝐽 = 3𝑉
𝑈 = 𝑏𝑐
𝑐=2 𝑑𝐽 𝑑𝐽 𝑑𝑣
= =3
𝑑𝑢 𝑑𝑣 𝑑𝑢

𝑑𝐽 𝑑𝐽 𝑑𝑢
= =6
𝑑𝑏 𝑑𝑢 𝑑𝑏

𝑑𝐽 𝑑𝐽 𝑑𝑢
= =9
𝑑𝑐 𝑑𝑢 𝑑𝑐 31
Logistic Regression Gradient descent

𝑧 = 𝑤 𝑇 𝑥 + 𝑏,
𝑦=𝑎=𝜎 𝑧
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + 1 − 𝑦 log 1 − 𝑎

𝑥1
𝑤1
𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦=𝑎=𝜎 𝑧 ℒ 𝑎, 𝑦
𝑤2
𝑏

32
Logistic Regression Gradient descent
𝑥1
ℒ 𝑎, 𝑦 = − 𝑦 log 𝑎 + 1 − 𝑦 log 1 − 𝑎
𝑤1
𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦=𝑎=𝜎 𝑧 ℒ 𝑎, 𝑦
𝑤2
𝑏
𝑑ℒ 𝑎, 𝑦 𝑑ℒ 𝑦 1−𝑦
= = 𝑑𝑎 = − +
𝑑𝑎 𝑑𝑎 𝑎 1−𝑎
𝑑𝑎
= 𝑎(1 − 𝑎)
𝑑𝑧
𝑤1 ← 𝑤1 − 𝛼𝑑𝑤1
𝑑ℒ 𝑑ℒ 𝑑𝑎
𝑑𝑧 = = =𝑎−𝑦 𝑤2 ← 𝑤2 − 𝛼𝑑𝑤2
𝑑𝑧 𝑑𝑎 𝑑𝑧
𝑑ℒ 𝑏 ← 𝑏 − 𝛼𝑑𝑏
𝑑𝑤1 = = 𝑥1 𝑑𝑧
𝑑𝑤1
𝑑ℒ
𝑑𝑤2 = = 𝑥2 𝑑𝑧
𝑑𝑤2
𝑑ℒ
𝑑𝑏 = = 𝑑𝑧
𝑑𝑏 33
Logistic regression on m examples
𝑥1
𝑤1
𝑥2 𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 𝑦=𝑎=𝜎 𝑧 ℒ 𝑎, 𝑦
𝑤2
𝑏
𝑚
1
𝐽 𝑤, 𝑏 = ℒ 𝑎 𝑖 ,𝑦 𝑖
𝑚
𝑖=1

𝑎(𝑖) = 𝑦 𝑖 =𝜎 𝑧 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 +𝑏
𝑚 𝑚
𝜕𝐽 𝑤, 𝑏 1 𝜕ℒ 𝑎 𝑖 , 𝑦 𝑖
1
= = 𝑑𝑤1 (𝑖)
𝜕𝑤1 𝑚 𝜕𝑤1 𝑚
𝑖=1 𝑖=1

𝑚 𝑖 𝑖 𝑚
𝜕𝐽 𝑤, 𝑏 1 𝜕ℒ 𝑎 , 𝑦 1
= = 𝑑𝑏
𝜕𝑏 𝑚 𝜕𝑏 𝑚
𝑖=1 𝑖=1
34
Logistic regression on m examples (iterative)
𝐽 = 0; 𝑑𝑤1 = 0; 𝑑𝑤2 = 0; 𝑑𝑏 = 0;
for i=1 to m:
𝑖
𝑧 = 𝑤𝑇𝑥 𝑖
+ 𝑏;

𝑎 𝑖 =𝜎 𝑧 𝑖 ;
𝑖 𝑖 𝑖 𝑖
𝐽 += − 𝑦 log 𝑎 + 1−𝑦 log 1 − 𝑎 ;
𝑖 𝑖
𝑑𝑧 =𝑎 −𝑦 𝑖 ;
𝑑𝑤1 += 𝑥1 (𝑖) 𝑑𝑧 𝑖 ;
𝑑𝑤2 += 𝑥2 (𝑖) 𝑑𝑧 𝑖 ;
𝑑𝑏 += 𝑑𝑧 𝑖 ;
𝐽/=𝑚; 𝑑𝑤1 /=𝑚; 𝑑𝑤2 /=𝑚; 𝑑𝑏/=𝑚;
𝑤1 = 𝑤1 − 𝛼𝑑𝑤1 ;
𝑤2 = 𝑤2 − 𝛼𝑑𝑤2 ;
𝑏 = 𝑏 − 𝛼𝑑𝑏;
35
What is vectorization?

𝑧 = 𝑤𝑇𝑥 + 𝑏

𝑤= 𝑤 ∈ ℝ𝑛𝑥 𝑥= 𝑥 ∈ ℝ𝑛𝑥

Iterative (non-vectorial) Vectorial

𝐳=𝟎;
for i in range(nx):
z = np.dot(w,x) + b
𝐳+=𝐰[𝐢]∗𝐱[𝐢];
𝐳+=𝐛;

36
Neural network programming guideline
Whenever possible, avoid explicit for loops

Iterative (non-vectorial) Vectorial

U = Av
𝐔𝐢 = 𝐀𝐢𝐣 𝐯𝐣
𝐢 𝐣
U = np.zeros((n,1))
u = np.dot(A,v)
for i=1...
for j=1...
u[i] += A[i][j]*v[j];

37
Vectors and matrix valued functions
Say you need to apply the exponential operation on every element of a
matrix/vector.
𝑣1 𝑒 𝑣1
𝑣2 𝑒 𝑣2
𝑣= ⋮ 𝑢= ⋮
⋮ ⋮
𝑣𝑛 𝑒 𝑣𝑛

Iterative (non-vectorial) Vectorial


u = np.zeros((n,1)) import numpy as np
for i in range(n): u = np.exp(v)
u[i] = math.exp(v[i])
# other functions:
np.log(v)
np.abs(v)
np.maximum(v)

38
Logistic regression derivatives

Iterative (non-vectorial) Vectorial

J = 0; dw1 = 0; dw2 = 0; db = 0; J = 0; dw = np.zeros((n_x,1));


for i=1 to m: db = 0;
z i = wTx i + b for i=1 to m:
ai =σ zi z i = wTx i + b
J+= − y i log a i + 1 − y i log 1 − a i ai =σ zi
J+= − y i log a i + 1 − y i log 1 − a i
dz (i) = a(i) − y i

(𝐢) (𝐢) (𝐢) dz (i) = a(i) − y i


𝐝𝐰𝟏 += 𝐱𝟏 𝐝𝐳
(𝐢) (𝐢) (𝐢) dw += 𝐱 (𝐢) 𝐝𝐳 (𝐢)
d𝐰𝟐 += 𝐱𝟐 𝐝𝐳
(i) db += dz (i)
db += dz
𝐽/=𝑚; 𝑑𝑤/=𝑚; 𝑑𝑏/=𝑚;
𝐽/=𝑚; 𝑑𝑤1/=𝑚; 𝑑𝑤2/=𝑚; 𝑑𝑏/=𝑚;

39
Vectorizing Logistic Regression

𝑧 (1) = 𝑤 𝑇 𝑥 (1) + 𝑏, 𝑧 (2) = 𝑤 𝑇 𝑥 (2) + 𝑏, 𝑧 (3) = 𝑤 𝑇 𝑥 (3) + 𝑏,


𝑎(1) = 𝜎 𝑧 (1) 𝑎(2) = 𝜎 𝑧 (2) 𝑎(3) = 𝜎 𝑧 (3)

| | ⋯ |
𝑋 = 𝑥 (1) 𝑥 (2) ⋯𝑥 (𝑚)
| | ⋯ |

𝑇 𝑇 (1) 𝑇 (2) 𝑇 (3)


Z = z (1) z (2) ⋯ z (m) = 𝑤 𝑋 + 𝑏 𝑏 ⋯ 𝑏 = 𝑤 𝑥 + 𝑏, 𝑤 𝑥 + 𝑏, ⋯ , 𝑤 𝑥 + 𝑏
 z = np.dot(w.T,x) + b

A = a(1) a(2) ⋯ a(m) = 𝜎 𝑍


 A = sigmoid(Z)

40
Implementing Logistic Regression

Iterative (non-vectorial) Vectorial

for iter in rang(1000): for iter in rang(1000):


J = 0; dw1 = 0; dw2 = 0; db = 0;
for i=1 to m: 𝐙 = np.dot(w.T,X)+b
𝐳 𝐢 = 𝐰𝐓𝐱 𝐢 + b 𝑨=𝛔 𝒁
𝐚𝐢 =𝛔 𝐳𝐢
𝐝𝐙 = 𝐀 − 𝐘
J+= − 𝐲 𝐢 𝐥𝐨𝐠 𝐚 𝐢 + 𝟏 − 𝐲 𝐢 𝐥𝐨𝐠 𝟏 − 𝐚 𝐢
𝟏
𝐝𝐳 (𝐢) = 𝐚(𝐢) − 𝐲 𝐢
dw = 𝐗𝐝𝒁𝑻
𝒎
𝐝𝒘𝟏 += 𝒙𝟏 (𝐢) 𝐝𝐳 (𝐢) 𝟏
db = 𝐧𝐩. 𝐬𝐮𝐦(𝐝𝒁)
𝒎
d𝒘𝟐 += 𝒙𝟐 (𝐢) 𝐝𝐳 (𝐢)
w = w - 𝜶dw
(𝐢)
d𝐛 += 𝐝𝐳
b = b - 𝜶db
J = J/𝑚, 𝐝𝒘𝟏 = 𝐝𝒘𝟏 /𝑚, d𝒘𝟐 = d𝒘𝟐 /m
𝑑𝑏 = db/𝑚

41
References
 Andrew Ng. Deep learning Specialization. Deeplearning.AI.
 Geoffrey Hinton. Neural Networks for Machine Learning.

42

You might also like