0% found this document useful (0 votes)
16 views34 pages

Chapter 2 - 2 Shallow Neural Network 2 - 2

Chapter 2 of the document focuses on building neural networks from scratch, covering shallow and deep neural networks, regularization techniques, and optimization methods. It includes a step-by-step guide to constructing networks, as well as discussions on hyper-parameters and practical applications. The chapter serves as a foundation for understanding subsequent topics in deep learning, such as convolutional and recurrent neural networks.

Uploaded by

Dương Tùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views34 pages

Chapter 2 - 2 Shallow Neural Network 2 - 2

Chapter 2 of the document focuses on building neural networks from scratch, covering shallow and deep neural networks, regularization techniques, and optimization methods. It includes a step-by-step guide to constructing networks, as well as discussions on hyper-parameters and practical applications. The chapter serves as a foundation for understanding subsequent topics in deep learning, such as convolutional and recurrent neural networks.

Uploaded by

Dương Tùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Deep Learning

Chapter 2 Building Neural Network


from Scratch
Dr. Minhhuy Le
EEE, Phenikaa University
Chapter 2: Building Neural Network from Scratch
1. Shallow neural network
2. Deep neural network
3. Building neural network: step-by-step
(modulation)
4. Regularization
5. Dropout
6. Batch Normalization
7. Optimizers
8. Hyper-parameters
9. Practice
Previous Lecture Overview
Chapter 1: Course Infor & Programming review - Chapter 3: Convolutional Neural Network - week 8-10
week 1 1. Convolutional operator
1. Course introduction and grades 2. History of CNN
2. History of Deep learning 3. Deep Convolutional Models
3. Deep learning applications 4. Layers in CNN
5. Applications of CNN
Chapter 2: Building Neural Network from Scratch – 6. Practice
week 2-7 Midterm summary
1. Shallow neural network - week 2 Chapter 4: TensorFlow Library- week 11-13
2. Deep neural network - week 3 1. Introduction to TensorFlow
3. Building neural network: step-by-step (modulation) - 2. Building a deep neural network with TensorFlow
week 3 3. Applications
4. Regularization - week 4 4. Practice
5. Dropout - week 4 Chapter 5: Recurrent Neural Network week 14-15
6. Batch Normalization - week 5 1. Unfolding Computational Graphs
7. Optimizers - week 6 2. Building a Recurrent Neural Networks
8. Hyper-parameters - week 7 3. Long Short-Term Memory
9. Practice- week 4. Vision with Language Processing
Midterm 5. Application of RNN
6. Practice
Minhhuy Le, ICSLab, Phenikaa Uni. 3
Previous Lecture Overview
Basic of Neural Network
• The Perceptron and its Learning Rule (Frank Rosenblatt, 1957)
• Adaptive Linear Neuron and Delta Rule (Widrow & Hoff, 1960)
• Logistic Regression and Gradient Descent

Minhhuy Le, ICSLab, Phenikaa Uni. 4


Previous Lecture Overview
Biologically inspired (akin to the neurons in a brain)

Minhhuy Le, ICSLab, Phenikaa Uni. 5


Previous Lecture Overview
Artificial Neurons and the McCulloch-Pitts Model (1943)

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical
biophysics, 5(4):115–133, 1943.

Minhhuy Le, ICSLab, Phenikaa Uni. 6


Previous Lecture Overview
Frank Rosenblatt’s Perceptron (1957)

Minhhuy Le, ICSLab, Phenikaa Uni. 7


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)

Minhhuy Le, ICSLab, Phenikaa Uni. 8


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)
• Gradient Descent
• A first-order iterative optimization algorithm for
finding the minimum of a function
• Take steps proportional to the negative of the
gradient of the function at the current point

Minhhuy Le, ICSLab, Phenikaa Uni. 9


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)
• Cost function: sum of squared errors (SSE)
1
• J(w) = σ𝑖(𝑦 ′(𝑖) − 𝑦 (𝑖) )2
2
• To minimize SSE, we can use “gradient descent”
• A step in the opposite direction of gradient
∆w = − α J(w)
where α is the learning rate, 0 < α < 1
• Thus, we need to compute the partial derivative of the cost
function for each weight in the weight vector,
𝜕𝐽
∆wj = − α
𝜕𝑤𝑗
Minhhuy Le, ICSLab, Phenikaa Uni. 10
Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)

• A step in gradient descent:


𝜕𝐽 𝑖 𝑖 𝑖 (𝑖)
• ∆wj = −α = −α σ𝑖 𝑦 ′ −𝑦 𝑖 −𝑥𝑗 = α σ𝑖 (𝑦 ′ − (𝑖)
𝑦 )𝑥𝑗
𝜕𝑤 𝑗

• Update weight vector:


• w := w + ∆w

• Differences with the perceptron rule


• The output y(i) is a real number, not a class label as in perceptron
learning rule.
• Weight update is based on “all samples in the training set” (Batch GD)

Minhhuy Le, ICSLab, Phenikaa Uni. 11


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)

• If the learning rate is TOO LARGE, gradient descent will overshoot the
minima and diverge.
• If the learning rate is too small, gradient descent will require too many epochs
to converge and can become trapped in local minima more easily.

Minhhuy Le, ICSLab, Phenikaa Uni. 12


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)

• If features are scaled on the same scale, gradient descent converges faster and prevents
weights from becoming too small (weight decay).

• Common way for feature scaling

𝑥𝑗 − 𝜇𝑗
𝑥𝑗,𝑠𝑡𝑑 =
𝜎𝑗
where μj is the sample mean of the feature xj and σj the standard deviation.

• After standardization, the features will have unit variance and centered around mean
zero.

Minhhuy Le, ICSLab, Phenikaa Uni. 13


Previous Lecture Overview
Adaptive Linear Neurons and the Delta Rule (1960)

• Batch Gradient Descent (BGD)


• Cost function is minimized based on the complete training dataset (all samples)

• Stochastic Gradient Descent (SGD)


• Weights are incrementally updated after each individual training sample
• Converges faster than BGD since weights are updated immediately after each
training sample
• Computationally more efficient, especially for large datasets

• Mini-batch Gradient Descent (MGD)


• Compromise between BGD and SGD, dataset is divided into mini-batches
• Smoother convergence than SGD

Minhhuy Le, ICSLab, Phenikaa Uni. 14


Previous Lecture Overview
Logistic Regression

Perceptron vs. Adaline vs.


Multi-Layer Perceptrons
(Logistic Regression)

Minhhuy Le, ICSLab, Phenikaa Uni. 15


Previous Lecture Overview
Logistic Regression
1 σ(z)
σ(z)=1+𝑒 −𝑧
• Definition:
• Given input 𝑥 ∈ ℛ 𝑛𝑥 ,
calculate the probability
𝑦ො = 𝑃 𝑦 = 1 𝑥 , 0 ≤ 𝑦ො ≤ 1.
• Parameters:
• Weights: 𝑤 ∈ ℛ 𝑛𝑥
• Bias: 𝑏 ∈ ℛ If z is large positive number, σ(z)
• Output: →1
• 𝑦ො = 𝜎 𝑧 = 𝜎(𝑤 𝑇 𝑥 + 𝑏) If z is small negative number, σ(z)
1
where 𝜎 𝑧 = 1+𝑒 −𝑧 is the →0
sigmoid activation
function

Minhhuy Le, ICSLab, Phenikaa Uni. 16


Previous Lecture Overview
Logistic Regression

• Cost function is the average of all cross-entropy losses

𝒥 𝑤, 𝑏𝑚
1
= ෍ ℒ(𝑦ො 𝑖 , 𝑦 (𝑖) )
𝑚
𝑖=1𝑚
1 𝑖 𝑖 𝑖 𝑖
=− ෍ 𝑦 log 𝑦ො + 1−𝑦 log 1 − 𝑦ො
𝑚
𝑖=1
• Goal:
• Find vectors w and b that minimize the cost function (total loss)

• Logistic regression can be viewed as a small neural network!

Minhhuy Le, ICSLab, Phenikaa Uni. 17


Previous Lecture Overview
Logistic Regression Convergence
1
• 𝑦ො (𝑖) =𝜎 𝑧 (𝑖) =𝜎 𝑤 𝑇 𝑥 (𝑖) + 𝑏 ,where 𝜎 𝑧 (𝑖) = (𝑖)
1+𝑒 −𝑧
1 1
• 𝒥 𝑤, 𝑏 = 𝑚 σ𝑚
𝑖=1 ℒ( 𝑦
ො 𝑖
, 𝑦 (𝑖)
) = − σ𝑚
𝑚 𝑖=1
𝑦 𝑖
log 𝑦ො 𝑖
+ 1−𝑦 𝑖
log 1 − 𝑦ො 𝑖

𝐽 𝑤, 𝑏

• Find w, b that minimize 𝒥 𝑤, 𝑏

𝑏
𝑤

Minhhuy Le, ICSLab, Phenikaa Uni. 18


Previous Lecture Overview
Logistic Regression Computation Graph

• A graph that depicts all the computations required for a function in a forward path
• For example: J(x, y, z) = 4(x + yz)

Forward Path: Computation


x

y v = x+u J = 4v
u = y*z
z

Backward Path: Derivatives

Minhhuy Le, ICSLab, Phenikaa Uni. 19


Previous Lecture Overview
Logistic Regression Computation Graph

𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑦ො = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = −(𝑦 log(𝑎) + (1 − 𝑦) log(1 − 𝑎))

x1
x2
w1 z = w1x1 + w2x2 + b a = σ(z) ℒ(a, y)
w2
b

Minhhuy Le, ICSLab, Phenikaa Uni. 20


Previous Lecture Overview
Logistic Regression Computation Graph
𝑦 1−𝑦
− +
𝑥1 (𝑎 − 𝑦) 𝑎−𝑦 𝑎 1−𝑎

𝛿𝐿(𝑎,𝑦) 𝑦 1−𝑦
• = − +
𝛿𝑎 𝑎 1−𝑎
𝛿𝐿(𝑎,𝑦) 𝛿𝐿(𝑎,𝑦) 𝛿𝑎 𝑦 1−𝑦
• = = − + 𝑎 1−𝑎 =𝑎−𝑦
𝛿𝑧 𝛿𝑎 𝛿𝑧 𝑎 1−𝑎
𝛿𝐿(𝑎,𝑦) 𝛿𝐿(𝑎,𝑦) 𝛿𝑎 𝛿𝑧 𝑦 1−𝑦 𝛿𝐿(𝑎,𝑦)

𝛿𝑤1
= 𝛿𝑎 𝛿𝑧 𝛿𝑤 = − 𝑎 + 1−𝑎 𝑎 1 − 𝑎 𝑥1 = 𝑥1 𝑎 − 𝑦 = 𝑥1 𝛿𝑧
1

Minhhuy Le, ICSLab, Phenikaa Uni. 21


1. Shallow neural network

Minhhuy Le, ICSLab, Phenikaa Uni. 22


1. Shallow neural network One hidden layer Neural Network

Layer 1
[1]
• 2-layer NN 𝑎1

• 1 hidden layer
𝑥1 [1]
𝑎2
𝑎0 = 𝑿 Layer 2

𝑥2 𝑎3
[1]
𝑦ො =𝑎
[2]

𝑥3 [1]
𝑎4 [1]
𝑎1
[1]
𝑎2
𝑎[1] = [1]
𝑎3
[1]
𝑎4
Input layer Hidden layer Output layer

Minhhuy Le, ICSLab, Phenikaa Uni. 23


1. Shallow neural network Computing NN’s Output

𝑥1

𝑥2 𝜎(𝑧) 𝑎 = 𝑦ො
𝑧 𝑎
𝑥3

𝑧 = 𝑤𝑇𝑥 + 𝑏

𝑎 = 𝜎(𝑧)

Minhhuy Le, ICSLab, Phenikaa Uni. 24


1. Shallow neural network Computing NN’s Output

𝑎11 [1] [1]


𝑧11 = 𝑤11 𝑇 𝑥 + 𝑏1 , 𝑎 1 = 𝜎(𝑧11 )
𝑥1
𝑎21 1 1𝑇 [1] [1] 1
𝑧2 = 𝑤2 𝑥 + 𝑏2 , 𝑎 2 = 𝜎(𝑧2 )
𝑥2 𝑦ො
𝑎31 [1] [1]
𝑧31 = 𝑤31 𝑇 𝑥 + 𝑏3 , 𝑎 3 = 𝜎(𝑧31 )
𝑥3
𝑎41 [1] [1]
𝑧41 = 𝑤41 𝑇 𝑥 + 𝑏4 , 𝑎 4 = 𝜎(𝑧41 )

Given input x:
2 2
𝑧 1
=𝑊 1
𝑥+𝑏 1 𝑧 =𝑊 𝑎1 +𝑏 2

𝑎 1 = 𝜎(𝑧 1
) 𝑎 2 = 𝜎(𝑧 2
)

Minhhuy Le, ICSLab, Phenikaa Uni. 25


1. Shallow neural network Vectorizing across multiple examples

• For all m examples …


𝑎11 for i = 1 to m:
𝑥1 𝑧 1 (𝑖) = 𝑊 1
𝑥 (𝑖) + 𝑏 1
𝑎21
𝑥2 𝑦ො 𝑎 1 (𝑖) = 𝜎(𝑧 1 𝑖 )
𝑎31 2 (𝑖) 2
𝑧 =𝑊 𝑎 1 (𝑖) + 𝑏 2
𝑥3
𝑎41 𝑎 2 (𝑖) = 𝜎(𝑧 2 𝑖 )

Minhhuy Le, ICSLab, Phenikaa Uni. 26


1. Shallow neural network Vectorizing across multiple examples

for i = 1 to m:
1 (𝑖) 1
𝑧 =𝑊 𝑥 (𝑖) + 𝑏 1

𝑎 1 (𝑖) = 𝜎(𝑧 1 𝑖
)
2 (𝑖) 2
𝑧 =𝑊 𝑎 1 (𝑖) + 𝑏 2
𝑋= 𝑥 (1) 𝑥 (2) … 𝑥 (𝑚)
𝑎 2 (𝑖) = 𝜎(𝑧 2 𝑖
)

1 1
𝑍 =𝑊 𝑋+𝑏 1

A [1]
= 𝐴 1 = 𝜎(𝑍 1 )
𝑎[1](1) 𝑎[1](2) …𝑎[1](𝑚)
𝑍2 =𝑊 2
𝐴1 +𝑏 2
𝐴 2 = 𝜎(𝑍 2 )

Minhhuy Le, ICSLab, Phenikaa Uni. 27


1. Shallow neural network Activation functions

Comprehensive List of Activation Functions:


https://stats.stackexchange.com/questions/115258/comprehensi
ve-list-of-activation-functions-in-neural-networks-with-pros-cons

Minhhuy Le, ICSLab, Phenikaa Uni. 28


1. Shallow neural network Activation functions

Activation Formula (g(z)) Derivative (g’z)) sigmoid


a
Function

z
sigmoid 1 𝑎(1 − 𝑎)
𝑎=
1 + 𝑒 −𝑧 tanh a

tanh 𝑒 𝑧 − 𝑒 −𝑧 1 − 𝑎2
z
𝑎= 𝑧
𝑒 + 𝑒 −𝑧

0 if 𝑧 < 0
ReLU a
ReLU max(0, 𝑧)
1 if 𝑧 ≥ 0

0.01 if 𝑧 < 0
z
Leaky ReLU max(0.01𝑧, 𝑧)
1 if 𝑧 ≥ 0 Leaky a
ReLU

z
Minhhuy Le, ICSLab, Phenikaa Uni. 29
1. Shallow neural network Why non-linear activation function?

𝑥1

• What not linear? 𝑥2 𝑦ො


• Suppose g[1], g[2] are all linear
𝑥3
• a[1] = z[1]
• a[2] = z[2]
• a[2] = W[2]a[1]+b[2] Given x:
• = W[2](W[1]X+b[1])+b[2]
𝑧 1 =𝑊 1 𝑥+𝑏 1
• = W[2]W[1]X + W[2]b[1]+b[2]
• = W’X + b’ 𝑎1 = 𝑔[1] (𝑧 1 )
• All LINEAR!!! 𝑧2 =𝑊 2
𝑎 1 +𝑏 2
𝑎 2 = 𝑔[2] (𝑧 2
)

Minhhuy Le, ICSLab, Phenikaa Uni. 30


1. Shallow neural network Gradient descent for one hidden layer
𝑥

𝑊 [1] 𝑧 [1] = 𝑊 [1] 𝑥 + 𝑏 [1] 𝑎[1] = 𝜎(𝑧 [1] )

𝑏 [1] 𝑑𝑧 [1] = 𝑊 2𝑇
𝑑𝑧 [2] ∗ 𝑔[1] ′(z 1 )

𝑑𝑊 [1] = 𝑑𝑧 [1] 𝑥 𝑇 𝑑𝑏 [1] = 𝑑𝑧 [1]


𝑊 [2]
𝑏 [2]

𝑧 [2] = 𝑊 [2] 𝑥 + 𝑏 [2] 𝑎[2] = 𝜎(𝑧 [2] ) ℒ(𝑎[2] , y)

𝑑𝑧 [2] = 𝑎[2] − 𝑦

𝑇
𝑑𝑊 [2] = 𝑑𝑧 [2] 𝑎 1

𝑑𝑏 [2] = 𝑑𝑧 [2]
Minhhuy Le, ICSLab, Phenikaa Uni. 31
1. Shallow neural network Vectorizing Gradient Descent

Minhhuy Le, ICSLab, Phenikaa Uni. 32


1. Shallow neural network Initializing weights

Minhhuy Le, ICSLab, Phenikaa Uni. 33


1. Shallow neural network Initialize weights RANDOMLY!

• W[1] = np.random.randn((2,2))*0.01
• Small random values are suggested!
• If too large, Z[1] = W[1]X + b[1] will also be very large and a[1] =
g[1](z[1]) will be in the flat areas and gradient descent will be
very, very slooooooooow….

• b[1] = np.zero((2,1)) (b can be zero, no problem!)

Minhhuy Le, ICSLab, Phenikaa Uni. 34

You might also like