0% found this document useful (0 votes)
12 views93 pages

01 - Neural Network Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views93 pages

01 - Neural Network Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Applied Deep Learning

Neural Network Basics


March 10th, 2020 http://adl.miulab.tw
2 Learning ≈ Looking for a Function
◉ Speech Recognition f( ) =“你好”

◉ Handwritten Recognition f( ) = “2”

◉ Weather forecast f( Thursday )= “ Saturday”

◉ Play video games


f( ) = “move left”
3 Machine Learning Framework
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2 

Training Data Training: Pick the best function f *


(x1, yˆ1 ), (x2 , yˆ 2 ),
“Best” Function f*
Testing Data Testing: f  ( x) = y y = +
(x, ? ),

Training is to pick the best function given the observed data


Testing is to predict the label using the learned function
4 How to Train a Model?
實際上我們是如何訓練一個模型的?
5 Machine Learning Framework
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2  Training
Procedure
Training Data Training: Pick the best function f *
(x1, yˆ1 ), (x2 , yˆ 2 ),
“Best” Function f*
Testing Data Testing: f  ( x) = y y = +
(x, ? ),

Training is to pick the best function given the observed data


Testing is to predict the label using the learned function
6 Training Procedure

Model: Hypothesis Function Set


f1 , f 2 

Training: Pick the best function f *

“Best” Function f*

◉ Q1. What is the model? (function hypothesis set)


◉ Q2. What does a “good” function mean?
◉ Q3. How do we pick the “best” function?
7 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
8 What is the Model?
什麼是模型?
9 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
10 Classification Task
◉ Sentiment Analysis
“這規格有誠意!” +
Binary Classification
“太爛了吧~” -
Class A (yes)
input
◉ Speech Phoneme Recognition object
Class B (no)

/h/
◉ Handwritten Recognition Multi-class Classification
Class A
2 input
object Class B
Class C

Some cases are not easy to be formulated as classification problems


11 Target Function
◉ Classification Task

f (x ) = y f :R →RN M

○ x: input object to be classified → a N-dim vector


○ y: class/label → a M-dim vector

Assume both x and y can be represented as fixed-size vectors


12 Vector Representation Example
◉ Handwriting Digit Classification f : RN → RM
x: image y: class/label

Each pixel corresponds to 10 dimensions for digit recognition


an element in the vector
16 x 16
“1” “2”
1 “1” 0 “1” “1” or not
0  1: for ink 0  1 “2”
“2” “2” or not
1 0: otherwise    
  0 “3” 0 “3” “3” or not
   16 x 16 = 256 dimensions    

  
13 Vector Representation Example
◉ Sentiment Analysis f : RN → RM
x: word y: class/label
“love” Each element in the 3 dimensions
vector corresponds to a (positive, negative, neutral)
word in the vocabulary
“+” “-”
0 1: indicates the word 1 “+” 0  “+” “+” or not
1 0: otherwise 0 “-” 1 “-” “-” or not
  dimensions = size of vocab    
   0 “?” 0  “?” “?” or not
   
 
14 Target Function
◉ Classification Task

f (x ) = y f :R →RN M

○ x: input object to be classified → a N-dim vector


○ y: class/label → a M-dim vector

Assume both x and y can be represented as fixed-size vectors


15 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
16 A Single Neuron

x1 w1 Activation

x2 w2
function
 (z )
z  (z )
+ y
wN

 (z ) =
xN 1
z
b 1 + e−z
bias Sigmoid function

Each neuron is a very simple function


17 A Single Neuron

x1 w1 Activation

x2 w2
function
 (z )
z  (z )
+ y
wN

 (z ) =
xN 1
z
b 1 + e−z
1 bias
Sigmoid function
The bias term is an “always on” feature
18 Why Bias?

 (z )

b
bias

z
The bias term gives a class prior
19 Model Parameters of A Single Neuron

x1 w1

x2 w2 z  (z )
+ y
wN

xN
 (z ) =
1
b 1 + e−z
1 bias

w, b are the parameters of this neuron


20 A Single Neuron

f : RN → RM
x1 w1

x2 w2 z
+ y
… wN
xN is "2" y  0.5
b 
1 bias
 not "2" y  0.5

A single neuron can only handle binary classification


21 A Layer of Neurons
◉ Handwriting digit classification f : RN → RM

x1 + y1
“1” or not
x2 + y2
“2” or not Which one is max?

xN + y3
“3” or not



1
10 neurons/10 classes
A layer of neurons can handle multiple possible output,
and the result depends on the max one
22 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
23 A Layer of Neurons – Perceptron
◉ Output units all operate separately – no shared weights

x1 + y1

x2 + y2

xN + y3
1


Adjusting weights moves the location, orientation, and steepness of cliff


http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
24 Expression of Perceptron
x1 w1
w z
x2 2 + y
b
1

A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
25 How to Implement XOR?

A
Input A’
Output A + B’
A B
AB’ + A’B
0 0 0
0 1 1
A’ + B
1 0 1 B
1 1 0 B’

A xor B = AB’ + A’B

Multiple operations can produce more complicate output


26 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
27 Neural Networks – Multi-Layer Perceptron

z1
x1 + a1

x2 z2 z
+ a2 + y

1 Hidden Units 1
28 Expression of Multi-Layer Perceptron
◉ Continuous function w/ 2 layers ◉ Continuous function w/ 3 layers

◉ Combine two opposite-facing ◉ Combine two perpendicular ridges to


threshold functions to make a ridge make a bump
○ Add bumps of various sizes and
locations to fit any surface
multiple layers enhance the model expression
→ the model can approximate more complex functions
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
29 Deep Neural Networks (DNN)
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2 … y2
x y

……

……
……
……
xN … yM

Deep NN: multiple hidden layers


30 Notation Definition

Output of a neuron:
1 l −1 1
a a1l layer l
1
l
2
a l −1
2
a l
a i neuron i
2 2

…..
…..

j i
a lj−1 ….. ail
…..

Layer l − 1 Layer l
N l −1 nodes N l nodes output of one layer → a vector
31 Notation Definition

layer l − 1
to layer l
1 l −1 1
a1 a1l from neuron j (layer l-1)
2 2 to neuron i (layer l)
l −1
a 2 a2l

…..
…..

j i
a lj−1 ….. ail
…..

Layer l − 1 Layer l weights between two layers


N l −1 nodes N l nodes → a matrix
32 Notation Definition

1 l −1 l 1 bil : bias for neuron i at layer l


a1
b
1 al
1

2 l −1
2
a 2 a2l

…..
…..

b2l
j i
a lj−1 l ail
b
…..
…..

1
Layer l − 1 Layer l bias of all neurons at each layer
N l −1 nodes N l nodes → a vector
33 Notation Definition

1 : input of the activation function


for neuron i at layer l
2
…..

j i
…..
…..

1 activation function input at


Layer l − 1 Layer l each layer → a vector
N l −1 nodes N l nodes
34 Notation Summary

l
a i
: output of a neuron : a weight

al : output vector of a layer : a weight matrix

zil : input of activation function bil : a bias

l : input vector of activation function l : a bias vector


z for a layer
b
35 Layer Output Relation

1 l −1 1
a1 zl
1
a1l
2 l −1
2
a 2 z l
2
a2l



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
36 Layer Output Relation – from a to z

1 l −1 1
a zl
a1l

… …
1 1
2 l −1
2
a 2 z l
2
a2l



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
37 Layer Output Relation – from z to a

( )
ail =  zil
1 1
( )
l −1
a1 zl
1
a1l a1l   z1l 
 l  l 
2
a l −1
2 z l
2
2
a2l ( )
a2   z 2 
  =  



   
j
a lj−1
i ( )
ai   zi 
l l

z l
i a l
   
i

   



a l −1 zl a l

Layer l − 1
( )
Layer l
N l −1 nodes N l nodes al =  z l
38 Layer Output Relation

1 l −1 1
a a1l
z l = W l a l −1 + b l
l
1 z1
2 2
( )
l −1
a l
a2l
al =  z l
2 z 2



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
39 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
=

=
40 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
41 Activation Function

bounded
function
42 Activation Function

boolean

linear

non-linear
43 Non-Linear Activation Function
◉ Sigmoid

◉ Tanh

◉ Rectified Linear Unit (ReLU)

Non-linear functions are frequently used in neural networks


44 Why Non-Linearity?
◉ Function approximation
○ Without non-linearity, deep neural networks work the same as linear
transform

○ With non-linearity, networks with more layers can approximate more


complex functions

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf
What does the “Good”
45 Function mean?
什麼叫做“好”的Function呢?
46 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
47 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
48 Function = Model Parameters

function set different parameters W and b → different functions

◉ Formal definition

model parameter set

pick a function f = pick a set of model parameters θ


49 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
50 Model Parameter Measurement
◉ Define a function to measure the quality of a parameter set θ
○ Evaluating by a loss/cost/error function C(θ) → how bad θ is
○ Best model parameter set

○ Evaluating by an objective/reward function O(θ) → how good θ is


○ Best model parameter set
51 Loss Function Example
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2 

Training Data Training: Pick the best function f *


(x1, yˆ1 ), (x2 , yˆ 2 ),
*
“Best” Function f

A “Good” function:

Define an example loss function:

sum over the error of all training samples


52 Frequent Loss Function
◉ Square loss

◉ Hinge loss

◉ Logistic loss

◉ Cross entropy loss

◉ Others: large margin, etc.

https://en.wikipedia.org/wiki/Loss_functions_for_classification
How can we Pick the
53 “Best” Function?
我們如何找出“最好”的Function呢?
54 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
55 Problem Statement
◉ Given a loss function and several model parameter sets
○ Loss function:
○ Model parameter sets:
◉ Find a model parameter set that minimizes C(θ)
How to solve this optimization problem?

◉ 1) Brute force – enumerate all possible θ


◉ 2) Calculus –

Issue: whole space of C(θ) is unknown


56 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
57 Gradient Descent for Optimization
◉ Assume that θ has only one variable

: the model at the i-th iteration


C( )


 0
 1
 2
 3

Idea: drop a ball and find the position where the ball stops rolling (local minima)
58 Gradient Descent for Optimization
◉ Assume that θ has only one variable
Randomly start at 𝜃 0
Compute 𝑑𝐶 𝜃 0 Τ𝑑𝜃:
C( )
Compute 𝑑𝐶 𝜃 1 Τ𝑑𝜃:

η is “learning rate”

 0
 1
59 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}
60 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}

• Randomly start at 𝜃 0 :

• Compute the gradients of 𝐶 𝜃 at 𝜃 0 :


• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃1 :


61 Gradient Descent for Optimization

𝜃2
𝛻𝐶 𝜃 0
Algorithm
1
𝜃0 𝛻𝐶 𝜃 Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
𝜃1 𝛻𝐶 𝜃 2 {
𝜃2 compute gradient at 𝜃 𝑖
Gradient update parameters
Movement 𝛻𝐶 𝜃 3
𝜃3 }

𝜃1
62 Revisit Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
63 Gradient Descent for Neural Network

Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters

}
64 Gradient Descent for Optimization
Simple Case
Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
x1 w1 compute gradient at 𝜃 𝑖
update parameters
x2 w2 + z  (z ) y
b }
1
65 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration

◉ Square error loss


66 Gradient Descent for Optimization
Simple Case – Square Error Loss
◉ Square Error Loss
67 Gradient Descent for Optimization
Simple Case – Square Error Loss

sigmoid func
chain rule
68 Gradient Descent for Optimization
Simple Case – Square Error Loss
◉ Square Error Loss
69 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration x1 w1

x2 w2 z  (z )
+ y
b
1
70 Optimization Algorithm
Algorithm
Initialization: set the parameters 𝜃, 𝑏 at random
while(stopping criteria not met)
{
for training sample {𝑥, 𝑦},
ො compute gradient and update parameters

}
71 Gradient Descent for Neural Network

Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters

Computing the gradient includes millions of parameters.


To compute it efficiently, we use backpropagation.
72 Gradient Descent Issue

Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),

After seeing all training samples, the model can be updated → slow
73 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
74 Stochastic Gradient Descent (SGD)
◉ Gradient Descent

◉ Stochastic Gradient Descent (SGD)


○ Pick a training sample xk Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
○ If all training samples have same probability to be picked

The model can be updated after seeing one training sample → faster
75 Epoch Definition
◉ When running SGD, the model starts θ0
Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
pick x1
pick x2 see all training


samples once
pick
xk

pick xK → one epoch

pick x1
76 Gradient Descent v.s. SGD
◉ Gradient Descent ◉ Stochastic Gradient Descent
✓ Update after seeing all examples ✓ If there are 20 examples, update 20 times
in one epoch.

See all See only one


examples example

1 epoch

SGD approaches to the target point faster than gradient descent


77 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
78 Mini-Batch SGD
◉ Batch Gradient Descent

Use all K samples in each iteration

◉ Stochastic Gradient Descent (SGD)


○ Pick a training sample xk

Use 1 samples in each iteration


◉ Mini-Batch SGD
○ Pick a set of B training samples as a batch b
B is “batch size”
Use all B samples in each iteration
79 Mini-Batch SGD
80 Batch v.s. Mini-Batch
Handwritting Digit Classification

Batch size = 1 Gradient Descent


81 Gradient Descent v.s. SGD v.s. Mini-Batch

Training speed: mini-batch > SGD > Gradient Descent


Training Time
(sec)

Why is mini-batch
faster than SGD?

Batch Size
1 10 100 1000 10000 full
SGD Gradient Descent
Mini-Batch
82 SGD v.s. Mini-Batch
◉ Stochastic Gradient Descent (SGD)

𝑧1 = 𝑊1 𝑥 𝑧1 = 𝑊1 𝑥 ……

◉ Mini-Batch SGD
matrix

𝑧1 𝑧1 = 𝑊1 𝑥 𝑥

Modern computers run matrix-matrix multiplication faster than


matrix-vector multiplication
83 Big Issue: Local Optima

Neural networks has no guarantee for obtaining global optimal solution


84 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
85 Initialization
◉ Different initialization parameters may result in different trained models

Do not initialize the parameters equally → set them randomly


86 Learning Rate

cost
Very Large

small

Large
Just make
Error
# parameters updates Surface

Learning rate should be set carefully


87 Tips for Mini-Batch Training

◉ Shuffle training samples before every epoch


○ the network might memorize the order you feed the samples
◉ Use a fixed batch size for every epoch
○ enable to fast implement matrix multiplication for calculations
◉ Adapt the learning rate to the batch size
○ larger batch → smaller learning rate

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-
88 Learning Recipe

Testing Data

Training Data Validation Real Testing

x ŷ x y x y

*
“Best” Function f
89 Learning Recipe

Testing Data

Training Data Validation Real Testing

x ŷ x y x y
immediately Do not know the
know the performance until
performance submission
90 Learning Recipe

get good results


no
on training set

modify training
process

◉ Possible reasons
○ no good function exists: bad hypothesis function set
→ reconstruct the model architecture
○ cannot find a good function: local optima
→ change the training strategy
91 Learning Recipe
yes yes
get good results get good results on done
no no dev/validation set
on training set

modify training prevent overfitting


process

Better performance on training but worse performance on dev → overfitting


92 Overfitting

◉ Possible solutions
○ more training samples
○ some tips: dropout, etc.
93 Concluding Remarks
◉ Q1. What is the model? Model Architecture

◉ Q2. What does a “good” function mean? Loss Function Design


◉ Q3. How do we pick the “best” function? Optimization

Model: Hypothesis Function Set


f1 , f 2 

Training: Pick the best function f *

“Best” Function f*

You might also like