0% found this document useful (0 votes)
20 views

Deep Learning Tutorial

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Deep Learning Tutorial

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Deep Learning Tutorial

李宏毅
Hung-yi Lee

speech processing and machine learning lab


Outline

• Lecture I: Introduction of Deep Learning

• Lecture II: Practical Tips for Deep Learning

• Lecture III: Convolutional Neural Network (CNN)


Lecture I:
Introduction of
Deep Learning
Machine Learning
≈ Looking for a Function
• Speech Recognition
f( )= “How are you”
• Image Recognition
f( )= “Cat”

• Playing Go f( )= “5-5” (next move)

• Chat-bot f( “Hi” )= “Hello”


(the user input) (system response)
Image Recognition:

Framework f( )= “cat”

A set of Model
function f1 , f 2 !

f1 ( )= “cat” f2 ( )= “money”

f1 ( )= “dog f2 ( )= “snake”

Image Recognition:

Framework f( )= “cat”

A set of Model
function f1 , f 2 ! Better!

Goodness of
function f
Supervised Learning

Training function input:


Data
function output: “monkey” “cat” “dog
Image Recognition:

Framework f( )= “cat”

Training Testing
A set of Model
function f1 , f 2 ! “cat”
Step 1

Goodness of Pick the “Best” Function


Using f∗
function f f*
Step 2 Step 3

Training
Data
“monkey” “cat” “dog
Three Steps for Deep Learning

• Step 1: define a set of function


Neural Network

• Step 2: goodness of function

• Step 3: pick the best function


Neural Network
Neuron
z = a1w1 + ! + ak wk + ! + aK wK + b

a1 w1 A simple function

ak wk z σ (z )
+ a

Activation

wK function
aK weights b bias
Neural Network
Sigmoid Function σ (z )
Neuron
1
σ (z ) = −z
1+ e z
2
1

4
-1 -2 + σ (z ) 0.98

Activation
-1
function
1 weights 1 bias
Neural Network
Different connections lead to
different network structures

+ σ (z )

+ σ (z ) + σ (z )

+ σ (z )

The neurons have different values of


weights and biases.
Fully Connected Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function σ (z )
1
σ (z ) = −z
1+ e z
Fully Connected Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connected Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function.
Input vector, output vector

Given network structure, define a function set


Fully Connected Feedforward
Network neuron

Input Layer 1 Layer 2 Layer L Output


x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer

Deep means many hidden layers


Deep = Many hidden layers
22 layers

http://
cs231n.stanford.edu/ 19 layers
slides/
winter1516_lecture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Deep = Many hidden layers
101 layers
152 layers

Special
structure

3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Example Application

Input Output

x1
y0.1
1 is 1

x2 y0.7
2 is 2
The image is
“2”

……
……
……

x256 y0.2
10 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition

x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……

……
Network

……
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”

……
……

……

……

……
……
Handwriting Digit Recognition
x256 …… y10 is 0
Input Output
Layer Hidden Layers Layer

You need to decide the network structure to


contain a good function in your function set.
FAQ

• Q: How many layers? How many neurons for each


layer?
Trial and Error + Intuition

• Q: Can we design the network structure?


Convolutional Neural Network (CNN)
in the following lecture
• Q: Can the structure be automatically determined?
• Yes, but not widely studied yet.
Three Steps for Deep Learning

• Step 1: define a set of function

• Step 2: goodness of function

• Step 3: pick the best function


Training Data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

The learning target is defined on


the training data.
Learning Target
x1 …… y1 is 1
x2

Softmax
…… …… y2 is 2

……

……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value

Input: y2 has the maximum value


A good function should make the loss
Loss of all examples as small as possible.

“1”

x1 …… y1 1
x2
Given a set ……

Softmax
of y2 0
parameters
……

……
……

……

……
x256 …… y10 0

Loss can be square error or cross entropy target


between the network output and target
Total Loss:
Total Loss
For all training data …
x1 NN y1
As small as possible
x2 NN y2
Find a function in
function set that
x3 NN y3
minimizes total loss L
……
……

……
……

xR NN yR
Three Steps for Deep Learning

• Step 1: define a set of function

• Step 2: goodness of function

• Step 3: pick the best function


Gradient Descent
• Gradient descent never guarantee global minima

Different initial point

Reach different minima,


so different results
You are playing Age of Empires …
You cannot see the whole map.
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
People image …… Actually …..

I hope you are not too disappointed :p


Backpropagation

libdnn
(by NTU student,
Po-wei Chou)

Ref: https://www.youtube.com/watch?v=ibJpTrp5mcE
Three Steps for Deep Learning
• Step 1: • Step 2: • Step 3:
define a goodness pick the
set of of best
function function function
Deep Learning is so simple ……

Now If you want to find a function


If you have some function input/output examples
as training data
You can use deep learning
Spoken Question Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Audio Story: (The original story is 5 min long.)
Question: “ What is a possible origin of Venus’ clouds? ”
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
Spoken Question Answering

Question: “what is a possible


origin of Venus‘ clouds?"
Audio Story:
ASR transcriptions Network answer
e.g. (A)
4 Choices

Using the questions and answers in the previous


exams to train the network
Spoken Question Answering

What we have learned today is


only the tip of the iceberg.
(A) (A) (A) (A)

(A)

(B) (B) (B)


Spoken Question Answering

(2) select the shortest (4) the choice with semantic


choice as answer most similar to others
Accuracy (%)

random

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Spoken Question Answering

Memory Network: 39.2%


Accuracy (%)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Spoken Question Answering

Word-based Attention: 48.8%

Memory Network: 39.2%


Accuracy (%)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Corpus & Code for
TOEFL Listening Comprehension Test by Machine
https://github.com/sunprinceS/Hierarchical-Attention-Model
Example Application
• Handwriting Digit Recognition

Machine “1”

28 x 28

MNIST Data: http://yann.lecun.com/exdb/mnist/


“Hello world” for deep learning
If you want to learn theano:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/
Keras Lecture/Theano%20DNN.ecm.mp4/index.html
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/
Lecture/RNN%20training%20(v6).ecm.mp4/index.html

Very flexible
or
Need some
effort to learn

Easy to learn and use


Interface of
TensorFlow or (still have some flexibility)
Theano You can modify it if you can write
Keras TensorFlow or Theano
Document: https://keras.io/
Deep Learning Graduate Student (using Keras)

How my friends see me How my family sees me How society sees me

How my professor sees me How I see myself How it really is

Modified from the figure of


沈昇勳 (Sheng-syun Shen)
How you feel when you use Keras
Keras
28x28
……

……
500

……
500

Softmax

y1 y2 …… y10
Keras
Keras

Step 3.1: Configuration

0.1
Step 3.2: Find the optimal network parameters

Training data Labels


(Images) (digits)
Keras
Step 3.2: Find the optimal network parameters

numpy array numpy array

28 x 28 …… 10 ……
=784

Number of training examples Number of training examples


https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Keras

Save and load models


http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

How to use the neural network (testing):

case 1:

case 2:
Using GPU to speed training
• THEANO_FLAGS=device=gpu0 python YourCode.py
Live Demo
• You can find the code for demo today at the
following link:
• http://speech.ee.ntu.edu.tw/~tlkagk/DL_tutorial/
DeepLecture_HelloWorld.py
Lesson we learned ...
http://ent.ltn.com.tw/news/
breakingnews/1144545

Although deep Learning is a convenient


hammer, It is the hammer of Thor.

Hard to lift the hammer …


Lecture II:
Practical Tips
for Deep Learning
Recipe of Deep Learning
YES

• Step 1: define a NO
Good Results on
set of function Testing Data?
Overfitting!
• Step 2: goodness
of function YES

NO
• Step 3: pick the Good Results on
Training Data?
best function

Neural
Network
Do not always blame Overfitting

Not well trained

Overfitting?

Training Data Testing Data

[Kaiming He, arXiv 2015]


Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.
YES
e.g. “dropout” for good
results on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES

• Choosing proper loss Good Results on


Testing Data?
• Mini-batch
YES
• New activation function
Good Results on
• Adaptive Learning Rate Training Data?

• Momentum
Choosing Proper Loss
“1”

x1 …… y1 1 1
x2 ……

Softmax
y2 0 0
……

……
……

……
……
……
loss
x256 …… y10 0 0
Which one is better?
target
Square Cross
Error Entropy
=0 =0
Let’s try it
Square Error

Cross Entropy
Testing: Accuracy
Let’s try it
Square Error 0.11
Cross Entropy 0.84

Training
Cross
Entropy

Square
Error
Choosing Proper Loss
When using softmax output layer,
choose cross entropy
Cross
Entropy

Total
Loss
Square
Error
http://jmlr.org/
proceedings/papers/
w1 w2
v9/glorot10a/
Recipe of Deep Learning
YES

• Choosing proper loss Good Results on


Testing Data?
• Mini-batch
YES
• New activation function
Good Results on
• Adaptive Learning Rate Training Data?

• Momentum
We do not really minimize total loss!
Mini-batch ➢ Randomly initialize
network parameters

x1 NN y1 ➢ Pick the 1st batch


Mini-batch

x31 NN y31 Update parameters once


➢ Pick the 2nd batch
……

Update parameters once


x2 NN y2


Mini-batch

➢ Until all mini-batches


have been picked
x16 NN y16
one epoch
……

Repeat the above process


Mini-batch

➢ Pick the 1st batch


x1 NN y1
Mini-batch

Update parameters once


x31 NN y31
➢ Pick the 2nd batch
……

Update parameters once


100 examples in a mini-batch


➢ Until all mini-batches
Repeat 20 times have been picked
one epoch
We do not really minimize total loss!
Mini-batch ➢ Randomly initialize
network parameters

x1 NN y1 ➢ Pick the 1st batch


Mini-batch

x31 NN y31 Update parameters once


➢ Pick the 2nd batch
……

Update parameters once


x2 NN y2


Mini-batch

L is different each time


x16 NN y16 when we update
parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch

Unstable!!!

The colors represent the total loss.


Not always true with
Mini-batch is Faster parallel computing.

Original Gradient Descent With Mini-batch


Update after seeing all If there are 20 batches, update
examples 20 times in one epoch.

See all See only one


examples batch
Can have the same speed
(not super large data set)

1 epoch

Mini-batch has better performance!


Testing:
Mini-batch is Better! Accuracy

Mini-batch 0.84
No batch 0.12
Training
Mini-batch
Accuracy

No batch

Epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2

x1 NN y1 x1 NN y1
Mini-batch

Mini-batch
x31 NN y31 x31 NN y31
……

……
Don’t worry. This is the default of Keras.
x2 NN y2 x2 NN y2
Mini-batch
Mini-batch

x16 NN y16 x16 NN y16

……
……
Recipe of Deep Learning
YES

• Choosing proper loss Good Results on


Testing Data?
• Mini-batch
YES
• New activation function
Good Results on
• Adaptive Learning Rate Training Data?

• Momentum
Hard to get the power of Deep …

Results on Training Data

Deeper usually does not imply better.


Demo
Vanishing Gradient Problem
x1 …… y1
x2 …… y2
……

……

……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge


based on random!?
Vanishing Gradient Problem
Smaller gradients

x1 ……
x2 Small
…… output
……

……
……

……

……

……
xN ……
Large
input
Intuitive way to compute the derivatives …
Hard to get the power of Deep …

In 2006, people used RBM pre-training.


In 2015, people use ReLU.
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
ReLU

x1 y1

0 y2
x2
0

0
ReLU
A Thinner linear network

x1 y1

y2
x2
Do not have
smaller gradients
Let’s try it
Testing: 9 layers Accuracy
Let’s try it
Sigmoid 0.11
ReLU 0.96
• 9 layers

Training

ReLU
Sigmoid
Recipe of Deep Learning
YES

• Choosing proper loss Good Results on


Testing Data?
• Mini-batch
YES
• New activation function
Good Results on
• Adaptive Learning Rate Training Data?
Giving different
• Momentum parameters different
learning rates
Recipe of Deep Learning
YES

• Choosing proper loss Good Results on


Testing Data?
• Mini-batch
YES
• New activation function
Good Results on
• Adaptive Learning Rate Training Data?

• Momentum
In physical world ……
• Momentum

How about put this phenomenon


in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum

Momentum
Real Movement

𝜕𝐿∕𝜕𝑤 = 0
Adam Adaptive learning rate + Momentum
Let’s try it Testing: Accuracy
Original 0.96
Adam 0.97
• ReLU, 3 layer

Training

Original
Adam
Recipe of Deep Learning
YES

• Early Stopping Good Results on


Testing Data?

• Regularization
YES

• Dropout Good Results on


Training Data?

• Network Structure
Why Overfitting?
• Training data and testing data can be different.

Training Data: Testing Data:

Learning target is defined by the training data.


The parameters achieving the learning target do not
necessary have good results on the testing data.
Panacea for Overfitting
• Have more training data
• Create more training data (?)

Handwriting recognition:

Original Created
Training Data: Training Data:

Shift 15。
Recipe of Deep Learning
YES

• Early Stopping Good Results on


Testing Data?

• Regularization
YES

• Dropout Good Results on


Training Data?

• Network Structure
Dropout
Training:

➢ Each time before updating the parameters


● Each neuron has p% to dropout
Dropout
Training:

Thinner!

➢ Each time before updating the parameters


● Each neuron has p% to dropout
The structure of the network is changed.
● Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Using basketball as example
Training (under pressure): Testing (in the real game):

http://big5.xinhuanet.com/gate/big5/
news.xinhuanet.com/sports/2012-07/
03/c_123363695.htm

https://www.youtube.com/watch?v=pn5dP9s9yiM
(idea from Prof. Min Sun)
Recipe of Deep Learning
YES

• Early Stopping Good Results on


Testing Data?

• Regularization
YES

• Dropout Good Results on


Training Data?

• Network Structure
CNN is a very good example!
(next lecture)
Lecture III:
Convolutional Neural
Network (CNN)
Why CNN for Image?
• When processing image, the first layer of fully
connected network would be very large
……

Softmax
……
100
……3 x 107

……
……
100 100 x 100 x 3 1000
Can the fully connected network be simplified by
considering the properties of image recognition?
Why CNN for Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image to
discover the pattern.
Connecting to small region with less parameters

“beak” detector
Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector

Do almost the same thing


They can use the same
set of parameters.

“middle beak”
detector
Why CNN for Image
• Subsampling the pixels will not change the object
bird
bird

subsampling

We can subsample the pixels to make image smaller


Less parameters for the network to process the image
The whole CNN
cat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
The whole CNN
Property 1
➢ Some patterns are much Convolution
smaller than the whole image
Property 2
➢ The same patterns appear in Max Pooling
Can repeat
different regions.
many times
Property 3 Convolution
➢Subsampling the pixels will
not change the object
Max Pooling

Flatten
The whole CNN
cat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN – Convolution
The values in the matrices are learned from training data.

Filter 1
Matrix

Convolution Filter 2
Matrix

……
Max Pooling
……

Each filter detects a small


Property 1
pattern (3 x 3).
CNN – Convolution
The values in the matrices are learned from training data.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1 Matrix
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
Matrix
0 0 1 0 1 0 -1 1 -1

……
6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
-3 -3 0 1
0 0 1 0 1 0

6 x 6 image 3 -2 -2 -1

Property 2
-1 1 -1
CNN – Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
Feature
0 1 0 0 1 0
-3 -3 Map0 1
0 0 1 0 1 0 -1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
CNN – Colorful image
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1 11 -1-1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1 -1 1 -1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1

0 0 1 1 0 0 -1 -1 1 -1 1 -1

1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0

connected 1 0 0 0 1 0
……

……
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
10: 0
0 1 0 0 1 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to 9
16: 1 input, not fully
connected

1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
9: 0 -1
1 0 0 0 1 0
10: 0
0 1 0 0 1 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared weights
Even less parameters!

The whole CNN
cat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN – Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
CNN – Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 30 13
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution

30 13
Max Pooling
Can repeat
A new image many times
Convolution
Smaller than the original
image
The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog ……
Convolution

Max Pooling
A new image
Fully Connected
Feedforward network Convolution

Max Pooling
A new image
Flatten
3
Flatten
0

1
3 0
-1 1 3

3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward network
0

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 …… There are 25
-1 -1 1 3x3 filters.
-1 1 -1 Max Pooling
Input_shape = ( 1 , 28 , 28 )
1: black/weight, 3: RGB 28 x 28 pixels Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
1 x 28 x 28
Convolution
How many parameters
9 25 x 26 x 26
for each filter?
Max Pooling
25 x 13 x 13
Convolution
How many parameters
225 50 x 11 x 11
for each filter?
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
1 x 28 x 28
output Convolution
25 x 26 x 26
Fully Connected Max Pooling
Feedforward network
25 x 13 x 13
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flatten
Live Demo
What does CNN learn?
The output of the k-th filter is a x
11 x 11 matrix. input
Degree of the activation
of the k-th filter: 25 3x3
Convolution
filters
(gradient ascent)
11 Max Pooling

3 -1 …… -1
50 3x3
Convolution
filters
-3 1 …… -3
11 50 x 11 x 11
Max Pooling
……

……

……

3 -2 …… -1
What does CNN learn?
The output of the k-th filter is a
11 x 11 matrix. input
Degree of the activation
of the k-th filter: 25 3x3
Convolution
filters
(gradient ascent)
Max Pooling

50 3x3
Convolution
filters
50 x 11 x 11
Max Pooling

For each filter


What does CNN learn? input

Find an image maximizing the output Convolution


of neuron:
Max Pooling

Convolution

Max Pooling

flatten

Each figure corresponds to a neuron


What does CNN learn? input

Can we see Convolution


digits?
Max Pooling
0 1 2
Convolution

Max Pooling
3 4 5
flatten

6 7 8

Deep Neural Networks are Easily Fooled


https://www.youtube.com/watch?v=M2IebCN9Ht4
What does CNN learn? Over all pixel
values

0 1 2 0 1 2

3 4 5 3 4 5

6 7 8 6 7 8
CNN

Deep Dream Modify


image

• Given a photo, machine adds what it sees ……

CNN exaggerates what it sees

http://deepdreamgenerator.com/
Deep Dream
• Given a photo, machine adds what it sees ……

http://deepdreamgenerator.com/
Deep Style
• Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep Style
• Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep Style

CNN CNN

A Neural content style


Algorithm of
Artistic Style
https://arxiv.org/abs/
1508.06576
CNN

?
Application: Playing Go

Next move
Network (19 x 19
positions)

19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
http://lgs.tw/qwwheue
Training
Collecting records of many previous plays

……

Machine mimics human player

CNN
Why CNN for Go playing?
• Some patterns are much smaller than the whole
image
Alpha Go uses 5 x 5 for first layer

• The same patterns appear in different regions.


Why CNN for Go playing?
• Subsampling the pixels will not change the object
Max Pooling How to explain this???

Alpha Go does not use Max Pooling ……


Concluding Remarks

• Lecture I: Introduction of Deep Learning

• Lecture II: Practical Tips for Deep Learning

• Lecture III: Convolutional Neural Network (CNN)

You might also like