Deep Learning
Deep Learning
to Deep Learning
Hoang-Quynh Le (PhD), VNU-UET
Outline
Introduction to Deep Learning
o What is Deep Learning
o Why is it useful
Neural Networks
○ Neural Networks: Perceptron, MLP
○ Gradient Descent, Activation Functions
○ Multi-layer NN: Forward- and Backward Propagation
Typical architectures
○ Convolutional neural networks (CNN)
○ Recurrent neural networks (RNN)
○ Attention mechanism
Data representation
Deep learning framworks
2
1.
Introduction
to Deep Learning
3
AlphaGo
https://www.youtube.com/watch?v=8tq1C8spV_g
4
Tesla X
5
Emotion Detection
• Anger
Check out
• Disgust https://faceinmotion.preferred.ai
• Fear
• Happiness
• Neutral
• Sadness
• Surprise
https://www.freecodecamp.org/news/facial-emotion-recognition-develop-a-c-n-n-and-break-into-kaggle-top-10-f618c024faa7/
6
Google Translation
https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html
7
Google Assistant
8
YouTube
9
GPT-3 Applications
https://www.youtube.com/watch?v=_x9AwxfjxvE
10
Machine Learning
Machine learning is a field of computer
science that gives computers the ability to
learn without being explicitly programmed
11
Machine Learning Basis (1)
12
Machine Learning Basis (2)
Machine learning is a field of computer science that gives computers the ability to learn without
being explicitly programmed
Machine Learning
Labeled Data algorithm
Training
Prediction
Learned
Labeled Data model Prediction
13
Types of Learning
Supervised: Learning with a labeled training set
Example: email classification with already labeled emails
class A
class A
Classification Clustering
Regression
14
Traditional Machine Learning
Traditional ML methods work well because of human-designed
representations and input features
ML becomes just optimizing weights to best make a final prediction
15
Traditional Rule-based approach
Machine learning/
Traditional Featured-based machine learning
Deep learning
16
Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
What is Deep Learning?
A machine learning subfield of learning representations of data. Exceptional
effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by
using a hierarchy of multiple layers
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
18
Traditional ML:
Trainable
Extract Output
Classifier
Hand (e.g.
(e.g. SVM,
Crafted Outdoor Yes
Random
Features or No)
Forrest)
Deep Learning:
Low Mid High Output
Trainable (e.g. outdoor,
Level Level Level
Classifier
Features Features Features indoor)
19
“
“Deep Learning doesn’t do
different things, it does things
differently”.
20
Machine Learning vs. Deep Learning
https://lerablog.org/technology/ai-artificial-intelligence-vs-machine-learning-vs-deep-learning/
21
Why is DL useful?
o Manually designed features are often over-specified, incomplete and take a
long time to design and validate
o Learned features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised i thm
s
r
o Effective end-to-end joint system learning g al g o
nin
L ear
o Utilize large amounts of training data ep
De
Performance
Traditional ML algorithms
Size of Data
22
How does DL learn features?
Indoor
Answer: Indoor
Outdoor
24
How does DL learn features?
Indoor
Answer: Indoor
Outdoor
25
Image classification
https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53
26
2.
Neural Networks
27
Neural in the Brain (1)
28
Neural in the Brain (2)
29
Artificial Neural Network
An Artificial Neural Network is an information processing paradigm
that is inspired by the biological nervous systems, such as the
human brain’s information processing mechanism.
x1 a1(1)
x2 a2(1)
a1(2) Y
x3 a3(1)
x4 a4(1)
31
Artificial Perceptron (2)
32
Perceptron Training Rule
33
Loss function
• The quantity to be minimized (optimized) during training
• the only thing the network cares about
• there might also be other metrics you care about
• Common tasks have “standard” loss functions:
• mean squared error for regression
• binary cross-entropy for two-class classification
• categorical cross-entropy for multi-class classification
• etc.
• https://lossfunctions.tumblr.com/
Optimizer
36
Gradient Descent (2)
37
Gradient Descent (3)
38
Gradient Descent (4)
39
Gradient Descent (5)
https://machinelearningcoban.com/2017/01/12/gradientdescent/ 40
Gradient Descent (5)
https://machinelearningcoban.com/2017/01/12/gradientdescent/ 41
Batch, Mini-Batch, Iterative Training
42
Iteration, Epoch
43
Neural Networks with Activation Functions
The purpose of the activation function is to introduce non-linearity into the network
44
Activation: Sigmoid
http://adilmoujahid.com/images/activation.png
45
Activation: Tanh
http://adilmoujahid.com/images/activation.png
46
Activation: ReLu
http://adilmoujahid.com/images/activation.png
47
Multi Layer Perceptron
48
Multi-layer Neural Networks with Sigmoid
Layer 4 Output layer
Layer 3
Hidden
Layers
Layer 2
Input layer
Layer 1
49
Forward Propagation Output layer
Layer 4
Notations
○ Input vector at level l
○ Weight matrix of level l
Forward-propagation Layer 3
Hidden
Layers
Layer 1
Input layer
50
Back Propagation Output layer
Layer 4
Error at level l
Error at last level (L=4)
Layer 3
Hidden
Layers
Layer 1
Input layer
51
3.
Typical Architectures
52
Number of Parameters
x1 a1(1) Softmax
x2 a2(1)
a1(2) Y
x3 a3(1)
x4 a4(1)
21 = 4*4 + 4 +1
53
If the input is an Image?
x1 a1(1)
x2 a2(1)
a1(2) Y
x3 a3(1)
400 X 400 X 3
a480000(1)
x480000
Number of Parameters
480000*480000 + 480000 +1 = approximately 230 Billion !!!
480000*1000 + 1000 +1 = approximately 480 million !!!
54
Convolution Layers
0 1 0 Inspired by the neurophysiological
Filter 1 -4 1
0 1 0
experiments conducted by Hubel and
Wiesel 1962.
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
Convoluted Image
Input Image
55
Convolution Layers
a b c d w1 w2
h1 h2
e f g h w3 w4
i j k l
m n o p
w1 w2
w3 w4
w5 w6
w7 w8
Filter 1
Input Image Filter 2
Layer 1
Feature Map
Layer 2
Feature Map
In CNNs, hidden units are only connected to local receptive field.
57
Pooling
Max pooling: reports the maximum output within a rectangular neighborhood.
Average pooling: reports the average output of a rectangular neighborhood.
1 3 5 3
MaxPool with 2X2 filter with stride of 2
4 2 3 1
4 5
3 1 1 3
3 4
0 1 0 4
58
Convolutional Neural Networks (1)
Maxpool
Output
Feature Extraction Architecture Vector
Living Room
Bed Room
128
256
256
512
512
512
512
128
256
512
512
64
Kitchen
64
Bathroom
Fully Connected
Layers
59
Convolutional Neural Networks (2)
Output: Binary, Multinomial, Continuous, Count
Input: Fixed size, can use padding to make all images same size.
Architecture: Choice is ad hoc
○ requires experimentation.
Optimization: Backward propagation
○ Hyper parameters for very deep model can be estimated
properly only if you have billions of images.
• Use an architecture and trained hyper parameters from
other papers (ImageNet or Microsoft/Google APIs etc)
Computing Power: Buy a GPU!!
60
Automatic Colorization of Black and White Images
61
Optimizing Images
64
Recurrent Neural
Networks (RNN)
65
Why RNN?
The limitations of the Convolutional Neural Networks
Take fixed length vectors as input and produce fixed length vectors
as output.
Allow fixed amount of computational steps.
We need to model the data with temporal or sequential structures and
varying length of inputs and outputs
e.g.:
This movie is ridiculously good.
This movie is very slow in the beginning but picks up pace later on and
has some great action sequences and comedy scenes.
66
Modeling Sequences
A person riding a
Image
motorbike on dirt
Captioning
road
Happy
Machine
Diwali श◌ु
भ
Translation
द प◌ा
वल
67
What is RNN?
Recurrent neural networks are connectionist models with the ability to selectively
pass information across sequence steps, while processing sequential data one
element at a time.
Allows a memory of the previous inputs to persist in the model’s internal state and
influence the outcome.
OUTPUT
h(t) h(t)
Hidden Layer Delay
h(t-1)
x(t)
INPUT
68
RNN (rolled over time)
ℎ 𝑡 =𝑓 𝑤 ∗ℎ 𝑡−1 +𝑤 ∗𝑥 𝑡
69
RNN (rolled over time)
70
The Vanishing Gradient Problem
RNN’s use back propagation.
Back propagation uses chain rule.
○ Chain rule multiplies derivatives
If these derivatives are between 0 and 1 the product vanishes as the
chain gets longer.
○ or the product explodes if the derivatives are greater than 1.
Sigmoid activation function in RNN leads to this problem.
ReLu, in theory, avoids this problem but not in practice.
71
Problem with Vanishing or Exploding Gradients
Don’t allow us to learn long term dependencies.
○ Param is a hard worker.
VS.
○ Param, student of Yong, is a hard worker.
BAD!!!!
Misguided!!!!
Unacceptable!!!!
72
Long Short-Term Memory
73
LSTM (1)
c0 Forget f1
+ c1
Input i1
𝑤 = . + .
h0 𝑓 () h1
u1
= . + .
𝑤
x1
74
LSTM (2)
c0 Forget f1
+ c1
𝑤 𝑓 () Input i1
𝑤
=
h0 𝑓 () h1
u1
=
𝑤 𝑤
x1
75
LSTM (3)
c0 Forget f1
+ c1
𝑤 𝑓 () Input i1
𝑤 =
h0 𝑓 () h1
u1
=
𝑤
𝑤
x1
76
LSTM (4)
=
c0 Forget f1
+ c1
=
Input i1 𝑓 ()
𝑤
h0 𝑓 () h1
u1 Output o1 h1
h0 () h1
x1
x1
77
LSTM (5)
c0 Forget + c1 Forget + c2
f1 f2
Input 𝑓 () 𝑓 ()
Input
𝑤 i1 i2
𝑤
h0 𝑓 () u1 Output h1 𝑓 () u2 Output h2
o1 o2
𝑤 𝑤
x1 x2
78
Attention
Mechanism
79
Attention Mechanism (1)
80
Attention Mechanism (2)
81
4.
Data representation
82
MACHINES learn
BETTER
by using
a deep understanding of data
83
Traditional Rule-based approach
Machine learning/
Traditional Featured-based machine learning
84
Transformer architecture
86
Word Embedding vs Language Models
87
Pre-trained language models
88
Large language models
89
5.
Deep Learning
Framework
90
Deep learning frameworks
+
• Python/numpy or R interfaces
• instead of C, C++, CUDA or HIP
• Open source
Deep learning Lasagne Keras TF Estimator torch.nn Gluon
frameworks (2)
Theano TensorFlow CNTK PyTorch MXNet Caffe
94
Thanks!
Any questions?
[email protected]
95