0% found this document useful (0 votes)
25 views95 pages

Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views95 pages

Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Introduction

to Deep Learning
Hoang-Quynh Le (PhD), VNU-UET
Outline
Introduction to Deep Learning
o What is Deep Learning
o Why is it useful
Neural Networks
○ Neural Networks: Perceptron, MLP
○ Gradient Descent, Activation Functions
○ Multi-layer NN: Forward- and Backward Propagation
Typical architectures
○ Convolutional neural networks (CNN)
○ Recurrent neural networks (RNN)
○ Attention mechanism
Data representation
Deep learning framworks

2
1.
Introduction
to Deep Learning

3
AlphaGo

https://www.youtube.com/watch?v=8tq1C8spV_g

4
Tesla X

5
Emotion Detection
• Anger
Check out
• Disgust https://faceinmotion.preferred.ai
• Fear
• Happiness
• Neutral
• Sadness
• Surprise

https://www.freecodecamp.org/news/facial-emotion-recognition-develop-a-c-n-n-and-break-into-kaggle-top-10-f618c024faa7/

6
Google Translation

https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

7
Google Assistant

8
YouTube

9
GPT-3 Applications

https://www.youtube.com/watch?v=_x9AwxfjxvE

10
Machine Learning
Machine learning is a field of computer
science that gives computers the ability to
learn without being explicitly programmed

11
Machine Learning Basis (1)

12
Machine Learning Basis (2)
Machine learning is a field of computer science that gives computers the ability to learn without
being explicitly programmed

Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data model Prediction

Methods that can learn from and make predictions on data

13
Types of Learning
Supervised: Learning with a labeled training set
Example: email classification with already labeled emails

Unsupervised: Discover patterns in unlabeled data


Example: cluster similar documents based on text

Reinforcement learning: learn to act based on feedback/reward


Example: learn to play Go, reward: win or lose

class A

class A

Classification Clustering
Regression
14
Traditional Machine Learning
Traditional ML methods work well because of human-designed
representations and input features
ML becomes just optimizing weights to best make a final prediction

15
Traditional Rule-based approach
Machine learning/
Traditional Featured-based machine learning

Deep learning

16
Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
What is Deep Learning?
A machine learning subfield of learning representations of data. Exceptional
effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by
using a hierarchy of multiple layers

https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
18
Traditional ML:
Trainable
Extract Output
Classifier
Hand (e.g.
(e.g. SVM,
Crafted Outdoor Yes
Random
Features or No)
Forrest)

Deep Learning:
Low Mid High Output
Trainable (e.g. outdoor,
Level Level Level
Classifier
Features Features Features indoor)

19

“Deep Learning doesn’t do
different things, it does things
differently”.

20
Machine Learning vs. Deep Learning

https://lerablog.org/technology/ai-artificial-intelligence-vs-machine-learning-vs-deep-learning/

21
Why is DL useful?
o Manually designed features are often over-specified, incomplete and take a
long time to design and validate
o Learned features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised i thm
s
r
o Effective end-to-end joint system learning g al g o
nin
L ear
o Utilize large amounts of training data ep
De

Performance
Traditional ML algorithms

Size of Data
22
How does DL learn features?

Indoor
Answer: Indoor
Outdoor

24
How does DL learn features?

Indoor
Answer: Indoor
Outdoor

25
Image classification

https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53
26
2.
Neural Networks

27
Neural in the Brain (1)

28
Neural in the Brain (2)

29
Artificial Neural Network
An Artificial Neural Network is an information processing paradigm
that is inspired by the biological nervous systems, such as the
human brain’s information processing mechanism.
x1 a1(1)

x2 a2(1)
a1(2) Y
x3 a3(1)

x4 a4(1)

Input Hidden Layers Output


#parameters: 4*4 + 4 +1 30
Artificial Perceptron

31
Artificial Perceptron (2)

32
Perceptron Training Rule

33
Loss function
• The quantity to be minimized (optimized) during training
• the only thing the network cares about
• there might also be other metrics you care about
• Common tasks have “standard” loss functions:
• mean squared error for regression
• binary cross-entropy for two-class classification
• categorical cross-entropy for multi-class classification
• etc.
• https://lossfunctions.tumblr.com/
Optimizer

• How to update the weights


based on the loss function
• Learning rate (+scheduling)
• Stochastic gradient descent,
momentum, and their
variants
• RMSProp is usually a good
first choice
• more info: http://ruder.io/optimizing-
gradient-descent/

Animation from: https://imgur.com/s25RsOr


Gradient Descent (1)

36
Gradient Descent (2)

37
Gradient Descent (3)

38
Gradient Descent (4)

39
Gradient Descent (5)

https://machinelearningcoban.com/2017/01/12/gradientdescent/ 40
Gradient Descent (5)

https://machinelearningcoban.com/2017/01/12/gradientdescent/ 41
Batch, Mini-Batch, Iterative Training

Batch Training Mini-batch Training Iterative Training


Use all training points to Use subset training points Use a single training point
compute gradients for to compute gradients for to compute gradients for
each iteration each iteration each iteration

42
Iteration, Epoch

An iteration respects to the training for a mini-batch

An epoch respects to the training for full dataset

43
Neural Networks with Activation Functions

Non-linear function or Activation function

The purpose of the activation function is to introduce non-linearity into the network

44
Activation: Sigmoid

Takes a real-valued number and


“squashes” it into range between 0
and 1.

http://adilmoujahid.com/images/activation.png

45
Activation: Tanh

Takes a real-valued number and


“squashes” it into range between -1
and 1.

http://adilmoujahid.com/images/activation.png

46
Activation: ReLu

Takes a real-valued number and


thresholds it at zero

http://adilmoujahid.com/images/activation.png

47
Multi Layer Perceptron

48
Multi-layer Neural Networks with Sigmoid
Layer 4 Output layer

Layer 3

Hidden
Layers

Layer 2

Each node (“neuron”) is the sigmoid unit

Input layer
Layer 1
49
Forward Propagation Output layer
Layer 4
Notations
○ Input vector at level l
○ Weight matrix of level l
Forward-propagation Layer 3

Hidden
Layers

Layer 1
Input layer

50
Back Propagation Output layer
Layer 4
Error at level l
Error at last level (L=4)

Layer 3

Hidden
Layers

Layer 1
Input layer

51
3.
Typical Architectures

52
Number of Parameters

x1 a1(1) Softmax

x2 a2(1)
a1(2) Y
x3 a3(1)

x4 a4(1)

Input Hidden Layers Output

21 = 4*4 + 4 +1
53
If the input is an Image?
x1 a1(1)

x2 a2(1)
a1(2) Y
x3 a3(1)

400 X 400 X 3
a480000(1)

x480000

Input Hidden Layers Output

Number of Parameters
480000*480000 + 480000 +1 = approximately 230 Billion !!!
480000*1000 + 1000 +1 = approximately 480 million !!!
54
Convolution Layers
0 1 0  Inspired by the neurophysiological
Filter 1 -4 1
0 1 0
experiments conducted by Hubel and
Wiesel 1962.
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1

Convoluted Image
Input Image
55
Convolution Layers

a b c d w1 w2
h1 h2
e f g h w3 w4
i j k l

m n o p

Filter Convolved Image


Input Image
(Feature Map)
ℎ2 = 𝑓 𝑏 ∗ 𝑤1 + 𝑐 ∗ 𝑤2 + 𝑓 ∗ 𝑤3 + 𝑔 ∗ 𝑤4

Number of Parameters for one feature map = 4


Number of Parameters for 100 feature map = 4*100
56
Lower Level to More Complex Features

w1 w2

w3 w4
w5 w6

w7 w8
Filter 1
Input Image Filter 2
Layer 1
Feature Map
Layer 2
Feature Map
In CNNs, hidden units are only connected to local receptive field.
57
Pooling
Max pooling: reports the maximum output within a rectangular neighborhood.
Average pooling: reports the average output of a rectangular neighborhood.

1 3 5 3
MaxPool with 2X2 filter with stride of 2
4 2 3 1
4 5
3 1 1 3
3 4
0 1 0 4

Input Matrix Output Matrix

58
Convolutional Neural Networks (1)
Maxpool
Output
Feature Extraction Architecture Vector

Living Room

Bed Room
128

256
256

512
512

512

512
128

256

512

512
64

Kitchen
64

Bathroom

Max Pool Outdoor


Filter

Fully Connected
Layers

59
Convolutional Neural Networks (2)
Output: Binary, Multinomial, Continuous, Count
Input: Fixed size, can use padding to make all images same size.
Architecture: Choice is ad hoc
○ requires experimentation.
Optimization: Backward propagation
○ Hyper parameters for very deep model can be estimated
properly only if you have billions of images.
• Use an architecture and trained hyper parameters from
other papers (ImageNet or Microsoft/Google APIs etc)
Computing Power: Buy a GPU!!

60
Automatic Colorization of Black and White Images

61
Optimizing Images

Post Processing Feature Optimization


(Color Curves and Details)

Post Processing Feature Optimization Post Processing Feature Optimization


(Illumination) (Color Tone: Warmness)
62
63
CNN for
text classification

64
Recurrent Neural
Networks (RNN)

65
Why RNN?
The limitations of the Convolutional Neural Networks
Take fixed length vectors as input and produce fixed length vectors
as output.
Allow fixed amount of computational steps.
We need to model the data with temporal or sequential structures and
varying length of inputs and outputs
e.g.:
This movie is ridiculously good.
This movie is very slow in the beginning but picks up pace later on and
has some great action sequences and comedy scenes.
66
Modeling Sequences
A person riding a
Image
motorbike on dirt
Captioning
road

Awesome tutorial. Positive Sentiment


Analysis

Happy
Machine
Diwali श◌ु

Translation
द प◌ा
वल

67
What is RNN?
Recurrent neural networks are connectionist models with the ability to selectively
pass information across sequence steps, while processing sequential data one
element at a time.
Allows a memory of the previous inputs to persist in the model’s internal state and
influence the outcome.
OUTPUT
h(t) h(t)
Hidden Layer Delay
h(t-1)
x(t)
INPUT

68
RNN (rolled over time)

ℎ 𝑡 =𝑓 𝑤 ∗ℎ 𝑡−1 +𝑤 ∗𝑥 𝑡

69
RNN (rolled over time)

70
The Vanishing Gradient Problem
RNN’s use back propagation.
Back propagation uses chain rule.
○ Chain rule multiplies derivatives
If these derivatives are between 0 and 1 the product vanishes as the
chain gets longer.
○ or the product explodes if the derivatives are greater than 1.
Sigmoid activation function in RNN leads to this problem.
ReLu, in theory, avoids this problem but not in practice.

71
Problem with Vanishing or Exploding Gradients
Don’t allow us to learn long term dependencies.
○ Param is a hard worker.
VS.
○ Param, student of Yong, is a hard worker.

BAD!!!!
Misguided!!!!
Unacceptable!!!!

72
Long Short-Term Memory

LSTM provide solution to the vanishing/exploding gradient problem.


Solution: Memory Cell, which is updated at each step in the sequence.
Three Gates control the flow of information to and from the Memory cell
○ Input Gate: protect the current step from irrelevant inputs
○ Output Gate: prevents current step from passing irrelevant
information to later steps.
○ Forget Gate: limits information passed from one cell to the next.

73
LSTM (1)

c0 Forget f1
+ c1

Input i1
𝑤 = . + .
h0 𝑓 () h1
u1
= . + .
𝑤

x1

74
LSTM (2)

c0 Forget f1
+ c1

𝑤 𝑓 () Input i1
𝑤
=
h0 𝑓 () h1
u1
=
𝑤 𝑤

x1

75
LSTM (3)

c0 Forget f1
+ c1

𝑤 𝑓 () Input i1
𝑤 =
h0 𝑓 () h1
u1
=
𝑤
𝑤

x1

76
LSTM (4)
=
c0 Forget f1
+ c1
=
Input i1 𝑓 ()
𝑤

h0 𝑓 () h1
u1 Output o1 h1

h0 () h1
x1

x1
77
LSTM (5)

c0 Forget + c1 Forget + c2
f1 f2
Input 𝑓 () 𝑓 ()
Input
𝑤 i1 i2
𝑤
h0 𝑓 () u1 Output h1 𝑓 () u2 Output h2
o1 o2
𝑤 𝑤

x1 x2

78
Attention
Mechanism

79
Attention Mechanism (1)

80
Attention Mechanism (2)

81
4.
Data representation

82
MACHINES learn
BETTER
by using
a deep understanding of data

83
Traditional Rule-based approach

Machine learning/
Traditional Featured-based machine learning

Deeper? Deep learning

From Shallow to Deep Pre-Training Model for representing data

84
Transformer architecture

Attention is all you need!


The dominant sequence
transduction models are based on
complex recurrent or
convolutional neural networks in
an encoder and decoder
configuration.
Text: BERT, Bart, ViT5
Image: VIT, Resnet, VGG
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you
need. Advances in neural information processing systems, 30. 85
Language model

86
Word Embedding vs Language Models

87
Pre-trained language models

88
Large language models

A large language model


(LLM) is a type of language
model notable for its
ability to achieve general-
purpose language
understanding and
generation.
Large language models
use transformer models
and are trained using
massive datasets.

89
5.
Deep Learning
Framework

90
Deep learning frameworks
+

• Actually tools for defining static or


+
dynamic general-purpose computational
graphs
• Automatic differentiation ✕ ✕

• Seamless CPU / GPU usage


• multi-GPU, distributed x y 5

• Python/numpy or R interfaces
• instead of C, C++, CUDA or HIP
• Open source
Deep learning Lasagne Keras TF Estimator torch.nn Gluon

frameworks (2)
Theano TensorFlow CNTK PyTorch MXNet Caffe

• Keras is a high-level CUDA, cuDNN


MKL, MKL-DNN
HIP, MIOpen
neural networks API
• we will use TensorFlow
GPUs CPUs
as the compute backend
• included in TensorFlow 2 as tf.keras
• https://keras.io/ , https://www.tensorflow.org/guide/keras
• PyTorch is:
• a GPU-based tensor library
• an efficient library for dynamic neural networks
• https://pytorch.org/
Summary
Introduction to Deep Learning
o What is Deep Learning
o Why is it useful
Neural Networks
○ Neural Networks: Perceptron, MLP
○ Gradient Descent, Activation Functions
○ Multi-layer NN: Forward- and Backward Propagation
Typical architectures
○ Convolutional neural networks (CNN)
○ Recurrent neural networks (RNN)
○ Attention mechanism
Data representation
Deep learning Frameworks

94
Thanks!
Any questions?
[email protected]

95

You might also like