0% found this document useful (0 votes)
51 views54 pages

Fundamentals of Deep Learning: Part 2: How A Neural Network Trains

The document discusses how neural networks are trained. It begins with an agenda that outlines topics including a simpler model, activation functions, and overfitting. It then explains how a neural network learns by adjusting its weights and biases to minimize a loss function through gradient descent. Different activation functions like ReLU and sigmoid are also introduced. Optimizers help neural networks learn more efficiently by determining how far to move along the gradient direction each step.

Uploaded by

Praveen Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views54 pages

Fundamentals of Deep Learning: Part 2: How A Neural Network Trains

The document discusses how neural networks are trained. It begins with an agenda that outlines topics including a simpler model, activation functions, and overfitting. It then explains how a neural network learns by adjusting its weights and biases to minimize a loss function through gradient descent. Different activation functions like ReLU and sigmoid are also introduced. Optimizers help neural networks learn more efficiently by determining how far to move along the gradient direction each step.

Uploaded by

Praveen Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

FUNDAMENTALS OF

DEEP LEARNING
Part 2: How a Neural Network Trains

1
Part 1: An Introduction to Deep Learning

Part 2: How a Neural Network Trains

AGENDA Part 3: Convolutional Neural Networks

Part 4: Data Augmentation and Deployment

Part 5: Pre-trained Models

Part 6: Advanced Architectures


AGENDA – PART 2
• Recap
• A Simpler Model
• From Neuron to Network
• Activation Functions
• Overfitting
• From Neuron to Classification
RECAP OF THE EXERCISE
What just happened?

Loaded and visualized our data

Edited our data (reshaped, normalized, to categorical)

Created our model

Compiled our model

Trained the model on our data


DATA PREPARATION
Input as an array

28 [0,0,0,24,75,184,185,78,32,55,0,0,0…]

28

5
DATA PREPARATION
Targets as categories

0 [1,0,0,0,0,0,0,0,0,0]

1 [0,1,0,0,0,0,0,0,0,0]

2 [0,0,1,0,0,0,0,0,0,0]
3 [0,0,0,1,0,0,0,0,0,0]
.
.
. 6
AN UNTRAINED MODEL

[ 0, 0, …, 0] (784,)

… … … (512,)
Layer
Size
… … … (512,)

(10,)

7
A SIMPLER MODEL
8
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏

6
m•x 𝑚=?
x y 4

y
1 3 2

2 5 0 ^
𝑦 b=?
0.5 1 1.5 2 2.5 3
x

9
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏

6
m•x 𝑚=?
x y 4

y
1 3 2

2 5 0 ^
𝑦 b=?
0.5 1 1.5 2 2.5 3
x

10
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
Start
6 Random
m•x
4
x y
𝑚=−1
y
1 3 4 2

2 5 3 0 ^
𝑦 b=5
0.5 1 1.5 2 2.5 3
x

11
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
x y 6
5
1 3 4 1 4
3


2 5 3 4 2
𝑛
1
1
𝑅𝑀𝑆𝐸= ∑ (𝑦 𝑖 − 𝑦 𝑖 )
^
MSE = 2.5 0 2
0.5 1 1.5 2 2.5 3
RMSE = 1.6 x
𝑛 𝑖=1

12
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
x y 6
5
1 3 4 1 4
3

y
2 5 3 4 2
1
MSE = 2.5 0
0.5 1 1.5 2 2.5 3
RMSE = 1.6 x

13
THE LOSS CURVE

Loss Surface 16

MSE

MSE

14
THE LOSS CURVE

6 16

4
y

Current
2
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=−1
b=5 0

15
THE LOSS CURVE

6 16

4
y

Old
2
Current
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=−1
b=4 0

16
THE LOSS CURVE

6 16

4
y

2
Current
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=0
b=4 0

17
THE LOSS CURVE

The 16
Which direction loss decreases
Gradient
the most
: The
learning
rate How far to travel

Epoch A model update with the full MSE


dataset

Target
Batch
A sample of the full dataset

Step An update to the weight 0


parameters

18
THE LOSS CURVE

The 16
Which direction loss decreases
Gradient
the most
: The
learning
rate How far to travel

Epoch A model update with the full MSE


dataset

Target
Batch
A sample of the full dataset

Step An update to the weight 0


parameters

19
OPTIMIZERS

Loss – Momentum Optimizer • Adam


• Adagrad
• RMSprop
• SGD

20
FROM NEURON TO
NETWORK
21
BUILDING A NETWORK

• Scales to more inputs

w1 w2

^
𝑦

22
BUILDING A NETWORK

x1 x2
w2 w3
w1 w4
• Scales to more inputs
• Can chain neurons
w5 w6

^
𝑦

23
BUILDING A NETWORK

x1 x2
w2 w3
w1 w4
• Scales to more inputs
• Can chain neurons
w5 w6
• If all regressions are
linear, then output will
^
𝑦 also be a linear
regression

24
ACTIVATION FUNCTIONS
25
ACTIVATION FUNCTIONS

Linear ReLU Sigmoid

{
1
^
𝑦 =𝑤𝑥+𝑏 𝑦 = 𝑤𝑥 +𝑏 𝑖𝑓 𝑤𝑥 +𝑏> 0
^ ^
𝑦= −( 𝑤𝑥+𝑏)
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 1+𝑒

10 10 1
0.8
5 5 0.6
0.4
0 0
0.2
-5 -5 0
- - - - - - - - 012356781
-10 -10 18765321 . . . . . . 0
-10 -5 0 5 10 -10 -5 0 5 10 0 . . . . . . 257 257

26
ACTIVATION FUNCTIONS

Linear ReLU Sigmoid

27
ACTIVATION FUNCTIONS

x1 2 2
w1 w2 1
0
1
0
-1 -1
-2 -2
--------------------001122334455667788991 --------------------001122334455667788991
19988776655443322110. . . . . . . . . .0 19988776655443322110. . . . . . . . . .0
0. . . . . . . . . . 5555555555 0. . . . . . . . . . 5555555555
w3 w4
2
1
0
-1

^
𝑦 -2
--------------------001122334455667788991
19988776655443322110. . . . . . . . . .0
0. . . . . . . . . . 5555555555

28
OVERFITTING
29
OVERFITTING
Why not have a super large neural network?

30
OVERFITTING
Which Trendline is Better?

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
MSE = .0000 MSE = .0113
0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

31
OVERFITTING
Which Trendline is Better?

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
MSE = .0308 MSE = .0062
0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

32
TRAINING VS VALIDATION DATA
Avoid memorization

Training data MSE Per Epoch


60
• Core dataset for the model to learn on
50

Validation data 40
30

MSE
• New data for model to see if it truly 20
understands (can generalize)
10
Overfitting 0
1 2 3 4 5 6 7 8 9 10
• When model performs well on the training Epoch
data, but not the validation data (evidence of
memorization) Training MSE
• Ideally the accuracy and loss should be Validation MSE - Expected
similar between both datasets Validation MSE - Overfitting

33
FROM REGRESSION TO
CLASSIFICATION
34
AN MNIST MODEL

[ 0, 0, …, 0] (784,)

… … … (512,)
Layer
Size
… … … (512,)

(10,)

35
AN MNIST MODEL

[ 0, 0, …, 0] (784,)

ReLU … … … (512,)
Layer
Size
ReLU … … … (512,)

Sigmoid
(10,)

36
AN MNIST MODEL

[ 0, 0, …, 0] (784,)

ReLU … … … (512,)
Layer
Size
ReLU … … … (512,)

Softmax
(10,)

37
RMSE FOR PROBABILITIES?
4

0
0 0.5 1 1.5 2 2.5 3 3.5
RMSE FOR PROBABILITIES?
4

0
0 0.5 1 1.5 2 2.5 3 3.5
CROSS ENTROPY
4
Cross Entropy
3 Blue Point Prediction
6
2
4

1 2

Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5

25

65

9
05

45

85
00

99
0.
0.
0.
0.
0.
0.
00
0.
Assigned Probability

Loss if True Loss if False


CROSS ENTROPY
4
Cross Entropy
3 Blue Point Prediction
6
2
4

1 2

Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5

25

65

9
05

45

85
00

99
0.
0.
0.
0.
0.
0.
00
0.
𝐿𝑜𝑠𝑠 =− ¿
Assigned Probability
𝑡 ( 𝑥 ) =𝑡𝑎𝑟𝑔𝑒𝑡 (0 𝑖𝑓 𝐹𝑎𝑙𝑠𝑒 , 1𝑖𝑓 𝑇𝑟𝑢𝑒)
Loss if True Loss if False
𝑝 ( 𝑥 )=𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑜𝑓 𝑝𝑜𝑖𝑛𝑡 𝑥
CROSS ENTROPY
4
Cross Entropy
3 Blue Point Prediction
6
2
4

1 2

Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5

25

65

9
05

45

85
00

99
0.
0.
0.
0.
0.
0.
00
0.
Assigned Probability

Loss if True Loss if False


BRINGING IT TOGETHER
43
THE NEXT EXERCISE
The American Sign Language Alphabet

44
LET’S GO!
45
APPENDIX: GRADIENT
DESCENT
HELPING THE COMPUTER CHEAT CALCULUS

46
Learning From Error

1 2
𝑀𝑆𝐸= ((3 − (𝑚 ( 1 )+ 𝑏)) ¿ ¿ 2+(5 − (𝑚 ( 2 ) +𝑏)) )¿
2

𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=9𝑚+5 𝑏 − 23 =5 𝑚+3 𝑏 − 13
𝜕𝑚 𝜕𝑏
𝑚=−1
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸 b=5
=− 7 =− 3
𝜕𝑚 𝜕𝑏
THE LOSS CURVE

Loss Surface 16

Current

Target

48
THE LOSS CURVE

16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏

Target

49
THE LOSS CURVE

16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏

𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target

𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏

50
THE LOSS CURVE

16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏

𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target
λ=.6
𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏

51
THE LOSS CURVE

16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏

𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target
λ=.005
𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏

52
THE LOSS CURVE

16

λ=.1

Target
b ≔ 5+3 λ= 4.7

53
54

You might also like