26 Deep Learning Annotated
26 Deep Learning Annotated
Learning
Jerr y Cain
March 11, 2024
Lecture Discussion on Ed
1
Deep Learning
2
Innovations in deep learning
Esteva, Andre, et al. "Dermatologist-level classification of skin cancer with deep neural networks."
Nature 542.7639 (2017): 115-118. Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 5
Deep learning
LOL Yes.
[1,0, … , 1] 𝑦,
( output > 0.5?
Predict 1
𝒙, input Lots of Logistic 𝑃 𝑌 = 1|𝑿 = 𝒙
(regressions)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 6
Logistic Regression Model
% 1
! "
𝑌) = arg max 𝑃 𝑌 | 𝑿
0.8
+
𝑦" Let’s focus on the
model up to 𝑦.
"
𝒙
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 7
Logistic Regression Model
% 1
! "
𝑌) = arg max 𝑃 𝑌 | 𝑿
0.8
+ σ > 0.5?
…
+ σ
= 𝑦(
𝒙 …
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 9
Biological basis for neural networks
A neuron
𝑥$ 𝜃!
𝑥& 𝜃" One neuron =
𝑦" one logistic
𝑥' 𝜃#
𝜃$ regression
𝑥(
Your brain
𝑥$
𝑥& Neural network =
𝑥' many logistic
regressions
𝑥(
(
𝒙(() = 0,0,0,0, … , 1,0,0,1, … , 0,0,1,0 𝑦 =0
(
𝒙(() = 0,0,1,1, … , 0,1,1,0, … , 0,1,0,0 𝑦 =1
+ σ
𝑦,
( output
…
𝑃 𝑌 = 1|𝑿 = 𝒙
𝒙, input features
(pixels, on/off) Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 12
Logistic Regression
indicates logistic
regression connection No.
> 0.5?
Predict 0
𝑦,
( output ✅
…
𝑃 𝑌 = 1|𝑿 = 𝒙
𝒙, input features
(pixels, on/off) Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 13
Logistic Regression
indicates logistic
regression connection Yes.
> 0.5?
Predict 1
𝑦,
( output ✅
…
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 14
Logistic Regression
indicates logistic
regression connection Yes.
> 0.5?
Predict 1
𝑦,
( output ❌
…
indicates logistic
regression connection
𝑦,
( output
…
Big idea #2 σ θ* 𝒙
Non-linear transform of multiple
values into one value, using
parameter θ
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 16
Introducing: The Neural network
No.
> 0.5?
Predict 0
𝑦,
( output ✅
…
𝒉, hidden
layer
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 17
Neural network
Big idea #1 𝑃 𝑌|𝑿 = 𝒙
Model conditional probability
𝑦" of class label given input
No.
> 0.5?
Predict 0
𝑦,
( output
…
𝒉, hidden
layer
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 18
Feed neurons into other neurons
hidden
neuron
+ σ
No.
> 0.5?
Predict 0
Big idea #2 σ θ* 𝒙 𝑦,
( output
…
No.
another > 0.5?
Predict 0
hidden
neuron 𝑦,
( output
…
+ σ
𝒉, hidden • Neuron = logistic regression
layer • Different parameters for
𝒙, input features every connection
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 20
Feed neurons into other neurons
|𝒉| logistic
regression
connections
No.
> 0.5?
Predict 0
𝒙 ⋅ |𝒉| 𝑦,
( output
…
parameters
|𝒉| logistic
output
regression
connections neuron
No.
+ σ > 0.5?
Predict 0
𝒙 ⋅ |𝒉| 𝑦,
( output
…
parameters
parameters parameters
𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
2. "#, output
…
$, hidden
layer
Linear network : !, input features
1. for 𝑗 = 1, … , |𝒉|:
" #
ℎ! = 𝜃! 𝒙
#
2. 𝑦, = 𝜎 𝜃 %$ 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
(by yourself)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 24
Why doesn’t a linear model introduce “complexity”?
Neural network:
1. for 𝑗 = 1, … , |𝒉|:
#
ℎ! = 𝜎 𝜃! " 𝒙 1. 2.
𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
2. "#, output
…
$, hidden
layer
Linear network : !, input features
1. for 𝑗 = 1, … , |𝒉|:
" #
ℎ! = 𝜃! 𝒙
The linear model is effectively
#
2. 𝑦, = 𝜎 𝜃 %$ 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙 a single logistic regression
with 𝒙 parameters.
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 25
Demonstration
https://adamharley.com/nn_vis/
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 26
Neural networks
A neural network (like logistic regression) gets intelligence from its
parameters 𝜃.
• Learn parameters 𝜃
Training • Find 𝜃&'( that maximizes likelihood of
training data (MLE)
• Learn parameters 𝜃
Training • Find 𝜃&'( that maximizes likelihood of
training data (MLE)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 28
Training: Logistic Regression Review
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/
2. Compute gradient
3. Optimize
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 30
1. Same output 𝑦,
! same log conditional likelihood
/
$, hidden /
layer ! ! %0! !
!, input features = 9 𝑦" ( 1 − 𝑦" (
("%
for 𝑗 = 1, … , |𝒉|:
#
ℎ! = 𝜎 𝜃! " 𝒙 /
("%
𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 31
(model is a little more complicated)
/
$, hidden
layer
!, input features
To optimize for
for 𝑗 = 1, … , |𝒉|:
log conditional likelihood,
# we now need to find:
ℎ! = 𝜎 𝜃! " 𝒙 dimension 𝒙
𝒉 ⋅ 𝒙 +𝒉 parameters
𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙 dimension 𝒉
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 32
2. Compute gradient
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/
("%
4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|
("%
4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|
("%
4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|
Wait, did we just skip something difficult?
2. Compute gradient Take gradient with respect to all 𝜃 parameters
("%
4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|
37
Shared weights?
It turns out if you want to force some of your weights to be shared over
different neurons, the math isn’t much harder.
Convolution is an example of such weight-sharing and is used a lot for
vision (Convolutional Neural Networks, CNN).
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 38
Neural networks with multiple layers
𝒙 𝒂 𝒃 𝒄 𝒅 𝒆 𝒇 J
𝒚 𝐿𝐿
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 39
Neurons learn features of the dataset
Neurons in later layers will respond strongly to high-level
features of your training data.
If your training data is faces, you will get lots of face neurons.
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 42
Softmax test metric: Top-5 error
(probabilities of predictions)
𝑌=𝑦 𝑃 𝑌 = 𝑦|𝑿 = 𝒙
5 0.14
8 0.13
7 0.12 Top-5 classification error
2 0.10 What % of datapoints
9 0.10 did not have the correct
4 0.09 class label in the top-5
1 0.09 predictions?
0 0.09
6 0.08 (class label: 5)
3 0.05
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 43
ImageNet classification
…
smoothhound, smoothhound shark, Mustelus mustelus
22,000 categories American smooth dogfish, Mustelus canis
Florida smoothhound, Mustelus norrisi
whitetip shark, reef whitetip shark, Triaenodon obseus
14,000,000 images Atlantic spiny dogfish, Squalus acanthias Stingray
Pacific spiny dogfish, Squalus suckleyi
hammerhead, hammerhead shark
smooth hammerhead, Sphyrna zygaena
Hand-engineered features smalleye hammerhead, Sphyrna tudes
(SIFT, HOG, LBP), shovelhead, bonnethead, bonnet shark, Sphyrna tiburo
angel shark, angelfish, Squatina squatina, monkfish
Spatial pyramid, electric ray, crampfish, numbfish, torpedo
smalltooth sawfish, Pristis pectinatus
SparseCoding/Compression guitarfish
roughtail stingray, Dasyatis centroura
butterfly ray
eagle ray
spotted eagle ray, spotted ray, Aetobatus narinari
cownose ray, cow-nosed ray, Rhinoptera bonasus
Mantaray
manta, manta ray, devilfish
Atlantic manta, Manta birostris
devil ray, Mobula hypostoma
grey skate, gray skate, Raja batis
little skate, Raja erinacea
…
44
Le, et al., Building high-level features usingLisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024
large-scale unsupervised learning. ICML 2012
ImageNet classification challenge
…
99.5%
Random guess
999
5 995
𝑃 true class label not in 5 guesses = =
1000 1000
5
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 46
ImageNet challenge: Top-5 classification error
(lower is better)
16.4%
? GoogLeNet
(2015)
Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 47
ImageNet challenge: Top-5 classification error
(lower is better)
16.4%
? 2.25%
GoogLe Net SENet
(2015) (2017)
Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 48
GoogLeNet (2015)
Multiple,
Multi class output
22 layers deep!
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 49
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Speeding up gradient descent minimizes loss (a function of prediction error)
initialize !! = 0 for 0 ≤ j ≤ m
repeat many times:
gradient[j] = 0 for 0 ≤ j ≤ m
Our batch gradient descent (over the entire training set) will be slow +
expensive.
1. Use stochastic gradient descent
(randomly select training examples with replacement).
2. Momentum update
(Incorporate “acceleration” or “deceleration” of gradient updates so far)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 50
Good ML = Generalization
Overfitting
Fitting the training data too well,
such that we lose generality of
model for predicting new data
perfect fit, but bad more general fit + better
predictor for new data predictor for new data
Dropout
During training, randomly leave out
some neurons each training step.
It will make your network more robust.
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 51
Making decisions?
http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 52
Deep Reinforcement Learning
http://cs.stanford.edu/people/karpathy
/convnetjs/demo/rldemo.html