0% found this document useful (0 votes)
12 views53 pages

26 Deep Learning Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views53 pages

26 Deep Learning Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

26: Intro to Deep

Learning
Jerr y Cain
March 11, 2024

Lecture Discussion on Ed
1
Deep Learning

2
Innovations in deep learning

Deep learning and neural


networks are cores theories and
technologies behind the current AI
revolution.
Errata:
• Checkers is the last solved game (from game
theory, where perfect player outcomes can be
fully predicted from any gameboard).
https://en.wikipedia.org/wiki/Solved_game
• The first machine learning algorithm defeated a
world champion in Chess in 1996.
https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
AlphaGO (2016)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 3
Computers making art

A Neural Algorithm of Artistic Style Google Deep Dream


The Next Rembrandt https://ai.googleblog.com/2015/06/in
https://medium.com/@DutchDigital/the- https://arxiv.org/abs/1508.06576
ceptionism-going-deeper-into-
next-rembrandt-bringing-the-old-master- https://github.com/jcjohnson/neural-style
neural.html
back-to-life-35dfb1653597
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 4
Detecting skin cancer

Esteva, Andre, et al. "Dermatologist-level classification of skin cancer with deep neural networks."
Nature 542.7639 (2017): 115-118. Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 5
Deep learning

def Deep learning is def A neural network is,


maximum likelihood estimation at its core, many logistic
with neural networks. regression units stacked on top
of each other.

LOL Yes.
[1,0, … , 1] 𝑦,
( output > 0.5?
Predict 1
𝒙, input Lots of Logistic 𝑃 𝑌 = 1|𝑿 = 𝒙
(regressions)

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 6
Logistic Regression Model
% 1
! "

𝑌) = arg max 𝑃 𝑌 | 𝑿
0.8

𝑿 𝜃! + # 𝜃" 𝑋" 0.6


0.4
0.2
𝑦" = 𝑃 𝑌 = 1|𝑿
"#$ -10 -8 -6 -4
0
-2 0 2 4 6 8 10
"
!" #,%

+
𝑦" Let’s focus on the
model up to 𝑦.
"
𝒙
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 7
Logistic Regression Model
% 1
! "

𝑌) = arg max 𝑃 𝑌 | 𝑿
0.8

𝑿 𝜃! + # 𝜃" 𝑋" 0.6


0.4
0.2
𝑦" = 𝑃 𝑌 = 1|𝑿
"#$ -10 -8 -6 -4
0
-2 0 2 4 6 8 10
"
!" #,%

+ σ > 0.5?

𝑦" Let’s focus on the


model up to 𝑦.
"
𝒙
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 8
One neuron = One logistic regression

+ σ
= 𝑦(

𝒙 …

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 9
Biological basis for neural networks
A neuron
𝑥$ 𝜃!
𝑥& 𝜃" One neuron =
𝑦" one logistic
𝑥' 𝜃#
𝜃$ regression
𝑥(

Your brain
𝑥$
𝑥& Neural network =
𝑥' many logistic
regressions
𝑥(

(or rather, someone else’s brain)


Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 10
Digit recognition example
Input image Input feature vector Output label

(
𝒙(() = 0,0,0,0, … , 1,0,0,1, … , 0,0,1,0 𝑦 =0

(
𝒙(() = 0,0,1,1, … , 0,1,1,0, … , 0,1,0,0 𝑦 =1

We make feature vectors from digitized pictures of numbers.


Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 11
Logistic Regression

+ σ
𝑦,
( output

𝑃 𝑌 = 1|𝑿 = 𝒙

𝒙, input features
(pixels, on/off) Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 12
Logistic Regression

indicates logistic
regression connection No.
> 0.5?
Predict 0
𝑦,
( output ✅

𝑃 𝑌 = 1|𝑿 = 𝒙

𝒙, input features
(pixels, on/off) Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 13
Logistic Regression

indicates logistic
regression connection Yes.
> 0.5?
Predict 1
𝑦,
( output ✅

𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 14
Logistic Regression

indicates logistic
regression connection Yes.
> 0.5?
Predict 1
𝑦,
( output ❌

What can we do to increase


𝒙, input features complexity of our model?
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 15
Take two big ideas from Logistic Regression Review

Big idea #1 𝑃 𝑌|𝑿 = 𝒙


Model conditional probability
𝑦" of class label given input

indicates logistic
regression connection

𝑦,
( output

Big idea #2 σ θ* 𝒙
Non-linear transform of multiple
values into one value, using
parameter θ
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 16
Introducing: The Neural network

No.
> 0.5?
Predict 0
𝑦,
( output ✅

𝒉, hidden
layer
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 17
Neural network
Big idea #1 𝑃 𝑌|𝑿 = 𝒙
Model conditional probability
𝑦" of class label given input

No.
> 0.5?
Predict 0
𝑦,
( output

𝒉, hidden
layer
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 18
Feed neurons into other neurons

hidden
neuron
+ σ

No.
> 0.5?
Predict 0
Big idea #2 σ θ* 𝒙 𝑦,
( output

Non-linear transform of multiple


values into one value, using
parameter θ • Neuron = logistic regression
𝒉, hidden
layer
𝒙, input features
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 19
Feed neurons into other neurons

No.
another > 0.5?
Predict 0
hidden
neuron 𝑦,
( output

+ σ
𝒉, hidden • Neuron = logistic regression
layer • Different parameters for
𝒙, input features every connection
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 20
Feed neurons into other neurons

|𝒉| logistic
regression
connections
No.
> 0.5?
Predict 0
𝒙 ⋅ |𝒉| 𝑦,
( output

parameters

𝒉, hidden • Neuron = logistic regression


layer • Different parameters for
𝒙, input features every connection
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 21
Feed neurons into other neurons

|𝒉| logistic
output
regression
connections neuron
No.
+ σ > 0.5?
Predict 0
𝒙 ⋅ |𝒉| 𝑦,
( output

parameters

𝒉, hidden • Neuron = logistic regression


layer • Different parameters for
𝒙, input features every connection
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 22
Feed neurons into other neurons

|𝒉| logistic 1 logistic


regression regression
connections connection
No.
> 0.5?
Predict 0
𝒙 ⋅ |𝒉| |𝒉| 𝑦,
( output

parameters parameters

𝒉, hidden • Neuron = logistic regression


layer • Different parameters for
𝒙, input features every connection
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 23
Why doesn’t a linear model introduce “complexity”?
Neural network:
1. for 𝑗 = 1, … , |𝒉|:
#
ℎ! = 𝜎 𝜃! " 𝒙 1. 2.

𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
2. "#, output


$, hidden
layer
Linear network : !, input features
1. for 𝑗 = 1, … , |𝒉|:
" #
ℎ! = 𝜃! 𝒙
#
2. 𝑦, = 𝜎 𝜃 %$ 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
(by yourself)

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 24
Why doesn’t a linear model introduce “complexity”?
Neural network:
1. for 𝑗 = 1, … , |𝒉|:
#
ℎ! = 𝜎 𝜃! " 𝒙 1. 2.

𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
2. "#, output


$, hidden
layer
Linear network : !, input features
1. for 𝑗 = 1, … , |𝒉|:
" #
ℎ! = 𝜃! 𝒙
The linear model is effectively
#
2. 𝑦, = 𝜎 𝜃 %$ 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙 a single logistic regression
with 𝒙 parameters.
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 25
Demonstration

https://adamharley.com/nn_vis/
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 26
Neural networks
A neural network (like logistic regression) gets intelligence from its
parameters 𝜃.

• Learn parameters 𝜃
Training • Find 𝜃&'( that maximizes likelihood of
training data (MLE)

For input feature vector 𝑿 = 𝒙:


Testing/ • Use parameters to compute 𝑦( = 𝑃 𝑌 = 1|𝑿 = 𝒙
Prediction • Classify instance as: 1 𝑦( > 0.5
-
0 otherwise
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 27
Neural networks
A neural network (like logistic regression) gets intelligence from its
parameters 𝜃.

• Learn parameters 𝜃
Training • Find 𝜃&'( that maximizes likelihood of
training data (MLE)

How do we learn the 𝒙 ⋅ 𝒉 + |𝒉| parameters?


Gradient ascent + chain rule!

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 28
Training: Logistic Regression Review
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦"


("%
( + 1 − 𝑦 (() log 1 − 𝑦" (
🌟
𝑦" = 𝜎 𝜃 * 𝒙(() = 𝑃 𝑌 = 1|𝑿 = 𝒙

2. Compute gradient Find |𝒙| parameters

3. Optimize initialize params


repeat many times:
compute gradient
params += η * gradient
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 29
Training: Neural networks
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.

2. Compute gradient

3. Optimize

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 30
1. Same output 𝑦,
! same log conditional likelihood
/

𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃


. .
/ ("%

𝐿 𝜃 = 9 𝑃 𝑌 = 𝑦 ( |𝑿 = 𝒙 ( , 𝜃 Binary class labels:


𝑌 ∈ 0, 1
"#, output ("%

$, hidden /
layer ! ! %0! !
!, input features = 9 𝑦" ( 1 − 𝑦" (

("%
for 𝑗 = 1, … , |𝒉|:
#
ℎ! = 𝜎 𝜃! " 𝒙 /

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

("%
𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 31
(model is a little more complicated)
/

𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃


. .
/ ("%

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

"#, output ("%


$, hidden
layer
!, input features
To optimize for
for 𝑗 = 1, … , |𝒉|:
log conditional likelihood,
# we now need to find:
ℎ! = 𝜎 𝜃! " 𝒙 dimension 𝒙
𝒉 ⋅ 𝒙 +𝒉 parameters

𝑦, = 𝜎 𝜃 %$ # 𝒉 = 𝑃 𝑌 = 1|𝑿 = 𝒙 dimension 𝒉

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 32
2. Compute gradient
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

("%

4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|

2. Compute gradient Take gradient with respect to all 𝜃 parameters

3. Optimize Calculus refresher #1: Calculus refresher #2:


Derivative(sum) = Chain rule 🌟 🌟 🌟
sum(derivative)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 33
3. Optimize
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

("%

4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|

2. Compute gradient Take gradient with respect to all 𝜃 parameters

3. Optimize initialize params


repeat many times:
compute gradient
params += η * gradient
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 34
Training a neural net
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

("%

4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|
Wait, did we just skip something difficult?
2. Compute gradient Take gradient with respect to all 𝜃 parameters

3. Optimize initialize params


repeat many times:
compute gradient
params += η * gradient
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 35
2. Compute gradient via backpropagation
/
1. Optimization 𝜃+,- = arg max 9 𝑓 𝑦 ( | 𝒙 ( , 𝜃 = arg max 𝐿𝐿 𝜃
problem: .
("%
.
/

𝐿𝐿 𝜃 = < 𝑦 (() log 𝑦" ( + 1 − 𝑦 (() log 1 − 𝑦" (

("%

4 * 𝑦" = 𝜎 𝜃 !5 * 𝒉
ℎ3 = 𝜎 𝜃3 𝒙 for 𝑗 = 1, … , |𝒉|

2. Compute gradient Take gradient with respect to all 𝜃 parameters

3. Optimize initialize params


Learn the tricks behind
repeatbackpropagation
many times: in
compute gradient
CS229, CS231N, CS224N,
params
etc. += η * gradient
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 36
Beyond the
basics

37
Shared weights?

It turns out if you want to force some of your weights to be shared over
different neurons, the math isn’t much harder.
Convolution is an example of such weight-sharing and is used a lot for
vision (Convolutional Neural Networks, CNN).

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 38
Neural networks with multiple layers

𝒙 𝒂 𝒃 𝒄 𝒅 𝒆 𝒇 J
𝒚 𝐿𝐿
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 39
Neurons learn features of the dataset
Neurons in later layers will respond strongly to high-level
features of your training data.
If your training data is faces, you will get lots of face neurons.

If your training data


is all of YouTube…

…you get a cat


neuron.

Top stimuli in test set Optimal stimulus found


by numerical optimization
40
Le, et al., Building high-level features usingLisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024
large-scale unsupervised learning. ICML 2012
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 41
Multiple outputs?
Softmax is a generalization of
the sigmoid function.
sigmoid 𝑧 : value in range [0, 1]
𝑧 ∈ ℝ:
𝑃 𝑌 = 1|𝑿 = 𝒙 = 𝜎 𝑧
(equivalent: Bernoulli 𝑝)

softmax 𝑧 : 𝑘-dimensional values in


range[0,1] that add up to 1
𝒛 ∈ ℝ6 :
𝑃 𝑌 = 𝑖|𝑿 = 𝒙 = softmax 𝒛 (
(equivalent: Multinomial 𝑝$ , … , 𝑝) )

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 42
Softmax test metric: Top-5 error
(probabilities of predictions)

𝑌=𝑦 𝑃 𝑌 = 𝑦|𝑿 = 𝒙
5 0.14
8 0.13
7 0.12 Top-5 classification error
2 0.10 What % of datapoints
9 0.10 did not have the correct
4 0.09 class label in the top-5
1 0.09 predictions?
0 0.09
6 0.08 (class label: 5)
3 0.05
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 43
ImageNet classification

smoothhound, smoothhound shark, Mustelus mustelus
22,000 categories American smooth dogfish, Mustelus canis
Florida smoothhound, Mustelus norrisi
whitetip shark, reef whitetip shark, Triaenodon obseus
14,000,000 images Atlantic spiny dogfish, Squalus acanthias Stingray
Pacific spiny dogfish, Squalus suckleyi
hammerhead, hammerhead shark
smooth hammerhead, Sphyrna zygaena
Hand-engineered features smalleye hammerhead, Sphyrna tudes
(SIFT, HOG, LBP), shovelhead, bonnethead, bonnet shark, Sphyrna tiburo
angel shark, angelfish, Squatina squatina, monkfish
Spatial pyramid, electric ray, crampfish, numbfish, torpedo
smalltooth sawfish, Pristis pectinatus
SparseCoding/Compression guitarfish
roughtail stingray, Dasyatis centroura
butterfly ray
eagle ray
spotted eagle ray, spotted ray, Aetobatus narinari
cownose ray, cow-nosed ray, Rhinoptera bonasus
Mantaray
manta, manta ray, devilfish
Atlantic manta, Manta birostris
devil ray, Mobula hypostoma
grey skate, gray skate, Raja batis
little skate, Raja erinacea

44
Le, et al., Building high-level features usingLisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024
large-scale unsupervised learning. ICML 2012
ImageNet classification challenge

22,000 categories 1000 categories smoothhound, smoothhound shark, Mustelus mustelus


American smooth dogfish, Mustelus canis
Florida smoothhound, Mustelus norrisi
whitetip shark, reef whitetip shark, Triaenodon obseus
14,000,000 images 1,200,000 images in train set
Atlantic spiny dogfish, Squalus acanthias
Pacific spiny dogfish, Squalus suckleyi
200,000 images in test
hammerhead, hammerhead shark
set
Hand-engineered features smooth hammerhead, Sphyrna zygaena
smalleye hammerhead, Sphyrna tudes
(SIFT, HOG, LBP), shovelhead, bonnethead, bonnet shark, Sphyrna tiburo
angel shark, angelfish, Squatina squatina, monkfish
Spatial pyramid, electric ray, crampfish, numbfish, torpedo
smalltooth sawfish, Pristis pectinatus
SparseCoding/Compression guitarfish
roughtail stingray, Dasyatis centroura
butterfly ray
eagle ray
spotted eagle ray, spotted ray, Aetobatus narinari
cownose ray, cow-nosed ray, Rhinoptera bonasus
manta, manta ray, devilfish
Atlantic manta, Manta birostris
devil ray, Mobula hypostoma
grey skate, gray skate, Raja batis
little skate, Raja erinacea

45
Le, et al., Building high-level features usingLisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024
large-scale unsupervised learning. ICML 2012
ImageNet challenge: Top-5 classification error
(lower is better)

99.5%
Random guess

999
5 995
𝑃 true class label not in 5 guesses = =
1000 1000
5

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 46
ImageNet challenge: Top-5 classification error
(lower is better)

99.5% 25.8% 5.1%


Random guess Pre-Neural Networks Humans
(2014)

16.4%
? GoogLeNet
(2015)

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 47
ImageNet challenge: Top-5 classification error
(lower is better)

99.5% 25.8% 5.1%


Random guess Pre-Neural Networks Humans
(2014)

16.4%
? 2.25%
GoogLe Net SENet
(2015) (2017)

Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV 2015
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Hu et al., Squeeze-and-Excitation Networks. Preprint arXiV 2017
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 48
GoogLeNet (2015)

1 Trillion Artificial Neurons


(btw human brains have 1 billion neurons)

Multiple,
Multi class output

22 layers deep!
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 49
Szegedy et al., Going Deeper With Convolutions. CVPR 2015
Speeding up gradient descent minimizes loss (a function of prediction error)

initialize !! = 0 for 0 ≤ j ≤ m
repeat many times:
gradient[j] = 0 for 0 ≤ j ≤ m

⚠ for each training example (",#):


for each 0 ≤ j ≤ m:
1. What if we have 1,200,000
images in our training set?
compute gradient

⚠ !! -= η * gradient[j] for all 0 ≤ j ≤ m 2. How can we speed up the update?

Our batch gradient descent (over the entire training set) will be slow +
expensive.
1. Use stochastic gradient descent
(randomly select training examples with replacement).
2. Momentum update
(Incorporate “acceleration” or “deceleration” of gradient updates so far)
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 50
Good ML = Generalization
Overfitting
Fitting the training data too well,
such that we lose generality of
model for predicting new data
perfect fit, but bad more general fit + better
predictor for new data predictor for new data

Dropout
During training, randomly leave out
some neurons each training step.
It will make your network more robust.

Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 51
Making decisions?

Not everything is classification.

Deep Reinforcement Learning


Instead of having the output of
a model be a probability, you
make output an expectation.

http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 52
Deep Reinforcement Learning

http://cs.stanford.edu/people/karpathy
/convnetjs/demo/rldemo.html

Deep Mind Atari Games


Score compared to best
human
Lisa Yan, Chris Piech, Mehran Sahami, and Jerry Cain, CS109, Winter 2024 53

You might also like