0% found this document useful (0 votes)
34 views

Lecture11 cnns-2

The document describes the AlexNet convolutional neural network architecture from 2012, which had 5 convolutional layers and 3 fully connected layers and helped improve image classification accuracy significantly on the ImageNet dataset compared to previous approaches. AlexNet took RGB images resized to 227x227x3 pixels as input after subtracting the mean training image, and used ReLU activations between layers.

Uploaded by

Saeed Firoozi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture11 cnns-2

The document describes the AlexNet convolutional neural network architecture from 2012, which had 5 convolutional layers and 3 fully connected layers and helped improve image classification accuracy significantly on the ImageNet dataset compared to previous approaches. AlexNet took RGB images resized to 227x227x3 pixels as input after subtracting the mean training image, and used ReLU activations between layers.

Uploaded by

Saeed Firoozi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Lecture 11: Convolutional neural network case studies

Announcements:


• HW #4 is due Friday, Feb 17, uploaded to Gradescope. To submit your Jupyter


Notebook, print the notebook to a pdf with your solutions and plots filled in. You must
also submit your .py files as pdfs.

• Please read midterm announcement carefully.

• Last name beginning with K or L: Please take the exam in ROYCE 154.
• Last name beginning with S or T: Please take the exam in ROYCE 156.
• Last name beginning with B or C: Please take the exam in ROYCE 162.
• Everyone else: Please take the exam in HAINES 39, our normal classroom.

• Midterm exam review session: Thursday, Feb 16, 6-9pm at WG Young CS50.
Review
problems
will be
posted by spin tonight.
·
No lecture or OHthis
Monday
(President's Day Holiday).
Wednesday
"ofafter the midterm are canceled.
·

My
Prof J.C. Kao, UCLA ECE
545x3 1
+

o
3

d
W

Prof J.C. Kao, UCLA ECE


Prof J.C. Kao, UCLA ECE
All convolutions in this class are valid convolutions

In this class, all convolutions will be valid convolutions. We explicitly


specify the amount of zero padding when we need it.

Output size:

Prof J.C. Kao, UCLA ECE


Recap: convolutional padding

Prof J.C. Kao, UCLA ECE


Convolutional stride

stride=2 Δ

Prof J.C. Kao, UCLA ECE


Pooling layer

k
and pool
&
-


Prof J.C. Kao, UCLA ECE
Sizing examples

n 1
wf
+
-

x 1

stride 1, pad 0
=

C1 contains six 5x5 conv filters. -

Size of >
output
feature maps at C1? 32 -
5 1
+

28
=
=> (28x28x6)
Number of parameters in C1 layer? 156
(5x5+1) x6= params
S2 is a 2x2 pooling layer applied at stride 2.
outph
Size of -
feature maps at S2? (14X14xb)
Number of parameters in S2 layer? Δ

Size of feature
- maps at C3?
output
*

ho
C3 contains sixteen 5x5 conv filters.
spotone
16
14

(10
-

10
x
5 1 10
+

16)
x
=

# in C3? (5x5X6 1) x
params
+

Prof J.C. Kao, UCLA ECE


CNN architecture

Prof J.C. Kao, UCLA ECE


CNN architecture

or
3
2
=

Fully connected

Prof J.C. Kao, UCLA ECE


Case studies

To help get an intuition behind CNN’s, we’ll go over a few architectures that
have been influential in recent years.

Case studies:

• LeNet (1998)
• AlexNet (2012)
• VGG (2013)
• GoogLeNet (2014)
• ResNet (2015)

Prof J.C. Kao, UCLA ECE


LeNet-5

LeCun et al., 1998.

Applied to handwriting recognition.

Prof J.C. Kao, UCLA ECE


LeNet-5

5X5 1+

b Δ
D
#
Δ -
-
&

LeCun et al., 1998.

Question: How many connections are there in the first convolutional layer?

heron has (5x5 1)


+

Every
28x28x6
#
of neons:

connections:28x28x6x (5x5 1) 122,304


+
=

#
of

Prof J.C. Kao, UCLA ECE


LeNet-5

4 total layers. Input is 32x32.

1. [28x28x6] CONV: 6 convolutional filters, 5x5 filter size, applied at stride 1.

Prof J.C. Kao, UCLA ECE


LeNet-5

4 total layers. Input is 32x32.

1. [28x28x6] CONV: 6 convolutional filters, 5x5 filter size, applied at stride 1.


2. [14x14x6] POOL: 2x2 pool with stride 2. (Adds all elems, multiplies them
by trainable coefficient, then passes through sigmoid.)

Prof J.C. Kao, UCLA ECE


LeNet-5

4 total layers. Input is 32x32.

1. [28x28x6] CONV: 6 convolutional filters, 5x5 filter size, applied at stride 1.


② POOL: 2x2 pool with stride 2. (Adds all elems, multiplies them
2. [14x14x6]
by trainable coefficient, then passes through sigmoid.)
3. [10x10x16] CONV: 16 convolutional filters, 5x5.
4. [5x5x16] POOL: 2x2 pool with stride 2. 1
x16 w
wf
+
-

5. [120] CONV: 120 5x5 convolutional filters.


5 5 +1 =1
6. [84] FC: FC layer: 84 x 120.
-

7. [10] OUT: MSE against a template for each digit.

Prof J.C. Kao, UCLA ECE


LeNet-5

LeCun et al., 1998.

Overall architecture:

[CONV-POOL]x2 - CONV - FC - OUT

Prof J.C. Kao, UCLA ECE


AlexNet in context

The number of layers refers to the number of convolutional or FC layers.

http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

Input processing:
• ImageNet has variable-sized images.
is

i|
• Downsample or resize each image; given a rectangular image …
• Crop so the shorter side is 256 pixels.
• Crop out the central 256 x 256 pixels.
-

• The actual input to the CNN is 224 x 224 x 3 after data augmentation.
• However, the layer sizing doesn’t quite work out, so we’ll say it’s
227x227x3.
• Subtracted the mean image over the training set from each pixel.

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

Nonlinearity:
• Used the ReLU. It was faster than sigmoidal or tanh units.

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

Dotted line is tanh


Solid line is ReLU
Clearly ReLU resulted in more
efficient training.
Stanh
ReLU is at the output of every
convolutional and fully-connected
layer.

ReLU

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

Training on multiple GPUs.

• This is why the above image is cropped. Everything is replicated x2,


and the two paths correspond to training on two GPU’s.
• They trained on GPUs due to memory; they trained on 1.2 million images
and they stored them on GPUs; each GPU had just 3 GB of memory.

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• Local response normalization (not common anymore).


• Used pooling with overlapping (i.e., the stride was not the pool width).

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• Data augmentation:
• Image translations and horizontal reflections.
• Extract out random 224 x 224 patches and their horizontal reflections.
• At test time, extract 5 random 224 x 224 patches + reflections, and
average the predictions of the 10 output softmax’s. This avg’ing
reduces error rate by ~1.5%.
• Color augmentation: scale the PCs of the colors, capturing different
levels of illumination and intensities.
• Reduces the Top 1 error rate by 1%.
• Dropout with p = 0.5.
• Substantially reduces overfitting; takes twice as long to train. Prof J.C. Kao, UCLA ECE
AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• SGD with momentum and weight decay.


• Batch size: 128, momentum: 0.9
->
a
regularization
• Learning rate initialized to 0.01, manually decreased when validation
error stopped improving.
• L2 weight decay: 0.0005

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• Training time: roughly five to six days on two GTX 580 GPUs.

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• Averaged the output of multiple CNNs. Validation error of …


• 1 CNN: 18.2%
• 5 CNNs: 16.4%
• 7 CNNs: 15.4%

Prof J.C. Kao, UCLA ECE


AlexNet

Prof J.C. Kao, UCLA ECE


AlexNet

serror
Top
rate

15.4%

Prof J.C. Kao, UCLA ECE


AlexNet

AlexNet, Krizhevsky et al., NIPS 2012.

• Importance of depth?
• Validation error worsens by 2% by removing any middle layer.

Prof J.C. Kao, UCLA ECE


AlexNet

+2.pad +
1
stride
Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were
changed so the operations work out).

Question: The input is 227x227x3. The first convolutional layer has 96 11x11
filters applied at stride 4. What is the output size?

"

#2
- 1
+ 55
=

(55x55x 96)
Prof J.C. Kao, UCLA ECE
AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

Question: How many trainable parameters in the first convolutional layer?


(Recall, 96 filters that are 11x11.)
1
filters:11 11
x x 3 +

96.(11x11x3 1)
+ 34,944
=

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

1. [55x55x96] CONV: 96 filters of size 11x11x3 with stride 4.

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

Question: The output of the first convolutional layer is 55x55x96. The pooling
layer is 3x3 filters applied at stride 2. What is the output size?

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

Question: How many trainable parameters in the first pooling layer? (Recall,
pool is with 3x3 filters at stride 2.)

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

1. [55x55x96] CONV: 96 filters of size 11x11x3 with stride 4.


2. [27x27x96] POOL: 3x3 filters with stride 2.
3. [27x27x96] NORM: normalization layer

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

Question: The input into the second convolutional layer is 27x27x96. The
layer has 256 5x5 filters at stride 1 with pad 2. What is the output size?

Prof J.C. Kao, UCLA ECE


AlexNet

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

1. [55x55x96] CONV: 96 filters of size 11x11x3 with stride 4.


2. [27x27x96] POOL: 3x3 filters with stride 2.
3. [27x27x96] NORM: normalization layer
4. [27x27x256] CONV: 256 filters of size 5x5x96 with stride 1, pad 2.
5. [13x13x256] POOL: 3x3 filters with stride 2.

Prof J.C. Kao, UCLA ECE


AlexNet Wont=Win-Wf+2.pad +

Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were


changed so the operations work out).

1. [55x55x96] CONV: 96 filters of size 11x11x3 with stride 4.


2. [27x27x96] POOL: 3x3 filters with stride 2. 5 + 2.2 + 1
Wont=Win
-

3. [27x27x96] NORM: normalization layer T


Win
4. [27x27x256] CONV: 256 filters of size 5x5x96 with stride 1, pad 2.
5. [13x13x256] POOL: 3x3 filters with stride 2.
6. [13x13x256] NORM: normalization layer
5/2

I
7. [13x13x384] CONV: 384 filters of size 3x3 at stride 1, pad 1.
8. [13x13x384] CONV: 384 filters of size 3x3 at stride 1, pad 1. zF: 1024
9. [13x13x256] CONV: 256 filters of size 3x3 at stride 1, pad 1. 5/2

10. [6x6x256] POOL: 3x3 filters at stride 2.


11. [4096] FC: Fully connected layer with 4096 units
12. [4096] FC: Fully connected layer with 4096 units
13. [1000] FC: Fully connected layer with 1000 units (class scores).
14. [1000] OUT: Softmax layer

Prof J.C. Kao, UCLA ECE


ZFNet

ZFNet, which was


AlexNet but with better
hyperparameters.

zF Net
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Prof J.C. Kao, UCLA ECE


What makes convolutional neural networks work better?

Some observations from ZFNet are:

• Smaller filters applied at smaller strides appears to help (at least in early
layers).
• Having more filters later on in deeper layers appears to help.

5th
4th
3rd

Prof J.C. Kao, UCLA ECE


What about depth?

New architectures, that are


substantially deeper.

http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Prof J.C. Kao, UCLA ECE


VGGNet

From the Visual Geometry Group, Dept of Eng. Sci., Oxford, “Very Deep
Convolutional Neural Networks for Large-Scale Image Recognition,” Simonyan
& Zisserman, arXiv 2014.

“Our main contribution is a thorough evaluation of networks of increasing


depth using an architecture with very small (3×3) convolution filters, which
shows that a significant improvement on the prior-art configurations can be
achieved by pushing the depth to 16–19 weight layers.”

Their approach: focus on a small convolutional filter (3x3) and extend the
depth.

Prof J.C. Kao, UCLA ECE


VGGNet

VGG Net:

Instead of 8 layers (AlexNet), VGGNet increased the network architecture to


16-19 layers.

ARCHITECTURE:

Input -> [CONVx2-POOL]x3 -> [CONVx3-POOL]x2 -> FC x 3 -> Softmax

All CONV filters are uniform: 3x3 with stride 1, pad 1


All POOL filters are uniform: 2x2 max pool with stride 2.

Reduction from 11.7% to 7.3%, approximately a 40% reduction in error rate.

Prof J.C. Kao, UCLA ECE


VGGNet Surya Ganguli

Small filters and depth:

What might be a con of using a small filter, and how does VGGNet address
this? (Think receptive fields.)
3x3
VGG
1

IF Net 7x7
Rec
↓ *

-↓ - --
-
1
wf *

o
So
RF 3
- O
=

#V
oooo ① -% kz 5
0 0 =

RF 7=

#Nooo
1
Prof J.C. Kao, UCLA ECE
biases
Drop VGGNet

Small filters and depth:

Which has more parameters? One 7x7 CONV layer or three 3x3 stacked
CONV layers?
VGG Net
ZF Net 7x7
1 stack three 3x3
filters
Cin C
depth:
=

input
(3x3 (in) (out) 3


Cout C
filters:
=

x x x
#

17 7
x
x
(in) Cout
x

1
27c2
4922

Prof J.C. Kao, UCLA ECE


VGGNet

Small filters and depth:

Why might stacking many 3x3 filters turn into a good thing?

B/c there's more


nonlinearity
/3 ReLUs vs 1ReLU for ZFNet

Prof J.C. Kao, UCLA ECE


VGGNet

Small filters and depth:

What is a potential con of using small filters and more layers?

eo.
VGG Net
IF Net

one 7x7
filter,

Win-7 + 2 + 1
Wont
=

Win 4
=
=

Prof J.C. Kao, UCLA ECE


10 307 32x32x3 image
VGGNet - 100 artificial
100 x 3072

INPUT [224x224x3]
CONV (64) [224x224x64] < 64 Ailth, each 3x3x3, pad 1 =

->

filters, each 3x3x64, pad


rel CONV (64) [224x224x64] < 64 1 =

POOL [112x112x64]
* CONV (128) [112x112x128] -> (112x112x128)(3x3x64)
#ops:
CONV (128) [112x112x128]
POOL [56x56x128]
* CONV (256) [56x56x256]
-
#ops:
(56x56x255)(3x3x128)
CONV (256) [56x56x256]
CONV (256) [56x56x256]
POOL [28x28x256]
CONV (512) [28x28x512]
CONV (512) [28x28x512]
CONV (512) [28x28x512]
POOL [14x14x512]
CONV (512) [14x14x512]
CONV (512) [14x14x512]
CONV (512) [14x14x512]
POOL [7x7x512]
FC [1x1x4096] <- 4096x(7.7.512)
FC [1x1x4096]
FC [1x1x1000]
Prof J.C. Kao, UCLA ECE
VGGNet

INPUT [224x224x3] 224*224*3 ~ 150K


CONV (64) [224x224x64] 224*224*64 ~ 3.2M
CONV (64) [224x224x64] 224*224*64 ~ 3.2M
POOL [112x112x64] 112*112*64 ~ 800K
CONV (128) [112x112x128] 112*112*128 ~ 1.6M
CONV (128) [112x112x128] 112*112*128 ~ 1.6M
POOL [56x56x128] 56*56*128 ~ 400K
CONV (256) [56x56x256] 56*56*256 ~ 800K
CONV (256) [56x56x256] 56*56*256 ~ 800K
CONV (256) [56x56x256] 56*56*256 ~ 800K
POOL [28x28x256] 28*28*256 ~ 200K
CONV (512) [28x28x512] 28*28*512 ~ 400K
CONV (512) [28x28x512] 28*28*512 ~ 400K
CONV (512) [28x28x512] 28*28*512 ~ 400K
POOL [14x14x512] 14*14*512 ~ 100K
CONV (512) [14x14x512] 14*14*512 ~ 100K
CONV (512) [14x14x512] 14*14*512 ~ 100K
CONV (512) [14x14x512] 14*14*512 ~ 100K
POOL [7x7x512] 7*7*512 ~ 25K
FC [1x1x4096] Every activation is 4096
FC [1x1x4096] 4 bytes 4096
FC [1x1x1000] 24M 4 bytes
x 1000
Prof J.C. Kao, UCLA ECE
96
=

MBytes
BIASES
VGGNet IGNORE

INPUT [224x224x3] 224*224*3 ~ 150K 0


CONV (64) [224x224x64] 224*224*64 ~ 3.2M (3*3*3)*64 = 1,728
CONV (64) [224x224x64] 224*224*64 ~ 3.2M (3*3*64)*64 = 36,864
POOL [112x112x64] 112*112*64 ~ 800K 0
CONV (128) [112x112x128] 112*112*128 ~ 1.6M (3*3*64)*128 = 73,728
CONV (128) [112x112x128] 112*112*128 ~ 1.6M (3*3*128)*128 = 147,456
POOL [56x56x128] 56*56*128 ~ 400K 0
CONV (256) [56x56x256] 56*56*256 ~ 800K (3*3*128)*256 = 294,912
CONV (256) [56x56x256] 56*56*256 ~ 800K (3*3*256)*256 = 589,824
CONV (256) [56x56x256] 56*56*256 ~ 800K (3*3*256)*256 = 589,824
POOL [28x28x256] 28*28*256 ~ 200K 0
CONV (512) [28x28x512] 28*28*512 ~ 400K (3*3*256)*512 = 1,179,648
CONV (512) [28x28x512] 28*28*512 ~ 400K (3*3*512)*512 = 2,359,296
CONV (512) [28x28x512] 28*28*512 ~ 400K (3*3*512)*512 = 2,359,296
POOL [14x14x512] 14*14*512 ~ 100K 0
CONV (512) [14x14x512] 14*14*512 ~ 100K (3*3*512)*512 = 2,359,296
CONV (512) [14x14x512] 14*14*512 ~ 100K (3*3*512)*512 = 2,359,296
CONV (512) [14x14x512] 14*14*512 ~ 100K (3*3*512)*512 = 2,359,296
POOL [7x7x512] 7*7*512 ~ 25K 0
FC [1x1x4096] 4096 7*7*512*4096 = 102,760,448
VGNet:138M params
FC [1x1x4096] 4096 4096*4096 = 16,777,216
FC [1x1x1000] FC 122M 1000 4096*1000 = 4,096,000
layes: Prof J.C. Kao, UCLA ECE
VGGNet

Some observations:

• Memory: 24M * 4 bytes = 96MB for one forward pass.


• Total parameters: 138M parameters
• A lot of the network parameters are in the fully connected layer.
222M

Prof J.C. Kao, UCLA ECE


VGGNet

Number of layers

A - 11
B - 13
C - 16
D - 16
E - 19

1 more layer

Simonyan et al., arXiv 2014

Difference between C & D: C had three 1x1 conv layers.


Prof J.C. Kao, UCLA ECE
VGGNet

Other implementation notes about VGGNet.

• Input is 224x224 RGB image that has global mean-subtraction.


• They disposed of the local response normalization (LRN) layers from
AlexNet, as they found they did not increase performance but consumed
more memory & computation.
• Batch size 256, SGD + momentum 0.9.
• L2 penalty of 5e-4.
• Dropout for first two FC layers.
• Learning rate adjusted as in AlexNet.
• In initialization, they trained a shallower network and then used its
weights as the initial weights for deeper networks.
• But later on they found out the Xavier initialization performed
comparably.
• Also performed the horizontal flipping, random crops, and RGB shifting
that AlexNet and others used.
• Training took 2-3 weeks on a 4 GPU machine.
• Their submission averaged the output of 7 nets.

Prof J.C. Kao, UCLA ECE


What about depth?

New architectures, that are


substantially deeper.

http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Prof J.C. Kao, UCLA ECE


GoogLeNet

From Szegedy et al., IEEE CVPR 2014.

The main take-home points of the GoogLeNet: out the


(1) the network
Let pick
• 22 layers (deeper) < most
important features
• Introduces the “Inception” module (2) Reduce
• Gets rid of fully connected layers.
computational expense,
• Has only 5 million parameters, which is about 12x less than AlexNet and
27x less than VGGNet.
• Also tries to keep computational budget down.
• “… so that the [sic] they do not end up to be a purely academic
curiosity…”
• Won ImageNet top 5 error (w/ error rate 6.7%).

Prof J.C. Kao, UCLA ECE


GoogLeNet

Going deeper requires more parameters, and more computational expense.

Is there a way to address this?

GoogLeNet: the inception module.

Prof J.C. Kao, UCLA ECE


GoogLeNet

They leverage an idea called “network-in-network.”

Naive inception module:


diffitration
extracting
into
1
Prof J.C. Kao, UCLA ECE

You might also like