Lecture11 cnns-2
Lecture11 cnns-2
Announcements:
• Last name beginning with K or L: Please take the exam in ROYCE 154.
• Last name beginning with S or T: Please take the exam in ROYCE 156.
• Last name beginning with B or C: Please take the exam in ROYCE 162.
• Everyone else: Please take the exam in HAINES 39, our normal classroom.
• Midterm exam review session: Thursday, Feb 16, 6-9pm at WG Young CS50.
Review
problems
will be
posted by spin tonight.
·
No lecture or OHthis
Monday
(President's Day Holiday).
Wednesday
"ofafter the midterm are canceled.
·
My
Prof J.C. Kao, UCLA ECE
545x3 1
+
o
3
d
W
Output size:
stride=2 Δ
6Δ
k
and pool
&
-
↳
Prof J.C. Kao, UCLA ECE
Sizing examples
n 1
wf
+
-
x 1
stride 1, pad 0
=
Size of >
output
feature maps at C1? 32 -
5 1
+
28
=
=> (28x28x6)
Number of parameters in C1 layer? 156
(5x5+1) x6= params
S2 is a 2x2 pooling layer applied at stride 2.
outph
Size of -
feature maps at S2? (14X14xb)
Number of parameters in S2 layer? Δ
Size of feature
- maps at C3?
output
*
ho
C3 contains sixteen 5x5 conv filters.
spotone
16
14
(10
-
10
x
5 1 10
+
16)
x
=
# in C3? (5x5X6 1) x
params
+
or
3
2
=
Fully connected
To help get an intuition behind CNN’s, we’ll go over a few architectures that
have been influential in recent years.
Case studies:
• LeNet (1998)
• AlexNet (2012)
• VGG (2013)
• GoogLeNet (2014)
• ResNet (2015)
5X5 1+
b Δ
D
#
Δ -
-
&
Question: How many connections are there in the first convolutional layer?
Every
28x28x6
#
of neons:
#
of
Overall architecture:
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
Input processing:
• ImageNet has variable-sized images.
is
i|
• Downsample or resize each image; given a rectangular image …
• Crop so the shorter side is 256 pixels.
• Crop out the central 256 x 256 pixels.
-
• The actual input to the CNN is 224 x 224 x 3 after data augmentation.
• However, the layer sizing doesn’t quite work out, so we’ll say it’s
227x227x3.
• Subtracted the mean image over the training set from each pixel.
Nonlinearity:
• Used the ReLU. It was faster than sigmoidal or tanh units.
• Data augmentation:
• Image translations and horizontal reflections.
• Extract out random 224 x 224 patches and their horizontal reflections.
• At test time, extract 5 random 224 x 224 patches + reflections, and
average the predictions of the 10 output softmax’s. This avg’ing
reduces error rate by ~1.5%.
• Color augmentation: scale the PCs of the colors, capturing different
levels of illumination and intensities.
• Reduces the Top 1 error rate by 1%.
• Dropout with p = 0.5.
• Substantially reduces overfitting; takes twice as long to train. Prof J.C. Kao, UCLA ECE
AlexNet
• Training time: roughly five to six days on two GTX 580 GPUs.
serror
Top
rate
15.4%
• Importance of depth?
• Validation error worsens by 2% by removing any middle layer.
+2.pad +
1
stride
Architecture: 8 layers. Input is 227x227x3 (in paper, 224x224x3; numbers were
changed so the operations work out).
Question: The input is 227x227x3. The first convolutional layer has 96 11x11
filters applied at stride 4. What is the output size?
"
#2
- 1
+ 55
=
(55x55x 96)
Prof J.C. Kao, UCLA ECE
AlexNet
96.(11x11x3 1)
+ 34,944
=
Question: The output of the first convolutional layer is 55x55x96. The pooling
layer is 3x3 filters applied at stride 2. What is the output size?
Question: How many trainable parameters in the first pooling layer? (Recall,
pool is with 3x3 filters at stride 2.)
Question: The input into the second convolutional layer is 27x27x96. The
layer has 256 5x5 filters at stride 1 with pad 2. What is the output size?
I
7. [13x13x384] CONV: 384 filters of size 3x3 at stride 1, pad 1.
8. [13x13x384] CONV: 384 filters of size 3x3 at stride 1, pad 1. zF: 1024
9. [13x13x256] CONV: 256 filters of size 3x3 at stride 1, pad 1. 5/2
zF Net
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
• Smaller filters applied at smaller strides appears to help (at least in early
layers).
• Having more filters later on in deeper layers appears to help.
5th
4th
3rd
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
From the Visual Geometry Group, Dept of Eng. Sci., Oxford, “Very Deep
Convolutional Neural Networks for Large-Scale Image Recognition,” Simonyan
& Zisserman, arXiv 2014.
Their approach: focus on a small convolutional filter (3x3) and extend the
depth.
VGG Net:
ARCHITECTURE:
What might be a con of using a small filter, and how does VGGNet address
this? (Think receptive fields.)
3x3
VGG
1
IF Net 7x7
Rec
↓ *
-↓ - --
-
1
wf *
o
So
RF 3
- O
=
#V
oooo ① -% kz 5
0 0 =
RF 7=
#Nooo
1
Prof J.C. Kao, UCLA ECE
biases
Drop VGGNet
Which has more parameters? One 7x7 CONV layer or three 3x3 stacked
CONV layers?
VGG Net
ZF Net 7x7
1 stack three 3x3
filters
Cin C
depth:
=
input
(3x3 (in) (out) 3
↓
Cout C
filters:
=
x x x
#
17 7
x
x
(in) Cout
x
1
27c2
4922
Why might stacking many 3x3 filters turn into a good thing?
eo.
VGG Net
IF Net
one 7x7
filter,
Win-7 + 2 + 1
Wont
=
Win 4
=
=
INPUT [224x224x3]
CONV (64) [224x224x64] < 64 Ailth, each 3x3x3, pad 1 =
->
POOL [112x112x64]
* CONV (128) [112x112x128] -> (112x112x128)(3x3x64)
#ops:
CONV (128) [112x112x128]
POOL [56x56x128]
* CONV (256) [56x56x256]
-
#ops:
(56x56x255)(3x3x128)
CONV (256) [56x56x256]
CONV (256) [56x56x256]
POOL [28x28x256]
CONV (512) [28x28x512]
CONV (512) [28x28x512]
CONV (512) [28x28x512]
POOL [14x14x512]
CONV (512) [14x14x512]
CONV (512) [14x14x512]
CONV (512) [14x14x512]
POOL [7x7x512]
FC [1x1x4096] <- 4096x(7.7.512)
FC [1x1x4096]
FC [1x1x1000]
Prof J.C. Kao, UCLA ECE
VGGNet
MBytes
BIASES
VGGNet IGNORE
Some observations:
Number of layers
A - 11
B - 13
C - 16
D - 16
E - 19
1 more layer
http://kaiminghe.com/icml16tutorial/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf