0% found this document useful (0 votes)
237 views

EfficientNet Tutorial

The document summarizes the Inception-v3 convolutional neural network architecture. Some key points: 1. Inception-v3 improves on previous Inception networks by introducing factorization techniques that reduce the number of parameters without decreasing efficiency. This allows for a deeper, 42-layer network. 2. The network introduces auxiliary classifiers to regularize training and prevent overfitting. 3. It uses efficient grid size reduction instead of max pooling to downsample feature maps in a less expensive way. 4. An ablation study shows the proposed techniques improve performance over previous Inception versions on the ImageNet classification task, leading to Inception-v3 becoming the 1st runner up in ILSVRC 2015.

Uploaded by

Tsigabu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views

EfficientNet Tutorial

The document summarizes the Inception-v3 convolutional neural network architecture. Some key points: 1. Inception-v3 improves on previous Inception networks by introducing factorization techniques that reduce the number of parameters without decreasing efficiency. This allows for a deeper, 42-layer network. 2. The network introduces auxiliary classifiers to regularize training and prevent overfitting. 3. It uses efficient grid size reduction instead of max pooling to downsample feature maps in a less expensive way. 4. An ablation study shows the proposed techniques improve performance over previous Inception versions on the ImageNet classification task, leading to Inception-v3 becoming the 1st runner up in ILSVRC 2015.

Uploaded by

Tsigabu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Review: Inception-v3 — 1st Runner Up

(Image Classification) in ILSVRC 2015


https://sh-tsang.medium.com/review-inception-v3-1st-runner-up-image-classification-in-ilsvrc-2015-
17915421f77c

In this story, Inception-v3 [1] is reviewed. By rethinking the inception architecture,


computational efficiency and fewer parameters are realized. With fewer parameters, 42-layer
deep learning network, with similar complexity as VGGNet, can be achieved.

AlexNet [2]: 60 million parameters


VGGNet [3]: 3× more parameters than AlexNet
GoogLeNet / Inception-v1 [4]: 7 million parameters

With 42 layers, lower error rate is obtained and make it become the 1st Runner Up for image
classification in ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2015. And it
is a 2016 CVPR paper with about 2000 citations when I was writing this story. (Sik-Ho Tsang@
Medium)

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.
ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are
roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.
About The Inception Versions
There are 4 versions. The first GoogLeNet must be the Inception-v1 [4], but there are numerous
typos in Inception-v3 [1] which lead to wrong descriptions about Inception versions. These
maybe due to the intense ILSVRC competition at that moment. Consequently, there are many
reviews in the internet mixing up between v2 and v3. Some of the reviews even think that v2 and
v3 are the same with only some minor different settings.

Nevertheless, in Inception-v4 [5], Google has a much more clear description about the version
issue:

“The Inception deep convolutional architecture was introduced as GoogLeNet in (Szegedy et al.
2015a), here named Inception-v1. Later the Inception architecture was refined in various ways,
first by the introduction of batch normalization (Ioffe and Szegedy 2015) (Inception-v2). Later
by additional factorization ideas in the third iteration (Szegedy et al. 2015b) which will be
referred to as Inception-v3 in this report.”

Thus, the BN-Inception / Inception-v2 [6] is talking about batch normalization while Inception-
v3 [1] is talking about factorization ideas.

What are covered:


1. Factorizing Convolutions

2. Auxiliary Classifiers

3. Efficient Grid Size Reduction

4. Inception-v3 Architecture

5. Label Smoothing As Regularization

6. Ablation Study

7. Comparison with State-of-the-art Approaches

1. Factorizing Convolutions
The aim of factorizing Convolutions is to reduce the number of connections/parameters
without decreasing the network efficiency.
1.1. Factorization Into Smaller Convolutions

Two 3×3 convolutions replaces one 5×5 convolution as follows:

By using 1 layer of 5×5 filter, number of parameters = 5×5=25


By using 2 layers of 3×3 filters, number of parameters = 3×3+3×3=18
Number of parameters is reduced by 28%

Similar technique has been mentioned in VGGNet [3] already.

With this technique, one of the new Inception modules (I call it Inception Module A here)
becomes:
1.2. Factorization Into Asymmetric Convolutions

One 3×1 convolution followed by one 1×3 convolution replaces one 3×3 convolution as follows:

One 3×1 convolution followed by one 1×3 convolution replaces one 3×3 convolution
By using 3×3 filter, number of parameters = 3×3=9
By using 3×1 and 1×3 filters, number of parameters = 3×1+1×3=6
Number of parameters is reduced by 33%

You may ask why we don’t use two 2×2 filters to replace one 3×3 filter?

If we use two 2×2 filters, number of parameters = 2×2×2=8


Number of parameters is only reduced by 11%

With this technique, one of the new Inception modules (I call it Inception Module B here)
becomes:

Inception Module B using asymmetric factorization

And Inception module C is also proposed for promoting high dimensional representations
according to author descriptions as follows:
Inception Module C using asymmetric factorization

Thus, authors suggest these 3 kinds of Inception Modules. With factorization, number of
parameters is reduced for the whole network, it is less likely to be overfitting, and
consequently, the network can go deeper!

2. Auxiliary Classifier
Auxiliary Classifiers were already suggested in GoogLeNet / Inception-v1 [4]. There are some
modifications in Inception-v3.

Only 1 auxiliary classifier is used on the top of the last 17×17 layer, instead of using 2
auxiliary classifiers. (The overall architecture would be shown later.)
The purpose is also different. In GoogLeNet / Inception-v1 [4], auxiliary classifiers are used for
having deeper network. In Inception-v3, auxiliary classifier is used as regularizer. So, actually,
in deep learning, the modules are still quite intuitive.

Batch normalization, suggested in Inception-v2 [6], is also used in the auxiliary classifier.

3. Efficient Grid Size Reduction


Conventionally, such as AlexNet and VGGNet, the feature map downsizing is done by max
pooling. But the drawback is either too greedy by max pooling followed by conv layer, or too
expensive by conv layer followed by max pooling. Here, an efficient grid size reduction is
proposed as follows:
Conventional downsizing (Top Left), Efficient Grid Size Reduction (Bottom Left), Detailed
Architecture of Efficient Grid Size Reduction (Right)

With the efficient grid size reduction, 320 feature maps are done by conv with stride 2. 320
feature maps are obtained by max pooling. And these 2 sets of feature maps are concatenated
as 640 feature maps and go to the next level of inception module.

Less expensive and still efficient network is achieved by this efficient grid size reduction.

4. Inception-v3 Architecture
There are some typos for the architecture in the passage and table within the paper. I believe this
is due to the intense ILSVRC competition in 2015. I thereby look into the codes to realize the
architecture:
Inception-v3 Architecture (Batch Norm and ReLU are used after Conv)

With 42 layers deep, the computation cost is only about 2.5 higher than that of GoogLeNet
[4], and much more efficient than that of VGGNet [3].

The links I use for reference about the architecture:

PyTorch version of Inception-v3:


https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py

Inception-v3 on Google Cloud


https://cloud.google.com/tpu/docs/inception-v3-advanced

5. Label Smoothing As Regularization


The purpose of label smoothing is to prevent the largest logit from becoming much larger than
all others:

new_labels = (1 — ε) * one_hot_labels + ε / K

where ε is 0.1 which is a hyperparameter and K is 1000 which is the number of classes. A kind
of dropout effect observed in classifier layer.

6. Ablation Study
Ablation Study (single-model single-crop)

Using single-model single-crop, we can see the top-1 error rate is improved when proposed
techniques are added on top of each other:

Inception-v1: 29%
Inception-v2: 25.2%

Inception-v3: 23.4%
+ RMSProp: 23.1%
+ Label Smoothing: 22.8%
+ 7×7 Factorization: 21.6%
+ Auxiliary Classifier: 21.2% (With top-5 error rate of 5.6%)

where 7×7 Factorization is to factorize the first 7×7 conv layer into three 3×3 conv layer.

7. Comparison with State-of-the-art


Approaches
With single-model multi-crop, Inception-v3 with 144 crops obtains top-5 error rate is 4.2%, which
outperforms PReLU-Net and Inception-v2 which were published in 2015.

Multi-Model Multi-Crop Results

With multi-model multi-crop, Inception-v3 with 144 crops and 4 models ensembled, the top-5
error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in
ILSVRC 2015, while the winner is ResNet [7] which will be reviewed later. Of course,
Inception-v4 [5] will also be reviewed later on as well.

Multi-Model Multi-Crop Results

With multi-model multi-crop, Inception-v3 with 144 crops and 4 models ensembled, the top-5
error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in
ILSVRC 2015, while the winner is ResNet [7] which will be reviewed later. Of course,
Inception-v4 [5] will also be reviewed later on as well.

https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d

Understanding and Coding a ResNet in


Keras
Doing cool things with data!
https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33

ResNet, short for Residual Networks is a classic neural network used as a backbone for many
computer vision tasks. This model was the winner of ImageNet challenge in 2015. The
fundamental breakthrough with ResNet was it allowed us to train extremely deep neural
networks with 150+layers successfully. Prior to ResNet training very deep neural networks was
difficult due to the problem of vanishing gradients.

AlexNet, the winner of ImageNet 2012 and the model that apparently kick started the focus on
deep learning had only 8 convolutional layers, the VGG network had 19 and Inception or
GoogleNet had 22 layers and ResNet 152 had 152 layers. In this blog we will code a ResNet-50
that is a smaller version of ResNet 152 and frequently used as a starting point for transfer
learning.

Revolution of Depth

However, increasing network depth does not work by simply stacking layers together. Deep
networks are hard to train because of the notorious vanishing gradient problem — as the gradient
is back-propagated to earlier layers, repeated multiplication may make the gradient extremely
small. As a result, as the network goes deeper, its performance gets saturated or even starts
degrading rapidly.

I learnt about coding ResNets from DeepLearning.AI course by Andrew Ng. I highly
recommend this course.
On my Github repo, I have shared two notebooks one that codes ResNet from scratch as
explained in DeepLearning.AI and the other that uses the pretrained model in Keras. I hope you
pull the code and try it for yourself.

Skip Connection — The Strength of ResNet

ResNet first introduced the concept of skip connection. The diagram below illustrates skip
connection. The figure on the left is stacking convolution layers together one after the other. On
the right we still stack convolution layers as before but we now also add the original input to the
output of the convolution block. This is called skip connection

Skip Connection Image from DeepLearning.AI

It can be written as two lines of code :

X_shortcut = X # Store the initial value of X in a variable


## Perform convolution + batch norm operations on XX = Add()([X, X_shortcut])
# SKIP Connection

The coding is quite simple but there is one important consideration — since X, X_shortcut above
are two matrixes, you can add them only if they have the same shape. So if the convolution +
batch norm operations are done in a way that the output shape is the same,then we can simply
add them as shown below.

When x and x_shortcut are the same shape


Otherwise, the x_shortcut goes through a convolution layer chosen such that the output from it is the
same dimension as the output from the convolution block as shown below:

X_shortcut goes through convolution block

In the notebook on Github, the two functions identity_block and convolution_block implement
above. These functions use Keras to implement Convolution and Batch Norm layers with ReLU
activation. Skip connection is technically the one line X = Add()([X, X_shortcut]).

One important thing to note here is that the skip connection is applied before the RELU
activation as shown in the diagram above. Research has found that this has the best results.

Why do Skip Connections work?

This is an interesting question. I think there are two reasons why Skip connections work here:

1. They mitigate the problem of vanishing gradient by allowing this alternate shortcut path
for gradient to flow through
2. They allow the model to learn and identify function which ensures that the higher layer
will perform at least as good as lower layer, and not worse.

Infarct since ResNet skip connections are used in a lot more model architectures like the Fully
Convolutional Network (FCN) and U-Net. They are used to flow information from earlier layers
in the model to later layers. In these architectures they are used to pass information from the
downsampling layers to the upsampling layers.

Testing the ResNet model we built

The identity and convolution blocks coded in the notebook are then combined to create a
ResNet-50 model with the architecture shown below:
ResNet-50 Model

The ResNet-50 model consists of 5 stages each with a convolution and Identity block. Each
convolution block has 3 convolution layers and each identity block also has 3 convolution layers.
The ResNet-50 has over 23 million trainable parameters.

I have tested this model on the signs data set which is also included in my Github repo. This data
set has hand images corresponding to 6 classes. We have 1080 train images and 120 test images.

Signs Data Set

Our ResNet-50 gets to 86% test accuracy in 25 epochs of training. Not bad!

Building ResNet in Keras using pretrained library

I loved coding the ResNet model myself since it allowed me a better understanding of a network
that I frequently use in many transfer learning tasks related to image classification, object
localization, segmentation etc.
However for more regular use it is faster to use the pretrained ResNet-50 in Keras. Keras has
many of these backbone models with their Imagenet weights available in its library.

I have uploaded a notebook on my Github that uses Keras to load the pretrained ResNet-50. You
can load the model with 1 line code:

base_model = applications.resnet50.ResNet50(weights= None, include_top=False,


input_shape= (img_height,img_width,3))

Here weights=None since I want to initialize the model with random weights as I did on the
ResNet-50 I coded. Otherwise I can also load the pretrained ImageNet weights. I set
include_top=False to not include the final pooling and fully connected layer in the original
model. I added Global Average Pooling and a dense output layaer to the ResNet-50 model.

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.7)(x)
predictions = Dense(num_classes, activation= 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

As shown above Keras provides a very convenient interface to load the pretrained models but it
is important to code the ResNet yourself as well at least once so you understand the concept and
can maybe apply this learning to another new architecture you are creating.

The Keras ResNet got to an accuracy of 75% after training on 100 epochs with Adam optimizer
and a learning rate of 0.0001. The accuracy is a bit lower than our own coded model and I guess
this has to do with weight initializations.

Keras also provides an easy interface for data augmentation so if you get a chance, try
augmenting this data set and see if that results in better performance.

Conclusion

 ResNet is a powerful backbone model that is used very frequently in many computer
vision tasks

 ResNet uses skip connection to add the output from an earlier layer to a later layer. This
helps it mitigate the vanishing gradient problem

 You can use Keras to load their pretrained ResNet 50 or use the code I have shared to
code ResNet yourself.

I have my own deep learning consultancy and love to work on interesting problems. I have
helped many startups deploy innovative AI based solutions. Check us out at —
http://deeplearninganalytics.org/.

You can also see my other writings at: https://medium.com/@priya.dwivedi


If you have a project that we can collaborate on, then please contact me through my website or at
[email protected]

References

 DeepLearning.AI

 Keras

 ReNet Paper

How to do Transfer learning with Efficientnet


https://www.dlology.com/blog/transfer-learning-with-efficientnet/

# Options: EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3


# Higher the number, the more complex the model is.
from efficientnet import EfficientNetB0 as Net
from efficientnet import center_crop_and_resize, preprocess_input

# loading pretrained conv base model


conv_base = Net(weights="imagenet", include_top=False,
input_shape=input_shape)

from tensorflow.keras import models


from tensorflow.keras import layers

dropout_rate = 0.2
model = models.Sequential()
model.add(conv_base)
model.add(layers.GlobalMaxPooling2D(name="gap"))
# model.add(layers.Flatten(name="flatten"))
if dropout_rate > 0:
model.add(layers.Dropout(dropout_rate, name="dropout_out"))
# model.add(layers.Dense(256, activation='relu', name="fc1"))
model.add(layers.Dense(2, activation="softmax", name="fc_out"))
conv_base.trainable = False

!wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-
6DEBA77B919F/kagglecatsanddogs_3367a.zip
!unzip -qq kagglecatsanddogs_3367a.zip -d dog_vs_cat

total training cat images: 1000


total training dog images: 1000
total validation cat images: 500
total validation dog images: 500
total test cat images: 500
total test dog images: 500

from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
rescale=1.0 / 255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode="nearest",
)

# Note that the validation data should not be augmented!


test_datagen = ImageDataGenerator(rescale=1.0 / 255)

train_generator = train_datagen.flow_from_directory(
# This is the target directory
train_dir,
# All images will be resized to target height and width.
target_size=(height, width),
batch_size=batch_size,
# Since we use categorical_crossentropy loss, we need categorical labels
class_mode="categorical",
)

validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(height, width),
batch_size=batch_size,
class_mode="categorical",
)
model.compile(
loss="categorical_crossentropy",
optimizer=optimizers.RMSprop(lr=2e-5),
metrics=["acc"],
)
history = model.fit_generator(
train_generator,
steps_per_epoch=NUM_TRAIN // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=NUM_TEST // batch_size,
verbose=1,
use_multiprocessing=True,
workers=4,
)

Another technique to make the model representation more relevant for the problem at hand is
called fine-tuning. That is based on the following intuition.

Earlier layers in the convolutional base encode more generic, reusable features, while layers
higher up encode more specialized features.

The steps for fine-tuning a network are as follow:

 1) Add your custom network on top of an already trained base network.


 2) Freeze the base network.
 3) Train the part you added.
 4) Unfreeze some layers in the base network.
 5) Jointly train both these layers and the part you added.

We have already done the first three steps, to find out which layers to unfreeze, it is helpful to
plot the Keras model.

from tensorflow.keras.utils import plot_model


plot_model(conv_base, to_file='conv_base.png', show_shapes=True)
from IPython.display import Image
Image(filename='conv_base.png')
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
if layer.name == 'multiply_16':
set_trainable = True
if set_trainable:
layer.trainable = True
else:
layer.trainable = False

You might also like