EfficientNet Tutorial
EfficientNet Tutorial
With 42 layers, lower error rate is obtained and make it become the 1st Runner Up for image
classification in ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2015. And it
is a 2016 CVPR paper with about 2000 citations when I was writing this story. (Sik-Ho Tsang@
Medium)
ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.
ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are
roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.
About The Inception Versions
There are 4 versions. The first GoogLeNet must be the Inception-v1 [4], but there are numerous
typos in Inception-v3 [1] which lead to wrong descriptions about Inception versions. These
maybe due to the intense ILSVRC competition at that moment. Consequently, there are many
reviews in the internet mixing up between v2 and v3. Some of the reviews even think that v2 and
v3 are the same with only some minor different settings.
Nevertheless, in Inception-v4 [5], Google has a much more clear description about the version
issue:
“The Inception deep convolutional architecture was introduced as GoogLeNet in (Szegedy et al.
2015a), here named Inception-v1. Later the Inception architecture was refined in various ways,
first by the introduction of batch normalization (Ioffe and Szegedy 2015) (Inception-v2). Later
by additional factorization ideas in the third iteration (Szegedy et al. 2015b) which will be
referred to as Inception-v3 in this report.”
Thus, the BN-Inception / Inception-v2 [6] is talking about batch normalization while Inception-
v3 [1] is talking about factorization ideas.
2. Auxiliary Classifiers
4. Inception-v3 Architecture
6. Ablation Study
1. Factorizing Convolutions
The aim of factorizing Convolutions is to reduce the number of connections/parameters
without decreasing the network efficiency.
1.1. Factorization Into Smaller Convolutions
With this technique, one of the new Inception modules (I call it Inception Module A here)
becomes:
1.2. Factorization Into Asymmetric Convolutions
One 3×1 convolution followed by one 1×3 convolution replaces one 3×3 convolution as follows:
One 3×1 convolution followed by one 1×3 convolution replaces one 3×3 convolution
By using 3×3 filter, number of parameters = 3×3=9
By using 3×1 and 1×3 filters, number of parameters = 3×1+1×3=6
Number of parameters is reduced by 33%
You may ask why we don’t use two 2×2 filters to replace one 3×3 filter?
With this technique, one of the new Inception modules (I call it Inception Module B here)
becomes:
And Inception module C is also proposed for promoting high dimensional representations
according to author descriptions as follows:
Inception Module C using asymmetric factorization
Thus, authors suggest these 3 kinds of Inception Modules. With factorization, number of
parameters is reduced for the whole network, it is less likely to be overfitting, and
consequently, the network can go deeper!
2. Auxiliary Classifier
Auxiliary Classifiers were already suggested in GoogLeNet / Inception-v1 [4]. There are some
modifications in Inception-v3.
Only 1 auxiliary classifier is used on the top of the last 17×17 layer, instead of using 2
auxiliary classifiers. (The overall architecture would be shown later.)
The purpose is also different. In GoogLeNet / Inception-v1 [4], auxiliary classifiers are used for
having deeper network. In Inception-v3, auxiliary classifier is used as regularizer. So, actually,
in deep learning, the modules are still quite intuitive.
Batch normalization, suggested in Inception-v2 [6], is also used in the auxiliary classifier.
With the efficient grid size reduction, 320 feature maps are done by conv with stride 2. 320
feature maps are obtained by max pooling. And these 2 sets of feature maps are concatenated
as 640 feature maps and go to the next level of inception module.
Less expensive and still efficient network is achieved by this efficient grid size reduction.
4. Inception-v3 Architecture
There are some typos for the architecture in the passage and table within the paper. I believe this
is due to the intense ILSVRC competition in 2015. I thereby look into the codes to realize the
architecture:
Inception-v3 Architecture (Batch Norm and ReLU are used after Conv)
With 42 layers deep, the computation cost is only about 2.5 higher than that of GoogLeNet
[4], and much more efficient than that of VGGNet [3].
new_labels = (1 — ε) * one_hot_labels + ε / K
where ε is 0.1 which is a hyperparameter and K is 1000 which is the number of classes. A kind
of dropout effect observed in classifier layer.
6. Ablation Study
Ablation Study (single-model single-crop)
Using single-model single-crop, we can see the top-1 error rate is improved when proposed
techniques are added on top of each other:
Inception-v1: 29%
Inception-v2: 25.2%
Inception-v3: 23.4%
+ RMSProp: 23.1%
+ Label Smoothing: 22.8%
+ 7×7 Factorization: 21.6%
+ Auxiliary Classifier: 21.2% (With top-5 error rate of 5.6%)
where 7×7 Factorization is to factorize the first 7×7 conv layer into three 3×3 conv layer.
With multi-model multi-crop, Inception-v3 with 144 crops and 4 models ensembled, the top-5
error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in
ILSVRC 2015, while the winner is ResNet [7] which will be reviewed later. Of course,
Inception-v4 [5] will also be reviewed later on as well.
With multi-model multi-crop, Inception-v3 with 144 crops and 4 models ensembled, the top-5
error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in
ILSVRC 2015, while the winner is ResNet [7] which will be reviewed later. Of course,
Inception-v4 [5] will also be reviewed later on as well.
https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d
ResNet, short for Residual Networks is a classic neural network used as a backbone for many
computer vision tasks. This model was the winner of ImageNet challenge in 2015. The
fundamental breakthrough with ResNet was it allowed us to train extremely deep neural
networks with 150+layers successfully. Prior to ResNet training very deep neural networks was
difficult due to the problem of vanishing gradients.
AlexNet, the winner of ImageNet 2012 and the model that apparently kick started the focus on
deep learning had only 8 convolutional layers, the VGG network had 19 and Inception or
GoogleNet had 22 layers and ResNet 152 had 152 layers. In this blog we will code a ResNet-50
that is a smaller version of ResNet 152 and frequently used as a starting point for transfer
learning.
Revolution of Depth
However, increasing network depth does not work by simply stacking layers together. Deep
networks are hard to train because of the notorious vanishing gradient problem — as the gradient
is back-propagated to earlier layers, repeated multiplication may make the gradient extremely
small. As a result, as the network goes deeper, its performance gets saturated or even starts
degrading rapidly.
I learnt about coding ResNets from DeepLearning.AI course by Andrew Ng. I highly
recommend this course.
On my Github repo, I have shared two notebooks one that codes ResNet from scratch as
explained in DeepLearning.AI and the other that uses the pretrained model in Keras. I hope you
pull the code and try it for yourself.
ResNet first introduced the concept of skip connection. The diagram below illustrates skip
connection. The figure on the left is stacking convolution layers together one after the other. On
the right we still stack convolution layers as before but we now also add the original input to the
output of the convolution block. This is called skip connection
The coding is quite simple but there is one important consideration — since X, X_shortcut above
are two matrixes, you can add them only if they have the same shape. So if the convolution +
batch norm operations are done in a way that the output shape is the same,then we can simply
add them as shown below.
In the notebook on Github, the two functions identity_block and convolution_block implement
above. These functions use Keras to implement Convolution and Batch Norm layers with ReLU
activation. Skip connection is technically the one line X = Add()([X, X_shortcut]).
One important thing to note here is that the skip connection is applied before the RELU
activation as shown in the diagram above. Research has found that this has the best results.
This is an interesting question. I think there are two reasons why Skip connections work here:
1. They mitigate the problem of vanishing gradient by allowing this alternate shortcut path
for gradient to flow through
2. They allow the model to learn and identify function which ensures that the higher layer
will perform at least as good as lower layer, and not worse.
Infarct since ResNet skip connections are used in a lot more model architectures like the Fully
Convolutional Network (FCN) and U-Net. They are used to flow information from earlier layers
in the model to later layers. In these architectures they are used to pass information from the
downsampling layers to the upsampling layers.
The identity and convolution blocks coded in the notebook are then combined to create a
ResNet-50 model with the architecture shown below:
ResNet-50 Model
The ResNet-50 model consists of 5 stages each with a convolution and Identity block. Each
convolution block has 3 convolution layers and each identity block also has 3 convolution layers.
The ResNet-50 has over 23 million trainable parameters.
I have tested this model on the signs data set which is also included in my Github repo. This data
set has hand images corresponding to 6 classes. We have 1080 train images and 120 test images.
Our ResNet-50 gets to 86% test accuracy in 25 epochs of training. Not bad!
I loved coding the ResNet model myself since it allowed me a better understanding of a network
that I frequently use in many transfer learning tasks related to image classification, object
localization, segmentation etc.
However for more regular use it is faster to use the pretrained ResNet-50 in Keras. Keras has
many of these backbone models with their Imagenet weights available in its library.
I have uploaded a notebook on my Github that uses Keras to load the pretrained ResNet-50. You
can load the model with 1 line code:
Here weights=None since I want to initialize the model with random weights as I did on the
ResNet-50 I coded. Otherwise I can also load the pretrained ImageNet weights. I set
include_top=False to not include the final pooling and fully connected layer in the original
model. I added Global Average Pooling and a dense output layaer to the ResNet-50 model.
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.7)(x)
predictions = Dense(num_classes, activation= 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)
As shown above Keras provides a very convenient interface to load the pretrained models but it
is important to code the ResNet yourself as well at least once so you understand the concept and
can maybe apply this learning to another new architecture you are creating.
The Keras ResNet got to an accuracy of 75% after training on 100 epochs with Adam optimizer
and a learning rate of 0.0001. The accuracy is a bit lower than our own coded model and I guess
this has to do with weight initializations.
Keras also provides an easy interface for data augmentation so if you get a chance, try
augmenting this data set and see if that results in better performance.
Conclusion
ResNet is a powerful backbone model that is used very frequently in many computer
vision tasks
ResNet uses skip connection to add the output from an earlier layer to a later layer. This
helps it mitigate the vanishing gradient problem
You can use Keras to load their pretrained ResNet 50 or use the code I have shared to
code ResNet yourself.
I have my own deep learning consultancy and love to work on interesting problems. I have
helped many startups deploy innovative AI based solutions. Check us out at —
http://deeplearninganalytics.org/.
References
DeepLearning.AI
Keras
ReNet Paper
dropout_rate = 0.2
model = models.Sequential()
model.add(conv_base)
model.add(layers.GlobalMaxPooling2D(name="gap"))
# model.add(layers.Flatten(name="flatten"))
if dropout_rate > 0:
model.add(layers.Dropout(dropout_rate, name="dropout_out"))
# model.add(layers.Dense(256, activation='relu', name="fc1"))
model.add(layers.Dense(2, activation="softmax", name="fc_out"))
conv_base.trainable = False
!wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-
6DEBA77B919F/kagglecatsanddogs_3367a.zip
!unzip -qq kagglecatsanddogs_3367a.zip -d dog_vs_cat
train_datagen = ImageDataGenerator(
rescale=1.0 / 255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode="nearest",
)
train_generator = train_datagen.flow_from_directory(
# This is the target directory
train_dir,
# All images will be resized to target height and width.
target_size=(height, width),
batch_size=batch_size,
# Since we use categorical_crossentropy loss, we need categorical labels
class_mode="categorical",
)
validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(height, width),
batch_size=batch_size,
class_mode="categorical",
)
model.compile(
loss="categorical_crossentropy",
optimizer=optimizers.RMSprop(lr=2e-5),
metrics=["acc"],
)
history = model.fit_generator(
train_generator,
steps_per_epoch=NUM_TRAIN // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=NUM_TEST // batch_size,
verbose=1,
use_multiprocessing=True,
workers=4,
)
Another technique to make the model representation more relevant for the problem at hand is
called fine-tuning. That is based on the following intuition.
Earlier layers in the convolutional base encode more generic, reusable features, while layers
higher up encode more specialized features.
We have already done the first three steps, to find out which layers to unfreeze, it is helpful to
plot the Keras model.
set_trainable = False
for layer in conv_base.layers:
if layer.name == 'multiply_16':
set_trainable = True
if set_trainable:
layer.trainable = True
else:
layer.trainable = False