CNN Architectures

Basic Introduction
LeNet-5, from the paper Gradient-Based Learning Applied to Document Recognition, is a very efficient
convolutional neural network for handwritten character recognition.
Paper: Gradient-Based Learning Applied to Document Recognition
Authors: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner
Published in: Proceedings of the IEEE (1998)
Structure of the LeNet network

LeNet5 is a small network, it contains the basic modules of deep learning: convolutional layer, pooling layer,
and full link layer. It is the basis of other deep learning models. Here we analyze LeNet5 in depth. At the
same time, through example analysis, deepen the understanding of the convolutional layer and pooling
layer.
LeNet-5 Total seven layer , does not comprise an input, each containing a trainable parameters; each layer
has a plurality of the Map the Feature , a characteristic of each of the input FeatureMap extracted by means
of a convolution filter, and then each FeatureMap There are multiple neurons.
Detailed explanation of each layer parameter:
INPUT Layer
The first is the data INPUT layer. The size of the input image is uniformly normalized to 32 * 32.
Note: This layer does not count as the network structure of LeNet-5. Traditionally, the input
layer is not considered as one of the network hierarchy.
C1 layer-convolutional layer
Input picture: 32 * 32
Convolution kernel size: 5 * 5
Convolution kernel types: 6
Output featuremap size: 28 * 28 (32-5 + 1) = 28
Number of neurons: 28 28 6
Trainable parameters: (5 5 + 1) 6 (5 * 5 = 25 unit parameters and one bias parameter per

filter, a total of 6 filters)
Number of connections: (5 5 + 1) 6 28 28 = 122304
Detailed description:
1. The first convolution operation is performed on the input image (using 6 convolution kernels of size 5 5)
to obtain 6 C1 feature maps (6 feature maps of size 28 28, 32-5 + 1 = 28).
2. Let's take a look at how many parameters are needed. The size of the convolution kernel is 5 5, and
there are 6 (5 * 5 + 1) = 156 parameters in total, where +1 indicates that a kernel has a bias.
3. For the convolutional layer C1, each pixel in C1 is connected to 5 5 pixels and 1 bias in the input
image, so there are 156 28 * 28 = 122304 connections in total. There are 122,304 connections, but we
only need to learn 156 parameters, mainly through weight sharing.
S2 layer-pooling layer (downsampling layer)

Input: 28 * 28
Sampling area: 2 * 2
Sampling method: 4 inputs are added, multiplied by a trainable parameter, plus a trainable
offset. Results via sigmoid
Sampling type: 6
Output featureMap size: 14 * 14 (28/2)
Number of neurons: 14 14 6
Trainable parameters: 2 * 6 (the weight of the sum + the offset)
Number of connections: (2 2 + 1) 6 14 14
The size of each feature map in S2 is 1/4 of the size of the feature map in C1.
The pooling operation is followed immediately after the first convolution. Pooling is performed using 2 2
kernels, and S2, 6 feature maps of 14 14 (28/2 = 14) are obtained.
The pooling layer of S2 is the sum of the pixels in the 2 * 2 area in C1 multiplied by a weight coefficient plus
an offset, and then the result is mapped again.
So each pooling core has two training parameters, so there are 2x6 = 12 training parameters, but there are
5x14x14x6 = 5880 connections.
C3 layer-convolutional layer
Input: all 6 or several feature map combinations in S2
Convolution kernel type: 16
Output featureMap size: 10 * 10 (14-5 + 1) = 10
Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating that the
feature map of this layer is a different combination of the feature maps extracted from the
previous layer.
One way is that the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as
input. The next 6 feature maps take 4 subsets of neighboring feature maps in S2 as input.
The next three take the non-adjacent 4 feature map subsets as input. The last one takes all
the feature maps in S2 as input.
The trainable parameters are: 6 (3 5 5 + 1) + 6 (4 5 5 + 1) + 3 (4 5 5 + 1) + 1 (6 5 5 +1) =

1516
Number of connections: 10 10 1516 = 151600
After the first pooling, the second convolution, the output of the second convolution is C3, 16 10x10 feature
maps, and the size of the convolution kernel is 5 5. We know that S2 has 6 14 14 feature maps, how to get
16 feature maps from 6 feature maps? Here are the 16 feature maps calculated by the special combination
of the feature maps of S2. details as follows:
The first 6 feature maps of C3 (corresponding to the 6th column of the first red box in the figure above) are
connected to the 3 feature maps connected to the S2 layer (the first red box in the above figure), and the
next 6 feature maps are connected to the S2 layer The 4 feature maps are connected (the second red box
in the figure above), the next 3 feature maps are connected with the 4 feature maps that are not connected
at the S2 layer, and the last is connected with all the feature maps at the S2 layer. The convolution kernel
size is still 5 5, so there are 6 (3 5 5 + 1) + 6 (4 5 5 + 1) + 3 (4 5 5 + 1) +1 (6 5 5 + 1) = 1516 parameters.
The image size is 10 10, so there are 151600 connections.
The convolution structure of C3 and the first 3 graphs in S2 is shown below:
S4 layer-pooling layer (downsampling layer)

Input: 10 * 10
Sampling area: 2 * 2
Sampling method: 4 inputs are added, multiplied by a trainable parameter, plus a trainable
offset. Results via sigmoid
Sampling type: 16
Output featureMap size: 5 * 5 (10/2)
Number of neurons: 5 5 16 = 400
Trainable parameters: 2 * 16 = 32 (the weight of the sum + the offset)
Number of connections: 16 (2 2 + 1) 5 5 = 2000
The size of each feature map in S4 is 1/4 of the size of the feature map in C3
S4 is the pooling layer, the window size is still 2 * 2, a total of 16 feature maps, and the 16 10x10 maps of
the C3 layer are pooled in units of 2x2 to obtain 16 5x5 feature maps. This layer has a total of 32 training
parameters of 2x16, 5x5x5x16 = 2000 connections.
The connection is similar to the S2 layer.
C5 layer-convolution layer
Input: All 16 unit feature maps of the S4 layer (all connected to s4)
Convolution kernel type: 120
Output featureMap size: 1 * 1 (5-5 + 1)
Trainable parameters / connection: 120 (16 5 * 5 + 1) = 48120
The C5 layer is a convolutional layer. Since the size of the 16 images of the S4 layer is 5x5, which is the
same as the size of the convolution kernel, the size of the image formed after convolution is 1x1. This
results in 120 convolution results. Each is connected to the 16 maps on the previous level. So there are
(5x5x16 + 1) x120 = 48120 parameters, and there are also 48120 connections. The network structure of the
C5 layer is as follows:
F6 layer-fully connected layer

Input: c5 120-dimensional vector
Calculation method: calculate the dot product between the input vector and the weight
vector, plus an offset, and the result is output through the sigmoid function.
Trainable parameters: 84 * (120 + 1) = 10164
Layer 6 is a fully connected layer. The F6 layer has 84 nodes, corresponding to a 7x12 bitmap, -1 means
white, 1 means black, so the black and white of the bitmap of each symbol corresponds to a code. The
training parameters and number of connections for this layer are (120 + 1) x84 = 10164. The ASCII
encoding diagram is as follows:
The connection method of the F6 layer is as follows:
Output layer-fully connected layer

The output layer is also a fully connected layer, with a total of 10 nodes, which respectively represent the
numbers 0 to 9, and if the value of node i is 0, the result of network recognition is the number i. A radial
basis function (RBF) network connection is used. Assuming x is the input of the previous layer and y is the
output of the RBF, the calculation of the RBF output is:
The value of the above formula w_ij is determined by the bitmap encoding of i, where i ranges from 0 to 9,
and j ranges from 0 to 7 * 12-1. The closer the value of the RBF output is to 0, the closer it is to i, that is, the
closer to the ASCII encoding figure of i, it means that the recognition result input by the current network is
the character i. This layer has 84x10 = 840 parameters and connections.
Summary
LeNet-5 is a very efficient convolutional neural network for handwritten character recognition.
Convolutional neural networks can make good use of the structural information of images.
The convolutional layer has fewer parameters, which is also determined by the main characteristics of
the convolutional layer, that is, local connection and shared weights.
Code Implementation
In [5]: from tensorflow import keras
from keras.datasets import mnist
from keras.layers import Conv2D, MaxPooling2D,AveragePooling2D
from keras.layers import Dense, Flatten
from keras.models import Sequential
# Load the CIFAR-10 dataset

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
# Normalize pixel values between 0 and 1

x_train = x_train / 255.0
x_test = x_test / 255.0
# Convert labels to one-hot encoding

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
# Building the Model Architecture
model = Sequential()
model.add(Conv2D(6, kernel_size = (5,5), padding = 'valid', activation='tanh', input_sha
model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))
model.add(Conv2D(16, kernel_size = (5,5), padding = 'valid', activation='tanh'))

model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))
model.add(Flatten())
model.add(Dense(120, activation='tanh'))
model.add(Dense(84, activation='tanh'))
model.add(Dense(10, activation='softmax'))
model.summary()
model.compile(loss=keras.metrics.categorical_crossentropy, optimizer=keras.optimizers.Ad
model.fit(x_train, y_train, batch_size=128, epochs=2, verbose=1, validation_data=(x_test
score = model.evaluate(x_test, y_test)
print('Test Loss:', score[0])

print('Test accuracy:', score[1])
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_5 (Conv2D) (None, 28, 28, 6) 456
average_pooling2d_4 (Averag (None, 14, 14, 6) 0

ePooling2D)
conv2d_6 (Conv2D) (None, 10, 10, 16) 2416
average_pooling2d_5 (Averag (None, 5, 5, 16) 0

ePooling2D)
flatten_2 (Flatten) (None, 400) 0
dense_6 (Dense) (None, 120) 48120
=================================================================
Total params: 62,006
Trainable params: 62,006
Non-trainable params: 0
_________________________________________________________________
Epoch 1/2
391/391 [==============================] - 14s 8ms/step - loss: 1.8395 - accuracy: 0.346
6 - val_loss: 1.7231 - val_accuracy: 0.3949
Epoch 2/2
- val_loss: 1.6083 - val_accuracy: 0.4258
Test Loss: 1.6083446741104126
Test accuracy: 0.42579999566078186
In [ ]:
Introduction
AlexNet was designed by Hinton, winner of the 2012 ImageNet competition, and his student
Alex Krizhevsky. It was also after that year that more and deeper neural networks were
proposed, such as the excellent vgg, GoogleLeNet. Its official data model has an accuracy
rate of 57.1% and top 1-5 reaches 80.2%. This is already quite outstanding for traditional
machine learning classification algorithms.
The following table below explains the network structure of AlexNet:
Size /
Filter Depth Stride Padding Number of Parameters Forward Computation
Operation
3* 227 * 227
11 * (11*11*3 + 1) * 96 * 55 *
Conv1 + Relu 96 4 (11*11*3 + 1) * 96=34944
11 55=105705600
96 * 55 * 55
Max Pooling 3*3 2

96 * 27 * 27
Norm
(5 * 5 * 96 + 1) * 256 * 27 *
Conv2 + Relu 5*5 256 1 2 (5 * 5 * 96 + 1) * 256=614656
27=448084224
256 * 27 * 27
Max Pooling 3*3 2
256 * 13 * 13
Norm
(3 * 3 * 256 + 1) * 384 * 13 *
Conv3 + Relu 3*3 384 1 1 (3 * 3 * 256 + 1) * 384=885120
13=149585280
384 * 13 * 13
(3 * 3 * 384 + 1) * 384 * 13 *
Conv4 + Relu 3*3 384 1 1 (3 * 3 * 384 + 1) * 384=1327488
13=224345472
384 * 13 * 13
(3 * 3 * 384 + 1) * 256 * 13 *
Conv5 + Relu 3*3 256 1 1 (3 * 3 * 384 + 1) * 256=884992
13=149563648
256 * 13 * 13
Max Pooling 3*3 2
256 * 6 * 6
Dropout (rate
0.5)
FC6 + Relu 256 * 6 * 6 * 4096=37748736 256 * 6 * 6 * 4096=37748736
4096
Dropout (rate
0.5)
FC7 + Relu 4096 * 4096=16777216 4096 * 4096=16777216
4096
FC8 + Relu 4096 * 1000=4096000 4096 * 1000=4096000
1000 classes
Overall 62369152=62.3 million 1135906176=1.1 billion
Conv:3.7million (6%) , FC: 58.6 Conv: 1.08 billion (95%) , FC: 58.6
Conv VS FC
million (94% ) million (5%)
Why does AlexNet achieve better results?

1. Relu activation function is used.
Relu function: f (x) = max (0, x)

ReLU-based deep convolutional networks are trained several times faster than tanh and sigmoid- based
networks. The following figure shows the number of iterations for a four-layer convolutional network based
on CIFAR-10 that reached 25% training error in tanh and ReLU:
1. Standardization ( Local Response Normalization )
After using ReLU f (x) = max (0, x), you will find that the value after the activation function has no range like
the tanh and sigmoid functions, so a normalization will usually be done after ReLU, and the LRU is a steady
proposal (Not sure here, it should be proposed?) One method in neuroscience is called "Lateral inhibition",
which talks about the effect of active neurons on its surrounding neurons.
1. Dropout
Dropout is also a concept often said, which can effectively prevent overfitting of neural networks. Compared
to the general linear model, a regular method is used to prevent the model from overfitting. In the neural
network, Dropout is implemented by modifying the structure of the neural network itself. For a certain layer
of neurons, randomly delete some neurons with a defined probability, while keeping the individuals of the
input layer and output layer neurons unchanged, and then update the parameters according to the learning
method of the neural network. In the next iteration, rerandom Remove some neurons until the end of
training.
1. Enhanced Data ( Data Augmentation )
In deep learning, when the amount of data is not large enough, there are generally 4 solutions:
Data augmentation- artificially increase the size of the training set-create a batch of "new"
data from existing data by means of translation, flipping, noise
Regularization——The relatively small amount of data will cause the model to overfit,
making the training error small and the test error particularly large. By adding a regular term
after the Loss Function , the overfitting can be suppressed. The disadvantage is that a need
is introduced Manually adjusted hyper-parameter.
Dropout- also a regularization method. But different from the above, it is achieved by
randomly setting the output of some neurons to zero
Unsupervised Pre-training- use Auto-Encoder or RBM's convolution form to do

unsupervised pre-training layer by layer, and finally add a classification layer to do supervised
Fine-Tuning
Code Implementation
In [1]: !pip install tflearn
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ

ic/simple/
Collecting tflearn
Downloading tflearn-0.5.0.tar.gz (107 kB)
107.3/107.3 kB 10.2 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from tf
learn) (1.22.4)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from tfle
arn) (1.16.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from t
flearn) (8.4.0)
Building wheels for collected packages: tflearn
Building wheel for tflearn (setup.py) ... done
Created wheel for tflearn: filename=tflearn-0.5.0-py3-none-any.whl size=127283 sha256=
e7f49a1218d295ab407bb51ae50a3185f8f81ab1df8dcbc0ebd5599aac7c8c6c
Stored in directory: /root/.cache/pip/wheels/55/fb/7b/e06204a0ceefa45443930b9a250cb5eb
e31def0e4e8245a465
Successfully built tflearn
Installing collected packages: tflearn
Successfully installed tflearn-0.5.0
In [2]: import tensorflow as tf

from tensorflow import keras
import keras
from keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
In [8]: # Get Data

import tflearn.datasets.oxflower17 as oxflower17
from keras.utils import to_categorical
x, y = oxflower17.load_data()
x_train = x.astype('float32') / 255.0

y_train = to_categorical(y, num_classes=17)
In [10]: print(x_train.shape)
print(y_train.shape)
(1360, 224, 224, 3)

(1360, 17)
In [5]: # Create a sequential model

# 1st Convolutional Layer

model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11), strides=(4,4)
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))
# Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())
# 2nd Convolutional Layer

model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same'))
# Pooling
# Batch Normalisation
# 3rd Convolutional Layer

model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
# 4th Convolutional Layer

# 5th Convolutional Layer
# Pooling
# Passing it to a dense layer

# 1st Dense Layer

model.add(Dense(4096, input_shape=(224*224*3,)))
# Add Dropout to prevent overfitting
model.add(Dropout(0.4))
# 2nd Dense Layer

model.add(Dense(4096))
# Add Dropout
# Output Layer
model.add(Dense(17))
model.add(Activation('softmax'))
model.summary()
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/keras/layers/normalizati
on/batch_normalization.py:581: _colocate_with (from tensorflow.python.framework.ops) is
deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model: "sequential"
_________________________________________________________________
=================================================================
conv2d (Conv2D) (None, 54, 54, 96) 34944
activation (Activation) (None, 54, 54, 96) 0
max_pooling2d (MaxPooling2D (None, 26, 26, 96) 0

)
batch_normalization (BatchN (None, 26, 26, 96) 384

ormalization)
conv2d_1 (Conv2D) (None, 26, 26, 256) 614656
activation_1 (Activation) (None, 26, 26, 256) 0
max_pooling2d_1 (MaxPooling (None, 12, 12, 256) 0

2D)
batch_normalization_1 (Batc (None, 12, 12, 256) 1024

hNormalization)
conv2d_2 (Conv2D) (None, 10, 10, 384) 885120

hNormalization)
conv2d_3 (Conv2D) (None, 8, 8, 384) 1327488

hNormalization)
conv2d_4 (Conv2D) (None, 6, 6, 256) 884992

2D)

hNormalization)
flatten (Flatten) (None, 1024) 0
dense (Dense) (None, 4096) 4198400
activation_5 (Activation) (None, 4096) 0
dropout (Dropout) (None, 4096) 0
batch_normalization_5 (Batc (None, 4096) 16384

hNormalization)
dense_1 (Dense) (None, 4096) 16781312
dropout_1 (Dropout) (None, 4096) 0
batch_normalization_6 (Batc (None, 4096) 16384

hNormalization)
=================================================================
Total params: 24,834,833
Trainable params: 24,815,697
Non-trainable params: 19,136
_________________________________________________________________
In [11]: # Compile the model

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
In [12]: # Train
model.fit(x_train, y_train, batch_size=64, epochs=5, verbose=1,validation_split=0.2, shu
Train on 1088 samples, validate on 272 samples

Epoch 1/5
1088/1088 [==============================] - ETA: 0s - loss: 3.6493 - acc: 0.2858
/usr/local/lib/python3.10/dist-packages/keras/engine/training_v1.py:2335: UserWarning: `
Model.state_updates` will be removed in a future version. This property should not be us
ed in TensorFlow 2.0, as `updates` are applied automatically.
updates = self.state_updates
1088/1088 [==============================] - 11s 10ms/sample - loss: 3.6493 - acc: 0.285
8 - val_loss: 3.1504 - val_acc: 0.0699
Epoch 2/5
- val_loss: 5.0040 - val_acc: 0.0699
Epoch 3/5
- val_loss: 7.8723 - val_acc: 0.0551
Epoch 4/5
- val_loss: 5.6248 - val_acc: 0.0699
Epoch 5/5
- val_loss: 7.2296 - val_acc: 0.0699
<keras.callbacks.History at 0x7fe7577e01f0>
Out[12]:
In [ ]:
VGG-Net
Introduction
The full name of VGG is the Visual Geometry Group, which belongs to the Department of
Science and Engineering of Oxford University. It has released a series of convolutional
network models beginning with VGG, which can be applied to face recognition and image
classification, from VGG16 to VGG19. The original purpose of VGG's research on the depth
of convolutional networks is to understand how the depth of convolutional networks affects
the accuracy and accuracy of large-scale image classification and recognition. -Deep-16
CNN), in order to deepen the number of network layers and to avoid too many parameters, a
small 3x3 convolution kernel is used in all layers.
Network Structure of VGG19
The network structure

The input of VGG is set to an RGB image of 224x244 size. The average RGB value is
calculated for all images on the training set image, and then the image is input as an input to
the VGG convolution network. A 3x3 or 1x1 filter is used, and the convolution step is fixed. .
There are 3 VGG fully connected layers, which can vary from VGG11 to VGG19 according to
the total number of convolutional layers + fully connected layers. The minimum VGG11 has 8
convolutional layers and 3 fully connected layers. The maximum VGG19 has 16 convolutional
layers. +3 fully connected layers. In addition, the VGG network is not followed by a pooling
layer behind each convolutional layer, or a total of 5 pooling layers distributed under different
convolutional layers. The following figure is VGG Structure diagram:
1000
VGG16 contains 16 layers and VGG19 contains 19 layers. A series of VGGs are exactly the
same in the last three fully connected layers. The overall structure includes 5 sets of
convolutional layers, followed by a MaxPool. The difference is that more and more cascaded
convolutional layers are included in the five sets of convolutional layers .
Each convolutional layer in AlexNet contains only one convolution, and the size of the
convolution kernel is 7 7 ,. In VGGNet, each convolution layer contains 2 to 4 convolution
operations. The size of the convolution kernel is 3 3, the convolution step size is 1, the
pooling kernel is 2 * 2, and the step size is 2. The most obvious improvement of VGGNet is to
reduce the size of the convolution kernel and increase the number of convolution layers.
Using multiple convolution layers with smaller convolution kernels instead of a larger
convolution layer with convolution kernels can reduce parameters on the one hand, and the
author believes that it is equivalent to more non-linear mapping, which increases the Fit
expression ability.
Two consecutive 3 3 convolutions are equivalent to a 5 5 receptive field, and three are
equivalent to 7 7. The advantages of using three 3 3 convolutions instead of one 7 7
convolution are twofold : one, including three ReLu layers instead of one , makes the decision
function more discriminative; and two, reducing parameters . For example, the input and
output are all C channels. 3 convolutional layers using 3 3 require 3 (3 3 C C) = 27 C C, and
1 convolutional layer using 7 7 requires 7 7 C C = 49C C. This can be seen as applying a
kind of regularization to the 7 7 convolution, so that it is decomposed into three 3 3
convolutions.
The 1 1 convolution layer is mainly to increase the non-linearity of the decision function
without affecting the receptive field of the convolution layer. Although the 1 1 convolution
operation is linear, ReLu adds non-linearity.
Network Configuration
Table 1 shows all network configurations. These networks follow the same design principles,
but differ in depth.
This picture is definitely used when introducing VGG16. This picture contains a lot of
information. My interpretation here may be limited. If you have any supplements, please leave
a message.
Number 1 : This is a comparison chart of 6 networks. From A to E, the network is getting deeper.
Several layers have been added to verify the effect.
Number 2 : Each column explains the structure of each network in detail.
Number 3: This is a correct way to do experiments, that is, use the simplest method to solve the
problem , and then gradually optimize for the problems that occur.
Network A: First mention a shallow network, this network can easily converge on ImageNet. And then?
Network A-LRN: Add something that someone else (AlexNet) has experimented to say is effective (LRN),
but it seems useless. And then?
Network B: Then try adding 2 layers? Seems to be effective. And then?

Network C: Add two more layers of 1 * 1 convolution, and it will definitely converge. The effect seems to be
better. A little excited. And then?
Network D: Change the 1 1 convolution kernel to 3 3. Try it. The effect has improved again. Seems to be
the best (2014).
Training
The optimization method is a stochastic gradient descent SGD + momentum (0.9) with momentum. The
batch size is 256.
Regularization : L2 regularization is used, and the weight decay is 5e-4. Dropout is after the first two fully
connected layers, p = 0.5.
Although it is deeper and has more parameters than the AlexNet network, we speculate that VGGNet can
converge in less cycles for two reasons: one, the greater depth and smaller convolutions bring implicit
regularization ; Second, some layers of pre-training.
Parameter initialization : For a shallow A network, parameters are randomly initialized, the weight w is
sampled from N (0, 0.01), and the bias is initialized to 0. Then, for deeper networks, first the first four
convolutional layers and three fully connected layers are initialized with the parameters of the A network.
However, it was later discovered that it is also possible to directly initialize it without using pre-trained
parameters.
In order to obtain a 224 * 224 input image, each rescaled image is randomly cropped in each SGD iteration.
In order to enhance the data set, the cropped image is also randomly flipped horizontally and RGB color
shifted.
Summary of VGGNet improvement points

1. A smaller 3 3 convolution kernel and a deeper network are used . The stack of two 3 3 convolution
kernels is relative to the field of view of a 5 5 convolution kernel, and the stack of three 3 3 convolution
kernels is equivalent to the field of view of a 7 7 convolution kernel. In this way, there can be fewer
parameters (3 stacked 3 3 structures have only 7 7 structural parameters (3 3 3) / (7 7) = 55%); on the
other hand, they have more The non-linear transformation increases the ability of CNN to learn
features.
1. In the convolutional structure of VGGNet, a 1 * 1 convolution kernel is introduced. Without affecting the
input and output dimensions, non-linear transformation is introduced to increase the expressive power
of the network and reduce the amount of calculation.
1. During training, first train a simple (low-level) VGGNet A-level network, and then use the weights of the
A network to initialize the complex models that follow to speed up the convergence of training .
Some basic questions

Q1: Why can 3 3x3 convolutions replace 7x7 convolutions?
Answer 1
3 3x3 convolutions, using 3 non-linear activation functions, increasing non-linear expression capabilities,
making the segmentation plane more separable Reduce the number of parameters. For the convolution
kernel of C channels, 7x7 contains parameters , and the number of 3 3x3 parameters is greatly reduced.
Q2: The role of 1x1 convolution kernel
Answer 2
Increase the nonlinearity of the model without affecting the receptive field 1x1 winding machine is equivalent
to linear transformation, and the non-linear activation function plays a non-linear role
Q3: The effect of network depth on results (in the same year, Google also independently released
the network GoogleNet with a depth of 22 layers)
Answer 3
VGG and GoogleNet models are deep Small convolution VGG only uses 3x3, while GoogleNet uses 1x1,
3x3, 5x5, the model is more complicated (the model began to use a large convolution kernel to reduce the
calculation of the subsequent machine layer)
Code Implementation
From Scratch

ic/simple/
Collecting tflearn
107.3/107.3 kB 11.9 MB/s eta 0:00:00
learn) (1.22.4)
arn) (1.16.0)
flearn) (8.4.0)
f270efc507b05ff73c8125856b43368ccef0ca18caf050fe770d806a4f8b359f
e31def0e4e8245a465

import keras,os
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
In [3]: # Get Data

WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compa
t/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scop
e) is deprecated and will be removed in a future version.
non-resource variables are not supported in the long term
Downloading Oxford 17 category Flower Dataset, Please wait...
100.0% 60276736 / 60270631
Succesfully downloaded 17flowers.tgz 60270631 bytes.
File Extracted
Starting to parse images...
Parsing Done!
(1360, 224, 224, 3)

(1360, 17)
In [8]: model = Sequential()

model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same", ac
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=17, activation="softmax"))
model.summary()
Model: "sequential_1"
_________________________________________________________________
=================================================================
conv2d_13 (Conv2D) (None, 224, 224, 64) 1792
conv2d_14 (Conv2D) (None, 224, 224, 64) 36928

2D)
conv2d_15 (Conv2D) (None, 112, 112, 128) 73856
conv2d_16 (Conv2D) (None, 112, 112, 128) 147584

2D)
conv2d_17 (Conv2D) (None, 56, 56, 256) 295168

conv2d_18 (Conv2D) (None, 56, 56, 256) 590080
conv2d_19 (Conv2D) (None, 56, 56, 256) 590080

2D)
conv2d_20 (Conv2D) (None, 28, 28, 512) 1180160
conv2d_21 (Conv2D) (None, 28, 28, 512) 2359808
conv2d_22 (Conv2D) (None, 28, 28, 512) 2359808

2D)
conv2d_23 (Conv2D) (None, 14, 14, 512) 2359808
conv2d_24 (Conv2D) (None, 14, 14, 512) 2359808
conv2d_25 (Conv2D) (None, 14, 14, 512) 2359808

2D)
flatten_1 (Flatten) (None, 25088) 0
dense_3 (Dense) (None, 4096) 102764544
dense_4 (Dense) (None, 4096) 16781312
=================================================================
_________________________________________________________________
In [9]: # Compile the model

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
In [10]: # Train

Epoch 1/5
1088/1088 [==============================] - ETA: 0s - loss: 2.8376 - acc: 0.0478
8 - val_loss: 2.8353 - val_acc: 0.0331
Epoch 2/5
2 - val_loss: 2.8366 - val_acc: 0.0331
Epoch 3/5
3 - val_loss: 2.8365 - val_acc: 0.0331
Epoch 4/5
3 - val_loss: 2.8387 - val_acc: 0.0331
Epoch 5/5
3 - val_loss: 2.8391 - val_acc: 0.0331
<keras.callbacks.History at 0x7fe540c4b4f0>
Out[10]:
In [ ]:
VGG Pretrained
In [11]: # download the data from g drive
import gdown
url = "https://drive.google.com/file/d/12jiQxJzYSYl3wnC8x5wHAhRzzJmmsCXP/view?usp=sharin
file_id = url.split("/")[-2]
print(file_id)
prefix = 'https://drive.google.com/uc?/export=download&id='
gdown.download(prefix+file_id, "catdog.zip")
12jiQxJzYSYl3wnC8x5wHAhRzzJmmsCXP
Downloading...
From: https://drive.google.com/uc?/export=download&id=12jiQxJzYSYl3wnC8x5wHAhRzzJmmsCXP
To: /content/catdog.zip
100%|██████████| 9.09M/9.09M [00:00<00:00, 118MB/s]
'catdog.zip'
Out[11]:
In [12]: !unzip catdog.zip
Archive: catdog.zip
creating: train/
creating: train/Cat/
inflating: train/Cat/0.jpg
inflating: train/Cat/cat.2405.jpg
creating: train/Dog/
inflating: train/Dog/10493.jpg
inflating: train/Dog/dog.2432.jpg
creating: validation/
creating: validation/Cat/
inflating: validation/Cat/cat.2407.jpg
creating: validation/Dog/
inflating: validation/Dog/dog.2402.jpg

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.layers import Dense, Dropout, Flatten
# Set the path to your training and validation data

train_data_dir = '/content/train'
validation_data_dir = '/content/validation'
# Set the number of training and validation samples

num_train_samples = 2000
num_validation_samples = 800
# Set the number of epochs and batch size

epochs = 5
batch_size = 16
# Load the VGG16 model without the top layer

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# Freeze the base model layers

for layer in base_model.layers:
layer.trainable = False
# Create a new model

# Add the base model as a layer
model.add(base_model)
# Add custom layers on top of the base model

model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile the model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Preprocess the training and validation data

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
validation_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(224, 224),
batch_size=batch_size,
class_mode='binary')
validation_generator = validation_datagen.flow_from_directory(
validation_data_dir,
# Train the model

model.fit(
train_generator,
steps_per_epoch=num_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=num_validation_samples // batch_size)
# Save the trained model

model.save('dog_cat_classifier.h5')
Found 337 images belonging to 2 classes.

Epoch 1/5
125/125 [==============================] - ETA: 0s - batch: 62.0000 - size: 15.2800 - lo
ss: 2.4077 - acc: 0.9665
125/125 [==============================] - 19s 138ms/step - batch: 62.0000 - size: 15.28
00 - loss: 2.4086 - acc: 0.9665 - val_loss: 10.4672 - val_acc: 0.9297
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
00 - loss: 4.0183e-07 - acc: 1.0000 - val_loss: 2.7819 - val_acc: 0.9824
In [ ]:
In [ ]:
Inception
Also known as GoogLeNet , it is a 22-layer network that won the 2014 ILSVRC Championship.
1. The original intention of the design is to expand the width and depth on its basis .
2. which is designed motives derived from improving the performance of the depth of the network
generally can increase the size of the network and increase the size of the data set to increase, but at
the same time cause the network parameters and easily fit through excessive , computing resources
inefficient and The production of high-quality data sets is an expensive issue.
3. Its design philosophy is to change the full connection to a sparse architecture and try to change it to a
sparse architecture inside the convolution.
4. The main idea is to design an inception module and increase the depth and width of the network by
continuously copying these inception modules , but GooLeNet mainly extends these inception modules
in depth.
There are four parallel channels in each inception module , and concat is performed at the end of the
channel .
1x1 conv is mainly used to reduce the dimensions in the article to avoid calculation bottlenecks. It also adds
additional softmax loss to some branches of the previous network layer to avoid the problem of gradient
disappearance.
Four parallel channels:
1x1 conv: Borrowed from [ Network in Network ], the input feature map can be reduced in dimension
and upgraded without too much loss of the input spatial information;
1x1conv followed by 3x3 conv: 3x3 conv increases the receptive field of the feature map, and changes
the dimension through 1x1conv;
1x1 conv followed by 5x5 conv: 5x5 conv further increases the receptive field of the feature map, and
changes the dimensions through 1x1 conv;
3x3 max pooling followed by 1x1 conv: The author believes that although the pooling layer will lose
space information, it has been effectively applied in many fields, which proves its effectiveness, so a
parallel channel is added, and it is changed by 1x1 conv Its output dimension.
Complete network design : -

Two ways to improve network performance:
The most direct way to improve the performance of deep neural networks is to increase their size .
This includes depth, the number of levels, and their width, the size of each level unit .
Another easy and safe way is to increase the size of the training data.
However, both methods have two disadvantages .
Larger models mean more parameters, which makes it easier for the network to overfit , especially when the
number of label samples in the training data set is limited.
At the same time, because the production of high-quality training sets is tricky and expensive ,especially
when some human experts do it , there is a large error rate . As shown below.
Another shortcoming is that uniformly increasing the size of the network will increase the use of computing
resources . For example, in a deep network, if two convolutions are chained, any unified improvement of
their convolution kernels will cause demand for resources.
Power increase: If the increased capacity is inefficient, for example, if most of the weights end with 0 , then
a lot of computing resources are wasted. But because the computing resources are always limited, an
effective computational distribution always tends to increase the size of the model indiscriminately, and even
the main objective goal is to improve the performance of the results.
The basic method to solve these two problems is to finally change the fully connected network to a sparse
architecture, even inside the convolution.
The details of the GooLeNet network layer are shown in the following table:
To sum up:
128 1x1 convolution kernels are used to reduce dimensions and modify linear activation units
A fully connected layer of 1024 units and a modified linear activation unit;
A dropout layer that drops neuron connections with a 70% probability;
A linear layer with softmax loss as classification Predict 1000 categories, but removed during the
inference phase
Training Methodology
The momentum is set to 0.9 and the learning rate is set to decrease by 4% every 8 epochs.
Seven models were trained . To make the problem more detailed, some models were trained
on small crops, and some were trained on large crops .
The factors that make the model train well include : the sampling of patches of various sizes
in the image , the size of which is evenly distributed between 8% and 100%, and the aspect
ratio between 3/4 and 4/3.
Illumination changes have an effect on avoiding overfitting.
Later, random interpolation is used to resize the image.
Inception-v2(2015)
This architecture is a landmark in the development of deep network models . The most prominent
contribution is to propose a normalized Batch Normalization layer to unify the output range of the network. It
is fixed in a relatively uniform range. If the BN layer is not added, the value range of the network input and
output of each layer is greatly different, so the size of the learning rate will be different. The BN layer avoids
this situation This accelerates the training of the network and gives the network regular terms to a certain
extent , reducing the degree of overfitting of the network. In the subsequent development of network
models, most models have more or less added BN layers to the model.
In this paper, the BN layer is standardized before being input to the activation function. At the same time,
VGG uses 2 3x3 convs instead of 5x5 convs in the inception module to reduce the amount of parameters
and speed up the calculation.
Algorithm advantages:
1. Improved learning rate : In the BN model, a higher learning rate is used to accelerate training
convergence, but it will not cause other effects. Because if the scale of each layer is different, then the
learning rate required by each layer is different. The scale of the same layer dimension often also
needs different learning rates. Usually, the minimum learning is required to ensure the loss function to
decrease, but The BN layer keeps the scale of each layer and dimension consistent, so you can
directly use a higher learning rate for optimization.
1. Remove the dropout layer : The BN layer makes full use of the goals of the dropout layer. Remove
the dropout layer from the BN-Inception model, but no overfitting will occur.
1. Decrease the attenuation coefficient of L2 weight : Although the L2 loss controls the overfitting of
the Inception model, the loss of weight has been reduced by five times in the BN-Inception model.
1. Accelerate the decay of the learning rate : When training the Inception model, we let the learning
rate decrease exponentially. Because our network is faster than Inception, we will increase the speed of
reducing the learning rate by 6 times.
1. Remove the local response layer : Although this layer has a certain role, but after the BN layer is
added, this layer is not necessary.
1. Scramble training samples more thoroughly : We scramble training samples, which can prevent the
same samples from appearing in a mini-batch. This can improve the accuracy of the validation set by
1%, which is the advantage of the BN layer as a regular term. In our method, random selection is more
effective when the model sees different samples each time.
1. To reduce image distortion: Because BN network training is faster and observes each training sample
less often, we want the model to see a more realistic image instead of a distorted image.
Inception-v3-2015
This architecture focuses, how to use the convolution kernel two or more smaller size of the convolution
kernel to replace, but also the introduction of asymmetrical layers i.e. a convolution dimensional
convolution has also been proposed for pooling layer Some remedies that can cause loss of spatial
information; there are ideas such as label-smoothing , BN-ahxiliary .
Experiments were performed on inputs with different resolutions . The results show that although low-
resolution inputs require more time to train, the accuracy and high-resolution achieved are not much
different.
The computational cost is reduced while improving the accuracy of the network.
General Design Principles
We will describe some design principles that have been proposed through extensive experiments with
different architectural designs for convolutional networks. At this point, full use of the following principles can
be guessed, and some additional experiments in the future will be necessary to estimate their accuracy and
effectiveness.
1. Prevent bottlenecks in characterization . The so-called bottleneck of feature description is that a

large proportion of features are compressed in the middle layer (such as using a pooling operation).
This operation will cause the loss of feature space information and the loss of features. Although the
operation of pooling in CNN is important, there are some methods that can be used to avoid this loss
as much as possible (I note: later hole convolution operations ).
2. The higher the dimensionality of the feature, the faster the training converges . That is, the
independence of features has a great relationship with the speed of model convergence. The more
independent features, the more thoroughly the input feature information is decomposed. It is easier to
converge if the correlation is strong. Hebbin principle : fire together, wire together.
3. Reduce the amount of calculation through dimensionality reduction . In v1, the feature is first
reduced by 1x1 convolutional dimensionality reduction. There is a certain correlation between different
dimensions. Dimension reduction can be understood as a lossless or low-loss compression. Even if the
dimensions are reduced, the correlation can still be used to restore its original information.
4. Balance the depth and width of the network . Only by increasing the depth and width of the network
in the same proportion can the performance of the model be maximized.
Factorizing Convolutions with Large Filter Size
GooLeNet uses many dimensionality reduction methods, which has achieved certain results. Consider the
example of a 1x1 convolutional layer used to reduce dimensions before a 3x3 convolutional layer. In the
network, we expect the network to be highly correlated between the output neighboring elements at the
activation function. Therefore, we can reduce their activation values before aggregation , which should
generate similar local expression descriptions.
This paper explores experiments to decompose the network layer into different factors under different
settings in order to improve the computational efficiency of the method . Because the Inception network is
fully convolutional, each weight value corresponds to a product operation each time it is activated.
Therefore, any reduction in computational cost will result in a reduction in parameters. This means that we
can use some suitable decomposition factors to reduce the parameters and thus speed up the training.
3.1 Factorizing Convolutions with Large Filter Size
With the same number of convolution kernels, larger convolution kernels (such as 5x5 or 7x7) are more
expensive to calculate than 3x3 convolution kernels , which is about a multiple of 25/9 = 2.78. Of course, the
5x5 convolution kernel can obtain more correlations between the information and activation units in the
previous network, but under the premise of huge consumption of computing resources, a physical reduction
in the size of the convolution kernel still appears.
However, we still want to know whether a 5x5 convolutional layer can be replaced by a multi-layer
convolutional layer with fewer parameters when the input and output sizes are consistent . If we scale the
calculation map of 5x5 convolution, we can see that each output is like a small fully connected network
sliding on the input window with a size of 5x5. Refer to Figure 1.
Therefore, we have developed a network that explores translation invariance and replaces one layer of
convolution with two layers of convolution: the first layer is a 3x3 convolution layer and the second layer is a
fully connected layer . Refer to Figure 1. We ended up replacing two 5x5 convolutional layers with two 3x3
convolutional layers. Refer to Figure 4 Figure 5. This operation can realize the weight sharing of
neighboring layers. It is about (9 + 9) / 25 times reduction in computational consumption.
Spatial Factorization into Asymmetric Convolutions
We are wondering if the convolution kernel can be made smaller, such as 2x2, but there is an asymmetric
method that can be better than this method. That is to use nx1 size convolution. For example, using the
[3x1 + 1x3] convolution layer. In this case, a single 3x3 convolution has the same receptive field. Refer to
Figure 3. This asymmetric method can save [((3x3)-(3 + 3)) / (3x3) = 33%] computing resources, and
replacing two 2x2 only saves [11%] Computing resources.
In theory, we can have a deeper discussion and use the convolution of [1xn + nx1] instead of the
convolutional layer of nxn. Refer to Figure 6. But this situation is not very good in the previous layer, but it
can perform better on a medium-sized feature map [mxm, m is between 12 and 20]. In this case, use [1x7 +
7x1] convolutional layer can get a very good result.
Utility of Auxiliary Classifiers
Inception-v1 introduced some auxiliary classifiers (referring to some branches of the previous layer adding
the softmax layer to calculate the loss back propagation) to improve the aggregation problem in deep
networks. The original motive is to pass the gradient back to the previous convolutional layer , so that they
can effectively and improve the aggregation of features and avoid the problem of vanishing gradients.
Traditionally, pooling layers are used in convolutional networks to reduce the size of feature maps . In order
to avoid bottlenecks in the expression of spatial information, the number of convolution kernels in the
network can be expanded before using max pooling or average pooling.
For example, for a dxd network layer with K feature maps, to generate a network layer with 2K [d / 2 xd / 2]
feature maps, we can use 2K convolution kernels with a step size of 1. Convolution and then add a pooling
layer to get it, then this operation requires [2d 2 K 2 ]. But using pooling instead of convolution, the
approximate operation is [2 * (d / 2) 2 xK 2 ], which reduces the operation by four times. However, this will
cause a description bottleneck, because the feature map is reduced to [(d / 2) 2 xK], which will definitely
cause the loss of spatial information on the network. Refer to Figure 9. However, we have adopted a
different method to avoid this bottleneck, refer to Figure 10. That is, two parallel channels are used , one is
a pooling layer (max or average), the step size is 2, and the other is a convolution layer , and then it is
concatenated during output.
Inception-v4-2016
After ResNet appeared, ResNet residual structure was added.
It is based on Inception-v3 and added the skip connection structure in ResNet. Finally, under the structure of
3 residual and 1 inception-v4 , it reached the top-5 error 3.08% in CLS (ImageNet calssification) .
1-Introduction Residual conn works well when training very deep networks. Because the Inception network
architecture can be very deep, it is reasonable to use residual conn instead of concat.
Compared with v3, Inception-v4 has more unified simplified structure and more inception modules.
The big picture of Inception-v4:
Fig9 is an overall picture, and Fig3,4,5,6,7,8 are all local structures. For the specific structure of each
module, see the end of the article.
Residual Inception Blocks

For the residual version in the Inception network, we use an Inception module that consumes less than the
original Inception. The convolution kernel (followed by 1x1) of each Inception module is used to modify the
dimension, which can compensate the reduction of the Inception dimension to some extent.
One is named Inception-ResNet-v1, which is consistent with the calculation cost of Inception-v3. One is
named Inception-ResNet-v2, which is consistent with the calculation cost of Inception-v4.
Figure 15 shows the structure of both. However, Inception-v4 is actually slower in practice, probably
because it has more layers.
Another small technique is that we use the BN layer in the header of the traditional layer in the Inception-
ResNet module, but not in the header of the summations. ** There is reason to believe that the BN layer is
effective. But in order to add more Inception modules, we made a compromise between the two.
Inception-ResNet-v1
Inception-ResNet-v2
Scaling of the Residuals

This paper finds that when the number of convolution kernels exceeds 1,000 , the residual variants will start
to show instability , and the network will die in the early stages of training, which means that the last layer
before the average pooling layer is in the Very few iterations start with just a zero value . This situation
cannot be prevented by reducing the learning rate or by adding a BN layer . Hekaiming's ResNet article also
mentions this phenomenon.
This article finds that scale can stabilize the training process before adding the residual module to the
activation layer . This article sets the scale coefficient between 0.1 and 0.3.
In order to prevent the occurrence of unstable training of deep residual networks, He suggested in the
article that it is divided into two stages of training. The first stage is called warm-up (preheating) , that is,
training the model with a very low learning first. In the second stage, a higher learning rate is used. And this
article finds that if the convolution sum is very high, even a learning rate of 0.00001 cannot solve this
training instability problem, and the high learning rate will also destroy the effect. But this article considers
scale residuals to be more reliable than warm-up.
Even if scal is not strictly necessary, it has no effect on the final accuracy, but it can stabilize the training
process.
Conclusion
Inception-ResNet-v1 : a network architecture combining inception module and resnet module with similar
calculation cost to Inception-v3;
Inception-ResNet-v2 : A more expensive but better performing network architecture.
Inception-v4 : A pure inception module, without residual connections, but with performance similar to
Inception-ResNet-v2.
A big picture of the various module structures of Inception-v4 / Inception-ResNet-v1 / v2:
Fig3-Stem: (Inception-v4 & Inception-ResNet-v2)

Fig4-Inception-A: (Inception-v4)
Fig5-Inception-B: (Inception-v4)
Fig6-Inception-C: (Inception-v4)
Fig7-Reduction-A: (Inception-v4 & Inception-ResNet-v1 & Inception-ResNet-v2)
Fig8-Reduction-B: (Inception-v4)
Fig10-Inception-ResNet-A: (Inception-ResNet-v1)
Fig11-Inception-ResNet-B: (Inception-ResNet-v1)
Fig12-Reduction-B: (Inception-ResNet-v1)
Fig13-Inception-ResNet-C: (Inception-ResNet-v1)
Fig14-Stem: (Inception-ResNet-v1)
Fig16-Inception-ResNet-A: (Inception-ResNet-v2)
Fig17-Inception-ResNet-B: (Inception-ResNet-v2)
Fig18-Reduction-B: (Inception-ResNet-v2)
Fig19-Inception-ResNet-C: (Inception-ResNet-v2)
Summary
Inception v1 network, 1x1, 3x3, 5x5 conv and 3x3 pooling and stacking together, on the one hand,
increase the width of the network, and on the other hand, increase the adaptability of the network to
scale.
The network of v2 has been improved based on v1. On the one hand, the BN layer has been added to
reduce the internal covariate shift (the internal neuron's data distribution has changed), so that the
output of each layer is normalized to an N (0, 1) Gaussian, on the other hand, learning VGG replaces
5x5 in the inception module with two 3x3 convs, which reduces the number of parameters and speeds
up the calculation.
One of the most important improvements in v3 is Factorization, which decomposes 7x7 into two one-
dimensional convolutions (1x7, 7x1), and 3x3 is the same (1x3, 3x1). This benefit can speed up
calculations (redundant calculations Capacity can be used to deepen the network), and one conv can
be split into two convs, which further increases the network depth and increases the nonlinearity of the
network. It is also worth noting that the network input has changed from 224x224 to 299x299, which is
more refined. Designed 35x35 / 17x17 / 8x8 modules.
v4 studied whether the Inception module combined with the Residual Connection can be improved? It
was found that the structure of ResNet can greatly speed up training and improve performance at the
same time. An Inception-ResNet v2 network was obtained. At the same time, a deeper and more
optimized Inception v4 model was designed to achieve performance comparable to Inception-ResNet
v2
Code implementation
From Scratch
In [ ]: !pip install tflearn

ic/simple/
Collecting tflearn
107.3/107.3 kB 6.7 MB/s eta 0:00:00
learn) (1.22.4)
arn) (1.16.0)
flearn) (8.4.0)
f0daf99db3e5e956ce8947fe092b62231f5895e4149dfeaf4134e3d979b1729b
e31def0e4e8245a465
In [ ]: from tensorflow import keras

from keras.models import Model
from keras.layers import Input, Conv2D, MaxPooling2D, concatenate, Flatten, Dense, Avera
from keras.optimizers import Adam
In [ ]: # Get Data

100.0% 60276736 / 60270631
File Extracted
Parsing Done!
In [ ]: print(x_train.shape)
(1360, 224, 224, 3)

(1360, 17)
In [ ]: # Inception block
def inception_block(x, filters):
tower_1 = Conv2D(filters[0], (1, 1), padding='same', activation='relu')(x)
tower_1 = Conv2D(filters[1], (3, 3), padding='same', activation='relu')(tower_1)
tower_2 = Conv2D(filters[2], (1, 1), padding='same', activation='relu')(x)

tower_3 = MaxPooling2D((3, 3), strides=(1, 1), padding='same')(x)

output = concatenate([tower_1, tower_2, tower_3], axis=3)

return output
# Build the Inception model

def inception(input_shape, num_classes):
inputs = Input(shape=input_shape)
x = Conv2D(64, (3, 3), padding='same', activation='relu')(inputs)

x = MaxPooling2D((2, 2))(x)
x = inception_block(x, filters=[64, 96, 128, 16, 32])



x = AveragePooling2D((4, 4))(x)
x = Flatten()(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)

return model
In [ ]: # Create the Inception model

model = inception(input_shape=(224, 224, 3), num_classes=17)
# Compile the model

model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accur
# Print a summary of the model

model.summary()
# Train
Model: "model"
________________________________________________________________________________________
__________
Layer (type) Output Shape Param # Connected to
========================================================================================
==========
input_1 (InputLayer) [(None, 224, 224, 3 0 []
)]
conv2d (Conv2D) (None, 224, 224, 64 1792 ['input_1[0][0]']
max_pooling2d (MaxPooling2D) (None, 112, 112, 64 0 ['conv2d[0][0]']
conv2d_1 (Conv2D) (None, 112, 112, 64 4160 ['max_pooling2d[0][0]']
conv2d_3 (Conv2D) (None, 112, 112, 12 8320 ['max_pooling2d[0][0]']
8)
max_pooling2d_1 (MaxPooling2D) (None, 112, 112, 64 0 ['max_pooling2d[0][0]']
conv2d_2 (Conv2D) (None, 112, 112, 96 55392 ['conv2d_1[0][0]']
conv2d_5 (Conv2D) (None, 112, 112, 32 2080 ['max_pooling2d_1[0]

[0]']
)
concatenate (Concatenate) (None, 112, 112, 14 0 ['conv2d_2[0][0]',
4) 'conv2d_4[0][0]',
'conv2d_5[0][0]']
conv2d_6 (Conv2D) (None, 112, 112, 12 18560 ['concatenate[0][0]']
8)
conv2d_8 (Conv2D) (None, 112, 112, 19 27840 ['concatenate[0][0]']
2)
max_pooling2d_2 (MaxPooling2D) (None, 112, 112, 14 0 ['concatenate[0][0]']
4)
8)
conv2d_10 (Conv2D) (None, 112, 112, 96 13920 ['max_pooling2d_2[0]

[0]']
)
concatenate_1 (Concatenate) (None, 112, 112, 25 0 ['conv2d_7[0][0]',
6) 'conv2d_9[0][0]',
'conv2d_10[0][0]']
max_pooling2d_3 (MaxPooling2D) (None, 56, 56, 256) 0 ['concatenate_1[0][0]']
conv2d_11 (Conv2D) (None, 56, 56, 192) 49344 ['max_pooling2d_3[0]

[0]']

[0]']
max_pooling2d_4 (MaxPooling2D) (None, 56, 56, 256) 0 ['max_pooling2d_3[0]

[0]']
conv2d_12 (Conv2D) (None, 56, 56, 96) 165984 ['conv2d_11[0][0]']


[0]']
concatenate_2 (Concatenate) (None, 56, 56, 160) 0 ['conv2d_12[0][0]',
'conv2d_14[0][0]',
'conv2d_15[0][0]']
conv2d_16 (Conv2D) (None, 56, 56, 160) 25760 ['concatenate_2[0][0]']

[0]']
'conv2d_19[0][0]',
/usr/local/lib/python3.10/dist-packages/keras/optimizers/legacy/adam.py:117: UserWarnin
g: The `lr` argument is deprecated, use `learning_rate` instead.
super().__init__(name, **kwargs)
'conv2d_20[0][0]']


[0]']
'conv2d_24[0][0]',
'conv2d_25[0][0]']

[0]']
'conv2d_29[0][0]',
'conv2d_30[0][0]']

[0]']

[0]']
max_pooling2d_9 (MaxPooling2D) (None, 28, 28, 240) 0 ['max_pooling2d_8[0]

[0]']

[0]']
'conv2d_34[0][0]',
'conv2d_35[0][0]']
max_pooling2d_10 (MaxPooling2D (None, 28, 28, 320) 0 ['concatenate_6[0][0]']

[0]']
'conv2d_39[0][0]',
'conv2d_40[0][0]']
max_pooling2d_11 (MaxPooling2D (None, 28, 28, 320) 0 ['concatenate_7[0][0]']
)

[0]']
'conv2d_44[0][0]',
'conv2d_45[0][0]']
average_pooling2d (AveragePool (None, 7, 7, 368) 0 ['concatenate_8[0][0]']
ing2D)
flatten (Flatten) (None, 18032) 0 ['average_pooling2d[0]

[0]']
dense (Dense) (None, 17) 306561 ['flatten[0][0]']
========================================================================================
==========
________________________________________________________________________________________
__________
Epoch 1/5
1088/1088 [==============================] - ETA: 0s - loss: 2.8350 - acc: 0.0377
7 - val_loss: 2.8338 - val_acc: 0.0368
Epoch 2/5
3 - val_loss: 2.8343 - val_acc: 0.0368
Epoch 3/5
3 - val_loss: 2.8345 - val_acc: 0.0368
Epoch 4/5
3 - val_loss: 2.8361 - val_acc: 0.0368
Epoch 5/5
3 - val_loss: 2.8362 - val_acc: 0.0368
<keras.callbacks.History at 0x7f9d3badf010>
Out[ ]:
Pretrained
import gdown
print(file_id)
Downloading...
100%|██████████| 9.09M/9.09M [00:00<00:00, 83.8MB/s]
'catdog.zip'
Out[1]:
Archive: catdog.zip
creating: train/
Training
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input



epochs = 5
batch_size = 16
# Load the InceptionV3 model without the top layer

base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(224, 224, 3



# Compile the model


train_data_dir,
# Train the model

model.fit(
train_generator,
epochs=epochs,

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/incep

tion_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87910968/87910968 [==============================] - 5s 0us/step
Epoch 1/5
22/125 [====>.........................] - ETA: 10s - loss: 2.0161 - accuracy: 0.8961
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that you
r dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this
case, 625 batches). You may need to use the repeat() function when building your datase
t.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that you
r dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this
case, 50 batches). You may need to use the repeat() function when building your dataset.
61 - val_loss: 2.6223 - val_accuracy: 0.9322
Prediction
In [4]: import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing import image
import os
In [5]: model = load_model("/content/dog_cat_classifier.h5")
img_path = "/content/train/Cat/1.jpg"
In [8]: test_image = image.load_img(img_path, target_size = (224,224))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = np.argmax(model.predict(test_image), axis=1)
print(result)
1/1 [==============================] - 0s 28ms/step

[0]
In [9]: if result[0] == 1:
prediction = 'dog'
print(prediction)
else:
prediction = 'cat'
print(prediction)
cat
In [ ]:
Resnet
Introduction
ResNet is a network structure proposed by the He Kaiming, Sun Jian and others of Microsoft
Research Asia in 2015, and won the first place in the ILSVRC-2015 classification task. At the same
time, it won the first place in ImageNet detection, ImageNet localization, COCO detection, and COCO
segmentation tasks. It was a sensation at the time.
ResNet, also known as residual neural network, refers to the idea of adding residual learning to the
traditional convolutional neural network, which solves the problem of gradient dispersion and accuracy
degradation (training set) in deep networks, so that the network can get more and more The deeper, both
the accuracy and the speed are controlled.
Deep Residual Learning for Image Recognition Original link : ResNet Paper
The problem caused by increasing depth
The first problem brought by increasing depth is the problem of gradient explosion / dissipation . This is
because as the number of layers increases, the gradient of backpropagation in the network will become
unstable with continuous multiplication, and become particularly large or special. small. Among them ,
the problem of gradient dissipation often occurs .
In order to overcome gradient dissipation, many solutions have been devised, such as using
BatchNorm, replacing the activation function with ReLu, using Xaiver initialization, etc. It can be said
that gradient dissipation has been well solved
Another problem of increasing depth is the problem of network degradation, that is, as the depth
increases, the performance of the network will become worse and worse, which is directly reflected in
the decrease in accuracy on the training set. The residual network article solves this problem. And after
this problem is solved, the depth of the network has increased by several orders of magnitude.
Degradation of deep network
With network depth increasing, accuracy gets saturated (which might be unsurprising) and
then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and
adding more layers to a favored deep model leads to higher training error.
The above figure is the error rate of the training set classified by the network on the CIFAR10-data set
with the increase of the network depth . It can be seen that if we directly stack the convolutional layers,
as the number of layers increases, the error rate increases significantly. The trend is that the deepest
56-layer network has the worst accuracy . We verified it on the VGG network. For the CIFAR-10
dataset, it took 5 minutes on the 18-layer VGG network to get the full network training. The 80%
accuracy rate was achieved, and the 34-layer VGG model took 8 minutes to get the 72% accuracy rate.
The problem of network degradation does exist.
The decrease in the training set error rate indicates that the problem of degradation is not caused by
overfitting. The specific reason is that it is left for further study. The author's other paper "Identity
Mappings in Deep Residual Networks" proved the occurrence of degradation. It is because the
optimization performance is not good, which indicates that the deeper the network, the more difficult the
reverse gradient is to conduct.
Deep Residual Networks

From 10 to 100 layers
We can imagine that when we simply stack the network directly to a particularly long length, the internal
characteristics of the network have reached the best situation in one of the layers. At this time, the
remaining layers should not make any changes to the characteristics and learn automatically. The form of
identity mapping. That is to say, for a particularly deep deep network, the solution space of the shallow form
of the network should be a subset of the solution space of the deep network, in other words, a network
deeper than the shallow network will not have at least Worse effect, but this is not true because of network
degradation.
Then, we settle for the second best. In the case of network degradation, if we do not add depth, we can
improve the accuracy. Can we at least make the deep network achieve the same performance as the
shallow network, that is, let the layers behind the deep network achieve at least The role of identity mapping
. Based on this idea, the author proposes a residual module to help the network achieve identity mapping.
To understand ResNet, we must first understand what kind of problems will occur
when the network becomes deeper.
The first problem brought by increasing the network depth is the disappearance and explosion of
the gradient.
This problem was successfully solved after Szegedy proposed the BN (Batch Normalization) structure.
The BN layer can normalize the output of each layer. The size can still be kept stable after the reverse layer
transfer, and it will not be too small or too large.
Is it easy to converge after adding BN and then increasing the depth?
The answer is still negative. The author mentioned the second problem-the degradation problem: when
the level reaches a certain level, the accuracy will saturate and then decline rapidly. This decline is not
caused by the disappearance of the gradient. It is not caused by overfit, but because the network is so
complicated that it is difficult to achieve the ideal error rate by unconstrained stocking training alone.
The degradation problem is not a problem of the network structure itself, but is caused by the current
insufficient training methods. The currently widely used training methods, whether it is SGD, AdaGrad, or
RMSProp, cannot reach the theoretically optimal convergence result after the network depth becomes
larger.
We can also prove that as long as there is an ideal training method, deeper networks will definitely perform
better than shallow networks.
The proof process is also very simple: Suppose that several layers are added behind a
network A to form a new network B. If the added level is just an identity
mapping of the output of A, that is, the output of A is after the level of B
becomes the output of B, there is no change, so the error rates of network A
and network B are equal, which proves that the deepened network will not be
worse than the network before deepening.
He Kaiming proposed a residual structure to implement the above identity mapping (Below Figure): In
addition to the normal convolution layer output, the entire module has a branch directly connecting the input
to the output. The output and the output of the convolution do The final output is obtained by arithmetic
addition. The formula is H (x) = F (x) + x, x is the input, F (x) is the output of the convolution branch, and H
(x) is the output of the entire structure. It can be shown that if all parameters in the F (x) branch are 0, H (x)
is an identity mapping. The residual structure artificially creates an identity map, which can make the entire
structure converge in the direction of the identity map, ensuring that the final error rate will not become
worse because the depth becomes larger. If a network can achieve the desired result by simply setting the
parameter values by hand, then this structure can easily converge to the result through training. This is a
rule that is unsuccessful when designing complex networks. Recall that in order to restore the original
distribution after BN processing, the formula y = rx + delta is used. When r is manually set to standard
deviation and delta is the mean, y is the distribution before BN processing. This is the use of this Rules.
What does residual learning mean?
The idea of residual learning is the above picture, which can be understood as a block, defined as follows:
The residual learning block contains two branches or two mappings:
1. Identity mapping refers to the curved curve on the right side of the figure above. As its name implies,
identity mapping refers to its own mapping, which is x itself;
1. F(x) Residual mapping refers to another branch, that is, part. This part is called residual mapping (y -x)
.
What role does the residual module play in back propagation?
The residual module will significantly reduce the parameter value in the module, so that the parameters
in the network have a more sensitive response ability to the loss of reverse conduction, although the
fundamental It does not solve the problem that the loss of backhaul is too small, but it reduces the
parameters. Relatively speaking, it increases the effect of backhaul loss and also generates a certain
regularization effect.
Secondly, because there are branches of the identity mapping in the forward process, the gradient
conduction in the back-propagation process also has more simple paths , and the gradient can be
transmitted to the previous module after only one relu.
The so-called backpropagation is that the network outputs a value, and then compares it with the real
value to an error loss. At the same time, the loss is changed to change the parameter. The returned
loss depends on the original loss and gradient. Since the purpose is to change the parameter, The
problem is that if the intensity of changing the parameter is too small, the value of the parameter can be
reduced, so that the loss of the intensity of changing the parameter is relatively greater.
Therefore, the most important role of the residual module is to change the way of forward and
backward information transmission, thereby greatly promoting the optimization of the network.
Using the four criteria proposed by Inceptionv3, we will use them again to improve the residual module.
Using criterion 3, the dimensionality reduction before spatial aggregation will not cause information
loss, so the same method is also used here, adding 1 * 1 convolution The kernel is used to increase the
non-linearity and reduce the depth of the output to reduce the computational cost. You get the form of a
residual module that becomes a bottleneck. The figure above shows the basic form on the left and the
bottleneck form on the right.
To sum up, the shortcut module will help the features in the network perform identity mapping in the
forward process, and help conduct gradients in the reverse process, so that deeper models can be
successfully trained.
Why can the residual learning solve the problem of "the accuracy of the network
deepening declines"?
For a neural network model, if the model is optimal, then training can easily optimize the residual mapping
to 0, and only identity mapping is left at this time. No matter how you increase the depth, the network will
always be in an optimal state in theory. Because it is equivalent to all the subsequent added networks to
carry information transmission along the identity mapping (self), it can be understood that the number of
layers behind the optimal network is discarded (without the ability to extract features), and it does not
actually play a role. . In this way, the performance of the network will not decrease with increasing depth.
The author used two types of data, ImageNet and CIFAR, to prove the effectiveness of ResNet:
The first is ImageNet. The authors compared the training effect of ResNet structure and traditional structure
with the same number of layers. The left side of Figure is a VGG-19 network with a traditional structure
(each followed by BN), the middle is a 34-layer network with a traditional structure (each followed by BN),
and the right side is 34 layers ResNet (the solid line indicates a direct connection, and the dashed line
indicates a dimensional change using 1x1 convolution to match the number of features of the input and
output). Figure 3 shows the results after training these types of networks.
The data on the left shows that the 34-layer network (red line) with the traditional structure has a higher
error rate than the VGG-19 (blue-green line). Because the BN structure is added to each layer Therefore,
the high error is not caused by the gradient disappearing after the level is increased, but by the degradation
problem; the ResNet structure on the right side of Figure 3 shows that the 34-layer network (red line) has a
higher error rate than the 18-layer network (blue-green line). Low, this is because the ResNet structure has
overcome the degradation problem. In addition, the final error rate of the ResNet 18-layer network on the
right is similar to the error rate of the traditional 18-layer network on the left. This is because the 18-layer
network is simpler and can converge to a more ideal result even without the ResNet structure.
The ResNet structure like the left side of Fig. 4 is only used for shallow ResNet networks. If there are many
network layers, the dimensions near the output end of the network will be very large. Still using the structure
on the left side of Fig. 4 will cause a huge amount of calculation. For deeper networks, we all use the
bottleneck structure on the right side of Figure 4, first using a 1x1 convolution for dimensionality reduction,
then 3x3 convolution, and finally using 1x1 dimensionality to restore the original dimension.
In practice, considering the cost of the calculation, the residual block is calculated and optimized, that is, the
two 3x3 convolution layers are replaced with 1x1 + 3x3 + 1x1 , as shown below. The middle 3x3
convolutional layer in the new structure first reduces the calculation under one dimensionality-reduced 1x1
convolutional layer , and then restores it under another 1x1 convolutional layer , both maintaining accuracy
and reducing the amount of calculation .
This is equivalent to reducing the amount of parameters for the same number of layers , so it can be
extended to deeper models. So the author proposed ResNet with 50, 101 , and 152 layers , and not only did
not have degradation problems, the error rate was greatly reduced, and the computational complexity was
also kept at a very low level .
At this time, the error rate of ResNet has already dropped other networks a few streets, but it does not seem
to be satisfied. Therefore, a more abnormal 1202 layer network has been built. For such a deep network,
optimization is still not difficult, but it appears The problem of overfitting is quite normal. The author also said
that the 1202 layer model will be further improved in the future.
Diffrent Variants : -
Below is the transcript of resnet, winning the championship at ImageNet2015

Code Implementation
From Scratch

ic/simple/
Collecting tflearn
107.3/107.3 kB 9.7 MB/s eta 0:00:00
learn) (1.22.4)
arn) (1.16.0)
flearn) (8.4.0)
8a88a556a32a46e730ec7bc096cd085cfa75e78702e726ebedc481ef0341cda6
e31def0e4e8245a465

from keras.models import Model
from keras.layers import Conv2D, BatchNormalization, Activation, Add, AveragePooling2D,
from keras.optimizers import Adam
In [3]: # Get Data


100.0% 60276736 / 60270631
File Extracted
Parsing Done!
(1360, 224, 224, 3)

(1360, 17)
In [5]: # Residual block

def residual_block(x, filters, downsample=False):
strides = (2, 2) if downsample else (1, 1)
# First convolutional layer of the block

y = Conv2D(filters, kernel_size=(3, 3), strides=strides, padding='same')(x)
y = BatchNormalization()(y)
y = Activation('relu')(y)
# Second convolutional layer of the block

y = Conv2D(filters, kernel_size=(3, 3), strides=(1, 1), padding='same')(y)
y = BatchNormalization()(y)
# Skip connection if downsample or number of filters change

if downsample:
x = Conv2D(filters, kernel_size=(1, 1), strides=(2, 2), padding='same')(x)
# Add skip connection

y = Add()([x, y])
y = Activation('relu')(y)
return y
# Build the ResNet model

def resnet(input_shape, num_classes):
inputs = keras.Input(shape=input_shape)
# Initial convolutional layer

x = Conv2D(16, kernel_size=(3, 3), strides=(1, 1), padding='same')(inputs)
x = BatchNormalization()(x)
x = Activation('relu')(x)
# Residual blocks
x = residual_block(x, filters=16)
x = residual_block(x, filters=32, downsample=True)
x = residual_block(x, filters=64, downsample=True)
# Average pooling and output layer

x = AveragePooling2D(pool_size=(8, 8))(x)
x = Flatten()(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)

return model
In [6]: # Create the ResNet model
model = resnet(input_shape=(224, 224, 3), num_classes=17)
# Compile the model

model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accur
# Train
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/keras/layers/normalizati
on/batch_normalization.py:581: _colocate_with (from tensorflow.python.framework.ops) is
deprecated and will be removed in a future version.
Colocations handled automatically by placer.
/usr/local/lib/python3.10/dist-packages/keras/optimizers/legacy/adam.py:117: UserWarnin
g: The `lr` argument is deprecated, use `learning_rate` instead.
super().__init__(name, **kwargs)
Epoch 1/5
1088/1088 [==============================] - ETA: 0s - loss: 2.8100 - acc: 0.1967
7 - val_loss: 2.8402 - val_acc: 0.0699
Epoch 2/5
- val_loss: 2.8888 - val_acc: 0.0735
Epoch 3/5
- val_loss: 2.9678 - val_acc: 0.0735
Epoch 4/5
- val_loss: 3.1419 - val_acc: 0.0735
Epoch 5/5
- val_loss: 3.3201 - val_acc: 0.0404
<keras.callbacks.History at 0x7f2260d58df0>
Out[6]:
Pretrained
import gdown
print(file_id)
Downloading...
100%|██████████| 9.09M/9.09M [00:00<00:00, 113MB/s]
'catdog.zip'
Out[7]:
Archive: catdog.zip
creating: train/
from keras.datasets import cifar10

from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input



epochs = 5
batch_size = 16
# Load the VGG16 model without the top layer

base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))




# Compile the model


train_data_dir,
# Train the model

model.fit(
train_generator,
epochs=epochs,

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resne

t/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94765736/94765736 [==============================] - 6s 0us/step
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

CNN Architectures

Uploaded by

Copyright:

Available Formats

CNN Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN Architectures

Uploaded by

Copyright:

Available Formats

Basic Introduction

Paper: Gradient-Based Learning Applied to Document Recognition

Published in: Proceedings of the IEEE (1998)

Structure of the LeNet network

Convolution kernel size: 5 * 5

Convolution kernel types: 6

Output featuremap size: 28 * 28 (32-5 + 1) = 28

Trainable parameters: (5 5 + 1) 6 (5 * 5 = 25 unit parameters and one bias parameter per

Number of connections: (5 5 + 1) 6 28 28 = 122304

S2 layer-pooling layer (downsampling layer)

Output featureMap size: 14 * 14 (28/2)

Trainable parameters: 2 * 6 (the weight of the sum + the offset)

Convolution kernel size: 5 * 5

Convolution kernel type: 16

Output featureMap size: 10 * 10 (14-5 + 1) = 10

The trainable parameters are: 6 (3 5 5 + 1) + 6 (4 5 5 + 1) + 3 (4 5 5 + 1) + 1 (6 5 5 +1) =

Number of connections: 10 10 1516 = 151600

S4 layer-pooling layer (downsampling layer)

Output featureMap size: 5 * 5 (10/2)

Number of neurons: 5 5 16 = 400

Trainable parameters: 2 * 16 = 32 (the weight of the sum + the offset)

Number of connections: 16 (2 2 + 1) 5 5 = 2000

The connection is similar to the S2 layer.

Convolution kernel size: 5 * 5

Convolution kernel type: 120

Output featureMap size: 1 * 1 (5-5 + 1)

Trainable parameters / connection: 120 (16 5 * 5 + 1) = 48120

F6 layer-fully connected layer

Trainable parameters: 84 * (120 + 1) = 10164

The connection method of the F6 layer is as follows:

Output layer-fully connected layer

# Load the CIFAR-10 dataset

# Normalize pixel values between 0 and 1

# Convert labels to one-hot encoding

# Building the Model Architecture

model.add(Conv2D(16, kernel_size = (5,5), padding = 'valid', activation='tanh'))

print('Test Loss:', score[0])

average_pooling2d_4 (Averag (None, 14, 14, 6) 0

conv2d_6 (Conv2D) (None, 10, 10, 16) 2416

average_pooling2d_5 (Averag (None, 5, 5, 16) 0

flatten_2 (Flatten) (None, 400) 0

dense_6 (Dense) (None, 120) 48120

dense_7 (Dense) (None, 84) 10164

dense_8 (Dense) (None, 10) 850

The following table below explains the network structure of AlexNet:

Max Pooling 3*3 2

Max Pooling 3*3 2

Max Pooling 3*3 2

FC6 + Relu 256 * 6 * 6 * 4096=37748736 256 * 6 * 6 * 4096=37748736

FC7 + Relu 4096 * 4096=16777216 4096 * 4096=16777216

FC8 + Relu 4096 * 1000=4096000 4096 * 1000=4096000

Overall 62369152=62.3 million 1135906176=1.1 billion

Why does AlexNet achieve better results?

Relu function: f (x) = max (0, x)

1. Standardization ( Local Response Normalization )

Unsupervised Pre-training- use Auto-Encoder or RBM's convolution form to do

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ

In [2]: import tensorflow as tf

In [8]: # Get Data