Qian PDF

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE 2021)
MobileNetV3 for Image Classification

Siying Qian1st, * Yuepeng Hu1st Chenran Ning1st
1st
College of Engineering and School of Data Science School of Computer Science and These authors contributed
Computer Science University of Science and Engineering equally.
The Australian National Technology of China Fudan University
University Hefei, China Shanghai, China
Canberra, Australia [email protected] [email protected]
*
[email protected]
Abstract—Convolutional neural network (CNN) is a kind of naturally to apply deep learning to daily tasks on mobile devices.
deep neural networks, which extracts image features through Successfully applying CNN to mobile terminals is thus very
multiple convolution layers and is widely used in image important. Due to the limited computational power on mobile
classifications. With the increasing number of image data devices, model efficiency and small memory are highly
processed by mobile devices, application of neural network for required.
mobile terminals becomes popular. However, these networks need
massive computation and advanced hardware support, making There are numerous CNN architectures designed for
them difficult to adapt to mobile devices. This paper demonstrates different tasks. Among them, some are often used in image
that MobileNetV3 can get a superior balance between efficiency classification problems, such as Inception [2], AlexNet [3] and
and accuracy for real-life image classification tasks on mobile VGG16 [4]. However, classification tasks in mobile devices
terminals. In our experiments, classification performances are require not only high accuracy but also, more practically, low
compared among MobileNetV3 and several other commonly used memory costs and high computation efficiency, which give rise
pre-trained CNN models on various image datasets. The chosen to models suitable for mobile terminals. We choose
datasets are all good representatives of the application scenarios of MobileNetV3 [5] as our main study object, which are
mobile devices. The result shows that as a lightweight neural lightweight CNN models with a good balance between accuracy
network, MobileNetV3 achieved good accuracy performance in an and efficiency.
effective manner compared to other large networks. Furthermore,
ROC confirmed the advantages of MobileNetV3 over other In this paper, we compare the performance of MobileNetV3,
experimented models. Some conjectures are also brought out AlexNet, InceptionV3 [6] and ShuffleNetV2 [7] on different
about the characteristics of image datasets that are suitable for datasets, and validate the efficiency and adaptability of
MobileNetV3. MobileNetV3. The main contributions of this paper can be
summarized as follows:
Keywords-Convolutional neural network; Image classification;
Mobile devices; MobileNetV3 1. Fruits 360 [8], 10 Monkey Species and Bird Species
Classification, as the representations of common image
I. INTRODUCTION datasets on mobile devices, are used to train and evaluate
Convolutional Neural Network (CNN) is recently given the chosen neural networks’ performance.
great attention because of its extended applications in image 2. MobileNetV3, AlexNet, InceptionV3 and ShuffleNetV2
classification [1], segmentation, and other computer vision are applied to multiple image classification tasks related
problems. CNN usually consists of 2 parts: features extracting to mobile scenarios.
part – made of convolutional layers and pooling layers, and
classifying part – which contains lots of stacked fully connected 3. Multiple evaluation metrics are adopted to analyse the
layers. In the first part, kernels in convolutional layers scan the performance of different models on various datasets.
input image step by step, multiplying the weights in each kernel 4. Based on the experiment results, the characteristics of
by the pixels’ values and combining the sum to create a new MobileNetV3 on image classification and its advantages
image passed to the next layer. Pooling layers play a role in on mobile devices are summarized.
down-sampling to reduce the number of data and save
computational resources. In the second part, the image first The rest of this paper is organized as follows. In Section II,
passes through a flatten layer to be converted to a one- we describe the main structure of MobileNetV3 and the
dimensional array. The following fully connected layers use this characteristics of other models in the experiment. In Section III,
array as input and produce the predicted label by applying the datasets are introduced, and the performance of MobileNetV3
linear combination and the non-linear activation function. and other models is evaluated. The experiment results and
analysis are shown in Section III as well. There is a
Because of its advantage of extracting deep features layers summarization of the characteristics of MobileNetV3 and a
by layers, nowadays, CNNs are widely used to solve real-world discussion of our future work in Section IV.
problems, such as auto-driving and medical image diagnosis.
With a large number of tasks in daily scenarios, it comes very
978-0-7381-3122-1/21/$31.00 ©2021 IEEE

490
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 26,2021 at 20:21:46 UTC from IEEE Xplore. Restrictions apply.
II. MODEL FORMULATION
CNN networks for mobile terminals are developing rapidly
in recent years. The three versions of mobile nets have kept
being improved in architecture from 2017 to 2019.
MobileNetV1 [9] is developed referencing from the traditional
VGG architecture while introducing depthwise separable
convolutions. Based on that, MobileNetV2 [10, 11] is introduced
one year later with linear bottleneck and inverted residual. With
the help of NAS and NetAdapt network searching for
architecture optimization, MobileNetV3 is further developed by
dropping expensive layers and using h-swish non-linearity
function instead of ReLU to increase its efficiency and relative
accuracy at the same time in mid of 2019.
According to high or low resource use cases' targets,
MobileNetV3 are defined as MobileNetV3-Small and
MobileNetV3-Large two models with different architecture
complexities.
TABLE I MOBILENETV3-LARGE SPECIFICATION. #EXPAND DENOTES

THE NUMBER OF CONVOLUTIONAL FILTERS THAT ARE USED TO EXPAND THE
FEATURE SPACE FROM THE LAYER’S INPUT. SE INDICATES WHETHER THERE IS
A SQUEEZE-AND-EXCITE IN THAT BLOCK OR NOT. NL DENOTES THE NON-
LINEARITY TYPE FOR THAT BLOCK. BN MEANS BATCH NORMALIZATION. Figure 1 MobileNetV3 architecture. The general architectures are the same
for both MobileNetV3-Large and MobileNetV3-Small.
Input Operator Kernel #exp #out str S NL
type size and ide E
The general structure of MobileNetV3-Small is almost the
224! × 3 conv2d - - 16 2 - h-swish same as MobileNetV3-Large; besides that, MobileNetV3-Small
112! × 16 bottleneck 3×3 16 16 1 - ReLU
112! × 16 bottleneck 3×3 64 24 2 - ReLU
is four layers shortened, as shown in Table II.
56! × 24 bottleneck 3×3 72 24 1 - ReLU
56! × 24 bottleneck 5×5 72 40 2 ✓ ReLU TABLE II MOBILENETV3-SMALL SPECIFICATION. ALL NOTATIONS ARE
28! × 40 5×5 THE SAME AS TABLE 1.
bottleneck 120 40 1 ✓ ReLU
28! × 40 bottleneck 5×5 120 40 1 ✓ ReLU Input Operator Kernel #exp #out str S NL
28! × 40 bottleneck 3×3 240 80 2 - h-swish type size and ide E
14! × 80 bottleneck 3×3 200 80 1 - h-swish 224! × 3 conv2d 3×3 - 16 2 - h-swish
14! × 80 bottleneck 3×3 184 80 1 - h-swish 112! × 16 3×3
bottleneck 16 16 2 ✓ ReLU
14! × 80 bottleneck 3×3 184 80 1 - h-swish
56! × 16 bottleneck 3×3 72 24 2 - ReLU
14! × 80 bottleneck 3×3 480 112 1 ✓ h-swish
28! × 24 bottleneck 3×3 88 24 1 - ReLU
7! × 160 conv2d 1×1 - 960 1 - h-swish
7! × 960 pool 7×7 - - 1 - -
1! × 960 conv2d 1×1 - 1280 1 - h-swish 14! × 48 bottleneck 5×5 288 96 2 ✓ h-swish
(no BN) 7! × 96 bottleneck 5×5 576 96 1 ✓ h-swish
1! × 1280 conv2d 1×1 - k 1 - - 7! × 96 bottleneck 5×5 576 96 1 ✓ h-swish
(no BN) 7! × 96 conv2d 1×1 - 576 1 ✓ h-swish
7! × 576 pool 7×7 - - 1 - -
1! × 576 conv2d 1×1 - 1280 1 - h-swish
The full specification of MobileNetV3-Large is shown in (no BN)
Table I. 1! × 1280 conv2d 1×1 - k 1 - -
(no BN)
According to the detailed layer specification, the model’s
overall architecture can be specified as Figure 1. A. Depthwise Separable Convolution
To compute more efficiently, depthwise separable
convolution is introduced. It is very similar to the traditional
convolution. While unlike the traditional convolution with only
one convolutional calculation for each layer, the convolutional
calculation of the depthwise separable convolution is divided
into two phases. In the first phase, depthwise convolution applies
a single convolutional filter for each input channel. In the second
phase, pointwise convolution (a 1 × 1 convolution) is applied to
all the channels of outputs of the depthwise convolution.
Altogether, the depthwise separable convolution improves the
491
calculating speed by reducing the computation amount, though Additionally, the reward design in reinforcement learning is
it will sacrifice a little accuracy. Depthwise separable modified to fit small mobile models better. After the types of
convolution is a core technique for many efficient models, such layers are fixed, NetAdapt [14], a method for fine-tuning the
as MobileNetV1 - V3. hyper-parameters in each layer, is used to optimize the model.
B. Linear Bottleneck E. Swish Function
To extract features from high-dimensional space without For higher accuracy, a new kind of activation function called
losing too much information, MobileNetV2 proposes linear swish is introduced to replace the ReLU function. This function
bottlenecks to reduce the dimensionality of input. Linear is defined as:
bottleneck refers to a bottleneck layer, which is a convolutional
layer with a 1 × 1 filter, combined with the linear activation swish(x) = x i σ (x) (1)
function. Because the traditional ReLU function transformation
provides non-linearity with a possibility of information loss, However, the sigmoid function in the swish formula may
MobileNetV2 turns to insert linear bottleneck layers into cost a large number of computational resources on mobile
convolutional blocks instead (assuming that the flow of features devices. To solve this problem, the authors of MobileNetV3 use
is low-dimensional and capturable). ReLU6 function to approximate the sigmoid function in swish
C. Inverted Residual and produce an approximation of swish function called the hard
As a safer and more efficient way to extract all necessary version of swish (h-swish), defined as:
information of the input data, bottleneck layers replace the ReLU
layers. There is also an expansion layer at the beginning of the Re LU 6(x + 3)
h − swish[x] = x (2)
bottleneck block. Additionally, MobileNetV2 uses shortcuts 6
directly between bottlenecks to better propagate gradients across
multiple layers and prevent gradients loss and explosion. An F. Additional Models in Experiments
inverted residual block is tested valid to act almost the same as
AlexNet is a very initial deep CNN starting to use ReLU non-
a residual block, while it reduces memory costs considerably at
linearity instead of the tanh function and can be put on multiple
the same time.
GPUs. It is excellent to apply for high-resolution images.
D. Network Architecture Search However, the major disadvantage of the model is a severe
With reinforcement learning and recurrent neural network overfitting problem due to the massive amounts of parameters.
(RNN), network architecture search is applied to MobileNetV3 InceptionV3 first applied Batch Normalization layers to the
to determine the optimal architecture for a constrained hardware propagation of data to accelerate the process of gradient
platform. It is a method that constructs a search space for the descending. The inception modules in InceptionV3 are
architecture of neural networks and searches efficiently in a optimized and have more branches compared to InceptionV2
hierarchical search space with reinforcement learning to [16]. Furthermore, the key insight that factorization into small
approach the best structure of the model for specific tasks. For convolution, which means separating a 2-dimensional
instance, the expansion layer of MobileNetV3 is redesigned as convolution kernel into two 1-dimensional ones, is implemented
Figure 2, based on the original design for MobileNetV2. in InceptionV3.
ShuffleNet [15] uses group convolution and channel shuffle
to reduce computational complexity. Features are extracted by
depthwise separable convolutional layers, which contribute to
lower computational requirement and higher speed of data
propagation without significant decrease of accuracy.
III. EXPERIMENTS
A. Datasets
In this paper, we used three different datasets downloaded
from Kaggle. These datasets are to simulate daily scenarios
where MobileNetV3 can be applied to mobile devices.
As shown in Figure 3, in row one, the first dataset is a fruit-
360 dataset, including 131 fruit categories and 67692 training
images. Each fruit is shot from different angles multiple times
and in a clear background. Row two in Figure 3 demonstrates
the second dataset, 1097 images of 10 monkeys. The third
dataset contains 16 categories of birds and 150 images overall,
Figure 2 Comparison of original last stage and redesigned last stage. shown in row three. This dataset is small and imbalanced, and
birds are distributed randomly in various backgrounds.
Because of the similarity in the RNN controller and search
space, MobileNetV3 uses MnasNet-A1 [13] as the initial model.
492
TP + TN
Accuracy = (3)
TP + TN + FP + FN
where TP, TN, FP, FN are the number of true-positive, true-

negative, false-positive and false-negative samples,
respectively.
2) Cross-Entropy Loss:
In the stage of training a neural network, we usually define a
target function and convert the training process to an
Figure 3 Samples from three experimented datasets. The first row is fruit- optimization problem. To measure the turbulence of the neural
360 dataset; the second row is monkey dataset; the third row is bird dataset. network’s predictions objectively and provide a reasonable
direction in the searching space for us to optimize, Cross-
As for data pre-processing, we performed the same Entropy is adopted as the loss function in the training stage of
operations for the three datasets: firstly, the training dataset was models in the rest of the paper.
randomly split into a training subset and a validation subset, with
the split ratio of 0.7; secondly, all images were resized to Cross-Entropy Loss is defined as:
224 × 224 and normalized as the models’ input except for
InceptionV3. Input images were resized to 299 × 299 for 1 N M
InceptionV3. CrossEntropyLoss = ∑ ∑ p(xij )log(q(xij ))
N i=1 j=1
(4)
B. Experiment Steps
In this paper, we compared the performance of 5 models where N is the number of samples in the training set; M is the
(MobileNetV3-Large, MobileNetV3-Small, AlexNet, number of classes in the training set; p(xij ) is the true
InceptionV3 and ShuffleNetV2) on 3 different datasets.
probability of the ith sample belonging to the jth class; q(xij ) is
The comparisons of FLOPs and parameters’ numbers among the probability of the ith sample belonging to the jth class
these 5 models are shown in Table III. produced by the model.
TABLE III COMPARISON OF FLOPS AND THE NUMBER OF PARAMETERS 3) ROC Curves:
AMONG DIFFERENT MODELS.
To visually show the trade-off between the true positive rate
Model MobileNe MobileNe AlexNet Incepti Shuffle (TPR) and the false positive rate (FPR), Receiver Operating
tV3-Large tV3-Small onV3 NetV2 Characteristics (ROC) curves are drawn, where TPR = TP / P is
FLOPs 226.0 59.65 714.7 5731 148.8 the proportion of positive samples that are correctly labelled by
(Millions)
Parameters 5.48 2.54 61.1 23.8 2.28 the model and FPR = FP / N is the proportion of negative
(Millions) samples that are mislabelled as positive.
The area under a ROC curve is often used to measure the
The datasets include daily-life image classification problems performance of a model. The diagonal line on the graph
related to fruit, bird and monkey. represents a model that randomly labels the samples. A model
better than random should appear above the diagonal. The
We replaced the last classify layers for each model according
to the number of unique classification classes for different further a model is to random (i.e., when the area under a ROC
datasets. Pre-trained weights [17, 18] on ImageNet [19] were curve is much larger than 0.5), the model is better.
used. The models were then fine-tuned by setting all the layers D. Experimental Results
trainable. Using pre-trained weights made the training procedure
1) Experiments on Fruit Dataset:
easier and led to faster convergence. Fine-tuning the models was
Fruit dataset contains 90483 images from 131 fruit
to adapt models to specific given tasks. The experiment
categories. All training and testing photos of one category of
procedure on the three datasets followed the same steps.
fruit are captured from different angles of a single fruit
C. Evaluation Metrics representative from that category with a white background. It
1) Accuracy: means there is no apparent difference between training and
To evaluate the performance of models on a dataset, the testing data of one specific category. The training size is large
enough, and the fruit in the images are nearly without any
number of samples predicted to be positive, which are actually
positive, and the number of samples predicted to be negative, background interference. There is a large number of testing data
which are actually negative, are both calculated. It reflects the as well. The testing results can thus be considered convincing.
precision of models distinguishing true positive samples and true The fine-tuning training time is exceptionally long for fruit
negative samples from the whole dataset. datasets, as shown in Table IV, compared to the experiments on
the other two datasets. It may be due to the large dataset and the
Accuracy is defined as:
complex fruit varieties.
493
TABLE IV APPROXIMATE FINE-TUNING TRAINING TIME ON FRUIT
DATASET FOR DIFFERENT EXPERIMENTED MODELS. THE ABSOLUTE TIME
HIGHLY DEPENDS ON THE TRAINING ENVIRONMENT, WHILE THE TIME
COMPARISON AMONG MODELS IS RELATIVELY MEANINGFUL.
Model Approximate Training Time (in minutes)

MobileNetV3-Large 40
MobileNetV3-Small 33
AlexNet 417
InceptionV3 225
ShuffleNetV2 42
Though the training time is relatively longer, the actual Figure 5 Comparison of ROC curves for different models on fruit dataset.
necessary number of training epochs for fruit dataset is smaller
Considering fine-tuning training time, testing time, and the
to achieve good performance for fruit dataset. In Figure 4, all of
MobileNetV3-Large, MobileNetV3-Small, AlexNet and final performance for all models on the dataset, ShuffleNetV2
corresponds to high efficiency while low accuracy. AlexNet and
InceptionV3 can reach nearly 1 as accuracy from the first epoch.
InceptionV3 can achieve high accuracy with much higher cost
Only the ShuffleNetV2 model shows much lower validation
on both time and resources. Only the two MobileNetV3 models
accuracy in the whole training process. By comparing training
achieved an outstanding balance between accuracy and
and validation evaluation, the validation accuracy is even
generally higher than training accuracy for all models. No efficiency as expected. MobileNetV3 not only cost the shortest
time but also brought the best classification performance.
overfitting problem thus exists in these experiments.
2) Experiments on Monkey Dataset:
Monkey dataset consists of almost 1400 pictures of 10
species of monkeys. Since the photos produced by mobile
devices are in relatively low resolution and animals are the
objects that mobile devices may commonly capture, we picked
up this dataset to validate the performance of MobileNetV3 in
the classification of animals’ graphics in low resolution. Each
image in the dataset contains a monkey from one of those ten
species. The background of the image is the place the monkey
(a) Training Accuracy on Fruit Dataset (b) Training Loss on Fruit Dataset lives in. The monkeys are usually in the centre of the pictures,
and the backgrounds are blurry.
In each class of the dataset, the number of images is almost
the same. It can help to prevent over-fitting problem caused by
the imbalance of the training dataset. Additionally, there are
nearly 30 pictures in each class to validate the trained models,
showing each model’s performance objectively. Labels and the
number of images from each category are shown in Table V.
(c) Validation Accuracy on Fruit Dataset (d) Validation Loss on Fruit Dataset
TABLE V NUMBER OF IMAGES IN EACH CLASS OF MONKEY DATASET.
Figure 4 Comparison of training and validation accuracy and loss
throughout the fine-tuning training process on the fruit dataset experiment. Label Training images Testing images
Mantled howler 105 26
From both the test accuracies and ROC curves comparisons Patas monkey 111 28
among all experimented models on the fruit dataset, the same Bald uakari 110 27
result is concluded: On the one hand, MobileNetV3-Large Japanese macaque 122 30
achieves the best performance, and MobileNetV3-Small stays Pygmy marmoset 105 26
White headed capuchin 113 28
closely at the second place. InceptionV3 has a similar while Silvery marmoset 106 26
slightly lower accuracy. Although AlexNet ranks fourth for this Common squirrel monkey 114 28
dataset, its accuracy is still high. On the other hand, the Black-headed night monkey 106 27
performance of ShuffleNetV2 is not satisfied, with the 0.79 test Nilgiri langur 106 26
accuracy and an obvious classification ability gap compared to
the other models. It is also clearly shown in Figure 5 that all
models’ performances are much better than a random guess.
494
In the experiments, the training images were split into two resolution images and be applied to image classification.
parts: a training set consisting of 770 images of monkeys and a According to the huge number of the FLOPs and parameters (as
validation set containing 328 images. Before the training stage, shown in Table 3), InceptionV3 should have the best
all the pictures were rescaled to 224 × 224 pixels to fit the input generalization ability. However, it didn’t perform as expected.
size of most convolutional neural network (except 299 × 299 We suppose the reason lies in the training size of the monkey
pixels for InceptionV3). During training, we loaded the weights dataset is too small for such a large model to reach its best
from ImageNet transfer learning and used them as the initial performance, which also shows that MobileNetV3 can be trained
weights. with small datasets and achieve high accuracy. According to the
huge number of the FLOPs and parameters (as shown in Table
3), InceptionV3 should have the best generalization ability.
However, it didn’t perform as expected. We suppose the reason
lies in the training size of the monkey dataset is too small for
such a large model to reach its best performance, which also
shows that MobileNetV3 can be trained with small datasets and
achieve high accuracy.
To test each model’s efficiency, we recorded the time of all
(a) Training Accuracy on Monkey Dataset (b) Training Loss on Monkey Dataset five models to predict all labels in the test set. The results and
analysis are shown together with the other two datasets in
Section III D.4 Figure 10.
3) Experiments on Bird Dataset:
The bird dataset is difficult to be trained and easily becomes
overfitting because it only contains 150 images and has less
than 10 images for each category. The various backgrounds and
the small birds’ sizes in the entire large images made
(c) Validation Accuracy on Monkey Dataset (d) Validation Loss on Monkey Dataset experiments on this dataset even harder to reach a good result.
Compared to the other two datasets, bird dataset required the
Figure 6 Comparison of training and validation accuracy and loss
throughout the fine-tuning training process on the monkey dataset experiment. least training time due to its smallest size.
As shown in Figure 6, both MobileNetV3-Large and TABLE VI FINE-TUNING TRAINING TIME ON BIRD DATASET FOR
DIFFERENT EXPERIMENTED MODELS. THE ABSOLUTE TIME IS ALMOST THE
MobileNetV3-Small reached convergence within 4 epochs and SAME. THE BEST VALIDATION ACCURACY IS RECORDED DURING TRAINING.
achieved even the same high training accuracy as InceptionV3,
which is a considerably larger neural network than Model Training Time Best Validation Accuracy
MobileNetV3. Compared to ShuffleNetV2 that also designed for MobileNetV3-Large 14min 1s 0.80
mobile devices, MobileNetV3 converged at a higher accuracy MobileNetV3-Small 14min 18s 0.78
AlexNet 14min 13s 0.60
within fewer epochs, which shows that MobileNetV3 is easier to InceptionV3 14min 26s 0.67
be trained in this monkey dataset and some other similar ShuffleNetV2 14min 6s 0.60
animals’ datasets in low resolution.
At the testing stage, all the five trained models are applied to As shown in Table VI, the training time of all models is almost
the test set. The results of each model’s performance are shown the same. It shows that time may not be a good comparison
in Figure 6. metric here because the time gaps among models for bird dataset
are not as significant as those for fruit dataset. However,
MobileNetV3-Large and MobileNetV3-Small still obtained
much higher validation accuracy than other models within the
same training time as the other two datasets. More accuracy and
loss details in the experiment process are shown in Figure 8.
Figure 8 demonstrates that all models show different degrees
of overfitting on bird dataset. This is mainly due to the
imbalanced training data for each category and the small number
of data in bird dataset. According to the training loss trends in
Figure 8 (b), all models came to good convergence in the training
Figure 7 Comparison of ROC curves for different models on monkey process except ShuffleNetV2. Figure 8 (d) for validation loss
dataset. shows that AlexNet failed at this classification task. Moreover,
AlexNet, InceptionV3, and ShuffleNetV2 encountered heavy
MobileNetV3-Large reached the highest testing accuracy overfitting after 10 training epochs. MobileNetV3-Large and
among all the five models and is followed by MobileNetV3- MobileNetV3-Small converged faster and are less overfitting.
Small. Figure 7 illustrates the same result. It reveals that
MobileNetV3 can perform well in extracting features in low-
495
The validation accuracy of these two models is generally higher In addition to the classification accuracy, we also focus on
than others. model efficiency. Figure 10 (a) demonstrates that MobileNetV3
did not take significantly longer testing time than other
experimented models.
(a) Training Accuracy on Bird Dataset (b) Training Loss on Bird Dataset
(a) Log of Test Time on Datasets (b) Test Accuracy on Datasets
Figure 10 (a) Comparison of test time for all experimented models on the
three different datasets. Due to the large difference in the original test size for
the three datasets, we divide all testing time by the tested number in the
corresponding dataset. (b) Comparison of test accuracy for all experimented
models on the three different datasets.
(c) Validation Accuracy on Bird Dataset (d) Validation Loss on Bird Dataset From the accuracy and efficiency analysis in both training
Figure 8 Comparison of training and validation accuracy and loss and testing process for all models, it can be seen that the two
throughout the fine-tuning training process on the bird dataset experiment. MobileNetV3 models take much shorter time (especially for
fined-tuning training time) than large CNN models such as
Figure 9 also demonstrates that MobileNetV3-Large InceptionV3 and AlexNet, while achieving satisfying accuracy.
achieved the best performance among all models. MobileNetV3- MobileNetV3 is also slightly more efficient than other nets
Small had almost the same great performance, measured by the designed for mobile terminals such as ShuffleNetV2, but with
area under the ROC curve. much better performance.
Overall, our experiments’ results validate that MobileNetV3
can reach high accuracy with a small amount of time resource,
which means it is extremely suitable for mobile devices with
constrained computational resources requiring high efficiency.
5) Conjectures about Characteristics of Suitable Datasets:
As a further step of our experiment results’ analysis, we
conclude some conjectures about what kinds of datasets are
more suitable for MobileNetV3, by comparing performance for
MobileNetV3 classifying the three different datasets.
Figure 9 Comparison of ROC curves for different models on bird dataset. From Figure 10 (b) and ROC curves for individual datasets
(i.e., Figure 5, 7, and 9), it can be found that fruit-360 dataset
4) Overall Testing Performance Comparison: leads to 1.0 accuracy in an extremely short time, and monkey
From the above analysis for experiments of individual dataset corresponds to a pretty good accuracy (higher than 0.95)
datasets, it can be found that MobileNetV3-Large can always after a few fined-tuning training epochs. However, bird dataset
achieve the best performance among all experimented models in has far smaller accuracy (lower than 0.50) than the other two
the training process. MobileNetV3-Small usually comes as the datasets.
second. Also, the training time of MobileNetV3 is significantly
Considering the characteristics of these datasets, fruit dataset
shorter than the other models.
is a representative of images of daily-life objects with no
We want to verify the above training-process results (in significant differences among a huge number of different
Section III D.1 to III D.3) on testing performance again. To this categories. Training set for each category is large enough, and
end, experiments are conducted on testing data of the three there is no background interference. Monkey dataset is made up
datasets for all the five models. Figure 10 (a) and (b) demonstrate of low-resolution animal images from only ten categories of
the test summary about efficiency and accuracy, respectively. monkeys. Monkeys that need to be classified are put in the centre
of the images. Image backgrounds are environments that
As shown in Figure 10 (b), both two models for monkeys commonly live in, while they are blurry. Bird dataset
MobileNetV3 kept the highest classification accuracy for the has only a small number of images for ten categories of birds.
chosen datasets. MobileNetV3-Large performed slightly better Though those images are in high resolution, image backgrounds
than MobileNetV3-Small. are messy, and the birds that need to be categorized are in tiny
size of the entire images.
496
Thus, we suppose the characteristics of suitable image [3] Krizhevsky A., Sutskever I. and Hinton G.E., “Imagenet classification
datasets for MobileNetV3 classification tasks are as follows: with deep convolutional neural networks,” Communications of the ACM,
vol. 60, no. 6, pp. 84-90, 2017.
• Classified objects are highlighted in the images with no [4] Simonyan K. and Zisserman A., “Very deep convolutional networks for
strong background interference. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[5] Howard A., Sandler M., Chu G., Chen L.C., Chen B., Tan M., Wang W.,
• Images have no high-resolution requirement (i.e., are Zhu Y., Pang R., Vasudevan V. and Le Q.V., “Searching for
friendly to low-resolution images). mobilenetv3,” Proceedings of the IEEE International Conference on
Computer Vision, pp. 1314-1324, 2019.
• The training dataset is better in relatively large size for the [6] Szegedy C., Vanhoucke V., Ioffe S., Shlens J. and Wojna Z., “Rethinking
classified category, which means the classified objects are the inception architecture for computer vision,” Proceedings of the IEEE
better commonly seen in the training library. Conference on Computer Vision and Pattern Recognition, pp. 2818-2826,
2016.
• The classified object is in typical shape or state of its [7] Ma N., Zhang X., Zheng, H.T. and Sun, J., “Shufflenet v2: Practical
category, and the backgrounds are better to be common as guidelines for efficient cnn architecture design,” Proceedings of the
well. European Conference on Computer Vision (ECCV), pp. 116-131, 2018.
[8] Mureşan H. and Oltean M., “Fruit recognition from images using deep
• There is no need to have apparent differences among learning,” Acta Universitatis Sapientiae, Informatica, vol. 10, no. 1, pp.
various categories in the dataset, as CNN can capture 26-42, 2018.
features more precisely than eyes of human beings. [9] Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T.,
Andreetto M. and Adam H., “Mobilenets: Efficient convolutional neural
The experimental summary of characteristics of networks for mobile vision applications,” arXiv preprint
MobileNetV3 suitable datasets convinces that MobileNetV3 is arXiv:1704.04861, 2017.
perfectly designed for mobile devices and brilliant to handle [10] Sandler M., Howard A., Zhu M., Zhmoginov A. and Chen L.C.,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” Proceedings of
classification tasks for daily-life photos in an extremely short the IEEE Conference on Computer Vision and Pattern Recognition, pp.
time. 4510-4520, 2018.
[11] Xiang Q., Wang X., Li R., Zhang G., Lai J. and Hu Q., “Fruit image
IV. CONCLUSION classification based on mobilenetv2 with transfer learning technique,”
This paper compared the performance of image classification Proceedings of the 3rd International Conference on Computer Science and
tasks among MobileNetV3 and certain standard CNN models Application Engineering, pp. 1-7, 2019.
based on the image datasets, which are usually captured and [12] Hu J., Shen L. and Sun G., “Squeeze-and-excitation networks,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
handled by mobile devices. By comparing and analysing the Recognition, pp. 7132-7141, 2018.
experimental results, we found that MobileNetV3 models can [13] Tan M., Chen B., Pang R., Vasudevan V., Sandler M., Howard A. and Le
complete image classification tasks with much higher efficiency. Q.V., “Mnasnet: Platform-aware neural architecture search for mobile,”
At the same time, their final accuracy is significantly higher than Proceedings of the IEEE Conference on Computer Vision and Pattern
the accuracy of other models. Besides, we note that Recognition, pp. 2820-2828, 2019.
MobileNetV3 is suitable for images that contain daily-life [14] Yang T.J., Howard A., Chen B., Zhang X., Go A., Sandler M., Sze V. and
objects with less background interference, even if the images are Adam H., “Netadapt: Platform-aware neural network adaptation for
mobile applications,” Proceedings of the European Conference on
in low resolution. Therefore, MobileNetV3 is very convenient to Computer Vision (ECCV), pp. 285-300, 2018.
deal with image classification tasks for mobile terminals. [15] Zhang X., Zhou X., Lin M. and Sun J., “Shufflenet: An extremely efficient
As further work to our research, we would like to summarize convolutional neural network for mobile device,” Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848-
the advantages of other large CNNs that can be used to adjust 6856, 2018.
the MobileNetV3 models further better, to improve their [16] Ioffe S. and Szegedy C., “Batch normalization: Accelerating deep
performance while avoiding the architecture to expand. We also network training by reducing internal covariate shift,” arXiv preprint
plan to experiment on more image datasets with MobileNetV3 arXiv:1502.03167, 2015.
models, which aims to summarize and confirm more universal [17] Hussain M., Bird J.J. and Faria D.R., “A study on cnn transfer learning
characteristics of images that perform excellently on for image classification,” UK Workshop on Computational Intelligence,”
MobileNetV3. vol. 840, pp. 191-202, 2018.
[18] Pan S.J. and Yang Q., “A survey on transfer learning,” IEEE Transactions
REFERENCES on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359,
2009.
[1] Al-Saffar A.A.M., Tao H. and Talab M.A., “Review of deep convolution
neural network in image classification,” International Conference on [19] Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z.,
Radar, Antenna, Microwave, Electronics, and Telecommunications Karpathy A., Khosla A., Bernstein M. and Berg A.C., “Imagenet large
(ICRAMET). IEEE, pp. 26-31, 2017. scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211-252, 2015.
[2] Szegedy C., Liu W., Jia Y., et al., “Going deeper with convolutions,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1-9, 2015.
497

Qian PDF

Uploaded by

Copyright:

Available Formats

Qian PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Qian PDF

Uploaded by

Copyright:

Available Formats

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE 2021)

MobileNetV3 for Image Classification

978-0-7381-3122-1/21/$31.00 ©2021 IEEE

TABLE I MOBILENETV3-LARGE SPECIFICATION. #EXPAND DENOTES

where TP, TN, FP, FN are the number of true-positive, true-

Model Approximate Training Time (in minutes)

(a) Log of Test Time on Datasets (b) Test Accuracy on Datasets

You might also like