computers-13-00173
computers-13-00173
computers-13-00173
Article
Optimizing Convolutional Neural Networks for Image
Classification on Resource-Constrained Microcontroller Units
Susanne Brockmann * and Tim Schlippe *
Abstract: Running machine learning algorithms for image classification locally on small, cheap, and
low-power microcontroller units (MCUs) has advantages in terms of bandwidth, inference time,
energy, reliability, and privacy for different applications. Therefore, TinyML focuses on deploying
neural networks on MCUs with random access memory sizes between 2 KB and 512 KB and read-
only memory storage capacities between 32 KB and 2 MB. Models designed for high-end devices
are usually ported to MCUs using model scaling factors provided by the model architecture’s
designers. However, our analysis shows that this naive approach of substantially scaling down
convolutional neural networks (CNNs) for image classification using such default scaling factors
results in suboptimal performance. Consequently, in this paper we present a systematic strategy for
efficiently scaling down CNN model architectures to run on MCUs. Moreover, we present our CNN
Analyzer, a dashboard-based tool for determining optimal CNN model architecture scaling factors
for the downscaling strategy by gaining layer-wise insights into the model architecture scaling factors
that drive model size, peak memory, and inference time. Using our strategy, we were able to introduce
additional new model architecture scaling factors for MobileNet v1, MobileNet v2, MobileNet v3,
and ShuffleNet v2 and to optimize these model architectures. Our best model variation outperforms
the MobileNet v1 version provided in the MLPerf Tiny Benchmark on the Visual Wake Words image
classification task, reducing the model size by 20.5% while increasing the accuracy by 4.0%.
access memory (RAM) sizes between 2 KB and 512 KB and read-only memory (ROM)
storage capacities between 32 KB and 2 MB [9,17]. Examples of such IoT use cases include
the processing of sensor data in smart manufacturing, personalized healthcare, automated
retail, wildlife conservation, and precision agriculture contexts. In many of these fields,
image classification plays an important role.
When seeking to obtain convolutional neural networks (CNNs) for image classification
that fit the aforementioned constraints, CNNs for high-end edge devices are often ported
to MCUs by reducing the input channels from RGB to grayscale [9], reducing the input
resolution [9,18], or by drastically decreasing the default model architecture scaling factor
of the model, such as the width multiplier α in MobileNets [11–13]. However, our analysis,
which we will present in Section 6.2.1, shows that the naive approach of reducing the
default model scaling factors leads to suboptimal results when substantially scaling down
the model architecture.
Consequently, in this study we elaborate a systematic strategy to efficiently optimize
CNN model architectures for running image classification on MCUs. Our goal was to
optimize tiny models that fit the following MCU constraints, which are also recommended
in the TinyML literature [18]:
• ≤250 KB RAM
• ≤250 KB ROM
• Inference cost ≤60 M multiply–accumulate operations (MACs)
For our experiments, we used the Visual Wake Word (VWW) dataset [18] with a
resolution of 96 × 96 × 3 pixels. The VWW dataset was specifically designed for the MCU
use case of classifying whether a person is present in an image and is an important part of
the MLPerf Tiny Benchmark [19].
We developed our CNN Analyzer, a dashboard-based tool, to gain layer-by-layer
insights into the model architecture scaling factors that have the potential to minimize
model size, peak memory, and inference time. Using our strategy together with our CNN
Analyzer, we were able to (1) locate the bottlenecks of the model; (2) introduce new model
architecture scaling factors for MobileNet v1 [11], MobileNet v2 [12], MobileNet v3 [13],
and ShuffleNet v2 [15]; and (3) optimize these model architectures. This would not have
been possible with a neural architecture search (NAS) approach, as in [9,20–22], since NAS
requires the definition of the search space in advance and does not provide layer-by-layer
insights. In summary, our contributions are as follows:
• We investigated and developed a strategy to optimize existing CNN architectures for
given resource constraints.
• We created the CNN Analyzer to inspect the metrics of each layer in a CNN.
• Our model implementations use the TensorFlow Lite for Microcontrollers inference
library, as it can run on almost all available MCUs [23].
• We introduced new model architecture scaling factors to optimize MobileNet v1,
MobileNet v2, MobileNet v3, and ShuffleNet v2.
• We have published the CNN Analyzer and its related code on our GitHub repository:
https://github.com/subrockmann/tiny_cnn (accessed on 3 June 2024)
Our findings and developed tools are portable to other network architectures and can
be combined with NAS approaches. While the goal of this paper is to increase performance
of models that already fit the aforementioned MCU constraints, our strategy and the
developed CNN Analyzer can also be applied to fit models into MCU constraints that
originally require more resources.
Examples for high-end MCUs which require those constraints are ESP32 Xtensa LX6
(4 MB ROM, 520 KB RAM), Arduino Nano 33 Cortex-M4 (1 MB ROM, 256 KB RAM),
Raspberry Pi Pico Cortex-M0+ (16 MB ROM, 264 KB), and STM32F746G-Disco board
Cortex-M7 (1 MB ROM, 340 KB RAM), which we used for our experiments described in
Section 6. Although the available ROM of these MCUs exceeds the 250 KB required to store
the model, the storage overhead for the entire application utilizing the model must also be
taken into account. Furthermore, these high-end MCUs “Are used in a huge range of use
cases, from sensing and IoT to digital gadgets, smart appliances and wearables. At the time
of writing, they represent the sweet spot for cost, energy usage, and computational ability
for embedded machine learning” [16].
For running inference of neural networks on MCUs, all static data, including program
code and model parameters, have to fit into the ROM, while temporary data such as model
activations must fit into the RAM. The RAM required for neural network inference varies
throughout the layers, and is determined by intermediate tensors that must be stored for
data transfer between layers. The largest sum of input and output tensors of an operation
plus all other tensors that must be kept in the RAM for subsequent operations [24], is
known as the peak memory. The amount of ROM needed for an application is the sum of the
operating system size, the machine learning framework size, the neural network model
size, and the application code size. The number of MACs or floating point operations
(FLOPs) is used to measure the inference cost.
While the number of MACs and FLOPs has an impact on accuracy, inference time,
and energy consumption, storage-related metrics such as the number of model parameters,
which impacts the model size, and the peak memory, which determines the RAM require-
ments, are crucial metrics for running neural networks on resource-constrained MCUs.
Consequently, it is relevant to achieve a trade-off between high accuracy, low inference
time, minimal storage requirements, and low energy consumption. The authors of [18]
used the number of model parameters as a proxy for model size, which requires 1 byte
storage for each parameter using int-8 quantization. However, this neglects the additional
storage requirements for metadata, computation graphs, and other information necessary
for training and inference. Due to this relationship, models with fewer model parameters
may have a higher model size than models with more model parameters. For example,
the MLPerf Tiny Benchmark model of MobileNet v1 with scaling factor α = 0.25 requires
a total memory which is 1.36 times larger than the size for storing the model parameters
alone. Consequently, in our strategy for optimizing CNN model architectures, we intro-
duce the bytes/parameter ratio as a new evaluation metric to estimate the number of model
parameters that have to be reduced to fit a model into the given constraints. For example,
bytes/parameter ratio = 1.3 indicates that in order to reduce the model size by 1000 bytes, we
need to reduce it by approximately 1300 model parameters.
As we will explain in Section 5.3, we capture the aforementioned metrics in our CNN
Analyzer to derive optimization strategies.
3. Related Work
In this section, we will first describe techniques for reducing the size of neural networks
and designing CNNs that require low computational resources (so-called efficient CNNs).
Then, we will present efficient CNN architectures designed for mobile devices and MCUs.
Another approach for reducing model size is quantization. Quantization maps high-
precision weight values to low-precision weight values, reducing the number of bits
needed for storing each weight. For example, [30] proposed full-integer quantization (int-8
quantization) of weights and activations to leverage integer-only hardware accelerators,
which can improve inference time, computation, and power usage. The authors of [31]
suggested knowledge distillation, which transfers knowledge from a large teacher model
to a smaller student model by learning mappings from input to output vectors.
eters leading to a model size of more than 1.9 MB, as each model parameter uses 1 byte of
storage using int-8 quantization, as described in Section 2.
is discarded. (3) For each of the remaining model variations, we investigate how to make the
model variation fit our constraints by changing the model architecture scaling factors. Then,
we proceed to step 1 to build a new model variation with these new model architecture
scaling factors.
5. Experimental Setup
In this section, we will first introduce the dataset we used to optimize and test our
model variations. Second, we will explain how we used TensorFlow, TensorFlow Lite and
TensorFlow Lite for Microcontrollers to transfer CNN models on an MCU. Then, we will
present the CNN Analyzer which we implemented to determine the optimal CNN model
architecture scaling factors for our down-scaling strategy. Finally, we will describe how we
tested our best models on a real MCU.
5.1. Dataset
Commonly used datasets for image classification include ImageNet [48] and CI-
FAR10 [49]. However, ImageNet [48], with 1000 classes, is not an appropriate dataset
for our MCU use case [18]. Furthermore, the resolution of the CIFAR10 images [49]
(32 × 32 × 3 pixels) is too small for most real-world IoT use cases [18].
Consequently, for our experiments we used the VWW dataset [18], which consists of
109,620 images (80% training, 10% validation, 10% test) with a resolution of 96 × 96 × 3 pix-
els. The VWW dataset was specifically designed for the MCU use case of classifying
whether a person is present in an image, and is an important part of the MLPerf Tiny
Benchmark [19]. Following this benchmark, we used the constraints defined in Section 2
and a minimum accuracy of 80%. Our goal was to find a model variation that reaches
maximum accuracy on the VWW test set while staying within these resource constraints.
5.3. CNN Analyzer: A Dashboard-Based Tool for Determining Optimal CNN Model Architecture
Scaling Factors
To determine optimal model architecture scaling factors for given constraints such as
accuracy, model size, peak memory, and inference time in steps 1–3 of our optimization
strategy (described in Section 4), we developed the CNN Analyzer. This toolkit allows
TensorFlow models to be built with different model architecture scaling factors, and enables
the storage, analysis, visualization, comparison, and optimization of the model variations.
5.3.2. Implementation
CNN Analyzer is powered by a collection of existing and self-developed analytical
tools that analyze and benchmark the model representations created for each model varia-
tion. In an interactive Jupyter notebook, the user can choose the model architecture, define
the model architecture scaling factors, and begin building, conversion, and analysis of
model variations. The extracted information of the different model variations, including its
compound model name, is logged in the model database of CNN Analyzer to keep track of
Computers 2024, 13, 173 9 of 18
all the different model architectures and model architecture scaling factors. The machine
learning operations (MLOps) tool Weights & Biases (https://www.wandb.com, accessed
on 4 July 2024) is used to log all model architecture scaling factors as well as to track and
visualize all model training runs. All metrics and tabular data of each model variation
are retrieved from the TensorFlow, TensorFlow Lite, and TensorFlow Lite for Microcontrollers
representations, which are described in Section 5.2.
TensorFlow provides the tf.model.summary method (https://www.tensorflow.org/api_
docs/python/tf/keras/Model#summary, accessed on 4 July 2024) for generating a layer-
wise summary report with layer names, layer types, number of channels, output shape,
and number of model parameters, as well as a summary of the total MACs and FLOPs of
the model variation.
To capture the layer-wise RAM requirements and peak memory of the model varia-
tion, we used tflite-tools (https://github.com/eliberis/tflite-tools, accessed on 4 July 2024)
created by [24] to analyze the TensorFlow Lite model representations. Additionally, we
utilized the TensorFlow Lite native benchmarking binary (https://www.tensorflow.org/lite/
performance/measurement#native_benchmark_binary, accessed on 4 July 2024), which can
run on Linux, Mac OS, and Android devices and creates a report with average inference
time on the CPU and a breakdown of the inference time per layer.
To measure the inference time on MCUs, the TensorFlow Lite model representation has
to be compiled into a c-byte array. Since compiling the model representation together with
its corresponding runtime code and uploading it to the MCU for inference time profiling
requires many manual steps, we first simulated the inference using a hardware simulator.
To simulate the inference, we used the Silicon Labs Machine Learning Toolkit (MLTK) (https:
//siliconlabs.github.io/mltk, accessed on 4 July 2024), which provides a model profiler that
uses a hardware simulator to estimate the inference time and CPU cycles per layer (based on
the ARM Cortex-M33). To compile the model and flash it on the MCU for the final inference
time evaluation, we used STM32.Cube.AI (https://stm32ai.st.com/stm32-cube-ai, accessed
on 4 July 2024). The STM32.Cube.AI software framework supports profiling of TensorFlow
Lite models on locally connected hardware such as the STM32F746G-Disco board, which we
used for our experiments. STM32.Cube.AI creates detailed reports including the model size,
peak RAM, and inference time as well as a layer-wise breakdown of the MACs, number of
model parameters, and inference time on the MCU.
To understand the impact of the width multiplier α for small numbers, we created
model variations with α ∈ {0.1, 0.2, 0.25, 0.35} and l ∈ {1, 2, 3, 4, 5} and evaluated the impact
of these model architecture scaling factors on the int-8 quantized model size and accuracy.
The model size in KB for the model variations with different α and l are displayed in
Figure 3, sorted by α and l. The figure consolidates the data stored in our CNN Analyzer.
The horizontal line marks the model size constraint of 250 KB. It can be observed that the
model size expands with increasing α and l. α significantly effects the model size, as more
channels per layer require more model weights, which increases the storage requirements.
The model with the highest accuracy (85.1%) that fits into our peak memory constraint is
MobileNet v1, with α = 0.25 and l = 3 (mobilenetv1_0.25_96_c3_02_l3).
Figure 4 shows the accuracy of our model variations. The horizontal line marks our
80% accuracy threshold. It demonstrates that the accuracy significantly decreases with
decreasing α. As our goal was to reduce model size while maintaining high accuracy, we
looked for other methods to reduce the number of model parameters and introduced new
additional model architecture scaling factors.
6.2.2. Layer-Wise Optimization with New Model Architecture Scaling Factors pl and ll
Figure 5 shows (in blue) the layer-wise visualization of the number of model parame-
ters in MobileNet v1 with α = 0.25. It can be observed that the penultimate convolutional
layer (consisting of 33 K parameters) and the last convolutional layer (consisting of 66 K pa-
rameters) are the biggest model parameter contributors, leading to a model size of 293.8 KB,
Computers 2024, 13, 173 12 of 18
which exceeds our 250 KB constraint. The MobileNet v1 [11] architecture was designed for
ImageNet [48] classification with 1000 classes, unlike our use case with only two classes.
Therefore, we hypothesized that the model size could be significantly reduced by lowering
the number of model parameters in the penultimate and last convolutional layers without
incurring a significant negative impact on accuracy.
To test this hypothesis, we introduced our two new model architecture scaling factors:
penultimate_layer_channels (pl) determines the number of channels in the penultimate
convolutional layer, while last_layer_channels (ll) specifies the number of channels in the
last convolutional layer. We investigated the impact of varying these model architecture
scaling factors. For our best MobileNet v1 variation (mobilenetv1_0.3_96_c3_o2_l5ll32pl256),
the reduction of pl from 1,024 to 256 and ll from 1024 to 32 were optimal and decreased
the number of model parameters significantly. As illustrated with the red bars in Figure 5,
the number of model parameters in the penultimate convolutional layer dropped to 11.7k,
while the number of model parameters in the last convolutional layer dropped to 0.7k. This
reduced model size allowed us to increase the width multiplier α to 0.3. The resulting best
model variation uses width multiplier α = 0.3, has a 17.2% decreased model size of 243.4 KB
that fits the ≤250 KB ROM constraint, and even shows 0.8% increased accuracy of 86.1% in
comparison to the benchmark model.
of the 222K parameters of the model variation. Without the layer-wise visualization of our
CNN Analyzer, the introduction of new model architecture scaling factors to control these
layers would not have been possible.
Since the CNN Analyzer displays the number of model parameters, model size,
bytes/parameter ratio, and layer-wise visualizations of channels and model parameters side-
by-side, the user can derive ideas on how to optimize specific scaling factor values. These
visualizations are even more important when several model architecture scaling factors
influence the same layer and the model parameter distribution shifts.
6.5.1. MobileNet v2
MobileNet v2 [12] also provides a width multiplier α, which we varied in our ex-
periments. Additionally, we exposed the expansion factor t, which scales the number
of channels inside the bottleneck block, as a model architecture scaling factor (t ∈ [1, 6],
default value t = 6). In the default implementation, α does not scale the last convolutional
layer with 1,280 channels. Consequently, we also introduced our new model architecture
scaling factor last_layer_channels to control and significantly reduce the number of model
parameters in this layer.
Since the architecture contains only one convolutional layer after the bottleneck blocks,
we could not introduce penultimate_layer_channels for the MobileNet v2 architecture.
The best MobileNet v2 model variation (mobilenetv2_0.25_96_c3_o2_t5l256) uses
α = 0.25, t = 5, last_layer_channels = 256, has an int-8 quantized model size of 248.0 KB,
uses 56.3 KB peak memory, requires 59.5 ms inference time on the MCU, and reaches an
accuracy of 84.1%, which is below the accuracy of the benchmark model.
6.5.2. MobileNet v3
MobileNet v3 [13] extends the MobileNet v2 [12] block with an additional squeeze-and-
excitation module [50] that is used as an attention module inside the bottleneck structure.
The best architecture within our constraints uses α = 0.05, has an int-8 quantized model
size of 197.1 KB, peak memory of 75.3 KB, 41.7 ms inference time on the MCU, and reaches
83.5% accuracy, which is below the accuracy of our benchmark model.
Since model variations with higher width multipliers α exceeded our peak memory
constraint of 250 KB, we used the same approach as [18], who removed the squeeze-
and-excitation modules inside the MobileNet v3 architecture to lower the peak memory.
The model variations without the squeeze-and-excitation module are indicated by the
suffix NSQ (“no squeeze”). We also introduced our new model architecture scaling fac-
tors penultimate_layer_channels (pl) and last_layer_channels (ll) to significantly reduce the
number of model parameters in these layers.
Our best MobileNet v3 model variation without the squeeze-and-excitation module is
mobilenetv3smallNSQ_0.3_96_c3_o2_l32pl128. It uses α = 0.3, penultimate_layer_channels = 128,
last_layer_channels = 32, has an int-8 quantized model size of 172.8 KB, uses a peak memory
of 110.6 KB, requires 118.8 ms inference time on the MCU, and reaches an accuracy of 86.1%,
slightly outperforming our benchmark model’s accuracy of 85.4%.
6.5.3. ShuffleNet v1
The ShuffleNet v1 [14] architecture uses pointwise group convolutions instead of
costly 1 × 1 convolutions to reduce computational cost while maintaining accuracy.
The model can be scaled by controlling the number of groups in the pointwise con-
volutions with the ShuffleNet-specific default model architecture scaling factor g ∈ {1, 2,
3, 4, 8}, which controls the connection sparsity, and a ShuffleNet-specific default model
architecture scaling factor α ∈ {0.25, 0.5, 1, 1.5}, which scales the number of channels per
Computers 2024, 13, 173 15 of 18
layer. Since the number of filters in each shuffle unit block must be divisible by g, only a
limited number of valid model variations can be created.
Due to architectural constraints and the downsampling strategy of ShuffleNet v1, we
could not introduce new model architecture scaling factors to further optimize the model.
The best model variation of ShuffleNet v1 (shufflenetv1_0.25_96_c3_o2_g1) with
α = 0.25 and g = 1 has an int-8 quantized model size of 175.2 KB, 81 KB of peak mem-
ory, 69.6 ms inference time on our MCU, and 85.1% accuracy, which is below the accuracy
of our benchmark model.
6.5.4. ShuffleNet v2
In ShuffleNet v2 [15], the number of channels c in the first ShuffleNet v2 block is
controlled by the ShuffleNet-specific default model architecture scaling factor α ∈ {0.5,
1, 1.5, 2}. We extended the range of α to also include α ∈ {0.05, 0.1, 0.2, 0.25, 0.35}. It is
important to take into account that the number of output channels of the first block must
be an even number in order to allow for the channel split operation.
To further optimize the ShuffleNet v2 architecture, we introduced our new model ar-
chitecture scaling factor last_layer_channels to significantly reduce the model parameters in
this layer. Since the architecture contains only one convolutional layer after the ShuffleNet
blocks, we could not introduce penultimate_layer_channels for the ShuffleNet v2 architec-
ture.
Our best ShuffleNet v2 model variation (shufflenetv2_0.1_96_c3_o2_l128) with α = 0.1
and last_layer_channels = 128 achieved 83.3% accuracy using 78.8 KB of peak memory and
had a model size of 167.8 KB. This optimized architecture does not reach the accuracy of
our benchmark model.
All models were developed and employed using the following downscaling and
optimization processes to optimize our candidate CNN model architecture:
Computers 2024, 13, 173 16 of 18
• Build model variations with different width multipliers α and check the model size
and peak memory. Find a model variation where only one of those constraints is not
met.
• If the peak memory constraint is not met, choose a smaller width multiplier α.
• If the model size requirement is not met, create a layer-wise visualization of the model
parameters and identify the layers with the most model parameters.
• Reduce the number of channels in the layers that have the most model parameters.
• Finally, try to increase the width multiplier α as much as possible while keeping the
model variation within the constraints.
References
1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of
the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates, Inc.: Glasgow,
UK, 2012; Volume 25, pp. 1097–1105.
2. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef] [PubMed]
Computers 2024, 13, 173 17 of 18
3. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer
Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. ISSN 2380-7504. [CrossRef]
4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. ISSN 1063-6919. [CrossRef]
5. Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference, BMVC 2016, York,
UK,19–22 September 2016; Wilson, R.C., Hancock, E.R., Smith, W.A.P., Eds.; BMVA Press: Durham, UK, 2016.
6. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015.
7. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; RabiNovemberich, A. Going Deeper
with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston,
MA, USA, 7–12 June 2015; pp. 1–9. ISSN 1063-6919. [CrossRef]
8. Alyamkin, S.; Ardi, M.; Berg, A.C.; Brighton, A.; Chen, B.; Chen, Y.; Cheng, H.P.; Fan, Z.; Feng, C.; Fu, B.; et al. Low-Power
Computer Vision: Status, Challenges, Opportunities. arXiv 2019, arXiv:1904.07714.
9. Banbury, C.; Zhou, C.; Fedorov, I.; Navarro, R.M.; Thakker, U.; Gope, D.; Reddi, V.J.; Mattina, M.; Whatmough, P.N. MicroNets:
Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers. Proc. Mach. Learn. Syst.
2021, 3, 517–532.
10. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50×
Fewer Parameters and <0.5 MB Model Size. arXiv 2016, arXiv:1602.07360.
11. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
12. Sandler, M.; Howard, A.; Zhu, M.; ZhmogiNovember, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks.
arXiv 2019, arXiv:1801.04381. [CrossRef]
13. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for
MobileNetV3. arXiv 2019, arXiv:1905.02244. [CrossRef]
14. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018;
pp. 6848–6856.
15. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings
of the ECCV 2018. Lecture Notes in Computer Science, Cham, Switzerland, 8–14 September 2018; Volume 11218. [CrossRef]
16. Situnayake, D.; Plunkett, J. AI at the Edge: Solving Real-World Problems with Embedded Machine Learning, 1st ed.; Machine Learning;
O’Reilly: Beijing, China; Boston, MA, USA; Farnham, UK; Sebastopol, CA, USA; Tokyo, Japan, 2023.
17. Hussein, D.; Ibrahim, D.; Alajlan, N. TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge
Devices for AI Applications. Micromachines 2022, 13, 851. [CrossRef] [PubMed]
18. Chowdhery, A.; Warden, P.; Shlens, J.; Howard, A.; Rhodes, R. Visual Wake Words Dataset. arXiv 2019, arXiv:1906.05721.
[CrossRef]
19. Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLPerf
Tiny Benchmark. arXiv 2021, arXiv:2106.07597.
20. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings
of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 8697–8710. ISSN: 2575-7075. [CrossRef]
21. Fedorov, I.; Adams, R.P.; Mattina, M.; Whatmough, P.N. SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained
Microcontrollers. arXiv 2019, arXiv:1905.12107. [CrossRef]
22. Lin, J.; Chen, W.M.; Lin, Y.; Cohn, J.; Gan, C.; Han, S. MCUNet: Tiny Deep Learning on IoT Devices—Technical Report. arXiv
2020, arXiv:2007.10319. [CrossRef]
23. David, R.; Duke, J.; Jain, A.; Reddi, V.J.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Wang, T.; et al. TensorFlow Lite
Micro: Embedded Machine Learning for TinyML Systems. Proc. Mach. Learn. Syst. 2021, 3, 800–811.
24. Liberis, E.; Lane, N.D. Neural Networks on Microcontrollers: Saving Memory at Inference via Operator Reordering. arXiv 2020,
arXiv:1910.05110. [CrossRef]
25. Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding. arXiv 2016, arXiv:1510.00149.
26. LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal Brain Damage. In Proceedings of the Advances in Neural Information Processing
Systems 2, Denver, CO, USA, 12 December 1990; pp. 598–605.
27. Hassibi, B.; Stork, D.; Wolff, G. Optimal Brain Surgeon and general network pruning. In Proceedings of the IEEE International
Conference on Neural Networks, San Francisco, CA, USA, 28 March–1 April 1993; Volume 1, pp. 293–299. [CrossRef]
28. Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the 7th
International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019.
29. Heim, L.; Biri, A.; Qu, Z.; Thiele, L. Measuring what Really Matters: Optimizing Neural Networks for TinyML. arXiv 2021,
arXiv:2104.10645. [CrossRef]
Computers 2024, 13, 173 18 of 18
30. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural
Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713.
31. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531.
32. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. ISSN 1063-6919.
[CrossRef]
33. Freeman, I.; Roese-Koerner, L.; Kummert, A. EffNet: An Efficient Structure for Convolutional Neural Networks. arXiv 2018,
arXiv:1801.06434.
34. Lawrence, T.; Zhang, L. IoTNet: An Efficient and Accurate Convolutional Neural Network for IoT Devices. Sensors 2019, 19, 5541.
[CrossRef]
35. Tan, M.; Le, Q.V. EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling. 2019. Available online:
https://research.google/blog/efficientnet-improving-accuracy-and-efficiency-through-automl-and-model-scaling/ (accessed
on 1 July 2024).
36. Gholami, A.; Kwon, K.; Wu, B.; Tai, Z.; Yue, X.; Jin, P.; Zhao, S.; Keutzer, K. SqueezeNext: Hardware-Aware Neural Network
Design. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
Salt Lake City, UT, USA, 18–22 June 2018; pp. 1638–1647. ISSN 2160-7516. [CrossRef]
37. Huang, G.; Liu, S.; Maaten, L.V.D.; Weinberger, K.Q. CondenseNet: An Efficient DenseNet Using Learned Group Convolutions.
In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22
June 2018; pp. 2752–2761. [CrossRef]
38. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive Neural
Architecture Search. arXiv 2018, arXiv:1712.00559. [CrossRef]
39. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture
Search for Mobile. arXiv 2019, arXiv:1807.11626. [CrossRef]
40. Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized Evolution for Image Classifier Architecture Search. arXiv 2019,
arXiv:1802.01548. [CrossRef]
41. Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv 2019, arXiv:1806.09055.
42. Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-Aware Efficient
ConvNet Design via Differentiable Neural Architecture Search. arXiv 2019, arXiv:1812.03443. [CrossRef]
43. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589.
44. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2022,
arXiv:2110.02178. [CrossRef]
45. Krishnamoorthi, R. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper. arXiv 2018, arXiv:1806.08342.
[CrossRef]
46. Lin, J.; Chen, W.M.; Cai, H.; Gan, C.; Han, S. MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning. arXiv
2021, arXiv:2110.15352. [CrossRef]
47. Liberis, E.; Dudziak, Ł.; Lane, N.D. µNAS: Constrained Neural Architecture Search for Microcontrollers. In Proceedings of the
1st Workshop on Machine Learning and Systems, New York, NY, USA, 26 April 2021; EuroMLSys ’21; pp. 70–79. [CrossRef]
48. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of
the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. ISSN
1063-6919. [CrossRef]
49. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report 0; University of Toronto: Toronto,
ON, Canada, 2009.
50. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. ISSN 2575-7075. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.