0% found this document useful (0 votes)

23 views

Model Compression

This paper provides a comprehensive review of model compression techniques in machine learning, highlighting their importance for deploying efficient models in resource-constrained environments such as mobile devices and IoT systems. It discusses various strategies like pruning, quantization, and knowledge distillation, emphasizing the need for a balance between model performance and computational demand. The review also identifies research gaps and advocates for future studies on hybrid methods and intelligent frameworks to enhance the effectiveness of model compression.

Uploaded by

Harsh Kr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Model Compression

Uploaded by

Harsh Kr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Applied Intelligence (2024) 54:11804–11844

https://doi.org/10.1007/s10489-024-05747-w

A comprehensive review of model compression techniques in machine

learning
Pierre Vilar Dantas1 · Waldir Sabino da Silva Jr1 · Lucas Carvalho Cordeiro2 · Celso Barbosa Carvalho1

Accepted: 5 August 2024 / Published online: 2 September 2024

Abstract
This paper critically examines model compression techniques within the machine learning (ML) domain, emphasizing their
role in enhancing model efficiency for deployment in resource-constrained environments, such as mobile devices, edge com-
puting, and Internet of Things (IoT) systems. By systematically exploring compression techniques and lightweight design
architectures, it is provided a comprehensive understanding of their operational contexts and effectiveness. The synthesis of
these strategies reveals a dynamic interplay between model performance and computational demand, highlighting the bal-
ance required for optimal application. As machine learning (ML) models grow increasingly complex and data-intensive, the
demand for computational resources and memory has surged accordingly. This escalation presents significant challenges for
the deployment of artificial intelligence (AI) systems in real-world applications, particularly where hardware capabilities are
limited. Therefore, model compression techniques are not merely advantageous but essential for ensuring that these models can
be utilized across various domains, maintaining high performance without prohibitive resource requirements. Furthermore,
this review underscores the importance of model compression in sustainable artificial intelligence (AI) development. The
introduction of hybrid methods, which combine multiple compression techniques, promises to deliver superior performance
and efficiency. Additionally, the development of intelligent frameworks capable of selecting the most appropriate compression
strategy based on specific application needs is crucial for advancing the field. The practical examples and engineering applica-
tions discussed demonstrate the real-world impact of these techniques. By optimizing the balance between model complexity
and computational efficiency, model compression ensures that the advancements in AI technology remain sustainable and
widely applicable. This comprehensive review thus contributes to the academic discourse and guides innovative solutions for
efficient and responsible machine learning practices, paving the way for future advancements in the field.

Keywords Lightweight design approaches · Neural network compression · Architectural innovations · Computational
efficiency · Model generalization · Technological evolution in machine learning

1 Introduction nence for their ability to handle high-dimensional data in

classification and regression tasks. Long short-term mem-
1.1 Background and significance of machine ory (LSTM) [3], introduced in 1997, became essential for
learning (ML) and deep learning (DL) sequential data processing in language modeling and speech
recognition. LeNet-51 , introduced in 1998 [4], was one
The evolution of ML and deep learning (DL) has been punc- of the first convolutional neural network (CNN), pioneer-
tuated by a series of landmark models and technologies, ing digit recognition and setting the stage for future CNN
each representing a significant leap in the field. The percep- developments. In the early 2000s, ensemble methods such
tron, developed in 1958 [1], laid the early groundwork for as random forest emerged [5, 6], enhancing the capabili-
deep neural networks (DNNs) in pattern recognition. In the ties of classification and regression. Deep belief network
1990s, support vector machines (SVMs) [2] gained promi- (DBN), unveiled in 2006 [7], reignited interest in DNNs,

Waldir Sabino da Silva Jr, Lucas Carvalho Cordeiro, and Celso Barbosa 1 LeNet-5 itself is not an acronym; it is a name. The ’Le’ in LeNet-5 is
Carvalho contributed equally to this work. derived from the name of one of its developers, Yann LeCun. The ’Net’
part refers to the fact that it is a neural network (NN). The ’5’ denotes
Extended author information available on the last page of the article that this was the fifth iteration or version of the model developed.
A comprehensive review of model... 11805

ushering in the modern era of DL. A major milestone was ML and DL models since the early 2010s marks a signifi-
achieved with the advent of AlexNet in 2012 [8], a DNN cant evolution in computational technology [16]. Surpassing
that dominated the ImageNet challenge and brought DL into the traditional bounds of Moore’s law, training compute has
the AI spotlight. The development of generative adversarial doubled approximately every six months, introducing the
networks (GANs) in 2014 [9] introduced a novel genera- large-scale era around late 2015 [17]. This era, marked by the
tive modeling approach, impacting unsupervised learning need for 10 to 100 times more compute power for ML mod-
and image generation. The introduction of the transformer els, has significantly increased the demand for computational
model in 2017 [10], and subsequently bidirectional encoder resources and expertise in advanced ML systems [14–16,
representations from transformers (BERT) in 2018 [11], rev- 18]. One of the most noted manifestations of this growth is
olutionized natural language processing (NLP), setting new the expansion of the largest dense models in DL. Since the
performance benchmarks and highlighting the significance 2020s, these models have expanded from one hundred mil-
of context in language understanding. These milestones not lion parameters to over one hundred billion, primarily due to
only mark critical points in AI but also showcase the diverse advancements in system technology, including model paral-
methodologies and increasing sophistication in ML and DL. lelism, pipeline parallelism, and zero redundancy optimizer
Numerous practical advantages have been offered by DL, (ZeRO) optimization techniques [17]. These changes made it
revolutionizing various fields. One of the primary benefits possible to train larger and better models, which has changed
is its ability to automatically extract features from raw data, how we handle ML. As computational capabilities continue
significantly reducing the need for manual feature engineer- to expand, the increase in graphics processing unit (GPU)
ing. This capability is particularly impactful in domains memory, from 16 GB to 80 GB, struggles to keep pace with
with complex data structures, such as image and speech the exponential growth in the computational demands of ML
recognition, where traditional methods struggle to achieve models [14]. This gap tests the limits of current hardware
high accuracy [7, 8]. In healthcare, DL models are used and magnifies the importance of more efficient utilization
for analyzing medical images to detect diseases like can- of available resources. The integration of ML with high-
cer, providing early and accurate diagnoses that are crucial performance computing (HPC), or high-performance data
for effective treatment [8]. For instance, CNNs have been analytics (HPDA), has been pivotal in this context, enabling
successfully applied to mammography images to identify faster ML algorithms and facilitating the resolution of more
breast cancer with higher precision than traditional meth- complex problems [14, 16, 18]. Advanced techniques like
ods [8]. In industrial applications, DL enhances quality DeepSpeed [15] and ZeRO-Infinity [17] further demonstrate
control processes by detecting defects in products on assem- how innovative system optimizations can push the bound-
bly lines, thus improving efficiency and reducing waste. aries of DL model training.
Additionally, in the automotive industry, DL is a corner- Nevertheless, the continued increase in model size and
stone of autonomous driving technology, enabling vehicles complexity underscores the need for model optimization [19–
to interpret and respond to their environment in real-time. 23]. Compressing ML models emerges as a vital approach,
Beyond these specialized applications, DL also impacts the reducing the disparity between escalating computational
daily lives of citizens through various consumer technolo- demands and inadequate memory expansion by adjusting
gies. Mobile phone applications, such as virtual assistants models to be compressed without significantly affecting per-
(e.g., Siri, Google Assistant), rely on DL to understand and formance. This approach encompasses several techniques,
process natural language commands, providing users with quantization, and knowledge distillation [24–29]. Model
convenient and hands-free interaction with their devices [12]. compression not only addresses the challenge of deploying
Furthermore, facial recognition technology, powered by DL, AI systems in resource-constrained environments, such as
is used for secure authentication in smartphones, enhancing mobile devices and embedded systems, but also improves
both security and user experience [12]. Personalized recom- the efficiency and speed of these models, making them more
mendations on platforms like Netflix, Amazon, and Spotify accessible and scalable. For instance, in mobile applica-
also utilize DL algorithms to analyze user behavior and tions, compressed models enable faster inference times and
preferences, delivering tailored content and improving user lower power consumption, which are critical for enhancing
satisfaction [13]. These practical examples highlight how DL user experience and extending battery life. Additionally, in
not only pushes the boundaries of AI but also provides sig- edge computing scenarios, where computational resources
nificant improvements and solutions to real-world problems are limited, compressed models facilitate real-time data pro-
across various sectors, making everyday life more efficient cessing and decision-making, enabling a wide range of
and convenient. applications from smart home devices to autonomous drones.
The expanding frontiers of ML and DL expose a paradox- In essence, model compression becomes not just a beneficial
ical combination of advancement and limitation. [14, 15]. strategy, but a necessity for the practical deployment of AI
The exponential growth in training compute for large-scale systems, particularly in environments where resources are
11806 P. V. D antas et al.

inherently limited. By optimizing the balance between model DL in smart manufacturing, and reinforcement learning
complexity and computational efficiency, model compres- (RL) in supply chain optimization.
sion ensures that the advancements in AI technology remain 4. Future research directions: the paper advocates for
sustainable and widely applicable across various domains future studies to focus on hybrid compression methods
and industries [24–29]. that combine multiple techniques for enhanced effi-
In conclusion, the rapid advancement in training com- ciency. Additionally, we suggest the development of
pute for DL models marks a remarkable era of technological autonomous selection frameworks that can intelligently
progress, juxtaposed with significant challenges. The stark choose the most suitable compression strategy based on
contrast between the exponential demands of these models the specific requirements of the application.
and the more modest growth in GPU memory capacity under- 5. Practical examples and applications: to bridge the gap
scores a pivotal issue in the field of ML. It is this imbalance between theory and practice, we provide practical exam-
that necessitates innovative approaches like model compres- ples and case studies demonstrating the application of
sion and system optimization. As we proceed, this paper will model compression techniques in real-world scenarios.
delve deeper into these challenges, exploring the intricacies These examples illustrate how model compression can
of model compression techniques and their critical role in lead to significant improvements in computational effi-
optimizing large-scale ML and DL models. We will exam- ciency without compromising model accuracy.
ine how these techniques not only address the limitations of 6. Innovative solutions for efficient ML: we propose
current hardware but also open new avenues for efficient, innovative solutions for improving the efficiency and
practical deployment of AI systems in various real-world effectiveness of ML models in resource-constrained envi-
scenarios. ronments. This includes the development of lightweight
model architectures and the integration of advanced com-
1.2 Main contributions and novelty pression techniques to facilitate the deployment of ML
models in practical, real-world applications.
This paper makes significant contributions to the field of
model compression techniques in ML, focusing on their The novelty of this paper lies in its approach to under-
applicability and effectiveness in resource-constrained envi- standing and advancing model compression techniques. By
ronments such as mobile devices, edge computing, and synthesizing existing knowledge and identifying critical
internet of things (IoT) systems. The main contributions of research gaps, we provide a comprehensive roadmap for
this paper are as follows: future research in this domain. Our focus on practical applica-
tions and innovative solutions further enhances the relevance
and impact of this work, making it a valuable resource for
1. Comprehensive review of model compression tech- both researchers and practitioners in the field of ML.
niques: we provide an in-depth review of various model
compression strategies, including pruning, quantization,
low-rank factorization, knowledge distillation, trans-
1.3 Emerging areas and research gaps
fer learning, and lightweight design architectures. This
The existing literature provides a comprehensive overview
review not only covers the theoretical underpinnings
of various model compression techniques and their applica-
of these techniques but also evaluates their practical
tions across different domains. However, there is a noticeable
implementations and effectiveness in different opera-
gap in addressing the specific challenges and advancements
tional contexts.
in many ML applications. Key areas where further research
2. Highlighting the balance between performance and
is necessary and emerging areas that leverage the latest
computational demand: our synthesis reveals the dyna-
developments in ML and model compression techniques to
mic interplay between model performance and compu-
enhance performance, efficiency, and reliability include:
tational requirements. We emphasize the necessity for a
balanced approach that optimizes both aspects, crucial
for the sustainable development of AI. 1. Digital twin-driven intelligent systems: digital twins,
3. Identification of research gaps: by examining the cur- virtual replicas of physical systems, offer significant
rent state of model compression research, we identify potential for real-time monitoring and predictive mainte-
critical gaps, highlighting the need for more research on nance. Current research lacks a thorough exploration of
integrating digital twins, physics-informed residual net- how digital twins can be integrated with advanced ML
works (PIResNet), advanced data-driven methods like models. Integrating model compression techniques can
gated recurrent units for better predictive maintenance further enhance their efficiency by reducing the compu-
of industrial components, predictive maintenance using tational burden during real-time monitoring [30–33].
A comprehensive review of model... 11807

2. PIResNet: traditional ML models have been extensively included based on the following criteria: detailed discussion
studied, but the incorporation of physical laws into these on model compression techniques; empirical evaluation of
models, such as in PIResNet, remains underexplored. compression methods on ML models; availability of perfor-
This approach can enhance model accuracy and reliabil- mance metrics like compression ratio, speedup, and accuracy
ity by embedding domain-specific knowledge. Applying retention; and relevant real-world applications. Exclusion
model compression techniques can optimize PIResNet criteria involved: papers not in English, reviews without
for deployment in resource-constrained environments original research, and studies focusing solely on theoretical
without sacrificing diagnostic accuracy [34–36]. aspects without empirical validation.
3. Gated recurrent units (GRU): there is a need for inno- Data extracted from the selected studies included author
vative data-driven approaches that leverage multi-scale names, publication year, compression technique evaluated,
fused features and advanced recurrent units, like GRUs. model used, datasets, performance metrics (e.g., compres-
Existing studies often focus on conventional methods, sion ratio, inference speedup, accuracy), and key findings.
missing the potential benefits of these sophisticated tech- A thematic synthesis approach was used to categorize the
niques. Incorporating model compression techniques can compression techniques and summarize their effectiveness
further enhance the applicability of these approaches by across different applications and model architectures. The
reducing memory and computational requirements [37– synthesis involved comparing and contrasting the effective-
40]. ness of different model compression techniques, highlighting
4. Predictive maintenance using DL in smart manu- their advantages and limitations. The impact of these tech-
facturing: predictive maintenance involves using ML niques on computational efficiency, model size reduction,
models to predict equipment failures before they occur, and performance metrics was analyzed to identify trends and
allowing for timely maintenance and reducing downtime. potential areas for future research.
Current research gaps include optimizing DL models
for deployment in smart factories by integrating them 1.5 Paper organization
with IoT devices for continuous monitoring and real-
time analysis. Applying model compression techniques The structure of this paper has been systematically designed
can make these models more efficient for real-time data to guide the reader through a comprehensive exploration of
processing [41, 42]. model compression techniques in ML. The sections are orga-
5. RL in supply chain optimization: RL algorithms learn nized as follows:
optimal policies through interactions with the environ- Section 1. Introduction: the significance of model com-
ment, making them well-suited for dynamic and complex pression in enhancing the efficiency of ML models, espe-
systems like supply chains. Current research gaps include cially in resource-constrained environments, is introduced.
optimizing various aspects such as inventory manage- An overview of the main contributions and the novelty of
ment, demand forecasting, and logistics by simulating this paper is provided.
different scenarios and learning from outcomes. To make Section 2. Challenges in machine learning (ML) and
RL models more feasible for real-time application in sup- deep learning (DL): the historical context and evolution of
ply chain operations, model compression techniques can ML and DL are discussed, highlighting key milestones and
be utilized to reduce the model’s complexity and enhance the exponential growth in computational demands.
operational efficiency, facilitating faster decision-making Section 3. Common model compression approaches:
processes [43–45]. key model compression techniques such as pruning, quan-
tization, low-rank factorization, knowledge distillation, and
1.4 Material and methods transfer learning are delved into. Detailed explorations of
each technique, including theoretical foundations, practical
A systematic literature search was conducted across several implementation considerations, and their impact on model
databases, including IEEE Xplore [46], ScienceDirect [47], performance, are provided.
and Google Scholar [48]. Keywords related to model com- Section 4. Lightweight model design and synergy with
pression techniques such as pruning, quantization, knowl- model compression techniques: an overview of lightweight
edge distillation, transfer learning, and lightweight model model architectures, such as SqueezeNet, MobileNet, and
design were used. The search was limited to papers published EfficientNet, is presented. The design principles and the
in the last decade to ensure relevance and innovation in the synergy with model compression techniques to achieve
fields of ML and AI, including classical papers. Studies were enhanced efficiency and performance are discussed.
11808 P. V. D antas et al.

Section 5. Performance evaluation criteria: the crite- 2 Challenges in machine learning (ML) and
ria for evaluating the performance of compressed models, deep learning (DL)
including metrics like compression ratio, speed-up rate, and
robustness metrics, are discussed. The importance of bal- 2.1 Computational demands vs. computational
ancing model performance with computational demand is memory growth
emphasized.
A model serves as a mathematical construct that represents
Section 6. Model compression in various domains:
a system or process. This construct is primarily used for the
recent innovations in model compression are highlighted,
purpose of prediction or decision-making based on the anal-
and case studies demonstrating the application of these
ysis of input data. Typically, DL models are DNNs, which
techniques in various domains are presented. The signifi-
consist of numerous interconnected nodes or neurons. These
cant improvements in computational efficiency achieved by
nodes collectively process incoming data to produce output
compressed models without compromising performance are
predictions or decisions, as depicted in Fig. 1. DL models can
illustrated.
be implemented using a variety of programming frameworks,
Section 7. Innovations in model compression and
including TensorFlow [49] and PyTorch [50].
performance enhancement: The applications of model
The training process of these models involves the use of
compression techniques across various fields are explored,
substantial datasets, aiming to refine their predictive accuracy
demonstrating their implementation in real-world scenar-
and enhance their generalization capabilities across unseen
ios such as mobile devices, edge computing, IoT systems,
data. The training of a DL model is a critical process where
autonomous vehicles, and healthcare. Specific examples
large volumes of data are employed to iteratively adjust the
illustrate the practical benefits and challenges of deploying
model’s internal parameters, such as weights and biases. This
compressed models in these environments.
adjustment process, known as backpropagation, involves the
Section 8. Challenges, strategies, and future directions:
computation of the gradient of the loss function - a mea-
future research directions in model compression are out-
sure of model error - relative to the network’s parameters.
lined. Potential advancements and innovations that could
The optimization of these parameters is executed through
enhance the efficiency and applicability of model compres-
algorithms like stochastic gradient descent (SGD), aiming to
sion techniques are discussed, including hybrid methods
minimize the loss function and thereby improve the model’s
and autonomous selection frameworks. This section aims to
performance. Various programming environments, such as
inspire further research to address existing gaps.
TensorFlow and PyTorch, provide sophisticated application
Section 9. Discussion: the findings from the compre-
programming interface (API) that support the development
hensive review of model compression techniques are syn-
and training of complex DNN architectures. These environ-
thesized. The implications for future research and practical
ments also offer access to a range of pre-trained models,
applications are evaluated, research gaps are identified, and
which can be directly applied or further fine-tuned for tasks
future directions are suggested.
in diverse domains, including image recognition, NLP, and
Section 10. Conclusion: the paper concludes with an
beyond.
exploration of recent innovations in model compression and
performance enhancement. The ongoing advancements in the
field and the potential for future research to optimize ML
models are underscored.
Appendix A. Comprehensive summary of the refer-
ences used in this paper: summary of references used in
this paper, categorized by their specific application areas.
This table provides a comprehensive overview of the key
publications that have been referenced throughout the study,
offering insights into the foundational and recent advance-
ments in each area.
This organization ensures a logical progression from
introducing the importance of model compression to explor-
ing specific techniques, discussing their applications, and Fig. 1 A NN commonly used in DL scenarios. The illustration show-
cases the network’s architecture, highlighting the input layer, hidden
concluding with future research directions. The structure layers, and output layer. Each node represents a neuron, and the con-
provides a clear roadmap for readers, facilitating a deeper nections between them indicate the pathways through which data flows
understanding of the topic. and weights are adjusted during training
A comprehensive review of model... 11809

2.2 Model size and complexity also efficient enough to be deployed in real-world applica-
tions.
Model compression in DL is a technique aimed at reducing Achieving this balance involves making trade-offs between
the size of a model without significantly compromising its the size, speed, and accuracy of the models. Techniques
predictive accuracy. This process is vital in the context of such as pruning, quantization, low-rank factorization, and
deploying DL models on resource-constrained devices, such knowledge distillation are pivotal in this regard. For instance,
as mobile phones or IoT devices. By compressing a model, pruning reduces the number of parameters by eliminating less
it becomes feasible to utilize advanced DL capabilities in significant ones, which can decrease computational require-
environments where computational power and storage are ments while maintaining performance [89, 90]. Quantization
limited. further enhances efficiency by reducing the precision of
The performance of a DL model is fundamentally its model parameters, thereby decreasing memory usage and
ability to make accurate predictions or decisions when accelerating computation [23, 25]. Low-rank factorization
confronted with new, unseen data. This performance is decomposes large weight matrices into smaller matrices,
quantitatively measured through various metrics, including which can capture essential information with fewer param-
accuracy, precision, recall, F1-score, and area under the curve eters [91, 92]. Knowledge distillation involves training a
of receiver operating characteristic (AUC-ROC). The selec- smaller model to replicate the behavior of a larger, well-
tion of these metrics is contingent upon the specific nature trained model, effectively transferring knowledge while
of the problem being addressed and the type of model in use, reducing computational complexity [93, 94]. This technique
ensuring a comprehensive evaluation of the model’s effec- is particularly useful for deploying models in environments
tiveness in real-world applications. with limited resources without significantly sacrificing accu-
We have conducted a comprehensive analysis that delin- racy.
eates the performance retention, model size reduction, and For instance, lightweight models like MobileNet and
other critical dimensions across different compression tech- SqueezeNet are designed to operate efficiently on mobile
niques. This comparison elucidates the nuanced distinctions devices, with MobileNet using depthwise separable con-
between the methods, counteracting the impression that all volutions to reduce computational load while maintaining
techniques yield similar outcomes in performance mainte- accuracy [95], and SqueezeNet achieving AlexNet-level
nance and size reduction. In Table 1 is encapsulated the com- accuracy with significantly fewer parameters through the use
parative analysis of these methods, addressing the strengths of fire modules [96]. In edge computing scenarios, models
and drawbacks of each. This table provides a nuanced view must balance performance with the limited computational
of how each model compression method balances between capacity of edge devices, utilizing techniques such as quan-
model size reduction and performance retention, along with tization and pruning to ensure real-time inference without
their computational efficiency and application suitability. excessive latency [97]. For IoT applications, model compres-
This comparison should clarify the unique attributes and sion is crucial for deploying intelligent analytics on devices
trade-offs of each model compression technique, offering a with stringent power and memory constraints, with tech-
more refined understanding of their individual and compar- niques like low-rank factorization and knowledge distillation
ative impacts [51–55]. creating compact models that can operate efficiently in such
In Table 2 is presented an overview of model compression environments [86].
approaches applied across various ML application domains. In conclusion, the interplay between model performance
It summarizes the most suitable techniques for specific fields, and computational demand is a dynamic challenge that neces-
such as image and speech analysis, highlighting the bene- sitates a balanced approach. By leveraging various model
fits and limitations of each approach. This comprehensive compression techniques, it is possible to develop efficient
comparison aims to illustrate the effectiveness of pruning, models that are suitable for deployment in a variety of
quantization, low-rank factorization, knowledge distillation, resource-constrained environments, thus advancing the prac-
and transfer learning in reducing model size, retaining per- tical application of AI technologies.
formance, and enhancing computational efficiency.

2.3 Resource allocation and efficiency 3 Common model compression approaches

Balancing model performance with computational demand is This section delves into key model compression techniques
a critical consideration in the development and deployment of in DNNs. Each technique addresses the challenge of deploy-
ML models, especially in resource-constrained environments ing advanced DNNs in scenarios with limited computational
such as mobile devices, edge computing, and IoT systems. power, such as mobile devices and edge computing plat-
This balance ensures that models are not only accurate but forms, highlighting the trade-offs between model size reduc-
11810

Table 1 Various model compression methods are evaluated, detailing their compression ratios, performance retention, computational efficiency, key strengths, and potential drawbacks, providing
insights into their suitability for different applications
Technique Compression Ratio Performance Retention Computing Efficiency Key Strength Drawbacks

Pruning Reduces overhead effectively Risk of over-pruning

Quantization Increases speed Errors can impact performance
Low-rank factorization Efficient in redundancy reduction Limited in non-redundant models
Knowledge distillation Smaller models perform well Possible performance gap
Transfer Learning Saves resources, improves performance Risk of negative transfer
P. V. D antas et al.
A comprehensive review of model... 11811

Table 2 A comprehensive overview of model compression techniques applied across various application domains in ML
Application Domain P Q L K T Rationale

Image classification [53, 56–58] Efficient for reducing model size and speeding up
inference, leveraging redundancy in CNNs without
major accuracy loss.
Speech recognition [59–62] Enhances real-time processing; knowledge distilla-
tion simplifies complex models for edge deployment.
NLP [63–65] Manages large matrix operations efficiently, crucial
for maintaining performance in translation and senti-
ment analysis.
Real-time applications [66, 67] Minimizes latency and resource use on constrained
devices, essential for immediate responses.
Domain-specific tasks [68–70] Adapts pre-trained models to new environments effi-
ciently, optimizing for performance and efficiency.
Model deployment on edge devices [67, 71, 72] Balances model complexity and deployment feasibil-
ity on devices with limited resources.
Autonomous vehicles [73–76] Benefits from reduced model sizes and faster infer-
ence times for real-time decision-making.
Augmented/virtual reality [77–79] Ensures high-speed processing for immersive expe-
riences through efficient computation and reduced
model sizes.
Recommender systems [62, 80, 81] Efficiently captures essential information from vast
amounts of sparse data, enhancing speed and perfor-
mance.
Medical image analysis [82–85] Quickly adapts existing models to specific medical
tasks and optimizes them for efficient analysis without
sacrificing accuracy.
IoT applications [70, 86–88] Produces lightweight models enabling smarter, real-
time analytics at the edge with stringent power and
computational constraints.
Each domain is paired with commonly used compression approaches and their underlying rationale, highlighting the benefits and potential limitations
in terms of model size reduction, performance retention, and computational efficiency. The letter P stands for pruning, Q for quantization, L for
low-rank factorization, K for knowledge distillation, and T for transfer learning

tion and performance retention. This exploration underlines teacher model to a smaller student model, retaining high
the importance of innovative approaches to model compres- accuracy with fewer parameters. Transfer learning lever-
sion, essential for the practical application of DNNs across ages pre-trained models on extensive datasets to adapt to
various domains. Readers are guided through a detailed new tasks, minimizing the need for extensive data collec-
exploration of model compression techniques in DNNs, espe- tion and training. Lightweight design architectures, such
cially in scenarios where resources are constrained. Each as SqueezeNet and MobileNet, are engineered with fewer
technique is explained, encompassing theoretical founda- parameters and lower computational requirements with-
tions, practical implementation considerations, and their out significantly compromising accuracy. Collectively, these
direct impact on model performance, including accuracy, techniques address the challenge of deploying advanced
inference speed, and memory utilization. ML models in resource-constrained environments, balanc-
Pruning systematically removes less significant parame- ing model performance with computational demand, and
ters from a DNN to reduce size and computational complex- highlighting their importance in efficient and sustainable AI
ity while maintaining performance. Quantization reduces the development.
precision of model parameters to lower-bit representations,
decreasing memory usage and speeding up computation, 3.1 Pruning
which is ideal for constrained devices. Low-rank factoriza-
tion decomposes large weight matrices into smaller, low- Pruning is a process for enhancing ML model efficiency
rank matrices, capturing essential information and reducing and effectiveness. By systematically removing less signifi-
model size and computational demands. Knowledge dis- cant parameters of a DNN, pruning reduces the model’s size
tillation transfers knowledge from a larger, well-trained and computational complexity without substantially com-
11812 P. V. D antas et al.

promising its performance [19–22, 98, 99]. This practice is of pruning a distributed CNN for IoT performance enhance-
especially vital in contexts where storage and computational ment are illustrated through a case study, emphasizing the
resources are limited. In Fig. 2 it is depicted one illustration complexity of achieving optimal pruning without compro-
of pruned DNN. mising model integrity [114]. These discussions collectively
Pruning involves the selective removal of network param- underscore the importance of research focusing on devel-
eters (weights and neurons) that contribute the least to the oping pruning methodologies that reduce model size and
network’s output [100]. This process leads to a compressed computational demands and safeguard against the loss of
and efficient model, facilitating faster inference times and essential information.
reduced energy consumption [101]. The common types of Pruning allows for the creation of more efficient and com-
pruning includes neuron pruning, which involves removing pressed ML models. While it involves a trade-off between
entire neurons or filters from the network [102]. It is com- model size and performance, with careful implementation,
monly used in CNNs and targets neurons that contribute less it can significantly enhance computational efficiency. Ongo-
to the network’s ability to model the problem. By removing research in this field continues to refine and develop new
ing these neurons, the network’s complexity is reduced, pruning techniques, making it a dynamic and essential aspect
potentially leading to faster inference times [59]. Weight of DNN optimization.
pruning, focused on eliminating individual weights within
a DNN [103], involves identifying weights with minimal 3.2 Quantization
impact (typically those with the smallest absolute values)
and setting them to zero. This process creates a sparse Quantization serves as a pivotal technique for model com-
weight matrix, which can significantly reduce the model’s pression, playing a key role in enhancing computational
size and computational requirements [104]. Structured prun- efficiency and reducing storage requirements [23–27, 69].
ing focuses on removing larger structural components of a This process is particularly critical in deploying DNN mod-
network, such as entire layers or channels [105]. It is aligned els on devices with limited resources. For example, most
with hardware constraints and optimizes for computational modern DNN are made up of billions of parameters, and
efficiency and regular memory access patterns. the smallest large language model (LLM) has 7B param-
The parameters of DNN are selected based on their impact eters [115]. If every parameter is 32 bit, then it is needed
on the output. Techniques like sensitivity analysis or heuris- (7 × 109 ) × 32 = 112Gbit just to store the parameter on
tics are often used to identify these parameters [106]. Various disk. This implies that large models are not readily accessible
algorithms, like magnitude-based pruning or gradient-based on a conventional computer or on an edge device. Quan-
approaches, are employed to determine and execute the tization refers to the process of reducing the precision of
removal of parameters. These methods frequently involve the DNN’s parameters (weights and activations), simplify-
iteratively pruning and testing the network to find an optimal ing the model, leading to decreased memory usage and faster
balance between size and performance [107]. After pruning, computation, without significantly compromising model per-
it’s essential to evaluate the pruned model’s performance to formance. Quantization aims to reduce the total number of
ensure that accuracy or other performance metrics are not bits required to represent each parameter, usually converting
significantly compromised [108]. Re-training or fine-tuning floating-point numbers into integers [116].
the pruned network is typically required to recover any loss Uniform quantization involves mapping input values to
in accuracy [56]. Post-pruning, it is crucial to validate the equally spaced levels. It typically converts floating-point rep-
model on a relevant dataset to ensure that its accuracy and resentations into lower-bit representations, like 8-bit integers.
efficiency meet the required standards [109]. Uniform quantization simplifies computations and reduces
Pruning is emphasized as a vital technique for remov- model size, but it must be carefully managed to avoid signifi-
ing excess in oversized models [90, 110]. Although, the cant loss in model accuracy [117]. Non-uniform quantization
main challenge arises from over-pruning, which can result uses unevenly spaced levels, which are often optimized for
in the loss of crucial information, adversely affecting the the specific distribution of the data. Techniques like log-
model’s performance [90, 110]. Researchers have argued for arithmic or exponential scaling are used to allocate more
the necessity of optimized DNN approaches that meticu- levels where the data is denser. Non-uniform quantization
lously avoid the negative consequences of over-pruning [111, can be more efficient in representing complex data distri-
112]. The conversation extends to the impact of over-pruning butions, potentially leading to better preservation of model
on cloud-edge collaborative inference, with suggestions for a accuracy [118]. Post-training quantization involves applying
more conservative approach to network pruning to maintain quantization to a model after it has been fully trained. It sim-
model effectiveness [97, 113]. This reflects a consensus on plifies the process as it doesn’t require retraining; however, it
the need to preserve essential information while streamlining may require calibration on a subset of the dataset to maintain
models for efficiency. Moreover, the optimization challenges accuracy [119].
A comprehensive review of model... 11813

Fig. 2 The process of weight pruning in a DNN. (a) Shows the original non-essential. (c) Displays the pruned connections with dashed lines,
DNN with all nodes and connections intact. (b) Highlights the nodes illustrating the streamlined network structure after the less significant
that have been pruned, indicating the parts of the network identified as weights have been removed

Selecting the right parameters (like weights and activa- impact these errors can pose on model accuracy [127]. More-
tions) to quantize is crucial. The selection is based on their over, the development of methodologies such as sharpness-
impact on output and the potential for computational savings. and quantization-aware training (SQuAT) has been shown to
Techniques include linear quantization, which maintains a mitigate the challenges of quantization and enhance model
linear relationship between the quantized and original val- performance [128]. These studies collectively underscore
ues, and non-linear quantization, which can better adapt to the critical need for continued innovation in quantization
data distribution [70]. These methods often require additional techniques, aiming to minimize the adverse effects of quan-
consideration to ensure minimal impact on the model’s per- tization errors while leveraging the efficiency gains offered
formance. It is crucial to assess the post-quantification to by this approach in the field of DL.
ensure that there is no significant loss in accuracy or effi-
ciency. In some cases, fine-tuning the quantized model can 3.3 Low-rank factorization
help regain any lost accuracy. Methods include retraining
the model with a lower learning rate or using techniques Low-rank factorization is a way to make NN smaller and sim-
like knowledge distillation [120]. It is important to check pler without making it less effective. This method focuses on
if the model works well with a specific set of data to decomposing large, dense weight matrices found in DNN
make sure it is accurate and fast [121]. Quantization effec- into two smaller, lower-rank matrices [28, 29, 129]. The
tively compresses DNN models, enabling their deployment essence of low-rank factorization is its ability to combine
in resource-constrained environments. It helps make DNN these two resulting data matrices to approximate the orig-
simpler and faster by reducing the number of computational inal, thereby achieving compression. This method reduces
requirements [122]. Advancements in quantization methods the model size and data processing demands, making CNN
continue to focus on maintaining model performance while more suitable for applications in resource-limited environ-
maximizing compression [71]. ments [91]. This process aims to capture the most significant
Quantization plays a significant role in model size reduc- information in the network’s weights, allowing for a more
tion and inference speed, suitable for mobile and edge com- compact representation with minimal loss in performance.
puting and real-time applications [121, 123–125]. Despite Matrix decomposition and tensor decomposition are com-
its benefits, it carries the risk of introducing quantization mon applied techniques. Matrix decomposition in low-rank
errors that can significantly impair model accuracy, espe- factorization involves breaking down large weight matri-
cially in complex DL models [121, 123]. This concern ces into simpler matrix forms. Single value decomposition
is well-documented in the literature, with several studies (SVD) is a common method, where a matrix is decom-
addressing the impact of quantization on model accuracy posed into three smaller matrices, capturing the essential
and proposing various methods to mitigate these effects. features of the original matrix. This reduces the number of
For instance, research has highlighted the effectiveness of parameters in DNN models, leading to less storage and com-
higher-bits integer quantization, thereby achieving a balance putational requirements, while striving to maintain model
between reduced model size and maintained accuracy [126]. performance [130]. Extending beyond matrix decompo-
Additionally, the risks associated with quantization errors sition, tensor decomposition deals with multidimensional
have been explicitly discussed, emphasizing the negative arrays (tensors) in DNNs. Techniques like canonical polyadic
11814 P. V. D antas et al.

decomposition (CPD) or tucker decomposition are used, 94, 105]. It works by transferring experience from a large-
which factorize a tensor into a set of smaller tensors [131]. scale model (teacher) to a smaller-scale model (student),
Tensor decomposition is particularly effective for com- enhancing the latter’s performance without the computa-
pressing CNNs, often achieving higher compression rates tional intensity of the former [139]. At its core, knowledge
compared to matrix decomposition. distillation is about extracting the informative aspects of a
Layers with larger weight matrices or those contributing large model’s behavior and instilling this knowledge into a
less to the output variance are prime candidates for factor- smaller model. This approach allows for the retention of high
ization. Techniques like sensitivity analysis can help identify accuracy in the student model while significantly reducing
these layers [132]. The model’s accuracy and dimension need its size and complexity.
to be assessed after the factorization process. Metrics like In a teacher-student model framework, a large-scale, well-
accuracy, inference time, and model size are key consider- trained model is employed to guide the implementation of a
ations. Fine-tuning the factorized network can help recover smaller-scale model. The large-scale network provides guid-
any loss in accuracy due to the compression [133]. This might ance to the small-scale network [116]. The small-scale model
involve continued training with a reduced learning rate or aims to mimic the large-scale model’s output while having
applying techniques like knowledge distillation. It is recom- fewer parameters and computational complexity. The small-
mended to validate the factorized model on a relevant dataset scale model is optimized to infer the correct categorization,
to ensure that it still meets the required performance stan- and also to replicate the large-scale model’s output (predic-
dards. tions or intermediate features). The small-scale model can be
Low-rank factorization efficiently reduces redundancies trained to match the softmax output of the large-scale model,
in models, particularly in fully connected layers [28, 29, 134]. or to match its feature representations. There are common
Low-rank factorization faces challenges in its broad appli- loss functions that measure how closely the small-scale out-
cability, despite its effectiveness in reducing redundancies puts match the large-scale outputs [140]. Distillation loss,
in large-scale models. Its suitability is somewhat limited to for example, helps the small-scale model to learn the behav-
scenarios with significant redundant information in fully con- ior of the large-scale model, going beyond mere categorical
nected layers, indicating a constraint in its versatility across inferring. Knowledge distillation is especially effective at
different types of models [28, 29]. For instance, the effec- simplifying models’ complexity, making it suited for appli-
tiveness of approaches like Kronecker tensor decomposition cations in limited-resource systems. The small-scale model’s
in compressing weight matrices and reducing the parameter performance is similar to the large-scale model, but it requires
dimension in CNN highlights the potential of low-rank fac- fewer computational resources.
torization techniques in specific contexts [135]. However, the While it may seem that the overall performance of
literature indicates that the applicability of low-rank factor- small-scale models would decrease compared to large-scale
ization may be somewhat constrained, reflecting limitations, models, the primary goal of knowledge distillation is not to
especially in models with varying architectural complex- achieve identical performance across all tasks but rather to
ities or those not characterized by significant redundancy maintain similar performance on specific tasks while reduc-
in their fully connected layers [136, 137]. Such challenges ing model size and computational complexity. However,
underscore the necessity for ongoing research to expand the these large models often come with significant computational
scope and efficacy of low-rank factorization methods, poten- costs and memory requirements, making them impracti-
tially through approaches that can be applicable to a broader cal for deployment in resource-constrained environments or
spectrum of DL architectures without compromising model real-time applications. By distilling the knowledge from a
performance or efficiency. large-scale model into a smaller counterpart, the goal is to
Low-rank factorization is an effective approach for com- retain the essential information and decision-making capabil-
pressing DNNs, particularly useful in environments with ities necessary for specific tasks while reducing the model’s
limited computational resources. The compromise between size and computational demands. While it is true that small-
model size and precision is inevitable, but an optimiza- scale models may not match the performance of large-scale
tion can mitigate the impact of performance dips. Ongoing models across all tasks, the focus is on achieving compara-
research in this area continues to explore more efficient fac- ble performance on targeted tasks of interest while benefiting
torization techniques and their applications in various types from the efficiency and speed advantages of smaller mod-
of NNs [138]. els. The aim is not to replicate the exact performance of
the large model but to strike a balance between model size,
3.4 Knowledge distillation computational efficiency, and task-specific performance,
making knowledge distillation a valuable technique for
Knowledge distillation is primarily utilized in the domain model compression and optimization in practical applica-
of DNNs for model compression and optimization [63, 93, tions.
A comprehensive review of model... 11815

Distillation techniques can significantly enhance the per- scratch. In essence, transfer learning involves taking a model
formance of smaller models, often outperforming models established for one purpose and repurposing it for a different
trained in standard ways [141]. Knowledge distillation has but related task. The assumption behind this strategy is that
been successfully adopted in areas like computational vision, the knowledge gained by a model in learning one task can be
NLP, and speech recognition, demonstrating its versatility beneficial in learning another, especially when the tasks are
and effectiveness. The choice of large and small-scale mod- similar [149–155].
els is crucial. Too complex a teacher can make the distillation In the feature extractor approach, a model employed on
process less effective. The architecture of the large and small- a large dataset is applied as a fixed feature extractor. The
scale models also has a significant impact on the distillation pre-training layers encompass common capabilities that are
process’s success. Tuning hyperparameters such as tempera- applicable to a diverse set of tasks. Common in image and
ture in softmax and the weight of distillation loss is vital for speech recognition tasks, this method is beneficial when
achieving optimal performance in the student model [142]. there’s limited training data for the new task [156]. It allows
Knowledge distillation, well-known for its capacity to for leveraging complex features learned by the model without
encapsulate the functionalities of larger models into forms extensive retraining. Fine-tuning involves adjusting a pre-
that are suited to deployment in resource-restricted set- conditioned model by continuing the learning process on a
tings, faces a series of intricate challenges and drawbacks. different dataset. This approach often involves modifying the
A notable issue is the deployment of extensive pre-trained model design in order to better suit the upcoming task, and
language models (PLM) on devices with limited memory, then training these layers (or the entire model) on the new
which necessitates a delicate balance to optimize perfor- data. Fine-tuning can lead to more tailored and accurate mod-
mance without overwhelming system resources [143]. Fur- els for specific tasks.
thermore, the generalization capacity of distilled models may Transfer learning drastically reduces the time and resources
be compromised when utilizing public datasets that differ required to develop effective models, as the initial learning
from the training datasets, diluting the model’s relevance phase is bypassed. Models can achieve higher accuracy, espe-
and accuracy [144]. The constraints of existing knowledge cially in tasks where training data is scarce, by building upon
distillation-based approaches in federated learning under- pre-learned patterns and features [157]. Transfer learning
score the need for innovative solutions to address the scarcity has seen successful applications in areas like medical image
of cross-lingual alignments for knowledge transfer and analysis [158], NLP [159], and autonomous vehicles [160],
the potential for unreliable temporal knowledge discrepan- showcasing its versatility. The selection of the preconditioned
cies [145, 146]. Additionally, the sparsity, randomness, and model should reflect the nature of the upcoming task. Factors
varying density of point cloud data in light detection and like the similarity of the datasets and the complexity of the
ranging (LiDAR) semantic segmentation present challenges model need consideration. Careful adjustment of the model
that can yield inferior results when traditional distillation is required to avoid overfitting to the new task or underfitting
approaches are directly applied [147]. These challenges high- due to insufficient training. Regularization techniques and
light the necessity for continuous exploration and refinement data augmentation can be helpful in this regard [161].
of knowledge distillation techniques to ensure they can effec- Transfer learning is not without its challenges and draw-
tively reduce model size and complexity while maintaining backs. One such challenge is the need for large and diverse
or even enhancing performance across a broad spectrum of datasets to effectively train models, coupled with the lim-
applications. ited interpretability of DL models [162]. In the context of
Knowledge distillation stands as a powerful tool in the face recognition with masks, the reduction in visible features
realm of CNN, offering an efficient way to compress mod- due to masks poses a significant challenge to maintaining
els and enhance the performance of smaller networks. As model performance, highlighting the complexity of adapt-
the demand for deploying sophisticated models in resource- ing transfer learning to new and evolving scenarios [163].
limited environments grows, knowledge distillation will Furthermore, the application of transfer learning in breast
continue going in a fundamental area of knowledge and cancer classification underscores the technique’s dependency
development, paving the way for more efficient and accessi- on domain-specific data to achieve state-of-the-art (SOTA)
ble AI applications [148]. performance, suggesting limitations in its versatility across
different domains [164]. Moreover, scenarios with limited
3.5 Transfer learning resources emphasize the need for optimized transfer learning
models [165]. The selection of appropriate transfer learning
Transfer learning is a technioque in the DNN’s domain that algorithms for practical applications in industrial scenar-
enables models to leverage pre-existing knowledge for new, ios presents another layer of complexity, underscoring the
often related tasks. This methodology significantly reduces challenge of applying transfer learning to varied real-world
the need for extensive data collection and training from applications [166]. Additionally, hypothesis transfer learn-
11816 P. V. D antas et al.

ing in binary classification highlights the balance required on minimizing model size without compromising accuracy,
between leveraging existing knowledge and adapting to SqueezeNet stands as an example of lightweight model
new tasks, further complicating the deployment of transfer design in ML.
learning in applications reliant on big data [167]. These ref- At the heart of SqueezeNet is the use of fire modules,
erences collectively underscore some challenges associated which are small, carefully designed CNN that drastically
with transfer learning, from dataset and interpretability issues reduce the number of parameters without affecting per-
to computational constraints and the risk of negative trans- formance [21, 172]. This design aligns with the growing
fer, highlighting the need for research and development to need for deployable DL models in limited-resource appli-
expand it across more comprehensive applications. cations, such as smartphones and embedded systems. The
Transfer learning represents a significant leap in training compact nature of SqueezeNet also offers significant ben-
DNNs, offering a practical and efficient pathway to model efits in terms of reduced memory requirements and faster
development and deployment. It accelerates the training pro- computational speeds, making it ideal for real-time applica-
cess and opens up possibilities for tasks with limited data tions [96]. SqueezeNet’s architecture has also been influential
availability. As AI continues to evolve, transfer learning is in the realm of model compression. Its highly efficient design
poised to play an increasingly vital role [168]. makes it an excellent baseline for applying further com-
pression techniques. These methods enhance SqueezeNet’s
ease of deployment, particularly in scenarios where computa-
4 Lightweight model design and synergy tional resources are limited. The adaptability of SqueezeNet
with model compression techniques to various compression techniques exemplifies its versatility
and robustness as a DL model [89].
The quest for efficient and effective NN architectures is The application of SqueezeNet extends beyond theoretical
paramount. Two critical approaches emerge in this pur- research, finding practical use in areas including media anal-
suit: lightweight model design and model compression. Both ysis and mobile applications. Its influence has also paved the
methodologies aim to enhance the ease of deployment and way for future research in lightweight NN design, inspiring
performance of DNN, especially in resource-constrained the development of subsequent architectures like MobileNet
environments [57, 169]. This section delves into the concept and SqueezeNext. These models build on the foundational
of lightweight model design, exemplified by groundbreaking principles established by SqueezeNet, further pushing the
architectures, and draws connections to model compression, boundaries of efficiency in NN design [95, 173].
illustrating how these strategies collectively drive advance-
ments in the ML domain. 4.1.2 SqueezeNext architecture
Lightweight model design focuses on constructing DNN
from the ground up, with an emphasis on minimalism and SqueezeNext is an advanced CNN architecture lightweight
efficiency. This approach often involves innovating architec- model [173]. Building upon the principles of SqueezeNet,
tural elements, such as the fire modules in SqueezeNet and the SqueezeNext integrates novel design elements to achieve
low-rank separable convolutions in SqueezeNext, to reduce even greater efficiency in model size and computation.
the model’s scale and computational needs without signif- SqueezeNext stands out for its innovative architectural
icantly compromising its performance. The objective is to choices, which include low-rank separable convolutions and
create inherently efficient models that can operate effectively optimized layer configurations. These features enable it
on devices with reduced computational capacity and mem- to maintain high accuracy while drastically reducing the
ory, such as smartphones, IoT appliances, and embedded model’s size and computational demands. This efficiency
systems [169]. However, model optimization methodologies is particularly beneficial for deployment in environments
are applied to pre-existing, often more complex, DNN mod- with stringent memory and processing constraints, such as
els. The goal is to make these post-training models smaller mobile devices and edge computing platforms. The design
and easier to use in limited resource settings. These meth- of SqueezeNext demonstrates the progress made in crafting
ods are used to make networks smaller-scale. They balance models that are both lightweight and capable [173].
keeping the network efficient with reducing its size [170]. SqueezeNext’s design also contributes significantly to the
field of model compression. Its inherent efficiency provides
4.1 Overview of lightweight model architectures one foundation for applying additional compression tech-
niques. These methods further enhance the model’s suitabil-
4.1.1 SqueezeNet architecture ity for deployment in resource-limited settings, showcasing
SqueezeNext’s versatility in various application scenarios.
SqueezeNet represents a significant advancement in design- The architecture serves as a benchmark in the study of model
ing NN models [96, 171]. Developed with an emphasis compression, providing insights into achieving an optimal
A comprehensive review of model... 11817

balance between model size, speed, and accuracy [21]. The the computational burden and preserves important informa-
impact of SqueezeNext extends to practical applications in tion flowing through the network. The result is a model that
areas like image processing, real-time analytics, and IoT offers higher accuracy and efficiency, particularly in appli-
devices. cations where latency and power consumption are critical
considerations [174].
4.1.3 MobileNetV1 architecture MobileNetV2’s enhanced efficiency has significant impli-
cations for mobile and edge computing. Its ability to deliver
MobileNetV1, introduced by researchers at Google, marks high-performance ML with minimal resource usage has
a significant milestone in the development of efficient DL broadened the scope of applications possible on mobile
architectures [95]. It is specifically engineered for mobile devices. This includes advanced image and video process-
and embedded vision applications, offering a perfect blend ing tasks, real-time object detection, and augmented/virtual
of compactness, speed, and accuracy. The core innovation of reality (AR/VR) - all on devices with limited computa-
MobileNetV1 lies in its use of depth-wise separable convolutional capabilities. MobileNetV2’s architecture has set a new
tions. This design reduces the computational cost and model benchmark for developing AI models that are both power-
size compared to conventional CNNs. Depthwise separable ful and resource-efficient [175]. Its architectural innovations
convolutions split the standard convolution into two layers - have been foundational in the creation of more advanced
a depthwise convolution and a pointwise convolution - which models like MobileNetV3 and beyond, which continue to
substantially decreases the number of parameters and opera- push the boundaries of efficiency and performance in NN
tions required. This architectural choice makes MobileNetV1 design. The legacy of MobileNetV2 is evident in the ongo-
exceptionally suited for mobile devices, where computa- ing efforts to optimize DL models for the increasingly diverse
tional resources and power are limited [21, 95]. requirements of mobile and edge AI [176].
MobileNetV1’s efficient design has significantly impacted
the deployment of DL models on mobile and edge devices. 4.1.5 MobileNetV3 architecture
Its ability to deliver high performance with low latency and
power consumption has enabled a wide range of applications, MobileNetV3 represents a further refinement in the develop-
from real-time image and video processing to complex ML ment of efficient and compact DL models tailored for mobile
tasks on handheld devices. This breakthrough has opened up and edge devices. Developed by Google, MobileNetV3
new possibilities in the field of mobile computing, where the builds upon the foundations laid by its predecessors,
demand for powerful yet efficient AI models is constantly MobileNetV1 and MobileNetV2, incorporating several novel
growing [174]. architectural innovations to enhance performance while
MobileNetV1 not only stands as a remarkable achieve- maintaining efficiency [177]. One of the key innovations
ment in its own right, but also lays the groundwork for future in MobileNetV3 is the use of a neural architecture search
advancements in lightweight DL models. It has inspired a (NAS) to optimize the network structure. This automated
series of subsequent architectures, like MobileNetV2 and design process identifies the most efficient network config-
MobileNetV3, each iterating on the initial design to achieve urations, balancing the trade-offs between latency, accuracy,
even greater efficiency and performance. The principles and computational cost. Additionally, MobileNetV3 intro-
established by MobileNetV1 continue to influence the design duces squeeze-and-excitation modules, which adaptively
of NN aimed at edge computing and IoT devices [175]. recalibrate channel-wise feature responses by explicitly mod-
eling interdependencies between channels. This improves the
4.1.4 MobileNetV2 architecture model’s representational power without a significant increase
in computational burden [177].
MobileNetV2, an evolution of its predecessor MobileNetV1, MobileNetV3 also incorporates a combination of hard
further refines the concept of efficient NN design for swish (h-swish) activation functions and new efficient build-
mobile and edge devices. Introduced by Google researchers, ing blocks, such as the MobileNetV3 blocks, which include
MobileNetV2 incorporates novel architectural features to lightweight depthwise convolutions and linear bottleneck
enhance performance and efficiency, making it a standout structures. These architectural features collectively reduce
choice in the landscape of lightweight DL models [174]. the computational load and enhance the model’s perfor-
MobileNetV2 introduces the concept of inverted residu- mance on mobile and edge devices [177]. The efficiency
als and linear bottlenecks, which are key to its improved and high performance of MobileNetV3 make it particularly
efficiency and accuracy. These innovations involve using suitable for real-time applications, such as image classifi-
lightweight, depth-wise separable convolutions to filter fea- cation, object detection, and other vision-related tasks on
tures in the intermediate expansion layer, and then projecting resource-constrained devices. Its compact design ensures low
them back to a low-dimensional space. This approach reduces latency and reduced power consumption, enabling deploy-
11818 P. V. D antas et al.

ment in diverse environments, from smartphones to IoT tecture was specifically designed to address the limitations
devices [177]. and challenges observed in previous lightweight models,
The principles and techniques introduced in MobileNetV3 particularly in the context of computational efficiency and
have been adopted and extended in various new architec- practical deployment on mobile and edge devices. By intro-
tures, further advancing the SOTA in lightweight and efficient ducing a series of novel design principles and techniques,
model design. These developments continue to push the ShuffleNetV2 achieves a superior balance between speed and
boundaries of what is achievable in the context of mobile accuracy, making it highly effective for real-world applica-
and edge AI applications, ensuring that high-performance tions [181].
DL models remain accessible and practical for real-world The core innovation of ShuffleNetV2 lies in its strat-
use [178, 179]. egy to optimize the network’s computational graph through
a more refined use of channel operations. Unlike its pre-
4.1.6 ShuffleNetV1 architecture decessor, ShuffleNetV2 focuses on addressing the issues
of memory access cost and network fragmentation. The
ShuffleNetV1 marks a significant advancement in the field of architecture introduces an enhanced channel split opera-
efficient NN architectures. Developed to cater to the increas- tion, where each layer’s input is split into two branches:
ing demand for computational efficiency in mobile and edge one that undergoes a pointwise convolution and another
computing, ShuffleNetV1 introduces an innovative approach that remains unchanged, significantly reducing the compu-
to designing lightweight DL models. The defining feature tation and memory footprint. Additionally, ShuffleNetV2
of ShuffleNetV1 is its use of pointwise group convolutions employs an improved channel shuffle operation that ensures
and channel shuffle operations. These techniques dramati- an even and efficient mixing of information across fea-
cally reduce computational costs while maintaining model ture maps, thereby enhancing the network’s representational
accuracy. Point-wise group convolutions divide the input power without introducing substantial computational over-
channels into groups, reducing the number of parameters and head [181]. ShuffleNetV2 outperforms its predecessor and
computations. The channel shuffle operation then allows for other contemporary lightweight models in terms of speed
the cross-group information flow, ensuring that the grouped and accuracy on various benchmarks. It achieves a favorable
convolutions do not weaken the network’s representational trade-off between model size and computational efficiency,
capabilities. This unique combination of features enables making it particularly well-suited for deployment in scenar-
ShuffleNetV1 to offer a highly efficient network architec- ios with stringent resource constraints, such as mobile and
ture, particularly suitable for scenarios where computational edge AI applications [181].
resources are limited [180]. The impact of ShuffleNetV2 extends beyond its immedi-
ShuffleNetV1’s efficiency and high performance make it ate performance benefits. Its introduction has influenced the
a valuable asset in mobile and edge computing applications. broader field of efficient NN design, inspiring subsequent
Its design addresses the challenges of running complex DL research and development efforts aimed at further optimizing
models on devices with constrained processing power and lightweight architectures. By addressing the critical bottle-
memory, such as smartphones and IoT devices. The archi- necks in mobile and edge AI deployment, ShuffleNetV2 has
tecture has been widely adopted for tasks like real-time set a new standard for what is achievable in terms of balancing
image classification and object detection, offering a practical efficiency and accuracy in DL models. This has paved the way
solution for deploying advanced AI capabilities in resource- for more sophisticated applications in real-time image pro-
limited environments [181]. cessing, object detection, and other AI-driven tasks, ensuring
The introduction of ShuffleNetV1 has had a significant that high-performance DL remains accessible and practical
impact on the research and development of efficient NN mod- for a wide range of real-world uses [181].
els. Its approach to reducing computational demands with-
out compromising accuracy has contributed to subsequent 4.1.8 EfficientNet architecture
architectures, including ShuffleNetV2. These developments
continue to explore and expand the opportunities of what is EfficientNet, a groundbreaking series of CNN, represents
possible in the realm of lightweight and efficient DL mod- a significant advancement in the efficient scaling of DL
els [180]. models. Developed with a focus on balanced scaling of
network dimensions, EfficientNet has set new standards
4.1.7 ShuffleNetV2 architecture for achieving SOTA accuracy with remarkably efficient
resource utilization. The key innovation of EfficientNet is
ShuffleNetV2 represents a progression in the evolution of its systematic approach to scaling, called compound scaling.
efficient NN architectures, building upon the foundations laid Unlike traditional methods that independently scale network
by its predecessor, ShuffleNetV1. The ShuffleNetV2 archi- dimensions (depth, width, or resolution), EfficientNet uses
A comprehensive review of model... 11819

a compound coefficient to uniformly scale these dimensions subsequent research and inspiring new directions in the
in a principled manner. This balanced scaling method allows development of efficient NN models. The principles and
EfficientNet to achieve higher accuracy without an expo- techniques introduced in EfficientNetV2 have been adopted
nential increase in computational complexity. The network and further refined in various other architectures, pushing the
efficiently utilizes resources, making it highly effective for boundaries of what is possible in efficient model design for
both high-end and resource-constrained environments [176]. real-world applications [185].
EfficientNet’s performance sets a benchmark for vari-
ous ML challenges, especially in image classification tasks. 4.1.10 Overview of lightweight model architectures
The network’s ability to scale efficiently across differ-
ent computational budgets makes it adaptable for a wide Some of the key lightweight model architectures are summa-
range of applications, from mobile devices to cloud-based rized in the Table 3, with their year of launch, key features,
servers. EfficientNet has demonstrated superior performance and impact on various applications highlighted.
in tasks requiring high accuracy and efficiency, such as object
detection, image segmentation, and transfer learning across 4.2 Integration with compression techniques
different domains [176]. Its principles have been adopted and
adapted in subsequent research, pushing the limits of what is The concept of lightweight design focuses on architecturally
possible in terms of efficiency and performance in NNs [182, optimizing DNNs to minimize their demand on compu-
183]. tational resources without significantly undermining their
efficacy. Innovations such as efficient convolutional lay-
4.1.9 EfficientNetV2 architecture ers [95, 174], introduce structural efficiencies that lower the
parameter count and computational load. These innovations
EfficientNetV2 represents a significant advancement in the are crucial for enabling the deployment of high-performing
field of efficient DL models, building upon the success of the DNNs on devices with limited computational capacity, like
original EfficientNet architecture. Developed by researchers smartphones and IoT devices. The fusion of lightweight
at Google, EfficientNetV2 introduces several novel tech- design and model compression in DNNs represents a cru-
niques to further enhance performance and efficiency, mak- cial advancement for deploying advanced ML models under
ing it one of the leading models for mobile and edge device computational and resource constraints [186–188].
applications [184]. The synergy between lightweight model design and model
EfficientNetV2 incorporates a new scaling method called compression represents a comprehensive approach to opti-
progressive learning, which adjusts the size of the model dur- mizing CNN. While the former approach is proactive,
ing training to improve both accuracy and efficiency. This building efficiency into the model’s architecture, the lat-
technique begins training with smaller resolutions and sim- ter is reactive, refining and streamlining models that have
pler augmentations, progressively increasing the resolution already been developed. Together, they address the diverse
and complexity as training progresses. This method not only challenges in deploying advanced ML models, from the ini-
speeds up the training process but also helps the model tial design phase through to post-training optimization [170].
achieve higher accuracy. Another key innovation in Effi- This section will explore how SqueezeNet, SqueezeNext, and
cientNetV2 is the use of fused convolutional blocks, which similar architectures embody the principles of lightweight
combine the efficiency of depthwise convolutions with the design and how their integration with model compression
accuracy benefits of regular convolutions. These blocks help techniques exemplifies the broader strategy of NN optimiza-
reduce the overall computational cost while maintaining high tion in ML.
performance. Additionally, EfficientNetV2 employs various Lightweight model design and model compression, though
training-aware optimizations, such as improved data aug- related, represent distinct approaches in DL. Lightweight
mentations and regularization techniques, which contribute model design focuses on creating architectures that are inher-
to its superior performance [184]. ently optimized for performance and low resource consump-
The architecture of EfficientNetV2 is designed to be tion, while maintaining satisfactory accuracy. This involves
versatile, performing well across a wide range of tasks, techniques like employing smaller convolutional filters and
including image classification, object detection, and segmen- depthwise separable convolutions to reduce the number of
tation. Its balanced approach to scaling and optimization parameters and computational intensity of each layer [189].
allows it to deliver SOTA accuracy with significantly reduced In contrast, model compression is the process of downsiz-
computational resources, making it ideal for deployment in ing an existing model to diminish its size and computational
environments with limited processing power and memory, demands without significantly compromising accuracy. The
such as mobile devices and IoT platforms [184]. Efficient- objective here is to adapt a pre-trained model for more effi-
NetV2 has set new benchmarks in the field of DL, influencing cient deployment on specific hardware platforms [190]. Both
11820 P. V. D antas et al.

Table 3 Summary of lightweight model architectures: year of launch, key features, impact, and applications
Architecture Year Key Features Impact and Applications

SqueezeNet 2016 Fire modules to reduce parameters, compact design, Significant model size reduction, used in smartphones
efficient for low-resource devices and embedded systems, baseline for further com-
pression techniques, practical in media analysis and
mobile applications.
SqueezeNext 2018 Low-rank separable convolutions, optimized layers, Greater model size and computation efficiency, useful
enhanced efficiency in mobile devices and edge computing, benchmark for
model compression.
MobileNetV1 2017 Depth-wise separable convolutions to reduce compu- Suitable for mobile and edge devices, real-time image
tational cost and video processing, inspired subsequent architec-
tures like MobileNetV2 and V3.
MobileNetV2 2018 Inverted residuals, linear bottlenecks, depth-wise sep- Higher efficiency and accuracy, broadened scope for
arable convolutions mobile applications, influenced further research in
efficient NN design.
MobileNetV3 2019 NAS for optimal network structure, squeeze-and- Enhanced performance for mobile devices, low
excitation modules, h-swish activation latency and power consumption, influenced new
architectures in efficient model design.
ShuffleNetV1 2018 Pointwise group convolutions, channel shuffle opera- Highly efficient for mobile and edge computing, prac-
tions tical for real-time image classification and object
detection.
ShuffleNetV2 2018 Optimized channel operations, enhanced channel split Superior speed and accuracy, well-suited for resource-
and shuffle operations constrained environments, set new standards for
lightweight NN design.
EfficientNet 2019 Compound scaling to balance network dimensions SOTA accuracy with efficient resource utilization,
adaptable for mobile to cloud applications, influenced
model scaling techniques.
EfficientNetV2 2021 Progressive learning, fused convolutional blocks, Improved training efficiency and accuracy, versatile
training-aware optimizations for various tasks, set benchmarks in DL, influencing
new efficient architectures.

methods aim to produce models that are well-suited for of action recognition in CNNs [201] reduces the model size
deployment on devices with limited resources. The following and also decreases overfitting, leading to enhanced perfor-
subsections highlight some prominent lightweight models mance on large-scale datasets. Finally, a hardware/software
designed to have fewer parameters and lower computational co-design approach for a NN accelerator focuses on model
requirements compared to traditional DNNs [191, 192]. compression and efficient execution [202]. A two-phase fil-
Recent studies have contributed to the field by proposing ter pruning framework was proposed for model compression,
novel approaches [69, 193, 194], exploring various com- optimizing the execution of DNNs. This co-design approach
pression techniques for DNNs, including compact models, exemplifies how integration of hardware and software can
tensor decomposition [131, 195], data quantization [122, enhance the performance and efficiency of DNNs in practi-
196], and network sparsification [197, 198]. These methods cal applications.
are instrumental in the design of NN accelerators, facilitat-
ing the deployment of efficient ML models on constrained 4.2.1 Combined impact on performance and efficiency
devices. A noteworthy application in video coding [199] sug-
gested a lightweight model achieving up to 6.4% average bit Innovative techniques, such as pruning depthwise separa-
reduction compared to high efficiency video coding (HEVC), ble convolution networks [203], highlight the potential for
showcasing the potential of architecturally optimized DNNs improving speed and maintaining accuracy, emphasizing the
in real-world applications. Additionally, it a novel and importance of structural efficiency in lightweight design.
lightweight model for efficient traffic classification [200] was Meanwhile, the work on adaptive tensor-train decomposi-
developed, utilizing thin modules and multi-head attention tion [204] showcases the significant reduction in parameters
to significantly reduce parameter count and running time, and computation, further underscoring the advancements
demonstrating the practical utility of lightweight designs in model compactness and efficiency for mobile devices.
in enhancing running efficiency. A pruning algorithm to Cyclic sparsely connected (CSC) architectures suggests
decrease the computational cost and improve the accuracy structurally sparse architectures for both fully connected and
A comprehensive review of model... 11821

convolutional layers in CNNs [205]. Unlike traditional prun- compression, ensuring that the compressed model remains
ing methods that require indexing, CSC architectures are robust and reliable [212–214]. Integrating XAI with model
designed to be inherently sparse, reducing memory and com- compression techniques not only enhances the interpretabil-
putation complexity to O(N log N ). The number N denotes ity of the compressed models but also aids in fine-tuning the
the number of connections presents in a layer. This technique balance between model size and performance. This synergy
demonstrates an innovative way to achieve model com- is essential for developing efficient, scalable, and trustwor-
pactness and computational efficiency without the overhead thy AI systems capable of operating effectively in diverse
associated with conventional sparsity methods. An efficient and resource-limited environments.
evolutionary algorithm was introduced for NAS [206]. This In looking towards future directions for the advancement
method enhances the search efficiency for task-specific NN of CNN, a multidisciplinary approach emerges across various
models, illustrating how evolutionary strategies can automate domains. The potential of automated ML (AutoML) [215]
the design of efficient and effective CNN architectures for elucidates how it can streamline model optimization by sim-
various tasks. These examples collectively underscore the plifying the search for efficient architectures, thus making
diverse and innovative strategies being explored to make the model design process easier. Meanwhile, the imperative
DNNs more efficient and adaptable, reflecting the ongo- of energy efficiency takes center stage [216], who pushes for
ing commitment within the research community to push the greener practices in CNN development, and encourages for
boundaries of what is possible in ML efficiency. the adoption of energy-efficient models to mitigate environ-
The combinations of model compression techniques and mental impact.
their impacts on model performance reveals a complex land- The exploration of lightweight design and model compres-
scape. Integrating various model compression techniques to sion techniques underscores a significant stride in optimizing
avoid compromising the original model’s effectiveness is a DNN architectures for efficient deployment on devices with
well-acknowledged challenge in the field [207, 208]. Com- constrained resources. Lightweight design approaches proac-
bining different compression methods can indeed lead to tively embed efficiency into the model’s architecture, while
increased efficiency. However, it presents the challenge of model compression methods reactively refine existing mod-
balancing improvements in memory usage and computa- els to reduce their size and computational demands. This dual
tional efficiency against the potential for accuracy reduction strategy addresses the diverse challenges encountered from
and the introduction of noise. This variability underscores the the initial design phase to post-training optimization. The
need for application-specific evaluation and adaptation [56, integration of these techniques exemplifies a comprehensive
60]. Moreover, the complexity of optimizing these meth- approach to NN optimization, balancing performance and
ods for specific applications highlights an ongoing research resource efficiency. Studies have demonstrated the practical
area, necessitating innovation to address factors like fairness utility of these approaches in various applications, including
and bias and to explore hardware advancements for fur- video coding, traffic classification, and action recognition,
ther enhancement. This includes developing strategies that highlighting their impact on enhancing model performance
can effectively leverage the strengths of each compression and efficiency. The ongoing research and innovations in this
approach while mitigating their drawbacks, ensuring that the field continue to push the boundaries of what is achiev-
resulting models are efficient and suitable for deployment in able in ML efficiency, ensuring that advanced models can
limited-resource settings and capable of performing near the be effectively deployed in real-world scenarios with limited
standard’s set [123, 209]. computational capacity.

4.2.2 Synergies between model compression and

explainable artificial intelligence (XAI) 5 Performance evaluation criteria

When discussing model compression, it is crucial to also con- This section delves into the methodologies and metrics used
sider the role of explainable artificial intelligence (XAI) as to assess the efficacy of model compression techniques.
a complementary tool in the process [210, 211]. XAI pro- Key aspects of performance evaluation, such as accuracy,
vides insights into how ML models make decisions, which is model size, computational speed, and energy efficiency,
particularly beneficial during the compression process. By are discussed. The trade-offs between maintaining high
understanding which parts of the model are most impor- accuracy and achieving significant compression rates are
tant for making accurate predictions, developers can make explored, highlighting the challenges and breakthroughs in
more informed decisions about which components to prune this domain. Additionally, this section discusses the practical
or quantize. This targeted approach can help maintain the implications of model compression in real-world applica-
model’s performance while reducing its size. Furthermore, tions, emphasizing the need for robust and efficient models
can help identify potential biases or errors introduced during that can operate under computational constraints.
11822 P. V. D antas et al.

5.1 Compression ratio metric is particularly important for applications requiring

real-time processing, where minimizing delay is critical.
The compression ratio α can be determined by calculating While the speed-up rate focuses on the relative improvement
the fraction between the original and compressed model’s in computational efficiency, inference latency measures the
size [21, 90]. Consider that the original model size is 100 absolute time required for a model to process an input and
MB and the compressed model size is 10 MB, then the com- produce an output. This distinction is crucial for applica-
pression ratio would be 10:1 (100:10). Secondly, it can be tions that demand real-time processing, where minimizing
expressed as a proportion of the total amount of parameters delay is paramount. Inference latency is directly measured
in the original model and the simplified model [217], as the in time units (e.g., seconds, milliseconds) and is essential
following expression: for ensuring real-time performance in applications such as
autonomous driving, video processing, and interactive sys-
a
α(M, M ∗ ) = tems.
a∗ Reducing inference latency involves several techniques
where a is the amount of parameters in the initial model M beyond model compression: model parallelism, which splits
and a ∗ is the number of parameters in the simplified model the model across multiple devices to reduce the time taken
M ∗ . The compression ratio α(M, M ∗ ) of M ∗ over M is the for each layer; data parallelism, which distributes input
proportion of the total number of parameters in M to the total data across multiple devices to reduce batch processing
number of parameters in M ∗ . In addition, a commonly used time; weight sharing, which shares model parameters across
benchmark is the index space-saving β, defined as: multiple instances to minimize memory footprint and compu-
tation; and hardware acceleration, which utilizes specialized
a − a∗ hardware like GPUs or tensor processing units (TPUs) to
β(M, M ∗ ) = speed up inference computations.
a∗

where β(M, M ∗ ) is the defined space-saving rate. 5.4 Label loyalty

5.2 Speed up rate Label loyalty measures how closely a compressed model pre-
dicts the same labels as the original model. It is computed
The speed-up rate focuses on quantifying how much faster a by comparing the quantity of the small-scale model’s pre-
compressed model performs compared to the original model. dictions to the ground truth labels (the correct answers) and
It provides a clear measurement of the reduction in compu- the predictions made by the large-scale model, where N is
tational time achieved by compression. the total number of samples in the dataset being evaluated.
To calculate the speed-up rate, we compare the infer- The label loyalty score can be calculated as the percentage
ence time of the original model with that of the compressed of instances where the compressed model predicts the same
model [217]. For example, if the original model takes 10 s to label as the original model [207].
perform inference and the compressed model takes 3 s, the
speed-up achieved is 10:3. The speed-up rate δ(M, M ∗ ) is Samples with matching predicted labels
defined as: Label Loyalty =
s N
δ(M, M ∗ ) = ∗
s
where s is the running time of the original model M, and s ∗ 5.5 Probability loyalty
is the running time of the compressed model M ∗ .
Most studies use the average training duration per epoch Probability loyalty measures how closely the probability
or the average testing duration to measure running time. distribution of the compressed model matches that of the
The speed-up rate is a crucial metric for understanding original model [207]. It is calculated using the Jensen-
the efficiency gains from model compression, especially as Shannon (JS) divergence relation between the predicted
smaller-scale models typically result in faster computation probability distributions of the compressed model and the
for both training and testing stages, closely linked to the com- original model. The JS divergence is a symmetric and finite
pression rate. distance-like metric that measures the difference between
two probability distributions. The probability of a loyalty
5.3 Inference latency score can be calculated using the following expression:

Inference latency measures the time required for a model 1 1

D J S (P, Q) = D K L (P, M) + D K L (Q, M)
to process an input and produce an output [217]. This 2 2
A comprehensive review of model... 11823

where P and Q are the probability distributions of the original compression tactics, this discourse sheds light on the poten-
and compressed models, respectively, and M is the average tial and challenges of optimizing DL models for real-world
of P and Q. applications.
The formula for the probability of loyalty is provided: This section provides case studies detailing the per-
formance of model compression techniques in different
L p (P, Q) = 1 − D J S (P, Q) scenarios. Developing new techniques for compressed mod-
els is a challenging and important area of research that
where L p is the probability loyalty score, P is the predicted requires a careful balance between model complexity, accu-
probability distribution of the original model, Q is the prob- racy, and practical considerations. Researchers need to
ability distribution of the compressed model, and D J S is the address various challenges, including maintaining accuracy
JS divergence between P and Q. while compressing the model, optimizing performance in
resource-constrained environments like mobile devices, and
5.6 Robustness integrating compressed models effectively into real-world
applications [96, 110, 113, 123, 208].
A model’s robustness can be measured using different met- Furthermore, compressed models must overcome obsta-
rics, depending on the types of perturbations the model is cles related to their configuration and hardware constraints.
expected to be resistant to [217]. Some common metrics for Current SOTA approaches often rely on well-designed CNN
calculating robustness include the following: models, limiting flexibility in changing configurations for
Adversarial accuracy: This measures how accurate the more complex tasks. Additionally, the extension of CNN to
model is on inputs that have been deliberately changed to platforms, such as smartphone, robotic, or self-navigating
cause the model to misclassify them. vehicle platforms, is impacted by hardware bottlenecks. In
Robustness radius: This measures the maximum amount of the context of big data challenges, compressed models face
perturbation that the model can tolerate while still correctly issues related to prediction, data cleansing, dimensional-
classifying an input. ity reduction, and other tasks. Researchers must develop
Worst-case error: This measures the worst-case error of the outstanding models and optimization techniques, including
model over a set of perturbations. parallel and decentralized methods, to effectively handle big
Sensitivity analysis: This measure measures the sensitivity data challenges.
of the model’s output to minor modifications in the input.
The specific method for calculating robustness will depend 6.1 Model compression in image classification
on the type of perturbation that the model is expected to be
robust against and the specific application. Pruning and quantization are widely recognized for their
effectiveness in reducing the computational demands of
5.7 Computation reduction DNNs without significantly compromising accuracy. A
method proposed in one study utilizes both techniques to
To calculate computation reduction, it is needed to compare compress DNNs, enabling their deployment on embed-
the quantity of operations required to perform inference on ded platforms for image classification tasks. This approach
the original model and on the compressed model [217]. For demonstrated that strategic model compression could retain
example, if the original model requires 1000 operations to performance levels while significantly reducing model size
perform inference and the compressed model requires 100, and computational requirements [56]. In real-time image
then the computation reduction would be 10:1 (1000:100). processing, such as synthetic aperture radar (SAR) ship
detection, the need for rapid inference and minimal model
size is paramount. Research in this area has shown that model
6 Model compression in various domains compression can maintain high accuracy in image classi-
fication tasks while substantially reducing model size and
This section delves into the cutting-edge advancements in inference time. This balance is crucial for applications where
model compression and performance optimization, high- timely processing of large image datasets is essential [57].
lighting various strategies such as pruning, quantization, Studies have also explored the impact of model compression
knowledge distillation, and transfer learning. These tech- techniques on improving the efficiency of image classi-
niques aim to reduce the model size and computational fiers on platforms with limited computational capabilities.
demands without significantly compromising accuracy, ther- By employing various compression strategies, researchers
eby enabling faster, more energy-efficient, and cost-effective have been able to enhance the performance of image clas-
ML solutions. Through detailed analysis and comparison of sification models, demonstrating the potential for efficient
methods like SqueezeNet, model pruning, and innovative image analysis in resource-constrained environments [53].
11824 P. V. D antas et al.

Further research has focused on developing tailored coming technique was compared to other pruning techniques.
pression techniques that not only reduce model size but also The article also aims to show that refinement of the pruned
improve accuracy in image classification. These techniques models can further improve their efficiency. The preliminary
are designed to optimize widely used models, demonstrat- results indicated that the presented method achieves compa-
ing that model compression can lead to better efficiency and rable or even greater performance than the original models
performance in image classification tasks [58]. while reducing the number of parameters and computation
In summary, model compression techniques such as prun- costs.
ing and quantization are instrumental in optimizing image The presented technique was used on the UC Merced
classification models for various applications, including real- Image Dataset and 21 land-use scenes. Subsets of the images
time image processing and efficient operation on resource- are divided for training, fine-tuning, validation, and test-
constrained devices. Tailored compression strategies further ing. The models are trained using a batch size of 64, using
illustrate the potential to enhance both the efficiency and SGD. The first learning rate is set to 0.001. The study’s
accuracy of these models, underlining the significance of outcome reveals that the proposed method achieves up to
model compression in the ongoing advancement of image 50% floating-point operations per second (FLOPS) pruning
classification technologies. ratio for visual geometry group (VGG)-16 and up to 47.62%
FLOPS pruning ratio for residual neural network (ResNet)-
6.1.1 SqueezeNet 50 while maintaining high accuracy. This indicates that the
suggested technique can reduce the computational cost of
SqueezeNet, a CNN architecture [96], achieves AlexNet- these models by up to 50% while maintaining their accuracy.
level performance on the ImageNet database with 50× fewer It also achieved an overall accuracy of 92.50% with a pruning
parameters. The primary goal of the paper is to find a model ratio of 60%. The effectiveness of the proposed methodol-
with very few parameters that is still accurate. Smaller CNN ogy is unaffected by the training rate, which means that the
architectures offer several advantages, such as more effective method is robust and can achieve high pruning ratios despite
ML training, less processing time when deploying new mod- training data used. On the NWPU-RESISC45 dataset, the
els, and cost-effective field programmable gate array (FPGA) proposed method prunes up to 50.68% FLOPS for VGG-16
and embedded deployment. and up to 44.98% FLOPS for ResNet-50 while maintaining
SqueezeNet can reduce the size of a model by 5 times com- high accuracy. It also achieved an overall accuracy of 93.50%
pared to AlexNet, but still does better than its top-1 and top-5 with a pruning ratio of 70%.
accuracy. When applying deep compression with 8-bit quan-
tization, SqueezeNet results in 363 times smaller than 32-bit 6.1.3 Compression based on pruning and quantification
AlexNet with similar performance. Furthermore, applying
compression with 6-bit quantization on SqueezeNet, it results In this case [123], it is proposed a novel DL model optimiza-
in 510 times smaller than 32-bit AlexNet with similar per- tion technique focused on the use of filter-stripe combination
formance. pruning and data quantification. The proposed technique
For instance, it is observed that SqueezeNet achieves a achieves a high compression ratio while maintaining model
1.5× speed up over AlexNet on a central processing unit accuracy. The proposed technique is also suitable for mobile
(CPU) and a 4× speed up on a GPU. Additionally, it uses and embedded devices. The authors conducted experiments
3 − 4× less memory than AlexNet during inference, mak- on two DL models, VGG-16 and ResNet-56, using the Cana-
ing it optimized for running resource-limited applications. dian institute for advanced research (CIFAR)-10 dataset.
Moreover, the developers of SqueezeNet have successfully They trained the original models to converge and then applied
implemented a hardware accelerator known as efficient the proposed pruning and data quantification techniques to
inference engine (EIE), which can work directly on the com- obtain a fully compressed model. The observations suggested
pressed model, achieving significant acceleration and energy that the proposed model performs better than existing DL
efficiency gains. This highlights the potential for model com- compression techniques in terms of performance and com-
pression techniques not only to enhance accuracy and model pression percentage. The performance of the resized models
size but also energy efficiency, which is critical for mobile is very similar to that of the original, while the compression
and embedded devices. ratio is significantly improved. The inference speed, memory
utilization, and energy efficiency of the compressed models
6.1.2 Model pruning for image classification are also improved compared to the original models. Based on
the research results from ResNet-56, this technique can mini-
In this case [110], a pruning scheme for remote optical image mize the number of parameters to 4:1 and the number of steps
analysis is proposed that reduces the computational cost of of computations to 5:1, and the loss of model performance
CNNs, while maintaining high accuracy. The proposed prun- is only 0.01%. On VGG-16, the quantity of parameters is
A comprehensive review of model... 11825

reduced to 14:1, the quantity of computation is scaled down efficacy of various other algorithms, including VGG-19,
to 3:1 and the accuracy loss is 0.5%. InceptionV3, ResNet-50, and DenseNet-169, in stress pre-
diction was evaluated. The research methodology included
6.1.4 Facial expression recognition leaveone-out cross-validation (LOOCV) and 10-fold cross-
validation. Findings showed that frequency domain images
In this case [218], the effects of model compression methods exhibited greater complexity and variability compared to
on the performance and fairness of facial expression recogni- spatial images, which, despite their reduced variation, were
tion models are investigated. The research encompasses three simpler and more adaptable.
compression tactics - pruning, weight clustering, and post- Through model pruning, the total trainable parameters
training quantization - and examines their combinations, were reduced from about 55 MM to less than 1 MM, which
specifically pruning with quantization and weight clustering resulted in a processing time of 148 ms/step and accuracy of
with quantization. The findings indicate that model size can 86.25%. The computation cost was reduced by 4 times. Addi-
be substantially reduced through compression and quantiza- tionally, quantization was applied to lower the precision of
tion without sacrificing accuracy. However, these processes the model’s weights and activations. This approach achieved
might negatively affect the fairness of the algorithms. Addi- a classification accuracy of 90.62% in the stress detection
tionally, the study conducts a comparative analysis of the approach. Knowledge distillation outperformed others in
baseline models versus the compressed models and delves terms of the balance between performance and processing
into three research questions concerning the efficacy and fair- power, attaining an accuracy of 88.75% with a loss of 0.0066
ness of model compression methods. and a processing speed of 65 ms/step.
The study used two datasets, extended extended Cohn-
Kanade (CK+) and real-world affective faces database (RAF- 6.1.6 Model compression in medical image analysis
DB). The assessment focused on three key aspects: model
size, accuracy, and fairness. The study found that compres- Research has explored the effects of image compression on
sion and quantization can significantly reduce model size the classification performance of DL models for medical
without compromising accuracy, but may adversely affect images, such as mammograms. The findings indicate that
algorithmic fairness. The baseline model reached an overall model compression can be applied effectively in medical
accuracy of 67.96% on the CK+ dataset, while the base- imaging without compromising the accuracy of diagno-
line RAF-DB classifier attained an overall test accuracy sis, which is paramount in clinical settings [82]. A novel
of 82.46%. The compressed MobileNetV2 model attained approach to medical image compression involves the use
a higher accuracy compared to the model on the CK+ of variational autoencoders combined with ResNet. This
dataset, while the compressed ShuffleNetV2 model achieved method addresses common issues in CNN training and aims
a slightly lower accuracy than the full model. However, the to optimize the balance between image quality and com-
study also found that model compression and quantization pression rate, thus preserving the critical details necessary
can adversely impact algorithmic fairness, particularly in for accurate medical diagnosis [83, 220]. Further studies
terms of gender and race accuracy. The compressed models have introduced model compression techniques to enhance
showed a larger discrepancy between male and female accu- the efficiency of DNNs in medical image analysis. These
racy metrics compared to the full models, indicating potential techniques streamline the model architecture, reducing its
algorithmic bias. Similarly, the compressed models showed complexity while maintaining diagnostic accuracy. Such
a larger discrepancy between accuracy metrics for different advancements facilitate quicker and more resource-efficient
race groups compared to the full models. analysis, which is essential for real-time medical decision-
making [84]. In the context of medical imaging, transfer
6.1.5 Detecting stress on a person’s health through 2D learning has been leveraged to adapt existing models to
images specific medical tasks efficiently. This approach allows for
the optimization of models for precise diagnostic analysis
In this case [219], it is addressed the adverse effects of stress without sacrificing accuracy, demonstrating the potential of
on health and its high prevalence in American society. It intro- model compression in enhancing the applicability and per-
duces a new algorithm leveraging ML techniques to detect formance of DL in medical image recognition [85].
stress from 2D electrocardiogram (ECG) images, bypassing In conclusion, model compression in medical image
the need for intricate feature extraction. Pruning, quantiza- analysis is a growing field that addresses the need for
tion, and knowledge distillation are identified as effective high-performing, efficient diagnostic tools in healthcare.
techniques for compressing the model. Through techniques like image compression, variational
The VGG-16 algorithm was optimized to enhance its autoencoding, efficiency optimization, and transfer learning,
learning rate and overall performance. Additionally, the researchers are able to refine DL models to operate effec-
11826 P. V. D antas et al.

tively in the demanding environment of medical diagnostics, model compression strategies in NLP, highlighting their
ensuring that critical healthcare applications benefit from the effectiveness in maintaining high accuracy while reducing
advancements in AI and ML. computational load [63]. Compression of word embeddings
using low-rank matrix decomposition and knowledge dis-
6.2 Model compression in speech recognition tillation presents a significant advancement in NLP model
efficiency. This approach not only reduces the size of the
In the context of speech recognition, a variety of model model but also retains the semantic richness of the embed-
compression techniques have been employed to enhance dings, which is essential for tasks like translation and
processing efficiency. These include pruning, quantization, sentiment analysis. The technique demonstrates that it is
knowledge distillation, Low-rank factorization, and trans- possible to achieve substantial compression without com-
fer learning. These methods aim to reduce model size promising the quality of language representations [64]. The
and computational complexity while retaining the accuracy slow inference speed and high computational demands of
necessary for reliable speech recognition [59]. Lossy compre-trained deep models, such as BERT, pose significant chal-
pression techniques, particularly for CNN-based GANs in lenges in NLP. Knowledge distillation has been proposed
speech recognition, have shown promise in reducing numer- as a solution to compress these models effectively, thereby
ical precision and encoding, thus minimizing model size and enhancing their practicality for real-time applications. This
computation requirements. Such approaches facilitate the method allows for the retention of essential language under-
deployment of efficient speech recognition systems that can standing capabilities while significantly reducing the model’s
operate effectively in real-time environments [60]. Further size and computational requirements [65].
exploration of model compression in speech recognition has In summary, model compression in NLP is a dynamic
led to the development of methods that encompass pruning, field that addresses the critical need for efficient and effective
quantization, knowledge distillation, low-rank factorization, language processing models. Through various compression
and transfer learning. These comprehensive strategies are techniques, researchers have made significant strides in opti-
designed to enhance the performance of speech recognition mizing NLP models for improved deployment on devices
models, ensuring high accuracy and efficiency on platforms with limited computational resources, ensuring that advanced
with constrained computational capabilities [61]. Innova- language processing capabilities remain accessible and prac-
tive methods like self-distillation have been introduced to tical for a wide range of applications.
improve model efficiency in speech recognition. This tech-
nique allows high-accuracy models to be obtained directly 6.3.1 Compressing sparse pre-trained language models
without the need for an assistive large-scale model, thus (PLM)
simplifying the training process and enhancing model per-
formance. Self-distillation represents a significant advance In this particular instance [110], a novel approach for imple-
in model compression, enabling the deployment of highly menting sparse PLM training is presented. The method uses
efficient speech recognition systems [62]. weight pruning and knowledge distillation to create pre-
In conclusion, model compression has emerged as a cru- trained models that can be used in downstream tasks with
cial technology in the advancement of speech recognition minimal accuracy loss. The method is applied to BERT-
systems, enabling the development of efficient and accu- base, BERT-large and DistilBERT and fine-tuning the sparse
rate models suitable for deployment on resource-constrained models on downstream tasks such as Stanford question
devices. Through techniques such as pruning, quantization, answering dataset (SQuAD) and the general language under-
knowledge distillation, low-rank factorization, and self- standing evaluation (GLUE) benchmark. The results indi-
distillation, researchers have been able to optimize speech cate that the compressed sparse pre-trained models achieve
recognition models to meet the demands of real-time pro- SOTA compression-to-accuracy ratios and can even be fur-
cessing and limited computational capacity. ther compressed to 8-bit precision using quantization-aware
training.
6.3 Model compression in natural language The experimental setup involved using the English Wikip-
processing (NLP) edia dataset, which contains 2500 MM words. The data was
divided into two groups: train (95%) and validation (5%)
In NLP, model compression encompasses a range of tech- sets. The pre-trained models were evaluated on a range of
niques including pruning, quantization, knowledge distil- benchmarks for transfer learning, including SQuAD and
lation, low-rank factorization, and transfer learning. These text classification tasks from the GLUE benchmark – multi-
methods are employed to manage the extensive computa- genre natural language inference (MNLI), Quora question
tional requirements of NLP models without significantly pairs (QQP), question-answering natural language inference
impacting their performance. A study discusses various (QNLI) and Stanford sentiment treebank (SST-2).
A comprehensive review of model... 11827

The method worked better than others when trained at introduced a compiler-aware neural pruning search frame-
a higher level of sparsity. When comparing the presented work, optimizing 2D and 3D object detection. This approach
approach to other approaches, it was found that it yielded enabled real-time inference speeds with minimal accuracy
superior results at 85 and 95% sparsity ratios, respectively. loss on mobile devices, illustrating the effectiveness of
Also, the accuracy loss is less than 3% when you compare model compression in maintaining performance while reduc-
the compressed sparse models with their dense counterpart. ing computational demands [73]. Another critical aspect in
It also demonstrated that it is possible to significantly com- autonomous vehicles is the optimization of energy consump-
press the models using quantization-aware training in order tion while adhering to real-time latency constraints. Research
to achieve SOTA results in terms of compression-to-accuracy in this area has proposed strategies to achieve a balance
ratio. between edge and cloud computing for autonomous sys-
tems, highlighting the importance of model compression in
6.3.2 Knowledge distillation for bidirectional encoder enhancing energy efficiency and reducing latency for real-
representations from transformers (BERT) time processing [74]. For For vehicle-to-vehicle communica-
tion (V2V), lightweight CNN designs inspired by MobileNet
In this case [221], a new approach called patient knowl- and enhanced through knowledge distillation have proven
edge distillation for compressing large PLMs like BERT into effective. These models facilitate automatic scenario recog-
equally effective lightweight models is proposed. It uses two nition, demonstrating that compressed models can achieve
patient distillation schemes to enable the exploitation of rel- high performance while being suitable for the stringent com-
evant information in the large-scale model’s hidden layers. putational limitations of autonomous vehicles [75]. Address-
It also boosts the small-scale model to gain knowledge from ing the challenge of limited communication bandwidth in
and mimic the large-scale through a multi-layer distillation connected vehicles, ternary quantization-based model com-
process. pression methods have been explored. These methods aim to
The research methodology focused on using the pro- reduce the model parameter size, demonstrating that effec-
posed method in relation to four different NLP tasks: tive model compression can lead to more efficient data
sentiment analysis, paraphrase similarity correlation, NLP transmission and processing in the network of autonomous
inferring, and machine reading understanding. The datasets vehicles [76].
used for each task were SST-2, QQP, MNLI, QNLI, Microsoft In conclusion, model compression in autonomous vehicles
research paraphrase corpus (MRPC) and recognizing textual is essential for achieving the necessary balance between per-
entailment (RTE). The research focused on comparing the formance and computational efficiency. Through techniques
effects of the suggested approach with standard knowledge like neural pruning, energy optimization, lightweight CNN
distillation techniques. The preliminary findings suggest designs, and quantization, researchers are able to develop
that the presented approach results in superior efficiency advanced systems that meet the real-time, energy-efficient,
and better predictive power than the traditional knowledge and accurate processing requirements crucial for the safety
distillation methodologies, with notable gains in training and functionality of autonomous vehicles.
efficiency and space reduction, while still maintaining com-
parable model performance to the original model. The 6.5 Model compression in recommender systems
method achieved SOTA results on various benchmark data
sources and reduced the number of required parameters by Self-distillation techniques have been utilized to reduce
up to 90%. model size and computational demands, which is particularly
The compressed models achieved up to 4.3× speed up beneficial for recommender systems that must process large
in inference time and up to 4.7× reduction in model size volumes of data swiftly to generate timely recommendations.
while maintaining similar accuracy to the original model. In Such approaches enable the creation of efficient and com-
general, the tests and results show that using compression pact models that maintain or even improve recommendation
techniques can make LLM easier to use in real life by reduc- quality [62]. The development of compressed frameworks
ing the number of redundant parts. The authors concluded for sequential recommender systems addresses the chal-
that the selection of hyperparameters had a critical effect on lenge of deploying these systems on resource-constrained
the performance of the approach, and that careful tuning was devices. Research in this area has shown that it is possible
necessary to achieve optimal results. to maintain or improve accuracy compared to uncompressed
models, thus demonstrating the effectiveness of model com-
6.4 Model compression in autonomous vehicles pression in enhancing the scalability and responsiveness
of recommender systems [80]. Matrix factorization algo-
In autonomous vehicle applications, real-time inference is rithms, a core component of many recommender systems,
essential for timely decision-making and response. A study have been optimized through model compression to improve
11828 P. V. D antas et al.

efficiency, speed, and simplicity. These advancements allow 6.6.1 Memory- communication-aware on Internet of Things
recommender systems to deliver high-quality recommenda- (IoT)
tions more efficiently, reducing the computational load and
enabling smoother operation on platforms with limited pro- In this case [208], is introduced a transfer learning model
cessing power [81]. compression technique, called network of neural networks
In summary, model compression in recommender systems (NoNN) that yields higher performance than other baselines
is essential for handling the vast data volumes and real-time and similar accuracy as the associated large-scale model,
processing requirements inherent in personalized recom- while using less communication among small-scale models.
mendation tasks. Through innovative techniques like self- The experimental setup involved compressing various
distillation, compressed frameworks, and efficient matrix DNNs for five image classification tasks: CIFAR-10, CIFAR-
factorization, researchers are able to significantly improve 100, Scene, caltech-UCSD birds (CUB), and Imagenet. The
the performance and efficiency of recommender systems, Scene and CUB datasets are related to the transfer learning
ensuring they can operate effectively even in resource-limited domain, where a pre-trained NN is fine-tuned on a partic-
environments. ular problem. Furthermore, the parameters in NoNN were
compared to the parameters in knowledge distillation and
6.6 Model compression in Internet of Things (IoT) attention-transfer with knowledge distillation (ATKD) stan-
and non-application specific domains dard models. Moreover, they indicate that they outperformed
previously used models like Splitnet.
In the IoT domain, model compression strategies are tailored The results indicated that NoNN achieved close to large-
to meet the unique requirements of resource-constrained scale model’s precision with significantly lower memory
devices. One study introduces a model compression approach (2.5−24× gain) and computation (2−15× fewer FLOPSs).
in IoT that optimizes for low-end devices by combining quan- They also reported that NoNN demonstrated superior per-
tization and pruning, significantly reducing computational formance compared to other baselines, despite the lack of
demands and enabling efficient deployment [86]. Another communication among small-scale models until the final
study presents a CNN model compression framework tai- layer. The proposed NoNN compresses a pre-trained large-
lored for intelligent inspection within power IoT systems, scale model resulting in many disjoint and highly compressed
focusing on enhancing performance through pruning and small-scale modules, without loss of performance. This facil-
quantization [87]. The demand for low-power solutions in itates faster and more cost-effective computation on IoT
IoT has led to the development of specialized accelerators devices.
for CNNs, aimed at improving inference capabilities on For instance, NoNN boasts an overall accuracy of 91.5%
low-end devices. These solutions incorporate hybrid quanti- on the CIFAR-10 sample, which is higher than the accuracy
zation schemes and binary activation functions, using SVM of many baselines such as MobileNet and ShuffleNet. NoNN
techniques to streamline model execution while conserv- also achieves an average inference speed of 0.5 ms per image,
ing energy [88]. Beyond IoT-specific considerations, model which is faster than various baselines such as MobileNet and
compression addresses broader challenges across various ResNet-18. Additionally, NoNN achieves an average mem-
domains. A significant concern is the computational and ory utilization of 430 kB per student, which is within the
energy efficiency constraints faced by embedded general- memory budget of most IoT devices. Finally, NoNN achieves
purpose processors in handling advanced NN-based ML an average energy efficiency of 0.5 mJ per image, which is
techniques. Research in this area highlights the impor- lower than several baselines such as MobileNet and ResNet-
tance of developing effective compression strategies, such 18.
as those combining pruning and quantization, to make CNN
inferences more efficient on resource-constrained devices 6.7 Summary of model compression applications
[70]. In Table 4 is provided a concise summary of how model
In summary, model compression in IoT and non-application- compression techniques are applied across various vertical
specific domains focuses on creating efficient, compact applications, highlighting the unique challenges and require-
models suitable for deployment in environments with lim- ments in each case.
ited computational and energy resources. By implementing
tailored compression techniques like pruning, quantization,
and specialized accelerators, researchers are able to signifi- 7 Innovations in model compression and
cantly improve the operational efficiency of models across a performance enhancement
wide range of devices and platforms, demonstrating the ver-
satility and necessity of model compression in the expanding The rapid growth in model size and complexity in ML has
landscape of smart devices and general-purpose computing. prompted significant advancements in model compression
A comprehensive review of model... 11829

Table 4 Summary of model compression case studies and their application peculiarities
Case study Peculiarities of the considered vertical application

Image Classification Effective in reducing computational demands while maintaining accuracy. Essential for
real-time image processing and deployment on resource-constrained devices. Techniques
like pruning and quantization are commonly used to reduce model size and latency with-
out compromising performance. The balance between speed and accuracy is crucial for
applications in mobile devices and embedded systems [53, 56–58].
Speech Recognition Focuses on reducing model size and computational complexity while retaining high accu-
racy for real-time processing. Knowledge distillation and pruning are frequently used to
compress models, enabling deployment on devices with limited computational power.
Ensures low latency and high throughput, which are critical for user interaction and real-
time applications such as virtual assistants and transcription services [59–62].
NLP Manages extensive computational requirements without significantly impacting perfor-
mance. Optimizes pre-trained models like BERT and generative pre-trained transformer
(GPT) for real-time applications using techniques such as pruning, quantization, and
knowledge distillation. Ensures efficient memory usage and faster inference, crucial for
applications like chatbots, language translation, and sentiment analysis [63–65].
Autonomous Vehicles Ensures real-time inference and energy efficiency. Balances edge and cloud computing,
and addresses limited communication bandwidth. Model compression techniques like
pruning and quantization are used to enhance the deployment of vision and decision-
making models on embedded systems within vehicles. Critical for safety and performance
in dynamic environments [73–76].
Recommender Systems Handles vast data volumes and real-time processing requirements, maintaining or
improving recommendation quality on resource-constrained devices. Techniques such
as low-rank factorization and pruning are applied to manage computational load while
ensuring timely and relevant recommendations. Important for personalized content deliv-
ery in e-commerce and media streaming services [62, 80, 81].
Medical Image Analysis Maintains diagnostic accuracy while reducing model complexity for real-time medical
decision-making. Compression techniques like knowledge distillation, quantization, and
pruning are crucial to ensure models are efficient and reliable for deployment in healthcare
settings. Ensures fast and accurate analysis, which is critical for early diagnosis and
treatment planning [82–85, 220].
IoT and Non-application Specific Tailored for resource-constrained devices, improving efficiency and enabling deployment
in a wide range of environments. Techniques like pruning, quantization, and lightweight
model design are employed to enhance model performance on low-power devices. Essen-
tial for applications ranging from smart home systems to industrial IoT, where efficient
processing and low power consumption are critical [70, 86–88].
Predictive Maintenance in Smart Manufacturing Uses model compression to ensure real-time monitoring and prediction of equipment
failures. Techniques like low-rank factorization and knowledge distillation are employed
to deploy models on edge devices for continuous data analysis. Critical for reducing
downtime and maintenance costs in industrial settings [207].

and performance enhancement techniques. These innova- PyTorch’s memory management utilities, play a pivotal
tions are crucial for reducing computational demands and role in optimizing memory usage [222–225].
expanding GPU memory capacity, thereby facilitating the • Integration of ML with HPC: integrating ML with HPC
integration of ML with HPC. environments leverages the massive parallel processing
capabilities of supercomputers, enhancing model train-
• GPU memory expansion and computational demands: ing efficiency. Frameworks like TensorFlow and PyTorch
with the increasing complexity of DL models, GPU mem- now support distributed training across multiple nodes,
ory limitations have become a significant bottleneck. significantly speeding up the training process for large-
Techniques like memory swapping and tensor remateri- scale models [226–229].
alization are employed to alleviate this issue, enabling • Advanced techniques (DeepSpeed, ZeRO-Infinity):
larger models to fit within the limited GPU memory. DeepSpeed, an open-source DL optimization library,
Advanced software libraries, such as NVIDIA’s Apex and introduces ZeRO to reduce memory redundancy by parti-
11830 P. V. D antas et al.

tioning model states across data-parallel processes. This 8 Challenges, strategies, and future
approach enables training of models with up to a trillion directions
parameters, significantly enhancing computational effi-
ciency and model scalability. The ZeRO-Infinity exten- 8.1 Computational overhead and suitability
sion further optimizes memory usage, making it feasible
to train even larger models [230, 231]. Compressing ML models often requires additional com-
• Advanced pruning techniques: pruning techniques aim putational resources, particularly when utilizing gradient
to remove redundant or less critical neurons and con- descent algorithms for optimization. Although these tech-
nections in NNs, thereby reducing model size and niques introduce extra computational overhead, their suit-
computational load without compromising performance. ability for most applications makes this trade-off reasonable.
Structured pruning and lottery ticket hypothesis are Techniques like pruning and quantization reduce model
notable advancements, offering systematic approaches size, significantly lowering the computational burden during
to identify and eliminate unnecessary network compo- inference, leading to faster predictions and reduced mem-
nents [232–235]. ory requirements. This optimization is crucial for deploying
• Innovative quantization methods: quantization reduces models in resource-constrained environments [19]. By focus-
the precision of model parameters from floating-point to ing on essential features and reducing redundancy, pruning
lower bit-width representations (e.g., 8-bit integers), sig- and knowledge distillation help prevent overfitting and
nificantly decreasing memory and computational require- enhance generalization to unseen data, ultimately improving
ments. Techniques such as post-training quantization real-world performance [111].
and quantization-aware training ensure that performance While compressed models may require more computa-
degradation is minimal while achieving substantial effi- tional resources during training, the long-term benefits of
ciency gains [236–238]. faster inference, reduced memory footprint, and enhanced
• Enhanced low-rank factorization approaches: low- performance outweigh the initial costs, making the invest-
rank factorization decomposes weight matrices into prod- ment in extra computation justified [69]. For instance,
ucts of smaller matrices, effectively reducing the num- pruning works by eliminating unnecessary neurons and con-
ber of parameters and computational complexity. This nections in NNs, which reduces the overall complexity of the
method maintains model accuracy while enabling more model without significantly impacting its accuracy. An exam-
efficient inference and training processes. Applications ple of this is Google’s use of pruning in its neural machine
in transformer models and CNNs have demonstrated sig- translation system, which resulted in a 92% reduction in
nificant improvements in performance [239–241]. model size while maintaining translation quality [253]. Sim-
• Advanced knowledge distillation techniques: knowl- ilarly, quantization converts the weights and activations of a
edge distillation transfers the knowledge from a large, NN from higher precision (such as 32-bit floats) to lower pre-
complex model (teacher) to a smaller, more efficient cision (such as 8-bit integers), which can drastically reduce
model (student). Advanced techniques focus on opti- the memory footprint and speed up inference times. This
mizing the distillation process to maximize the transfer approach was successfully applied in the MobileNet archi-
of knowledge, resulting in student models that achieve tecture, making it highly efficient for mobile and embedded
comparable performance with reduced computational applications [217]. Knowledge distillation not only reduces
demands [242–246]. the size of the model but also often results in a model that per-
• Hybrid compression methods: combining multiple forms better on specific tasks due to the distilled knowledge
compression techniques, such as pruning, quantiza- from the teacher. An example of this is the use of knowl-
tion, and low-rank factorization, creates hybrid methods edge distillation in training compact BERT models, which
that leverage the strengths of each approach. These maintain high accuracy while being significantly smaller and
methods offer superior compression rates and perfor- faster than the original model [9]. These methods collectively
mance enhancements, making them highly effective for ensure that the models remain robust and effective even when
deploying large-scale models on resource-constrained deployed on devices with limited computational capabilities,
devices [247–252]. such as mobile phones and edge devices.
Industries where speed, memory efficiency, and accuracy
These innovations in model compression and performance are critical, such as healthcare, finance, and autonomous driv-
enhancement are pivotal in addressing the challenges posed ing, greatly benefit from the deployment of these efficient
by the growing complexity of ML models. They enable and optimized ML models. In healthcare, for example, faster
more efficient use of computational resources, facilitating inference times can lead to quicker diagnostics and treat-
the deployment of sophisticated models in real-world appli- ment decisions, which are vital for patient care. A notable
cations. example is the use of compressed ML models in real-time
A comprehensive review of model... 11831

magnetic resonance imaging (MRI) reconstruction, reduc- to ResNet-50, achieving a 90% reduction in parameters with
ing scan times from minutes to seconds. In finance, real-time only a 1% drop in accuracy on the ImageNet dataset. This
analysis and predictions can improve trading strategies and highlights the potential of iterative pruning to maintain high
risk management. High-frequency trading algorithms often performance even with significant compression [255].
rely on compressed models to process vast amounts of data Regularization techniques serve as another critical tool
rapidly and make split-second decisions. Autonomous driv- in model compression. These methods introduce additional
ing systems require real-time data processing to ensure safety constraints into the training process, often in the form of
and accuracy in navigation and obstacle detection, where penalties on the magnitude of parameters, to encourage the
models like you only look once (YOLO) have been pruned model to maintain only those weights that are genuinely influ-
and quantized to run efficiently on automotive-grade hard- ential in determining the output. L1 and L2 regularization
ware [7]. are prominent examples of such techniques, where L1 pro-
Moreover, the initial computational overhead incurred motes sparsity in the model weights, thereby facilitating their
during the training phase can often be mitigated by lever- subsequent removal during compression. Structured spar-
aging advanced hardware accelerators such as GPUs and sity learning elucidates how regularization can be effectively
TPUs, which are designed to handle large-scale computa- employed to enhance the compressibility of NNs, paving the
tions more efficiently. This makes the overall process of way for more efficient and compact models [256].
compressing models not only feasible but also cost-effective Moreover, recent advancements in model compression
in the long run [6]. Consequently, the extra computational have led to the development of more sophisticated approaches,
resources required for model compression during training such as the utilization of sparsity-inducing norms and
are a worthwhile investment for the significant performance sparsity-aware algorithms. These methods aim not only to
improvements they bring during deployment. reduce the model’s size but also to retain the model’s capacity
for accurate predictions on unseen data [257, 258]. Prun-
8.2 Over-pruning and regularization techniques ing CNNs using Taylor expansion provides insights into
how sparsity-aware techniques can be employed to identify
Over-pruning represents a critical challenge, often leading to and eliminate redundant parameters with minimal impact on
compromised model performance and generalization capa- model performance [259]. An example of this is the use
bilities. This phenomenon occurs when an excessive number of Taylor expansion-based pruning in VGG-16, where the
of parameters are eliminated during the compression process, model’s size was reduced by over 80% while maintaining
which, although beneficial for reducing the model’s size and accuracy within 1% of the original model. This approach
computational demands, can inadvertently strip away vital demonstrated that even complex architectures could be
information necessary for accurate predictions. The balance significantly compressed without substantial performance
between model size reduction and performance retention is degradation.
therefore a pivotal concern in the development and optimiza- These methods not only aim at reducing the model’s
tion of model compression techniques. Mitigating the risks size but also at retaining the model’s capacity for accu-
associated with over-pruning necessitates a well-planned and rate predictions on unseen data [257, 258]. Pruning CNNs
informed strategy for the implementation of model compres- using Taylor expansion provides insights into how sparsity-
sion. aware techniques can be employed to identify and eliminate
Over-pruning, characterized by the excessive removal of redundant parameters with minimal impact on model perfor-
model parameters, can significantly negatively affect the mance [259].
model’s performance and its ability to generalize. To coun-
teract these adverse effects, sophisticated methodologies 8.3 Trade-offs and impact on model architecture
have been developed and refined within the ML commu-
nity. One such approach is iterative pruning, a process that The primary goal of model compression is to reduce compu-
methodically eliminates less critical parameters over multi- tational demands while retaining performance. Our expanded
ple cycles, interspersed with phases of retraining to restore discussion contrasts various methods, such as pruning, quan-
and enhance model performance. The iterative nature of this tization, and knowledge distillation, by examining their
technique allows for a more gradual reduction in model impact on performance retention, computational efficiency,
size, minimizing the risk of removing essential informa- and their suitability for different ML tasks and scenarios.
tion, which demonstrated that iterative pruning could achieve Each of these methods offers a pathway to reduce the com-
substantial reductions in model size while maintaining, and putational footprint of models, yet they must be applied
sometimes even improving, model accuracy [254]. An exam- judiciously to avoid compromising the integrity and efficacy
ple of iterative pruning’s effectiveness can be seen in the of the models. In practice, the process of model compression
ResNet architecture. Researchers applied iterative pruning involves trade-offs.
11832 P. V. D antas et al.

For instance, while pruning and quantization effectively Regarding to emerging areas, the integration of model
reduce model size and accelerate inference times, they may compression techniques across various industrial applica-
introduce artifacts or errors that could degrade model per- tions can lead to significant improvements in efficiency,
formance [19, 59, 116]. Similarly, low-rank factorization speed, and accuracy. Whether through the deployment of dig-
aims to streamline models by reducing redundancy in weight ital twins, advanced NNs, data-driven health management
matrices, but if applied excessively, it might eliminate essen- systems, predictive maintenance strategies, or reinforce-
tial features, leading to poor performance on complex tasks. ment learning for supply chain optimization, the benefits
Knowledge distillation and transfer learning present innova- of model compression are evident. By reducing computa-
tive ways to leverage existing models to train more compact tional demands and enhancing model performance, these
and efficient versions, yet they depend heavily on the qual- techniques pave the way for more effective and scalable
ity and relevance of the original models and the alignment intelligent systems in industrial settings. Model compression
between tasks [130, 136]. techniques can significantly enhance the performance of dig-
The intricacies of model compression also extend to ital twins by reducing the computational burden required for
the interaction between compression techniques and model real-time analysis. This is particularly important for resource-
architecture. Certain architectures may be more amenable to constrained environments where computational power is lim-
specific compression methods, with variations in their sus- ited. Compressed models can facilitate faster data processing
ceptibility to information loss or performance degradation and more efficient anomaly detection, ultimately leading
post-compression. Therefore, a comprehensive understand- to more timely and accurate maintenance decisions [261,
ing of both the model’s structural nuances and the operational 262]. Model compression can further improve PIResNet’s
principles of compression methods is imperative for suc- efficiency by reducing the computational complexity with-
cessful model compression [70, 87, 116, 186]. Furthermore, out compromising the model’s diagnostic capabilities. By
the dynamic nature of technological advancement and the pruning redundant parameters and employing quantization
evolving landscape of ML applications necessitate continu- techniques, compressed PIResNet models can provide rapid
ous research and adaptation of model compression strategies. and reliable fault detection, making them more suitable
Innovations in hardware and software, along with advance- for real-time industrial applications [263, 264]. Data-driven
ments in ML algorithms, constantly reshape the boundaries approaches to bearing health management rely on extensive
and possibilities of model compression [69, 100, 173]. data collection and analysis to predict bearing failures. ML
models, particularly those that are heavily parameterized,
8.4 Future works and recommendations can benefit from compression techniques to manage large
datasets effectively [265, 266]. Compressed models can pro-
In the realm of future works and directions within model cess data more efficiently, leading to faster and more accurate
compression, several pivotal recommendations emerge. A health assessments. This is crucial for industries where down-
primary focus should be on the evolution of more sophisti- time due to bearing failures can be costly. By enhancing the
cated compression algorithms, which can dynamically mod- speed and accuracy of data-driven health management sys-
ulate efficiency and performance tailored to the diverse needs tems, model compression contributes to more reliable and
of applications. This approach underscores the significance cost-effective maintenance strategies.
of enhancing quantization-aware training and knowledge The application of model compression in predictive main-
distillation methods, aimed at refining model compression tenance in smart manufacturing context can significantly
capabilities without forfeiting crucial data, thus safeguard- reduce the computational resources required for training and
ing model integrity across varying compression extents inference [267, 268]. Techniques such as pruning and quan-
[207–209, 260]. Moreover, the development of advanced tization can make DL models more efficient, enabling their
pruning methodologies that accurately pinpoint and elimi- deployment on edge devices with limited processing power.
nate superfluous or non-critical model components, without This can lead to more widespread adoption of predictive
undermining the model’s decision-making proficiency, is maintenance solutions, enhancing operational efficiency and
anticipated to substantially advance the domain [90, 110]. reducing maintenance costs across the manufacturing sec-
The exploration into the synergistic effects of diverse com- tor. RL has shown great promise in optimizing supply chain
pression strategies could unveil novel pathways to strike an operations by learning optimal policies through interaction
optimal balance between model size, speed, and accuracy, with the environment [269, 270]. However, RL models can be
fostering innovations in model efficiency [56, 60]. Tack- computationally intensive and require substantial resources
ling the performance variability across different tasks and for training. Model compression techniques can alleviate this
environments remains a crucial endeavor, highlighting the by streamlining the RL models, making them more suit-
necessity for adaptive models that can fine-tune their com- able for real-time decision-making. Compressed RL models
pression strategies in alignment with the deployment context. can operate more efficiently, enabling faster responses to
A comprehensive review of model... 11833

dynamic supply chain conditions and improving overall sup- systems developed through these techniques uphold princi-
ply chain performance. ples of fairness, accountability, and transparency.
Future research would need to pivot towards the develop-
ment of hybrid compression techniques that synergistically
combine the strengths of existing methods to achieve superior 9 Discussion
compression rates without compromising model perfor-
mance. Additionally, there is a pressing need for more To actualize these advancements, ensuing research must
robust frameworks that can automatically select and apply concentrate on formulating algorithms that can adeptly com-
the most appropriate compression technique based on the press models by recognizing their distinct traits and the
model’s characteristics and the computational environment. specific demands of their respective applications. Envisag-
The exploration of model compression techniques presented ing models that self-optimize in terms of size, speed, and
in this review sets the stage for several key areas of future accuracy, these advanced algorithms are likely to leverage
research and development in the field of ML and AI. Build- ML techniques to ascertain the most efficacious compression
ing upon the insights gained from the current state of model strategy, potentially integrating reinforcement learning or
compression, the following recommendations and directions meta-learning to perpetually enhance their compression tac-
can guide future works: tics based on performance feedback [271, 272]. This segues
Sophisticated compression algorithms: there is a grow- into the importance of embedding compression considera-
ing need to prioritize the development of more advanced tions within the initial design phase of models, fostering
compression algorithms that can dynamically balance effi- the emergence of inherently efficient architectures. Such an
ciency and performance based on the specific requirements approach encourages the creation of high-performance mod-
of different applications. By focusing on creating algorithms els that are naturally predisposed to compression, thereby
that can adapt to varying computational environments and circumventing the typical trade-offs associated with post-
application scenarios, researchers can enhance the versatil- development compression [69, 105, 273].
ity and effectiveness of model compression techniques. As computational sustainability garners increasing atten-
Hybrid compression techniques: future research should tion, model compression techniques aimed at reducing the
explore the potential of hybrid compression techniques that energy demand of ML operations will become paramount.
combine the strengths of existing methods to achieve superior These energy-efficient compression methods, which are cru-
compression rates without compromising model perfor- cial for lowering operational expenses and fulfilling sustain-
mance. By integrating multiple compression approaches in ability objectives, will necessitate innovations that optimize
a synergistic manner, researchers can push the boundaries of computational pathways and minimize energy-intensive pro-
model efficiency and effectiveness, paving the way for more cesses [274]. In the context of federated learning, where
optimized AI systems. model training and data are disseminated across numerous
Automated compression frameworks: there is a press- devices, the imperative for efficient model compression is
ing need for the development of robust frameworks that can underscored. Future research endeavors should be dedicated
automatically select and apply the most suitable compression to crafting compression methodologies that enable swift and
technique based on the characteristics of the model and the effective model updates and sharing across devices, thus alle-
computational environment. By creating automated systems viating bandwidth and storage constraints while preserving
that can intelligently adapt compression strategies to specific model integrity and performance [86, 145, 275].
contexts, researchers can streamline the model compression With the advent of multi-modal ML, the demand for com-
process and enhance its scalability and applicability. pression techniques proficient across varied data types, such
Enhanced real-world applications: future works should as text, images, and audio, will intensify. These techniques
focus on expanding the application of model compression must adeptly navigate the complexities inherent to different
techniques to a wider range of real-world scenarios and data modalities, ensuring the effectiveness of the compressed
domains. By exploring how compression methods can be model across a spectrum of tasks, and paving the way for
tailored to specific industries and use cases, researchers can more adaptable compression techniques that broaden the
demonstrate the practical value and impact of efficient AI utility of ML models in a myriad of settings [276]. Addi-
deployment in diverse settings. tionally, the potential of quantum computing to transform
Ethical considerations and transparency: as model model compression is on the horizon. Probing into how quan-
compression techniques continue to evolve, it is essential tum algorithms could enhance the efficiency of large dataset
for researchers to prioritize ethical considerations and trans- and model compression processes may herald unprecedented
parency in their work. Future studies should emphasize the breakthroughs, offering superior compression capabilities
ethical implications of model compression, ensuring that AI beyond the scope of classical algorithms [277, 278].
11834 P. V. D antas et al.

Future models with self-healing attributes represent an increasingly pivotal. This paper’s exploration of model com-
exciting frontier, where models could autonomously detect pression strategies, their applications, and implications for
performance declines due to compression and adapt proac- future research serves as a base for ongoing and future explo-
tively. Employing mechanisms to self-adjust their structure rations in the field. By addressing the challenges posed by
or parameters would ensure sustained optimal performance, the paradox of progress and limitation, it points the way
even when faced with compression-related challenges [279, for a future where advanced ML models are not only tech-
280]. Lastly, the establishment of definitive benchmarks and nically feasible but also easy to deploy across diverse and
standards for model compression is imperative for a coher- resource-limited environments. The journey of model com-
ent and meaningful evaluation of various techniques. The pression is far from complete, and the insights generated
formulation of standardized metrics that accurately delineate from this research will inspire further innovations, driving
the interplay between compression rate, model performance, the evolution of efficient, scalable, and accessible AI technol-
and computational efficiency is essential for a more objec- ogies.
tive assessment of compression methodologies, propelling
the advancement of more efficacious techniques [281, 282].
These future directions in model compression not only high- Appendix A Comprehensive summary of the
light the field’s potential for significant advancements but references used in this paper
also underscore the interdisciplinary effort required to opti-
mize ML models for the next generation of applications. By In Table 5 is provided a comprehensive summary of the
pushing the boundaries of current methodologies and explor- references used in this paper, categorizing them by their
ing new paradigms, the research community can develop specific application areas such as compression techniques,
more sophisticated, efficient, and versatile model compres- CNNs, DL, generative models, hardware implementation,
sion techniques. and other domains. By organizing key publications, this
summary offers valuable insights into the foundational and
recent advancements in each area, highlighting the strengths
10 Conclusion and potential drawbacks of various methods. This compara-
tive analysis serves as a useful resource for researchers and
In this comprehensive review, we explored various model practitioners looking to implement or further develop model
compression techniques within ML environments, address- compression strategies in their work.
ing the critical challenges faced, and the strategies employed
to overcome them. Our investigation highlighted signifi- Table 5 Summary of references used in this paper, categorized by their
cant advancements in model compression methods, including specific application areas
quantization, pruning, knowledge distillation, transfer learn-
Field Reference
ing, and lightweight architectural designs. We found that each
technique offers unique benefits and limitations, contingent Compression techniques [19–23, 25–27, 56, 59, 62, 65, 66,
upon the specific application and ML model requirements. 69–72, 75, 82–84, 86, 89–94, 96–
101, 106–109, 111, 112, 120–123,
The significance of this review extends beyond the sum- 127, 131, 139, 141–145, 148, 169,
marization of model compression techniques; it underscores 170, 189, 195, 196, 202, 208, 209,
the need for efficient computational models in the era of big 212, 217–219, 232, 242, 243, 250,
data and AI. This exploration aids in the understanding of 251, 254, 255, 260, 283, 284]
how each technique can be optimized and applied to differ- CNNs [8, 24, 52, 52, 58, 95, 110, 134, 137,
149, 152, 162, 164, 174, 176, 180,
ent model architectures and tasks, broadening the scope of
181, 181, 187, 191, 203, 203, 238,
model compression research and application. Secondly, this 247, 249, 252, 285]
paper underscores the significance of performance evalua- DL [2, 33, 49]
tion criteria in assessing the efficacy of compressed models. Generative Models [1, 9, 60]
Moreover, through detailed case studies, the paper illustrates Hardware Implementation [102, 213, 248]
the practical implications and successes of model compres- ML Applications [4, 7, 11, 14, 17, 29, 30, 35–37, 40,
sion techniques in real-world scenarios. These examples 43–45, 55, 57, 67, 78, 81, 85, 114,
not only highlight the effectiveness of model compression 125, 129, 150, 158, 163, 165–167,
in enhancing computational efficiency and model ease of 171, 220, 231, 234, 269, 270, 274,
276, 278, 282, 286, 287]
deployment but also underscore the potential for further inno-
NLP [10, 12, 13, 42, 63, 64, 104, 105,
vation in the field.
128, 140, 157, 190, 192, 221, 235,
In conclusion, as the frontiers of ML and DL continue to 244, 253, 264, 288]
expand, the role of model compression techniques becomes
A comprehensive review of model... 11835

Table 5 continued in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
Field Reference
is not included in the article’s Creative Commons licence and your
Neural Architecture [74, 119, 151, 153–156, 172, 173, intended use is not permitted by statutory regulation or exceeds the
205, 206, 233, 236, 263, 268, 289, permitted use, you will need to obtain permission directly from the copy-
290] right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
Optimization [15, 124, 184, 222, 228, 230, 261,
277]
Recurrent neural networks [38, 39, 159, 267, 291]
Miscellaneous [3, 5, 6, 16, 18, 28, 31, 32, 34, 41,
46–48, 50, 51, 53, 54, 61, 68, 73, 76,
References
77, 79, 80, 87, 88, 103, 113, 115–
118, 126, 130, 132, 133, 135, 136, 1. Rosenblatt F (1958) The perceptron: A probabilistic model for
138, 146, 147, 160, 161, 168, 175, information storage and organization in the brain. Psychol Rev
175, 177–179, 182, 183, 185, 186, 65(6):386–408. https://doi.org/10.1037/h0042519
188, 193, 194, 197–201, 204, 207, 2. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
210, 211, 214–216, 223–227, 229, 20(3):273–297. https://doi.org/10.1007/bf00994018
237, 239–241, 245, 246, 256–259, 3. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
262, 265, 266, 271–273, 275, 279– Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.
281, 292, 293] 1997.9.8.1735
4. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-
This table provides a comprehensive overview of the key publications based learning applied to document recognition. Proc IEEE
that have been referenced throughout the study, offering insights into 86(11):2278–2324. https://doi.org/10.1109/5.726791
the foundational and recent advancements in each area 5. Ho TK (1995). Random decision forests. https://doi.org/10.1109/
icdar.1995.598994
6. Ho TK (1998) The random subspace method for constructing
Acknowledgements The work in this paper is partially funded by the decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–
Engineering and Physical Sciences Research Council (EPSRC) grants 844. https://doi.org/10.1109/34.709601
EP/T026995/1, EP/V000497/1, EP/X037290/1, and Soteria project 7. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm
awarded by the UK Research and Innovation for the Digital Security for deep belief nets. Neural Comput 18(7):1527–1554. https://doi.
by Design (DSbD) Programme. org/10.1162/neco.2006.18.7.1527
8. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi-
Author Contributions All authors contributed to the study conception cation with deep convolutional neural networks. Commun ACM
and design. Material preparation, literature review and analysis were 60(6):84–90. https://doi.org/10.1145/3065386
performed by all authors. The first draft of the manuscript was written 9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley
by Pierre Dantas. Waldir Junior and Celso Carvalho commented on D et al (2020) Generative adversarial networks. Commun ACM
previous versions of the manuscript. All authors read and approved the 63(11):139–144. https://doi.org/10.1145/3422622
final manuscript. 10. Fields J, Chovanec K, Madiraju P (2024) A survey of text classifi-
cation with transformers: How wide? how large? how long? how
Data availability statement (DAS) No datasets were generated or ana- accurate? how expensive? how safe? IEEE Access 12:6518–6531.
lyzed during the current study. https://doi.org/10.1109/access.2024.3349952
11. Aftan S, Shah H (2023) A survey on bert and its applications.
Declarations In: IEEE (ed) 2023 20th Learning and Technology Conference
(L&T). https://doi.org/10.1109/lt58159.2023.10092289
12. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-
Ethical and informed consent for data used In adherence to ethical training of Deep Bidirectional Transformers for Language Under-
standards and principles of informed consent, we confirm that all data standing. arXiv. https://doi.org/10.48550/arXiv.1810.04805
used in this manuscript were collected and employed with due regard for 13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J et al (2017) Atten-
ethical norms. The study was conducted transparently, and all authors tion Is All You Need. arXiv. https://doi.org/10.48550/arXiv.1706.
made substantial contributions as detailed in the Authors Contribution 03762
Statement section. We affirm our unwavering commitment to ethical 14. Sevilla J, Heim L, Ho A, Besiroglu T, Hobbhahn M, Villalobos
research practices and informed consent, assuring that this research P (2022) Compute trends across three eras of machine learn-
aligns with established ethical guidelines. ing. In: IEEE (ed) 2022 International Joint Conference on Neural
Networks (IJCNN), pp 1–8. https://doi.org/10.1109/ijcnn55064.
Competing interests All authors certify that they have no affiliations 2022.9891914
with or involvement in any organization or entity with any financial 15. Rasley J, Rajbhandari S, Ruwase O, He Y (2020) DeepSpeed:
interest or non-financial interest in the subject matter or materials dis- System optimizations enable training deep learning models with
cussed in this manuscript. over 100 billion parameters. In: ACM (ed) Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discov-
Open Access This article is licensed under a Creative Commons ery and Data Mining. KDD’20. https://doi.org/10.1145/3394486.
Attribution 4.0 International License, which permits use, sharing, adap- 3406703
tation, distribution and reproduction in any medium or format, as 16. Duan Y, Edwards JS, Dwivedi YK (2019) Artificial intelligence
long as you give appropriate credit to the original author(s) and the for decision making in the era of big data - evolution, challenges
source, provide a link to the Creative Commons licence, and indi- and research agenda. Int J Inf Manag 48:63–71. https://doi.org/
cate if changes were made. The images or other third party material 10.1016/j.ijinfomgt.2019.01.021
11836 P. V. D antas et al.

17. Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y (2021) ZeRO- Areas Commun 41(10):3034–3045. https://doi.org/10.1109/jsac.
infinity: breaking the GPU memory wall for extreme scale deep 2023.3310058
learning. In: ACM (ed) Proceedings of the International Confer- 33. Zhao L, Bi Z, Hawbani A, Yu K, Zhang Y, Guizani M (2022) Elite:
ence for High Performance Computing, Networking, Storage and An intelligent digital twin-based hierarchical routing scheme for
Analysis. SC’21. https://doi.org/10.1145/3458817.3476205 softwarized vehicular networks. IEEE Trans Mobile Comput 1–1.
18. Dwivedi YK, Hughes L, Ismagilova et al (2021) Artificial intel- https://doi.org/10.1109/tmc.2022.3179254
ligence (AI): Multidisciplinary perspectives on emerging chal- 34. Ni Q, Ji JC, Halkon B, Feng K, Nandi AK (2023) Physics-
lenges, opportunities, and agenda for research, practice and policy. informed residual network (piresnet) for rolling element bearing
Int J Inf Manag 57:101994. https://doi.org/10.1016/j.ijinfomgt. fault diagnostics. Mechanical Systems and Signal Processing
2019.08.002 200:110544. https://doi.org/10.1016/j.ymssp.2023.110544
19. Vadera S, Ameen S (2022) Methods for pruning deep neural net- 35. Shan T, Zeng J, Song X, Guo R, Li M, Yang F, Xu S (2023)
works. IEEE Access 10:63280–63300. https://doi.org/10.1109/ Physics-informed supervised residual learning for electromag-
access.2022.3182659 netic modeling. IEEE Trans Antennas Propag 71(4):3393–3407.
20. Yeom S-K, Seegerer P, Lapuschkin S, Binder A et al (2021) Prun- https://doi.org/10.1109/tap.2023.3245281
ing by explaining: A novel criterion for deep neural network 36. Bozkaya E, Bilen T, Erel-Özçevik M, Özçevik Y (2023) Energy-
pruning. Pattern Recogn 115:107899. https://doi.org/10.1016/j. aware task scheduling for digital twin edge networks in 6g. https://
patcog.2021.107899 doi.org/10.1109/smartnets58706.2023.10215892
21. Cheng Y, Wang D, Zhou P, Zhang T (2018) Model compres- 37. Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao RX (2019)
sion and acceleration for deep neural networks: The principles, Deep learning and its applications to machine health monitoring.
progress, and challenges. IEEE Signal Process Mag 35(1):126– Mechanical Systems and Signal Processing 115:213–237. https://
136. https://doi.org/10.1109/msp.2017.2765695 doi.org/10.1016/j.ymssp.2018.05.050
22. Tian G, Chen J, Zeng X, Liu Y (2021) Pruning by training: A 38. Bajao NA, Sarucam J-a (2023) Threats detection in the internet
novel deep neural network compression framework for image pro- of things using convolutional neural networks, long short-term
cessing. IEEE Signal Process Lett 28:344–348. https://doi.org/10. memory, and gated recurrent units. Mesopotamian J Cyber Secur
1109/lsp.2021.3054315 22–29. https://doi.org/10.58496/mjcs/2023/005
23. Ji M, Peng G, Li S, Cheng F, Chen Z et al (2022) A neural 39. Yevnin Y, Chorev S, Dukan I, Toledo Y (2023) Short-term wave
network compression method based on knowledge-distillation forecasts using gated recurrent unit model. Ocean Engineering
and parameter quantization for the bearing fault diagnosis. Appl 268:113389. https://doi.org/10.1016/j.oceaneng.2022.113389
Soft Comput 127:109331. https://doi.org/10.1016/j.asoc.2022. 40. Mohan Raparthy Ea (2023) Predictive maintenance in IoT devices
109331 using time series analysis and deep learning. Dandao Xue-
24. Libano F, Wilson B, Wirthlin M, Rech P, Brunhaver J (2020) bao/Journal of Ballistics 35(3):01–10. https://doi.org/10.52783/
Understanding the impact of quantization, accuracy, and radiation dxjb.v35.113
on the reliability of convolutional neural networks on FPGAs. 41. Meriem H, Nora H, Samir O (2023) Predictive maintenance for
IEEE Trans Nucl Sci 67(7):1478–1484. https://doi.org/10.1109/ smart industrial systems: A roadmap. Procedia Computer Science
tns.2020.2983662 220:645–650. https://doi.org/10.1016/j.procs.2023.03.082
25. Haase P, Schwarz H, Kirchhoffer H, Wiedemann et al (2020) 42. Sang GM, Xu L, Vrieze P (2021) A predictive maintenance model
Dependent scalar quantization for neural network compression. for flexible manufacturing in the context of industry 4.0. Frontiers
In: IEEE (ed) 2020 IEEE International Conference on Image Pro- in Big Data 4. https://doi.org/10.3389/fdata.2021.663466
cessing (ICIP). https://doi.org/10.1109/icip40778.2020.9190955 43. Rolf B, Jackson I, Müller M, Lang S, Reggelin T, Ivanov D (2022)
26. Boo Y, Shin S, Sung W (2019) Memorization capacity of deep A review on reinforcement learning algorithms and applications
neural networks under parameter quantization. In: IEEE (ed) in supply chain management. Int J Prod Res 61(20):7151–7179.
ICASSP 2019 - 2019 IEEE International Conference on Acous- https://doi.org/10.1080/00207543.2022.2140221
tics, Speech and Signal Processing (ICASSP). https://doi.org/10. 44. Esteso A, Peidro D, Mula J, Díaz-Madroñero M (2022) Reinforce-
1109/icassp.2019.8682462 ment learning applied to production planning and control. Int J
27. Tadahal S, Bhogar G, S M M, Kulkarni U, Gurlahosur SV, Prod Res 61(16):5772–5789. https://doi.org/10.1080/00207543.
Vyakaranal SB (2022) Post-training 4-bit quantization of deep 2022.2104180
neural networks. In: IEEE (ed) 2022 3rd International Confer- 45. Li C, Zheng P, Yin Y, Wang B, Wang L (2023) Deep reinforcement
ence for Emerging Technology (INCET). https://doi.org/10.1109/ learning in smart manufacturing: A review and prospects. CIRP
incet54531.2022.9825213 J Manuf Sci Technol 40:75–101. https://doi.org/10.1016/j.cirpj.
28. Hu Z, Nie F, Wang R, Li X (2021) Low rank regularization: A 2022.11.003
review. Neural Networks 136:218–232. https://doi.org/10.1016/ 46. Institute of Electrical and Electronics Engineers (2024) IEEE
j.neunet.2020.09.021 Xplore Digital Library. https://ieeexplore.ieee.org. Accessed 23
29. He S, Li Z, Tang Y, Liao Z, Li F, Lim S-J (2020) Parameters Feb 2024
compressing in deep learning. Computers Materials and Continua 47. Elsevier BV (2024) ScienceDirect. https://www.sciencedirect.
62(1):321–336. https://doi.org/10.32604/cmc.2020.06130 com. Accessed 23 Feb 2024
30. Xu H, Wu J, Pan Q, Guan X, Guizani M (2023) A survey on digital 48. Google LLC (2024) Google Scholar. https://scholar.google.com.
twin for industrial internet of things: Applications, technolo- Accessed 23 Feb 2024
gies and tools. IEEE Commun Surv Tutorials 25(4):2569–2598. 49. Developers TensorFlow (2021). TensorFlow Zenodo. https://doi.
https://doi.org/10.1109/comst.2023.3297395 org/10.5281/ZENODO.4758419
31. Feng K, Ji JC, Zhang Y, Ni Q, Liu Z, Beer M (2023) Digital 50. Imambi S, Prakash KB, Kanagachidambaresan GR (2021) In:
twin-driven intelligent assessment of gear surface degradation. Publishing SI (ed) PyTorch, pp 87–104. https://doi.org/10.1007/
Mech Syst Signal Process 186:109896. https://doi.org/10.1016/j. 978-3-030-57077-4_10
ymssp.2022.109896 51. Manessi F, Rozza A, Bianco S, Napoletano P, Schettini R (2018)
32. Zhang Y, Hu J, Min G (2023) Digital twin-driven intelligent task Automated Pruning for Deep Neural Network Compression.
offloading for collaborative mobile edge computing. IEEE J Sel IEEE. https://doi.org/10.1109/icpr.2018.8546129
A comprehensive review of model... 11837

52. Demidovskij A, Smirnov E (2020) Effective Post-Training Quan- tion and Test (VLSI-DAT). https://doi.org/10.1109/vlsi-dat.2018.
tization Of Neural Networks For Inference on Low Power Neu- 8373246
ral Accelerator. IEEE. https://doi.org/10.1109/ijcnn48605.2020. 69. Deng BL, Li G, Han S, Shi L, Xie Y (2020) Model compression
9207281 and hardware acceleration for neural networks: A comprehen-
53. Zhang Y, Ding W, Liu C (2019) Summary of convolutional neural sive survey. Proc IEEE 108(4):485–532. https://doi.org/10.1109/
network compression technology. In: IEEE (ed) 2019 IEEE Inter- jproc.2020.2976475
national Conference on Unmanned Systems (ICUS). https://doi. 70. Russo E, Palesi M, Monteleone S, Patti D et al (2022) DNN
org/10.1109/icus48101.2019.8995969 model compression for IoT domain-specific hardware acceler-
54. Ma L, Cheng N, Wang X, Yin Z, Zhou H, Quan W (2023) Distilling ators. IEEE Internet Things J 9(9):6650–6662. https://doi.org/10.
Knowledge from Resource Management Algorithms to Neu- 1109/jiot.2021.3111723
ral Networks: A Unified Training Assistance Approach. IEEE. 71. Li Z, Li H, Meng L (2023) Model compression for deep neural
https://doi.org/10.1109/vtc2023-fall60731.2023.10333602 networks: A survey. Computers 12(3):60. https://doi.org/10.3390/
55. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE computers12030060
Trans Knowl Data Eng 22(10):1345–1359. https://doi.org/10. 72. He H, Huang L, Huang Z, Yang T (2022) The compression
1109/tkde.2009.191 techniques applied on deep learning model. Highlights in Sci-
56. Dupuis E, Novo D, O’Connor I, Bosio A (2020) Sensitivity analy- ence, Engineering and Technology 4:325–331. https://doi.org/10.
sis and compression opportunities in DNNs using weight sharing. 54097/hset.v4i.920
In: IEEE (ed) 2020 23rd International Symposium on Design and 73. Zhao P, Yuan G, Cai Y, Niu W, Liu Q et al (2021) Neural pruning
Diagnostics of Electronic Circuits and Systems (DDECS). https:// search for real-time object detection of autonomous vehicles. In:
doi.org/10.1109/ddecs50862.2020.9095658 IEEE (ed) 2021 58th ACM/IEEE Design Automation Conference
57. Li J, Chen J, Cheng P, Yu Z, Yu L, Chi C (2023) A survey on deep- (DAC). https://doi.org/10.1109/dac18074.2021.9586163
learning-based real-time SAR ship detection. IEEE J Sel Topics 74. Malawade A, Odema M, Lajeunesse-degroot S, Al Faruque MA
Appl Earth Obs Remote Sens 16:3218–3247. https://doi.org/10. (2021) SAGE: A split-architecture methodology for efficient end-
1109/jstars.2023.3244616 to-end autonomous vehicle control. ACM Trans Embed Comput
58. Prasad KPSP (2021) Compressed MobilenetV3: an efficient CNN Syst 20(5s):1–22. https://doi.org/10.1145/3477006
for resource constrained platforms. https://doi.org/10.25394/ 75. Yang J, Wang Y, Zhao H, Gui G (2022) MobileNet and
PGS.14442710.V1 knowledge distillation-based automatic scenario recognition
59. Lu Y, Ni R, Wen J (2022) Model compression and acceleration: method in vehicle-to-vehicle systems. IEEE Trans Veh Technol
Lip recognition based on channel-level structured pruning. Appl 71(10):11006–11016. https://doi.org/10.1109/tvt.2022.3184994
Sci 12(20):10468. https://doi.org/10.3390/app122010468 76. Shen S, Yu C, Zhang K, Chen X, Chen H, Ci S (2021)
60. Tantawy D, Zahran M, Wassal A (2021) A survey on GAN accel- Communication-efficient federated learning for connected vehi-
eration using memory compression techniques. J Eng Appl Sci cles with constrained resources. In: IEEE (ed) 2021 International
68(1). https://doi.org/10.1186/s44147-021-00045-5 Wireless Communications and Mobile Computing (IWCMC).
61. Dupuis E, Novo D, O’Connor I, Bosio A (2020) On the auto- https://doi.org/10.1109/iwcmc51323.2021.9498677
matic exploration of weight sharing for deep neural network 77. Pinkham R, Berkovich A, Zhang Z (2021) Near-sensor distributed
compression. In: IEEE (ed) 2020 Design, automation and test DNN processing for augmented and virtual reality. IEEE J Emerg
in Europe conference and exhibition (DATE). https://doi.org/10. Sel Top Circ Syst 11(4):663–676. https://doi.org/10.1109/jetcas.
23919/date48585.2020.9116350 2021.3121259
62. Xu T-B, Liu C-L (2022) Deep neural network self-distillation 78. Fiala G, Ye Z, Steger C (2022) Pupil detection for augmented and
exploiting data representation invariance. IEEE Trans Neural virtual reality based on images with reduced bit depths. In: IEEE
Netw Learn Syst 33(1):257–269. https://doi.org/10.1109/tnnls. (ed) 2022 IEEE Sensors Applications Symposium (SAS). https://
2020.3027634 doi.org/10.1109/sas54819.2022.9881378
63. Gupta M, Agrawal P (2022) Compression of deep learning models 79. Wu D, Yang Z, Zhang P, Wang R, Yang B, Ma X (2023) Virtual-
for text: A survey. ACM Trans Knowl Discov Data 16(4):1–55. reality interpromotion technology for metaverse: A survey. IEEE
https://doi.org/10.1145/3487045 Internet Things J 10(18):15788–15809. https://doi.org/10.1109/
64. Lioutas V, Rashid A, Kumar K, Haidar MA, Rezagholizadeh M jiot.2023.3265848
(2020) Improving word embedding factorization for compression 80. Sun Y, Yuan F, Yang M, Wei G, Zhao Z, Liu D (2020) A generic
using distilled nonlinear neural decomposition. In: Computa- network compression framework for sequential recommender
tional Linguistics A (ed) Findings of the Association for Com- systems. In: ACM (ed) Proceedings of the 43rd International ACM
putational Linguistics: EMNLP 2020. https://doi.org/10.18653/ SIGIR Conference on Research and Development in Information
v1/2020.findings-emnlp.250 Retrieval. SIGIR ’20. https://doi.org/10.1145/3397271.3401125
65. Yuan F, Shou L, Pei J, Lin W, Gong M, Fu Y, Jiang D (2021) Rein- 81. Isinkaye FO (2021) Matrix factorization in recommender sys-
forced multi-teacher selection for knowledge distillation. Proc tems: Algorithms, applications, and peculiar challenges. IETE
AAAI Conf Artif Intell 35(16):14284–14291. https://doi.org/10. J Res 69(9):6087–6100. https://doi.org/10.1080/03772063.2021.
1609/aaai.v35i16.17680 1997357
66. Lyu Z, Yu T, Pan F, Zhang Y, Luo J et al (2023) A survey of model 82. Jo Y-Y et al (2021) Impact of image compression on deep learning-
compression strategies for object detection. Multimed Tools Appl. based mammogram classification. Sci Rep 11(1). https://doi.org/
https://doi.org/10.1007/s11042-023-17192-x 10.1038/s41598-021-86726-w
67. Chen Y, Zheng B, Zhang Z, Wang Q, Shen C, Zhang Q (2020) 83. Liu X, Zhang L, Guo Z, Han T, Ju M, Xu B, Liu H (2022) Medical
Deep learning on mobile and embedded devices: State-of-the-art, image compression based on variational autoencoder. Math Probl
challenges, and future directions. ACM Comput Surv 53(4):1–37. Eng 2022:1–12. https://doi.org/10.1155/2022/7088137
https://doi.org/10.1145/3398209 84. Fernandes FE, Yen GG (2021) Automatic searching and pruning
68. Chen C-J, Chen K-C, Martin-Kuo M-c (2018) Acceleration of of deep neural networks for medical imaging diagnostic. IEEE
neural network model execution on embedded systems. In: IEEE Trans Neural Netw Learn Syst 32(12):5664–5674. https://doi.org/
(ed.) 2018 International Symposium on VLSI Design, Automa- 10.1109/tnnls.2020.3027308
11838 P. V. D antas et al.

85. Tang H, Cen X (2021) A survey of transfer learning applied 101. Helal Uddin M, Baidya S (2023) Optimizing neural network effi-
in medical image recognition. In: IEEE (ed) 2021 IEEE ciency with hybrid magnitude-based and node pruning for energy-
International conference on advances in electrical engineering efficient computing in IoT. In: ACM (ed) Proceedings of the 8th
and computer applications (AEECA). https://doi.org/10.1109/ ACM/IEEE Conference on Internet of Things Design and Imple-
aeeca52519.2021.9574368 mentation. IoTDI’23. https://doi.org/10.1145/3576842.3589175
86. Prakash P, Ding J, Chen R, Qin X, Shu M et al (2022) IoT 102. Shabani H, Singh A, Youhana B, Guo X (2023) HIRAC: A hier-
device friendly and communication-efficient federated learning archical accelerator with sorting-based packing for SpGEMMs in
via joint model pruning and quantization. IEEE Internet Things J DNN applications. In: IEEE (ed) 2023 IEEE International Sym-
9(15):13638–13650. https://doi.org/10.1109/jiot.2022.3145865 posium on High-Performance Computer Architecture (HPCA).
87. Shang F, Lai J, Chen J, Xia W, Liu H (2021) A model compression https://doi.org/10.1109/hpca56546.2023.10070977
based framework for electrical equipment intelligent inspection 103. Ma X, Lin S, Ye S, He Z et al (2022) Non-structured DNN weight
on edge computing environment. In: IEEE (ed) 2021 IEEE 6th pruning—is it beneficial in any platform? IEEE Trans Neural
international conference on cloud computing and big data ana- Netw Learn Syst 33(9):4930–4944. https://doi.org/10.1109/tnnls.
lytics (ICCCBDA). https://doi.org/10.1109/icccbda51879.2021. 2021.3063265
9442600 104. Yu F, Xu Z, Liu C, Stamoulis D et al (2022) AntiDoteX:
88. Elgawi O, Mutawa AM (2020) Low power deep-learning architec- Attention-based dynamic optimization for neural network runtime
ture for mobile IoT intelligence. In: IEEE (ed) 2020 IEEE interna- efficiency. IEEE Trans Comput Aided Des Integr Circuits Syst
tional conference on informatics, IoT, and enabling technologies 41(11):4694–4707. https://doi.org/10.1109/tcad.2022.3144616
(ICIoT). https://doi.org/10.1109/iciot48696.2020.9089642 105. Liu Y, Lin Z, Yuan F (2021) ROSITA: Refined bert compres-
89. Han S, Mao H, Dally WJ (2015) Deep Compression: Com- sion with integrated techniques. Proc AAAI Conf Artif Intell
pressing Deep Neural Networks with Pruning, Trained Quantiza- 35(10):8715–8722. https://doi.org/10.1609/aaai.v35i10.17056
tion and Huffman Coding. arXiv. https://doi.org/10.48550/arXiv. 106. Zhang J, Chen X, Song M, Li T (2019) Eager pruning: algo-
1510.00149 rithm and architecture support for fast training of deep neural
90. Lee K, Hwangbo S, Yang D, Lee G (2023) Compression of deep- networks. In: ACM (ed) Proceedings of the 46th international
learning models through global weight pruning using alternating symposium on computer architecture. ISCA’19. https://doi.org/
direction method of multipliers. Int J Comput Intell Syst 16(1). 10.1145/3307650.3322263
https://doi.org/10.1007/s44196-023-00202-z 107. Huang G, Li H, Qin M, Sun F, Ding Y, Xie Y (2022) Shfl-bw:
91. Cai G, Li J, Liu X, Chen Z, Zhang H (2023) Learning and accelerating deep neural network inference with tensor-core aware
compressing: Low-rank matrix factorization for deep neural net- weight pruning. In: ACM (ed) Proceedings of the 59th ACM/IEEE
work compression. Appl Sci 13(4):2704. https://doi.org/10.3390/ design automation conference. DAC’22. https://doi.org/10.1145/
app13042704 3489517.3530588
92. Hsu Y-C, Hua T, Chang S, Lou Q, Shen Y, Jin H (2022) Language 108. Zhao X, Yao Y, Wu H, Zhang X (2021) Structural watermarking
model compression with weighted low-rank factorization. arXiv. to deep neural networks via network channel pruning. In: IEEE
https://doi.org/10.48550/arXiv.2207.00112 (ed) 2021 IEEE international workshop on information forensics
93. Suau X, Zappella u, Apostoloff N (2020) Filter distillation for net- and security (WIFS). https://doi.org/10.1109/wifs53200.2021.
work compression. In: IEEE (ed) 2020 IEEE Winter conference 9648376
on applications of computer vision (WACV). https://doi.org/10. 109. Hu P, Peng X, Zhu H, Aly MMS, Lin J (2022) OPQ: Compress-
1109/wacv45572.2020.9093546 ing Deep Neural Networks with One-shot Pruning-Quantization.
94. Prakosa SW, Leu J-S, Chen Z-H (2020) Improving the accuracy of arXiv. https://doi.org/10.48550/arXiv.2205.11141
pruned network using knowledge distillation. Pattern Anal Appl 110. Guo X, Hou B, Ren B, Ren Z, Jiao L (2022) Network pruning
24(2):819–830. https://doi.org/10.1007/s10044-020-00940-2 for remote sensing images classification based on interpretable
95. Howard AG, Zhu M, Chen B, Kalenichenko D et al (2017) CNNs. IEEE Trans Geosci Remote Sens 60:1–15. https://doi.org/
MobileNets: Efficient Convolutional neural networks for mobile 10.1109/tgrs.2021.3077062
vision applications. arXiv. https://doi.org/10.48550/arXiv.1704. 111. Song Q, Xia X (2022) A survey on pruning algorithm based
04861 on optimized depth neural network. Int J Comput Commun Eng
96. Iandola FN, Han S, Moskewicz MW, Ashraf K et al (2016) 11(2):10–23. https://doi.org/10.17706/ijcce.2022.11.2.10-23
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters 112. Ghosh S, Prasad K, Dai X, Zhang P et al (2023) Pruning Compact
and < 0.5MB model size. arXiv. https://doi.org/10.48550/arXiv. ConvNets for Efficient Inference. arXiv. https://doi.org/10.48550/
1602.07360 arXiv.2301.04502
97. Li M, Zhang X, Guo J, Li F (2023) Cloud–edge collaborative 113. Balasubramaniam S, Kavitha DV (2013) A survey on data retrieval
inference with network pruning. Electronics 12(17):3598. https:// techniques in cloud computing 8:15. https://api.semanticscholar.
doi.org/10.3390/electronics12173598 org/CorpusID:15715742
98. Meng J, Yang L, Peng X, Yu S, Fan D, Seo J-S (2021) Structured 114. Saqib E, Leal IS, Shallari I, Jantsch A, Krug S, O’Nils M (2023)
pruning of RRAM crossbars for efficient in-memory computing Optimizing the IoT performance: A case study on pruning a
acceleration of deep neural networks. IEEE Trans Circuits Syst distributed CNN. In: IEEE (ed) 2023 IEEE sensors applica-
II Express Briefs 68(5):1576–1580. https://doi.org/10.1109/tcsii. tions symposium (SAS). https://doi.org/10.1109/sas58821.2023.
2021.3069011 10254054
99. Liu J, Zhuang B, Zhuang Z, Guo Y et al (2021) Discrimination- 115. Touvron H et al (2023) Llama 2: Open Foundation and Fine-Tuned
aware network pruning for deep model compression. IEEE Trans Chat Models. arXiv. https://doi.org/10.48550/arXiv.2307.09288
Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/tpami. 116. Kim J, Chang S, Kwak N (2021) PQK: Model compres-
2021.3066410 sion via pruning, quantization, and knowledge distillation.
100. Lee S-T, Lim S, Bae J-H, Kwon et al (2020) Pruning for hardware- In: ISCA (ed) Interspeech 2021. https://doi.org/10.21437/
based deep spiking neural networks using gated schottky diode interspeech.2021-248
as synaptic devices. J Nanosci Nanotechnol 20(11):6603–6608. 117. Long Y, Lee E, Kim D, Mukhopadhyay S (2020) Q-PIM: A genetic
https://doi.org/10.1166/jnn.2020.18772 algorithm based flexible DNN quantization method and applica-
tion to processing-in-memory platform. In: IEEE (ed) 2020 57th
A comprehensive review of model... 11839

ACM/IEEE design automation conference (DAC). https://doi.org/ 133. Long Z, Zhu C, Liu J, Comon P, Liu Y (2022) Trainable sub-
10.1109/dac18072.2020.9218737 spaces for low rank tensor completion: Model and analysis. IEEE
118. Liu F, Yang N, Jiang L (2023) PSQ: An automatic search frame- Transactions on Signal Processing 70:2502–2517. https://doi.org/
work for data-free quantization on pim-based architecture. In: 10.1109/tsp.2022.3173470
IEEE (ed) 2023 IEEE 41st international conference on computer 134. Chen W, Wilson J, Tyree S, Weinberger KQ, Chen Y (2016) Com-
design (ICCD). https://doi.org/10.1109/iccd58817.2023.00084 pressing convolutional neural networks in the frequency domain.
119. Guo K, Sui L, Qiu J, Yao S, Han S, Wang Y, Yang H (2016) From In: ACM (ed) Proceedings of the 22nd ACM SIGKDD interna-
model to FPGA: Software-hardware co-design for efficient neural tional conference on knowledge discovery and data mining. KDD
network acceleration. In: IEEE (ed) 2016 IEEE Hot Chips 28 Sym- ’16. https://doi.org/10.1145/2939672.2939839
posium (HCS). https://doi.org/10.1109/hotchips.2016.7936208 135. Chen S, Sun W, Huang L, Yang X, Huang J (2019) Compressing
120. Liu X, Li B, Chen Z, Yuan Y (2021) Exploring gradient flow based fully connected layers using kronecker tensor decomposition. In:
saliency for DNN model compression. In: ACM (ed) Proceedings IEEE (ed) 2019 IEEE 7th international conference on computer
of the 29th ACM international conference on multimedia. MM science and network technology (ICCSNT). https://doi.org/10.
’21. https://doi.org/10.1145/3474085.3475474 1109/iccsnt47585.2019.8962432
121. Jin H, Wu D, Zhang S, Zou X et al (2023) Design of a 136. Yu X, Liu T, Wang X, Tao D (2017) On compressing deep models
quantization-based DNN delta compression framework for model by low rank and sparse decomposition. In: IEEE (ed) 2017 IEEE
snapshots and federated learning. IEEE Trans Parallel Distrib Syst conference on computer vision and pattern recognition (CVPR).
34(3):923–937. https://doi.org/10.1109/tpds.2022.3230840 https://doi.org/10.1109/cvpr.2017.15
122. Gong C, Chen Y, Lu Y, Li T, Hao C, Chen D (2021) Vecq: 137. Lin S, Ji R, Chen C, Tao D, Luo J (2019) Holistic CNN compres-
Minimal loss DNN model compression with vectorized weight sion via low-rank decomposition with knowledge transfer. IEEE
quantization. IEEE Trans Comput 70(5):696–710. https://doi.org/ Trans Pattern Anal Mach Intell 41(12):2889–2905. https://doi.
10.1109/tc.2020.2995593 org/10.1109/tpami.2018.2873305
123. Zhao M, Tong X, Wu W, Wang Z, Zhou B, Huang X (2022) A novel 138. Li W, Wang Y, Liu N, Xiao C, Sun Z, Du Q (2023)
deep-learning model compression based on filter-stripe group Integrated spatio-spectral-temporal fusion via anisotropic spar-
pruning and its IoT application. Sensors 22(15):5623. https://doi. sity constrained low-rank tensor approximation. IEEE Trans
org/10.3390/s22155623 Geosci Remote Sens 61:1–16. https://doi.org/10.1109/tgrs.2023.
124. Suo J, Zhang X, Zhang S, Zhou W, Shi W (2021) Feasibility anal- 3284481
ysis of machine learning optimization on GPU-based low-cost 139. Yang Z, Zhang Y, Sui D, Ju Y, Zhao J, Liu K (2023) Explana-
edges. In: IEEE (ed) 2021 IEEE SmartWorld, ubiquitous intelli- tion guided knowledge distillation for pre-trained language model
gence and computing, advanced and trusted computing, scalable compression. ACM Trans Asian Low-Resource Lang Inf Process.
computing and communications, internet of people and smart https://doi.org/10.1145/3639364
city innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI). 140. Ji M, Heo B, Park S (2021) Show, attend and distill: Knowledge
https://doi.org/10.1109/swc50871.2021.00022 distillation via attention-based feature matching. Proc AAAI Conf
125. Manzano Sanchez RA, Naik K, Albasir A, Zaman M, Goel N Artif Intell 35(9):7945–7952. https://doi.org/10.1609/aaai.v35i9.
(2022) Detection of anomalous behavior of smartphone devices 16969
using changepoint analysis and machine learning techniques. Dig- 141. Li Y, Hu F, Liu Y, Ryan M, Wang R (2023) A hybrid model
ital Threats: Research and Practice 4(1):1–28. https://doi.org/10. compression approach via knowledge distillation for predict-
1145/3492327 ing energy consumption in additive manufacturing. Int J Prod
126. Liu J, Wang Q, Zhang D, Shen L (2021) Super-resolution model Res 61(13):4525–4547. https://doi.org/10.1080/00207543.2022.
quantized in multi-precision. Electronics 10(17):2176. https://doi. 2160501
org/10.3390/electronics10172176 142. Xu Q, Wu M, Li X, Mao K, Chen Z (2023) Contrastive distil-
127. Ma H, Qiu et al (2024) Quantization backdoors to deep learning lation with regularized knowledge for deep model compression
commercial frameworks. IEEE Trans Dependable Secure Comput on sensor-based human activity recognition. IEEE Trans Ind
1–18. https://doi.org/10.1109/tdsc.2023.3271956 Cyber-Physical Syst 1:217–226. https://doi.org/10.1109/ticps.
128. Wang Z, Li JB, Qu S, Metze F, Strubell E (2022) SQuAT: 2023.3320630
Sharpness- and Quantization-Aware Training for BERT. arXiv. 143. Tan S, Tam et al (2023) GKD: A General Knowledge Distillation
https://doi.org/10.48550/arXiv.2210.07171. arxiv:2210.07171 Framework for Large-scale Pre-trained Language Model. arXiv.
129. Lu H, Chen X, Shi J, Vaidya J, Atluri V, Hong Y, Huang W (2020) https://doi.org/10.48550/arXiv.2306.06629
Algorithms and applications to weighted rank-one binary matrix 144. Ravikumar D, Saha G, Aketi SA, Roy K (2023) Homogenizing
factorization. ACM Trans Manag Inf Syst 11(2):1–33. https://doi. Non-IID datasets via In-Distribution Knowledge Distillation for
org/10.1145/3386599 Decentralized Learning. arXiv. https://doi.org/10.48550/arXiv.
130. Goyal S, Roy Choudhury A, Sharma V (2019) Compression 2304.04326
of deep neural networks by combining pruning and low rank 145. Wu Z, Sun S, Wang Y, Liu M, Jiang X, Li R, Gao B (2023)
decomposition. In: IEEE (ed) 2019 IEEE International Parallel Survey of Knowledge Distillation in Federated Edge Learning.
and Distributed Processing Symposium Workshops (IPDPSW). arXiv. https://doi.org/10.48550/arXiv.2301.05849
https://doi.org/10.1109/ipdpsw.2019.00162 146. Wang R, Li Z, Yang J, Cao T et al (2023) Mutually-paced knowl-
131. Yin M, Sui Y, Liao S, Yuan B (2021) Towards efficient tensor edge distillation for cross-lingual temporal knowledge graph rea-
decomposition-based DNN model compression with optimiza- soning. In: ACM (ed) Proceedings of the ACM Web Conference
tion framework. In: IEEE (ed) 2021 IEEE/CVF conference on 2023. WWW ’23. https://doi.org/10.1145/3543507.3583407
computer vision and pattern recognition (CVPR). https://doi.org/ 147. Hou Y, Zhu X, Ma Y, Loy CC, Li Y (2022) Point-to-voxel
10.1109/cvpr46437.2021.01053 knowledge distillation for lidar semantic segmentation. In:
132. Xue J, Zhao Y, Huang S, Liao W et al (2022) Multilayer sparsity- IEEE (ed) 2022 IEEE/CVF conference on computer vision and
based tensor decomposition for low-rank tensor completion. IEEE pattern recognition (CVPR). https://doi.org/10.1109/cvpr52688.
Trans Neural Netw Learn Syst 33(11):6916–6930. https://doi.org/ 2022.00829
10.1109/tnnls.2021.3083931 148. Li Z, Xu P, Chang X, Yang L, Zhang Y, Yao L, Chen X (2023)
When object detection meets knowledge distillation: A survey.
11840 P. V. D antas et al.

IEEE Trans Pattern Anal Mach Intell 45(8):10555–10579. https:// 164. Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan
doi.org/10.1109/tpami.2023.3257546 Y (2020) Optimizing the performance of breast cancer classifica-
149. Dewan JH, Das R, Thepade SD, Jadhav H et al (2023) Image tion by employing the same domain transfer learning from hybrid
classification by transfer learning using pre-trained CNN mod- deep convolutional neural network model. Electronics 9(3):445.
els. In: IEEE (ed) 2023 International Conference on Recent https://doi.org/10.3390/electronics9030445
Advances in Electrical, Electronics, Ubiquitous Communication, 165. Askarizadeh M, Morsali A, Nguyen KK (2024) Resource-
and Computational Intelligence (RAEEUCCI). https://doi.org/10. constrained multisource instance-based transfer learning. IEEE
1109/raeeucci57140.2023.10134069 Trans Neural Netw Learn Syst 1–15. https://doi.org/10.1109/
150. Ullah N, Khan JA, Khan MS, Khan W et al (2022) An effective tnnls.2023.3327248
approach to detect and identify brain tumors using transfer learn- 166. Li W, Huang R, Li J, Liao Y, Chen Z et al (2022) A perspective
ing. Appl Sci 12(11):5645. https://doi.org/10.3390/app12115645 survey on deep transfer learning for fault diagnosis in industrial
151. Dar SUH, Özbey M, Çatlı AB, Çukur T (2020) A transfer-learning scenarios: Theories, applications and challenges. Mechanical Sys-
approach for accelerated MRI using deep neural networks. Magn tems and Signal Processing 167:108487. https://doi.org/10.1016/
Reson Med 84(2):663–685. https://doi.org/10.1002/mrm.28148 j.ymssp.2021.108487
152. Paymode AS, Malode VB (2022) Transfer learning for multi-crop 167. Aghbalou A, Staerman G (2023) Hypothesis Transfer Learn-
leaf disease image classification using convolutional neural net- ing with Surrogate Classification Losses: Generalization Bounds
work VGG. Artificial Intelligence in Agriculture 6:23–33. https:// through Algorithmic Stability. arXiv. https://doi.org/10.48550/
doi.org/10.1016/j.aiia.2021.12.002 arXiv.2305.19694
153. N K, Narasimha Prasad LV, Pavan Kumar CS, Subedi B et al 168. Chen Y, Liu L, Li J, Jiang H, Ding C, Zhou Z (2022) MetaLR:
(2021) Rice leaf diseases prediction using deep neural networks Meta-tuning of Learning Rates for Transfer Learning in Medical
with transfer learning. Environ Res 198:111275. https://doi.org/ Imaging. arXiv. https://doi.org/10.48550/arXiv.2206.01408
10.1016/j.envres.2021.111275 169. Li Y, Li Z, Zhang T, Zhou P, Feng S, Yin K (2021) Design of a novel
154. Vallabhajosyula S, Sistla V, Kolli VKK (2021) Transfer learning- neural network compression method for tiny machine learning. In:
based deep ensemble neural network for plant leaf disease ACM (ed) Proceedings of the 2021 5th International Conference
detection. J Plant Dis Prot 129(3):545–558. https://doi.org/10. on Electronic Information Technology and Computer Engineer-
1007/s41348-021-00465-8 ing. EITCE 2021. https://doi.org/10.1145/3501409.3501526
155. Chai C, Maceira M, Santos-Villalobos HJ et al (2020) Using a 170. Cai M, Su Y, Wang B, Zhang T (2023) Research on compres-
deep neural network and transfer learning to bridge scales for sion pruning methods based on deep learning. J Phys: Conf
seismic phase picking. Geophys Res Lett 47(16). https://doi.org/ Ser 2580(1):012060. https://doi.org/10.1088/1742-6596/2580/1/
10.1029/2020gl088651 012060
156. Glory Precious J, Angeline Kirubha SP, Keren Evangeline I (2022) 171. Hayder Z, He X, Salzmann M (2016) Learning to co-generate
Deployment of a mobile application using a novel deep neural object proposals with a deep structured network. In: IEEE (ed)
network and advanced pre-trained models for the identification 2016 IEEE Conference on Computer Vision and Pattern Recog-
of brain tumours. IETE Journal of Research 69(10):6902–6914. nition (CVPR). https://doi.org/10.1109/cvpr.2016.281
https://doi.org/10.1080/03772063.2022.2083027 172. Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient pro-
157. Han L, Gladkoff S, Erofeev G, Sorokina I, Galiano B, Nenadic cessing of deep neural networks: A tutorial and survey. Proc
G (2023) Neural Machine Translation of Clinical Text: An IEEE 105(12):2295–2329. https://doi.org/10.1109/jproc.2017.
Empirical Investigation into Multilingual Pre-Trained Language 2761740
Models and Transfer-Learning. arXiv. https://doi.org/10.48550/ 173. Gholami A, Kwon K, Wu B, Tai Z, Yue X et al (2018)
arXiv.2312.07250 SqueezeNext: Hardware-Aware Neural Network Design. arXiv.
158. Kora P, Ooi CP, Faust O, Raghavendra U et al (2022) Transfer https://doi.org/10.48550/arXiv.1803.10615
learning techniques for medical image analysis: A review. Bio- 174. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018)
cybern Biomed Eng 42(1):79–107. https://doi.org/10.1016/j.bbe. MobileNetV2: Inverted residuals and linear bottlenecks. In: IEEE
2021.11.004 (ed) 2018 IEEE/CVF Conference on Computer Vision and Pattern
159. Sasikala S, Ramesh S, Gomathi S, Balambigai S, Anbumani V Recognition. https://doi.org/10.1109/cvpr.2018.00474
(2021) Transfer learning based recurrent neural network algorithm 175. Howard A, Sandler M, Chu G, Chen L-C, Chen B et al (2019)
for linguistic analysis. Concurr Comput Pract Experience 34(5). Searching for MobileNetV3. arXiv. https://doi.org/10.48550/
https://doi.org/10.1002/cpe.6708 arXiv.1905.02244
160. Akhauri S, Zheng LY, Lin MC (2020) Enhanced transfer learning 176. Tan M, Le QV (2019) EfficientNet: Rethinking model scaling for
for autonomous driving with systematic accident simulation. In: convolutional neural networks. https://doi.org/10.48550/arXiv.
IEEE (ed) 2020 IEEE/RSJ international conference on intelligent 1905.11946
robots and systems (IROS). https://doi.org/10.1109/iros45743. 177. Howard A, Sandler M, Chen et al (2019) Searching for
2020.9341538 MobileNetV3. https://doi.org/10.1109/iccv.2019.00140
161. Feng T, Narayanan S (2023) PEFT-SER: On the Use of Parame- 178. Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le
ter Efficient Transfer Learning Approaches For Speech Emotion QV (2019) Mnasnet: Platform-aware neural architecture search
Recognition Using Pre-trained Speech Models. arXiv. https://doi. for mobile. https://doi.org/10.1109/cvpr.2019.00293
org/10.48550/arXiv.2306.05350 179. Aghera S, Gajera H, Mitra SK (2020). Mnasnet based lightweight
162. Salehi AW, Khan S, Gupta G, Alabduallah BI et al (2023) A study CNN for facial expression recognition. https://doi.org/10.1109/
of CNN and transfer learning in medical imaging: Advantages, isssc50941.2020.9358903
challenges, future scope. Sustainability 15(7):5930. https://doi. 180. Zhang X, Zhou X, Lin M, Sun J (2017) ShuffleNet: An Extremely
org/10.3390/su15075930 Efficient Convolutional Neural Network for Mobile Devices.
163. Noé IT, Costa LHL, Medeiros TH (2023) Masked faces: Over- arXiv. https://doi.org/10.48550/arXiv.1707.01083
coming recognition challenges with transfer learning in cnns. In: 181. Ma N, Zhang X, Zheng H-T, Sun J (2018) ShuffleNet V2: Practical
Computação - SBC SB (ed) Anais do XI Symposium on Knowl- Guidelines for Efficient CNN Architecture Design. arXiv. https://
edge Discovery, Mining and Learning (KDMiLe 2023). KDMiLe doi.org/10.48550/arXiv.1807.11164
2023. https://doi.org/10.5753/kdmile.2023.232907
A comprehensive review of model... 11841

182. Arun Y, Viknesh GS (2022). Leaf classification for plant recog- 199. Choi H, Bajic IV (2020) A lightweight model for deep frame
nition using EfficientNet architecture. https://doi.org/10.1109/ prediction in video coding. In: IEEE (ed.) 2020 54th Asilomar
icaecc54045.2022.9716637 Conference on Signals, Systems, and Computers. https://doi.org/
183. Mantha T, Eswara Reddy B (2021) A transfer learning method for 10.1109/ieeeconf51394.2020.9443427
brain tumor classification using EfficientNet-b3 model. https:// 200. Cheng J, He R, Yuepeng E, Wu Y, You J, Li T (2020) Real-time
doi.org/10.1109/csitss54238.2021.9683036 encrypted traffic classification via lightweight neural networks. In:
184. Tan M, Le QV (2021) EfficientNetV2: Smaller models and faster IEEE (ed) GLOBECOM 2020 - 2020 IEEE Global Communica-
training. https://doi.org/10.48550/arXiv.2104.00298 tions Conference. https://doi.org/10.1109/globecom42002.2020.
185. Zhang H, Wu C, Zhang Z, Zhu et al (2022) Resnest: Split-attention 9322309
networks. https://doi.org/10.1109/cvprw56347.2022.00309 201. Phan H-H, Ha CT, Nguyen TT (2020) Improving the efficiency of
186. Wang F, Pan C, Huang J (2022) Application of model compres- human action recognition using deep compression. In: IEEE (ed)
sion technology based on knowledge distillation in convolutional 2020 International Conference on Multimedia Analysis and Pat-
neural network lightweight. In: IEEE (ed) 2022 China Automa- tern Recognition (MAPR). https://doi.org/10.1109/mapr49794.
tion Congress (CAC). https://doi.org/10.1109/cac57257.2022. 2020.9237772
10055501 202. Kumar R, Chen GK, Ekin Sumbul H, Knag et al (2020) A 9.0-
187. Wang Z, Du L, Li Y (2021) Boosting lightweight cnns through TOPS/W hash-based deep neural network accelerator enabling
network pruning and knowledge distillation for SAR target 128× model compression in 10-nm FinFET CMOS. IEEE Solid-
recognition. IEEE J Sel Topics Appl Earth Obs Remote Sens State Circ Lett 3:338–341. https://doi.org/10.1109/lssc.2020.
14:8386–8397. https://doi.org/10.1109/jstars.2021.3104267 3019349
188. Zhu X, Jiang Z, Lou Y (2023) Real-time lightweight hand detec- 203. Tu C-H, Lee J-H, Chan Y-M, Chen C-S (2020) Pruning depthwise
tion model combined with network pruning. In: IEEE (ed) 2023 separable convolutions for MobileNet compression. In: IEEE (ed.)
IEEE/ACIS 23rd International Conference on Computer and 2020 international joint conference on neural networks (IJCNN).
Information Science (ICIS). https://doi.org/10.1109/icis57766. https://doi.org/10.1109/ijcnn48605.2020.9207259
2023.10210237 204. Zheng Y, Zhou Y, Zhao Z, Yu D (2021). Adaptive Tensor-Train
189. Chen Z-C, Jhong S-Y, Hsia C-H (2021) Design of a lightweight Decomposition for Neural Network Compression. https://doi.org/
palmf-vein authentication system based on model compression. 10.1007/978-3-030-69244-5_6
J Inf Sci Eng 37(4) . https://doi.org/10.6688/JISE.202107_37(4). 205. Hosseini M, Manjunath N, Kallakuri U, Mahmoodi H, Homayoun
0005 H, Mohsenin T (2021) Cyclic sparsely connected architectures:
190. Yasir M, Ullah I, Choi C (2023) Depthwise channel attention From foundations to applications. IEEE Solid-State Circuits Mag
network (DWCAN): An efficient and lightweight model for sin- 13(4):64–76. https://doi.org/10.1109/mssc.2021.3111431
gle image super-resolution and metaverse gaming. Expert Syst. 206. He C, Tan H, Huang S, Cheng R (2021) Efficient evolutionary neu-
https://doi.org/10.1111/exsy.13516 ral architecture search by modular inheritable crossover. Swarm
191. Zhou H, Liu A, Cui H, Bie Y, Chen X (2023) SleepNet-Lite: A Evol Comput 64:100894. https://doi.org/10.1016/j.swevo.2021.
novel lightweight convolutional neural network for single-channel 100894
EEG-based sleep staging. IEEE Sensors Letters 7(2):1–4. https:// 207. Lee J-G, Roh Y, Song H, Whang SE (2021) Machine learning
doi.org/10.1109/lsens.2023.3239343 robustness, fairness, and their convergence. In: ACM (ed.) Pro-
192. Abbas Q, Daadaa Y, Rashid U, Ibrahim MEA (2023) Assist- ceedings of the 27th ACM SIGKDD Conference on Knowledge
dermo: A lightweight separable vision transformer model for Discovery & Data Mining. KDD ’21. https://doi.org/10.1145/
multiclass skin lesion classification. Diagnostics 13(15):2531. 3447548.3470799
https://doi.org/10.3390/diagnostics13152531 208. Bhardwaj K, Lin C-Y, Sartor A, Marculescu R (2019) Memory-
193. Yu J, Yu X, Liu Y, Liu L, Peng X (2021) An 8-bit fixed point quanti- and communication-aware model compression for distributed
zation method for sparse MobileNetV2. In: IEEE (ed) 2021 China deep learning inference on IoT. ACM Trans Embed Comput Syst
Automation Congress (CAC). https://doi.org/10.1109/cac53003. 18(5s):1–22. https://doi.org/10.1145/3358205
2021.9727524 209. Qin L, Sun J (2023) Model compression for data compression:
194. Xiaowei G, Hui T, Zhongjian D (2021) Structured attention Neural network based lossless compressor made practical. In:
knowledge distillation for lightweight networks. In: IEEE (ed) IEEE (ed) 2023 Data Compression Conference (DCC). https://
2021 33rd Chinese Control and Decision Conference (CCDC). doi.org/10.1109/dcc55655.2023.00013
https://doi.org/10.1109/ccdc52312.2021.9601745 210. Dwivedi R, Dave D, Naik et al (2023) Explainable AI (XAI): Core
195. Crowley EJ, Gray G, Turner J, Storkey A (2021) Substitut- ideas, techniques, and solutions. ACM Comput Surv 55(9):1–33.
ing convolutions for neural network compression. IEEE Access https://doi.org/10.1145/3561048
9:83199–83213. https://doi.org/10.1109/access.2021.3086321 211. Pradhan B, Dikshit A, Lee S, Kim H (2023) An explain-
196. Wang P, He X, Chen Q, Cheng A, Liu Q, Cheng J (2021) Unsu- able AI (XAI) model for landslide susceptibility modeling.
pervised network quantization via fixed-point factorization. IEEE Applied Soft Computing 142:110324. https://doi.org/10.1016/j.
Trans Neural Netw Learn Syst 32(6):2706–2720. https://doi.org/ asoc.2023.110324
10.1109/tnnls.2020.3007749 212. Yan S, Natarajan S, Joshi S, Khardon R, Tadepalli P (2023)
197. Chen X, Pan R, Wang X, Tian F, Tsui C-Y (2023) Late breaking Explainable models via compression of tree ensembles. Mach
results: Weight decay is all you need for neural network sparsi- Learn 113(3):1303–1328. https://doi.org/10.1007/s10994-023-
fication. In: IEEE (ed) 2023 60th ACM/IEEE Design Automa- 06463-1
tion Conference (DAC). https://doi.org/10.1109/dac56929.2023. 213. Kim J, Ko G, Kim J-H, Lee C, Kim T, Youn C-H, Kim J-Y (2023)
10247950 A 26.55tops/w explainable AI processor with dynamic workload
198. Hu Y, Ye Q, Zhang Z, Lv J (2022) A layer-based sparsification allocation and heat map compression/pruning. https://doi.org/10.
method for distributed DNN training. In: IEEE (ed) 2022 IEEE 1109/cicc57935.2023.10121215
24th Int Conf on High Performance Computing and Commu- 214. Zee T, Lakshmana M, Nwogu I (2022) Towards understanding
nications (HPCC). https://doi.org/10.1109/hpcc-dss-smartcity- the behaviors of pretrained compressed convolutional models.
dependsys57074.2022.00209 In: 2022 26th International Conference on Pattern Recognition
(ICPR). https://doi.org/10.1109/icpr56361.2022.9956037
11842 P. V. D antas et al.

215. He X, Zhao K, Chu X (2021) AutoML: A survey of the state-of- 232. Liu B, Hu B-B, Zhao M, Peng S-L, Chang J-M (2023) Model
the-art. Knowledge-Based Systems 212:106622. https://doi.org/ compression algorithm via reinforcement learning and knowl-
10.1016/j.knosys.2020.106622 edge distillation. Mathematics 11(22):4589. https://doi.org/10.
216. McCoy T, Pavlick E, Linzen T (2019) Right for the wrong reasons: 3390/math11224589
Diagnosing syntactic heuristics in natural language inference. In: 233. Careem R, Md Johar MG, Khatibi A (2024) Deep neural networks
Computational Linguistics A (ed) Proceedings of the 57th Annual optimization for resource-constrained environments: techniques
Meeting of the Association for Computational Linguistics. https:// and models. Indones J Electr Eng Comput Sci 33(3):1843. https://
doi.org/10.18653/v1/p19-1334 doi.org/10.11591/ijeecs.v33.i3.pp1843-1854
217. Choudhary T, Mishra V, Goswami A, Sarangapani J (2020) 234. Abood MJK, Abdul-Majeed GH (2024) Enhancing multi-class
A comprehensive survey on model compression and accelera- ddos attack classification using machine learning techniques. J
tion. Artif Intell Rev 53(7):5113–5155. https://doi.org/10.1007/ Adv Res Appl Sci Eng Technol 43(2):75–92. https://doi.org/10.
s10462-020-09816-7 37934/araset.43.2.7592
218. Stoychev S, Gunes H (2022) The Effect of Model Compression 235. Hossain MB, Gong N, Shaban M (2024) A novel attention-based
on Fairness in Facial Expression Recognition. arXiv. https://doi. layer pruning approach for low-complexity convolutional neural
org/10.48550/arXiv.2201.01709 networks. Advanced Intelligent Systems. https://doi.org/10.1002/
219. Ishaque S, Khan N, Krishnan S (2022) Detecting stress through 2D aisy.202400161
ECG images using pretrained models, transfer learning and model 236. Xu X, Ma L, Zeng T, Huang Q (2023) Quantized graph neural net-
compression techniques. Mach Learn Appl 10:100395. https:// works for image classification. Mathematics 11(24):4927. https://
doi.org/10.1016/j.mlwa.2022.100395 doi.org/10.3390/math11244927
220. Choudhury A, Balasubramaniam S, Kumar AP, Kumar SNP 237. Zhang J, Liu X (2023) Design of low power LSTM neural net-
(2023) Psso: Political squirrel search optimizer-driven deep learn- work accelerator based on FPGA. IEEE. https://doi.org/10.1109/
ing for severity level detection and classification of lung cancer. iccc59590.2023.10507503
Int J Inf Technol Decis Making 1–34. https://doi.org/10.1142/ 238. Sui X, Lv Q, Zhi L, Zhu B, Yang Y, Zhang Y, Tan Z (2023)
s0219622023500189 A hardware-friendly high-precision CNN pruning method and
221. Sun S, Cheng Y, Gan Z, Liu J (2019) Patient Knowledge Distil- its FPGA implementation. Sensors 23(2):824. https://doi.org/10.
lation for BERT Model Compression. arXiv. https://doi.org/10. 3390/s23020824
48550/arXiv.1908.09355 239. Ai C, Yang H, Ding Y, Tang J, Guo F (2023) Low rank matrix
222. Shi X, Peng X, He L, Zhao Y, Jin H (2023) Waterwave: factorization algorithm based on multi-graph regularization for
A GPU memory flow engine for concurrent DNN training. detecting drug-disease association. IEEE/ACM Trans Com-
IEEE Trans Comput 72(10):2938–2950. https://doi.org/10.1109/ put Biol Bioinforma 1–11. https://doi.org/10.1109/tcbb.2023.
tc.2023.3278530 3274587
223. Aguado-Puig Q, Doblas et al (2023) Wfa-GPU: gap-affine pair- 240. Shcherbakova EM, Matveev SA, Smirnov AP, Tyrtyshnikov
wise read-alignment using gpus. Bioinformatics 39(12). https:// EE (2023) Study of performance of low-rank nonnegative ten-
doi.org/10.1093/bioinformatics/btad701 sor factorization methods. Russ J Numer Anal Math Model
224. Huang H, Li Y, Zhou X (2023) Accelerating Point Clouds Clas- 38(4):231–239. https://doi.org/10.1515/rnam-2023-0018
sification in Dynamic Graph CNN with GPU Tensor Core. IEEE. 241. Kokhazadeh M, Keramidas G, Kelefouras V, Stamoulis I (2024)
https://doi.org/10.1109/icpads60453.2023.00240 Denseflex: A Low Rank Factorization Methodology for Adaptable
225. Zeng H, Wang H, Zhang B (2024) A high-performance cellular Dense Layers in DNNs. ACM. https://doi.org/10.1145/3649153.
automata model for urban expansion simulation based on con- 3649183
volution and graphic processing unit. Trans GIS 28(4):947–968. 242. Latif SA, Sidek KA, Bakar EA, Hashim AHA (2024) Online mul-
https://doi.org/10.1111/tgis.13163 timodal compression using pruning and knowledge distillation for
226. Zhuang M-H, Shih C-Y, Lin H-C, Kang A, Wang Y-P (2024) High iris recognition. J Adv Res Appl Sci Eng Technol 37(2):68–81.
Speed Signal Design on Fan-Out RDL Interposer for Artificial https://doi.org/10.37934/araset.37.2.6881
Intelligence (AI) and Deep Neural Network (DNN) Chiplet Accel- 243. Pang C, Weng X, Wu J, Wang Q, Xia G-S (2024) Hicd: Change
erators Application. IEEE. https://doi.org/10.23919/icep61562. detection in quality-varied images via hierarchical correlation dis-
2024.10535433 tillation. IEEE Trans Geosci Remote Sens 62:1–16. https://doi.
227. Nagar P, Boruah S, Bhoi AK, Patel A, Sarda J, Darjij P org/10.1109/tgrs.2024.3367778
(2024) Emerging VLSI Technologies for High performance AI 244. Cao K, Zhang T, Huang J (2024) Advanced hybrid lstm-
and ML Applications. IEEE. https://doi.org/10.1109/assic60049. transformer architecture for real-time multi-task prediction in
2024.10507954 engineering systems. Sci Reports 14(1). https://doi.org/10.1038/
228. Chae H, Zhu K, Mutnury B, Wallace et al (2024) Isop+: Machine s41598-024-55483-x
learning-assisted inverse stack-up optimization for advanced 245. Zhang T (2024) Industrial Image Anomaly Localization Method
package design. IEEE Transactions on Computer-Aided Design based on Reverse Knowledge Distillation. IEEE. https://doi.org/
of Integrated Circuits and Systems 43(1):2–15. https://doi.org/10. 10.1109/iaeac59436.2024.10503620
1109/tcad.2023.3305934 246. Zhang S, Pei Z, Ren Z (2024) Super-resolution knowledge-
229. Tian L, Sedona R, Mozaffari A, Kreshpa E, Paris C, Riedel M, distillation-based low-resolution steel defect images classifica-
Schultz MG, Cavallaro G (2023) End-to-End Process Orchestra- tion. SPIE. https://doi.org/10.1117/12.3026364
tion of Earth Observation Data Workflows with Apache Airflow 247. Yang W, Jin L, Wang S, Cu Z, Chen X, Chen L (2019) Thinning
on High Performance Computing. IEEE. https://doi.org/10.1109/ of convolutional neural network with mixed pruning. IET Image
igarss52108.2023.10283416 Proc 13(5):779–784. https://doi.org/10.1049/iet-ipr.2018.6191
230. Rajbhandari S, Rasley J, Ruwase O, He Y (2019) ZeRO: Memory 248. Tan Z, Tan S-H, Lambrechts J-H, Zhang Y, Wu Y, Ma K (2021) A
Optimizations Toward Training Trillion Parameter Models. arXiv. 400MHz NPU with 7.8TOPS2/W High-PerformanceGuaranteed
https://doi.org/10.48550/ARXIV.1910.02054 Efficiency in 55nm for Multi-Mode Pruning and Diverse Quanti-
231. Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y (2021) ZeRO- zation Using Pattern-Kernel Encoding and Reconfigurable MAC
Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Units. IEEE. https://doi.org/10.1109/cicc51472.2021.9431519
Learning. Zenodo. https://doi.org/10.5281/ZENODO.5156596
A comprehensive review of model... 11843

249. Chen X, Zhu J, Jiang J, Tsui C-Y (2023) Tight compression: 267. Essien A, Giannetti C (2020) A deep learning model for smart
Compressing CNN through fine-grained pruning and weight per- manufacturing using convolutional lstm neural network autoen-
mutation for efficient implementation. IEEE Trans Comput Aided coders. IEEE Trans Industr Inf 16(9):6069–6078. https://doi.org/
Des Integr Circuits Syst 42(2):644–657. https://doi.org/10.1109/ 10.1109/tii.2020.2967556
tcad.2022.3178047 268. Nordal H, El-Thalji I (2020) Modeling a predictive maintenance
250. Dettmers T, Lewis M, Shleifer S, Zettlemoyer L (2021) 8-bit management architecture to meet industry 4.0 requirements: A
Optimizers via Block-wise Quantization. arXiv. https://doi.org/ case study. Syst Eng 24(1):34–50. https://doi.org/10.1002/sys.
10.48550/ARXIV.2110.02861 21565
251. Ren S, Zhu KQ (2023) Low-Rank Prune-And-Factorize for 269. Yan Y, Chow AHF, Ho CP, Kuo Y-H, Wu Q, Ying C (2022) Rein-
Language Model Compression. arXiv. https://doi.org/10.48550/ forcement learning for logistics and supply chain management:
ARXIV.2306.14152 Methodologies, state of the art, and future opportunities. Trans-
252. Ding Y, Chen D-R (2023) Optimization based layer-wise portation Research Part E: Logistics and Transportation Review
pruning threshold method for accelerating convolutional neu- 162:102712. https://doi.org/10.1016/j.tre.2022.102712
ral networks. Mathematics 11(15):3311. https://doi.org/10.3390/ 270. Kegenbekov Z, Jackson I (2021) Adaptive supply chain: Demand–
math11153311 supply synchronization using deep reinforcement learning. Algo-
253. Wu Y, Schuster M, Chen et al (2016) Google’s Neural Machine rithms 14(8):240. https://doi.org/10.3390/a14080240
Translation System: Bridging the Gap between Human and 271. Xu D, Lu G, Yang R, Timofte R (2020) Learned image and video
Machine Translation. arXiv. https://doi.org/10.48550/ARXIV. compression with deep neural networks. IEEE. https://doi.org/10.
1609.08144 1109/vcip49819.2020.9301828
254. Ge L, Zhang W, Liang C, He Z (2020) Compressed neural 272. Kufa J, Budac A (2023) Quality comparison of 360 degrees
network equalization based on iterative pruning algorithm for 112- 8K images compressed by conventional and deep learning algo-
gbps vcsel-enabled optical interconnects. J Lightwave Technol rithms. IEEE. https://doi.org/10.1109/radioelektronika57919.
38(6):1323–1329. https://doi.org/10.1109/jlt.2020.2973718 2023.10109066
255. Cheng Y, Wang D, Zhou P, Zhang T (2017) A Survey of Model 273. Qassim H, Verma A, Feinzimer D (2018) Compressed residual-
Compression and Acceleration for Deep Neural Networks. arXiv. VGG16 CNN model for big data places image recognition. IEEE.
https://doi.org/10.48550/arXiv.1710.09282 https://doi.org/10.1109/ccwc.2018.8301729
256. Nasution MA, Chahyati D, Fanany MI (2017) Faster R-CNN with 274. Strubell E, Ganesh A, McCallum A (2020) Energy and policy
structured sparsity learning and Ristretto for mobile environment. considerations for modern deep learning research. Proceedings
IEEE. https://doi.org/10.1109/icacsis.2017.8355051 of the AAAI Conference on Artificial Intelligence 34(09):13693–
257. Nie F, Hu Z, Wang X, Li X, Huang H (2022) Iteratively re- 13696. https://doi.org/10.1609/aaai.v34i09.7123
weighted method for sparsity-inducing norms. IEEE Trans Knowl 275. Sharma M, Kaur P (2023) An Empirical study of Gradient Com-
Data Eng 1–1. https://doi.org/10.1109/tkde.2022.3179554 pression Techniques for Federated Learning. IEEE. https://doi.
258. Flores A, Lamare RC (2017) Sparsity-aware set-membership org/10.1109/ici60088.2023.10421660
adaptive algorithms with adjustable penalties. IEEE. https://doi. 276. Baltrusaitis T, Ahuja C, Morency L-P (2019) Multimodal machine
org/10.1109/icdsp.2017.8096110 learning: A survey and taxonomy. IEEE Trans Pattern Anal
259. Gaikwad AS, El-Sharkawy M (2018) Pruning convolution neu- Mach Intell 41(2):423–443. https://doi.org/10.1109/tpami.2018.
ral network (SqueezeNet) using taylor expansion-based criterion. 2798607
IEEE. https://doi.org/10.1109/isspit.2018.8705095 277. Jain S, Gandhi A, Singla S, Garg L, Mehla S (2022) Quan-
260. Zhou Z, Zhou Y, Jiang Z, Men A, Wang H (2022) An effi- tum Machine Learning and Quantum Communication Net-
cient method for model pruning using knowledge distillation with works: The 2030s and the Future. IEEE. https://doi.org/10.1109/
few samples. In: IEEE (ed) ICASSP 2022 - 2022 IEEE Interna- iccmso58359.2022.00025
tional Conference on Acoustics, Speech and Signal Processing 278. Kuppusamy P, Yaswanth Kumar N, Dontireddy J, Iwendi C (2022)
(ICASSP). https://doi.org/10.1109/icassp43922.2022.9746024 Quantum Computing and Quantum Machine Learning Classifi-
261. Hartmann D, Herz M, Wever U (2018) Model Order Reduction a cation – A Survey. IEEE. https://doi.org/10.1109/icccmla56841.
Key Technology for Digital Twins, pp 167–179. Springer Interna- 2022.9989137
tional Publishing. https://doi.org/10.1007/978-3-319-75319-5_8 279. Sujatha D, Raj.TF M, Ramesh G, Agoramoorthy M, S AA (2024)
262. Segovia M, Garcia-Alfaro J (2022) Design, modeling and imple- Neural Networks-Based Predictive Models for Self-Healing in
mentation of digital twins. Sensors 22(14):5396. https://doi.org/ Cloud Computing Environments. IEEE. https://doi.org/10.1109/
10.3390/s22145396 iitcee59897.2024.10467499
263. Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed 280. Schneider C, Barker A, Dobson S (2014) A survey of self-healing
neural networks: A deep learning framework for solving forward systems frameworks. Wiley. https://doi.org/10.1002/spe.2250
and inverse problems involving nonlinear partial differential equa- 281. Hoffmann F, Bertram T, Mikut R, Reischl M, Nelles O (2019)
tions. J Comput Phys 378:686–707. https://doi.org/10.1016/j.jcp. Benchmarking in classification and regression. WIREs Data Min
2018.10.045 Knowl Disc 9(5). https://doi.org/10.1002/widm.1318
264. Anagnostopoulos SJ, Toscano JD, Stergiopulos N, Karniadakis 282. Ahmad R, Alsmadi I, Alhamdani W, Tawalbeh L (2022) A
GE (2024) Residual-based attention in physics-informed neural comprehensive deep learning benchmark for IoT IDS. Comput-
networks. Computer Methods in Applied Mechanics and Engi- ers & Security 114:102588. https://doi.org/10.1016/j.cose.2021.
neering 421:116805. https://doi.org/10.1016/j.cma.2024.116805 102588
265. Jieyang P, Kimmig A, Dongkun W, Niu Z, Zhi et al (2022) A 283. Sarridis I, Koutlis C, Kordopatis-Zilos G, Kompatsiaris I,
systematic review of data-driven approaches to fault diagnosis and Papadopoulos S (2022) InDistill: Information flow-preserving
early warning. J IntelManuf 34(8):3277–3304. https://doi.org/10. knowledge distillation for model compression. arXiv. https://doi.
1007/s10845-022-02020-0 org/10.48550/arXiv.2205.10003
266. Iunusova E, Gonzalez MK, Szipka K, Archenti A (2023) Early 284. Wu S, Chen H, Quan X, Wang Q, Wang R (2023) AD-KD:
fault diagnosis in rolling element bearings: comparative analysis Attribution-Driven Knowledge Distillation for Language Model
of a knowledge-based and a data-driven approach. J Intell Manuf Compression. arXiv. https://doi.org/10.48550/arXiv.2305.10010
35(5):2327–2347. https://doi.org/10.1007/s10845-023-02151-y
11844 P. V. D antas et al.

285. Mao H, Han S, Pool J, Li W, Liu X et al (2017) Exploring the 291. Ni Q, Ji JC, Feng K, Zhang Y, Lin D, Zheng J (2024) Data-
Regularity of Sparse Structure in Convolutional Neural Networks. driven bearing health management using a novel multi-scale fused
arXiv. https://doi.org/10.48550/arXiv.1705.08922 feature and gated recurrent unit. Reliability Engineering & System
286. S B, Syed MH, More NS, Polepally V (2023) Deep learning- Safety 242:109753. https://doi.org/10.1016/j.ress.2023.109753
based power prediction aware charge scheduling approach in 292. Qi Q, Tao F, Hu T, Anwer N, Liu A, Wei Y, Wang L, Nee AYC
cloud based electric vehicular network. Eng Appl Artif Intel (2021) Enabling technologies and tools for digital twin. J Manuf
121:105869. https://doi.org/10.1016/j.engappai.2023.105869 Syst 58:3–21. https://doi.org/10.1016/j.jmsy.2019.10.001
287. Paszke et al (2019) PyTorch: An Imperative Style, High- 293. Horvath S, Laskaridis S, Rajput S, Wang H (2023) Maestro:
Performance Deep Learning Library. arXiv. https://doi.org/10. Uncovering Low-Rank Structures via Trainable Decomposition.
48550/arXiv.1912.01703 arXiv. https://doi.org/10.48550/arXiv.2308.14929
288. Xu C, Zhou W, Ge T, Xu K, McAuley J, Wei F (2021) Beyond
Preserved Accuracy: Evaluating Loyalty and Robustness of BERT
Compression. arXiv. https://doi.org/10.48550/arXiv.2109.03228
Publisher’s Note Springer Nature remains neutral with regard to juris-
289. Hinton G, Vinyals O, Dean J (2015) Distilling the Knowledge
dictional claims in published maps and institutional affiliations.
in a Neural Network. arXiv. https://doi.org/10.48550/arXiv.1503.
02531
290. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable
are features in deep neural networks? https://doi.org/10.48550/
arXiv.1411.1792

Authors and Aﬃliations

Pierre Vilar Dantas1 · Waldir Sabino da Silva Jr1 · Lucas Carvalho Cordeiro2 · Celso Barbosa Carvalho1

B Pierre Vilar Dantas 1 Center for R&D in Electronic and Information Technology
[email protected] (CETELI) and Department of Electronics and Computing
B Lucas Carvalho Cordeiro
(DTEC), Federal University of Amazonas (UFAM), Av.
General Rodrigo Octavio, 1200, Manaus 69067-005,
[email protected]
Amazonas, Brazil
Waldir Sabino da Silva Jr 2 The University of Manchester, Oxford Rd, M13 9PL
[email protected]
Manchester, UK
Celso Barbosa Carvalho
[email protected]

Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
ModelCompressionTechniquesinDeepLearning
No ratings yet
ModelCompressionTechniquesinDeepLearning
23 pages
A Comprehensive Survey On Model Compression and Acceleration
No ratings yet
A Comprehensive Survey On Model Compression and Acceleration
43 pages
Model Compression For Deep Neural Networks - A Survery 1689444018366
No ratings yet
Model Compression For Deep Neural Networks - A Survery 1689444018366
22 pages
Model Compression Is The Big ML Flavour of 2021
No ratings yet
Model Compression Is The Big ML Flavour of 2021
4 pages
Pattern Recognition Unit - 5
No ratings yet
Pattern Recognition Unit - 5
16 pages
Compression Survey Hal
No ratings yet
Compression Survey Hal
26 pages
Book
No ratings yet
Book
199 pages
Unit I
No ratings yet
Unit I
10 pages
The Little Book of Deep Learning François Fleuret download pdf
100% (3)
The Little Book of Deep Learning François Fleuret download pdf
55 pages
Deep Learning in Data Science Theoretical Foundati
No ratings yet
Deep Learning in Data Science Theoretical Foundati
6 pages
Deep Learning Module-01
No ratings yet
Deep Learning Module-01
17 pages
Deep Learning Module-01 Search Creators
No ratings yet
Deep Learning Module-01 Search Creators
17 pages
DEEP LEARNING
No ratings yet
DEEP LEARNING
22 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
CIKM
No ratings yet
CIKM
173 pages
Unit-3
No ratings yet
Unit-3
16 pages
Efficient_Deep_Learning_in_Network_Compression_and
No ratings yet
Efficient_Deep_Learning_in_Network_Compression_and
21 pages
模型压缩综述
No ratings yet
模型压缩综述
72 pages
NNQuant4
No ratings yet
NNQuant4
20 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
LBDL
No ratings yet
LBDL
142 pages
Paper 4
No ratings yet
Paper 4
3 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (9)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
2006.03669v2
No ratings yet
2006.03669v2
73 pages
The Little Book of Deep Learning François Fleuret download
100% (1)
The Little Book of Deep Learning François Fleuret download
55 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
DeepLearning - 1NT22CS078 - I Shania Jone
No ratings yet
DeepLearning - 1NT22CS078 - I Shania Jone
4 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Deep Learning Applications,: M. Arif Wani Taghi M. Khoshgoftaar Vasile Palade Editors
No ratings yet
Deep Learning Applications,: M. Arif Wani Taghi M. Khoshgoftaar Vasile Palade Editors
307 pages
Deep Learning Note 21cs743
No ratings yet
Deep Learning Note 21cs743
96 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
1 - A Day in The Life of ChatGPT As A Researcher
No ratings yet
1 - A Day in The Life of ChatGPT As A Researcher
20 pages
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
No ratings yet
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
16 pages
Deep Learning Applications
No ratings yet
Deep Learning Applications
309 pages
cq02_vdthanh_ass3
No ratings yet
cq02_vdthanh_ass3
20 pages
LBDL
No ratings yet
LBDL
185 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Machine Learning and Visual Perception 9783110595567 9783110595536
100% (1)
Machine Learning and Visual Perception 9783110595567 9783110595536
221 pages
UNIT I part 1 notes
No ratings yet
UNIT I part 1 notes
28 pages
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
No ratings yet
Deep Learning: Nicholas G. Polson Vadim O. Sokolov
18 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
Binary Neural Networks Algorithms Architectures And Applications Baochang Zhang pdf download
No ratings yet
Binary Neural Networks Algorithms Architectures And Applications Baochang Zhang pdf download
91 pages
Deep Learning Review and Discussion of Its Future
No ratings yet
Deep Learning Review and Discussion of Its Future
7 pages
The Little Book of Deep Learning François Fleuret - The full ebook with complete content is ready for download
100% (4)
The Little Book of Deep Learning François Fleuret - The full ebook with complete content is ready for download
77 pages
Little Book of Deep Learning
100% (1)
Little Book of Deep Learning
158 pages
LBDL
No ratings yet
LBDL
156 pages
TF Estimators KDD Paper
No ratings yet
TF Estimators KDD Paper
9 pages
Lec 1
No ratings yet
Lec 1
30 pages
Beyond Silicon
From Everand
Beyond Silicon
Piyush yadav
5/5 (1)
Deep Learning Applications In Image Analysis Sanjiban Sekhar Roy download
No ratings yet
Deep Learning Applications In Image Analysis Sanjiban Sekhar Roy download
80 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
ML3
No ratings yet
ML3
7 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
ANN Unit 3 Answers
No ratings yet
ANN Unit 3 Answers
12 pages
On The Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures
No ratings yet
On The Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures
13 pages
Oversampling Techniques Deep Belief Network DenseNets DNN
No ratings yet
Oversampling Techniques Deep Belief Network DenseNets DNN
23 pages
DY PYQ Math Linear Algebra 3
No ratings yet
DY PYQ Math Linear Algebra 3
2 pages
RNN Part1
No ratings yet
RNN Part1
12 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
Question Bank: Aksheyaa College of Enginnering Chennai Department of B.E (Ece)
No ratings yet
Question Bank: Aksheyaa College of Enginnering Chennai Department of B.E (Ece)
27 pages
Automatic Detection of Pneumonia On Compressed Sensing Images Using Deep Learning
No ratings yet
Automatic Detection of Pneumonia On Compressed Sensing Images Using Deep Learning
4 pages
319 - MECE-001 - ENG D18 - Compressed PDF
No ratings yet
319 - MECE-001 - ENG D18 - Compressed PDF
3 pages
1.3.1 Logic Gates Workbook
No ratings yet
1.3.1 Logic Gates Workbook
44 pages
SPPUML5
No ratings yet
SPPUML5
4 pages
AI ML RL GenAI
No ratings yet
AI ML RL GenAI
37 pages
IDSUP MID SEM EXAM-2023
No ratings yet
IDSUP MID SEM EXAM-2023
2 pages
A Step by Step Perceptron Example
100% (1)
A Step by Step Perceptron Example
5 pages
Entropy: The Entropy Provides A Relation Between Heat and Temperature
No ratings yet
Entropy: The Entropy Provides A Relation Between Heat and Temperature
10 pages
A Mean-Risk Index Model For Uncertain Capital Budgeting
No ratings yet
A Mean-Risk Index Model For Uncertain Capital Budgeting
10 pages
Transmission Lin
No ratings yet
Transmission Lin
41 pages
Information Theory, IT Entropy Mutual Information Use in NLP
No ratings yet
Information Theory, IT Entropy Mutual Information Use in NLP
23 pages
Capitulo 1 Big data uc3m
No ratings yet
Capitulo 1 Big data uc3m
10 pages
Summary of 8 of Spe S Comparative Studies Cases
No ratings yet
Summary of 8 of Spe S Comparative Studies Cases
2 pages
Solution
No ratings yet
Solution
8 pages
Math Unit 3 M2
No ratings yet
Math Unit 3 M2
59 pages
Passenger Flow Prediction in Bus Transportation System Using Deep Learning
No ratings yet
Passenger Flow Prediction in Bus Transportation System Using Deep Learning
24 pages
Correlated Attention Based Transformer for Multivariate Time Series
No ratings yet
Correlated Attention Based Transformer for Multivariate Time Series
15 pages
Private-Key Cryptography
No ratings yet
Private-Key Cryptography
30 pages
Lecture16 CE72.12FEM - Meshing
No ratings yet
Lecture16 CE72.12FEM - Meshing
26 pages
LECTURE 1 - Algorithms Basics
No ratings yet
LECTURE 1 - Algorithms Basics
10 pages
2008 SAR ADC Algorithm With Redundancy
No ratings yet
2008 SAR ADC Algorithm With Redundancy
4 pages
Rahil Resume 1
No ratings yet
Rahil Resume 1
1 page
5-Modified Booths Algorithm-05-02-2024
No ratings yet
5-Modified Booths Algorithm-05-02-2024
23 pages
Presentation Mathematical Economics
No ratings yet
Presentation Mathematical Economics
15 pages
Acknowledgement of Submission of Contributed Paper 51 For MED 2023
No ratings yet
Acknowledgement of Submission of Contributed Paper 51 For MED 2023
5 pages