0% found this document useful (0 votes)
5 views

ModelCompressionTechniquesinDeepLearning

The document discusses model compression techniques in deep neural networks, highlighting the necessity of reducing model size while maintaining performance due to challenges posed by large models. It reviews various methods such as pruning, knowledge distillation, and quantization, and presents a systematic comparison of these techniques along with their implications beyond mere size reduction. Additionally, it identifies open research problems in the field of model compression for further exploration.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ModelCompressionTechniquesinDeepLearning

The document discusses model compression techniques in deep neural networks, highlighting the necessity of reducing model size while maintaining performance due to challenges posed by large models. It reviews various methods such as pruning, knowledge distillation, and quantization, and presents a systematic comparison of these techniques along with their implications beyond mere size reduction. Additionally, it identifies open research problems in the field of model compression for further exploration.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371051458

Model Compression Techniques in Deep Neural Networks

Chapter · May 2023


DOI: 10.1007/978-3-031-31327-1_10

CITATIONS READS
0 377

2 authors:

Mubarek Mohammed Beakal Gizachew Assefa


Addis Ababa Institute of Technology Addis Ababa Institute of Technology
2 PUBLICATIONS 0 CITATIONS 29 PUBLICATIONS 306 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Beakal Gizachew Assefa on 09 March 2024.

The user has requested enhancement of the downloaded file.


Model Compression Techniques in Deep Neural
Networks⋆

Mubarek Mohammed Yesuf1[0000−0002−0145−1810] and Beakal Gizachew


Assefa2[0000−0003−3065−3881]
1
Information Network Security Agency and Addis Ababa Institute of Technology,
AAU, Addis Ababa, Ethiopia
[email protected],[email protected]
2
Addis Ababa Institute of Technology, AAU, Addis Ababa, Ethiopia
[email protected]

Abstract. With the current set-up, the success of Deep Neural Net-
work models is highly tied to their size. Although this property might
help them improve their performance, it makes them difficult to train,
deploy them on resource-constrained machines, and iterate on experi-
ments. There is also a growing concern about their environmental and
economic impacts. Model compression is a set of techniques that are
applied to reduce the size of models maintaining performance. This sur-
vey presents state-of-the-art model compression methods, discusses how
they are hybridized, highlights the changes they could cause beyond size
reduction, and puts forward open research problems for further study.

Keywords: Model compression · Pruning · Knowledge Distillation ·


Quantization · Neural Networks · Deep learning · Artificial Intelligence

1 Introduction

Breakthrough advances in AI algorithms are consequences of Neural Networks


which are the foundations for Deep Learning (DL), a family of Machine Learn-
ing (ML) algorithms behind successful Artificial Intelligence (AI) innovation in
tasks like Voice Recognition, Image Classification [47], Human Language under-
standing [19], etc. DL algorithms are based on a theorem, named the Universal
Approximation Theorem [38], which guarantees that Neural Networks, grouped
in some way, can compute any continuous function, and the hierarchical rep-
resentation of information inspired from the human brain[6]. On the other the
exact number of parameters a specific DL model needs for a certain problem is
still a hyper-parameter, a choice of the designer. These ideas stimulated over-
parameterization in an attempt to increase a model’s representation capacity
and thus performance, as the tasks they are being applied to grows in com-
plexity [8]. How complex a model is commonly measured by the total number

Supported by Ethiopian Artificial Intelligence Institute.
2 M.M Yesuf, B.G Assefa.

of parameters which is useful to measure its memory consumption, and float-


ing point operations per second(FLOPs), useful in fields of computations that
require floating-point calculations.
Most benchmark models are related to Computer Vision and Natural Lan-
guage Processing, sub-fields of AI relating to human vision and language re-
spectively. Computer vision related models are mostly based on Convolutional
Neural Network (CNN)s, a type of DL architecture designed for spatial data
such as an image. Prominent examples are, Alexnet [47] and VGG16 [73], which
are variants of CNNs, are winners of Imagenet [47],a dataset with more than 14
million well labeled images intended to serve as a benchmark in computer vision
applications. Natural language related models are getting bigger and bigger re-
cently so much that they are now being referred to as Large Language Models.
State of the art and benchmark model architectures for language models are
based on the transformers [79]. Prominent examples are BERT, GPT, BART,
DitillBERT. Figure 5 below shows the evolution of some benchmark computer
vision models and their computational burdens expressed in Giga FLoating-point
OPerations (GFLOPs) along with their Top-1(left) and Top-2 accuracy.

Fig. 1. Size of benchmark vision and language models over the past years [7].

However, there is a trade-off: large models perform well at a cost of increased


complexity. Challenges of such big models include longer training, inference and
iteration time, difficulty to deploy on resource constrained and edge machines,
and economic and environmental impact [66]. This raises the question if there
is a way to work around the trade-off. This is where Model Compression comes
in, it is the idea of making models small without losing performance. A review
of model compression methods for deep-learning are presented here.
Model Compression Techniques in Deep Neural Networks 3

Model Compression is a set of techniques [11] for reducing the size of large
models without a notable performance loss. Its necessity is increasing in paral-
lel with the advancement and complexity of models. Even though their recent
attempts to train smaller networks from the beginning [22], mostly, Model Com-
pression is applied after training a bigger model as the model needs much fewer
parameters for inference than for learning. Model Compression methods in lit-
erature can be classified into four major parts: Pruning, Knowledge Distillation
(KD), Quantization, and other methods. Pruning is removing an unwanted struc-
ture from a trained network. KD is a mechanism transferring the knowledge of
a bigger network onto a smaller network. Quantization is reducing the number
of bits of model parts required to be represented. It is similar to approxima-
tion. The rest of the Model Compression methods other than Pruning, KD, and
Quantization are can be organized in one section and are referred to as Other
methods. Weight decomposition methods deserve to have their own section, but
this arrangement makes the paper simpler for the reader.
The contributions of the paper are as follows

– Conduct a review of deep learning Model Compression methods.


– Systematically compares and discusses each method along with their conse-
quences beyond size reduction, their combinations, and selection considera-
tions.
– Discuss open research problems for further study on the topic of Deep Neural
Network Model Compression.

The rest of the paper is organized as follows. The next section, section two, re-
views prominent works in Model Compression. Section three is on Hybrid Model
Compression techniques which are about methods that combine existing ones
for a better outcome. The effect of Model Compression beyond size reduction
is discussed in section four. Discussions, factors affecting the choice of methods,
and open research problems are presented in sections, section five, six, and seven
respectively. Finally, a conclusion is presented.

Pruning

Knowledge Distillation

Model Compression Quantization

Others methods

Hybrid Model Compression

Fig. 2. The taxonomy and categories of Deep Learning model compression methods.
4 M.M Yesuf, B.G Assefa.

2 Model Compression Methods

2.1 Generic paradigms

Over the development of Model Compression methods, certain key character-


istics keep on reappearing. These characteristics are generic and can be orga-
nized as a paradigms of Model Compression. Understanding them gives how the
general view of towards the approaches in Model Compression methods in an
agnostic way, assist in selection, and stimulate research questions. Unlike past
works, here they are organized in one stand alone subsection. Not all of them are
applied for all Compression methods and some can be found in one method at
the same time. They form important trade-offs wherever they appear and extend
the default method which is usually static, deterministic, and post-training.

– Random or Stochastic. Sometimes certain aspects( parameters, hyper-


parameters, etc) of a certain model are selected at random. The stochastic-
ity brings the analogous benefits that it brings to the Stochastic Gradient
descent, it enables exploration in the space. In fact, sometimes, they can be
used as a benchmark for comparison against the other methods [9].
– Data free, Zero-shot, one shot, or few shot . The data the original
models were trained on might not be always available or it might be expensive
to work with the it. But some other information such as the meta data might
be available. In such cases, a data and iteration sceptic paradigms are used
[76], [15].
– Dynamic. The dynamic approach to model compression dynamically de-
cides different parts of the process [49].
– Online. Compression of the model is done while it is being trained [41].
– Auto-ML. Instead of adjusting hyper-parameters by hand, this paradigm
encourages using Auto-ML methods where certain parts of the process of
compression is automatically decided by an external algorithm [30],[36].
– Differentiable. These paradigm’s turn hyper-parameters introduced by the
compression method into a parameters and learn them on the process [43].

2.2 Pruning

The oldest, most studied and intuitive method of reducing model size is pruning.
It is literally removing a weight or a node from a network based on how important
it is. The vanilla pruning method has three main stages: initializing, training,
pruning and retraining(fine tuning). Hyperparameters introduced by the Pruning
include compression rate, magnitude to prune with when it is magnitude weight,
type of saliency metric to use, etc. The approach towards these hyperparamters
categorize the pruning methods in literature.
The earliest methods were of mainly: Saliency based, where node or weight
is checked for importance, penalty based [48], [57], where the a penalty term
is added on the loss(objective) function to induce weight decay [70], magni-
tude based pruning, simply removing the smallest link(weight) in each iteration.
Model Compression Techniques in Deep Neural Networks 5

Prominent early works are Optimal Brain Damage (OBD) [48], Optimal Brain
Surgeon [35] and Skeletonization [57].
An intuitive direction of development for Pruning is to determine what is
important. There were attempts to identify them with statistical methods, and
works arounds via following recent advances in the area Explainability, a notion
trying to undersand,explain the ’decision’ of a model. [87].

Fig. 3. Iterative Pruning (left) [33] and Pruning(right) [33].

Modern approaches vary not just in the Saliency metrics they use or other ap-
proach, but also in the overall pipeline of pruning. The ‘lottery ticket hypothesis’
[22], whose stronger version was supported in [55], states that random initial-
ization of a network contains a small subnetwork (” the winning tickets”) that,
when trained in separately, can compete with the performance of the original
model. A rather bold finding in by [68] states that there is a randomly initialized
subnetwork that even without training can perform as good as the original. The
recent work [84], termed pruning from scratch, questions the need for training
the network and proposes a different pipeline as: random initialization, pruning
and then training. The work in [50], raises the question why train a big model
to convergence while it is possible to attain better performance by training it
for free iterations and prune it.
Rather than design designing manually the hyperparamters introduced by
Pruning, Reinforcement Learning (RL) based methods are emerging to formulate
the problem of removing ad node or a weight as a Markov Decision Process
[30], [30]. Neural Architecture Search, based methods formulate the Pruning as
a search in the space of sub-networks emerging area of study in the study of
neural network’s.[21].

2.3 Knowledge Distillation


Knowledge Distillation is a relatively new compression procedure originally in-
troduced in [11] and reinvented in [37]. A bigger model and well performing
model, which is termed as the teacher model, provides labels for a smaller net-
work to train with. For each observation x and its label y out of the dataset
6 M.M Yesuf, B.G Assefa.

the original network was trained on, the smaller network will use the predictions
of the bigger network as a label for x instead of just y. The predictions of the
teacher model are termed as soft labels since they are not ones and zeros like
hard labels are. These soft labels are termed as the Dark Knowledge. The intu-
ition is an image of a dog has a larger chance of being mistaken as a cat than as
a car.

Fig. 4. Knowledge Distillation (KD) [20].

KD uses a constant value named Temperature, T, which magnifies the small


probability values in the prediction(the soft labels) which represent. Each output
of the i-th class in training, with logit z and a temperature T, is computed as:

exp(zi /T )
p(zi , T ) = P (1)
exp(zj /T )

Hyper-parameters introduced by KD include the architecture type of the


student network, the temperature T, and the distillation loss which measures the
divergence between the student network and the teacher network [37]. Different
approaches and improvements in these hyper-parameters create variants of KD.
There can be improvements in the loss [77], [92] and also have an assistant
Model Compression Techniques in Deep Neural Networks 7

network, a size that is less than the teacher and smaller than the student, to
help out the small student network [56].
A powerful adaptation of KD, Fitnets [71], trains a student, deeper and thin-
ner than the teacher which is unlike [37], using the intermediate representations,
features, learned by the teacher as hints to improve the training process and
final generalizing capacity of the student. This work opened a whole new sub-
catagory of KD termed as Feature Distillation making the original to be called
response based distillation [28]. Response-based distillation, is also limited to the
supervised learning as the logits, the soft labels, are probability distributions.
Relationship distillation extends feature distillation by using outputs of specific
layers in the teacher model and exploring the relationships between different
layers or data samples [65].
At core level, KD is a knowledge transfer mechanism [54]. Thus, it is ex-
tremely versatile and model agnostic, any model can be used as teacher or a
student. Thus, it has usage for other purposes including faster optimization [88],
defence against adversarial attacks[62], explainable networks [42], [63], even to
improve the original network itself [24].

2.4 Quantization
Quantization, in simple terms discretization, is mapping from the space of con-
tinuous values to discrete values. A good example is rounding up a number.
The concept is old and hsa been used in various settings in information theory
and other sciences as well, but in NN setting, it is about [26] rounding up the
parameters in the network and reduce the number of bits required for representa-
tion for example from 32 bits floating point representations, to limited precision,
commonly in 8 bits or lower representations. It’s foundation is rooted in how a
human brain stores and encodes information in a discrete way [25].
In literature, the variants of quantization differ according to which param-
eters to quantize, how to quantize, whether to do it in training time or after
training time, and other factors which correspond to the variables S, Z in the
quantization equation. S itself is determined by the boundary values of the clip-
ping range. Depending on whether the clipping range is calculated for each dy-
namically or statically, quantization can be dynamic or static. The work on [41]
made sure only 8-bit integers are used in the network at inference time.
Normally, what is known as the clipping range, a range to make sure the
individual weights will lie in, is determined before hand. The common formula
is given by:
r
Q(r) = Int( ) − −Z (2)
S
where Q is the quantization operator, r is a real valued number, since the
clipping range is not necessarily input (activation or weight), S is a real val-
ued scaling factor, and Z is an integer. These variables are part of the hyper-
parameters introduced by Quantization. Int is the integer function that maps
its input to the integer. A quantized number can be recovered with the reverse
quantization, Dequantization.
8 M.M Yesuf, B.G Assefa.

The basic method of quantization is to train a model under normal setup


and then to quantize each parameter, activation or even input, using some sort
of function or mapping, which is sometimes referred to as a quantization oper-
ator [25]. When fine tuning is needed after quantization, Quantization Aware
Training is used where a re-training is done with floating point propagation but
the weight are Quantized back after each update. Post Training Quantization is
quantization without retraining [5], [12]. It is preferred when QAT is too complex
to implement.
The ideal Quantization is, Binarization, making the parameters of a network
binary [17] for memory and compute efficiency. In fact, the most appealing fea-
ture of this method, given it works properly, is that it might even completely
get rid of the need of multiplication at inference time as in [18]. This is possible
by replacing the dot product in conventional neural networks with bitwise oper-
ators [17] to train networks with binary weights. In [69], the researchers showed
CNNs with only binary weights and demonstrated a binary quantized version
of Alexnet [47], with 32 times smaller, with comparable accuracy as the origi-
nal version. Beyond binarization, an alternative network is a ternary network,
where the weights of the network are limited to three values instead of two as
in binary. Neither binary nor ternary Quantization methods are trainable with
back-propagation with gradient descent.
Recent, advances introduce Differentiable Quantization in order to learn the
hyper-parameters of Quantization rather [91], and Mixed precision of numbers
[80].

2.5 Summary of Model Compression against generic paradigms

Once basic Model Compression methods and generic paradigms development are
established, they can be effectively summarized in the table below.

Table 1. Summary of Hybrid Model Compression.

Paradigm Pruning Quantization KD


Random or Stochastic [76] [31] [53],[52]
Data free, Zero, One , or few shot [76], [15] [34] [53],[52],[59]
Dynamic [83] [91] [49]
Online [48] [41] [93]
AutoML (RL, NAS ) [30],[36], [80] [2]
Differentiable [43] [91] [29]

2.6 Other Compression Methods

In general, the goal of compression is intuitive: reduce size. In reduce size with
accuracy. Thus, people have tried whatever they think would work to achieve
Model Compression Techniques in Deep Neural Networks 9

that goal. Listed in this section are methods that are both common, Tensor
Decomposition, and rare, like Information theoretic.
Tensor decomposition is a popular method to reduce a model size. A tensor
is just a matrix in higher dimension. Tensor decomposition is a scaled up version
of the common matrix decomposition in linear algebra. Matrix decomposition
methods like Singular Value Decomposition and Principal component analysis,
have their analogous versions. To make compression, these methods take the
weights of a model to form a Tensor. Tensor decomposition, the core of these
methods, is an existing approach in other fields and common decomposition
methods Canonical Polyadic, Tensor Train, and the Tucker decomposition are
applied to compress models. The works in [60], [13] and [46] are iconic examples
of these methods. These are almost always applied in Multi Layer Perceptrons
or fully-connected layers of CNNs. Large scale matrix manipulation involved is
their disadvantages. They are also applied on specific architectures.
Weight sharing approaches are somehow ways of making CNN variants effi-
cient. The works [75], Inception [90],MobileNet [39] and SqueezNet [40] are ex-
amples of these methods. One of the most influential work [40] achieved Alexnet
[47] level accuracy with 50 times fewer parameters on the ImageNet dataset
[47]. These family of compression methods are more of careful design choices
that turned out to be efficient.
An unique method is presented in [85] used the then state of the art loss-
less video encoding and compression technique to compress a model. It is an
information theoretic approach that formulated the model compression problem
as an information source coding one. It used Context-Adaptive Binary Arith-
metic Coding (CABAC) a video coding standard H.264/AVC to encode and the
quantize the weights of the neural network but the encoding takes long time.

3 Hybrid Model Compression Methods


Fortunately, Model Compression methods are not disjoint: they can be combined.
The resulting method would be more complex than the individual ones but in
cases where that is not an issue, combining hybrid methods is beneficial. In
fact, a symbiotic relationship where the problems of one is addressed by the
other can be achieved. This section highlights prominent works in that combine
compression methods.

3.1 Pruning and Quantization


Pruning and Quantization the most related ones especially unstructured Pruning
where it is simply setting a parameter zero. The most prominent work not just in
hybrid paradigm but also in the area of Model Compression is Deep Compression
framework [32] where iterative Pruning, Qauntization and Encoding are applied
one after the other. Its Pruning part is the same as [33].
The intuitive relationship between Pruning and Quantization is being backed
by a recent advances in Quantization that study differentiable hyperparamters
in Quantization as a means to unify them [82], [78].
10 M.M Yesuf, B.G Assefa.

Fig. 5. An example of hybrid Compression is the Deep Compression framework [32].

3.2 Pruning and Knowledge Distillation

There is also a positive relationship between Pruning and Knowledge Distillation


and it has been recognized relatively early in [45] where Pruning is applied on
distilled student networks on the task of Neural Machine Translation. Recent
works are more generic than applied. A work demonstrated in [64] showed, a
student can learn better from pruned teacher which serves as a better regularizer
than otherwise in the context of KD. The converse of it has also been shown in
[14] where the authors argue that the performance of a pruned network is better
restored when the student is fine-tuned with KD than vanilla training. A good
example of a symbiotic relationship between the two is [1] where unprunable
parameters are compressed with KD and potentially redundancy is addressed
by Pruning[1].

3.3 Quantization and Knowledge Distillation

An early major work in the combination of Quantization and Knowledge Dis-


tillation is the application of Quantization for the choice of a student model for
KD [67] where a the knowledge of a Quantized teacher is transferred to a Quan-
tized student. In this setting, KD helps Quantization restore better performance.
Conversely, Quantization also benefits KD as in [10] where it has been used to
enable on device KD by introducing noise that mimic the existence of a bigger
teacher model. This addresses the problem of missing teacher on edge devices to
do vanilla KD and might even put the need of the teacher in the first place in
question. Their bond is strengthened by Quantization Aware KD [44] that aims
to solve the problem of performance degradation in Quantized distillation [67].
As it can be seen on the table, the methods can benefit each other when they
are combined but it comes at a cost. The number of hyper-parameters that have
to be decided will grow exponentially.
Model Compression Techniques in Deep Neural Networks 11

Table 2. Summary of Hybrid Model Compression.

Hybrid Compression Remark


Pruning and Quantization They are applied together almost every-
where [32] and are so similar there are work
that attempt to unify them [82],[78]
Pruning and Knowledge Distillation Mutually beneficial result. One solves the
challenges of the other.[1], [64] ,[45]
Quantization and Knowledge Distillation Distillation restores performance damaged
by Quantization [67]. Quantization helps
mimic a non existent teacher [10]

4 Model Compression Beyond Size Reduction


The purpose of compressing a model is to reduce its size. But the size reduction
can alter the behaviour of the network that can have both positive and negative
consequences. In the case where it benefits, it will be an extra advantage. In fact,
in some cases, like [48], the compression can even be done for the by-product.
But in the cases where it is disadvantageous there is a trade-off to be addressed
appropriately. This section highlights these issues.

4.1 Reducing Overfitting


Ideally, a model is expected to generalize , and not overfit. Overfitting is when a
model is trained on a training data to the extent that it negatively impacts its
accuracy on data beyond the training set. Dramatic size reduction of any kind
of compression impacts generalization negatively [71],[33]. But a certain level of
compression can help reduce overfiting. In general, compression methods can be
used as noise injection mechanisms which has a regularizing effect [37] for the
network by forcing parameters to learn representations independently. In fact, in
the beginning days of Artificial Intelligence, one objective of NN Pruning was to
reduce overfitting [48]. Recently, dropouts, randomly dropping out weights from
the network with a certain probability, [47] enabled for models to go deeper than
usual and reignited the consideration of compression as a cure for overfitting.

4.2 Explainability
Model Compression is also tied with Explainability or Interpretability, an ef-
fort to try to understand the decisions of models which is getting more and
more attention due to their applications in high stake areas. The problem of
Explainability can be solved by KD, sometimes at a cost of a slight accuracy, by
transferring the knowledge of a high performing neural network into a Decision
Tree model which is explainable inherently [23], [51]. But this is a double-edged
phenomenon because compressing a trained model impacts explanations as can
be seen [42] and [63] where the authors suggested explanation preserving meth-
ods. There is a lack of detailed work between compression and explainability.
12 M.M Yesuf, B.G Assefa.

For example, what model compression does to Mechanstic Interpretability, an


effort trying to understand what happens inside Neural Networks, remained a
mystery [61].

4.3 Neural Architecture Search

Neural Architecture Search is an attempt to find a an optimal architecture in


an educated way. This is because most novel architectures are purely results of
the choice of the designer rather than. The relationship of Neural Architecture
Search with model compression is also intuitive and well recognised in literature
[16]. The task of optimal compression can be seen as a search in the space of
sub-architectures. For example the task of Pruning can be taken as a search
in the space of NN architectures that are sub networks of the original network
using a certain algorithm [86], [3].

4.4 Algorithmic Fairness

Algorithmic fairness has become increasingly vital due to the increasing impact
of AI in society. Particularly, as AI solutions are being adopted in various fields
of high social importance or automate decision-making, the question of how fair
the algorithm is critical now more than ever. Since Model Compression is be-
coming a default component of ML deployment, its impact on fairness have to
be examined. A recent work aimed at this problem reported that model com-
pression, Pruning and Weight Decomposition, exacerbate the bias in a network
[74].

4.5 Security

Security issues regarding Neural Networks include Gradient Leakage Attacks,


where an attacker can gain access to private training data from a model’s gra-
dient, Adversarial Samples, where an attacker misleads a trained classifier with
carefully designed inputs [27], and Membership Inference Attacks, where an at-
tacker learns about the training data by making repeated inferences [81].
In general, neither Quantization nor Pruning help mitigate Adversarial at-
tacks without costing accuracy [94]. Knowledge Distillation is extensively used
as a defense mechanism for Adversarial attacks [62]. Pruning can also be used
to prevent Membership Inference Attack [81].
Basically, all decisions that were made about the network before compressing
it can be questioned after the compression, but mentioned here are few of them
for there is a lack of enough work in the area. In some cases, it might make sense
to ask whether or not a particular behaviour observed on a network might affect
the compression performance.
Model Compression Techniques in Deep Neural Networks 13

Table 3. Summary of Model Compression and its consequences.

Network Behaviour Relationship


Overfitting Slight compression can help [48], [37], but extreme com-
pression of any kind damages it.
Explainability Both Pruning and Knowledge Distillation damage Ex-
plainability(Attribution) but Knowledge Distillation in-
directly solves the problem of Explainability by transfer-
ring knowledge to an interptetable model [42], [63]
Neural Architecture Search It has a two way positive relationship with Pruning and
Knowledge Distillation [16], [3].
Algorithmic Bias All compression methods exacerbate the existing bias
[74]
Security Against adversarial attacks, Quantization and Prun-
ing do not help against without loss of accuracy [94],
but Knowledge Distillation does [62]. Pruning can help
against Membership Inference Attacks [81].

5 Findings

At core level, most Model Compression methods are somehow related with some
kind of natural phenomena outside of AI. Quantization is related with how the
human brain stores and encodes information in a discrete way[25]. The original
KD paper discussed about its similarity between the development of insects.
Pruning is related with how humans have less neuron connections when they are
babies, than adults. This begs the question whether or not other ideas can also
be considered or even if there is an even bigger mystery.
Another high-level observation is that the methods can be seen as micro level
and macro level. KD and Weight Decomposition are macro-level compression
methods, they work at network level. Quantization and Pruning are relatively
micro compression methods, they have mechanisms to deal with specific param-
eters [33]. Those in the latter group have patterns across their techniques and
trade-offs introduced by their generic paradigms. For example in both Pruning
and Quantization, the conventional method is post-training compression, getting
more surgical by applying them while training is possible but it has an additional
cost of complexity, and have alternatives to deal with parameters individually
or in groups. Particularly, Unstructured pruning and Uniform Quanitzation are
analogous. But Quantization has more headaches to be surgical. For this reason,
they are heavily applied in structured in Pruning( in uniform in Quantization)
in practical settings. AutoML solutions such as RL and NAS can be utilized but
the number of hyper-parameters increases exponentially.
In general, Model Compression techniques can be seen in two broad cate-
gories. The methods in the first category, such as Pruning, Quantization and
Tensor Decomposition, transform a trained model into a smaller or optimized
version of itself. The methods in the second category try to create a new but
smaller model from scratch. Knowledge distillation can be seen as creating a
14 M.M Yesuf, B.G Assefa.

model from scratch. Any other design choices that can reduce model size can be
categorized in the second category.
Most compression methods are vision native, they are made for computer vi-
sion applications especially CNN based architectures. That is most likely because
vision data is abundant, to make inference at edge devices where the camera re-
sides, or to take advantage of compression methods that remove structures at a
time. Its only recently, with the development of Large Language Models, that
the methods are being adapted for language models.

5.1 Pruning

The most important gap common in all the Pruning methods is that it is hard to
compare different variants due to lack of benchmark metric. The only benchmark
effort is Shrinkbench [9], a framework aiming to provide pruning performance
comparisons.
There is a critical difference between unstructured and structured Pruning.
Unstructured pruning, is implemented by making individual parameters set to
be zero. This makes unstructured pruning almost surgical and more accurate in
identifying what to remove. But this implementation doesn’t actually reduce the
computation time since the GPU will do the operation anyway. On the other
hand, structured pruning will remove a complete layer or a channel or a filter.
And then the rest of the connections will be connected afterwards. This increases
efficiency both pruning time and implementation time. But this too has its own
consequences as it is not surgical. The layer or channel to be removed might still
have important weights that are needed for inference.
Another common limitation , which can be caused by the lack of a common
benchmark, is that they are architecture dependent. Almost all of them involve
at least some amount of fine-tuning or retraining which requires having original
data to train network is trained on which in some cases might not be available.
Considering randomly dropping weights or magnitude based pruning, it is
justified to say that Pruning is the fastest.

5.2 Knowledge Distillation

Knowledge Distillation methods actually construct a brand new smaller model


that tries to mimic the bigger model. This means training a new model from
scratch which will take as much time as the training time for the teacher. The
loss have to be a softmax loss function whose output is a probability distribu-
tion. They doesn’t give a bigger size reduction. Some might even need a teacher
assistant to catch up to the teacher[56].
In Knowledge Distillation (KD), the added term, the distillation loss which is
seen above4, does serve as a regularize [72], but the exact part of Dark Knowledge
the student is taking advantage of is still being studied [89], [92].
Model Compression Techniques in Deep Neural Networks 15

5.3 Quantization

Quantization methods make weights discrete. They might also damage the preci-
sion of important weights. There are ways to bypass that, dynamic, Asymmetric
Quantization or Quantization Aware Training), but they come at the expense
of computation which is what compression was needed for in the first place.

Table 4. Summary of model compression methods.

Method Description Remark


Pruning Removing parameters by set- The oldest, fastest and one
ting them to zero with the biggest compression
rate.
Knowledge Distillation Train a new smaller model Most versatile and model ag-
with the predictions the bigger nostic but works with only cer-
model tain types of loss functions
Quantization Reduce the amount of bits re- Work well with pruning, effec-
quired to represent the param- tive to reduce memory size as
eters well as computation
Other methods Methods that reduce model Efficient architecture designs,
size with techniques other inspired by file compression
than the above three. methods, or matrix decompo-
sition based methods.

6 Factors affecting the choice of methods

There is no size fits all method. One has to find the right one for the task at
hand. These are the factors that can be cons

– Trade-off : the paradigms discussed above introduce trade-offs that can in-
fluence the choice of a compression method.
– Area of application. How tolerant the task at hand is can affect the choice
of a compression method in relation to alignment, accuracy, etc.
– Type of data (Architecture) Computer vision and natural language-
related tasks can require architectures that suit their purpose which in turn
affects the choice of compression strategy. For example, structured Pruning
can effectively remove channels and filters from CNNs on the other hand
there are real-world cases where large language models are effectively com-
pressed with Knowledge Distillation. Classical methods can have comparable
performance as Neural Networks.
– Expertise. Some of the compression methods require expertise to be imple-
mented and require high level of sophistication to be carried out effectively.
16 M.M Yesuf, B.G Assefa.

7 Open Research Problems

There are numerous open issues in Model Compression that can be formulated
as valid research questions. They are presented in this section according to the
classification used in the paper.

7.1 Knowledge Distillation (KD)

Knowledge Distillation (KD) is one of the major compression methods. In its


framework as can be seen in the figure4, the added term in the loss part is the
knowledge that is assumed to be being helpful for the teacher. It does serves as a
regularizer [37]. But the performance and even the need of the teacher network
itself, especially in response based KD, is being questioned repeatedly [10], [89],
[72], [4]. What part of the dark knowledge the student takes advantage of is still
an open problem.
There are also understudied variants of KD. An example is Subclass distil-
lation [58] that entail a powerful concept. Especially, its converse, extracting
specialists from a bigger model, seems a fruitful direction. Zero shot knowledge
distillation methods, as discussed before, use random inputs to the trained model
to generate psudo data to train the small network with. Studying how the result
will turn out for a curated input is also a possible direction.

7.2 Hybrid Methods

As described in section three, combining compression methods can be beneficial


if the added complexity is not an issue. In general, there could combinations in
two ways: combining variants of the same method to take advantage of them and
combinations between different compression methods as in [32]. Thus,a study on
the combination of variants of the same compression method can be good direc-
tion. An example is trying to synchronize unstructured and structured pruning.
Such methods can be possible with Reinforcement Learning (RL) based approach
pursue as the algorithms get more faster and advanced. There is also a lack of ex-
ploration combining other compression methods. Additionally, a comprehensive
survey of this specific area can also be helpful for researchers.

7.3 Model Compression Beyond Size Reduction

Despite efforts to train smaller networks from scratch [22], most Model Com-
pression methods are applied after a big model is trained. Thus, the compression
ought to have an effect on other aspects of the model. For example, as pointed
out in [16], there are still remaining works that can bridge the concept of a
models size (compression) and its Explainability. Some applications of Explain-
ability are mentioned in the pruning section of this writing. But, there are only
few works,save for [13], that even acknowledge the existence of a bridge between
them. Thus, a formalized future work in this direction can be fruitful.
Model Compression Techniques in Deep Neural Networks 17

Basically, one can raise many components of the network and ask how com-
pression impacts it. Presented in section four are only a limited number of them
because there is still much work to be done in the area. Again, a comprehensive
survey on the effect of Model Compression beyond size reduction can be ex-
tremely helpful to build Model Compression without ramifications and beyond.

7.4 Miscellaneous
Some model compression methods are developed and are experimented for a
specific types of architectures. These makes them limited in applications. In
general RL based approaches will be suitable in the future to adapt compression
techniques to the given model.
There is little innovation in methods other than pruning, quantization and
knowledge distillation. Design choices can make a considerable change [75], [90],
[39] [40]. Since the aim problem of model compression is simple and intuitive,
any model size reduction method can be called a model compression technique
as long as it performs well.
The idea of ensembles have been in use and in classical ML world. It is about
using multiple machine learning models for a single task simultaneously. It in-
creases accuracy by taking advantage of the performance of individual models.
The concept have been far from being even mentioned in the context of neu-
ral networks as the complexity is overwhelming even to thing about. But with
parallel advances in model compression and hardware, it seems it will be within
the reach of industry leaders or whoever has the access and motive to try. Since
that has most likely never been explored, it is worth the shot. Hardware issues,
thought they are not discussed in this survey, are highly related with compression
and optimization.

8 Conclusion
A brief review of Model Compression techniques in Deep Learning with their
challenges, and opportunities has been presented. It started by defining Model
Compression and the challenges it will introduce. Then thirty years of work on
Pruning, a number of works on Knowledge Distillation and Quantization are
synthesized. How each methods is combined with another one is also discussed.
It contained a dedicated section intended to magnify the relationship between
model compression and the networks behavior. Findings and patterns from each
section are presented in a separate section. Actionable future research directions
have been put forward to assist interested researchers.

References
1. Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for
cnn compression. 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW) pp. 3185–3192 (2021)
18 M.M Yesuf, B.G Assefa.

2. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2017)
3. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2018)
4. Bang, D., Lee, J., Shim, H.: Distilling from professors: Enhancing
the knowledge distillation of teachers. Information Sciences 576, 743–
755 (2021). https://doi.org/https://doi.org/10.1016/j.ins.2021.08.020,
https://www.sciencedirect.com/science/article/pii/S0020025521008203
5. Banner, R., Nahshan, Y., Hoffer, E., Soudry, D.: Aciq: Analytical clipping for
integer quantization of neural networks. ArXiv abs/1810.05723 (2018)
6. Bengio, Y., et al.: Learning deep architectures for ai. Foundations and trends® in
Machine Learning 2(1), 1–127 (2009)
7. Bernstein, L., Sludds, A., Hamerly, R., Sze, V., Emer, J.S., Englund, D.: Freely
scalable and reconfigurable optical hardware for deep learning. Scientific Reports
11 (2020)
8. Bianco, S., Cadene, R., Celona, L., Napoletano, P.: Benchmark analysis of repre-
sentative deep neural network architectures. IEEE Access 6, 64270–64277 (2018).
https://doi.org/10.1109/ACCESS.2018.2877890
9. Blalock, D.W., Ortiz, J.J.G., Frankle, J., Guttag, J.V.: What is the state of neural
network pruning? ArXiv abs/2003.03033 (2020)
10. Boo, Y., Shin, S., Choi, J., Sung, W.: Stochastic precision ensemble: self-knowledge
distillation for quantized deep neural networks. In: Proceedings of the AAAI Con-
ference on Artificial Intelligence. vol. 35, pp. 6794–6802 (2021)
11. Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: KDD ’06
(2006)
12. Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: A
novel zero shot quantization framework. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) pp. 13166–13175 (2020)
13. Calvi, G.G., Moniri, A., Mahfouz, M., Zhao, Q., Mandic, D.P.: Compression and
interpretability of deep neural networks via tucker tensor layer: From first principles
to tensor valued back-propagation. arXiv: Learning (2019)
14. Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a
better pruned network with knowledge distillation. Complex & Intelligent Systems
(2021)
15. Chen, T., Ji, B., Ding, T., Fang, B., Wang, G., Zhu, Z., Liang, L., Shi, Y., Yi, S., Tu,
X.: Only train once: A one-shot neural network training and pruning framework.
In: Neural Information Processing Systems (2021)
16. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and
acceleration for deep neural networks. ArXiv abs/1710.09282 (2017)
17. Courbariaux, M., Bengio, Y.: Binarynet: Training deep neural networks with
weights and activations constrained to +1 or -1. ArXiv abs/1602.02830 (2016)
18. Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural
networks with binary weights during propagations. In: NIPS (2015)
19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. ArXiv abs/1810.04805 (2019)
20. Ding, X., Wang, Y., Xu, Z., Wang, Z.J., Welch, W.J.: Distilling and
transferring knowledge via cgan-generated samples for image clas-
sification and regression. Expert Systems with Applications 213,
Model Compression Techniques in Deep Neural Networks 19

119060 (2023). https://doi.org/https://doi.org/10.1016/j.eswa.2022.119060,


https://www.sciencedirect.com/science/article/pii/S0957417422020784
21. Dong, X., Yang, Y.: Network pruning via transformable architecture search. In:
NeurIPS (2019)
22. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv: Learning (2019)
23. Frosst, N., Hinton, G.E.: Distilling a neural network into a soft decision tree. ArXiv
abs/1711.09784 (2017)
24. Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again
neural networks. In: ICML (2018)
25. Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A
survey of quantization methods for efficient neural network inference. ArXiv
abs/2103.13630 (2022)
26. Gong, Y., Liu, L., Yang, M., Bourdev, L.D.: Compressing deep convolutional net-
works using vector quantization. ArXiv abs/1412.6115 (2014)
27. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
examples. CoRR abs/1412.6572 (2014)
28. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. ArXiv
abs/2006.05525 (2021)
29. Guan, Y., Zhao, P., Wang, B., Zhang, Y., Yao, C., Bian, K., Tang, J.: Differentiable
feature aggregation search for knowledge distillation. ArXiv abs/2008.00506
(2020)
30. Gupta, M., Aravindan, S., Kalisz, A., Chandrasekhar, V.R., Jie, L.: Learning to
prune deep neural networks via reinforcement learning. ArXiv abs/2007.04756
(2020)
31. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with
limited numerical precision. In: International Conference on Machine Learning
(2015)
32. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. arXiv: Computer Vision
and Pattern Recognition (2016)
33. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. Advances in neural information processing systems 28
(2015)
34. Haroush, M., Hubara, I., Hoffer, E., Soudry, D.: The knowledge within: Methods
for data-free model compression. 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) pp. 8491–8499 (2019)
35. Hassibi, B., Stork, D., Wolff, G.: Optimal brain surgeon and general network prun-
ing. In: IEEE International Conference on Neural Networks. pp. 293–299 vol.1
(1993). https://doi.org/10.1109/ICNN.1993.298572
36. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., Han, S.: Amc: Automl for model
compression and acceleration on mobile devices. In: ECCV (2018)
37. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
ArXiv abs/1503.02531 (2015)
38. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-
versal approximators. Neural networks 2(5), 359–366 (1989)
39. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications. ArXiv abs/1704.04861 (2017)
20 M.M Yesuf, B.G Assefa.

40. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model
size. ArXiv abs/1602.07360 (2016)
41. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H.,
Kalenichenko, D.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition pp. 2704–2713 (2018)
42. Joseph, V., Siddiqui, S.A., Bhaskara, A., Gopalakrishnan, G., Muralidharan, S.,
Garland, M., Ahmed, S., Dengel, A.R.: Going beyond classification accuracy met-
rics in model compression (2020)
43. Kim, J., youn Park, C., Jung, H.J., Choe, Y.: Differentiable pruning method for
neural networks. ArXiv abs/1904.10921 (2019)
44. Kim, J., Bhalgat, Y., Lee, J., Patel, C., Kwak, N.: Qkd: Quantization-aware knowl-
edge distillation. arXiv preprint arXiv:1911.12491 (2019)
45. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Conference on
Empirical Methods in Natural Language Processing (2016)
46. Kossaifi, J., Lipton, Z.C., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor
regression networks. ArXiv abs/1707.08308 (2020)
47. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Communications of the ACM 60, 84 – 90 (2012)
48. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NIPS (1989)
49. Li, L., Lin, Y., Ren, S., Li, P., Zhou, J., Sun, X.: Dynamic knowledge distillation
for pre-trained language models. ArXiv abs/2109.11295 (2021)
50. Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., Gonzalez, J.: Train
large, then compress: Rethinking model size for efficient training and inference of
transformers. ArXiv abs/2002.11794 (2020)
51. Liu, X., Wang, X., Matwin, S.: Improving the interpretability of deep neural net-
works with knowledge distillation. 2018 IEEE International Conference on Data
Mining Workshops (ICDMW) pp. 905–912 (2018)
52. Liu, Y., Zhang, W., Wang, J., Wang, J.: Data-free knowledge transfer: A survey.
ArXiv abs/2112.15278 (2021)
53. Lopes, R.G., Fenu, S., Starner, T.: Data-free knowledge distillation for deep neural
networks. ArXiv abs/1710.07535 (2017)
54. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.N.: Unifying distillation and
privileged information. CoRR abs/1511.03643 (2015)
55. Malach, E., Yehudai, G., Shalev-Shwartz, S., Shamir, O.: Proving the lottery ticket
hypothesis: Pruning is all you need. In: ICML (2020)
56. Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh,
H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)
57. Mozer, M.C., Smolensky, P.: Skeletonization: A technique for trimming the fat
from a network via relevance assessment. In: NIPS (1988)
58. Müller, R., Kornblith, S., Hinton, G.E.: Subclass distillation. ArXiv
abs/2002.03936 (2020)
59. Nayak, G.K., Mopuri, K.R., Shaj, V., Babu, R.V., Chakraborty, A.: Zero-shot
knowledge distillation in deep networks. ArXiv abs/1905.08114 (2019)
60. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural net-
works. In: NIPS (2015)
61. Olah, C.: Mechanistic Interpretability, Variables, and the Importance of
Interpretable Bases. https://transformer-circuits.pub/2022/mech-interp-
essay/index.html (jun 22 2022), [Online; accessed 2022-12-14]
Model Compression Techniques in Deep Neural Networks 21

62. Papernot, N., Mcdaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense
to adversarial perturbations against deep neural networks. 2016 IEEE Symposium
on Security and Privacy (SP) pp. 582–597 (2015)
63. Park, G., Yang, J.Y., Hwang, S.J., Yang, E.: Attribution preservation in network
compression for reliable network interpretation. Advances in Neural Information
Processing Systems 33, 5093–5104 (2020)
64. Park, J., No, A.: Prune your model before distill it. In: European Conference on
Computer Vision (2021)
65. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
pp. 3962–3971 (2019)
66. Patterson, D., Gonzalez, J., Le, Q.V., Liang, C., Munguı́a, L.M., Rothchild, D., So,
D.R., Texier, M., Dean, J.: Carbon emissions and large neural network training.
ArXiv abs/2104.10350 (2021)
67. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quan-
tization. arXiv preprint arXiv:1802.05668 (2018)
68. Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Rastegari, M.: What’s
hidden in a randomly weighted neural network? 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 11890–11899 (2020)
69. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classifi-
cation using binary convolutional neural networks. In: ECCV (2016)
70. Reed, R.: Pruning algorithms-a survey. IEEE Transactions on Neural Networks
4(5), 740–747 (1993). https://doi.org/10.1109/72.248452
71. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
Hints for thin deep nets. CoRR abs/1412.6550 (2015)
72. Sau, B.B., Balasubramanian, V.N.: Deep model compression: Distilling knowledge
from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)
73. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2015)
74. Stoychev, S., Gunes, H.: The effect of model compression on fairness in facial
expression recognition. ArXiv abs/2201.01709 (2022)
75. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan,
D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1–9 (2015)
76. Tang, J., Liu, M., Jiang, N., Cai, H., Yu, W., Zhou, J.: Data-
free network pruning for model compression. In: 2021 IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2021).
https://doi.org/10.1109/ISCAS51556.2021.9401109
77. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. ArXiv
abs/1910.10699 (2020)
78. Van Baalen, M., Louizos, C., Nagel, M., Amjad, R.A., Wang, Y., Blankevoort, T.,
Welling, M.: Bayesian bits: Unifying quantization and pruning. Advances in neural
information processing systems 33, 5741–5752 (2020)
79. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. ArXiv abs/1706.03762 (2017)
80. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quan-
tization with mixed precision. 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) pp. 8604–8612 (2018)
81. Wang, Y., Wang, C., Wang, Z., Zhou, S., Liu, H., Bi, J., Ding, C., Rajasekaran,
S.: Against membership inference attack: Pruning is all you need. In: International
Joint Conference on Artificial Intelligence (2020)
22 M.M Yesuf, B.G Assefa.

82. Wang, Y., Lu, Y., Blankevoort, T.: Differentiable joint pruning and quantization
for hardware efficiency. ArXiv abs/2007.10463 (2020)
83. Wang, Y., Zhang, X., Hu, X., Zhang, B., Su, H.: Dynamic network pruning with
interpretable layerwise channel selection. In: AAAI (2020)
84. Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., Hu, X.: Pruning from
scratch. In: AAAI (2020)
85. Wiedemann, S., Kirchhoffer, H., Matlage, S., Haase, P., Marbán, A., Marinc, T.,
Neumann, D., Nguyen, T., Schwarz, H., Wiegand, T., Marpe, D., Samek, W.:
Deepcabac: A universal compression algorithm for deep neural networks. IEEE
Journal of Selected Topics in Signal Processing 14, 700–714 (2020)
86. Yang, Z., Wang, Y., Chen, X., Shi, B., Xu, C., Xu, C., Tian, Q., Xu, C.: Cars: Con-
tinuous evolution for efficient neural architecture search. 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR) pp. 1826–1835 (2020)
87. Yeom, S.K., Seegerer, P., Lapuschkin, S., Wiedemann, S., Müller, K.R., Samek,
W.: Pruning by explaining: A novel criterion for deep neural network pruning.
ArXiv abs/1912.08881 (2021)
88. Yim, J., Joo, D., Bae, J.H., Kim, J.: A gift from knowledge distillation: Fast op-
timization, network minimization and transfer learning. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 7130–7138 (2017)
89. Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisiting knowledge distillation
via label smoothing regularization. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 3903–3911 (2020)
90. Zhai, S., Cheng, Y., Zhang, Z., Lu, W.: Doubly convolutional neural networks. In:
NIPS (2016)
91. Zhang, Z., Shao, W., Gu, J., Wang, X., Ping, L.: Differentiable dynamic quan-
tization with mixed precision and adaptive resolution. ArXiv abs/2106.02295
(2021)
92. Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distilla-
tion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 11943–11952 (2022)
93. Zhao, H., Sun, X., Dong, J., Chen, C., Dong, Z.: Highlight every step: Knowledge
distillation via collaborative teaching. IEEE Transactions on Cybernetics 52, 2070–
2081 (2019)
94. Zhao, Y., Shumailov, I., Mullins, R.D., Anderson, R.: To compress or not to com-
press: Understanding the interactions between adversarial attacks and neural net-
work compression. ArXiv abs/1810.00208 (2018)

View publication stats

You might also like