0% found this document useful (0 votes)

124 views

ModelCompressionTechniquesinDeepLearning

The document discusses model compression techniques in deep neural networks, highlighting the necessity of reducing model size while maintaining performance due to challenges posed by large models. It reviews various methods such as pruning, knowledge distillation, and quantization, and presents a systematic comparison of these techniques along with their implications beyond mere size reduction. Additionally, it identifies open research problems in the field of model compression for further exploration.

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

ModelCompressionTechniquesinDeepLearning

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371051458

Model Compression Techniques in Deep Neural Networks

Chapter · May 2023

DOI: 10.1007/978-3-031-31327-1_10

CITATIONS READS
0 377

2 authors:

Mubarek Mohammed Beakal Gizachew Assefa

Addis Ababa Institute of Technology Addis Ababa Institute of Technology
2 PUBLICATIONS 0 CITATIONS 29 PUBLICATIONS 306 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Beakal Gizachew Assefa on 09 March 2024.

The user has requested enhancement of the downloaded file.

Model Compression Techniques in Deep Neural
Networks⋆

Mubarek Mohammed Yesuf1[0000−0002−0145−1810] and Beakal Gizachew

Assefa2[0000−0003−3065−3881]
1
Information Network Security Agency and Addis Ababa Institute of Technology,
AAU, Addis Ababa, Ethiopia
[email protected],[email protected]
2
Addis Ababa Institute of Technology, AAU, Addis Ababa, Ethiopia
[email protected]

Abstract. With the current set-up, the success of Deep Neural Net-
work models is highly tied to their size. Although this property might
help them improve their performance, it makes them difficult to train,
deploy them on resource-constrained machines, and iterate on experi-
ments. There is also a growing concern about their environmental and
economic impacts. Model compression is a set of techniques that are
applied to reduce the size of models maintaining performance. This sur-
vey presents state-of-the-art model compression methods, discusses how
they are hybridized, highlights the changes they could cause beyond size
reduction, and puts forward open research problems for further study.

Keywords: Model compression · Pruning · Knowledge Distillation ·

Quantization · Neural Networks · Deep learning · Artificial Intelligence

1 Introduction

Breakthrough advances in AI algorithms are consequences of Neural Networks

which are the foundations for Deep Learning (DL), a family of Machine Learn-
ing (ML) algorithms behind successful Artificial Intelligence (AI) innovation in
tasks like Voice Recognition, Image Classification [47], Human Language under-
standing [19], etc. DL algorithms are based on a theorem, named the Universal
Approximation Theorem [38], which guarantees that Neural Networks, grouped
in some way, can compute any continuous function, and the hierarchical rep-
resentation of information inspired from the human brain[6]. On the other the
exact number of parameters a specific DL model needs for a certain problem is
still a hyper-parameter, a choice of the designer. These ideas stimulated over-
parameterization in an attempt to increase a model’s representation capacity
and thus performance, as the tasks they are being applied to grows in com-
plexity [8]. How complex a model is commonly measured by the total number
⋆
Supported by Ethiopian Artificial Intelligence Institute.
2 M.M Yesuf, B.G Assefa.

of parameters which is useful to measure its memory consumption, and float-

ing point operations per second(FLOPs), useful in fields of computations that
require floating-point calculations.
Most benchmark models are related to Computer Vision and Natural Lan-
guage Processing, sub-fields of AI relating to human vision and language re-
spectively. Computer vision related models are mostly based on Convolutional
Neural Network (CNN)s, a type of DL architecture designed for spatial data
such as an image. Prominent examples are, Alexnet [47] and VGG16 [73], which
are variants of CNNs, are winners of Imagenet [47],a dataset with more than 14
million well labeled images intended to serve as a benchmark in computer vision
applications. Natural language related models are getting bigger and bigger re-
cently so much that they are now being referred to as Large Language Models.
State of the art and benchmark model architectures for language models are
based on the transformers [79]. Prominent examples are BERT, GPT, BART,
DitillBERT. Figure 5 below shows the evolution of some benchmark computer
vision models and their computational burdens expressed in Giga FLoating-point
OPerations (GFLOPs) along with their Top-1(left) and Top-2 accuracy.

Fig. 1. Size of benchmark vision and language models over the past years [7].

However, there is a trade-off: large models perform well at a cost of increased

complexity. Challenges of such big models include longer training, inference and
iteration time, difficulty to deploy on resource constrained and edge machines,
and economic and environmental impact [66]. This raises the question if there
is a way to work around the trade-off. This is where Model Compression comes
in, it is the idea of making models small without losing performance. A review
of model compression methods for deep-learning are presented here.
Model Compression Techniques in Deep Neural Networks 3

Model Compression is a set of techniques [11] for reducing the size of large
models without a notable performance loss. Its necessity is increasing in paral-
lel with the advancement and complexity of models. Even though their recent
attempts to train smaller networks from the beginning [22], mostly, Model Com-
pression is applied after training a bigger model as the model needs much fewer
parameters for inference than for learning. Model Compression methods in lit-
erature can be classified into four major parts: Pruning, Knowledge Distillation
(KD), Quantization, and other methods. Pruning is removing an unwanted struc-
ture from a trained network. KD is a mechanism transferring the knowledge of
a bigger network onto a smaller network. Quantization is reducing the number
of bits of model parts required to be represented. It is similar to approxima-
tion. The rest of the Model Compression methods other than Pruning, KD, and
Quantization are can be organized in one section and are referred to as Other
methods. Weight decomposition methods deserve to have their own section, but
this arrangement makes the paper simpler for the reader.
The contributions of the paper are as follows

– Conduct a review of deep learning Model Compression methods.

– Systematically compares and discusses each method along with their conse-
quences beyond size reduction, their combinations, and selection considera-
tions.
– Discuss open research problems for further study on the topic of Deep Neural
Network Model Compression.

The rest of the paper is organized as follows. The next section, section two, re-
views prominent works in Model Compression. Section three is on Hybrid Model
Compression techniques which are about methods that combine existing ones
for a better outcome. The effect of Model Compression beyond size reduction
is discussed in section four. Discussions, factors affecting the choice of methods,
and open research problems are presented in sections, section five, six, and seven
respectively. Finally, a conclusion is presented.

Pruning

Knowledge Distillation

Model Compression Quantization

Others methods

Hybrid Model Compression

Fig. 2. The taxonomy and categories of Deep Learning model compression methods.
4 M.M Yesuf, B.G Assefa.

2 Model Compression Methods

2.1 Generic paradigms

Over the development of Model Compression methods, certain key character-

istics keep on reappearing. These characteristics are generic and can be orga-
nized as a paradigms of Model Compression. Understanding them gives how the
general view of towards the approaches in Model Compression methods in an
agnostic way, assist in selection, and stimulate research questions. Unlike past
works, here they are organized in one stand alone subsection. Not all of them are
applied for all Compression methods and some can be found in one method at
the same time. They form important trade-offs wherever they appear and extend
the default method which is usually static, deterministic, and post-training.

– Random or Stochastic. Sometimes certain aspects( parameters, hyper-

parameters, etc) of a certain model are selected at random. The stochastic-
ity brings the analogous benefits that it brings to the Stochastic Gradient
descent, it enables exploration in the space. In fact, sometimes, they can be
used as a benchmark for comparison against the other methods [9].
– Data free, Zero-shot, one shot, or few shot . The data the original
models were trained on might not be always available or it might be expensive
to work with the it. But some other information such as the meta data might
be available. In such cases, a data and iteration sceptic paradigms are used
[76], [15].
– Dynamic. The dynamic approach to model compression dynamically de-
cides different parts of the process [49].
– Online. Compression of the model is done while it is being trained [41].
– Auto-ML. Instead of adjusting hyper-parameters by hand, this paradigm
encourages using Auto-ML methods where certain parts of the process of
compression is automatically decided by an external algorithm [30],[36].
– Differentiable. These paradigm’s turn hyper-parameters introduced by the
compression method into a parameters and learn them on the process [43].

2.2 Pruning

The oldest, most studied and intuitive method of reducing model size is pruning.
It is literally removing a weight or a node from a network based on how important
it is. The vanilla pruning method has three main stages: initializing, training,
pruning and retraining(fine tuning). Hyperparameters introduced by the Pruning
include compression rate, magnitude to prune with when it is magnitude weight,
type of saliency metric to use, etc. The approach towards these hyperparamters
categorize the pruning methods in literature.
The earliest methods were of mainly: Saliency based, where node or weight
is checked for importance, penalty based [48], [57], where the a penalty term
is added on the loss(objective) function to induce weight decay [70], magni-
tude based pruning, simply removing the smallest link(weight) in each iteration.
Model Compression Techniques in Deep Neural Networks 5

Prominent early works are Optimal Brain Damage (OBD) [48], Optimal Brain
Surgeon [35] and Skeletonization [57].
An intuitive direction of development for Pruning is to determine what is
important. There were attempts to identify them with statistical methods, and
works arounds via following recent advances in the area Explainability, a notion
trying to undersand,explain the ’decision’ of a model. [87].

Fig. 3. Iterative Pruning (left) [33] and Pruning(right) [33].

Modern approaches vary not just in the Saliency metrics they use or other ap-
proach, but also in the overall pipeline of pruning. The ‘lottery ticket hypothesis’
[22], whose stronger version was supported in [55], states that random initial-
ization of a network contains a small subnetwork (” the winning tickets”) that,
when trained in separately, can compete with the performance of the original
model. A rather bold finding in by [68] states that there is a randomly initialized
subnetwork that even without training can perform as good as the original. The
recent work [84], termed pruning from scratch, questions the need for training
the network and proposes a different pipeline as: random initialization, pruning
and then training. The work in [50], raises the question why train a big model
to convergence while it is possible to attain better performance by training it
for free iterations and prune it.
Rather than design designing manually the hyperparamters introduced by
Pruning, Reinforcement Learning (RL) based methods are emerging to formulate
the problem of removing ad node or a weight as a Markov Decision Process
[30], [30]. Neural Architecture Search, based methods formulate the Pruning as
a search in the space of sub-networks emerging area of study in the study of
neural network’s.[21].

2.3 Knowledge Distillation

Knowledge Distillation is a relatively new compression procedure originally in-
troduced in [11] and reinvented in [37]. A bigger model and well performing
model, which is termed as the teacher model, provides labels for a smaller net-
work to train with. For each observation x and its label y out of the dataset
6 M.M Yesuf, B.G Assefa.

the original network was trained on, the smaller network will use the predictions
of the bigger network as a label for x instead of just y. The predictions of the
teacher model are termed as soft labels since they are not ones and zeros like
hard labels are. These soft labels are termed as the Dark Knowledge. The intu-
ition is an image of a dog has a larger chance of being mistaken as a cat than as
a car.

Fig. 4. Knowledge Distillation (KD) [20].

KD uses a constant value named Temperature, T, which magnifies the small

probability values in the prediction(the soft labels) which represent. Each output
of the i-th class in training, with logit z and a temperature T, is computed as:

exp(zi /T )
p(zi , T ) = P (1)
exp(zj /T )

Hyper-parameters introduced by KD include the architecture type of the

student network, the temperature T, and the distillation loss which measures the
divergence between the student network and the teacher network [37]. Different
approaches and improvements in these hyper-parameters create variants of KD.
There can be improvements in the loss [77], [92] and also have an assistant
Model Compression Techniques in Deep Neural Networks 7

network, a size that is less than the teacher and smaller than the student, to
help out the small student network [56].
A powerful adaptation of KD, Fitnets [71], trains a student, deeper and thin-
ner than the teacher which is unlike [37], using the intermediate representations,
features, learned by the teacher as hints to improve the training process and
final generalizing capacity of the student. This work opened a whole new sub-
catagory of KD termed as Feature Distillation making the original to be called
response based distillation [28]. Response-based distillation, is also limited to the
supervised learning as the logits, the soft labels, are probability distributions.
Relationship distillation extends feature distillation by using outputs of specific
layers in the teacher model and exploring the relationships between different
layers or data samples [65].
At core level, KD is a knowledge transfer mechanism [54]. Thus, it is ex-
tremely versatile and model agnostic, any model can be used as teacher or a
student. Thus, it has usage for other purposes including faster optimization [88],
defence against adversarial attacks[62], explainable networks [42], [63], even to
improve the original network itself [24].

2.4 Quantization
Quantization, in simple terms discretization, is mapping from the space of con-
tinuous values to discrete values. A good example is rounding up a number.
The concept is old and hsa been used in various settings in information theory
and other sciences as well, but in NN setting, it is about [26] rounding up the
parameters in the network and reduce the number of bits required for representa-
tion for example from 32 bits floating point representations, to limited precision,
commonly in 8 bits or lower representations. It’s foundation is rooted in how a
human brain stores and encodes information in a discrete way [25].
In literature, the variants of quantization differ according to which param-
eters to quantize, how to quantize, whether to do it in training time or after
training time, and other factors which correspond to the variables S, Z in the
quantization equation. S itself is determined by the boundary values of the clip-
ping range. Depending on whether the clipping range is calculated for each dy-
namically or statically, quantization can be dynamic or static. The work on [41]
made sure only 8-bit integers are used in the network at inference time.
Normally, what is known as the clipping range, a range to make sure the
individual weights will lie in, is determined before hand. The common formula
is given by:
r
Q(r) = Int( ) − −Z (2)
S
where Q is the quantization operator, r is a real valued number, since the
clipping range is not necessarily input (activation or weight), S is a real val-
ued scaling factor, and Z is an integer. These variables are part of the hyper-
parameters introduced by Quantization. Int is the integer function that maps
its input to the integer. A quantized number can be recovered with the reverse
quantization, Dequantization.
8 M.M Yesuf, B.G Assefa.

The basic method of quantization is to train a model under normal setup

and then to quantize each parameter, activation or even input, using some sort
of function or mapping, which is sometimes referred to as a quantization oper-
ator [25]. When fine tuning is needed after quantization, Quantization Aware
Training is used where a re-training is done with floating point propagation but
the weight are Quantized back after each update. Post Training Quantization is
quantization without retraining [5], [12]. It is preferred when QAT is too complex
to implement.
The ideal Quantization is, Binarization, making the parameters of a network
binary [17] for memory and compute efficiency. In fact, the most appealing fea-
ture of this method, given it works properly, is that it might even completely
get rid of the need of multiplication at inference time as in [18]. This is possible
by replacing the dot product in conventional neural networks with bitwise oper-
ators [17] to train networks with binary weights. In [69], the researchers showed
CNNs with only binary weights and demonstrated a binary quantized version
of Alexnet [47], with 32 times smaller, with comparable accuracy as the origi-
nal version. Beyond binarization, an alternative network is a ternary network,
where the weights of the network are limited to three values instead of two as
in binary. Neither binary nor ternary Quantization methods are trainable with
back-propagation with gradient descent.
Recent, advances introduce Differentiable Quantization in order to learn the
hyper-parameters of Quantization rather [91], and Mixed precision of numbers
[80].

2.5 Summary of Model Compression against generic paradigms

Once basic Model Compression methods and generic paradigms development are
established, they can be effectively summarized in the table below.

Table 1. Summary of Hybrid Model Compression.

Paradigm Pruning Quantization KD

Random or Stochastic [76] [31] [53],[52]
Data free, Zero, One , or few shot [76], [15] [34] [53],[52],[59]
Dynamic [83] [91] [49]
Online [48] [41] [93]
AutoML (RL, NAS ) [30],[36], [80] [2]
Differentiable [43] [91] [29]

2.6 Other Compression Methods

In general, the goal of compression is intuitive: reduce size. In reduce size with
accuracy. Thus, people have tried whatever they think would work to achieve
Model Compression Techniques in Deep Neural Networks 9

that goal. Listed in this section are methods that are both common, Tensor
Decomposition, and rare, like Information theoretic.
Tensor decomposition is a popular method to reduce a model size. A tensor
is just a matrix in higher dimension. Tensor decomposition is a scaled up version
of the common matrix decomposition in linear algebra. Matrix decomposition
methods like Singular Value Decomposition and Principal component analysis,
have their analogous versions. To make compression, these methods take the
weights of a model to form a Tensor. Tensor decomposition, the core of these
methods, is an existing approach in other fields and common decomposition
methods Canonical Polyadic, Tensor Train, and the Tucker decomposition are
applied to compress models. The works in [60], [13] and [46] are iconic examples
of these methods. These are almost always applied in Multi Layer Perceptrons
or fully-connected layers of CNNs. Large scale matrix manipulation involved is
their disadvantages. They are also applied on specific architectures.
Weight sharing approaches are somehow ways of making CNN variants effi-
cient. The works [75], Inception [90],MobileNet [39] and SqueezNet [40] are ex-
amples of these methods. One of the most influential work [40] achieved Alexnet
[47] level accuracy with 50 times fewer parameters on the ImageNet dataset
[47]. These family of compression methods are more of careful design choices
that turned out to be efficient.
An unique method is presented in [85] used the then state of the art loss-
less video encoding and compression technique to compress a model. It is an
information theoretic approach that formulated the model compression problem
as an information source coding one. It used Context-Adaptive Binary Arith-
metic Coding (CABAC) a video coding standard H.264/AVC to encode and the
quantize the weights of the neural network but the encoding takes long time.

3 Hybrid Model Compression Methods

Fortunately, Model Compression methods are not disjoint: they can be combined.
The resulting method would be more complex than the individual ones but in
cases where that is not an issue, combining hybrid methods is beneficial. In
fact, a symbiotic relationship where the problems of one is addressed by the
other can be achieved. This section highlights prominent works in that combine
compression methods.

3.1 Pruning and Quantization

Pruning and Quantization the most related ones especially unstructured Pruning
where it is simply setting a parameter zero. The most prominent work not just in
hybrid paradigm but also in the area of Model Compression is Deep Compression
framework [32] where iterative Pruning, Qauntization and Encoding are applied
one after the other. Its Pruning part is the same as [33].
The intuitive relationship between Pruning and Quantization is being backed
by a recent advances in Quantization that study differentiable hyperparamters
in Quantization as a means to unify them [82], [78].
10 M.M Yesuf, B.G Assefa.

Fig. 5. An example of hybrid Compression is the Deep Compression framework [32].

3.2 Pruning and Knowledge Distillation

There is also a positive relationship between Pruning and Knowledge Distillation

and it has been recognized relatively early in [45] where Pruning is applied on
distilled student networks on the task of Neural Machine Translation. Recent
works are more generic than applied. A work demonstrated in [64] showed, a
student can learn better from pruned teacher which serves as a better regularizer
than otherwise in the context of KD. The converse of it has also been shown in
[14] where the authors argue that the performance of a pruned network is better
restored when the student is fine-tuned with KD than vanilla training. A good
example of a symbiotic relationship between the two is [1] where unprunable
parameters are compressed with KD and potentially redundancy is addressed
by Pruning[1].

3.3 Quantization and Knowledge Distillation

An early major work in the combination of Quantization and Knowledge Dis-

tillation is the application of Quantization for the choice of a student model for
KD [67] where a the knowledge of a Quantized teacher is transferred to a Quan-
tized student. In this setting, KD helps Quantization restore better performance.
Conversely, Quantization also benefits KD as in [10] where it has been used to
enable on device KD by introducing noise that mimic the existence of a bigger
teacher model. This addresses the problem of missing teacher on edge devices to
do vanilla KD and might even put the need of the teacher in the first place in
question. Their bond is strengthened by Quantization Aware KD [44] that aims
to solve the problem of performance degradation in Quantized distillation [67].
As it can be seen on the table, the methods can benefit each other when they
are combined but it comes at a cost. The number of hyper-parameters that have
to be decided will grow exponentially.
Model Compression Techniques in Deep Neural Networks 11

Table 2. Summary of Hybrid Model Compression.

Hybrid Compression Remark

Pruning and Quantization They are applied together almost every-
where [32] and are so similar there are work
that attempt to unify them [82],[78]
Pruning and Knowledge Distillation Mutually beneficial result. One solves the
challenges of the other.[1], [64] ,[45]
Quantization and Knowledge Distillation Distillation restores performance damaged
by Quantization [67]. Quantization helps
mimic a non existent teacher [10]

4 Model Compression Beyond Size Reduction

The purpose of compressing a model is to reduce its size. But the size reduction
can alter the behaviour of the network that can have both positive and negative
consequences. In the case where it benefits, it will be an extra advantage. In fact,
in some cases, like [48], the compression can even be done for the by-product.
But in the cases where it is disadvantageous there is a trade-off to be addressed
appropriately. This section highlights these issues.

4.1 Reducing Overfitting

Ideally, a model is expected to generalize , and not overfit. Overfitting is when a
model is trained on a training data to the extent that it negatively impacts its
accuracy on data beyond the training set. Dramatic size reduction of any kind
of compression impacts generalization negatively [71],[33]. But a certain level of
compression can help reduce overfiting. In general, compression methods can be
used as noise injection mechanisms which has a regularizing effect [37] for the
network by forcing parameters to learn representations independently. In fact, in
the beginning days of Artificial Intelligence, one objective of NN Pruning was to
reduce overfitting [48]. Recently, dropouts, randomly dropping out weights from
the network with a certain probability, [47] enabled for models to go deeper than
usual and reignited the consideration of compression as a cure for overfitting.

4.2 Explainability
Model Compression is also tied with Explainability or Interpretability, an ef-
fort to try to understand the decisions of models which is getting more and
more attention due to their applications in high stake areas. The problem of
Explainability can be solved by KD, sometimes at a cost of a slight accuracy, by
transferring the knowledge of a high performing neural network into a Decision
Tree model which is explainable inherently [23], [51]. But this is a double-edged
phenomenon because compressing a trained model impacts explanations as can
be seen [42] and [63] where the authors suggested explanation preserving meth-
ods. There is a lack of detailed work between compression and explainability.
12 M.M Yesuf, B.G Assefa.

For example, what model compression does to Mechanstic Interpretability, an

effort trying to understand what happens inside Neural Networks, remained a
mystery [61].

4.3 Neural Architecture Search

Neural Architecture Search is an attempt to find a an optimal architecture in

an educated way. This is because most novel architectures are purely results of
the choice of the designer rather than. The relationship of Neural Architecture
Search with model compression is also intuitive and well recognised in literature
[16]. The task of optimal compression can be seen as a search in the space of
sub-architectures. For example the task of Pruning can be taken as a search
in the space of NN architectures that are sub networks of the original network
using a certain algorithm [86], [3].

4.4 Algorithmic Fairness

Algorithmic fairness has become increasingly vital due to the increasing impact
of AI in society. Particularly, as AI solutions are being adopted in various fields
of high social importance or automate decision-making, the question of how fair
the algorithm is critical now more than ever. Since Model Compression is be-
coming a default component of ML deployment, its impact on fairness have to
be examined. A recent work aimed at this problem reported that model com-
pression, Pruning and Weight Decomposition, exacerbate the bias in a network
[74].

4.5 Security

Security issues regarding Neural Networks include Gradient Leakage Attacks,

where an attacker can gain access to private training data from a model’s gra-
dient, Adversarial Samples, where an attacker misleads a trained classifier with
carefully designed inputs [27], and Membership Inference Attacks, where an at-
tacker learns about the training data by making repeated inferences [81].
In general, neither Quantization nor Pruning help mitigate Adversarial at-
tacks without costing accuracy [94]. Knowledge Distillation is extensively used
as a defense mechanism for Adversarial attacks [62]. Pruning can also be used
to prevent Membership Inference Attack [81].
Basically, all decisions that were made about the network before compressing
it can be questioned after the compression, but mentioned here are few of them
for there is a lack of enough work in the area. In some cases, it might make sense
to ask whether or not a particular behaviour observed on a network might affect
the compression performance.
Model Compression Techniques in Deep Neural Networks 13

Table 3. Summary of Model Compression and its consequences.

Network Behaviour Relationship

Overfitting Slight compression can help [48], [37], but extreme com-
pression of any kind damages it.
Explainability Both Pruning and Knowledge Distillation damage Ex-
plainability(Attribution) but Knowledge Distillation in-
directly solves the problem of Explainability by transfer-
ring knowledge to an interptetable model [42], [63]
Neural Architecture Search It has a two way positive relationship with Pruning and
Knowledge Distillation [16], [3].
Algorithmic Bias All compression methods exacerbate the existing bias
[74]
Security Against adversarial attacks, Quantization and Prun-
ing do not help against without loss of accuracy [94],
but Knowledge Distillation does [62]. Pruning can help
against Membership Inference Attacks [81].

5 Findings

At core level, most Model Compression methods are somehow related with some
kind of natural phenomena outside of AI. Quantization is related with how the
human brain stores and encodes information in a discrete way[25]. The original
KD paper discussed about its similarity between the development of insects.
Pruning is related with how humans have less neuron connections when they are
babies, than adults. This begs the question whether or not other ideas can also
be considered or even if there is an even bigger mystery.
Another high-level observation is that the methods can be seen as micro level
and macro level. KD and Weight Decomposition are macro-level compression
methods, they work at network level. Quantization and Pruning are relatively
micro compression methods, they have mechanisms to deal with specific param-
eters [33]. Those in the latter group have patterns across their techniques and
trade-offs introduced by their generic paradigms. For example in both Pruning
and Quantization, the conventional method is post-training compression, getting
more surgical by applying them while training is possible but it has an additional
cost of complexity, and have alternatives to deal with parameters individually
or in groups. Particularly, Unstructured pruning and Uniform Quanitzation are
analogous. But Quantization has more headaches to be surgical. For this reason,
they are heavily applied in structured in Pruning( in uniform in Quantization)
in practical settings. AutoML solutions such as RL and NAS can be utilized but
the number of hyper-parameters increases exponentially.
In general, Model Compression techniques can be seen in two broad cate-
gories. The methods in the first category, such as Pruning, Quantization and
Tensor Decomposition, transform a trained model into a smaller or optimized
version of itself. The methods in the second category try to create a new but
smaller model from scratch. Knowledge distillation can be seen as creating a
14 M.M Yesuf, B.G Assefa.

model from scratch. Any other design choices that can reduce model size can be
categorized in the second category.
Most compression methods are vision native, they are made for computer vi-
sion applications especially CNN based architectures. That is most likely because
vision data is abundant, to make inference at edge devices where the camera re-
sides, or to take advantage of compression methods that remove structures at a
time. Its only recently, with the development of Large Language Models, that
the methods are being adapted for language models.

5.1 Pruning

The most important gap common in all the Pruning methods is that it is hard to
compare different variants due to lack of benchmark metric. The only benchmark
effort is Shrinkbench [9], a framework aiming to provide pruning performance
comparisons.
There is a critical difference between unstructured and structured Pruning.
Unstructured pruning, is implemented by making individual parameters set to
be zero. This makes unstructured pruning almost surgical and more accurate in
identifying what to remove. But this implementation doesn’t actually reduce the
computation time since the GPU will do the operation anyway. On the other
hand, structured pruning will remove a complete layer or a channel or a filter.
And then the rest of the connections will be connected afterwards. This increases
efficiency both pruning time and implementation time. But this too has its own
consequences as it is not surgical. The layer or channel to be removed might still
have important weights that are needed for inference.
Another common limitation , which can be caused by the lack of a common
benchmark, is that they are architecture dependent. Almost all of them involve
at least some amount of fine-tuning or retraining which requires having original
data to train network is trained on which in some cases might not be available.
Considering randomly dropping weights or magnitude based pruning, it is
justified to say that Pruning is the fastest.

5.2 Knowledge Distillation

Knowledge Distillation methods actually construct a brand new smaller model

that tries to mimic the bigger model. This means training a new model from
scratch which will take as much time as the training time for the teacher. The
loss have to be a softmax loss function whose output is a probability distribu-
tion. They doesn’t give a bigger size reduction. Some might even need a teacher
assistant to catch up to the teacher[56].
In Knowledge Distillation (KD), the added term, the distillation loss which is
seen above4, does serve as a regularize [72], but the exact part of Dark Knowledge
the student is taking advantage of is still being studied [89], [92].
Model Compression Techniques in Deep Neural Networks 15

5.3 Quantization

Quantization methods make weights discrete. They might also damage the preci-
sion of important weights. There are ways to bypass that, dynamic, Asymmetric
Quantization or Quantization Aware Training), but they come at the expense
of computation which is what compression was needed for in the first place.

Table 4. Summary of model compression methods.

Method Description Remark

Pruning Removing parameters by set- The oldest, fastest and one
ting them to zero with the biggest compression
rate.
Knowledge Distillation Train a new smaller model Most versatile and model ag-
with the predictions the bigger nostic but works with only cer-
model tain types of loss functions
Quantization Reduce the amount of bits re- Work well with pruning, effec-
quired to represent the param- tive to reduce memory size as
eters well as computation
Other methods Methods that reduce model Efficient architecture designs,
size with techniques other inspired by file compression
than the above three. methods, or matrix decompo-
sition based methods.

6 Factors affecting the choice of methods

There is no size fits all method. One has to find the right one for the task at
hand. These are the factors that can be cons

– Trade-off : the paradigms discussed above introduce trade-offs that can in-
fluence the choice of a compression method.
– Area of application. How tolerant the task at hand is can affect the choice
of a compression method in relation to alignment, accuracy, etc.
– Type of data (Architecture) Computer vision and natural language-
related tasks can require architectures that suit their purpose which in turn
affects the choice of compression strategy. For example, structured Pruning
can effectively remove channels and filters from CNNs on the other hand
there are real-world cases where large language models are effectively com-
pressed with Knowledge Distillation. Classical methods can have comparable
performance as Neural Networks.
– Expertise. Some of the compression methods require expertise to be imple-
mented and require high level of sophistication to be carried out effectively.
16 M.M Yesuf, B.G Assefa.

7 Open Research Problems

There are numerous open issues in Model Compression that can be formulated
as valid research questions. They are presented in this section according to the
classification used in the paper.

7.1 Knowledge Distillation (KD)

Knowledge Distillation (KD) is one of the major compression methods. In its

framework as can be seen in the figure4, the added term in the loss part is the
knowledge that is assumed to be being helpful for the teacher. It does serves as a
regularizer [37]. But the performance and even the need of the teacher network
itself, especially in response based KD, is being questioned repeatedly [10], [89],
[72], [4]. What part of the dark knowledge the student takes advantage of is still
an open problem.
There are also understudied variants of KD. An example is Subclass distil-
lation [58] that entail a powerful concept. Especially, its converse, extracting
specialists from a bigger model, seems a fruitful direction. Zero shot knowledge
distillation methods, as discussed before, use random inputs to the trained model
to generate psudo data to train the small network with. Studying how the result
will turn out for a curated input is also a possible direction.

7.2 Hybrid Methods

As described in section three, combining compression methods can be beneficial

if the added complexity is not an issue. In general, there could combinations in
two ways: combining variants of the same method to take advantage of them and
combinations between different compression methods as in [32]. Thus,a study on
the combination of variants of the same compression method can be good direc-
tion. An example is trying to synchronize unstructured and structured pruning.
Such methods can be possible with Reinforcement Learning (RL) based approach
pursue as the algorithms get more faster and advanced. There is also a lack of ex-
ploration combining other compression methods. Additionally, a comprehensive
survey of this specific area can also be helpful for researchers.

7.3 Model Compression Beyond Size Reduction

Despite efforts to train smaller networks from scratch [22], most Model Com-
pression methods are applied after a big model is trained. Thus, the compression
ought to have an effect on other aspects of the model. For example, as pointed
out in [16], there are still remaining works that can bridge the concept of a
models size (compression) and its Explainability. Some applications of Explain-
ability are mentioned in the pruning section of this writing. But, there are only
few works,save for [13], that even acknowledge the existence of a bridge between
them. Thus, a formalized future work in this direction can be fruitful.
Model Compression Techniques in Deep Neural Networks 17

Basically, one can raise many components of the network and ask how com-
pression impacts it. Presented in section four are only a limited number of them
because there is still much work to be done in the area. Again, a comprehensive
survey on the effect of Model Compression beyond size reduction can be ex-
tremely helpful to build Model Compression without ramifications and beyond.

7.4 Miscellaneous
Some model compression methods are developed and are experimented for a
specific types of architectures. These makes them limited in applications. In
general RL based approaches will be suitable in the future to adapt compression
techniques to the given model.
There is little innovation in methods other than pruning, quantization and
knowledge distillation. Design choices can make a considerable change [75], [90],
[39] [40]. Since the aim problem of model compression is simple and intuitive,
any model size reduction method can be called a model compression technique
as long as it performs well.
The idea of ensembles have been in use and in classical ML world. It is about
using multiple machine learning models for a single task simultaneously. It in-
creases accuracy by taking advantage of the performance of individual models.
The concept have been far from being even mentioned in the context of neu-
ral networks as the complexity is overwhelming even to thing about. But with
parallel advances in model compression and hardware, it seems it will be within
the reach of industry leaders or whoever has the access and motive to try. Since
that has most likely never been explored, it is worth the shot. Hardware issues,
thought they are not discussed in this survey, are highly related with compression
and optimization.

8 Conclusion
A brief review of Model Compression techniques in Deep Learning with their
challenges, and opportunities has been presented. It started by defining Model
Compression and the challenges it will introduce. Then thirty years of work on
Pruning, a number of works on Knowledge Distillation and Quantization are
synthesized. How each methods is combined with another one is also discussed.
It contained a dedicated section intended to magnify the relationship between
model compression and the networks behavior. Findings and patterns from each
section are presented in a separate section. Actionable future research directions
have been put forward to assist interested researchers.

References
1. Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for
cnn compression. 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW) pp. 3185–3192 (2021)
18 M.M Yesuf, B.G Assefa.

2. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2017)
3. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2018)
4. Bang, D., Lee, J., Shim, H.: Distilling from professors: Enhancing
the knowledge distillation of teachers. Information Sciences 576, 743–
755 (2021). https://doi.org/https://doi.org/10.1016/j.ins.2021.08.020,
https://www.sciencedirect.com/science/article/pii/S0020025521008203
5. Banner, R., Nahshan, Y., Hoffer, E., Soudry, D.: Aciq: Analytical clipping for
integer quantization of neural networks. ArXiv abs/1810.05723 (2018)
6. Bengio, Y., et al.: Learning deep architectures for ai. Foundations and trends® in
Machine Learning 2(1), 1–127 (2009)
7. Bernstein, L., Sludds, A., Hamerly, R., Sze, V., Emer, J.S., Englund, D.: Freely
scalable and reconfigurable optical hardware for deep learning. Scientific Reports
11 (2020)
8. Bianco, S., Cadene, R., Celona, L., Napoletano, P.: Benchmark analysis of repre-
sentative deep neural network architectures. IEEE Access 6, 64270–64277 (2018).
https://doi.org/10.1109/ACCESS.2018.2877890
9. Blalock, D.W., Ortiz, J.J.G., Frankle, J., Guttag, J.V.: What is the state of neural
network pruning? ArXiv abs/2003.03033 (2020)
10. Boo, Y., Shin, S., Choi, J., Sung, W.: Stochastic precision ensemble: self-knowledge
distillation for quantized deep neural networks. In: Proceedings of the AAAI Con-
ference on Artificial Intelligence. vol. 35, pp. 6794–6802 (2021)
11. Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: KDD ’06
(2006)
12. Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: A
novel zero shot quantization framework. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) pp. 13166–13175 (2020)
13. Calvi, G.G., Moniri, A., Mahfouz, M., Zhao, Q., Mandic, D.P.: Compression and
interpretability of deep neural networks via tucker tensor layer: From first principles
to tensor valued back-propagation. arXiv: Learning (2019)
14. Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a
better pruned network with knowledge distillation. Complex & Intelligent Systems
(2021)
15. Chen, T., Ji, B., Ding, T., Fang, B., Wang, G., Zhu, Z., Liang, L., Shi, Y., Yi, S., Tu,
X.: Only train once: A one-shot neural network training and pruning framework.
In: Neural Information Processing Systems (2021)
16. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and
acceleration for deep neural networks. ArXiv abs/1710.09282 (2017)
17. Courbariaux, M., Bengio, Y.: Binarynet: Training deep neural networks with
weights and activations constrained to +1 or -1. ArXiv abs/1602.02830 (2016)
18. Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural
networks with binary weights during propagations. In: NIPS (2015)
19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. ArXiv abs/1810.04805 (2019)
20. Ding, X., Wang, Y., Xu, Z., Wang, Z.J., Welch, W.J.: Distilling and
transferring knowledge via cgan-generated samples for image clas-
sification and regression. Expert Systems with Applications 213,
Model Compression Techniques in Deep Neural Networks 19

119060 (2023). https://doi.org/https://doi.org/10.1016/j.eswa.2022.119060,

https://www.sciencedirect.com/science/article/pii/S0957417422020784
21. Dong, X., Yang, Y.: Network pruning via transformable architecture search. In:
NeurIPS (2019)
22. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv: Learning (2019)
23. Frosst, N., Hinton, G.E.: Distilling a neural network into a soft decision tree. ArXiv
abs/1711.09784 (2017)
24. Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again
neural networks. In: ICML (2018)
25. Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A
survey of quantization methods for efficient neural network inference. ArXiv
abs/2103.13630 (2022)
26. Gong, Y., Liu, L., Yang, M., Bourdev, L.D.: Compressing deep convolutional net-
works using vector quantization. ArXiv abs/1412.6115 (2014)
27. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
examples. CoRR abs/1412.6572 (2014)
28. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. ArXiv
abs/2006.05525 (2021)
29. Guan, Y., Zhao, P., Wang, B., Zhang, Y., Yao, C., Bian, K., Tang, J.: Differentiable
feature aggregation search for knowledge distillation. ArXiv abs/2008.00506
(2020)
30. Gupta, M., Aravindan, S., Kalisz, A., Chandrasekhar, V.R., Jie, L.: Learning to
prune deep neural networks via reinforcement learning. ArXiv abs/2007.04756
(2020)
31. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with
limited numerical precision. In: International Conference on Machine Learning
(2015)
32. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. arXiv: Computer Vision
and Pattern Recognition (2016)
33. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. Advances in neural information processing systems 28
(2015)
34. Haroush, M., Hubara, I., Hoffer, E., Soudry, D.: The knowledge within: Methods
for data-free model compression. 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) pp. 8491–8499 (2019)
35. Hassibi, B., Stork, D., Wolff, G.: Optimal brain surgeon and general network prun-
ing. In: IEEE International Conference on Neural Networks. pp. 293–299 vol.1
(1993). https://doi.org/10.1109/ICNN.1993.298572
36. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., Han, S.: Amc: Automl for model
compression and acceleration on mobile devices. In: ECCV (2018)
37. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
ArXiv abs/1503.02531 (2015)
38. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are uni-
versal approximators. Neural networks 2(5), 359–366 (1989)
39. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications. ArXiv abs/1704.04861 (2017)
20 M.M Yesuf, B.G Assefa.

40. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model
size. ArXiv abs/1602.07360 (2016)
41. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H.,
Kalenichenko, D.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition pp. 2704–2713 (2018)
42. Joseph, V., Siddiqui, S.A., Bhaskara, A., Gopalakrishnan, G., Muralidharan, S.,
Garland, M., Ahmed, S., Dengel, A.R.: Going beyond classification accuracy met-
rics in model compression (2020)
43. Kim, J., youn Park, C., Jung, H.J., Choe, Y.: Differentiable pruning method for
neural networks. ArXiv abs/1904.10921 (2019)
44. Kim, J., Bhalgat, Y., Lee, J., Patel, C., Kwak, N.: Qkd: Quantization-aware knowl-
edge distillation. arXiv preprint arXiv:1911.12491 (2019)
45. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Conference on
Empirical Methods in Natural Language Processing (2016)
46. Kossaifi, J., Lipton, Z.C., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor
regression networks. ArXiv abs/1707.08308 (2020)
47. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Communications of the ACM 60, 84 – 90 (2012)
48. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NIPS (1989)
49. Li, L., Lin, Y., Ren, S., Li, P., Zhou, J., Sun, X.: Dynamic knowledge distillation
for pre-trained language models. ArXiv abs/2109.11295 (2021)
50. Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., Gonzalez, J.: Train
large, then compress: Rethinking model size for efficient training and inference of
transformers. ArXiv abs/2002.11794 (2020)
51. Liu, X., Wang, X., Matwin, S.: Improving the interpretability of deep neural net-
works with knowledge distillation. 2018 IEEE International Conference on Data
Mining Workshops (ICDMW) pp. 905–912 (2018)
52. Liu, Y., Zhang, W., Wang, J., Wang, J.: Data-free knowledge transfer: A survey.
ArXiv abs/2112.15278 (2021)
53. Lopes, R.G., Fenu, S., Starner, T.: Data-free knowledge distillation for deep neural
networks. ArXiv abs/1710.07535 (2017)
54. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.N.: Unifying distillation and
privileged information. CoRR abs/1511.03643 (2015)
55. Malach, E., Yehudai, G., Shalev-Shwartz, S., Shamir, O.: Proving the lottery ticket
hypothesis: Pruning is all you need. In: ICML (2020)
56. Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh,
H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)
57. Mozer, M.C., Smolensky, P.: Skeletonization: A technique for trimming the fat
from a network via relevance assessment. In: NIPS (1988)
58. Müller, R., Kornblith, S., Hinton, G.E.: Subclass distillation. ArXiv
abs/2002.03936 (2020)
59. Nayak, G.K., Mopuri, K.R., Shaj, V., Babu, R.V., Chakraborty, A.: Zero-shot
knowledge distillation in deep networks. ArXiv abs/1905.08114 (2019)
60. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural net-
works. In: NIPS (2015)
61. Olah, C.: Mechanistic Interpretability, Variables, and the Importance of
Interpretable Bases. https://transformer-circuits.pub/2022/mech-interp-
essay/index.html (jun 22 2022), [Online; accessed 2022-12-14]
Model Compression Techniques in Deep Neural Networks 21

62. Papernot, N., Mcdaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense
to adversarial perturbations against deep neural networks. 2016 IEEE Symposium
on Security and Privacy (SP) pp. 582–597 (2015)
63. Park, G., Yang, J.Y., Hwang, S.J., Yang, E.: Attribution preservation in network
compression for reliable network interpretation. Advances in Neural Information
Processing Systems 33, 5093–5104 (2020)
64. Park, J., No, A.: Prune your model before distill it. In: European Conference on
Computer Vision (2021)
65. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
pp. 3962–3971 (2019)
66. Patterson, D., Gonzalez, J., Le, Q.V., Liang, C., Munguı́a, L.M., Rothchild, D., So,
D.R., Texier, M., Dean, J.: Carbon emissions and large neural network training.
ArXiv abs/2104.10350 (2021)
67. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quan-
tization. arXiv preprint arXiv:1802.05668 (2018)
68. Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Rastegari, M.: What’s
hidden in a randomly weighted neural network? 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 11890–11899 (2020)
69. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classifi-
cation using binary convolutional neural networks. In: ECCV (2016)
70. Reed, R.: Pruning algorithms-a survey. IEEE Transactions on Neural Networks
4(5), 740–747 (1993). https://doi.org/10.1109/72.248452
71. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
Hints for thin deep nets. CoRR abs/1412.6550 (2015)
72. Sau, B.B., Balasubramanian, V.N.: Deep model compression: Distilling knowledge
from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)
73. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2015)
74. Stoychev, S., Gunes, H.: The effect of model compression on fairness in facial
expression recognition. ArXiv abs/2201.01709 (2022)
75. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan,
D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1–9 (2015)
76. Tang, J., Liu, M., Jiang, N., Cai, H., Yu, W., Zhou, J.: Data-
free network pruning for model compression. In: 2021 IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2021).
https://doi.org/10.1109/ISCAS51556.2021.9401109
77. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. ArXiv
abs/1910.10699 (2020)
78. Van Baalen, M., Louizos, C., Nagel, M., Amjad, R.A., Wang, Y., Blankevoort, T.,
Welling, M.: Bayesian bits: Unifying quantization and pruning. Advances in neural
information processing systems 33, 5741–5752 (2020)
79. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. ArXiv abs/1706.03762 (2017)
80. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quan-
tization with mixed precision. 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) pp. 8604–8612 (2018)
81. Wang, Y., Wang, C., Wang, Z., Zhou, S., Liu, H., Bi, J., Ding, C., Rajasekaran,
S.: Against membership inference attack: Pruning is all you need. In: International
Joint Conference on Artificial Intelligence (2020)
22 M.M Yesuf, B.G Assefa.

82. Wang, Y., Lu, Y., Blankevoort, T.: Differentiable joint pruning and quantization
for hardware efficiency. ArXiv abs/2007.10463 (2020)
83. Wang, Y., Zhang, X., Hu, X., Zhang, B., Su, H.: Dynamic network pruning with
interpretable layerwise channel selection. In: AAAI (2020)
84. Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., Hu, X.: Pruning from
scratch. In: AAAI (2020)
85. Wiedemann, S., Kirchhoffer, H., Matlage, S., Haase, P., Marbán, A., Marinc, T.,
Neumann, D., Nguyen, T., Schwarz, H., Wiegand, T., Marpe, D., Samek, W.:
Deepcabac: A universal compression algorithm for deep neural networks. IEEE
Journal of Selected Topics in Signal Processing 14, 700–714 (2020)
86. Yang, Z., Wang, Y., Chen, X., Shi, B., Xu, C., Xu, C., Tian, Q., Xu, C.: Cars: Con-
tinuous evolution for efficient neural architecture search. 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR) pp. 1826–1835 (2020)
87. Yeom, S.K., Seegerer, P., Lapuschkin, S., Wiedemann, S., Müller, K.R., Samek,
W.: Pruning by explaining: A novel criterion for deep neural network pruning.
ArXiv abs/1912.08881 (2021)
88. Yim, J., Joo, D., Bae, J.H., Kim, J.: A gift from knowledge distillation: Fast op-
timization, network minimization and transfer learning. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 7130–7138 (2017)
89. Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisiting knowledge distillation
via label smoothing regularization. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 3903–3911 (2020)
90. Zhai, S., Cheng, Y., Zhang, Z., Lu, W.: Doubly convolutional neural networks. In:
NIPS (2016)
91. Zhang, Z., Shao, W., Gu, J., Wang, X., Ping, L.: Differentiable dynamic quan-
tization with mixed precision and adaptive resolution. ArXiv abs/2106.02295
(2021)
92. Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distilla-
tion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 11943–11952 (2022)
93. Zhao, H., Sun, X., Dong, J., Chen, C., Dong, Z.: Highlight every step: Knowledge
distillation via collaborative teaching. IEEE Transactions on Cybernetics 52, 2070–
2081 (2019)
94. Zhao, Y., Shumailov, I., Mullins, R.D., Anderson, R.: To compress or not to com-
press: Understanding the interactions between adversarial attacks and neural net-
work compression. ArXiv abs/1810.00208 (2018)

View publication stats

Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (7)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
SR 11-7, Validation and Machine Learning Models
100% (1)
SR 11-7, Validation and Machine Learning Models
31 pages
Model Compression For Deep Neural Networks - A Survery 1689444018366
No ratings yet
Model Compression For Deep Neural Networks - A Survery 1689444018366
22 pages
Model Compression
No ratings yet
Model Compression
41 pages
CIKM
No ratings yet
CIKM
173 pages
Compression Survey Hal
No ratings yet
Compression Survey Hal
26 pages
Efficient_Deep_Learning_in_Network_Compression_and
No ratings yet
Efficient_Deep_Learning_in_Network_Compression_and
21 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
2006.03669v2
No ratings yet
2006.03669v2
73 pages
A Comprehensive Survey On Model Compression and Acceleration
No ratings yet
A Comprehensive Survey On Model Compression and Acceleration
43 pages
Model Compression Is The Big ML Flavour of 2021
No ratings yet
Model Compression Is The Big ML Flavour of 2021
4 pages
1 - A Day in The Life of ChatGPT As A Researcher
No ratings yet
1 - A Day in The Life of ChatGPT As A Researcher
20 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
DL presentation
No ratings yet
DL presentation
20 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Compressing Neural Networks Using The Variational Information Bottleneck
No ratings yet
Compressing Neural Networks Using The Variational Information Bottleneck
27 pages
Unit II
No ratings yet
Unit II
27 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
32 pages
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
unit-iv-v-deep-learning-material
No ratings yet
unit-iv-v-deep-learning-material
32 pages
Unit-3
No ratings yet
Unit-3
16 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
UNIT 4 Ann
No ratings yet
UNIT 4 Ann
8 pages
2101.09671v3
No ratings yet
2101.09671v3
41 pages
An Introduction To Neural Data Compression: Yibo Yang, Stephan Mandt, and Lucas Theis
No ratings yet
An Introduction To Neural Data Compression: Yibo Yang, Stephan Mandt, and Lucas Theis
20 pages
Deep Learning concise notes
No ratings yet
Deep Learning concise notes
4 pages
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
From Everand
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
DAVID MACKAY
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
33 pages
Applsci 12 11184
No ratings yet
Applsci 12 11184
18 pages
NIPS 2016 Dynamic Network Surgery For Efficient Dnns Paper
No ratings yet
NIPS 2016 Dynamic Network Surgery For Efficient Dnns Paper
9 pages
Compressing Large Scale Transformer Based Models_A Case Study on BERT
No ratings yet
Compressing Large Scale Transformer Based Models_A Case Study on BERT
7 pages
Deep Learning Note 21cs743
No ratings yet
Deep Learning Note 21cs743
96 pages
TF Estimators KDD Paper
No ratings yet
TF Estimators KDD Paper
9 pages
Deep Learning Module-01 Search Creators
No ratings yet
Deep Learning Module-01 Search Creators
17 pages
A Comprehensive Survey of Graph Neural Networks PDF
No ratings yet
A Comprehensive Survey of Graph Neural Networks PDF
22 pages
cq02_vdthanh_ass3
No ratings yet
cq02_vdthanh_ass3
20 pages
Lecture 23b Auto Encoder
No ratings yet
Lecture 23b Auto Encoder
27 pages
MODULE 1 DL SNOTES
No ratings yet
MODULE 1 DL SNOTES
11 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Deep Learning-1
No ratings yet
Deep Learning-1
20 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
An Analysis Framework
No ratings yet
An Analysis Framework
12 pages
DL_UNIT_3_NOTES
No ratings yet
DL_UNIT_3_NOTES
16 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Deep learning Unit 4
No ratings yet
Deep learning Unit 4
10 pages
2
No ratings yet
2
9 pages
LBDL
No ratings yet
LBDL
156 pages
UNIT I part 1 notes
No ratings yet
UNIT I part 1 notes
28 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
Deep Learning in Data Science Theoretical Foundati
No ratings yet
Deep Learning in Data Science Theoretical Foundati
6 pages
VGG Deep Neural Network Compression Via SVD and CUR Decomposition Techniques
No ratings yet
VGG Deep Neural Network Compression Via SVD and CUR Decomposition Techniques
6 pages
Deep Learning: Yann Lecun
No ratings yet
Deep Learning: Yann Lecun
58 pages
Deep learning (nirali)
No ratings yet
Deep learning (nirali)
32 pages
lbdl
No ratings yet
lbdl
143 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
Detailed Deep Learning Answers
No ratings yet
Detailed Deep Learning Answers
4 pages
Chapter 14_ Analyzing Adversarial Performance _ The Deep Learning Architect's Handbook
No ratings yet
Chapter 14_ Analyzing Adversarial Performance _ The Deep Learning Architect's Handbook
1 page
AIcrowd _ Single-source Augmentation _ Challenges
No ratings yet
AIcrowd _ Single-source Augmentation _ Challenges
1 page
Inductive Moment Matching
No ratings yet
Inductive Moment Matching
36 pages
IterateAI_Careers
No ratings yet
IterateAI_Careers
4 pages
2024-11-15-AI-Updates
No ratings yet
2024-11-15-AI-Updates
20 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
RoPE
No ratings yet
RoPE
7 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
No ratings yet
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
26 pages
SVM
No ratings yet
SVM
19 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Ai
No ratings yet
Ai
28 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
33 pages
RNN
No ratings yet
RNN
12 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
No ratings yet
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
395 pages
Modified Generative AI and LLMs in Practice
No ratings yet
Modified Generative AI and LLMs in Practice
6 pages
A Prediction of Water Quality Analysis Using Machine Learning
No ratings yet
A Prediction of Water Quality Analysis Using Machine Learning
6 pages
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
No ratings yet
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
11 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
Win 23 3170724 Merged
No ratings yet
Win 23 3170724 Merged
9 pages
ML New
No ratings yet
ML New
20 pages
Iiver
No ratings yet
Iiver
53 pages
DWM - Classification-Unit7
No ratings yet
DWM - Classification-Unit7
44 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Enterprise Artificial Intelligence and Machine Learning For Managers
100% (2)
Enterprise Artificial Intelligence and Machine Learning For Managers
97 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Phase - 2 - PPT (1) Final
No ratings yet
Phase - 2 - PPT (1) Final
26 pages
Towards Reliable Alignment: Uncertainty-Aware RLHF
No ratings yet
Towards Reliable Alignment: Uncertainty-Aware RLHF
25 pages
Machine Learning Technique For The Prediction of Blended Concrete Compressive Strength
No ratings yet
Machine Learning Technique For The Prediction of Blended Concrete Compressive Strength
19 pages
Predicting Material Properties From 3D Printer Settings Using Machine Learning Techniques
100% (1)
Predicting Material Properties From 3D Printer Settings Using Machine Learning Techniques
19 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
AI Associate Merged
No ratings yet
AI Associate Merged
100 pages
Analytics Prepbook Laterals 2019-2020
100% (1)
Analytics Prepbook Laterals 2019-2020
40 pages
tj_19_2025_1_17-25
No ratings yet
tj_19_2025_1_17-25
9 pages
CatBoost
No ratings yet
CatBoost
11 pages
Deep Learning (2024)
No ratings yet
Deep Learning (2024)
589 pages
AIR Questions
No ratings yet
AIR Questions
4 pages
Final Report
No ratings yet
Final Report
35 pages
CSE445 1 Intro to ML
No ratings yet
CSE445 1 Intro to ML
36 pages
Machine Learning
No ratings yet
Machine Learning
92 pages
Course DataCamp Classification With XGBoost
100% (1)
Course DataCamp Classification With XGBoost
39 pages
25CSD09_ICRTICC-2025
No ratings yet
25CSD09_ICRTICC-2025
4 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
Mathematical Foundations of Data Science Class Notes
No ratings yet
Mathematical Foundations of Data Science Class Notes
45 pages
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
No ratings yet
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
5 pages