ModelCompressionTechniquesinDeepLearning
ModelCompressionTechniquesinDeepLearning
net/publication/371051458
CITATIONS READS
0 377
2 authors:
All content following this page was uploaded by Beakal Gizachew Assefa on 09 March 2024.
Abstract. With the current set-up, the success of Deep Neural Net-
work models is highly tied to their size. Although this property might
help them improve their performance, it makes them difficult to train,
deploy them on resource-constrained machines, and iterate on experi-
ments. There is also a growing concern about their environmental and
economic impacts. Model compression is a set of techniques that are
applied to reduce the size of models maintaining performance. This sur-
vey presents state-of-the-art model compression methods, discusses how
they are hybridized, highlights the changes they could cause beyond size
reduction, and puts forward open research problems for further study.
1 Introduction
Fig. 1. Size of benchmark vision and language models over the past years [7].
Model Compression is a set of techniques [11] for reducing the size of large
models without a notable performance loss. Its necessity is increasing in paral-
lel with the advancement and complexity of models. Even though their recent
attempts to train smaller networks from the beginning [22], mostly, Model Com-
pression is applied after training a bigger model as the model needs much fewer
parameters for inference than for learning. Model Compression methods in lit-
erature can be classified into four major parts: Pruning, Knowledge Distillation
(KD), Quantization, and other methods. Pruning is removing an unwanted struc-
ture from a trained network. KD is a mechanism transferring the knowledge of
a bigger network onto a smaller network. Quantization is reducing the number
of bits of model parts required to be represented. It is similar to approxima-
tion. The rest of the Model Compression methods other than Pruning, KD, and
Quantization are can be organized in one section and are referred to as Other
methods. Weight decomposition methods deserve to have their own section, but
this arrangement makes the paper simpler for the reader.
The contributions of the paper are as follows
The rest of the paper is organized as follows. The next section, section two, re-
views prominent works in Model Compression. Section three is on Hybrid Model
Compression techniques which are about methods that combine existing ones
for a better outcome. The effect of Model Compression beyond size reduction
is discussed in section four. Discussions, factors affecting the choice of methods,
and open research problems are presented in sections, section five, six, and seven
respectively. Finally, a conclusion is presented.
Pruning
Knowledge Distillation
Others methods
Fig. 2. The taxonomy and categories of Deep Learning model compression methods.
4 M.M Yesuf, B.G Assefa.
2.2 Pruning
The oldest, most studied and intuitive method of reducing model size is pruning.
It is literally removing a weight or a node from a network based on how important
it is. The vanilla pruning method has three main stages: initializing, training,
pruning and retraining(fine tuning). Hyperparameters introduced by the Pruning
include compression rate, magnitude to prune with when it is magnitude weight,
type of saliency metric to use, etc. The approach towards these hyperparamters
categorize the pruning methods in literature.
The earliest methods were of mainly: Saliency based, where node or weight
is checked for importance, penalty based [48], [57], where the a penalty term
is added on the loss(objective) function to induce weight decay [70], magni-
tude based pruning, simply removing the smallest link(weight) in each iteration.
Model Compression Techniques in Deep Neural Networks 5
Prominent early works are Optimal Brain Damage (OBD) [48], Optimal Brain
Surgeon [35] and Skeletonization [57].
An intuitive direction of development for Pruning is to determine what is
important. There were attempts to identify them with statistical methods, and
works arounds via following recent advances in the area Explainability, a notion
trying to undersand,explain the ’decision’ of a model. [87].
Modern approaches vary not just in the Saliency metrics they use or other ap-
proach, but also in the overall pipeline of pruning. The ‘lottery ticket hypothesis’
[22], whose stronger version was supported in [55], states that random initial-
ization of a network contains a small subnetwork (” the winning tickets”) that,
when trained in separately, can compete with the performance of the original
model. A rather bold finding in by [68] states that there is a randomly initialized
subnetwork that even without training can perform as good as the original. The
recent work [84], termed pruning from scratch, questions the need for training
the network and proposes a different pipeline as: random initialization, pruning
and then training. The work in [50], raises the question why train a big model
to convergence while it is possible to attain better performance by training it
for free iterations and prune it.
Rather than design designing manually the hyperparamters introduced by
Pruning, Reinforcement Learning (RL) based methods are emerging to formulate
the problem of removing ad node or a weight as a Markov Decision Process
[30], [30]. Neural Architecture Search, based methods formulate the Pruning as
a search in the space of sub-networks emerging area of study in the study of
neural network’s.[21].
the original network was trained on, the smaller network will use the predictions
of the bigger network as a label for x instead of just y. The predictions of the
teacher model are termed as soft labels since they are not ones and zeros like
hard labels are. These soft labels are termed as the Dark Knowledge. The intu-
ition is an image of a dog has a larger chance of being mistaken as a cat than as
a car.
exp(zi /T )
p(zi , T ) = P (1)
exp(zj /T )
network, a size that is less than the teacher and smaller than the student, to
help out the small student network [56].
A powerful adaptation of KD, Fitnets [71], trains a student, deeper and thin-
ner than the teacher which is unlike [37], using the intermediate representations,
features, learned by the teacher as hints to improve the training process and
final generalizing capacity of the student. This work opened a whole new sub-
catagory of KD termed as Feature Distillation making the original to be called
response based distillation [28]. Response-based distillation, is also limited to the
supervised learning as the logits, the soft labels, are probability distributions.
Relationship distillation extends feature distillation by using outputs of specific
layers in the teacher model and exploring the relationships between different
layers or data samples [65].
At core level, KD is a knowledge transfer mechanism [54]. Thus, it is ex-
tremely versatile and model agnostic, any model can be used as teacher or a
student. Thus, it has usage for other purposes including faster optimization [88],
defence against adversarial attacks[62], explainable networks [42], [63], even to
improve the original network itself [24].
2.4 Quantization
Quantization, in simple terms discretization, is mapping from the space of con-
tinuous values to discrete values. A good example is rounding up a number.
The concept is old and hsa been used in various settings in information theory
and other sciences as well, but in NN setting, it is about [26] rounding up the
parameters in the network and reduce the number of bits required for representa-
tion for example from 32 bits floating point representations, to limited precision,
commonly in 8 bits or lower representations. It’s foundation is rooted in how a
human brain stores and encodes information in a discrete way [25].
In literature, the variants of quantization differ according to which param-
eters to quantize, how to quantize, whether to do it in training time or after
training time, and other factors which correspond to the variables S, Z in the
quantization equation. S itself is determined by the boundary values of the clip-
ping range. Depending on whether the clipping range is calculated for each dy-
namically or statically, quantization can be dynamic or static. The work on [41]
made sure only 8-bit integers are used in the network at inference time.
Normally, what is known as the clipping range, a range to make sure the
individual weights will lie in, is determined before hand. The common formula
is given by:
r
Q(r) = Int( ) − −Z (2)
S
where Q is the quantization operator, r is a real valued number, since the
clipping range is not necessarily input (activation or weight), S is a real val-
ued scaling factor, and Z is an integer. These variables are part of the hyper-
parameters introduced by Quantization. Int is the integer function that maps
its input to the integer. A quantized number can be recovered with the reverse
quantization, Dequantization.
8 M.M Yesuf, B.G Assefa.
Once basic Model Compression methods and generic paradigms development are
established, they can be effectively summarized in the table below.
In general, the goal of compression is intuitive: reduce size. In reduce size with
accuracy. Thus, people have tried whatever they think would work to achieve
Model Compression Techniques in Deep Neural Networks 9
that goal. Listed in this section are methods that are both common, Tensor
Decomposition, and rare, like Information theoretic.
Tensor decomposition is a popular method to reduce a model size. A tensor
is just a matrix in higher dimension. Tensor decomposition is a scaled up version
of the common matrix decomposition in linear algebra. Matrix decomposition
methods like Singular Value Decomposition and Principal component analysis,
have their analogous versions. To make compression, these methods take the
weights of a model to form a Tensor. Tensor decomposition, the core of these
methods, is an existing approach in other fields and common decomposition
methods Canonical Polyadic, Tensor Train, and the Tucker decomposition are
applied to compress models. The works in [60], [13] and [46] are iconic examples
of these methods. These are almost always applied in Multi Layer Perceptrons
or fully-connected layers of CNNs. Large scale matrix manipulation involved is
their disadvantages. They are also applied on specific architectures.
Weight sharing approaches are somehow ways of making CNN variants effi-
cient. The works [75], Inception [90],MobileNet [39] and SqueezNet [40] are ex-
amples of these methods. One of the most influential work [40] achieved Alexnet
[47] level accuracy with 50 times fewer parameters on the ImageNet dataset
[47]. These family of compression methods are more of careful design choices
that turned out to be efficient.
An unique method is presented in [85] used the then state of the art loss-
less video encoding and compression technique to compress a model. It is an
information theoretic approach that formulated the model compression problem
as an information source coding one. It used Context-Adaptive Binary Arith-
metic Coding (CABAC) a video coding standard H.264/AVC to encode and the
quantize the weights of the neural network but the encoding takes long time.
4.2 Explainability
Model Compression is also tied with Explainability or Interpretability, an ef-
fort to try to understand the decisions of models which is getting more and
more attention due to their applications in high stake areas. The problem of
Explainability can be solved by KD, sometimes at a cost of a slight accuracy, by
transferring the knowledge of a high performing neural network into a Decision
Tree model which is explainable inherently [23], [51]. But this is a double-edged
phenomenon because compressing a trained model impacts explanations as can
be seen [42] and [63] where the authors suggested explanation preserving meth-
ods. There is a lack of detailed work between compression and explainability.
12 M.M Yesuf, B.G Assefa.
Algorithmic fairness has become increasingly vital due to the increasing impact
of AI in society. Particularly, as AI solutions are being adopted in various fields
of high social importance or automate decision-making, the question of how fair
the algorithm is critical now more than ever. Since Model Compression is be-
coming a default component of ML deployment, its impact on fairness have to
be examined. A recent work aimed at this problem reported that model com-
pression, Pruning and Weight Decomposition, exacerbate the bias in a network
[74].
4.5 Security
5 Findings
At core level, most Model Compression methods are somehow related with some
kind of natural phenomena outside of AI. Quantization is related with how the
human brain stores and encodes information in a discrete way[25]. The original
KD paper discussed about its similarity between the development of insects.
Pruning is related with how humans have less neuron connections when they are
babies, than adults. This begs the question whether or not other ideas can also
be considered or even if there is an even bigger mystery.
Another high-level observation is that the methods can be seen as micro level
and macro level. KD and Weight Decomposition are macro-level compression
methods, they work at network level. Quantization and Pruning are relatively
micro compression methods, they have mechanisms to deal with specific param-
eters [33]. Those in the latter group have patterns across their techniques and
trade-offs introduced by their generic paradigms. For example in both Pruning
and Quantization, the conventional method is post-training compression, getting
more surgical by applying them while training is possible but it has an additional
cost of complexity, and have alternatives to deal with parameters individually
or in groups. Particularly, Unstructured pruning and Uniform Quanitzation are
analogous. But Quantization has more headaches to be surgical. For this reason,
they are heavily applied in structured in Pruning( in uniform in Quantization)
in practical settings. AutoML solutions such as RL and NAS can be utilized but
the number of hyper-parameters increases exponentially.
In general, Model Compression techniques can be seen in two broad cate-
gories. The methods in the first category, such as Pruning, Quantization and
Tensor Decomposition, transform a trained model into a smaller or optimized
version of itself. The methods in the second category try to create a new but
smaller model from scratch. Knowledge distillation can be seen as creating a
14 M.M Yesuf, B.G Assefa.
model from scratch. Any other design choices that can reduce model size can be
categorized in the second category.
Most compression methods are vision native, they are made for computer vi-
sion applications especially CNN based architectures. That is most likely because
vision data is abundant, to make inference at edge devices where the camera re-
sides, or to take advantage of compression methods that remove structures at a
time. Its only recently, with the development of Large Language Models, that
the methods are being adapted for language models.
5.1 Pruning
The most important gap common in all the Pruning methods is that it is hard to
compare different variants due to lack of benchmark metric. The only benchmark
effort is Shrinkbench [9], a framework aiming to provide pruning performance
comparisons.
There is a critical difference between unstructured and structured Pruning.
Unstructured pruning, is implemented by making individual parameters set to
be zero. This makes unstructured pruning almost surgical and more accurate in
identifying what to remove. But this implementation doesn’t actually reduce the
computation time since the GPU will do the operation anyway. On the other
hand, structured pruning will remove a complete layer or a channel or a filter.
And then the rest of the connections will be connected afterwards. This increases
efficiency both pruning time and implementation time. But this too has its own
consequences as it is not surgical. The layer or channel to be removed might still
have important weights that are needed for inference.
Another common limitation , which can be caused by the lack of a common
benchmark, is that they are architecture dependent. Almost all of them involve
at least some amount of fine-tuning or retraining which requires having original
data to train network is trained on which in some cases might not be available.
Considering randomly dropping weights or magnitude based pruning, it is
justified to say that Pruning is the fastest.
5.3 Quantization
Quantization methods make weights discrete. They might also damage the preci-
sion of important weights. There are ways to bypass that, dynamic, Asymmetric
Quantization or Quantization Aware Training), but they come at the expense
of computation which is what compression was needed for in the first place.
There is no size fits all method. One has to find the right one for the task at
hand. These are the factors that can be cons
– Trade-off : the paradigms discussed above introduce trade-offs that can in-
fluence the choice of a compression method.
– Area of application. How tolerant the task at hand is can affect the choice
of a compression method in relation to alignment, accuracy, etc.
– Type of data (Architecture) Computer vision and natural language-
related tasks can require architectures that suit their purpose which in turn
affects the choice of compression strategy. For example, structured Pruning
can effectively remove channels and filters from CNNs on the other hand
there are real-world cases where large language models are effectively com-
pressed with Knowledge Distillation. Classical methods can have comparable
performance as Neural Networks.
– Expertise. Some of the compression methods require expertise to be imple-
mented and require high level of sophistication to be carried out effectively.
16 M.M Yesuf, B.G Assefa.
There are numerous open issues in Model Compression that can be formulated
as valid research questions. They are presented in this section according to the
classification used in the paper.
Despite efforts to train smaller networks from scratch [22], most Model Com-
pression methods are applied after a big model is trained. Thus, the compression
ought to have an effect on other aspects of the model. For example, as pointed
out in [16], there are still remaining works that can bridge the concept of a
models size (compression) and its Explainability. Some applications of Explain-
ability are mentioned in the pruning section of this writing. But, there are only
few works,save for [13], that even acknowledge the existence of a bridge between
them. Thus, a formalized future work in this direction can be fruitful.
Model Compression Techniques in Deep Neural Networks 17
Basically, one can raise many components of the network and ask how com-
pression impacts it. Presented in section four are only a limited number of them
because there is still much work to be done in the area. Again, a comprehensive
survey on the effect of Model Compression beyond size reduction can be ex-
tremely helpful to build Model Compression without ramifications and beyond.
7.4 Miscellaneous
Some model compression methods are developed and are experimented for a
specific types of architectures. These makes them limited in applications. In
general RL based approaches will be suitable in the future to adapt compression
techniques to the given model.
There is little innovation in methods other than pruning, quantization and
knowledge distillation. Design choices can make a considerable change [75], [90],
[39] [40]. Since the aim problem of model compression is simple and intuitive,
any model size reduction method can be called a model compression technique
as long as it performs well.
The idea of ensembles have been in use and in classical ML world. It is about
using multiple machine learning models for a single task simultaneously. It in-
creases accuracy by taking advantage of the performance of individual models.
The concept have been far from being even mentioned in the context of neu-
ral networks as the complexity is overwhelming even to thing about. But with
parallel advances in model compression and hardware, it seems it will be within
the reach of industry leaders or whoever has the access and motive to try. Since
that has most likely never been explored, it is worth the shot. Hardware issues,
thought they are not discussed in this survey, are highly related with compression
and optimization.
8 Conclusion
A brief review of Model Compression techniques in Deep Learning with their
challenges, and opportunities has been presented. It started by defining Model
Compression and the challenges it will introduce. Then thirty years of work on
Pruning, a number of works on Knowledge Distillation and Quantization are
synthesized. How each methods is combined with another one is also discussed.
It contained a dedicated section intended to magnify the relationship between
model compression and the networks behavior. Findings and patterns from each
section are presented in a separate section. Actionable future research directions
have been put forward to assist interested researchers.
References
1. Aghli, N., Ribeiro, E.: Combining weight pruning and knowledge distillation for
cnn compression. 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW) pp. 3185–3192 (2021)
18 M.M Yesuf, B.G Assefa.
2. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2017)
3. Ashok, A., Rhinehart, N., Beainy, F.N., Kitani, K.M.: N2n learning: Net-
work to network compression via policy gradient reinforcement learning. ArXiv
abs/1709.06030 (2018)
4. Bang, D., Lee, J., Shim, H.: Distilling from professors: Enhancing
the knowledge distillation of teachers. Information Sciences 576, 743–
755 (2021). https://doi.org/https://doi.org/10.1016/j.ins.2021.08.020,
https://www.sciencedirect.com/science/article/pii/S0020025521008203
5. Banner, R., Nahshan, Y., Hoffer, E., Soudry, D.: Aciq: Analytical clipping for
integer quantization of neural networks. ArXiv abs/1810.05723 (2018)
6. Bengio, Y., et al.: Learning deep architectures for ai. Foundations and trends® in
Machine Learning 2(1), 1–127 (2009)
7. Bernstein, L., Sludds, A., Hamerly, R., Sze, V., Emer, J.S., Englund, D.: Freely
scalable and reconfigurable optical hardware for deep learning. Scientific Reports
11 (2020)
8. Bianco, S., Cadene, R., Celona, L., Napoletano, P.: Benchmark analysis of repre-
sentative deep neural network architectures. IEEE Access 6, 64270–64277 (2018).
https://doi.org/10.1109/ACCESS.2018.2877890
9. Blalock, D.W., Ortiz, J.J.G., Frankle, J., Guttag, J.V.: What is the state of neural
network pruning? ArXiv abs/2003.03033 (2020)
10. Boo, Y., Shin, S., Choi, J., Sung, W.: Stochastic precision ensemble: self-knowledge
distillation for quantized deep neural networks. In: Proceedings of the AAAI Con-
ference on Artificial Intelligence. vol. 35, pp. 6794–6802 (2021)
11. Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: KDD ’06
(2006)
12. Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: A
novel zero shot quantization framework. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) pp. 13166–13175 (2020)
13. Calvi, G.G., Moniri, A., Mahfouz, M., Zhao, Q., Mandic, D.P.: Compression and
interpretability of deep neural networks via tucker tensor layer: From first principles
to tensor valued back-propagation. arXiv: Learning (2019)
14. Chen, L., Chen, Y., Xi, J., Le, X.: Knowledge from the original network: restore a
better pruned network with knowledge distillation. Complex & Intelligent Systems
(2021)
15. Chen, T., Ji, B., Ding, T., Fang, B., Wang, G., Zhu, Z., Liang, L., Shi, Y., Yi, S., Tu,
X.: Only train once: A one-shot neural network training and pruning framework.
In: Neural Information Processing Systems (2021)
16. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and
acceleration for deep neural networks. ArXiv abs/1710.09282 (2017)
17. Courbariaux, M., Bengio, Y.: Binarynet: Training deep neural networks with
weights and activations constrained to +1 or -1. ArXiv abs/1602.02830 (2016)
18. Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural
networks with binary weights during propagations. In: NIPS (2015)
19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. ArXiv abs/1810.04805 (2019)
20. Ding, X., Wang, Y., Xu, Z., Wang, Z.J., Welch, W.J.: Distilling and
transferring knowledge via cgan-generated samples for image clas-
sification and regression. Expert Systems with Applications 213,
Model Compression Techniques in Deep Neural Networks 19
40. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model
size. ArXiv abs/1602.07360 (2016)
41. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H.,
Kalenichenko, D.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition pp. 2704–2713 (2018)
42. Joseph, V., Siddiqui, S.A., Bhaskara, A., Gopalakrishnan, G., Muralidharan, S.,
Garland, M., Ahmed, S., Dengel, A.R.: Going beyond classification accuracy met-
rics in model compression (2020)
43. Kim, J., youn Park, C., Jung, H.J., Choe, Y.: Differentiable pruning method for
neural networks. ArXiv abs/1904.10921 (2019)
44. Kim, J., Bhalgat, Y., Lee, J., Patel, C., Kwak, N.: Qkd: Quantization-aware knowl-
edge distillation. arXiv preprint arXiv:1911.12491 (2019)
45. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Conference on
Empirical Methods in Natural Language Processing (2016)
46. Kossaifi, J., Lipton, Z.C., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor
regression networks. ArXiv abs/1707.08308 (2020)
47. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Communications of the ACM 60, 84 – 90 (2012)
48. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NIPS (1989)
49. Li, L., Lin, Y., Ren, S., Li, P., Zhou, J., Sun, X.: Dynamic knowledge distillation
for pre-trained language models. ArXiv abs/2109.11295 (2021)
50. Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., Gonzalez, J.: Train
large, then compress: Rethinking model size for efficient training and inference of
transformers. ArXiv abs/2002.11794 (2020)
51. Liu, X., Wang, X., Matwin, S.: Improving the interpretability of deep neural net-
works with knowledge distillation. 2018 IEEE International Conference on Data
Mining Workshops (ICDMW) pp. 905–912 (2018)
52. Liu, Y., Zhang, W., Wang, J., Wang, J.: Data-free knowledge transfer: A survey.
ArXiv abs/2112.15278 (2021)
53. Lopes, R.G., Fenu, S., Starner, T.: Data-free knowledge distillation for deep neural
networks. ArXiv abs/1710.07535 (2017)
54. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.N.: Unifying distillation and
privileged information. CoRR abs/1511.03643 (2015)
55. Malach, E., Yehudai, G., Shalev-Shwartz, S., Shamir, O.: Proving the lottery ticket
hypothesis: Pruning is all you need. In: ICML (2020)
56. Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh,
H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)
57. Mozer, M.C., Smolensky, P.: Skeletonization: A technique for trimming the fat
from a network via relevance assessment. In: NIPS (1988)
58. Müller, R., Kornblith, S., Hinton, G.E.: Subclass distillation. ArXiv
abs/2002.03936 (2020)
59. Nayak, G.K., Mopuri, K.R., Shaj, V., Babu, R.V., Chakraborty, A.: Zero-shot
knowledge distillation in deep networks. ArXiv abs/1905.08114 (2019)
60. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural net-
works. In: NIPS (2015)
61. Olah, C.: Mechanistic Interpretability, Variables, and the Importance of
Interpretable Bases. https://transformer-circuits.pub/2022/mech-interp-
essay/index.html (jun 22 2022), [Online; accessed 2022-12-14]
Model Compression Techniques in Deep Neural Networks 21
62. Papernot, N., Mcdaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense
to adversarial perturbations against deep neural networks. 2016 IEEE Symposium
on Security and Privacy (SP) pp. 582–597 (2015)
63. Park, G., Yang, J.Y., Hwang, S.J., Yang, E.: Attribution preservation in network
compression for reliable network interpretation. Advances in Neural Information
Processing Systems 33, 5093–5104 (2020)
64. Park, J., No, A.: Prune your model before distill it. In: European Conference on
Computer Vision (2021)
65. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
pp. 3962–3971 (2019)
66. Patterson, D., Gonzalez, J., Le, Q.V., Liang, C., Munguı́a, L.M., Rothchild, D., So,
D.R., Texier, M., Dean, J.: Carbon emissions and large neural network training.
ArXiv abs/2104.10350 (2021)
67. Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quan-
tization. arXiv preprint arXiv:1802.05668 (2018)
68. Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Rastegari, M.: What’s
hidden in a randomly weighted neural network? 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 11890–11899 (2020)
69. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classifi-
cation using binary convolutional neural networks. In: ECCV (2016)
70. Reed, R.: Pruning algorithms-a survey. IEEE Transactions on Neural Networks
4(5), 740–747 (1993). https://doi.org/10.1109/72.248452
71. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
Hints for thin deep nets. CoRR abs/1412.6550 (2015)
72. Sau, B.B., Balasubramanian, V.N.: Deep model compression: Distilling knowledge
from noisy teachers. arXiv preprint arXiv:1610.09650 (2016)
73. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2015)
74. Stoychev, S., Gunes, H.: The effect of model compression on fairness in facial
expression recognition. ArXiv abs/2201.01709 (2022)
75. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan,
D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1–9 (2015)
76. Tang, J., Liu, M., Jiang, N., Cai, H., Yu, W., Zhou, J.: Data-
free network pruning for model compression. In: 2021 IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2021).
https://doi.org/10.1109/ISCAS51556.2021.9401109
77. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. ArXiv
abs/1910.10699 (2020)
78. Van Baalen, M., Louizos, C., Nagel, M., Amjad, R.A., Wang, Y., Blankevoort, T.,
Welling, M.: Bayesian bits: Unifying quantization and pruning. Advances in neural
information processing systems 33, 5741–5752 (2020)
79. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. ArXiv abs/1706.03762 (2017)
80. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quan-
tization with mixed precision. 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) pp. 8604–8612 (2018)
81. Wang, Y., Wang, C., Wang, Z., Zhou, S., Liu, H., Bi, J., Ding, C., Rajasekaran,
S.: Against membership inference attack: Pruning is all you need. In: International
Joint Conference on Artificial Intelligence (2020)
22 M.M Yesuf, B.G Assefa.
82. Wang, Y., Lu, Y., Blankevoort, T.: Differentiable joint pruning and quantization
for hardware efficiency. ArXiv abs/2007.10463 (2020)
83. Wang, Y., Zhang, X., Hu, X., Zhang, B., Su, H.: Dynamic network pruning with
interpretable layerwise channel selection. In: AAAI (2020)
84. Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., Hu, X.: Pruning from
scratch. In: AAAI (2020)
85. Wiedemann, S., Kirchhoffer, H., Matlage, S., Haase, P., Marbán, A., Marinc, T.,
Neumann, D., Nguyen, T., Schwarz, H., Wiegand, T., Marpe, D., Samek, W.:
Deepcabac: A universal compression algorithm for deep neural networks. IEEE
Journal of Selected Topics in Signal Processing 14, 700–714 (2020)
86. Yang, Z., Wang, Y., Chen, X., Shi, B., Xu, C., Xu, C., Tian, Q., Xu, C.: Cars: Con-
tinuous evolution for efficient neural architecture search. 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR) pp. 1826–1835 (2020)
87. Yeom, S.K., Seegerer, P., Lapuschkin, S., Wiedemann, S., Müller, K.R., Samek,
W.: Pruning by explaining: A novel criterion for deep neural network pruning.
ArXiv abs/1912.08881 (2021)
88. Yim, J., Joo, D., Bae, J.H., Kim, J.: A gift from knowledge distillation: Fast op-
timization, network minimization and transfer learning. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) pp. 7130–7138 (2017)
89. Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisiting knowledge distillation
via label smoothing regularization. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 3903–3911 (2020)
90. Zhai, S., Cheng, Y., Zhang, Z., Lu, W.: Doubly convolutional neural networks. In:
NIPS (2016)
91. Zhang, Z., Shao, W., Gu, J., Wang, X., Ping, L.: Differentiable dynamic quan-
tization with mixed precision and adaptive resolution. ArXiv abs/2106.02295
(2021)
92. Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distilla-
tion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 11943–11952 (2022)
93. Zhao, H., Sun, X., Dong, J., Chen, C., Dong, Z.: Highlight every step: Knowledge
distillation via collaborative teaching. IEEE Transactions on Cybernetics 52, 2070–
2081 (2019)
94. Zhao, Y., Shumailov, I., Mullins, R.D., Anderson, R.: To compress or not to com-
press: Understanding the interactions between adversarial attacks and neural net-
work compression. ArXiv abs/1810.00208 (2018)