0% found this document useful (0 votes)
23 views11 pages

Revisiting Internal Covariate Shift For Batch Normalization

This document discusses internal covariate shift (ICS) and its relationship to batch normalization (BatchNorm). Specifically: 1) The original BatchNorm paper attributed its success to reducing ICS during training, but recent work has questioned this and proposed alternative explanations like smoothing the optimization landscape. 2) The authors conduct experiments to show that while alternative properties like higher learning rates and landscape smoothing exist in BatchNorm, they are not sufficient to explain its effectiveness on their own. Reducing ICS is still important. 3) A normalization scheme is devised that has the alternative properties but not ICS reduction, and it shows poor performance, supporting the importance of ICS reduction.

Uploaded by

wjc2625010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Revisiting Internal Covariate Shift For Batch Normalization

This document discusses internal covariate shift (ICS) and its relationship to batch normalization (BatchNorm). Specifically: 1) The original BatchNorm paper attributed its success to reducing ICS during training, but recent work has questioned this and proposed alternative explanations like smoothing the optimization landscape. 2) The authors conduct experiments to show that while alternative properties like higher learning rates and landscape smoothing exist in BatchNorm, they are not sufficient to explain its effectiveness on their own. Reducing ICS is still important. 3) A normalization scheme is devised that has the alternative properties but not ICS reduction, and it shows poor performance, supporting the importance of ICS reduction.

Uploaded by

wjc2625010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

5082 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO.

11, NOVEMBER 2021

Revisiting Internal Covariate Shift for


Batch Normalization
Muhammad Awais, Md. Tauhid Bin Iqbal , Member, IEEE, and Sung-Ho Bae , Member, IEEE

Abstract— Despite the success of batch normalization (Batch- to fix the distribution of each subnetwork by normalizing the
Norm) and a plethora of its variants, the exact reasons for its input with first-order statistics and an affine transformation as
success are still shady. The original BatchNorm article explained training progresses. This normalization reduced the ‘alleged”
it as a mechanism that reduces the internal covariate shift
(ICS), i.e., the distribution shifts in the input of the layers ICS and made the convergence faster.
during training. Recently, some articles manifested skepticism Recent work has challenged this ICS hypothesis and pre-
on this hypothesis and provided alternative explanations for sented alternative explanations [3], [4]. Santurkar et al. [3]
the success of BatchNorm, such as the applicability of very raised skepticism on the role of ICS alleviation in BatchNorm
high learning rates and the ability to smooth the landscape and empirically demonstrated that BatchNorm tends to correct
in optimization. In this work, we counter these alternative
arguments by demonstrating the importance of reduction in ICS the activations during training. This property of BatchNorm
following an empirical approach. We demonstrated various ways helps the network use higher learning rates, making it escape
to achieve the abovementioned alternative properties without any sharp minima. Bjorck et al. [4] exhibited that the alleviation
performance boost. In this light, we explored the importance of of ICS does not explain why BatchNorm works. Instead, they
different BatchNorm parameters (i.e., batch statistics and affine claimed that BatchNorm makes the optimization landscape of
transformation parameters) by visualizing their effectiveness
in the performance and analyzed their connections with ICS. the network smoother, resulting in stable gradient updates. In
Afterward, we showed a different normalization scheme that consequence, they mentioned that the smoothing effect of the
fulfills all the alternative explanations except reduction in ICS. loss landscape by BatchNorm enables a network to take ben-
Despite having all the alternative properties, we observed its efits of using a higher learning rate under the stable behavior
poor performance, which nullifies the alternative claims, rather condition, thus resulting in faster convergence. Therefore, both
signifies the importance of the ICS reduction. We performed
comprehensive experiments on many variants of BatchNorm, the articles [3] and [4] indicate that the reduction in ICS is not
finding that all of them similarly reduce ICS. the primary mechanism that explains why BatchNorm works.
Our main contributions are as follows.
Index Terms— Batch normalization (BatchNorm), deep learn-
ing, internal covariate shift (ICS). 1) We show that the abovementioned alternative bene-
I. I NTRODUCTION fits are not sufficient to explain the effectiveness of
BatchNorm. Instead, we empirically prove that the ICS

B ATCH normalization (BatchNorm or BN) [1] was intro-


duced to speed up the training of deep neural net-
works and has become a de facto choice in many tasks.
hypothesis is crucial to understand why BatchNorm
works.
2) For nullifying the alternative explanations, we devise
Despite widespread use and success, the reasons for its new effective normalization having the abovementioned
success are still dubious. In the original BatchNorm article, alternative properties while offering no performance
Ioffe and Szegedy [2] attributed it to the reduction of internal improvement.
covariate shift (ICS), i.e., the distribution changes of input 3) We perform experiments to understand the importance of
along with the layers in a neural network during training. They BatchNorm parameters. That is, we compare the impor-
consider each layer of a network as a subnetwork and argued tance of first-order statistics and affine transformation
that constant ICS (similar to covariate shift in any learning sys- by observing how they affect the overall performance
tem) in every subnetwork makes it hard for the neural network of the network. By introducing randomness in first-order
to adapt during training. Especially, the ICS slows the learning statistics, we also see how an increase in ICS can harm
process and makes neural networks sensitive to initial condi- a network.
tions, hyperparameters, and many other factors. They proposed 4) We demonstrate that variants of BatchNorm also reduces
Manuscript received November 5, 2019; revised May 9, 2020; accepted ICS very similar to BatchNorm.
September 18, 2020. Date of publication October 23, 2020; date of current
version October 28, 2021. This work was supported by the Technology Inno- II. BACKGROUND
vation Program (or Industrial Strategic Technology Development Program)
under Grant 10085646 (Memristor Fault-Aware Neuromorphic System for A. Previous Work
3D Memristor Array) funded by the Ministry of Trade, Industry & Energy 1) Batch Normalization and Its Variants: BatchNorm was
(MOTIE, Korea). (Corresponding author: Sung-Ho Bae.)
The authors are with the Department of Computer Science and Engineering, introduced to normalize each layer’s input, aiming to stabilize
Kyung Hee University (Global Campus), Yongin 17104, South Korea (e-mail: and optimize the training of neural networks by alleviating
[email protected]; [email protected]; [email protected]). ICS [2]. Since then, many different variants of BachNorm have
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. been introduced to overcome its shortcomings and extend them
Digital Object Identifier 10.1109/TNNLS.2020.3026784 to other domains. For instance, smaller batch size deteriorates
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
AWAIS et al.: REVISITING ICS FOR BatchNorm 5083

the performance of BatchNorm, making it useless in many are parameters for affine transformation learned with SGD. μli
tasks where a small batch size is crucial for training, such as and σil are batch statistics (mean and std) computed by
sequential models or models using large size inputs. 
μli = l
x (i,a,b,c) (2)
To overcome this issue, layer normalization and group a c
b
normalization were introduced [5], [6], computing statistics 
   2
on different dimensions of input. Many other variants, such σil = l
x (i,a,b,c) − μli . (3)
as normalization propagation [7], batch renormalization [8], a b c
instance normalization (IN) [9], batch IN (BIN) [9], Kalman The difference among major variants of BatchNorm is the
BatchNorm [10], decorrelated normalization [11], switch nor- axis around which these statistics are calculated or how X il
malization [12], sparse switchable normalization [13], switch- is chosen. In BatchNorm, X il = X (n,i,h,w) l
, i.e., the batch
able whitening [14], and dynamic normalization [15], have statistics are calculated across the channels where N = c.
also been introduced recently to overcome different issues. Similarly, in LayerNorm [26], X il = X (i,c,h,w)
l
and N = n. For
Similarly, efforts have been made to extend BatchNorm to GroupNorm (GN) [6], X i = X (i,g,h,w) , and N equals to g = gcn ,
l l
other domains as well, such as [5], [6], and [16]–[19]. where g and gn are the number of channels in a group and
2) Understanding BatchNorm and Its Variants: While the total number of groups, respectively. In InstanceNorm (IN)
BatchNorm has successfully become a crucial part of deep [9], instead of X il , it becomes X i, j = X (i, j,h,w) and N = n × c.
l
learning toolkit, the reasons for its success are still less
understood. The authors of the original BatchNorm article C. Experimental Design
[2] argued that it decreases ICS, thereby making optimization To understand the fundamental mechanism of BatchNorm,
faster. Recent studies [3], [4] contradicted the ICS hypothesis we designed three types of experiments as follows.
and showed many alternative reasons.
1) To understand the importance and the connectivity of the
Santurkar et al. [3] showed that BatchNorm makes the
parameters in BatchNorm to ICS, we perform experi-
optimization landscape smoother by inducing stable and pre-
ments with replacing the BatchNorm parameters, i.e.,
dictive behavior of gradients. They showed that a network
batch statistics and affine transform parameters.
with BatchNorm has better Lipschitz condition for both the
2) To counter the existing alternative explanations for the
gradients and loss. They also argued that there could exist
effectiveness of BatchNorm (e.g., correction of activa-
other normalization schemes that can give similar benefits by
tions) in [3] and [4], we perform experiments with spe-
possessing these properties. Similarly, in [4], they argued that
cially designed normalization that fulfills the alternative
BatchNorm corrects activations to remain smaller and, hence,
properties except reduction in ICS.
enables it to use higher learning rates. They showed unnormal-
3) To see the effectiveness of ICS in BatchNorm variants,
ized networks having exploding activations in higher layers
we also perform experiments to see how they reduce
with large, heavy-tailed gradients at initialization, making
ICS compared with BatchNorm.
training the network harder. They suggested that the primary
Following [4], we have used a fairly simple and standard
reason for the success of BatchNorm is its ability to enable
experimental setup as our purpose is not to achieve the state-
the use of higher learning rates. They proved that the use of
of-art-results but to understand why BatchNorm works. For
higher learning rates makes it possible to take the benefits of
experiments, we use two versions of the CIFAR data set
adding noise in the standard gradient descent (SGD) scheme,
which helps to avoid sharp minima. with ten and 100 classes (CIFAR10 and CIFAR100), where
CIFAR10 is relatively simple, and CIFAR100 is more complex
Many recent articles have also investigated BatchNorm from
and hard. We have shown the results on both data sets to
different perspectives [20]–[24]. The article [25] investigates
increase confidence. We use ResNet20 for CIFAR10 as our
the regularization effect, [22] provides a theoretical analysis
backbone network because it is light enough to perform
of BatchNorm’s role of autotuning of learning rates, [23] ana-
comprehensive experiments and is considered one of the
lyzed BatchNorm from the perspective of mean-field theory,
representatives of modern deep neural architectures. All the
and so on.
networks are trained for 164 epochs, with an initial learning
rate of 0.1 and reduced by 10 times after 80th and 120th
B. Formulation epochs. For networks that do not converge at higher learning
Let X l ∈ Rn×c×h×w be input of the lth layer of a neural rates (network without any normalization), we train them with
network where n is the batch axis, c is the channel axis, h a learning rate of 0.001. In our experiments, we use the batch
is the height, and w is the width of this input. In general, size of 128, the standard cross-entropy (CE) loss function,
normalization is a function f : X l → Y l , where Y l ∈ and SGD with momentum and weight decay [27] with the
Rn×c×h×w is the normalized output calculated by default setting of PyTorch. For more details and supplementary
  l l N material, please visit the project page.1
l X i − μi
Y = γi
l
+ βi
l
(1) III. I MPORTANCE OF BATCH N ORM PARAMETERS
σil i=0
Although BatchNorm is one of the most successful layers
where i in X il and N are the sample index and the total
in modern neural network architectures, the reasons for its
number of samples in normalization, respectively, and are
different depending on the normalization scheme. γil and βil 1 http://awaisrauf.github.io/ics-bn.html

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
5084 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 11, NOVEMBER 2021

TABLE I
S UMMARY OF THE E XPERIMENTS IN S ECTION III

Fig. 2. Training and validation accuracy of networks with different


BatchNorm variables. Results can be grouped into three categories, as shown
in the figure’s legend. Removing affine transformation does not affect the
performance of a network, but removing batch statistics makes network
performance equivalent to an unnormalized network. The removal of batch
statistics hurts the performance, but the network still converges with high
Fig. 1. Visual example of the distribution normalization by BatchNorm layer learning rates (0.1). Replacing batch statistics with any random values makes
for a particular channel i using the batch statistics parameters, μ and σ 2 , and the network equal to a randomly initialized network. For more context of
affine transform parameters, γ and β. The input of any particular layer is first these results, please see Section III.
normalized with batch statistics to have zero mean and unit variance. This
normalized input is then rescaled with the two learnable parameters of affine 3) Replacement of Batch Statistics With Random Noise,
transformation.
μ ∼ N (0, 1)| and σ 2 ∼ |N (0, 1)|: This is one way to
success are still disputed. The original BatchNorm article introduce extreme ICS in the layers as we are changing
attributed it to the reduction in ICS, i.e., the stabilization of the mean and variance of input of each layer randomly.
input distribution in each layer of the network, by assuming We keep affine transformation so that, if possible, a net-
that the distribution of input in a layer follows a normal distrib- work can learn to adjust the scale of the inputs.
ution, being characterized with μ, σ 2 . The change in a normal 4) Replacement of Batch Statistics With Uniform Ran-
distribution can be reflected by adjusting these parameters. dom Scaling, μ ∼ U(min(X), max(X)) and σ 2 ∼
Due to the change in weights during training, the input U(min(X), max(X)): This is another way of introducing
distribution of the lth layer Nl (μl , σl2 ) can be changed to a random distribution shift in the inputs.
N̂l (μ̂l , σ̂l2 ). Hence, ICS can be understood as a change of From Fig. 2, importance and necessity of batch statistics
the parameters from (μl , σl2 ) to (μ̂l , σˆl2 ). BatchNorm aims are evident. We can divide the results of the experiments
to correct this distribution change by: 1) normalizing input mentioned above into three groups: first, learnable parame-
using minibatch mean and variance, i.e., making it the standard ters do not affect the overall performance of the network
normal distribution, i.e., Nl (μl = 0, σl2 = 1) and 2) projecting (experiment 1); second, removing batch statistics reduces
it to have a new (learned) mean and variance by affine trans- the performance of a network equal to an unnormalized
formation with two parameters, γ and β, as shown in Fig. 1. network or worse (experiment 2). However, it may enable
This way, ICS may be reduced by these four parameters. a higher learning rate convergence. Third, any randomness
Since these parameters play an essential role in the reduction in the batch statistics hinders the network from convergence
of ICS, we first need to understand their importance. We (experiments 3 and 4).
investigated the role of these parameters by analyzing and The performance of a network is similar to the stan-
understanding the behavior of a network when these parame- dard BatchNorm network when we only use batch statistics
ters are perturbed in different ways. To this end, we design (91.6–90.7 for CIFAR10 and 68.7–68.1 for CIFAR100). How-
the following four experiments and compare them with the ever, the performance is reduced and becomes similar to an
standard BatchNormed and unnormalized network. unnormalized network or worse if we remove batch statistics
1) No Learnable Parameters in BatchNorm Layer, γ = (91.6–80.2 for CIFAR10 and 68.7–49.8 for CIFAR100). Note
1 and β = 0: BatchNorm layer becomes standard that the use of the affine transformation layer without batch
normalization layer: inputs of all the layers have zero statistics is enough to train networks with a high learning
mean and unit variance. In this setting, we are only rate (0.1 compared with 0.001 for unnormalized network).
reducing ICS by making the mean and variance of input However, it fails to increase performance beyond an unnor-
of each layer to be a constant; therefore, we should malized network (accuracy of 80.2 compares with 79.8 of an
expect comparable performance if the ICS hypothesis unnormalized network for CIFAR10 and 49.8 compared with
is sufficient to explain why BatchNorm works. This 47.1 of an unnormalized network for CIFAR100). Similarly,
experiment also establishes the importance of mean and the last two experiments show the effect of the introduction
variance compared with affine parameters. of higher internal covariant shift: a network with higher
2) No Batch Statistics (μ = 0 and σ 2 = 1): BatchNorm in covariant shift fails to convergence. These experiments show
each layer becomes only an affine transformation. This two crucial points: batch statistics are necessary and sufficient
experiment gives us a way to gauge the importance of for the BatchNorm layer to work, and the randomness in batch
the rescaling hypothesis. statistics destroys the network as it introduces extreme ICS.

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
AWAIS et al.: REVISITING ICS FOR BatchNorm 5085

Fig. 3. Comparison of training of a BatchNorm network with varying Fig. 4. Comparison of batch statistics and affine transformation in a fully
levels of noise injected in batch statistics. The x-axis shows gradient descent trained network. The x-axis shows the channels, and the y-axis shows the
steps, and the y-axis shows the accuracy of the network; the first figure corresponding value of batch statistics (orange) and affine transform learned
shows training, and the second figure shows test accuracy. As the noise level parameters (blue). Affine transform parameters tend to remain close to initial
increases, the performance of the network starts decreasing. values.

Batch Statistics and Affine Transformation: We have seen In this section, we demonstrate the importance of the ICS
that replacing batch statistics with random values damages hypothesis for BatchNorm by disentangling other benefits
the network beyond recovery. However, it can be argued that from it. Taking inspiration from [3]’s call to explore other
replacing batch statistics with completely random values is a normalization schemes, we introduced a normalization that
bit too much. To further understand effect of ICS, we added has all the alternative benefits except the reduction in ICS. We
a normal random value of zero mean and σnoise variance in showed that despite fulfilling all the alternative explanations,
batch statistics: μ + N (0, σnoise ) and σ + N (0, σnoise ). Since the alternative normalization performs inadequately.
we have shown that batch statistics are sufficient, we have not
used affine transformation. We observed that by increasing
A. MinMax Normalization Versus BatchNorm
the value of noise (σnoise ), performance decreases, learning
slows down, and an increase in noise beyond a threshold Bjorck et al. [4] suggested that a network without Batch-
cripples the network even further. We show the results in Norm diverges due to progressively larger activations as depth
Fig. 3 demonstrating a downward trend as we increase the increases. They suggested that BatchNorm corrects activa-
noise level. This indicates that the introduction of a small tions, hence allowing larger learning rates. There are many
distribution shift also affects the network. alternative ways to contain activations and prevent them from
Similarly, in the last set of experiments, we saw how an exploding in higher layers. One naive option could be to use
affine transformation does not play a very crucial role, and  p -norm-based scaling, as suggested by Santurkar et al. [3].
batch statistics alone are sufficient. If we assume the ICS They first centered the activations: Y1 = X − μ, making
hypothesis to be correct, we should expect batch statistics to E[X] = 0 and then dividing these centered activations by
be sufficient as they make distribution zero mean and unit norm: Y2 = Y1 /(X2 ). We know that E[X 2 ] = X22 and
variance. It can also be argued that affine transformation can σ 2 = E[X 2 ] − E[X]2 = E[X 2 ] for centered activations.
be a source of ICS as they project N (0, 1) to have arbitrary Hence, both are equivalent except that BatchNorm is applied
mean and variance. To see how affine transformation changes channelwise, while this is applied on the whole activation.
mean and variance of input after normalization with batch Due to the abovementioned reasons, we decided to choose
statistics, we visualize weight and bias of all the channels of MinMax normalization (MinMaxNorm), which scales the acti-
a fully trained ResNet20. Interestingly, weight and bias tend vation while preserving all other relationships of the data.
to have learned values very close to their initialization, i.e., 0 To normalize lth layer input X l , MinMaxNorm is defined as
and 1. On the other hand, the batch statistics (μ and σ ) are follows:
 c
very diverse, as shown in Fig. 4. For instance, 95% values of X il − min X il
γ and β lie in the interval of −0.02∼−0.45 and 0.68∼0.93, MMN X = γi
l l
+ βil
(4)
max X il − min X il
respectively. This observation is in line with our experiments i=0
in Section III and shows that a decrease in ICS with batch where c is the number of channels. All the variables follows
statistics is sufficient for a network to have a performance the definitions of Section II-B. Also note that learning para-
comparable to the full BatchNorm layer. meters, γil and βil , can restore the representation power of the
network as it was argued in [2] for BatchNorm. This way,
IV. D ISENTANGLING ICS R EDUCTION AND OTHER after normalization, network learns to transform the data in
B ENEFITS OF BATCH N ORM the optimal range.
Our experiments, so far, have exhibited the importance of As shown in Fig. 5, the MinMaxNormed network has
BatchNorm parameters and their crucial role in the success performance comparable to the unnormalized model—much
of BatchNorm. Now, we turn our attention toward the alter- weaker compared with BatchNorm. Note that the MinMax
native arguments presented by [3] and [4] for the success normalized model was trained with a learning rate of 0.1; the
of BatchNorm. For the sake of understanding, let us divide unnormalized network fails to converge at such high learning
known reasons behind the success of BatchNorm into two rates, so we have to train it with a small learning rate of
parts: reduction in ICS and all the other benefits demon- 0.001. In the following sections, we explain how MinMax
strated in the original article [2], as well as in [3] and [4]. normalization fails to perform better than the unnormalized

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
5086 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 11, NOVEMBER 2021

Fig. 6. Comparison of BatchNormed (blue), MinMaxNormed(orange),


and unnormalized networks (blue) for 2 -based magnitude of activations at
initialization for ResNet-98. Please note that the y-axis is in log scale. The
figure shows the similarity of BatchNorm and MinMaxNorm for the scale of
activations.

This shows that rescaling or higher learning rates do not fully


explain why BatchNorm works. We can train the network with
Fig. 5. Comparison of MinMaxNormed network with BatchNormed and higher learning rates by just rescaling activations without any
unnormalized network. MinMaxNormed network converges at a high learning gain in performance, as shown in Fig. 5.
rate (0.1), but its performance is similar to an unnormalized network.
2) Analysis of Networks at Initialization: Bjorck et al. [4]
network despite fulfilling all the alternative properties. We presented different properties of BatchNorm at initialization
then show that the primary reason for this failure is it an and showed that it enjoys better initial conditions compare
unsuccessful effort in alleviating the ICS. We first show how to no norm case. Here, we present a comparison of these
MinMaxNorm fulfills alternative properties and then discuss properties for MinMaxNorm with BatchNorm and show the
its performance compare to BatchNorm. similarity between these properties.
First, we show the similarity of activations for BatchNorm
B. MinMaxNorm Satisfies Alternative Properties and MinMaxNorm at initialization. To show this, we initialized
1) Activation Explosion and Higher Learning Rate: Bjorck a ResNet98 and computed 2 -norm of activations of each
et al. [4] argued that BatchNorm works because of its ability to channel for a randomly sampled batch. We showed a log of
contain the activations, thereby enabling higher learning rates. the magnitude of channels in Fig. 6 for both CIFAR10 and
This higher learning rate also biases a neural network toward CIFAR100 data sets. Both BatchNorm and MinMaxNorm have
wider minima and, therefore, a more generalized network. In a similar and relatively uniform magnitude of activations over
this section, we demonstrate that MinMaxNorm also enables the channels, while the unnormalized network’s magnitude
a network to use a higher learning rate without giving any varies a lot.
performance boost. This shows that, while rescaling alone can Second, we show the distribution of gradients at initializa-
make the network take larger learning steps, it cannot explain tion. Fig. 7 shows gradients of layers from initial, middle,
why BatchNorm works. and endpoint at initialization. Both MinMax and BatchNorm
MinMaxNorm is better at containing activations from networks have smaller and mean-centered gradients, while the
exploding as it produces relatively smaller activations com- unnormalized network has very large, heavy-tailed gradient
pared with BatchNorm. To show it, let us define result of distribution.
applying both normalizations on an activation X Similarly, Bjorck et al. [4] computed the sum of absolute
values and the absolute value of the sum of gradients for
X −μ
YBN = each layer. They noticed that, in an unnormalized network,
σ both terms have similar magnitude showing a similar sign
X − min(X)
YMMN = . of values throughout and, hence, input-independent behavior.
max(X) − min(X) For the BatchNormed network, the magnitude of both terms
From Popoviciu’s inequality [28], we know that σ 2 ≤ is very different. To explain it, let us assume a = absolute
(1/4)(max(X) − min(X))2 and min(X) ≤ μ. Hence, we can value of sum of gradients and b = sum of absolute value
say that 2YMMN ≤ YBN . MinMaxNorm also has an affine trans- of gradients. In Fig. 8, we show (a/b) for each channel at
formation similar to BatchNorm, so it can project activation initialization for both CIFAR10 and CIFAR100 data sets. We
back to an optimal range. have excluded some outliers for better visualization. Notice
We show it empirically in Fig. 6. Note the similarity of 2 - the stark similarity between BatchNorm and MinMaxNorm.
norm of activations after BatchNorm and MinMaxNorm. This 3) Analysis of Gradients: It has been shown that the gradi-
means that the MinMaxNormalized network can take larger ent of BatchNorm and unnormalized networks is different and,
gradient steps. This also means that a network with Min- perhaps, plays a crucial role in the success of BatchNorm [3],
MaxNorm can benefit from regularization by noise, as defined [4]. Here, we show that MinMaxNorm has similar gradients
by [4, eq. (2)], similar to the BatchNormalized network. This as BatchNorm.
indeed is the case in experiments: an unnormalized network is In [3], it was shown that BatchNorm makes the loss
unable to converge with a learning rate of 0.1, but a network landscape of the network smooth and gradient update stable.
with MinMaxNorm converges with such large gradient steps. They demonstrated that the magnitude of the gradient of loss,

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
AWAIS et al.: REVISITING ICS FOR BatchNorm 5087

Fig. 7. Histograms for the gradient for the starting, middle, and final point of a ResNet-98 at initialization. MinMax and BatchNorm have smaller and normally
distributed gradients, while an unnormalized network has heavy-tailed, very large gradients. Note the difference in the x-axis scale for these histograms.

Fig. 8. Comparison of BatchNormed, MinMaxNormed, and unnormalized networks for 2 difference of gradients with respect to loss as we move in the
gradient direction for both CIFAR10 and CIFAR100 data sets. Both BatchNorm and MinMaxNorm make its gradient updates well behaved and more predictive
compared with unnormalized network.

which is related to the Lipschitz condition, is smaller for the


BatchNormalized network. They showed that the magnitude of
the gradient of the BatchNormed network and the unnormal-
ized network has inversely proportional relation with scaling
by variance. From Popoviciu’s inequality, we also know that
the range (max − min) of MinMaxNorm is always higher than
BatchNorm, i.e., σ ≤2(max(X) − min(X)). Hence, from [3,
Ths. 4.1, 4.2, and 4.4], we can infer that MinMaxNorm enjoys
better Lipschitz conditions of gradients and a relatively smooth
optimization landscape.
Santurkar et al. [3] explored the optimization landscape
and found that gradients updates in the BatchNorm network
tend to be stable and more predictable. Here, we compare
gradient predictiveness of BatchNorm and MinMaxNorm. Let
us assume the gradient of the lth convolutional layer at the tth
gradient step as dlt . Then, we can define gradient predictiveness Fig. 9. Similarity of gradients of MinMaxNorm and BatchNorm at ini-
of the lth layer as tialization. Comparison of gradients of differently normalized networks at
  initialization. We compare the absolute value of the sum of gradients and
GPl = dlt − dlt+1 2 . sum of absolute values for MinMaxNorm, BatchNorm, and unnormalized
networks at initialization. Lower values of this factor mean matching signs of
gradients throughout, thus making it input-independent. Both MinMaxNorm
Similarly, gradient predictiveness of a network with L layers and BatchNorm have large and similar values showing its dependence on
can be computed as follows: input, which is a good sign.

L
BatchNorm. For a detailed comparison, we plot GPl for all
GP = GPl 2 .
the layers in Fig. 12 of the Appendix.
l=0
Similarly, gradients at initialization are also an important
In Fig. 8, we showed GP of the whole network. From this indication. Figs. 7 and 9 show the similarity of gradients of
figure, it is clear that GP of MinMaxNorm is very similar to BatchNorm and MinMaxNorm at initialization.

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
5088 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 11, NOVEMBER 2021

Fig. 10. Comparison of BatchNorm and MinMaxNorm for distribution of inputs at different layers. To compute this, we forward passed a random batch of
inputs and computed activations at each epoch during training and displayed it as a histogram. Distribution of activation is stable throughout training for a
network normalized with BatchNorm; however, distribution of activation is unstable for the same network normalized with MinMaxNorm. This shows that
MinMaxNorm fails to reduce ICS compared with BatchNorm.

Fig. 11. Plots show the Hellinger distance (the difference between two distributions) between the subsequent distributions as we move toward the direction of
the gradient. The network with different variants of BatchNorm was trained for 12k SGD steps, and the f -divergence-based Hellinger distance was measured
between successive updates. The results are shown for one layer from the initial, intermediate, and final points taken randomly. The figure shows how different
variants of BatchNorm reduce ICS in a very similar fashion compared with an unnormalized network, which failed to stabilize distributions.

C. Comparison of Distributions and ICS Hypothesis proceeds. Each histogram in the figure shows the distribution
In the experiments so far, we have shown that MinMaxNorm of input at one particular epoch. Note how the distribution of
fulfills all the alternative properties that are supposed to each layer changes during training for a network with MinMax
explain the success of BatchNorm. MinMaxNorm rescales and BatchNorm. The distribution of MinMaxNorm is very
the activations; it enables the network to learn with a higher unstable, while BatchNorm has relatively stable distributions,
learning rate, thereby giving the benefit of achieving wider as we train the network. To further understand the stability of
minima and better generalization, it has better initial conditions distributions during training, we show a different visualization
similar to BatchNorm, and it makes the gradient stable and of the same distributions in Fig. 13 of the Appendix. In
landscape smoother. Here, we discuss how MinMaxNorm is this figure, the x-axis represents training steps, and the y-
different in terms of reduction in the ICS. ICS can be defined axis shows a level of values, i.e., the lowest line shows the
in terms of stabilization of distribution of input layer during minimum value of the distribution, the highest line shows the
training: how distributions of inputs differ as we move toward maximum value of the distribution, and the middle lines show
the direction of the gradient. We trained the same network other levels. Similarly, shades of the color show percent of
with BatchNorm and MinMaxNorm and visualized its layer’s the distribution, and light color shows 0%–25% and 75%–
distributions during training using Tensorboard. Fig. 10 shows 100%, while the dark part shows the rest of the 50% of
the histogram of input of each layer of the network as training the distribution. Each line in this graph shows the trend

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
AWAIS et al.: REVISITING ICS FOR BatchNorm 5089

Fig. 12. More detailed look at gradient predictiveness shown for all the convolution layers of ResNet-20. Comparison of BatchNormed, MinMaxNormed, and
unnormalized networks for 2 difference of gradients with respect to loss as we move in the gradient direction. All the layers have similarly stable gradient
updates for both BatchNorm and MinMaxNorm, but the unnormalized network has very noisy and unpredictable gradient updates.

of distribution as training proceeds. From this figure, the updates are important for a network, they do not explain why
instability of MinMaxNorm becomes clearer. BatchNorm works. Similarly, these experiments also show
The evolution of training and test accuracy for BatchNorm, the importance of reduction in ICS: a normalization fulfilling
MinMaxNorm, and the unnormalized network is shown in alternative properties except reduction in ICS fails to have
Fig. 5. MinMaxNormed network was trained with a learning benefits similar to BatchNorm.
rate of 0.1, and as we have shown, it full fills alternative
properties suggested to be a reason behind the success of V. ICS AND VARIANTS OF BATCH N ORM
BatchNorm. However, MinMaxNorm fails to stabilize the A plethora of variants of BatchNorm has been introduced
input distribution of each layer. It performs similar to an recently. These variants promise to resolve many shortcomings
unnormalized network; it shows that, while rescaling, higher of the original BatchNorm. Most of the research on under-
learning rates, better initial conditions, and stable gradient standing BatchNorm does not consider these variants, so it is

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
5090 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 11, NOVEMBER 2021

Fig. 13. Comparison of BatchNorm and MinMaxNorm for distribution of inputs at different layers. This is similar to Fig. 10 except visualized in a different
way. Each shade in this graph shows the evolution of different percentile of values over training. The darker color shows a higher percentile, and light color
shows a smaller percentile. Note the unstable behavior of the MinMaxNormed network’s input distributions.

not clear whether these variants reduce ICS and how can we VI. E VIDENCE F ROM P REVIOUS W ORK
compare them with BatchNorm. In this section, we empirically Recent work has revealed that the concept of ICS in
explore the stabilization behavior of some of the variants of BatchNorm can be leveraged to extend the capacity of neural
BatchNorm. We observed that like BatchNorm, its variants networks. The basic hypothesis that they have used is as
also reduce ICS and make the activations stable during train- follows: BatchNorm captures information about the underlying
ing. Our main observation is that variants of BatchNorm— distribution of the data, which can be leveraged. This supports
while different in computation—share a similarity in how they our findings that BatchNorm works by manipulating the under-
reduce ICS. Apart from its relevance with our main findings, lying distribution of the data. We briefly summarize some of
it is also important because it can guide the next variants of this work in the following.
BatchNorm in terms of how they should reduce ICS more In domain adaption, we train a model on data sampled
effectively. from a source distribution D S and want it to work on related
Change in Distributions: One primary goal of BatchNorm but different distribution DT with limited or no labeled data
and its variants is to alleviate ICS and stabilize input distribu- available to adapt [30]. Li et al. [31] shown that we can solve
tions over time. Different variants of BatchNorm achieve this this problem by just estimating batch statistics of the target
goal differently. One option is to calculate the f -divergence distribution DT . This shows that the statistics computed for
(a function that measures the difference between two distrib- the BatchNorm layer are not just scaling factors but encode
utions) of the input distributions as it evolves during training. data distribution.
To this end, we calculate Hellinger Distance (an instance of A significant problem in current deep learning models is
f -divergence) [29] for a layer l as we move toward the their adversarial behavior: fooling a network to misclassify by
direction of gradient during training. This way, we measured adding inconceivable but targeted noise in the input, also as
the change in input distribution as training evolves. called adversarial images. Adversarial training [32], [33] is
Let us assume the distribution of the i th layer at gradient employed to solve this problem where a network is trained
step t being p t (X i ). We calculated the Hellinger distance with both clean and adversarial images. Adversarial training
HD( pt+1 (X i )|| pt (X i )) for three randomly selected layers from causes degradation in clean accuracy, and it has been shown
initial, mid, and final points of a network. Results are shown that there may exist a compromise between robustness and
in Fig. 11. Although changes in distributions are arbitrary generalization [34], [35]. Xie et al. [36] hypothesized that this
in the initial layer, a very similar reduction of ICS is obvi- degradation of accuracy is caused by distribution mismatch—
ous in higher layers. While unnormalized networks have a clean and adversarial images have different underlying dis-
very noisy, increasing distance between the distributions, the tributions, and training on one distribution cannot transfer
distance between all other normalization are stable and very well on others. To mitigate this distribution mismatch, they
similar. purpose to maintain two sets of batch statistics during training:
We also compare the evolution of input distributions for one for clean images and one for adversarial images. This
different layers as we move toward the direction of gradient simple remedy not only bridged the gap between clean and
during training. We trained ResNet20 with IN [9], GN [6], BIN adversarial accuracy but it also improved the state-of-the-art
[10], and BN [2] for 90 epochs. At each epoch during training, clean accuracy results for several networks.
we recorded activation values for a random batch of data and If we use BatchNorm and dropout in a network, the per-
showed it as histograms in Fig. 14 of the Appendix. Note that, formance of the network reduces. Li et al. [21] discovered
while all the normalization has different distributions, all of that this is because of the variance shift in the BatchNorm
them share similar stability in distribution across training. This layer. During training, dropout randomly turns off neurons of
shows that different variants of BatchNorm similarly reduce a layer with probability p; during the test time, no neuron
ICS although they differ in the computation. is turned off. The authors suggested that this difference in

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
AWAIS et al.: REVISITING ICS FOR BatchNorm 5091

Fig. 14. Comparison of distribution of different variants of BatchNorm for initial, mid, and final points of a 20-layer ResNet. We trained a ResNet20 with
different variants of BatchNorm. At each epoch, activation for a random batch was recorded and shown as histograms. All the distributions of the same layer
seem very similar both in terms of shape and magnitude (note the similarity of the x-axis). Also, note how stable distributions are over time.

neurons causes the variance shift during train and test, thereby learning rate, better initial conditions, stable gradients, and
incorrect estimate of batch statistics. smoother loss landscape except reduction in ICS. We show
Quantization is often utilized to decrease the size of a neural that this normalization fails to increase the performance of a
network during inference. However, it requires the availability network beyond an unnormalized network trained with a very
of original data to refine the network. Haroush et al. [37] small learning rate. This shows the importance of reduction
determined that it is possible to use BatchNorm statistics to in ICS and how the alternative benefits are not sufficient to
estimate data distribution. They synthesized the data by using explain the success of BatchNorm. We also show that variants
batch statistics and refined the network on it. The synthesis of BatchNorm are successful because they also similarly
of data is an iterative process started by random data, which reduce ICS as BatchNorm. We also found that some of the
is then forward propagated through a network and updated recent research has used the concept of ICS and distribution
to minimize the difference between statistics calculated from involved with BatchNorm to solve some difficult problems and
it and precalculated statistics of BatchNorm layers. Similarly, explain different behavior of the neural networks.
Gholami et al. [38] also proposed a method to synthesize data
from the statistics of trained BatchNorm layers. They used A PPENDIX
this synthetic data to measure the sensitivity of quantization Figs. 12–14 are added for better visibility.
for each layer. This sensitivity metric is then used to find the
level of quantization in each layer. ACKNOWLEDGMENT
VII. D ISCUSSION AND R ESULTS The authors are very grateful to Muhammad Asim, Fahad
Shamshad, and Dr. Salman Khan for helpful discussions and
In this work, we revisited the importance of the internal
Eunseop Shin for technical support.
covariant shift hypothesis for BatchNorm and show that alter-
native hypotheses are not sufficient to explain why BatchNorm
R EFERENCES
works. First, we devised experiments to understand the differ-
ent parameters of BatchNorm. We found that the correction of [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
first-order statistics of activations alone is enough to get a per- (CVPR), Jun. 2016, pp. 770–778.
formance increase similar to BatchNorm even without affine [2] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
transformation. This shows the importance of batch statistics deep network training by reducing internal covariate shift,” 2015,
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167
and thereby ICS hypothesis to understand why BatchNorm [3] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch
works. To further demonstrate the effect of ICS, we grouped normalization help optimization?” in Proc. Adv. Neural Inf. Process.
the benefits of BatchNorm into two categories: reduction in Syst., 2018, pp. 2483–2493.
[4] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger, “Under-
ICS and all other alternative benefits. Then, we introduced standing batch normalization,” in Proc. Adv. Neural Inf. Process. Syst.,
MinMaxNorm that has all the benefits like training with higher 2018, pp. 7694–7705.

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.
5092 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 11, NOVEMBER 2021

[5] J. Bruce and D. Erman, “A probabilistic approach to systems of para- [30] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”
meters and noether normalization,” 2016, arXiv:1604.01704. [Online]. Neurocomputing, vol. 312, pp. 135–153, Oct. 2018.
Available: http://arxiv.org/abs/1604.01704 [31] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch normaliza-
[6] Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Comput. tion for practical domain adaptation,” 2016, arXiv:1603.04779. [Online].
Vis. (ECCV), 2018, pp. 3–19. Available: http://arxiv.org/abs/1603.04779
[7] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju, “Normalization [32] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradi-
propagation: A parametric technique for removing internal covariate ents give a false sense of security: Circumventing defenses to
shift in deep networks,” 2016, arXiv:1603.01431. [Online]. Available: adversarial examples,” 2018, arXiv:1802.00420. [Online]. Available:
http://arxiv.org/abs/1603.01431 http://arxiv.org/abs/1802.00420
[8] S. Ioffe, “Batch renormalization: Towards reducing minibatch depen- [33] E. Wong, L. Rice, and J. Zico Kolter, “Fast is better than free: Revisit-
dence in batch-normalized models,” in Proc. Adv. Neural Inf. Process. ing adversarial training,” 2020, arXiv:2001.03994. [Online]. Available:
Syst., 2017, pp. 1945–1953. http://arxiv.org/abs/2001.03994
[9] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: [34] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry,
The missing ingredient for fast stylization,” 2016, arXiv:1607.08022. “Robustness may be at odds with accuracy,” 2018, arXiv:1805.12152.
[Online]. Available: http://arxiv.org/abs/1607.08022 [Online]. Available: http://arxiv.org/abs/1805.12152
[10] G. Wang, J. Peng, P. Luo, X. Wang, and L. Lin, “Batch [35] A. Raghunathan, S. Michael Xie, F. Yang, J. C. Duchi, and P. Liang,
Kalman normalization: Towards training deep neural networks “Adversarial training can hurt generalization,” 2019, arXiv:1906.06032.
with micro-batches,” 2018, arXiv:1802.03133. [Online]. Available: [Online]. Available: http://arxiv.org/abs/1906.06032
http://arxiv.org/abs/1802.03133 [36] C. Xie, M. Tan, B. Gong, J. Wang, A. Yuille, and Q. V. Le, “Adver-
[11] L. Huangi, L. Huangi, D. Yang, B. Lang, and J. Deng, “Decorrelated sarial examples improve image recognition,” 2019, arXiv:1911.09665.
batch normalization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [Online]. Available: http://arxiv.org/abs/1911.09665
Recognit., Jun. 2018, pp. 791–800. [37] M. Haroush, I. Hubara, E. Hoffer, and D. Soudry, “The knowl-
[12] P. Luo, J. Ren, Z. Peng, R. Zhang, and J. Li, “Differentiable Learning- edge within: Methods for data-free model compression,” 2019,
to-Normalize via switchable normalization,” 2018, arXiv:1806.10779. arXiv:1912.01274. [Online]. Available: http://arxiv.org/abs/1912.01274
[Online]. Available: http://arxiv.org/abs/1806.10779 [38] A. Gholami, M. W. Mahoney, and K. Keutzer, “An integrated approach
[13] W. Shao et al., “SSN: Learning sparse switchable normalization via to neural network design, training, and inference,” Univ. California,
SparsestMax,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Berkeley, Berkeley, CA, USA, Tech. Rep., 2020.
(CVPR), Jun. 2019, pp. 443–451.
[14] X. Pan, X. Zhan, J. Shi, X. Tang, and P. Luo, “Switchable whitening
for deep representation learning,” 2019, arXiv:1904.09739. [Online].
Available: http://arxiv.org/abs/1904.09739
Muhammad Awais received the B.S. degree from
[15] P. Luo, P. Zhanglin, S. Wenqi, Z. Ruimao, R. Jiamin, and W. Lingyun,
“Differentiable dynamic normalization for learning deep representation,” the University of Engineering and Technology
in Proc. Int. Conf. Mach. Learn., 2019, pp. 4203–4211. (UET), Lahore, Pakistan, in 2016, and the M.S.
[16] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville, degree from Information Technology University
“Recurrent batch normalization,” 2016, arXiv:1603.09025. [Online]. (ITU), Lahore, in 2019. He is currently pursuing the
Available: http://arxiv.org/abs/1603.09025 Ph.D. degree with the Machine Learning and Visual
[17] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using Computing (MLVC) Lab, Kyung Hee University,
convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Yongin, South Korea.
Pattern Recognit. (CVPR), Jun. 2016, pp. 2414–2423. From 2016 to 2019, he was a Research Asso-
[18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normal- ciate with the Signal Processing and Information
ization for generative adversarial networks,” 2018, arXiv:1802.05957. Decoding (SPIDER) Lab, ITU. He is currently an
[Online]. Available: http://arxiv.org/abs/1802.05957 Assistant Researcher with the AI Theory Group, Noah’s Ark Lab, Huawei,
[19] T. Salimans and D. P. Kingma, “Weight normalization: A simple Hong Kong. His current research interests include deep learning model
reparameterization to accelerate training of deep neural networks,” in architectures, adversarial attacks/defenses, and transfer learning.
Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 901–909.
[20] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and
T. Hofmann, “Towards a theoretical understanding of batch normaliza-
tion,” Stat, vol. 1050, p. 27, May 2018. Md. Tauhid Bin Iqbal (Member, IEEE) received
[21] X. Li, S. Chen, X. Hu, and J. Yang, “Understanding the disharmony the bachelor’s degree in information technology
between dropout and batch normalization by variance shift,” in Proc. from the University of Dhaka, Dhaka, Bangladesh,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, in 2012, and the Ph.D. degree in computer sci-
pp. 2682–2690. ence and engineering from Kyung Hee University,
[22] S. Arora, Z. Li, and K. Lyu, “Theoretical analysis of auto rate-tuning Yongin, South Korea, in 2019.
by batch normalization,” 2018, arXiv:1812.03981. [Online]. Available: He is affiliated with the Machine Learning and
http://arxiv.org/abs/1812.03981 Visual Computing (MLVC) Lab for his post-doctoral
[23] G. Yang, J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz, research at Kyung Hee University. His current
“A mean field theory of batch normalization,” 2019, arXiv:1902.08129. research interests include deep learning interpreta-
[Online]. Available: http://arxiv.org/abs/1902.08129 tion, adversarial attack/defense, and facial analysis
[24] M. Awais, F. Shamshad, and S.-H. Bae, “Towards an adversarially robust and its recognition.
normalization approach,” 2020, arXiv:2006.11007. [Online]. Available:
http://arxiv.org/abs/2006.11007
[25] P. Luo, X. Wang, W. Shao, and Z. Peng, “Towards understanding reg-
ularization in batch normalization,” 2018, arXiv:1809.00846. [Online].
Available: http://arxiv.org/abs/1809.00846 Sung-Ho Bae (Member, IEEE) received the B.S.
[26] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” Stat, degree from Kyung Hee University, Yongin, South
vol. 1050, p. 21, Jul. 2016. Korea, in 2011, and the M.S. and Ph.D. degrees from
[27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance the Korea Advanced Institute of Science and Tech-
of initialization and momentum in deep learning,” in Proc. Int. Conf. nology (KAIST), Daejeon, South Korea, in 2012 and
Mach. Learn., 2013, pp. 1139–1147. 2016, respectively.
[28] T. Popoviciu, “Sur les équations algébriques ayant toutes From 2016 to 2017, he was a Post-Doctoral Asso-
leurs racines réelles,” M.S. thesis, Mathematica, 1935, vol. 9, ciate with the Computer Science and Artificial Intel-
pp. 129–145. [Online]. Available: http://www.numdam.org/ ligence Laboratory (CSAIL), Massachusetts Institute
article/THESE_1933__146__1_0.pdf of Technology (MIT), Cambridge, MA, USA. Since
[29] E. Hellinger, “Neue begründung der theorie quadratischer formen 2017, he has been an Assistant Professor with the
von unendlichvielen veränderlichen,” J. für die reine und angewandte Department of Computer Science and Engineering (CSE), Kyung Hee Univer-
Mathematik, vol. 1909, no. 136, pp. 210–271, 1909. [Online]. Available: sity. His current research interests include model compression, interpretation,
https://www.degruyter.com/view/journals/crll/1909/136/crll.1909.issue- adversarial attack/defense, and neural architecture search for deep neural
136.xml networks.

Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on September 21,2023 at 10:11:26 UTC from IEEE Xplore. Restrictions apply.

You might also like