0% found this document useful (0 votes)
5 views23 pages

Normalization Techniques

The document explains the differences between normalization, standardization, batch normalization, layer normalization, and group normalization in neural networks. Normalization and standardization help to scale data to improve model stability and convergence speed, while batch normalization normalizes outputs across mini-batches, and layer normalization normalizes across features within a layer. Group normalization is introduced as an alternative that normalizes features within groups, which is beneficial for small batch sizes.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Normalization Techniques

The document explains the differences between normalization, standardization, batch normalization, layer normalization, and group normalization in neural networks. Normalization and standardization help to scale data to improve model stability and convergence speed, while batch normalization normalizes outputs across mini-batches, and layer normalization normalizes across features within a layer. Group normalization is introduced as an alternative that normalizes features within groups, which is beneficial for small batch sizes.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Firstly, What’s the Difference between

Normalization and Standardization?

• Normalization:
1 100 0 1 Referred to as
𝑥 −𝑚𝑒𝑎𝑛 the same
• Standardization: 𝑥 =  The data will have a mean of 0 and sometimes
𝑠𝑡𝑑
standard deviation of 1

But why do we need to do this in the first place?


If you don’t, higher values will dominant lower values, making the lower values
useless in the dataset. Therefore, the network will be unstable which increases the
convergence time
 All our data should be on the same scale
PyTorch
If we have a simple neural network with two inputs: The first input value varies from 0 to 1 while the second
input value varies from 0 to 0.01. Through a series of linear combinations and nonlinear activations of a neural
network, the parameters associated with each input will also exist on different scales. Therefore, the larger
values will be dominant and the smaller ones will be ignored in weight update.

Updates are dominant by larger gradients All updates are equally proportional

By normalizing all of our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters
for each input node
Batch
Normalization
So we said that the inputs to the neural network
are normalized

X1
Normalized Inputs

X2

X3

X4
What if one (or some) of the neurons have high vales? Well, they would
also cause the neurons depending on it to have a large value, thus the
network would still be unstable. What would be the solution?
These activations are the inputs to the
next layer, so let’s also normalize them
• We normalize each feature (each neuron) in a layer according to the batch
of samples. The mean and standard deviation are calculated across a batch
of samples.

2.3
1.2
Sample 1 . 0.8
Sample 2 .
Sample 3 .
Consider the example of normalizing the first
neuron in the output layer. We are normalizing
the neuron across the output values of a batch
for a single neuron (batch size is 3 in this case)
Batch Normalization
• Normalization of each output neuron in each hidden layer, rather than only for the input layer, for a
mini-batch of data. In other words, it reduces the amount that the hidden layer values shift around.

In addition to this, we have batch norm applied at a chosen layer

Batch Batch
Norm Norm

Normalized
input

The mean and standard deviation are calculated across a batch of samples.

We don’t always want to normalize our value to have zero mean and
unit variance. This would perform poorly for some activation functions
(ex. Sigmoid). Thus, we learn the best distribution by scaling our
normalized values by 𝛾 and shifting by 𝛽

Learnable Parameters
https://arxiv.org/abs/1502.03167

∈: Very Small number to prevent If we set =1 and =0


dividing by zero We just have standardization
Benefits of Batch Normalization
• Converges Faster and reduced the need for dropout
• We've allowed the network to normalize a layer into whichever
distribution is most optimal for learning (since we are learning the
scaling and shifting parameters)
• Eliminates the need for a bias for a layer when batch normalization is
applied to it, since we are already shifting the normalized values with
the 𝛽 parameter.
Layer
Normalization
Batch Normalization
We normalize each feature (each neuron)
2.3
in a layer according to the batch of 1.2
samples. The mean and standard Sample 1 . 0.8
Sample 2 .
deviation are calculated across a batch of Sample 3 .
samples.

Layer Normalization:
We normalize each feature (each neuron) .
in a layer according to the features in that .
layer. The mean and standard deviation .
are calculated across the features of a
layer (across the outputs of a layer).
Layer Normalization
The mean and standard deviation are
At the output of a feedforward layer: calculated across the layer outputs.

Mean over all neurons in the layer


m: the number of neurons in that layer

Variance over all neurons in the layer .


.
.
For each neuron in the output layer:

Normalize: subtract mean and divide


by standard deviation

Scale by a learnable value and shift


by a learnable value
Learnable Parameters
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4

(−2.3 + 0.275)2 +(1.9 + 0.275)2 +(2.7 + 0.275)2 +(−3.4 + 0.275)2


= 6.861
4

−2.3 + 0.275
. = −0.77
. 6.861 + 10−9
.

−0.77𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4

(−2.3 + 0.275)2 +(1.9 + 0.275)2 +(2.7 + 0.275)2 +(−3.4 + 0.275)2


= 6.861
4

1.9 + 0.275
. = 0.83
. 6.861 + 10−9
.

0.83𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4

(−2.3 + 0.275)2 +(1.9 + 0.275)2 +(2.7 + 0.275)2 +(−3.4 + 0.275)2


= 6.861
4

2.7 + 0.275
. = 1.13
. 6.861 + 10−9
.

1.13𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4

(−2.3 + 0.275)2 +(1.9 + 0.275)2 +(2.7 + 0.275)2 +(−3.4 + 0.275)2


= 6.861
4

−3.4 + 0.275
. = −1.19
. 6.861 + 10−9
.

−1.19𝜃1 + 𝜃2
Differences
𝑥 −𝑚𝑒𝑎𝑛
• All the normalization is calculated using: 𝑥 =
𝑠𝑡𝑑
• Batch Normalization and Layer Normalization are performed in different directions (it only
differs in how the mean is calculated).
• For batch normalization, input values of the same neuron from different images in one
mini batch are normalized. In layer normalization, input values for different neurons in the
same layer are normalized without consideration of mini batch.

https://www.quora.com/What-are-the-practical-differences-between-batch-normalization-and-layer-normalization-in-deep-neural-networks
Group
Normalization
Motivation
• From HOG/SIFT Feature extraction method in image processing

Feature Vector is group-wised

Each histogram is normalized


independently, therefore the
feature vector is normalized
within each group

Source: Group Normalization ECCV Presentation Video


• The deep features can have internal group-wise sub-structures  Normalize within the group

1. Split into groups 2. Compute the mean and standard deviation for
each group, and normalize each group to have zero
mean and unit variance
Group Normalization

The input channels are separated into num_groups groups, each containing num_channels / num_groups channels. The
mean and standard-deviation are calculated separately over the each group.

In the figure above for Group Norm: Separate 6 channels into 2 groups

If we put all 6 channels into a single group  equivalent with LayerNorm


Example: If group is 2

. Compute 𝜇 and 𝜎 and normalize


.
.
Compute 𝜇 and 𝜎 and normalize

When is GroupNorm more helpful than BatchNorm?


• In BatchNorm, if we use a small batch size, the error rate increases.
• However, this problem is avoided in GroupNorm

You might also like