Normalization Techniques
Normalization Techniques
• Normalization:
1 100 0 1 Referred to as
𝑥 −𝑚𝑒𝑎𝑛 the same
• Standardization: 𝑥 = The data will have a mean of 0 and sometimes
𝑠𝑡𝑑
standard deviation of 1
Updates are dominant by larger gradients All updates are equally proportional
By normalizing all of our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters
for each input node
Batch
Normalization
So we said that the inputs to the neural network
are normalized
X1
Normalized Inputs
X2
X3
X4
What if one (or some) of the neurons have high vales? Well, they would
also cause the neurons depending on it to have a large value, thus the
network would still be unstable. What would be the solution?
These activations are the inputs to the
next layer, so let’s also normalize them
• We normalize each feature (each neuron) in a layer according to the batch
of samples. The mean and standard deviation are calculated across a batch
of samples.
2.3
1.2
Sample 1 . 0.8
Sample 2 .
Sample 3 .
Consider the example of normalizing the first
neuron in the output layer. We are normalizing
the neuron across the output values of a batch
for a single neuron (batch size is 3 in this case)
Batch Normalization
• Normalization of each output neuron in each hidden layer, rather than only for the input layer, for a
mini-batch of data. In other words, it reduces the amount that the hidden layer values shift around.
Batch Batch
Norm Norm
Normalized
input
The mean and standard deviation are calculated across a batch of samples.
We don’t always want to normalize our value to have zero mean and
unit variance. This would perform poorly for some activation functions
(ex. Sigmoid). Thus, we learn the best distribution by scaling our
normalized values by 𝛾 and shifting by 𝛽
Learnable Parameters
https://arxiv.org/abs/1502.03167
Layer Normalization:
We normalize each feature (each neuron) .
in a layer according to the features in that .
layer. The mean and standard deviation .
are calculated across the features of a
layer (across the outputs of a layer).
Layer Normalization
The mean and standard deviation are
At the output of a feedforward layer: calculated across the layer outputs.
−2.3 + 0.275
. = −0.77
. 6.861 + 10−9
.
−0.77𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4
1.9 + 0.275
. = 0.83
. 6.861 + 10−9
.
0.83𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4
2.7 + 0.275
. = 1.13
. 6.861 + 10−9
.
1.13𝜃1 + 𝜃2
-2.3 −2.3 + 1.9 + 2.7 − 3.4
. = −0.275
4
. 1.9
. 2.7
-3.4
−3.4 + 0.275
. = −1.19
. 6.861 + 10−9
.
−1.19𝜃1 + 𝜃2
Differences
𝑥 −𝑚𝑒𝑎𝑛
• All the normalization is calculated using: 𝑥 =
𝑠𝑡𝑑
• Batch Normalization and Layer Normalization are performed in different directions (it only
differs in how the mean is calculated).
• For batch normalization, input values of the same neuron from different images in one
mini batch are normalized. In layer normalization, input values for different neurons in the
same layer are normalized without consideration of mini batch.
https://www.quora.com/What-are-the-practical-differences-between-batch-normalization-and-layer-normalization-in-deep-neural-networks
Group
Normalization
Motivation
• From HOG/SIFT Feature extraction method in image processing
1. Split into groups 2. Compute the mean and standard deviation for
each group, and normalize each group to have zero
mean and unit variance
Group Normalization
The input channels are separated into num_groups groups, each containing num_channels / num_groups channels. The
mean and standard-deviation are calculated separately over the each group.
In the figure above for Group Norm: Separate 6 channels into 2 groups