Batch normalization normalizes the inputs of a layer during training by adjusting for the mini-batch mean and standard deviation, ensuring zero mean and unit variance. It then applies scale and shift transformations to allow the network to learn optimal scaling and bias. This process helps maintain reasonable activation ranges, reducing the vanishing gradient problem and enabling more stable training of deep neural networks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views1 page
Misc ML Concepts
Batch normalization normalizes the inputs of a layer during training by adjusting for the mini-batch mean and standard deviation, ensuring zero mean and unit variance. It then applies scale and shift transformations to allow the network to learn optimal scaling and bias. This process helps maintain reasonable activation ranges, reducing the vanishing gradient problem and enabling more stable training of deep neural networks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1
Batch Normalization
- During training, for each mini-batch of training examples, batch normalization
normalizes the inputs of a specific layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. This normalization step ensures that the inputs to the layer have zero mean and unit variance. - After normalization, batch normalization applies a scale and shift transformation to the normalized inputs. The scale parameter allows the network to learn the optimal scaling for the normalized inputs, while the shift parameter allows the network to learn the optimal bias. - By normalizing the inputs, batch normalization helps to keep the activations in a reasonable range throughout the network. This mitigates the issues caused by the vanishing gradient problem, as the gradients are less likely to become extremely small or vanish as they propagate through the layers. It enables more stable and efficient training of deep neural networks.