Deep Neural Networks
Deep Neural Networks
Networks
Vanishing/Exploding Gradients Problem,
avoiding overfitting through regularization,
Dropout Regularization.
Deep Neural Networks
Introduction:
• Neural network with 2 or more hidden layers, can be called Deep
Neural Network.
• While handling a complex problem such as detecting hundreds of
types of objects in high resolution images, you may need to train a
much deeper DNN, perhaps say 10 layers, each containing hundreds of
neurons, connected by hundreds of thousands of connections.
• This leads to a problem of vanishing gradients.
Training a Neural Network
• Training a neural network involves updating its parameters (weights and biases) to
minimize a loss function based on the difference between predicted outputs and the
actual labels (in supervised learning).
• This is typically done using an optimization algorithm like gradient descent or one of its
variants.
• The general steps involved in training a neural network are outlined below:
1. Forward Propagation
2. Loss Function
3. Backpropagation
4. Gradient Descent
5. Evaluation
6. Stopping Criteria
1. Forward Propagation:
• In forward propagation, the input data is passed through the neural
network to generate the output (predictions).
• This involves calculating the weighted sum of inputs and applying an
activation function at each layer.
2. Loss Function
• The loss function, L, measures how well the neural network’s
predictions match the actual target values.
3. Back propagation
• Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the
error rate obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and
make the model reliable by increasing its generalization.
• Backpropagation in neural network is a short form for “backward
propagation of errors.”
It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to
all the weights in the network.
4. Gradient Descent
• After computing the gradients of the loss function with respect to the
weights and biases, the parameters are updated using gradient
descent.
• Gradient descent minimizes the loss function by iteratively adjusting
the parameters in the opposite direction of the gradient.
5. Evaluation
• After each epoch (a complete pass through the training data), the
performance of the neural network is evaluated on a validation set to
monitor generalization.
• Metrics like accuracy (for classification) or RMSE (for regression) are
used to check the performance.
6. Stopping Criteria
• Training continues until a predefined stopping criterion is met.
Common stopping criteria include:
• A fixed number of epochs.
• Early stopping, where training is stopped when the performance on
the validation set stops improving (to avoid overfitting).
Vanishing Gradient
• Back propagation algorithm works by going from the output layer to
the input layer, propagating the error gradient on the way.
• Gradients often get smaller and smaller as the algorithm progresses
down to the lower layers.
• As a result the gradient descent updates in the lower layer weights
virtually unchanged.
• This makes training never converges to a good solution.
• This is called vanishing gradient.
Causes of the Vanishing
Gradient Problem
1. Activation Functions:
• Sigmoid and tanh functions can squash input values into a small range (0 to 1 for
sigmoid, -1 to 1 for tanh), leading to derivatives (gradients) that are small.
2. Weight Initialization:
• Poor initialization of weights can lead to outputs that are either very large or very
small, which when fed into activation functions, result in small gradients.
3. Deep Architectures:
• As the number of layers increases, the likelihood of the gradients diminishing
exponentially increases.
• Each layer’s small gradients compound across many layers, leading to vanishing
gradients.
Exploding Gradients
• Some times the gradients can grow bigger and bigger, so many layers
gets large weight updates and the algorithm diverges.
• This is called exploding Gradients, which is popularly seen in recurrent
neural networks.
Causes of the Exploding
Gradient Problem
1. Weight Initialization:
• Poor initialization of weights can lead to very large gradient values if
the weights are not scaled appropriately.
2. Deep Architectures and Long Sequences:
• Deep neural networks and RNNs with many layers or long sequences
exacerbate the problem due to the compounding effect of gradients.
Solutions to
Vanishing/Exploding gradients
• Xavier/He Initialization: Proper weight initialization methods like Xavier
initialization (for sigmoid/tanh) or He initialization (for ReLU) can help
prevent the gradients from shrinking or exploding. They normalize the
variance of the inputs and outputs at each layer, making learning more
stable.
• Batch Normalization: This technique normalizes the input to each layer,
ensuring that the distribution of input values stays consistent across layers,
preventing gradients from either vanishing or exploding. Batch normalization
also helps speed up training and can have a slight regularization effect.
• Gradient Clipping: In cases where exploding gradients occur, gradient
clipping can be applied. This technique scales down the gradients if they
exceed a certain threshold to keep the updates under control.
Avoiding Overfitting Through
Regularization
• Deep Neural Network typically have tens of thousands of parameters.
• With so many parameters, the network is prone to overfitting the
training set.
• This will be done using “Regularization” techniques.
• Some of the popular regularization techniques are:
• Early Stopping
• Dropout
• Max-Norm Regularization and
• Data Augmentation.
Early Stopping
• To avoid Overfitting the training set, good solution is early stopping.
• Interrupt training when its performance on the validation set starts
dropping.
• Evaluate the model on a validation set at regular intervals.
• If the performance is improved compared to the previous interval, go
back to the pervious values and stop training.
Dropout
• Another popular regularization technique for deep neural network is
arguably dropout.
• At every training step, every neuron has a probability P of being
temporarily “Dropped Out”, meaning it will be entirely ignored during
this training step.
• But it may be active during the next step.
• The hyperparameter ‘P’ is called the dropout rate and it is typically set
to 50%.
• After training neurons don’t get dropped anymore.
• It is found that many a times this technique has worked well.
Max-Norm Regularization
• Another regularization technique that is quite popular for neural
networks is called max-norm regularization.
• It constrains the weights w of the incoming connections such that**
∥ w ∥2 ≤ r, where r is the max-norm hyperparameter and ∥ · ∥2 is the ℓ2
norm**.
• It is typically implemented by computing ∥w∥2 after each training
step and clipping w if needed.
• Reducing r increases the amount of regularization and helps reduce
overfitting.
Data Augmentation
• One last regularization technique is data augmentation.
• It consists of generating new training instances from existing ones.
• Artificially boosting the size of the training set.
• This will reduce overfitting making this a regularization technique.
• The trick is to generate realistic training instances, ideally a human
should not be able to tell which instances were generated and which
ones were not.
3. Dropout Regularization
• Dropout is a popular and highly effective regularization technique
used to avoid overfitting in neural networks.
• It works by randomly "dropping out" (i.e., setting to zero) a fraction of
neurons in the network during each forward pass in training.
• The neurons that are dropped are chosen randomly, and they do not
participate in both the forward and backward passes during that
iteration.
How Dropout Works:
During Training:
• Each neuron has a probability ppp (commonly p=0.5) of being dropped.
• The network is effectively trained on a different architecture at each iteration,
forcing it to learn more robust features and preventing it from relying too
much on any particular neurons.
During Testing:
• No neurons are dropped out.
• Instead, the network’s output is scaled by the dropout probability p to account
for the increased number of active neurons.
• This ensures that the predictions remain consistent between training and
inference.
Advantages:
• Improved Generalization: By forcing the network to work with various
subsets of neurons, dropout helps the model generalize better to
unseen data.
• Efficient and Simple: Dropout is easy to implement and computationally
efficient. It has been found to be effective across a wide range of tasks.
Limitations:
• Extended Training Time: Since the model is effectively learning several
architectures simultaneously, it may require more epochs to converge
compared to a model without dropout.
import random
# List of words to choose from
words = ["python", "hangman", "coding", "programming", "algorithm"]
word = random.choice(words)
guessed_letters = set()
attempts = 6
print("Welcome to Hangman!")
# Game loop
while attempts > 0:
display_word = [letter if letter in guessed_letters else '_' for letter in word]
print("Word:", ' '.join(display_word))
if '_' not in display_word:
print("Congratulations, you guessed the word!")
break
guess = input("Guess a letter: ").lower()
if guess in guessed_letters:
print("You've already guessed that letter.")
elif guess in word:
guessed_letters.add(guess)
print("Good guess!")
else:
attempts -= 1
print(f"Wrong guess. You have {attempts} attempts left.")
if attempts == 0:
print(f"Game over! The word was: {word}")