Upload Unit 2
Upload Unit 2
10
Today's Agenda:
Detail discussion on Activation Functions in Artificial Neural Network.
Activation Functions
Weights control the signal (or the strength of the connection) between two neurons. In
other words, a weight decides how much influence the input will have on the output.
Biases, which are constant, are an additional input into the next layer that will always
have the value of 1. Bias units are not influenced by the previous layer (they do not
have any incoming connections) but they do have outgoing connections with their own
weights. The bias unit guarantees that even when all the inputs are zeros there will still
be an activation in the neuron.
Summary
Today's Agenda:
Detail discussion on Loss functions, weights initialization in Machine
Learning.
In machine learning, loss functions and weight initialization play distinct but
interconnected roles in model training and optimization:
Loss Functions:
Measuring Error: Loss functions quantify the "distance" between the model's
predictions and the actual target values. They mathematically articulate how "wrong"
the model is, guiding model training towards better predictions.
Goal of Training: Minimizing the loss function is the primary objective of machine
learning algorithms. By iteratively adjusting model parameters to reduce loss, models
learn to capture patterns in the data and make more accurate predictions.
Common Loss Functions:
o Mean Squared Error (MSE): Often used for regression problems, penalizing large
errors proportionally.
o Cross-Entropy Loss: Popular for classification tasks, measuring the divergence
between predicted probabilities and true labels.
o Hinge Loss: Suitable for support vector machines, aiming to maximize the margin
between classes.
Weight Initialization:
Interplay:
Guiding Updates: Loss functions provide the feedback signal that backpropagation
uses to update weights. By calculating gradients of the loss with respect to
weights, the algorithm determines how to adjust weights to minimize loss.
Impact on Convergence: The choice of loss function and weight initialization method
can affect the speed and stability of convergence.
Best Practices:
Summary
In essence, loss functions set the learning objective, while weight initialization sets
the starting point for the model's journey towards that objective. Understanding their
roles and interactions is crucial for effective model training and optimization in
machine learning.
Lecture No. 12
Today's Agenda:
Detail discussion on gradient decent in Machine Learning.
The cost function represents the discrepancy between the predicted output of
the model and the actual output. The goal of gradient descent is to find the
set of parameters that minimizes this discrepancy and improves the model’s
performance.
The algorithm operates by calculating the gradient of the cost function, which
indicates the direction and magnitude of steepest ascent. However, since the
The learning rate, a hyperparameter, determines the step size taken in each
iteration, influencing the speed and stability of convergence. Gradient
Let’s say you are playing a game where the players are at the top of a
mountain, and they are asked to reach the lowest point of the mountain.
The best way is to observe the ground and find where the land descends.
From that position, take a step in the descending direction and iterate this
gradient (move away from the gradient) of the function at the current point. If
we take steps proportional to the positive of the gradient (moving towards the
Gradient Ascent.
Source: Clairvoyant
The goal of the gradient descent algorithm is to minimize the given function
(say cost function). To achieve this goal, it performs two steps iteratively:
1. Compute the gradient (slope), the first order derivative of the function at that
point
opposite direction of slope increase from the current point by alpha times the
Summary
Gradient descent is one of the most fundamental and widely used optimization
algorithms in machine learning. It's all about finding the best set of parameters for your
model by continuously minimizing a loss function. Essentially, it's like rolling a ball
down a hill where the bottom of the hill represents the minimum of the loss function
and the ball represents your model's parameters.
Here's a breakdown of the key concepts:
1. Cost Function (Loss Function): This function measures how "bad" your model's
predictions are. We want to minimize this function to train a better model.
2. Gradient: This is the direction and steepness of the slope of the cost function at the
current position of your model's parameters. It tells you how much and in which
direction to adjust your parameters to minimize the cost function.
3. Parameter Update: Based on the gradient, you adjust your model's parameters by
taking a small step in the direction of the negative gradient (steepest descent). This step
size is called the learning rate.
4. Iteration: You repeat steps 2 and 3 iteratively until the cost function reaches a
minimum (or close enough!), and your model has learned the best parameters.
Lecture No. 13
Today's Agenda:
Detail discussion on MLP in ANN
It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a
deep ANN. An MLP is a typical example of a feedforward artificial neural network. In this
figure, the ith activation unit in the lth layer is denoted as ai.The number of layers and the
number of neurons are referred to as hyperparameters of a neural network, and these need
tuning. Cross-validation techniques must be used to find ideal values for these. The weight
adjustment training is done via backpropagation. Deeper neural networks are better at
processing data. However, deeper layers can lead to vanishing gradient problems. Special
algorithms are required to solve this issue.
Notations
A simplified view of the multilayer is presented here. This image shows a fully connected
three-layer neural network with 3 input neurons and 3 output neurons. A bias term is added to
the input vector.
Starting with the input layer, propagate data forward to the output layer. This step is the
forward propagation.
Based on the output, calculate the error (the difference between the predicted and known
outcome). The error needs to be minimized.
Backpropagate the error. Find its derivative with respect to each weight in the network,
and update the model.
Repeat the three steps given above over multiple epochs to learn ideal weights. Finally, the
output is taken via a threshold function to obtain the predicted class labels.
In the first step, calculate the activation unit al(h) of the hidden layer.
Backpropagation
Backpropagation is the crucial learning algorithm that powers artificial neural networks
(ANNs), enabling them to learn complex relationships and improve their predictions over
time. It's like the engine in a car, propelling the ANN towards optimal performance. Here's
why it's so important:
1. Training ANNs: Without backpropagation, ANNs would be static, unable to adjust their
internal parameters based on new information. Backpropagation provides the mechanism for
fine-tuning the weights and biases in the network, allowing it to learn from its mistakes and
improve its predictions on future data.
2. Minimizing Loss: Backpropagation calculates the gradient of the loss function with
respect to each weight and bias in the network. This gradient tells us how much each
parameter contributes to the overall error in the network's predictions.
3. Updating Weights and Biases: Based on the calculated gradient, backpropagation updates
the weights and biases in the network iteratively, moving them in the direction that
minimizes the loss function. This process is akin to sculpting the network, gradually
shaping its internal structure to better represent the underlying data patterns.
5. Generalization: Backpropagation helps ANNs generalize from the training data to unseen
examples. By minimizing the loss function on the training data, the network learns to capture
the underlying patterns and relationships, enabling it to make accurate predictions on new
data points it hasn't encountered before.
Summary
In conclusion, backpropagation is the vital cog in the ANN learning machine. It's what
transforms these models from static structures to powerful learning machines, capable of
solving complex problems and shaping the future of AI. If you're interested in diving deeper
into the intricate details of backpropagation
Lecture No. 14
Today's Agenda:
Detail discussion on Testing, gradient unstable,
Batch Normalization (BN): A popular technique that normalizes activations within each
mini-batch, addressing internal covariate shift and stabilizing gradients.
Weight Initialization: Choosing appropriate initialization schemes like Xavier or He
initialization can influence gradient flow and mitigate instability.
Clipping Gradients: Enforce hard or soft clipping thresholds on gradient magnitudes to
prevent them from exploding.
Adaptive Learning Rates: Employ algorithms like Adam or RMSprop that dynamically
adjust learning rates based on gradients, providing more stability.
Regularization Techniques: Techniques like L1 or L2 regularization can prevent model
overfitting, which can contribute to instability.
Hessian Analysis: Analyzes the curvature of the loss function around the current
parameters, providing insights into gradient stability and potential second-order issues.
Spectral Normalization: Normalizes weights based on their singular values, addressing
exploding gradients in certain network architectures.
Gradient Checkpointing: Saves and restores gradients periodically during
training, providing a fallback point if they diverge excessively.
Consider the specific model architecture, problem domain, and dataset characteristics.
Experiment with different combinations of testing and mitigation strategies.
Monitor performance metrics and choose the approach that demonstrably improves
training stability and optimizes model performance.
Summary
In conclusion, Batch normalization is a valuable tool in the machine learning toolbox,
offering significant improvements in training speed, stability, and model performance.
By understanding its workings, benefits, and considerations, you can leverage its
strengths to get the most out of your machine learning projects.
Lecture No. 15
Today's Agenda:
Detail discussion on L1 and L2 regularization
When you have a large number of features in your data set, you may wish to create a less
complex, more parsimonious model. Two widely used regularization techniques used to
address overfitting and feature selection are L1 and L2 regularization.
Again, if lambda is zero, then we'll get back OLS (ordinary least squares) whereas a very
large value will make coefficients zero, which means it will become underfit.
L2 Regularization: Ridge Regression
Ridge regression adds the “squared magnitude” of the coefficient as the penalty term to the
loss function. The highlighted part below represents the L2 regularization element.
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very
large then it will add too much weight and lead to underfitting. Having said that, how we
choose lambda is important. This technique works very well to avoid overfitting issues.
Summary
The key difference between these techniques is that lasso shrinks the less important
feature’s coefficient to zero thus, removing some features altogether. In other
words, L1 regularization works well for feature selection in case we have a huge
number of features.
Lecture No. 16
Today's Agenda:
Detail discussion on Momentum,
The Problem:
Gradient descent typically takes small steps in the direction of the steepest descent
(negative gradient) to minimize the loss function. While this works, it can be slow,
especially when the landscape is bumpy or has shallow regions.
Benefits of Momentum:
Faster Convergence: By accumulating past gradients, momentum can significantly
accelerate the learning process.
Reduced Oscillations: Momentum smooths out the descent path, preventing the ball
from bouncing back and forth in shallow regions.
Improved Performance: Faster convergence and smoother trajectories often lead to
better final model performance.
Key Parameters:
Momentum coefficient (μ): This controls the amount of past gradients
considered. Higher values increase momentum, but too much can cause overshooting.
Initial velocity: This sets the starting direction of the ball's movement.
Challenges and Considerations:
Choosing the right momentum coefficient: Finding the optimal value depends on the
problem and dataset. Experimentation is key.
Can overshoot the minimum: High momentum can cause the ball to zoom past the
minimum, requiring careful tuning.
Not a guaranteed solution: While effective in many cases, momentum may not always
be the best approach.
Variations of Momentum:
Nesterov momentum: Looks ahead slightly before taking the update, leading to improved
stability and accuracy.
AdaGrad, RMSProp, Adam: Adaptive momentum-based algorithms adjust the learning
rate for different parameters, often leading to better performance.
Summary
momentum is a powerful tool that can significantly improve the efficiency and performance of
your machine learning models. By understanding its benefits, challenges, and variations, you
can leverage its capabilities to push your models to the top of the learning hill faster and achieve
optimal results.