0% found this document useful (0 votes)
8 views19 pages

Upload Unit 2

The document discusses key concepts in artificial neural networks (ANNs), including activation functions, weights, biases, loss functions, weight initialization, and gradient descent. It emphasizes the importance of these elements in training and optimizing models, highlighting techniques like backpropagation and batch normalization to address issues like gradient instability. Overall, understanding these concepts is crucial for effectively building and refining machine learning models.

Uploaded by

punia4901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

Upload Unit 2

The document discusses key concepts in artificial neural networks (ANNs), including activation functions, weights, biases, loss functions, weight initialization, and gradient descent. It emphasizes the importance of these elements in training and optimizing models, highlighting techniques like backpropagation and batch normalization to address issues like gradient instability. Overall, understanding these concepts is crucial for effectively building and refining machine learning models.

Uploaded by

punia4901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Lecture No.

10

Today's Agenda:
 Detail discussion on Activation Functions in Artificial Neural Network.

Detail discussion on Weights and bias terminology in Artificial Neural


Network.

Activation Functions

An activation function determines if a neuron should be activated or not activated. This


implies that it will use some simple mathematical operations to determine if the
neuron’s input to the network is relevant or not relevant in the prediction process. The
ability to introduce non-linearity to an artificial neural network and generate output
from a collection of input values fed to a layer is the purpose of the activation function.

Types of Activation functions

Activation functions can be divided into three types:

1. Linear Activation Function

2. Binary Step Function

3. Non-linear Activation Function


Weights and biases (commonly referred to as w and b) are the learnable parameters of
a some machine learning models, including neural networks.
Neurons are the basic units of a neural network. In an ANN, each neuron in a layer is
connected to some or all of the neurons in the next layer. When the inputs are
transmitted between neurons, the weights are applied to the inputs along with the bias.

Weights control the signal (or the strength of the connection) between two neurons. In
other words, a weight decides how much influence the input will have on the output.
Biases, which are constant, are an additional input into the next layer that will always
have the value of 1. Bias units are not influenced by the previous layer (they do not
have any incoming connections) but they do have outgoing connections with their own
weights. The bias unit guarantees that even when all the inputs are zeros there will still
be an activation in the neuron.

Summary

In artificial neural networks (ANNs), activation functions play a crucial role


in determining the output of neurons and introducing non-linearity into the
network. They are essential for:
 Capturing Complex Relationships: Real-world data often exhibits non-
linear relationships. Activation functions allow ANNs to model these
complexities, going beyond simple linear decision boundaries.
 Beyond Straight Lines: Without activation functions, ANNs would be
limited to linear mappings, unable to learn intricate patterns and solve
complex problems.
Lecture No. 11

Today's Agenda:
 Detail discussion on Loss functions, weights initialization in Machine
Learning.

In machine learning, loss functions and weight initialization play distinct but
interconnected roles in model training and optimization:

Loss Functions:

 Measuring Error: Loss functions quantify the "distance" between the model's
predictions and the actual target values. They mathematically articulate how "wrong"
the model is, guiding model training towards better predictions.
 Goal of Training: Minimizing the loss function is the primary objective of machine
learning algorithms. By iteratively adjusting model parameters to reduce loss, models
learn to capture patterns in the data and make more accurate predictions.
 Common Loss Functions:
o Mean Squared Error (MSE): Often used for regression problems, penalizing large
errors proportionally.
o Cross-Entropy Loss: Popular for classification tasks, measuring the divergence
between predicted probabilities and true labels.
o Hinge Loss: Suitable for support vector machines, aiming to maximize the margin
between classes.

Weight Initialization:

 Starting Point: Weights in neural networks represent connections between


neurons, and their initial values significantly impact the learning process.
 Importance: Proper initialization can:
o Accelerate convergence to a good solution.
o Prevent vanishing or exploding gradients during backpropagation.
o Influence the model's ability to capture complex patterns.
 Common Techniques:
o Random Initialization: Assigning small random values to weights, usually from a
Gaussian or uniform distribution.
o Xavier Initialization: Designed for sigmoid or tanh activations, scaling weights based
on the number of inputs and outputs to a neuron.
o He Initialization: Better suited for ReLU activations, using a different scaling factor to
account for its properties.

Interplay:

 Guiding Updates: Loss functions provide the feedback signal that backpropagation
uses to update weights. By calculating gradients of the loss with respect to
weights, the algorithm determines how to adjust weights to minimize loss.
 Impact on Convergence: The choice of loss function and weight initialization method
can affect the speed and stability of convergence.

Best Practices:

 Experimentation: The optimal choice often depends on the specific model


architecture, dataset, and task. Experimenting with different combinations is often
recommended.
 Monitor Learning: Track loss values during training to assess convergence and
potential issues like overfitting.

Summary

In essence, loss functions set the learning objective, while weight initialization sets
the starting point for the model's journey towards that objective. Understanding their
roles and interactions is crucial for effective model training and optimization in
machine learning.
Lecture No. 12

Today's Agenda:
 Detail discussion on gradient decent in Machine Learning.

Gradient Descent is an optimization algorithm used in machine learning to

minimize the cost function by iteratively adjusting parameters in the direction

of the negative gradient, aiming to find the optimal set of parameters.

The cost function represents the discrepancy between the predicted output of

the model and the actual output. The goal of gradient descent is to find the

set of parameters that minimizes this discrepancy and improves the model’s

performance.

The algorithm operates by calculating the gradient of the cost function, which

indicates the direction and magnitude of steepest ascent. However, since the

objective is to minimize the cost function, gradient descent moves in the

opposite direction of the gradient, known as the negative gradient direction.

By iteratively updating the model’s parameters in the negative gradient

direction, gradient descent gradually converges towards the optimal set of

parameters that yields the lowest cost.

The learning rate, a hyperparameter, determines the step size taken in each
iteration, influencing the speed and stability of convergence. Gradient

descent can be applied to various machine learning algorithms, including

linear regression, logistic regression, neural networks, and support vector

machines. It provides a general framework for optimizing models by

iteratively refining their parameters based on the cost function.

Example of Gradient Descent:

Let’s say you are playing a game where the players are at the top of a

mountain, and they are asked to reach the lowest point of the mountain.

Additionally, they are blindfolded

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends.

From that position, take a step in the descending direction and iterate this

process until we reach the lowest point.

Finding the lowest point in a hilly landscape. (Source: Fisseha Berhane)

Gradient descent is an iterative optimization algorithm for finding the local

minimum of a function.To find the local minimum of a function using

gradient descent, we must take steps proportional to the negative of the

gradient (move away from the gradient) of the function at the current point. If

we take steps proportional to the positive of the gradient (moving towards the

gradient we will approach a local maximum of the function, and the


procedure is called

Gradient Ascent.

Gradient descent was originally proposed by CAUCHY in 1847. It is also

known as steepest descent.

Source: Clairvoyant
The goal of the gradient descent algorithm is to minimize the given function

(say cost function). To achieve this goal, it performs two steps iteratively:

1. Compute the gradient (slope), the first order derivative of the function at that
point

2. Make a step (move) in the direction opposite to the gradient,

opposite direction of slope increase from the current point by alpha times the

gradient at that point

Alpha is called Learning rate – a tuning parameter in the optimization


process. It decides the length of the steps.

Summary
Gradient descent is one of the most fundamental and widely used optimization
algorithms in machine learning. It's all about finding the best set of parameters for your
model by continuously minimizing a loss function. Essentially, it's like rolling a ball
down a hill where the bottom of the hill represents the minimum of the loss function
and the ball represents your model's parameters.
Here's a breakdown of the key concepts:
1. Cost Function (Loss Function): This function measures how "bad" your model's
predictions are. We want to minimize this function to train a better model.
2. Gradient: This is the direction and steepness of the slope of the cost function at the
current position of your model's parameters. It tells you how much and in which
direction to adjust your parameters to minimize the cost function.
3. Parameter Update: Based on the gradient, you adjust your model's parameters by
taking a small step in the direction of the negative gradient (steepest descent). This step
size is called the learning rate.
4. Iteration: You repeat steps 2 and 3 iteratively until the cost function reaches a
minimum (or close enough!), and your model has learned the best parameters.

Lecture No. 13

Today's Agenda:
 Detail discussion on MLP in ANN

 Detail discussion on Backpropogation


Multi-layer ANN
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).

It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a
deep ANN. An MLP is a typical example of a feedforward artificial neural network. In this
figure, the ith activation unit in the lth layer is denoted as ai.The number of layers and the
number of neurons are referred to as hyperparameters of a neural network, and these need
tuning. Cross-validation techniques must be used to find ideal values for these. The weight
adjustment training is done via backpropagation. Deeper neural networks are better at
processing data. However, deeper layers can lead to vanishing gradient problems. Special
algorithms are required to solve this issue.

Notations

In the representation below:

 ai(in) refers to the ith value in the input layer

 ai(h) refers to the ith unit in the hidden layer

 ai(out) refers to the ith unit in the output layer


 ao(in) is simply the bias unit and is equal to 1; it will have the corresponding weight w0

 The weight coefficient from layer l to layer l+1 is represented by wk,j(l)

A simplified view of the multilayer is presented here. This image shows a fully connected
three-layer neural network with 3 input neurons and 3 output neurons. A bias term is added to
the input vector.

MLP Learning Procedure

The MLP learning procedure is as follows:

 Starting with the input layer, propagate data forward to the output layer. This step is the
forward propagation.

 Based on the output, calculate the error (the difference between the predicted and known
outcome). The error needs to be minimized.

 Backpropagate the error. Find its derivative with respect to each weight in the network,
and update the model.

Repeat the three steps given above over multiple epochs to learn ideal weights. Finally, the
output is taken via a threshold function to obtain the predicted class labels.

Forward Propagation in MLP

In the first step, calculate the activation unit al(h) of the hidden layer.

Backpropagation

Backpropagation is the crucial learning algorithm that powers artificial neural networks
(ANNs), enabling them to learn complex relationships and improve their predictions over
time. It's like the engine in a car, propelling the ANN towards optimal performance. Here's
why it's so important:

1. Training ANNs: Without backpropagation, ANNs would be static, unable to adjust their
internal parameters based on new information. Backpropagation provides the mechanism for
fine-tuning the weights and biases in the network, allowing it to learn from its mistakes and
improve its predictions on future data.
2. Minimizing Loss: Backpropagation calculates the gradient of the loss function with
respect to each weight and bias in the network. This gradient tells us how much each
parameter contributes to the overall error in the network's predictions.

3. Updating Weights and Biases: Based on the calculated gradient, backpropagation updates
the weights and biases in the network iteratively, moving them in the direction that
minimizes the loss function. This process is akin to sculpting the network, gradually
shaping its internal structure to better represent the underlying data patterns.

4. Enabling Non-Linearity: Backpropagation allows ANNs to learn non-linear


relationships between features and target variables. This is crucial for tackling real-world
problems where simple linear models often fall short. The ability to learn complex
relationships makes ANNs versatile and powerful tools for diverse applications.

5. Generalization: Backpropagation helps ANNs generalize from the training data to unseen
examples. By minimizing the loss function on the training data, the network learns to capture
the underlying patterns and relationships, enabling it to make accurate predictions on new
data points it hasn't encountered before.

Summary
In conclusion, backpropagation is the vital cog in the ANN learning machine. It's what
transforms these models from static structures to powerful learning machines, capable of
solving complex problems and shaping the future of AI. If you're interested in diving deeper
into the intricate details of backpropagation
Lecture No. 14

Today's Agenda:
 Detail discussion on Testing, gradient unstable,

 Detail discussion on batch normalization

Gradient instability is a major hurdle in machine learning, impacting training speed,


convergence, and overall model performance. Fortunately, various testing and mitigation
strategies can be employed to address this issue. Let's delve deeper into this topic:

Detecting Gradient Instability:

 Monitor Gradient Magnitudes: Track the absolute values of gradients throughout


training. Excessively high or low values indicate potential instability.
 Visualize Gradient Distributions: Utilize tools like TensorBoard to visualize the
distribution of gradients over layers and epochs. Look for sharp peaks or highly skewed
distributions.
 Analyze Loss Function Landscape: Sudden jumps or plateaus in the loss function could
signify instability.
 Monitor Training Metrics: Track convergence metrics like training accuracy and
loss. Stagnation or erratic fluctuations might indicate gradient issues.

Understanding the Roots of Instability:

 Vanishing Gradients: Gradients decrease exponentially through deep


networks, eventually shrinking to insignificant values, hindering learning in deeper
layers.
 Exploding Gradients: Gradients accumulate and magnify throughout
backpropagation, leading to uncontrolled updates and model divergence.
 Internal Covariate Shift: Activations within a layer can shift distributions across
training batches, destabilizing gradients.

Testing and Mitigation Strategies:

 Batch Normalization (BN): A popular technique that normalizes activations within each
mini-batch, addressing internal covariate shift and stabilizing gradients.
 Weight Initialization: Choosing appropriate initialization schemes like Xavier or He
initialization can influence gradient flow and mitigate instability.
 Clipping Gradients: Enforce hard or soft clipping thresholds on gradient magnitudes to
prevent them from exploding.
 Adaptive Learning Rates: Employ algorithms like Adam or RMSprop that dynamically
adjust learning rates based on gradients, providing more stability.
 Regularization Techniques: Techniques like L1 or L2 regularization can prevent model
overfitting, which can contribute to instability.

Advanced Testing Techniques:

 Hessian Analysis: Analyzes the curvature of the loss function around the current
parameters, providing insights into gradient stability and potential second-order issues.
 Spectral Normalization: Normalizes weights based on their singular values, addressing
exploding gradients in certain network architectures.
 Gradient Checkpointing: Saves and restores gradients periodically during
training, providing a fallback point if they diverge excessively.

Choosing the Right Approach:

 Consider the specific model architecture, problem domain, and dataset characteristics.
 Experiment with different combinations of testing and mitigation strategies.
 Monitor performance metrics and choose the approach that demonstrably improves
training stability and optimizes model performance.

BN normalizes the activations (outputs) of a layer within each mini-batch during


training. This essentially standardizes the distribution of activations around a mean of 0
and a standard deviation of
Calculate Mean and Variance: Within each mini-batch, the mean and variance of the
activations across all training samples are calculated for each feature.
Normalize Activations: Each activation is then shifted by the mean and scaled by the
standard deviation, effectively centering and rescaling its distribution.
Scale and Shift: Two learnable parameters, gamma and beta, are applied to the
normalized activations to allow the model to recover any representational power lost
during normalization.
Benefits of Batch Normalization:
 Reduced Internal Covariate Shift: BN stabilizes the distribution of activations across
mini-batches, preventing the covariate shift that can occur during training and hinder
learning.
 Faster Convergence: Normalized activations lead to smoother gradient flow, allowing
the model to converge to a minimum of the loss function more quickly.
 Improved Generalization: BN can act as a regularizer, reducing overfitting and
improving the model's ability to perform well on unseen data.
 Higher Learning Rates: The stable gradients due to BN allow for using larger learning
rates without risking divergence or instability.

Challenges and Considerations:


 Increased Complexity: BN adds additional learnable parameters and
computational overhead to the model.
 Hyperparameter Tuning: Tuning the learnable scale and shift parameters (gamma
and beta) can be crucial for optimal performance.
 Not a cure-all: While effective in many cases, BN might not solve all instability
issues, and alternative techniques like layer normalization or weight normalization
might be needed in specific cases.

Applications of Batch Normalization:


 Deep Neural Networks: BN is particularly beneficial in deep networks where
vanishing or exploding gradients can be a major problem.
 Computer Vision: BN is widely used in convolutional neural networks for image
recognition and classification tasks.
 Natural Language Processing: BN can improve the performance of recurrent neural
networks used for language modeling and machine translation.

Summary
In conclusion, Batch normalization is a valuable tool in the machine learning toolbox,
offering significant improvements in training speed, stability, and model performance.
By understanding its workings, benefits, and considerations, you can leverage its
strengths to get the most out of your machine learning projects.
Lecture No. 15

Today's Agenda:
 Detail discussion on L1 and L2 regularization

When you have a large number of features in your data set, you may wish to create a less
complex, more parsimonious model. Two widely used regularization techniques used to
address overfitting and feature selection are L1 and L2 regularization.

L1 VS. L2 REGULARIZATION METHODS

 L1 Regularization, also called a lasso regression, adds the “absolute value of


magnitude” of the coefficient as a penalty term to the loss function.
 L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.

L1 Regularization: Lasso Regression


Lasso is an acronym for least absolute shrinkage and selection operator, and lasso regression
adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss
function.

Again, if lambda is zero, then we'll get back OLS (ordinary least squares) whereas a very
large value will make coefficients zero, which means it will become underfit.
L2 Regularization: Ridge Regression
Ridge regression adds the “squared magnitude” of the coefficient as the penalty term to the
loss function. The highlighted part below represents the L2 regularization element.

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very
large then it will add too much weight and lead to underfitting. Having said that, how we
choose lambda is important. This technique works very well to avoid overfitting issues.

Summary

The key difference between these techniques is that lasso shrinks the less important
feature’s coefficient to zero thus, removing some features altogether. In other
words, L1 regularization works well for feature selection in case we have a huge
number of features.
Lecture No. 16

Today's Agenda:
 Detail discussion on Momentum,

 Detail discussion on hypertunning of parameters

Momentum is a powerful optimization technique in machine learning, often used alongside


gradient descent, that helps your model navigate the training landscape more efficiently.
Imagine rolling a ball down a hill – momentum accelerates the ball, helping it overcome
bumps and reach the bottom (minimum of the loss function) faster. Here's a deeper look at
how it works:

The Problem:
Gradient descent typically takes small steps in the direction of the steepest descent
(negative gradient) to minimize the loss function. While this works, it can be slow,
especially when the landscape is bumpy or has shallow regions.

The Solution: Momentum:


Momentum acts like a rolling ball's inertia. It "remembers" the past gradients and adds
them to the current gradient, creating a larger update vector. This pushes the ball
downhill with extra force, helping it overcome shallow areas and reach the minimum
faster.

Benefits of Momentum:
 Faster Convergence: By accumulating past gradients, momentum can significantly
accelerate the learning process.
 Reduced Oscillations: Momentum smooths out the descent path, preventing the ball
from bouncing back and forth in shallow regions.
 Improved Performance: Faster convergence and smoother trajectories often lead to
better final model performance.
Key Parameters:
 Momentum coefficient (μ): This controls the amount of past gradients
considered. Higher values increase momentum, but too much can cause overshooting.
 Initial velocity: This sets the starting direction of the ball's movement.
Challenges and Considerations:
 Choosing the right momentum coefficient: Finding the optimal value depends on the
problem and dataset. Experimentation is key.
 Can overshoot the minimum: High momentum can cause the ball to zoom past the
minimum, requiring careful tuning.
 Not a guaranteed solution: While effective in many cases, momentum may not always
be the best approach.

Variations of Momentum:

 Nesterov momentum: Looks ahead slightly before taking the update, leading to improved
stability and accuracy.
 AdaGrad, RMSProp, Adam: Adaptive momentum-based algorithms adjust the learning
rate for different parameters, often leading to better performance.

Summary
momentum is a powerful tool that can significantly improve the efficiency and performance of
your machine learning models. By understanding its benefits, challenges, and variations, you
can leverage its capabilities to push your models to the top of the learning hill faster and achieve
optimal results.

You might also like