Unit 3 Self Made
Unit 3 Self Made
2. Architecture:
An ANN typically consists of three types of layers:
1. Input Layer: Receives the input features.
2. Hidden Layers: Perform intermediate computations, where the
"learning" happens.
3. Output Layer: Produces the final prediction or classification result.
The number of layers and nodes defines the network's complexity.
3. Learning Process:
ANNs use weights and biases associated with connections to learn patterns.
The learning process involves:
1. Forward Propagation: Data flows through the network to make
predictions.
2. Backward Propagation (Backpropagation): Errors are propagated
backward to adjust weights using optimization algorithms like Gradient
Descent.
4. Activation Functions:
Non-linear activation functions enable ANNs to model complex relationships.
Common activation functions:
o Sigmoid
o ReLU (Rectified Linear Unit)
o Tanh
o Softmax (for classification tasks)
5. Applications:
ANNs are widely used across industries for tasks like:
o Image recognition (e.g., face detection).
o Natural language processing (e.g., sentiment analysis).
o Time series forecasting (e.g., stock price prediction).
o Autonomous vehicles (e.g., driving decisions).
Structure of HebbNet
Input Layer: Accepts the input features.
Output Layer: Produces a response, often a pattern or a class label.
Weights: Initialized randomly and updated according to Hebbian Learning.
Unlike traditional neural networks, HebbNet does not rely on error-based learning
methods like backpropagation but instead updates weights directly based on the
co-activation of neurons.
Algorithm for HebbNet
1. Initialize Weights:
o Start with small random weights between neurons.
4. Repeat:
Continue presenting patterns and updating weights until the network
stabilizes or reaches a stopping criterion.
Applications of HebbNet
1. Pattern Association:
o Learning to associate one pattern with another, such as mapping an
input vector to a target vector.
2. Unsupervised Learning:
o Used in scenarios where labeled data is unavailable, and the network
learns correlations in the input data.
3. Feature Extraction:
o Can identify dominant patterns in data, similar to Principal Component
Analysis (PCA).
4. Biological Modeling:
o Helps simulate brain-like learning for understanding neural processes
What is a Perceptron?
Types of Hyperparameters
1. Model Hyperparameters
These decide the structure or shape of the model.
Examples:
o Number of layers: In a neural network, how many layers are there?
o Tree depth: For decision trees, how many splits can it make?
2. Training Hyperparameters
These control the learning process during training.
Examples:
o Learning Rate: How big are the steps the model takes while learning?
o Batch Size: How many data samples are used in each training step?
o Number of Epochs: How many times does the model go through the
entire dataset?
3. Regularization Hyperparameters
These prevent the model from overfitting (memorizing the data).
Examples:
o Dropout: Temporarily turns off some parts of the model during training
to make it more general.
o L1/L2 Regularization: Adds a penalty to large model weights, forcing
the model to simplify.
4. Optimization Hyperparameters
These control how optimization algorithms (like gradient descent) work.
Examples:
o Momentum: Helps the model move faster in the right direction during
training.
o Learning Rate Decay: Reduces the learning rate as training progresses.
Gradient Descent
Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning and
deep learning models. It helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
o If we move towards a negative gradient or away from the gradient of the
function at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of
the function at the current point, we will get the local maximum of that
function.
Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into Batch gradient descent, stochastic gradient descent,
and mini-batch gradient descent. Let's understand these different types of
gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training
set and update the model after evaluating all training examples. This procedure is
known as the training epoch. In simple words, it is a greedy approach where we
have to sum over all examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training
samples.
2. Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one
training example per iteration. Or in other words, it processes a training epoch for
each example within a dataset and updates each training example's parameters
one at a time. As it requires only one training example at a time, hence it is easier
to store in allocated memory. However, it shows some computational efficiency
losses in comparison to batch gradient systems as it shows frequent updates that
require more detail and speed. Further, due to frequent updates, it is also treated
as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it
consists of a few advantages over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
3. MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes
then performs the updates on those batches separately. Splitting training datasets
into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can
achieve a special type of gradient descent with higher computational efficiency
and less noisy gradient descent.
Advantages of Mini Batch gradient descent:
o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.
Backpropagation is a supervised learning algorithm used to train neural networks
by optimizing their weights and biases. It minimizes the error between predicted
and actual outputs by propagating the error backward through the network.
Error Gradient Calculation: Systematically computes how each network
weight contributes to the overall prediction error, using the chain rule of calculus to
determine precise weight adjustments.
Backward Learning Mechanism: Moves from output layer to input layer,
distributing error gradients and updating weights to improve future predictions by
understanding each neuron's error contribution.
Automated Weight Optimization: Automatically adjusts network weights by
computing partial derivatives, allowing the neural network to learn complex
patterns and reduce prediction errors across multiple layers.
Gradient Descent Implementation: Uses an iterative approach to minimize the
loss function by incrementally updating weights in the direction that reduces
network error.
Computational Efficiency: Enables efficient learning in deep neural networks
by calculating gradients for all weights in a single backward pass, making it
scalable for complex machine learning tasks.
Steps in Backpropagation
1. Forward Pass:
o Input data flows through the network.
o Compute the output of each neuron layer by layer.
o Calculate the final output and loss (using a loss function like Mean
Squared Error or Cross-Entropy).
2. Backward Pass (Error Propagation):
o Compute the gradient of the loss function with respect to the output.
o Propagate the error back through the layers using the chain rule of
differentiation.
o Compute the gradients for weights and biases layer by layer.
3. Weight and Bias Updates:
o Update parameters (weights and biases) using the Gradient Descent
formula:
Weightnew=Weightcurrent−Learning Rate×Gradient\text{Weight}_{\tex
t{new}} = \text{Weight}_{\text{current}} - \text{Learning Rate} \times
\text{Gradient}Weightnew=Weightcurrent−Learning Rate×Gradient
4. Repeat:
o Perform forward and backward passes iteratively for multiple epochs
until the network converges.
Variants of Backpropagation
1. Standard Backpropagation:
o The traditional method where weights and biases are updated after
computing gradients.
2. Stochastic Backpropagation:
o Updates parameters after every individual data point (like Stochastic
Gradient Descent).
3. Mini-Batch Backpropagation:
o Divides the data into mini-batches and updates weights after processing
each mini-batch.
4. Online Backpropagation:
o Performs immediate updates to weights after processing each data
point.
5. Backpropagation Through Time (BPTT):
o Used in Recurrent Neural Networks (RNNs) to propagate errors through
time steps for sequential data.
Applications:
Training neural networks for tasks like image recognition, language
translation, and recommendation systems.
Widely used in deep learning for optimizing complex architectures like CNNs
and RNNs.
Avoiding Overfitting Through Regularization
Overfitting occurs when a model learns noise and specific details from the training
data, causing poor performance on unseen data. Regularization techniques help
prevent overfitting by simplifying the model or introducing constraints.
Common Regularization Techniques:
1. L1 Regularization (Lasso):
o Adds the absolute value of weights (∣w∣|w|∣w∣) as a penalty to the loss
function.
o Encourages sparsity, making some weights zero and simplifying the
model.
2. L2 Regularization (Ridge):
o Adds the squared value of weights (w2w^2w2) as a penalty to the loss
function.
o Helps reduce the magnitude of weights, leading to a smoother model.
3. Dropout:
o Randomly drops neurons (along with their connections) during training.
o Prevents over-reliance on specific neurons and improves generalization.
4. Early Stopping:
o Monitors the validation loss during training.
o Stops training when the validation loss stops decreasing, preventing
overfitting to the training data.
5. Data Augmentation:
o Expands the training dataset by applying transformations (e.g.,
rotations, flips, noise).
o Helps the model generalize better by learning from varied examples.
6. Batch Normalization:
o Normalizes the inputs of each layer to have zero mean and unit
variance.
o Reduces internal covariate shift and acts as a form of regularization.