Unit 3
Unit 3
Unit-3
A deep neural network is a neural network with atleast two hidden layers. Deep neural networks use
sophisticated mathematical modeling to process data in different ways.Traditional machine learning
algorithms are linear, deep learning algorithms are stacked in a hierarchy.
Deep learning creates many layers of neurons, attempting to learn structured representation, layer by
layer
A feedforward network defines a mapping y = f (x:θ) and learns the value of the parameters θ that result in
the best function approximation.
These models are called feedforward because information flows through the function being evaluated from
x, through the intermediate computations
used to define f, and finally to the output y. There are no feedback connections in which outputs of the
model are fed back into itself.
When feedforward neural networks are extended to include feedback connections, they are called recurrent
neural networks.
Feedforward networks are of extreme importance to machine learning practitioners.They form the basis of
many important commercial applications. Forexample, the convolutional networks used for object
recognition from photos are aspecialized kind of feedforward network.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is associated with a directed
acyclic graph describing how the functions are composed together.
For example:
we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) = f(3)(f (2)(f(1) (x ))).
This chain structure is most commonly used structure of neural networks. In this case, f (1) is called the first
layer of the network called input layer used to feed the input into the network; f (2) is called the second
layer called hidden layer used to train the neural network, and so on. The final layer of a feedforward
network is called the output layer that provides the output of the network. The overall length of the chain
gives the depth of the model and width of the model is number of neurons in the input layer. It is from this
terminology that the name “deep learning” arises.
A deep learning neural network learns to map a set of inputs to a set of outputs from training data. We
cannot calculate the perfect weights for a neural network.
Gradient descent is an iterative optimization algorithm for finding the minimum of a function.
To find the minimum of a function using gradient descent, one takes steps proportional to the negative of
the gradient of the function at the current point.
The “gradient” in gradient descent refers to an error gradient. The model with a given set of weights is used
to make predictions and the error for those predictions is calculated.
The gradient is given by the slope of the tangent at w = 0.2, and then the magnitude of the step is controlled
by a parameter called the learning rate. The larger the learning rate, the bigger the step we take, and the
smaller the learning rate, the smaller the step we take. Then we take the step and we move to w1.
Now when choosing the learning rate, we have to be very careful as a large learning rate can lead to big
steps and eventually missing the minimum.
On the other hand, a small learning rate can result in very small steps and therefore causing the algorithm to
take a long time to find the minimum point.
n deep learning, hidden units (or hidden neurons) are the computational nodes in the hidden layers of a
neural network that process and transform input data into more abstract representations before it reaches the
output layer. The number and arrangement of these hidden units significantly impact the network's learning
capacity, efficiency, and performance.
Hidden units detect patterns and features in the input data that are not directly visible at the output.
Each unit applies a weight to the inputs it receives, sums them, and then applies an activation
function to introduce non-linearity.
By stacking multiple hidden layers with many hidden units, a network can learn increasingly
abstract representations. For instance, in image classification, earlier layers may detect edges, while
deeper layers can detect objects or entire scenes.
Purpose of Activation Functions: Activation functions (e.g., ReLU, sigmoid, tanh) are applied to
each hidden unit’s output, allowing the network to model complex, non-linear relationships.
Common Activation Functions:
ReLU (Rectified Linear Unit): Sets negative values to zero and is commonly used in
hidden layers because it reduces issues like vanishing gradients and speeds up training.
Sigmoid and Tanh: Earlier-used functions that squash the outputs between fixed ranges (0
to 1 for sigmoid, -1 to 1 for tanh). However, they can suffer from vanishing gradients in
deeper networks.
Leaky ReLU, Swish, GELU: Variants that avoid issues like the “dying ReLU” problem
and improve gradient flow through layers.
More Units Increases Capacity: Increasing hidden units allows the network to capture more
complex patterns but also increases the risk of overfitting, where the network learns specific noise
in the training data rather than general patterns.
Too Few Units Limits Learning: With insufficient hidden units, the network may lack the capacity
to capture intricate data structures, leading to underfitting and lower accuracy.
Rule of Thumb for Selection: Selecting the optimal number of hidden units often involves
experimentation, typically starting with a manageable number and tuning based on performance.
Cross-validation and regularization techniques can help guide the selection.
The term hidden layers refers to the layers themselves, while hidden units refer to the neurons
within each layer.
Deeper Networks (more hidden layers) can learn hierarchical representations and are ideal for
tasks where different levels of abstraction are needed, like image and language processing.
Wider Networks (more hidden units per layer) can be beneficial for tasks requiring complex
pattern recognition within a single level of abstraction.
5. Distributed Representations
In networks with many hidden units, the hidden layers often develop distributed representations,
where the features learned are represented as combinations of the outputs of multiple units.
These representations are more flexible t han single-feature representations, allowing networks to
generalize better across varied inputs.
6. Dropout in Hidden Units
Purpose: Dropout is a regularization technique where random hidden units are "dropped out" (set
to zero) during training, forcing the network to learn robust patterns that don’t rely on specific units.
Application: By randomly deactivating hidden units, dropout helps prevent overfitting, particularly
in deep networks.
Shallow Networks: Few hidden layers and hidden units, effective for simple tasks (e.g., basic
classification problems) but limited in handling complex data.
Deep Networks with Many Hidden Units: Often used in tasks like image recognition (e.g.,
ResNet, Inception), speech processing, and NLP (e.g., Transformers), where the data requires both
depth (for abstraction) and width (for complexity).
CNNs: Hidden units are organized into convolutional and pooling layers, where each unit in the
convolutional layer acts on a localized region of the input. These units focus on spatial hierarchies,
detecting features like edges, shapes, and patterns.
RNNs: Hidden units in RNNs are recurrent, meaning they process sequences and maintain memory
of previous inputs. Each unit’s output is influenced by both the current input and the previous unit
state, allowing them to capture temporal dependencies.
Transformers: Hidden units are part of self-attention mechanisms, where each unit attends to other
tokens to capture long-range dependencies. Transformers tend to have wider layers and more
hidden units due to the computational efficiency of parallel processing.
Computational Cost: Increasing hidden units raises the number of parameters and computational
requirements. In very deep networks, this can lead to longer training times and higher memory
usage.
Overfitting and Generalization: More hidden units increase the risk of overfitting. Techniques
like dropout, L2 regularization, and early stopping are essential in balancing model complexity and
generalization.
Vanishing/Exploding Gradients: In very deep networks with many hidden units, gradients may
become very small or large as they pass through layers, slowing down or halting training. Solutions
include better initialization, normalization techniques (e.g., batch normalization), and alternative
architectures like residual connections (e.g., ResNet).
Hidden units are a critical component of deep learning architectures. Their configuration should be carefully
chosen to balance the network’s capacity to learn complex patterns with the need for generalization and
computational efficiency.
The architecture design of deep learning models involves creating and organizing the structure of layers,
neurons, and connections to best solve a specific task. This design is critical in determining the network’s
effectiveness, efficiency, and ability to generalize across data. Here's an overview of key architectural types
and design principles:
1. Basic Feedforward Neural Networks (FNN)
Description: FNNs are the simplest type of neural network, with information flowing in one
direction, from the input to the output through a series of hidden layers.
Structure:
Use Cases: Suitable for simple classification and regression tasks on structured data.
1.Convolutional Layers: Use filters to detect local patterns. Each filter slides over the input, extracting
spatial features, which allows for translation invariance.
2.Pooling Layers: Reduce spatial dimensions by selecting the maximum (max pooling) or average
(average pooling) values, preserving important features while lowering computational costs.
3.Fully Connected Layers: Often added at the end to integrate the features learned by convolutional
layers.
Popular Architectures:
1. AlexNet: Popularized deep CNNs for image classification with multiple convolutional layers.
3.ResNet: Introduced residual connections to allow for very deep networks by mitigating the vanishing
gradient problem.
Purpose: Suited for sequential data, such as time series, audio, and text, RNNs can learn
dependencies in data over time by retaining information from previous steps.
Key Components:
Recurrent Layers: Enable each neuron to connect back to itself, allowing information to
persist across steps.
Memory Cells (LSTM and GRU): LSTMs (Long Short-Term Memory) and GRUs (Gated
Recurrent Units) improve traditional RNNs by adding gates that control the flow of
information, enabling the network to retain or forget information selectively.
Limitations: RNNs can struggle with very long sequences due to gradient issues.
Applications: Text generation, language translation, time series forecasting, speech recognition.
4. Transformers
Purpose: Originally developed for natural language processing, Transformers are now used across
tasks, including image and multi-modal tasks, due to their efficiency and scalability.
Key Components:
Self-Attention Mechanism: Each token in the sequence can focus on relevant parts of the
entire sequence, capturing dependencies over long distances without recurrence.
Multi-Head Attention: Provides multiple attention layers to capture different types of
relationships within the data.
Positional Encoding: Adds information about the order of tokens in the sequence, as
Transformers lack inherent sequence-processing structures.
Popular Models:
5. Autoencoders
Purpose: Used for unsupervised learning, Autoencoders learn compressed representations of data,
useful for tasks like dimensionality reduction and anomaly detection.
Key Components:
Variants:
Purpose: GANs are used to generate realistic data samples by learning the underlying data
distribution through adversarial training.
Key Components:
Purpose: Designed for data represented as graphs, like social networks, molecular structures, or knowledge
graphs.
Key Components:
Graph Convolutional Layers: Extend convolutional operations to graph structures, enabling the
network to learn based on relationships among nodes.
Message Passing Mechanism: Allows each node to aggregate information from its neighbors,
capturing relationships and dependencies.
CNN-RNN Hybrids: Useful for video analysis and image captioning, where a CNN
extracts spatial features, and an RNN handles sequential data.
Multi-Modal Transformers: Recently popular for models that work with multiple data
types (e.g., text and images), like OpenAI’s CLIP, which learns joint text-image
embeddings.
Applications: Tasks that require both spatial and sequential information, such as video processing,
robotics, and multi-modal learning.
Layer Depth: Deeper networks can capture more complex patterns, but they also risk vanishing or
exploding gradients. Techniques like residual connections (ResNet) or batch normalization mitigate
this risk.
Layer Width: Wider networks (more neurons per layer) can capture a greater diversity of features,
but they come with higher computational costs and increased risk of overfitting.
Regularization: Techniques like dropout, L2 regularization, and batch normalization help prevent
overfitting, making the model more robust.
Residual and Skip Connections: By bypassing one or more layers, residual connections help
preserve gradient flow, allowing for much deeper networks.
Attention Mechanisms: Attention is now a key component for many architectures, allowing
models to focus on relevant parts of the input, especially beneficial for sequence-based tasks.
Normalization Techniques: Batch normalization and layer normalization help stabilize and speed
up training by normalizing inputs to layers.
10. Hyperparameter Tuning
Learning Rate: Controls the step size in gradient descent. Finding the right learning rate is critical,
as too high a value can cause the model to overshoot minima, and too low a value can slow down
training.
Batch Size: Larger batches make training more stable but require more memory, while smaller
batches can lead to faster convergence but noisier updates.
Early Stopping: Stops training when the model’s performance on a validation set stops improving,
reducing the risk of overfitting.
The architecture design in deep learning is highly task-specific and requires careful balancing of layer depth,
width, and regularization to optimize model performance while avoiding overfitting and maintaining
computational efficiency.
Backpropagation and other differential algorithms are central to how deep learning models learn by
adjusting their weights to minimize error. These methods involve calculating gradients to update parameters
efficiently during training. Here’s an overview of backpropagation, followed by a look at related
optimization techniques and advancements in gradient-based learning algorithms.
1. BACKPROPAGATION
Purpose: Backpropagation (short for "backward propagation of errors") is a method to compute the
gradient of the loss function with respect to each weight in the network. It enables efficient training
of deep neural networks by propagating error backward through the layers.
Process:
1.Forward Pass: Input data passes through the network to compute the output and loss (error).
2.Backward Pass: The network calculates gradients of the loss with respect to each parameter
(weight) using the chain rule of calculus. These gradients show how each weight affects the
loss.
3.Weight Update: Using the computed gradients, each weight is updated in the opposite
direction of the gradient (usually scaled by a learning rate) to minimize the loss.
Challenges in Backpropagation:
2.Local Minima and Saddle Points: Backpropagation may get stuck in local minima or saddle
points, slowing down convergence.
Solutions:
1.Activation Functions: Functions like ReLU (Rectified Linear Unit) help mitigate vanishing
gradients by maintaining larger gradients for positive inputs.
2.Batch Normalization: Normalizes inputs to each layer to stabilize and speed up training.
3.Residual Connections: Enable gradients to flow through the network more directly, making
very deep networks like ResNet possible.
Stochastic Gradient Descent (SGD): Updates weights based on a single sample per iteration. It
introduces randomness, which can help avoid local minima but may be noisy.
Mini-Batch Gradient Descent: Updates weights using a subset (mini-batch) of the data. It strikes a
balance between the efficiency of batch gradient descent and the noise-reducing benefit of
averaging multiple samples.
Batch Gradient Descent: Uses the entire dataset to compute gradients before each update, which is
stable but computationally intensive and slow for large datasets.
Momentum:
Description: Accelerates gradient descent by adding a fraction of the previous update to the
current update, helping to smooth the trajectory and speed up convergence, especially in
areas with small gradients.
Formula: vt=γvt−1+η∗L(w)v_t = \gamma v_{t-1} + \eta \nabla L(w)vt=γvt−1+η∗L(w),
where vtv_tvt is the update direction, γ\gammaγ is the momentum coefficient, and η\etaη is
the learning rate.
Description: Improves upon momentum by looking ahead, calculating gradients not at the
current position but a position moved in the direction of the previous momentum.
Advantage: Helps avoid overshooting by incorporating a form of anticipation into the
updates.
Description: Adapts the learning rate based on the magnitude of gradients in each
dimension, giving each parameter its own learning rate that decreases as it accumulates
more updates.
Use Case: Works well for sparse data and cases where features have different frequencies.
RMSProp:
Description: An improvement on Adagrad, RMSProp also scales learning rates for each
parameter but with a moving average of past gradients. This approach prevents learning
rates from decaying too fast, as in Adagrad.
Application: Often used in RNNs and other networks with large amounts of sequential
data.
Description: Combines momentum and RMSProp, adjusting learning rates based on both
first (momentum) and second (variance) moments of the gradient.
Popular Parameters: Typically, β1=0.9\beta_1 = 0.9β1=0.9 (for momentum) and
β2=0.999\beta_2 = 0.999β2=0.999 (for variance).
Advantage: Adapts well to both sparse and dense data, making it widely popular for a
variety of deep learning applications.
AdamW:
Description: A variant of Adam that decouples weight decay from the gradient update,
resulting in better generalization by more precisely regulating regularization during weight
updates.
AdaMax: An extension of Adam using infinity-norm, making it more robust in some cases.
Nadam: Combines Adam with Nesterov momentum, often improving convergence rates.
While gradient-based methods are most common, gradient-free techniques are useful when
calculating gradients is computationally infeasible or when the optimization surface is non-
differentiable.
Evolutionary Algorithms:
Bayesian Optimization:
Simulated Annealing:
Purpose: Increases learning rates for large batch sizes, adjusting rates based on each layer’s
magnitude, often used in large-scale training.
Fixed Schedules: Decrease learning rate at fixed intervals (e.g., every few epochs).
Adaptive Schedules: Reduce learning rate when performance plateaus, such as the
ReduceLROnPlateau in Keras.
Cosine Annealing: Adjusts the learning rate in a periodic, cosine-shaped curve, allowing
for more aggressive exploration in early stages.
Gradient Clipping:
o Concept: Use second-order derivatives (the Hessian) to better approximate the curvature of
the loss surface, allowing for more accurate steps towards the minima.
o Limitation: Calculating the Hessian is computationally intensive, making these methods
impractical for large networks.
Limited-memory BFGS (L-BFGS):
Automatic Differentiation:
Description: A key feature of modern deep learning frameworks like TensorFlow and
PyTorch. It automatically computes the gradients needed for backpropagation by creating a
computational graph and applying the chain rule efficiently.
Reverse Mode (Backpropagation): Used in deep learning to efficiently compute gradients
for large models by moving from the output back to the input.
Backpropagation: The fundamental gradient-based method used in almost all deep learning
models.
Gradient Descent Variants (SGD, Momentum, Adam, RMSProp, etc.): Provide efficient ways
to update parameters by adjusting learning rates and applying momentum.
Gradient-Free Techniques: Used in cases where differentiability isn’t guaranteed or is too
computationally costly, including evolutionary algorithms and Bayesian optimization.
Advanced Techniques: Learning rate scheduling, gradient clipping, and second-order methods
refine the optimization process, improving convergence and stability for specific tasks and network
architectures.
In summary, back-propagation is the backbone of gradient-based learning in deep neural networks, while a
variety of optimization techniques further refine and adapt weight updates, allowing deep learning models
to converge faster and generalize better across different applications.