Deep Learing
Deep Learing
Deep Learning
Deep learning is a subset of machine learning that utilizes deep neural networks to model
and solve problems requiring abstraction and representation. It is inspired by the structure
and function of the human brain, aiming to automatically learn feature hierarchies from raw
data.
Training deep neural networks involves optimizing the weights of neurons to minimize a loss
function. This process, while conceptually simple, encounters several difficulties:
2. Overfitting
● Issue: DNNs often have a large number of parameters, leading to a high capacity for
memorizing the training data instead of generalizing to unseen data.
● Impact: Poor performance on test data despite excellent performance on training
data.
● Solutions:
○ Regularization techniques like L1/L2 regularization.
○ Dropout: Randomly deactivating neurons during training to reduce
co-dependencies.
○ Data augmentation: Expanding the training dataset by applying
transformations to the existing data.
3. Computational Complexity
● Issue: Training DNNs requires substantial computational power and memory due to
their high number of parameters and operations.
● Impact: Long training times and high resource costs.
● Solutions:
○ Use specialized hardware like GPUs or TPUs.
○ Implement efficient algorithms and libraries (e.g., TensorFlow, PyTorch).
○ Leverage parallel processing and distributed training.
5. Lack of Interpretability
● Issue: DNNs are often considered "black boxes," making it difficult to understand
how decisions are made.
● Impact: Reduced trust and challenges in debugging models.
● Solutions:
○ Use visualization tools (e.g., saliency maps).
○ Apply explainability techniques like SHAP or LIME.
While deep neural networks have transformed numerous fields, their training remains a
challenging yet rewarding task. Continuous advancements in techniques and tools are
helping mitigate these difficulties, making DNNs more accessible and powerful for real-world
applications.
An activation function is like a switch or filter in a neural network that decides what
information to pass forward. It takes the input coming into a neuron, applies a rule or
condition, and produces an output that goes to the next layer. This helps the network decide
which signals are important and which to ignore, allowing it to learn and solve complex
problems.
It can also be defined as a transformation that maps the input signals into output signals that
are needed for the neural network.
○
2. Sigmoid Activation Function
○ What it does: Converts input into a range between 000 and 111.
○ Good for: Simple problems or the output layer for probabilities.
○ Issues:
■ Not zero-centered, so neurons can only send positive signals, causing
zig-zag behavior during training.
■ Vanishing Gradient Problem: At extreme values (close to 0 or 1),
gradients almost vanish, slowing or stopping learning.
■
3. ReLU (Rectified Linear Unit)
○ What it does: Outputs the input if it's positive; otherwise, outputs 0.
○ Good for: Speeding up training and avoiding the vanishing gradient problem.
○ Issues: Can face the Dying ReLU Problem, where some neurons stop
working if weights are updated poorly or learning rates are too high.
○ Advantages:
■ Simple and fast to compute.
■ Creates sparse activations (only some neurons activate at a time).
■ Works well for deep networks.
4. Leaky ReLU
○ What it does: Similar to ReLU, but allows a small negative slope when input
is less than 0 (e.g., outputs 0.01 × input).
○ Good for: Fixing the Dying ReLU Problem, ensuring neurons always
contribute a little.
○ Advantages: Speeds up training and avoids "dead neurons."
Differences Between Activation Functions
Feature Tanh Sigmoid ReLU Leaky ReLU
Output Range [−1,1][-1, [0,1][0, 1][0,1] [0,∞)[0, \infty)[0,∞) Same as ReLU but
1][−1,1] for x>0x > 0x>0, 0 slight slope for
for x≤0x \leq 0x≤0 x<0x < 0x<0
1. Parameters
● Definition: Parameters are the values that the deep learning model learns
automatically during training by optimizing the loss function.
● Examples in Deep Learning:
○ Weights: Connections between neurons in different layers.
○ Biases: Additional values added to neuron outputs to allow flexibility in
predictions.
● Characteristics:
○ Learned through training using algorithms like gradient descent.
○ Directly impact how the model processes data and makes predictions.
2. Hyperparameters
Key Differences
Role Define the model's behavior for Control the training process
predictions
In summary, parameters define the model's structure and outputs, while hyperparameters
determine how the model learns.
Greedy layer-wise training is a training approach used in deep neural networks to train one
layer at a time in a sequential manner. This method helps initialize the network effectively,
especially in deep architectures, and mitigates challenges like the vanishing gradient
problem.
How It Works
1. Train One Layer at a Time:
Each layer is trained independently, starting from the first layer and moving upward,
without training the entire network all at once.
2. Freeze Lower Layers:
After training a layer, its weights are frozen (fixed), and the next layer is trained
based on the output of the previously trained layer.
3. Stack Layers Gradually:
Layers are added and trained one by one, forming a deeper network step by step.
4. Fine-Tuning (Optional):
Once all layers are trained, the entire network can be fine-tuned end-to-end using
backpropagation to adjust weights across all layers together.
● Efficient Initialization:
Helps initialize weights in deep networks, avoiding poor starting points caused by
random initialization.
● Reduces Vanishing Gradient Issues:
Since each layer is trained independently, gradients don't diminish as they do in deep
backpropagation.
● Improves Stability:
Training one layer at a time is computationally simpler and reduces the chances of
unstable updates.
● Historical Context:
Used in early deep learning methods like deep belief networks (DBNs) and
autoencoders.
Applications
Limitations
● Time-Consuming:
Training layers sequentially can take longer compared to end-to-end training.
● Modern Alternatives:
Techniques like better weight initializations (e.g., Xavier, He initialization) and
advanced optimizers (e.g., Adam) often make greedy layer-wise training
unnecessary in modern neural networks.
In summary, greedy layer-wise training builds deep networks layer by layer, stabilizing the
learning process and making it easier to train deep architectures, especially when
computational resources or data are limited.
1.
2. Shared Weights: The weights of the network remain the same for every time step,
making RNNs efficient for sequential tasks.
Structure
1. Repeating Unit: The core component of an RNN is a small neural network (often a
fully connected layer) that is repeated for each time step.
2. Unfolding Over Time: An RNN processes a sequence by "unrolling" itself, where
each time step has its own copy of the repeating unit but shares the same weights.
Advantages
● Sequential Understanding: RNNs are great for tasks where the order of data
matters, such as language translation or stock price prediction.
● Memory: They can carry information forward across time steps, allowing them to
understand context in sequences.
Challenges
Applications
To address the limitations of basic RNNs, advanced architectures like LSTMs (Long
Short-Term Memory) and GRUs (Gated Recurrent Units) were developed to better
capture long-term dependencies.
Backpropagation through
time (BPTT)
Backpropagation Through Time (BPTT) Simplified
What is BPTT?
BPTT is a method used to train Recurrent Neural Networks (RNNs). Unlike regular neural
networks where data flows in one direction, RNNs process sequences, meaning the output
at a certain time depends not just on the input at that time but also on previous time steps.
BPTT works by "unfolding" the RNN across time steps, turning it into a series of
interconnected layers (like a deep neural network) with shared weights. Errors are
backpropagated through this unfolded structure, and weights are updated using techniques
like gradient descent.
This process helps the RNN learn patterns over time, making it good at tasks like predicting
the next value in a sequence.
Challenges of BPTT
Summary
BPTT helps RNNs learn from sequential data by "rewinding" through time to adjust weights.
While it's powerful for tasks involving patterns over time, it has limitations like high memory
needs and gradient issues. With improvements like truncated BPTT and specialized
architectures (e.g., LSTMs), these challenges can be mitigated.
What is LSTM?
LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to
handle long-term dependencies. It was introduced in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber. Unlike traditional RNNs, which struggle with problems like vanishing and
exploding gradients, LSTMs can efficiently learn and remember important information over
extended sequences of data.
What is LSTM?
LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to
handle long-term dependencies. It was introduced in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber. Unlike traditional RNNs, which struggle with problems like vanishing and
exploding gradients, LSTMs can efficiently learn and remember important information over
extended sequences of data.
LSTM (Long Short-Term Memory) is a type of neural network, but there are variations to suit
different needs. Let’s simplify each variant:
1. Peephole Connections
● In normal LSTMs, gates (like forget and input gates) decide what information to keep
or discard. However, they don’t directly "peek" at the current memory state (CtC_tCt).
● With peephole connections, gates are allowed to "look" at the current memory
state while making decisions.
● Think of it as giving the gate extra context:
○ It can see the entire notebook (cell state) before deciding what to add or
erase.
2. Coupled Forget and Input Gates
● Normally, the forget gate and the input gate work separately:
○ Forget gate decides what to erase.
○ Input gate decides what new information to add.
● In coupled gates, these decisions happen together:
○ Forget something only if you’re replacing it with something new.
○ If you’re not adding anything new, you keep the old information intact.
● Think of it like: "I’ll clean the space (forget) only if I’m putting new notes there (input)."
3. GRU (Gated Recurrent Unit)
Summary of Variants