RLDL End Sem
RLDL End Sem
Deep Learning
https://yashnote.notion.site/Reinforcement-Learning-and-Deep-Learning-
1150e70e8a0f800caaf8fb8967b3e7f4?pvs=4
Reinforcement Learning and Deep Learning
Unit - 1
Reinforcement Learning Foundation: Introduction, Terms, Features, and Elements
Defining Reinforcement Learning Framework and Markov Decision Process (MDP)
MDP
What Is the Markov Decision Process?
Markov Decision Process Terminology
What Is the Markov Property?
Markov Process Explained
Markov Reward Process (MRP)
Markov Decision Process (MDP)
Return (G_t)
Discount (γ)
Policy (π)
Value Functions
State Value Function for MRP
Bellman Expectation Equation for Markov Reward Process (MRP)
State Value Function for Markov Decision Process (MDP)
Action Value Function for Markov Decision Process (MDP)
Bellman Expectation Equation (for MDP)
Markov Decision Process Optimal Value Functions
Bellman Optimality Equation
Exploration vs. Exploitation in Reinforcement Learning
Exploration
Exploitation
The Exploration-Exploitation Dilemma
Factors Affecting the Exploration-Exploitation Trade-off
Techniques for Balancing Exploration and Exploitation
Technical Considerations
1. Agent: The learner or decision-maker that interacts with the environment and
takes actions.
2. Environment: The external system with which the agent interacts. The
environment responds to the agent’s actions by providing new states and
rewards.
4. Action (A): The set of all possible moves the agent can make in a given state.
5. Reward (R): A feedback signal received by the agent after taking an action in
the environment. It can be positive or negative, encouraging or discouraging
certain behaviors.
6. Policy (π): A strategy used by the agent to decide the next action based on
the current state. It can be deterministic or probabilistic.
7. Value Function (V): It estimates the expected reward for each state, helping
the agent evaluate how good or bad a state is.
8. Q-Value or Action-Value Function (Q): Similar to the value function but also
considers the specific action taken from a state. It tells how good it is to take a
particular action in a given state.
Model-Free RL: The agent learns directly from experience without building
a model of the environment (e.g., Q-Learning).
1. Neural Networks:
Input Layer: Takes input features (e.g., pixel values for images, words
for text).
2. Activation Functions:
ReLU (Rectified Linear Unit): Most widely used, it outputs the input if
it’s positive, otherwise zero.
3. Loss Functions:
Loss functions measure how far the model’s predictions are from the
actual results. The goal of training is to minimize the loss.
4. Backpropagation:
Recurrent Neural Networks (RNNs): Suitable for sequence data like time
series or language, as they maintain an internal state to capture temporal
dependencies. A popular variant is the LSTM (Long Short-Term Memory)
network.
Regularization Techniques:
Unit - 1
Reinforcement Learning Foundation: Introduction,
Terms, Features, and Elements
Introduction to Reinforcement Learning:
Reinforcement Learning (RL) is a subfield of machine learning concerned with
how agents should take actions in an environment to maximize cumulative
rewards. Unlike supervised learning, where a model learns from a labeled dataset,
RL involves an agent learning from its own experience through trial and error. This
framework is used in situations where the agent needs to make a series of
decisions over time, with each decision influencing future outcomes. RL can be
applied in various domains, such as robotics, game playing, autonomous vehicles,
and resource management.
In RL, the agent interacts with an environment and learns to make decisions by
receiving feedback in the form of rewards or penalties. Over time, the agent
refines its actions based on this feedback, aiming to develop a policy that
maximizes long-term rewards.
1. Agent:
2. Environment:
The external system with which the agent interacts. It provides the agent
with observations or states and responds to actions by providing rewards
and new states.
3. State (S):
4. Action (A):
The choices or decisions that the agent can make at any given state. The
set of all possible actions is called the action space.
5. Reward (R):
A scalar feedback signal that tells the agent how good or bad its action
was in a particular state. The goal of the agent is to maximize cumulative
rewards over time.
6. Policy (π):
The strategy used by the agent to determine its next action based on the
current state. A policy can be deterministic (mapping each state to a
specific action) or stochastic (mapping states to probabilities of actions).
8. Q-Value (Q-Function):
Similar to the value function, but it also considers the specific action taken
from a state. The Q-value function estimates the expected reward for
taking action \( a \) in state \( s \) and then following the policy \( \pi \).
Exploration involves the agent trying new actions to discover more about
the environment, while exploitation involves selecting actions that are
known to yield high rewards based on past experience. Balancing
exploration and exploitation is critical in RL.
1. Episode:
A sequence of states, actions, and rewards that ends when the agent reaches
a terminal state. In episodic tasks, learning happens over multiple episodes.
2. Delayed Rewards:
In RL, the rewards for certain actions may not be immediate. The agent
may need to perform a series of actions before receiving feedback. The
challenge is to learn which actions contribute to long-term success.
The agent must explore different strategies to discover the best one while
also exploiting the knowledge it has already gained to maximize rewards.
This balance between exploration and exploitation is a core feature of RL.
Unlike supervised learning, where the model learns from labeled data, RL
agents learn directly from interacting with their environment. The agent
does not need a dataset in advance but instead learns as it acts in the
environment.
1. Agent:
The agent is the core learning system. It decides which actions to take,
learns from the environment, and updates its knowledge or policy based
on the rewards it receives.
2. Environment:
3. Reward Signal:
The reward signal is the feedback that the agent uses to evaluate its
actions. It defines the goal of the RL problem by providing feedback on the
immediate success or failure of the agent’s actions. The agent’s objective
is to maximize the cumulative reward it receives over time.
4. Policy (π):
The policy defines the behavior of the agent. It maps states to actions and
can be either deterministic or stochastic. The policy guides the agent’s
decision-making process, determining which action to take in each state.
The policy is what the agent seeks to improve through learning.
Core RL Algorithms:
1. Model-Free Algorithms:
These algorithms learn the optimal policy without explicitly modeling the
environment. Common examples include Q-Learning, SARSA, and Deep Q-
Networks (DQN). These algorithms directly learn from experience and do
not require a model of the environment’s dynamics.
2. Model-Based Algorithms:
Example:
Consider an agent (robot) that needs to learn to navigate through a maze to reach
the goal. The agent does not know the layout of the maze initially, but it learns
over time by moving through it and receiving rewards (e.g., +1 for reaching the
goal, -1 for hitting walls). The agent’s task is to develop a policy that helps it
consistently find the shortest path to the goal.
Policy (π): A strategy mapping the robot's position to the next action (e.g.,
always move toward the goal).
Value Function (V): The expected cumulative reward from each position.
Through trial and error, the agent learns the best path to reach the goal while
maximizing the cumulative reward.
1. Agent:
The learner or decision-maker that interacts with the environment to choose
actions that will maximize cumulative rewards over time.
2. Environment:
The external system or situation in which the agent operates. The environment
provides states to the agent and updates based on the actions the agent
takes. The environment also provides rewards or penalties that guide the
agent’s learning.
3. State (S):
4. Action (A):
The agent can take one of several possible actions at each time step,
depending on the current state. The set of all possible actions is called the
action space. The agent's goal is to choose actions that maximize long-term
rewards.
5. Reward (R):
After taking an action, the environment provides feedback in the form of a
reward (or penalty), which is a numerical value that measures the immediate
effect of the action on the agent’s performance. The agent’s objective is to
maximize the cumulative sum of rewards over time.
6. Policy (π):
The policy defines the agent’s behavior by mapping states to actions. A policy
can be deterministic, where the agent always chooses the same action for a
given state, or stochastic, where actions are chosen according to a probability
distribution over possible actions.
The environment transitions to a new state and provides a reward to the agent.
This process continues over time, and the agent learns the best policy to
maximize cumulative rewards.
Transition Probability (P): The probability that the robot will move to the
intended adjacent cell (with some chance of slipping to adjacent cells due to
slippery terrain).
R):** +10 for reaching the goal, -1 for each step taken, and -100 for hitting an
obstacle.
Discount Factor (γ): 0.9 to ensure the robot focuses on reaching the goal
quickly while avoiding penalties.
The robot learns a policy that guides it through the grid by balancing the
immediate and long-term rewards using MDP principles.
MDP
The Markov decision process (MDP) is a mathematical framework used for
modeling decision-making problems where the outcomes are partly random and
partly controllable. It’s a framework that can address most reinforcement
learning (RL) problems.
3. State: The state defines the current situation of the agent. This can be the
exact position of the robot in the house, the alignment of its two legs or its
current posture. It all depends on how you address the problem.
4. Action: The choice that the agent makes at the current time step. For example,
the robot can move its right or left leg, raise its arm, lift an object or turn
right/left, etc. We know the set of actions (decisions) that the agent can
perform in advance.
We’ll start with the basic idea of Markov Property and then continue to layer more
complexity.
A Markov process is defined by (S, P) where S are the states, and P is the state-
transition probability. It consists of a sequence of random states S₁, S₂ , … where
all the states obey the Markov property.
The state transition probability or P_ss ’ is the probability of jumping to a
state s’ from the current state s.
Return (G_t)
Discount (γ)
The variable γ ∈ [0, 1] in the figure is the discount factor. The intuition behind
using a discount ( γ ) is that there is no certainty about the future rewards. While it
is important to consider future rewards to increase the Return, it’s also equally
important to limit the contribution of the future rewards to the Return (since you
can’t be 100 percent certain of the future.)
And also because using a discount is mathematically convenient.
Policy (π)
Value Functions
A value function is the long-term value of a state or an action. In other words, it’s
the expected Return over a state or an action. This is something that we are
actually interested in optimizing.
Solution for the Bellman equation of the state value function. | Image: Rohan
Jagtap
Dot to circle: The environment acts on the agent and sends it to a state based
on the transition probability. Continuing the chess-playing agent example, this
is the part of the transition where the opponent makes a move. After both
moves, we call it a complete state transition. The agent can’t control this part
as it can’t control how the environment acts, only its own behavior.
And this completely satisfies the Bellman equation as the same is done for the
action value function:
Imagine if we obtained the value for all the states/actions of an MDP for all
possible patterns of actions that can be picked, then we could simply pick the
policy with the highest value for the states and actions. The equations above
represent this exact thing. If we obtain q∗(s, a) , the problem is solved.
We can simply assign probability 1 for the action that has the max value for q∗ and
0 for the rest of the actions for all given states.
Bellman optimality equation for optimal action value function. | Image: Rohan
Jagtap
Nothing much can change for this equation, as this is the part of the transition
where the environment acts; the agent cannot control it. However, since we are
following the optimal policy, the state value function will be the optimal one.
We can address most RL problems as MDPs like we did for the robot and the
chess-playing agent. Just identify the set of states and actions.
In the previous section, I said, “Imagine if we obtained the value for all the
states/actions of an MDP for all possible patterns of actions that can be picked….”
However, this is practically infeasible. There can be millions of possible patterns of
transitions, and we cannot evaluate all of them. However, we only discussed the
formulation of any RL problem into an MDP, and the evaluation of the agent in the
context of an MDP. We did not explore solving for the optimal values/policy.
There are many iterative solutions to obtain the optimal solution for the MDP.
Some strategies to keep in mind as you move forward include:
1. Value iteration.
3. SARSA.
4. Q-learning.
Exploration
Definition: Exploration involves the agent trying new actions and states to
discover potentially better rewards.
Purpose: To identify new opportunities and avoid getting stuck in local optima.
Strategies:
Exploitation
Definition: Exploitation involves the agent choosing actions that it believes will
maximize its reward based on its current knowledge.
Strategies:
Learning algorithm: The RL algorithm used can impact how the agent
balances exploration and exploitation.
Technical Considerations
Q-values: The Q-value function represents the expected future reward for
taking a particular action in a given state.
Learning rate: The learning rate determines how quickly the agent updates its
Q-values based on new experiences.
Example:
class Agent:
def __init__(self):
# Initialize parameters
class Environment:
def __init__(self):
# Set up environment
3. Documenting Code:
Example:
Args:
state (np.ndarray): The current state of the envir
onment.
Returns:
int: The index of the selected action.
"""
pass
Debug code with tools like pdb for step-by-step inspection of the RL
agent's behavior.
5. Version Control:
Use version control systems like Git to track changes and ensure
collaborative development.
6. Config Files:
learning_rate: 0.001
discount_factor: 0.99
Implement logging using the Python logging module to record metrics and
errors during training.
1. TensorFlow:
TensorFlow is a deep learning framework widely used for building RL models,
particularly for neural networks used in deep reinforcement learning (DRL). It
provides the flexibility to build complex computational graphs and supports
hardware acceleration via GPUs.
Key Features:
TensorFlow 2.x: Encourages the use of Keras API for ease of use and
debugging.
import tensorflow as tf
from tensorflow.keras import layers
2. Keras:
Keras is an easy-to-use, high-level API built into TensorFlow that simplifies the
construction of neural networks. It's popular for rapid prototyping in RL
algorithms and is commonly used in conjunction with TensorFlow for Deep Q-
Networks (DQN), Deep Deterministic Policy Gradient (DDPG), and other DRL
methods.
Key Features:
Functional API: More flexibility for creating complex networks with shared
layers and multiple inputs/outputs.
3. OpenAI Gym:
OpenAI Gym is a toolkit for developing and comparing RL algorithms. It
provides various
environments (simulations) for testing RL algorithms, such as classic control
tasks (CartPole, MountainCar), Atari games, and more complex simulations.
Key Features:
import gym
env = gym.make("CartPole-v1")
state = env.reset()
done = False
while not done:
action = env.action_space.sample() # Take a random ac
tion
next_state, reward, done, _ = env.step(action)
env.close()
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
Key Features:
1. Hyperparameter Tuning:
Use tools like Optuna or Ray Tune for automated hyperparameter tuning
(e.g., learning rate, discount factor, exploration rate).
2. Custom Environments:
Use gym to create custom environments with unique state spaces, actions,
and reward structures.
3. Replay Buffers:
4. Exploration vs Exploitation:
import gym
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
Adam(learning_rate=0.001))
return model
# Training loop
for episode in range(1000):
state = env.reset()
state = np.reshape(state, [1, state_shape])
Tabular Methods in RL
Tabular methods are primarily used when the state-action space is small enough
that we can store values in a table (a matrix) without memory issues. In these
methods, each state-action pair’s value is stored explicitly, and the agent updates
this value based on its interactions with the environment.
The most common tabular method is Q-learning, which estimates the Q-value for
each state-action pair and uses this to derive an optimal policy.
Computes an optimal policy by solving the Bellman equations for every state
and action.
Not feasible for very large state spaces due to high computational complexity.
Requires complete episodes for value estimation, making it less suitable for
ongoing, non-terminating tasks.
import numpy as np
import gym
# Initialize environment
env = gym.make('FrozenLake-v1', is_slippery=False)
q_table = np.zeros([env
.observation_space.n, env.action_space.n])
# Hyperparameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Exploration rate
episodes = 10000
# Q-Learning update
best_next_action = np.argmax(q_table[next_state])
q_table[state, action] += alpha * (reward + gamma * q
_table[next_state, best_next_action] - q_table[state, actio
n])
state = next_state
Conclusion
Tabular methods like Q-learning are efficient when the state-action space is
small.
Dynamic Programming (DP) is useful for planning with known models but
becomes impractical in large environments.
Assume the agent is in state \( s \), takes action \( a \), and transitions to state
\( s' \) while receiving reward \( r \).
The agent then selects action \( a' \) from state \( s' \).
SARSA updates the Q-value for \( Q(s, a) \) based on the reward \( r \) and the
value of \( Q(s', a') \).
Summary of Techniques
Technique Key Feature Benefits
Conclusion
Deep Q-Networks and their extensions, including Dueling DQN, Double DQN, and
Prioritized Experience Replay, have significantly improved the ability of agents to
learn optimal policies in complex environments. These techniques combine the
power of deep learning with the principles of reinforcement learning, enabling
agents to effectively handle high-dimensional state spaces while maintaining
stable and efficient learning.
Unit - 2
Policy Optimization in Reinforcement Learning
1. Policy Function (π): A function that defines the agent's behavior, specifying
the probability of taking each action in each state.
3. Objective Function: The goal is to find the optimal policy that maximizes the
expected cumulative reward:
J(θ) = E_π_θ[Σ_t γ^t R_t]
where γ is the discount factor and R_t is the reward at time t.
4. Policy Gradient: The gradient of the objective function with respect to the
policy parameters, used to update the policy in the direction of higher
expected rewards.
Disadvantages:
Key Equations:
2. Log-likelihood Gradient:
∇_θ log π_θ(a|s) = ∇_θ π_θ(a|s) / π_θ(a|s)
Variants and Improvements:
1. Baseline Subtraction: Subtract a baseline (often the state value function) from
the return to reduce variance:
θ ← θ + α (G_t - b(S_t)) ∇_θ log π_θ(A_t|S_t)
2. Actor-Critic Methods: Combine policy gradient with value function estimation
to further reduce variance and improve sample efficiency.
3. Natural Policy Gradient: Use the natural gradient instead of the standard
gradient to achieve more stable and efficient updates.
Cons:
Sample inefficient
import torch
import torch.nn as nn
class PolicyNetwork(nn.Module):
def __init__(self, input_dim, output_dim):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, output_dim)
# Usage
state_dim = 4
action_dim = 2
policy_net = PolicyNetwork(state_dim, action_dim)
This example shows a simple policy network for a problem with 4-dimensional
state space and 2 possible actions. The network outputs action probabilities using
a softmax activation.
REINFORCE Algorithm
Algorithm Steps:
Simple to implement
Disadvantages:
Update Equations:
Algorithm Steps:
Disadvantages:
3. May suffer from instability if actor and critic are not well-balanced
import torch
import torch.nn as nn
import torch.optim as optim
class ActorCritic(nn.Module):
def __init__(self, input_dim, n_actions):
super(ActorCritic, self).__init__()
self.actor = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, n_actions),
optimizer.zero_grad()
loss.backward()
optimizer.step()
3. Multiple Epochs: Allows for multiple optimization steps on the same data.
Algorithm Steps:
Advantages of PPO:
3. Line Search: Ensures that the policy update improves the objective while
satisfying constraints.
TRPO Objective:
maximize_θ E[π_θ(a|s) / π_θ_old(a|s) * A_π_θ_old(s,a)]
subject to E[KL[π_θ_old(·|s) || π_θ(·|s)]] ≤ δ
Where:
Algorithm Steps:
5. Perform line search to find the largest step size that improves the objective
and satisfies the KL constraint
Advantages of TRPO:
Disadvantages:
3. Target Networks: Slowly-updating copies of the actor and critic for stability.
Algorithm Steps:
Update critic by minimizing the loss: L = 1/N Σ_i (y_i - Q(s_i, a_i|θ^Q))^2
Advantages of DDPG:
Disadvantages:
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
# Usage
state_dim = 10
action_dim = 2
max_action = 1.0
This example provides a basic implementation of the actor and critic networks for
DDPG in PyTorch. The full implementation would include the replay buffer, target
networks, and training loop.
1. Introduction to Model-Based RL
Model-Based Reinforcement Learning is an approach where the agent learns an
explicit model of the environment and uses it for planning and decision-making.
This is in contrast to model-free methods, which learn a policy or value function
directly from experience without explicitly modeling the environment.
Key Components:
1. Environment Model: A function that predicts the next state and reward given
the current state and action.
3. Model Learning: The process of learning the environment model from data.
b. Stochastic Models:
Predict a distribution over next states and rewards
b. Probabilistic Models:
Learn probability distributions over next states and rewards
c. Uncertainty-Aware Models:
Capture uncertainty in predictions
c. Trajectory Optimization:
Optimize a sequence of actions to maximize cumulative reward
d. Dyna-style Algorithms:
Alternate between model-free RL updates and simulated experiences from the
model
5. Advantages of Model-Based RL
1. Sample Efficiency: Can learn from fewer real-world interactions by using the
model to generate simulated experiences.
2. Transfer Learning: The learned model can potentially be reused for different
tasks in the same environment.
3. Interpretability: The explicit model can provide insights into the environment
dynamics.
4. Exploration: Can use the model for directed exploration (e.g., curiosity-driven
exploration).
6. Challenges in Model-Based RL
1. Model Bias: Errors in the learned model can lead to suboptimal policies.
import numpy as np
from sklearn.neural_network import MLPRegressor
class SimpleMBRL:
def __init__(self, state_dim, action_dim):
self.model = MLPRegressor(hidden_layer_sizes=(64, 6
4))
self.state_dim = state_dim
self.action_dim = action_dim
for _ in range(num_sequences):
state = initial_state
total_reward = 0
action_sequence = []
for _ in range(horizon):
action = np.random.rand(self.action_dim) # R
andom action for simplicity
next_state, reward = self.predict(state, acti
on)
total_reward += reward
action_sequence.append(action)
state = next_state
# Usage
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
states.append(state)
actions.append(action)
next_states.append(next_state)
rewards.append(reward)
state = next_state
mbrl.train_model(np.array(states), np.array(actions),
np.array(next_states), np.array(reward
s))
1. Task Distribution: A set of related tasks from which training and test tasks are
drawn.
4. Few-Shot Learning: The ability to learn from only a few examples or trials.
Uses a recurrent policy that receives past rewards and actions as input.
b. Gradient-Based Meta-Learning
1. Algorithm Steps:
Initialize meta-parameters θ
Update θ ← θ - β ∇θΣiLTi(fθ'i)
2. Application to RL:
The outer loop updates the initial policy to improve performance across
tasks.
c. Memory-Based Meta-Learning
These approaches use external memory to store information about past
experiences, allowing for quick adaptation to new tasks.
Example: SNAIL (Simple Neural Attentive Meta-Learner, Mishra et al., 2018)
Adapts quickly to new tasks by inferring the posterior over the context
variables.
Learns a meta-policy that can quickly adapt its importance weights for new
tasks.
Applications of Meta-Learning in RL
2. Game AI: Learning general strategies that can be applied to new game
scenarios.
import torch
import torch.nn as nn
import torch.optim as optim
class PolicyNetwork(nn.Module):
class MAML:
def __init__(self, input_dim, output_dim, alpha=0.01, bet
a=0.001):
self.policy = PolicyNetwork(input_dim, output_dim)
self.meta_optimizer = optim.Adam(self.policy.paramete
rs(), lr=beta)
self.alpha = alpha
for _ in range(num_steps):
state = task.reset()
done = False
total_reward = 0
loss = -total_reward
task_optimizer.zero_grad()
loss.backward()
task_optimizer.step()
return task_params
meta_loss += -total_reward
self.meta_optimizer.zero_grad()
meta_loss.backward()
self.meta_optimizer.step()
# Usage
input_dim = 4 # Example: CartPole state dimension
output_dim = 2 # Example: CartPole action dimension
maml = MAML(input_dim, output_dim)
# Meta-training
maml.outer_loop(tasks)
c. Partial Observability
e. Non-Stationarity
g. Real-Time Constraints
b. Hybrid Approaches
c. Safe Exploration
f. Hierarchical RL
4. Real-World Applications of RL
a. Robotics and Automation
Example: J.P. Morgan's LOXM system uses RL for optimal trade execution in
equity markets.
d. Healthcare
Example: Researchers have used RL for adaptive clinical trials and personalized
treatment plans in cancer therapy.
e. Recommender Systems
2. RL Formulation:
3. Challenges Addressed:
4. Implementation Strategy:
5. Results:
10. Prepare for Long-Term Maintenance: Plan for ongoing updates, retraining,
and system maintenance.
Unit 3
Deep
Introduction to Deep Learning
Deep learning is a subset of machine learning in artificial intelligence (AI) that
focuses on building and training neural networks with multiple layers. These
3. Healthcare
4. Autonomous Vehicles
5. Recommendation Systems
6. Finance
2. Speech Recognition
3. Game AI
5. NLP Applications
2. Scalability
2. Hidden Layers:
3. Output Layer:
6. Autoencoders
3. Healthcare
4. Finance
7. Recommender Systems
Limitations
The entire learning process can be divided into three main parts:
These calculations occur throughout the entire network. After completing the
calculations in the output layer node(s), we get the final output of the forward
propagation part in the first iteration.
In the forward propagation, calculations are made from the input layer to the
output layer (left to right) through the network.
The loss function computes a score called the loss score between the predicted
values and ground truth values. This is also known as the error of the model. The
loss function captures how well the model performs in each iteration. We use the
loss score as a feedback signal to update parameters in the backpropagation part.
The ideal value of the loss function is zero (0). Our goal is to minimize the loss
function close to 0 in each iteration so that the model will make better predictions
that are close to ground truth values.
Backward propagation
In the first iteration, the predicted values are far from the ground truth values and
the distance score will be high. This is because we initially assigned arbitrary
values to the network’s parameters (weights and biases). Those values are not
optimal values. So, we need to update the values of these parameters in order to
minimize the loss function. The process of updating network parameters is
called parameter learning or optimization which is done using an optimization
algorithm (optimizer) that implements backpropagation.
The objective of the optimization algorithm is to find the global minima where the
loss function has its minimum value. However, it is a real challenge for an
optimization algorithm to find the global minimum of a complex loss function by
avoiding all the local minima. If the algorithm is stopped at a local minimum, we’ll
not get the minimum value for the loss function. Therefore, our model will not
perform well.
Gradient Descent
Adam
Adagrad
Adadelta
Adamax
Nadam
Ftrl
The derivative of the loss function is its slope which provides us with the direction
that we should need to consider for updating (changing) the values of the model
parameters.
The neural network libraries in Keras provide automatic differentiation. This
means, after you define the neural network architecture, the libraries automatically
calculate all of the derivates needed for backpropagation.
In the backward propagation, calculations are made from the output layer to the
input layer (right to left) through the network.
Keras: An Overview
What is Keras?
Features of Keras
1. Ease of Use
3. Modularity
4. Extensibility
5. Integration
6. Cross-Platform
2. Layers
4. Loss Functions
5. Metrics
3. Time-Series Analysis
4. Reinforcement Learning
5. Generative Models
model = Sequential([
Dense(128, activation='relu', input_shape=(input_siz
model.compile(optimizer='adam', loss='categorical_crossent
ropy', metrics=['accuracy'])
Limitations
Applications of Keras
1. Image and Video Processing: Object recognition, video summarization.
Perceptron
Introduction
The perceptron is the simplest type of artificial neural network.
Components of a Perceptron
1. Input Values: A set of values or a dataset for predicting the output value. They
are also described as a dataset’s features and dataset.
2. Weights: The real value of each feature is known as weight. It tells the
importance of that feature in predicting the final value.
3. Bias: The activation function is shifted towards the left or right using bias. You
may understand it simply as the y-intercept in the line equation.
4. Summation Function: The summation function binds the weights and inputs
together. It is a function to find their sum.
Mathematical Representation
Output, y : y = Activation(∑(wi ⋅ xi ) + b)
Introduction
An extension of the perceptron with multiple layers, making it capable of
solving non-linear problems.
Comprises:
Structure of MLP
1. Multiple Layers:
2. Backpropagation Algorithm:
3. Activation Functions:
Introduction
A Deep Neural Network is an advanced version of MLP with many hidden
layers.
Structure
1. Input Layer:
2. Hidden Layers:
3. Output Layer:
Key Characteristics
1. Feature Learning:
2. Scalability:
3. Powerful Representations:
Uses of DNN
1. Computer Vision:
3. Speech Recognition:
Speech-to-text systems.
4. Autonomous Vehicles:
5. Healthcare:
6. Recommendation Systems:
Input, Hidden,
Layers Single Input, Many Hidden, Output
Output
Training Perceptron
Backpropagation Advanced Backpropagation
Algorithm Rule
The neural network is one of the most widely used machine learning algorithms.
The successful applications of neural networks in fields such as image
classification, time series forecasting, and many others have paved the way for its
adoption in business and research. It is fair to say that the neural network is one of
the most important machine learning algorithms. A clear understanding of the
algorithm will come in handy in diagnosing issues and also in understanding other
advanced deep learning algorithms. The goal of this article is to explain the
workings of a neural network. We will do a step-by-step examination of the
algorithm and also explain how to set up a simple neural network in PyTorch. We
will also compare the results of our calculations with the output from PyTorch.
The coefficients -1.75, -0.1, 0.172, and 0.15 have been arbitrarily chosen for
illustrative purposes. Next, we define two new functions a₁ and a₂ that are
functions of z₁ and z₂ respectively:
used above is called the sigmoid function. It is an S-shaped curve. The function
f(x) has a special role in a neural network. We will discuss it in more detail in a
subsequent section. For now, we simply apply it to construct functions a₁ and a₂.
Once again, the coefficients 0.25, 0.5, and 0.2 are arbitrarily chosen. Figure 1
shows a plot of the three functions a₁, a₂, and z₃.
We can see from Figure 1 that the linear combination of the functions a₁ and a₂ is a
more complex-looking curve. In other words, by linearly combining curves, we
can create functions that are capable of capturing more complex variations. We
can extend the idea by applying the sigmoid function to z₃ and linearly combining
it with another similar function to represent an even more complex function. In
theory, by combining enough such functions we can represent extremely complex
variations in values. The coefficients in the above equations were selected
arbitrarily. What if we could change the shapes of the final resulting function by
adjusting the coefficients? That would allow us to fit our final function to a very
complex dataset. This is the basic idea behind a neural network. The neural
network provides us a framework to combine simpler functions to construct a
complex function that is capable of representing complicated variations in data.
Let us now examine the framework of a neural network.
Table 1 shows three common activation functions. The plots of each activation
function and its derivatives are also shown. While the sigmoid and the tanh are
smooth functions, the RelU has a kink at x=0. The choice of the activation function
depends on the problem we are trying to solve. There are applications of neural
networks where it is desirable to have a continuous derivative of the activation
function. For such applications, functions with continuous derivatives are a good
choice. The tanh and the sigmoid activation functions have larger derivatives in
the vicinity of the origin. Therefore, if we are operating in this region these
functions will produce larger gradients leading to faster convergence. In contrast,
away from the origin, the tanh and sigmoid functions have very small derivative
values which will lead to very small changes in the solution. We will discuss the
computation of gradients in a subsequent section. There are many other activation
z₁ and z₂ are obtained by linearly combining the input x with w₁ and b₁ and w₂ and
b₂ respectively. a₁ and a₂ are the outputs from applying the RelU activation
function to z₁ and z₂ respectively. z₃ and z₄ are obtained by linearly combining a₁
and a₂ from the previous layer with w₃, w₅, b₃, and w₄, w₆, b₄ respectively. Finally,
the output yhat is obtained by combining a₃ and a₄ from the previous layer with
w₇, w₈, and b₅. In practice, the functions z₁, z₂, z₃, and z₄ are obtained through a
matrix-vector multiplication as shown in figure 4.
The final step in the forward pass is to compute the loss. Since we have a single
data point in our example, the loss L is the square of the difference between the
output value yhat and the known value y. In general, for a regression problem, the
loss is the average sum of the square of the difference between the network
output value and the known value for each data point. It is called the mean
squared error.
4.0 Setting up the simple neural network in PyTorch:
Our aim here is to show the basics of setting up a neural network in PyTorch using
our simple network example. It is assumed here that the user has installed
PyTorch on their machine. We will use the torch.nn module to set up our network.
We start by importing the nn module as follows:
The nn.Linear class is used to apply a linear combination of weights and biases.
There are two arguments to the Linear class. The first one specifies the number of
nodes that feed the layer. The number of nodes in the layer is specified as the
second argument. For example, the (1,2) specification in the input layer implies
that it is fed by a single input node and the layer has two nodes. The hidden layer
is fed by the two nodes of the input layer and has two nodes. It is important to
note that the number of output nodes of the previous layer has to match the
number of input nodes of the current layer. The (2,1) specification of the output
layer tells PyTorch that we have a single output node. The activation function is
specified in between the layers. As discussed earlier we use the RelU function.
Using this simple recipe, we can construct as deep and as wide a network as is
appropriate for the task at hand. The output from the network is obtained by
supplying the input value as follows:
t_u1 is the single x value in our case. To compute the loss, we first define the loss
function. The inputs to the loss function are the output from the neural network
and the known value.
The weights and biases of a neural network are the unknowns in our model. We
wish to determine the values of the weights and biases that achieve the best fit for
our dataset. The best fit is achieved when the losses (i.e., errors) are minimized.
Note the loss L (see figure 3) is a function of the unknown weights and biases.
Imagine a multi-dimensional space where the axes are the weights and the biases.
The loss function is a surface in this space. At the start of the minimization
process, the neural network is seeded with random weights and biases, i.e., we
start at a random point on the loss surface. To reach the lowest point on the
surface we start taking steps along the direction of the steepest downward slope.
This is what the gradient descent algorithm achieves during each training epoch
or iteration. At any nth iteration the weights and biases are updated as follows:
m are the total number of weights and biases in the network. Note that here we
are using wᵢ to represent both weights and biases. The learning rate η determines
the size of each step. The partial derivatives of the loss with respect to each of
the weights/biases are computed in the back propagation step.
The process starts at the output node and systematically progresses backward
through the layers all the way to the input layer and hence the name
backpropagation. The chain rule for computing derivatives is used at each step.
We now compute these partial derivatives for our simple neural network.
Here we have used the equation for yhat from figure 6 to compute the partial
derivative of yhat wrt to w₇. The partial derivatives wrt w₈ and b₅ are computed
similarly.
Refer to Figure 8 for the partial derivatives wrt w₄, w₆, and b₄:
PyTorch performs all these computations via a computational graph. The gradient
of the loss wrt weights and biases is computed as follows in PyTorch:
First, we broadcast zeros for all the gradient terms. optL is the optimizer. The
.backward triggers the computation of the gradients in PyTorch.
1. Sigmoid Function
1
F ormula : f(x) =
1 + e−x
Characteristics:
1. Shape:
S-shaped curve.
2. Purpose:
3. Gradient:
Use Cases:
Binary classification tasks.
ex − e−x
F ormula : f(x) = x
e + e−x
Characteristics:
1. Shape:
2. Purpose:
3. Gradient:
Use Cases:
Hidden layers in neural networks, especially when data is centered around
zero.
Characteristics:
1. Shape:
2. Purpose:
3. Gradient:
May suffer from "dying ReLU" problem, where neurons output 0 for all
inputs.
Use Cases:
4. Leaky ReLU
F ormula : f(x) = {
x, if x > 0
αx, if x ≤ 0
Characteristics:
1. Shape:
2. Purpose:
Use Cases:
Hidden layers where ReLU is not performing well.
5. Softmax Function
exi
F ormula : f(xi ) = n
∑j=1 exj
Characteristics:
1. Purpose:
2. Gradient:
Use Cases:
6. Swish
OutputRange : (−∞, ∞)
Characteristics:
1. Purpose:
2. Gradient:
Use Cases:
Gaining popularity in advanced architectures like EfficientNet.
No vanishing
ReLU [0, ∞) Dying ReLU for negative values
gradients
Output Layers:
Introduction
A Feedforward Neural Network (FNN) is a type of artificial neural network
where information flows in one direction: from the input layer to the output
layer through hidden layers.
It does not have feedback loops or connections between neurons within the
same layer.
Structure
1. Input Layer:
2. Hidden Layers:
3. Output Layer:
n
z = ∑ wi ⋅ xi + b
i=1
where w_i are weights, x_i are inputs, and \( b \) is the bias.
2. Prediction:
The output layer produces the final prediction (e.g., probabilities for
classification tasks).
Cost Function
Backpropagation Algorithm
What is Backpropagation?
Steps in Backpropagation
1. Forward Pass:
Calculate the network's output by passing the input data through the
layers.
Evaluate the difference between the predicted output and the actual target
using the cost function.
3. Gradients:
Advantages of Backpropagation
1. Efficient training of deep neural networks.
Limitations
1. Prone to vanishing gradients in deep networks.
Gradient Descent
2. Exploding Gradient:
2. Deep Learning:
3. Reinforcement Learning:
Policy optimization.
Regularization
What is Regularization?
Regularization is a technique used to reduce overfitting in machine learning
models by penalizing large weights in the cost function. This helps in creating
simpler models that generalize better to unseen data.
What is Dropout?
Dropout is a regularization technique used in neural networks to prevent
overfitting by randomly "dropping out" neurons (setting their output to 0) during
2. Testing Phase:
During testing, no neurons are dropped, but their outputs are scaled down
to account for the dropout during training.
Advantages of Dropout
1. Reduces overfitting.
Drawbacks
1. Increases training time.
Batch Normalization
2. Faster Convergence:
3. Regularization Effect:
Comparison of Techniques
The simplest type of neural network where data flows in one direction
(input → hidden → output).
Use Cases:
Classification
Regression
Example:
Key Features:
Use Cases:
Object detection
Facial recognition
Video processing
Example:
Outputs from previous steps are fed as inputs to the current step.
Key Features:
Speech recognition
Example:
Key Features:
Use Cases:
Text generation
Speech-to-text conversion
Sentiment analysis
Example:
Key Features:
Use Cases:
Machine translation
Example:
6. Autoencoders
Description:
Key Features:
Use Cases:
Anomaly detection
Feature extraction
Image compression
Example:
Key Features:
Use Cases:
Image generation
Data augmentation
Video synthesis
Example:
Deepfake technology.
Key Features:
Use Cases:
Function approximation
Classification
Time-series prediction
Key Features:
Use Cases:
Classification
Regression
Pattern recognition
Key Features:
Use Cases:
Dimensionality reduction
Feature extraction
Key Features:
Use Cases:
Machine translation
Text summarization
Example:
Key Features:
Use Cases:
Robotics
Real-time systems
Low-power AI systems
Comparison Table
Type Key Feature Primary Use Case
CNN
https://developersbreach.com/convolution-neural-network-deep-learning/
1. What is CNN ?
Convolution Neural Network has input layer, output layer, many hidden layers and
millions of parameters that have the ability to learn complex objects and patterns.
It sub-samples the given input by convolution and pooling processes and is
subjected to activation function, where all of these are the hidden layers which are
partially connected and at last end is the fully connected layer that results in the
output layer. The output retains the original shape similar to input image
dimensions.
1.1 Convolution
Convolution is the process involving combination of two functions that produces
the other function as a result. In CNN’s, the input image is subjected to
2. Convolution Layer
Convolutions occur in convolution layer which are the building blocks of CNN.
This layer generally has
Where each neural is connected only to a subset of input image (unlike a neural
network where all neurons are fully connected). In CNN, a certain dimension of
filter is chosen, which slides over these subsets of input data. Multiple filters are
present in CNN where each filter moves over entire image and learns different
portions of input image.
Parameter Sharing
Is sharing of weights by all neurons in a particular feature map. All of them share
same amounts of weight hence called parameter sharing.
Batch normalizing allows higher learning rates that can reduce training time and
gives better performance. It allows learning at each layer by itself without being
more dependent on other layers. Dropout which is also a regularizing technique, is
less effective to regularize convolution layers.
Stride in CNN
y = max(0, x)
For Example : In a given matrix (M),
[ [ 0, 19, 5 ], [ 7, 0, 12 ], [ 4, 0, 17 ] ]
Average pooling : It involves average calculation for each patch of the feature
map.
Pooling layer
Why pooling is important ?
There can be many number of convolution, ReLU and pooling layers. Initial layers
of convolution learns generic information and last layers learn more
specific/complex features. After the final Convolution Layer, ReLU, Pooling Layer
the output feature map(matrix) will be converted into vector(one dimensional
array). This is called flatten layer.
6. Dropout
Dropout is an approach used for regularization in neural networks. It is a
technique where randomly chosen nodes are ignored in network during training
phase at each stage.
This dropout rate is usually 0.5 and dropout can be tuned to produce best results
and also improves training speed. This method of regularization reduces node-to-
node interactions in the network which leads to learning of important features and
also helps in generalizing new data better.
7. Soft-Max Layer
Soft-max is an activation layer normally applied to the last layer of network that
acts as a classifier. Classification of given input into distinct classes takes place at
this layer. The soft max function is used to map the non-normalized output of a
network to a probability distribution.
The output from last layer of fully connected layer is directed to soft max layer,
which converts it into probabilities.
For binary classification problem, logistic function is used and for multi-
classification soft-max is used.
2. Filters are generated that performs convolutions over entire image and trains
the network to identify and learn features from image, which are converted to
matrices.
4. The convolutions are performed until better accuracy has attained and
maximum feature extraction is done.
7. After the final convolution, the input matrix is converted to feature vector. This
feature vector is the flattened layer.
8. Feature vector serves as input to next layer(fully connected layer), where all
features are collectively transferred into this network. Dropout of random
nodes occurs during training to reduce overfitting in this layer.
9. Finally, the raw values which are predicted output by network are converted to
probabilistic values with use of soft max function.
What is a CNN?
A Convolutional Neural Network (CNN) is a specialized type of neural network
designed for processing structured data like images and videos. It is particularly
effective for image-related tasks due to its ability to automatically and adaptively
learn spatial hierarchies of features.
1. Convolutional Layers
The core building block of CNNs.
How it Works:
The kernel (a smaller matrix) scans the input image to produce feature maps.
Mathematical Operation:
2. Pooling Layers
Reduce the spatial dimensions of the feature maps while retaining the most
important information.
Workflow of CNN
1. Input Layer:
2. Convolutional Layer:
3. Activation Function:
4. Pooling Layer:
5. Flattening:
The same kernel is used across the input, reducing the number of
parameters.
2. Sparse Connectivity:
Advantages of CNNs
1. Efficient Feature Extraction:
2. Reduces Parameters:
Applications of CNNs
1. Image Recognition:
2. Face Detection:
3. Medical Imaging:
4. Autonomous Vehicles:
5. Video Analysis:
What is Flattening?
Flattening is the process of converting a multi-dimensional array (e.g., 2D
feature maps) into a 1D vector.
Purpose of Flattening:
Converts the extracted spatial features into a format suitable for processing by
fully connected (dense) layers.
Example in Python:
Purpose of FC Layers:
Aggregates features learned by previous layers to make predictions
(classification or regression).
Key Properties:
1. Connections:
Each neuron receives input from all neurons in the previous layer.
Every connection has a weight, and each neuron has a bias term.
3. Activation Function:
2. Hidden Layers:
3. Output Layer:
# Summary
model.summary()
1. Multi-Class Classification:
# Summary
model.summary()
1. Regression:
Steps:
1. Prepare the Data:
Code Example:
Flatten (H, W, C) (H * W * C)
RNN
The input layer X processes the initial input and passes it to the middle layer A.
The middle layer consists of multiple hidden layers, each with its activation
functions, weights, and biases. These parameters are standardized across the
One-to-many has a single input and multiple outputs. This is used for
generating image captions.
While training the model, CNN uses a simple backpropagation and RNN uses
backpropagation through time to calculate the loss.
RNN can have no restriction in length of inputs and outputs, but CNN has finite
inputs and finite outputs.
CNN has a feedforward network and RNN works on loops to handle sequential
data.
CNN can also be used for video and image processing. RNN is primarily used
for speech and text analysis.
Limitations of RNN
Simple RNN models usually run into two major issues. These issues are related to
gradient, which is the slope of the loss function along with the error function.
1. Vanishing Gradient problem occurs when the gradient becomes so small that
updating parameters becomes insignificant; eventually the algorithm stops
learning.
2. Exploding Gradient problem occurs when the gradient becomes too large,
which makes the model unstable. In this case, larger error gradients
accumulate, and the model weights become too large. This issue can cause
longer training times and poor model performance.
The simple solution to these issues is to reduce the number of hidden layers
within the neural network, which will reduce some complexity in RNNs. These
issues can also be solved by using advanced RNN architectures such as LSTM
and GRU.
GRU Cell
What is an RNN?
A Recurrent Neural Network (RNN) is a type of neural network designed to
process sequential data by using its internal memory. Unlike traditional
feedforward neural networks, RNNs are capable of learning temporal
dependencies by retaining information from previous steps in the sequence.
Suitable for time-series data, text, speech, and other sequential data.
2. Internal Memory:
3. Recurrent Connections:
Each neuron in an RNN is connected to itself in the next time step, creating
loops in the network architecture.
Limitations of RNN
1. Vanishing Gradient Problem:
2. Exploding Gradients:
3. Memory Constraints:
Each layer receives the hidden states from the previous layer as input.
2. Layered Memory:
Lower layers capture simple patterns, while higher layers capture more
abstract patterns in the sequence.
2. Improved Performance:
3. Flexibility:
Variants of RNNs
1. Long Short-Term Memory (LSTM):
Uses gates (input, forget, and output) to control the flow of information.
Combines the forget and input gates into a single update gate.
3. Bidirectional RNN:
4. Deep RNN:
Sentiment analysis.
2. Time-Series Prediction:
Weather forecasting.
3. Speech Recognition:
Voice-to-text conversion.
Video captioning.
# Summary
model.summary()
1. Deep RNN:
# Summary
model.summary()
Conclusion
RNNs are powerful for processing sequential data but have limitations like
vanishing gradients.
For longer dependencies and better gradient flow, LSTMs or GRUs are
preferred.
What is LSTM?
LSTMs use gates to control the flow of information, allowing them to retain
important information for longer periods.
3. Versatility:
Applications of LSTM
1. Natural Language Processing (NLP):
2. Time-Series Forecasting:
3. Speech Recognition:
4. Video Analysis:
# Summary
model.summary()
What is GRU?
GRU is a simplified version of LSTM that also addresses the vanishing
gradient problem.
Unlike LSTM, GRU combines the forget and input gates into a single update
gate and removes the cell state.
2. Efficient:
Applications of GRU
1. Speech Recognition:
Voice-to-text systems.
2. Time-Series Analysis:
3. NLP Tasks:
More (complex
Parameters Fewer (simpler architecture)
architecture)
Use GRU:
Transfer Learning
2. Fine-Tuning:
3. Feature Extraction:
2. Performance:
3. Knowledge Transfer:
Add layers specific to your task (e.g., classification for a different number
of classes).
Fine-tune the added layers while optionally freezing the base layers.
2. Fine-Tuning:
3. Zero-Shot Learning:
4. Domain Adaptation:
3. Medical Imaging:
4. Autonomous Vehicles:
3. Improved Accuracy:
The pre-trained model may not perform well if the new task domain is too
different.
2. Overfitting:
Unit 4
Introduction to Natural Language Processing (NLP)
What is NLP?
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that
focuses on enabling computers to understand, interpret, and respond to human
language in a meaningful way. It combines computational linguistics, machine
learning, and deep learning techniques to process and analyze text and speech
data.
2. Semantics:
3. Pragmatics:
4. Morphology:
2. Tokenization:
5. Sentiment Analysis:
6. Machine Translation:
7. Text Summarization:
8. Speech Recognition:
9. Language Generation:
Applications of NLP
1. Chatbots and Virtual Assistants:
2. Search Engines:
Google and Bing use NLP for understanding queries and ranking results.
3. Sentiment Analysis:
4. Translation Tools:
Tools like Google Translate rely on NLP for accurate language translation.
5. Healthcare:
6. Document Summarization:
Approaches to NLP
1. Rule-Based Methods:
2. Machine Learning:
3. Deep Learning:
Challenges in NLP
1. Ambiguity:
2. Context Understanding:
4. Domain-Specific Language:
5. Low-Resource Languages:
2. spaCy:
4. Gensim:
5. TextBlob:
Future of NLP
1. Improved Contextual Understanding:
More advanced models like GPT-4 and BERT improve context handling.
2. Multilingual NLP:
4. Ethical NLP:
Dimensions are typically derived from features like terms, context, or co-
occurrence frequencies.
2. Semantic Similarity:
3. Applications:
Information retrieval.
Document clustering.
Vector Creation:
Vector representation:
"cat" → [1, 0, 1]
"dog" → [1, 0, 1]
"fish" → [0, 1, 0]
2. Document Representation:
Two documents:
Term frequency:
D1: [1, 1, 1]
D2: [0, 1, 1]
Advantages of VSM
1. Simple and Effective:
2. Language Agnostic:
Limitations of VSM
1. High Dimensionality:
2. No Contextual Understanding:
3. Assumes Independence:
2. Contextual Models:
2. Document Clustering:
3. Semantic Analysis:
4. Recommender Systems:
Objective
The Skip-Gram model, introduced as part of Word2Vec, aims to predict the
context (surrounding words) given a target word.
Architecture
Input: A single target word (e.g., "dog").
Core Idea: Words that appear in similar contexts will have similar vector
representations.
Training Steps
1. Input Representation:
2. Projection Layer:
WW
3. Output Layer:
4. Optimization:
Advantages
Captures semantic similarity well.
Objective
The CBOW model predicts a target word based on its context words.
Architecture
Input: Context words (a set of surrounding words).
Core Idea: Words in similar contexts are likely to have similar meanings.
Training Steps
1. Input Representation:
2. Projection Layer:
3. Output Layer:
4. Optimization:
Advantages
Faster to train than Skip-Gram.
Objective
GloVe is a count-based method that constructs word vectors using the co-
occurrence statistics of words in a corpus.
Core Idea
Words that co-occur frequently in a corpus will have similar representations. For
example:
Key Features
Matrix Construction:
Matrix Factorization:
Objective Function:
Advantages
Combines local (context-based) and global (corpus-wide) information.
c. Downstream Tasks
Evaluate embeddings based on their performance in tasks like:
Text classification.
Sentiment analysis.
5. Applications
b. Analogy Reasoning
Knowledge extraction: Identify relationships in large datasets.
c. Sentiment Analysis
Represent words in sentiment analysis models to classify text polarity.
d. Machine Translation
Word embeddings help align representations of similar words across
languages.
Comparison of Methods
Feature Skip-Gram CBOW GloVe
1. Image Segmentation
Types:
1. Semantic Segmentation:
2. Instance Segmentation:
2. U-Net:
3. Mask R-CNN:
4. DeepLab:
2. Autonomous Vehicles:
3. Satellite Imagery:
model = unet()
model.summary()
2. Object Detection
Combines region proposal networks (RPNs) with CNNs for faster object
detection.
2. Retail:
3. Healthcare:
2. Attention Mechanism:
Allows the model to focus on specific parts of the image while generating
each word.
3. Vision-Language Transformers:
Models like CLIP and BLIP utilize transformers for improved image-text
understanding.
2. Social Media:
3. E-Commerce:
Comparison of Tasks
Generate textual
Image Sentences or Encoder-Decoder,
descriptions for
Captioning phrases Attention, Transformers
images
Conclusion
Deep learning has enabled significant advancements in computer vision tasks like
image segmentation, object detection, and automatic image captioning. These
tasks find applications in autonomous vehicles, healthcare, e-commerce, and
accessibility technologies. Modern architectures, including transformers, continue
to push the boundaries of these applications.
1. Generator:
2. Discriminator:
2. Synthetic Image:
4. Feedback:
Applications of GANs
1. Image Generation:
3. Super-Resolution:
4. Text-to-Image:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Reshape, Flatten,
Conv2D, Conv2DTranspose, LeakyReLU
from tensorflow.keras.models import Sequential
# Generator model
def build_generator():
model = Sequential([
Dense(256, activation="relu", input_dim=100),
LeakyReLU(0.2),
Dense(512),
LeakyReLU(0.2),
Dense(1024),
LeakyReLU(0.2),
Dense(28 * 28 * 1, activation="tanh"),
Reshape((28, 28, 1))
])
return model
# Discriminator model
def build_discriminator():
model = Sequential([
Flatten(input_shape=(28, 28, 1)),
Dense(512),
# Compile GAN
generator = build_generator()
discriminator = build_discriminator()
discriminator.compile(optimizer="adam", loss="binary_crossent
ropy", metrics=["accuracy"])
Training GANs:
How It Works
1. Feature Extraction:
Use a CNN (e.g., ResNet, Inception) to extract spatial features from video
frames.
3. Text Generation:
2. Feature Extraction:
3. Sequence Processing:
4. Caption Generation:
Applications of Video-to-Text
1. Video Summarization:
2. Accessibility:
3. Content Recommendation:
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, T
video_to_text_model = build_video_to_text_model(vocab_size=10
000)
Challenges in Video-to-Text
1. Temporal Dependencies:
2. Dataset Complexity:
Conclusion
GANs excel in generating realistic images and videos, finding applications in
data augmentation, content creation, and super-resolution tasks.
2. Channel Attention:
3. Temporal Attention:
4. Self-Attention:
1. Self-Attention
Computes attention scores between every pair of input elements.
2. Spatial Attention
Focuses on specific spatial regions of an image.
3. Channel Attention
Determines which feature maps (channels) are important.
4. Multi-Head Attention
Divides the input into multiple subspaces and computes attention for each
subspace.
How it Works:
Applications:
How it Works:
Applications:
How it Works:
Applications:
How it Works:
Applications:
5. Attention U-Net
Overview:
How it Works:
Applications:
2. Object Detection:
3. Image Segmentation:
5. Super-Resolution:
6. Anomaly Detection:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, LayerNorm
alization, MultiHeadAttention, Dropout
from tensorflow.keras.models import Model
class VisionTransformer(Model):
def __init__(self, num_patches, projection_dim, num_head
s, transformer_units, num_classes):
super(VisionTransformer, self).__init__()
self.num_patches = num_patches
self.projection_dim = projection_dim
self.class_token = self.add_weight(shape=(1, 1, proje
ction_dim), initializer="random_normal")
self.position_embedding = self.add_weight(shape=(1, n
um_patches + 1, projection_dim), initializer="random_normal")
3. Versatility:
2. Large Datasets:
Conclusion
Attention mechanisms have significantly advanced computer vision, enabling
state-of-the-art performance in tasks like image classification, object detection,
and segmentation. While models like Vision Transformers and DETR lead the
way, hybrid approaches combining CNNs with attention mechanisms (e.g., CBAM,
SENet) continue to be effective for resource-constrained applications.