0% found this document useful (0 votes)
6 views49 pages

Unit 4

The document discusses Semi-Supervised Learning (SSL) and Reinforcement Learning (RL) as machine learning techniques that utilize both labeled and unlabeled data for model training and decision-making, respectively. It covers various concepts within RL, including the Markov Decision Process, Bellman Equation, Monte Carlo policy evaluation, Q-learning, and SARSA, highlighting their applications and methodologies. Additionally, it introduces Model-Based Reinforcement Learning, which enhances learning efficiency by predicting future states and rewards based on a model of the environment.

Uploaded by

berwalnimish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views49 pages

Unit 4

The document discusses Semi-Supervised Learning (SSL) and Reinforcement Learning (RL) as machine learning techniques that utilize both labeled and unlabeled data for model training and decision-making, respectively. It covers various concepts within RL, including the Markov Decision Process, Bellman Equation, Monte Carlo policy evaluation, Q-learning, and SARSA, highlighting their applications and methodologies. Additionally, it introduces Model-Based Reinforcement Learning, which enhances learning efficiency by predicting future states and rewards based on a model of the environment.

Uploaded by

berwalnimish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Semi Supervised

Learning,
Semi Supervised Learning

 Semi-supervised learning (SSL) is a machine learning technique that


uses both labeled and unlabeled data to train AI models for
classification and regression.

 SSL is a combination of supervised and unsupervised learning, and it


uses techniques that incorporate unlabeled data into model training.
Examples of Semi-Supervised
Learning
 Text Classification

 Image Classification

 Anomaly Detection
Reinforcement Learning

 Reinforcement learning (RL) is a machine learning (ML) technique that trains software
to make decisions to achieve the most optimal results.

 Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the
results of actions.

 Reinforcement learning is a form of machine learning that teaches a model to choose


the best course of action while solving a problem

 It mimics the trial-and-error learning process that humans use to achieve their goals.

 Reinforcement Learning (RL) is the science of decision making. It is about learning the
optimal behavior in an environment to obtain maximum reward.
 In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.

 Since there is no labeled data, so the agent is bound to learn by its


experience only.

 How a Robotic dog learns the movement of his arms is an example of


Reinforcement learning.
Types of Reinforcement:

 There are two types of Reinforcement:


1. Positive: Positive Reinforcement is defined as when an event, occurs
due to a particular behavior, increases the strength and the
frequency of the behavior. In other words, it has a positive effect on
behavior.

Advantages of reinforcement learning are:


1. Maximizes Performance
2. Sustain Change for a long period of time
3. Too much Reinforcement can lead to an overload of states which can
diminish the results
Types of Reinforcement:

1. Negative: Negative Reinforcement is defined as strengthening of


behavior because a negative condition is stopped or avoided.

Advantages of reinforcement learning:


1. Increases Behavior
2. Provide defiance to a minimum standard of performance
3. It Only provides enough to meet up the minimum behavior
• Example: Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent interacts
with the environment by performing some actions, and based on
those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
• The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by
doing these actions, he learns and explores the environment.
• The agent learns that what actions lead to positive feedback or
rewards and what actions lead to negative feedback penalty. As a
positive reward, the agent gets a positive point, and as a penalty, it
gets a negative point
Markov Decision Process

 Markov Decision Process or MDP, is used to formalize the


reinforcement learning problems.

 If the environment is completely observable, then its dynamic can be


modeled as a Markov Process.

 In MDP, the agent constantly interacts with the environment and


performs actions; at each action, the environment responds and
generates a new state.
 A State is a set of tokens that represent every state that the agent
can be in.

 MDP contains a tuple of four elements (S, A, Pa, Ra):


• A set of finite States S
• A set of finite Actions A
• Rewards received after transitioning from state S to state S', due to
action a.
• Probability Pa.
Markov Property

 MDP uses Markov property, and to better understand the MDP, we


need to learn about it.

 "If the agent is present in the current state S1, performs an


action a1 and move to the state s2, then the state transition
from s1 to s2 only depends on the current state and future
action and states do not depend on past actions, rewards, or
states.“

 Example: Chess game, the players only focus on the current


state and do not need to remember past actions or states.
 Markov Process is a memoryless process with a sequence of random
states S1, S2, ....., St that uses the Markov Property. Markov process is
also known as Markov chain, which is a tuple (S, P) on state S and
transition function P.
Bellman Equation

 The Bellman equation was introduced by the Mathematician Richard


Ernest Bellman in the year 1953, and hence it is called as a
Bellman equation.

 It is associated with dynamic programming and used to calculate the


values of a decision problem at a certain point by including the values
of previous states.
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by
performing an action a.
γ = Discount factor(Gamma)
V(s`) = The value at the previous state.
 In the above image, the agent is at the very first block of the maze.
The maze is consisting of an S6 block, which is a wall, S8 a fire pit,
and S4 a diamond block.

 The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the fire
pit, then gets -1 reward point. It can take four actions: move up,
move down, move left, and move right.

 The agent can take any path to reach to the final point, but he needs
to make it in possible fewer steps. Suppose the agent considers the
path S9-S5-S1-S2-S3, so he will get the +1-reward point.
 For 1st block:
 V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no
further state to move.
 V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1

 For 2nd block:


 V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s,
a)= 0, because there is no reward at this state.
 V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
 For 3rd block:
 V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s,
a)= 0, because there is no reward at this state also.
 V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

 For 4th block:


 V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s,
a)= 0, because there is no reward at this state also.
 V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
 For 5th block:
 V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s,
a)= 0, because there is no reward at this state also.
 V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
 By using the Bellman equation our agent will calculate the value of
every step except for the trophy and the fire state (V = 0), they
cannot have values since they are the end of the maze.
Policy Evolution using Monte Carlo

 Monte Carlo policy evaluation is a technique within the field of


reinforcement learning that estimates the effectiveness of a policy—a
strategy for making decisions in an environment.

 It’s a bit like learning the rules of a game by playing it many times,
rather than studying its manual.

 This approach doesn’t require a pre-built model of the environment;


instead, it learns exclusively from the outcomes of the episodes it
experiences.
How Monte Carlo Policy Evaluation
Works?
 The method works by running simulations or episodes(repeated
random sampling) where an agent interacts with the environment
until it reaches a terminal state.

 At the end of each episode, the algorithm looks back at the states
visited and the rewards received to calculate what’s known as the
“return” — the cumulative reward starting from a specific state until
the end of the episode.
 Monte Carlo policy evaluation repeatedly simulates episodes, tracking the
total rewards that follow each state and then calculating the average.

 These averages give an estimate of the state value under the policy being
followed.

 By aggregating the results over many episodes, the method converges to


the true value of each state when following the policy.

 These values are useful because they help us understand which states are
more valuable and thus guide the agent toward better decision-making in
the future.
 Over time, as the agent learns the value of different states, it can
refine its policy, favoring actions that lead to higher rewards.

 In Monte Carlo policy evaluation, the value V of a state “s” under a


policy π is estimated by the average return G following that state. The
return is the cumulative reward obtained after visiting state “s”:

N(s) is the number of times state “s” is visited across episodes, and
Gi is the return from the i-th episode after visiting state “s”.
Conclusion

 Monte Carlo policy evaluation is like learning through full experience.


It’s a hands-on way to measure how effective certain actions are,
based on the rewards they yield over many trials.
Q Learning

 Q-learning is a basic form of Reinforcement Learning that uses Q-


values (also called action values) to iteratively improve the behavior
of the learning agent.

 Q-learning is a popular model-free reinforcement learning algorithm


used in machine learning and artificial intelligence applications.

 It falls under the category of temporal difference learning techniques,


in which an agent picks up new information by observing results,
interacting with the environment, and getting feedback in the form of
rewards.
 we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state
and the future state.

 The agent can go to any direction (Up, Left, or Right), so he needs to


decide where to go for the optimal path.

 Here agent will take a move as per probability bases and changes the
state. But if we want some exact moves, so for this, we need to make
some changes in terms of Q-value.
 Q- represents the quality of the actions at each state. So instead of
using a value at each state, we will use a pair of state and action, i.e.,
Q(s, a).

 Q-value specifies that which action is more lubricative than others,


and according to the best Q-value, the agent takes his next move.

 The Bellman equation can be used for deriving the Q-value.


The above formula is used to estimate the Q-
values in Q-Learning.
SARSA

 State-Action-Reward-State-Action

 Consider teaching a computer to play a game, operate a car, or


manage resources.

 SARSA is a reinforcement learning algorithm that teaches computers


how to make good decisions by interacting with an environment.

 It helps computers learn from their experiences to determine the


best actions.
Explanation of SARSA:

 Assume you're teaching a robot to navigate a maze. The robot begins


at a specific location (the "State" - where it is), and you want it to
discover the best path to the maze's finish.

 The robot can proceed in numerous directions at each step (these are
the "Actions" - what it does). As it travels, the robot receives input
through incentives - positive or negative numbers indicating its
performance.
Explanation of SARSA:

 The amazing thing about SARSA is that it doesn't need a map of the
maze or explicit instructions on what to do.

 It learns by trial and error, discovering which actions work best in


different situations. This way, SARSA helps computers learn to make
decisions in various scenarios, from games to driving cars to
managing resources efficiently.
Equation

Here, the update equation for SARSA depends on the current state,
current action, reward obtained, next state and next action.
Code Snippet
Output
Model Based Reinforcement
Learning
 The model is used for planning, which means it provides a way to take a
course of action by considering all future situations before actually
experiencing those situations.

 In Model-based reinforcement learning which mimics the behavior of the


environment.

 The approaches for solving the RL problems with the help of the model are
termed as the model-based approach.

 With the help of the model, one can make inferences(idea or conclusions)
about how the environment will behave.
 Such as, if a state and an action are given, then a model can predict
the next state and reward.

 Model-based reinforcement learning (MBRL) is an approach within the


field of reinforcement learning (RL) that incorporates a model of the
environment to improve the efficiency and effectiveness of the
learning process.
 In MBRL, an agent not only learns from interactions with the
environment but also builds and utilizes a model of the environment.

 This model can predict the next state and reward given the current
state and action. It helps the agent to simulate future states and
outcomes without direct interaction.
Approach in MBRL

 1. Model Learning:
 Implicit - Indirectly learning models, often through latent variable
representations or embeddings.
 Explicit- Directly learning the dynamics of the environment (e.g., using
neural networks or Gaussian processes).
 2. Planning Algorithms:
 Planning involves using the model to simulate multiple future scenarios,
enabling the agent to choose actions that maximize long-term rewards.

 3. Hybrid Approach
 Combining MBRL with model-free methods can leverage the strengths of
both approaches.

You might also like