0% found this document useful (0 votes)
28 views21 pages

Unit-5 (AI)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 21

AI (UNIT-5)

Reinforcement Learning
What is Reinforcement Learning?
 Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal
is long-term, such as game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning
is a type of machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within that." How a
Robotic dog learns the movement of his arms is an example of Reinforcement
learning.
 For example, in the Mario video game, if a character takes a random action (e.g.
moving left), based on that action, it may receive a reward. After taking the action, the
agent (Mario) is in a new state, and the process repeats until the game character
reaches the end of the stage or dies. This episode will repeat multiple times until
Mario learns to navigate the environment by maximizing the rewards
Example:
Suppose there is an AI agent present within a maze environment, and his goal is to find
the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward
or penalty as feedback.
The agent continues doing these three things (take action, change state/remain in
the same state, and get feedback), and by doing these actions, he learns and explores
the environment.
The agent learns that what actions lead to positive feedback or rewards and what
actions lead to negative feedback penalty. As a positive reward, the agent gets a positive
point, and as a penalty, it gets a negative point.

Terms used in Reinforcement Learning


 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the
agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
Key Features of Reinforcement Learning
 In RL, the agent is not instructed about the environment and what actions need
to be taken.
 It is based on the hit and trial process.
 The agent takes the next action and changes states according to the feedback
of the previous action.
 The agent may get a delayed reward.
 The environment is stochastic, and the agent needs to explore it to reach to get
the maximum positive rewards.

Approaches to implement Reinforcement Learning


There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based
2. Policy-based
3. Model-based

1. Value-based: The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based: Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to apply such a
policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for each
environment.
Elements of Reinforcement Learning: There are four main elements of Reinforcement
Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It
maps the perceived states of the environment to the actions taken on those states. A
policy is the core element of the RL as it alone can define the behavior of the agent.
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal.
At each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total
number of rewards for good actions. The reward signal can change the policy, such as if
an action selected by the agent leads to low reward, then the policy may change to select
other actions in the future.
3) Value Function: The value function gives information about how good the situation
and action are and how much reward an agent can expect. A reward indicates the
immediate signal for each good and bad action, whereas a value function specifies the
good state and action for the future. The value function depends on the reward as,
without reward, there could be no value. The goal of estimating values is to achieve more
rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about
how the environment will behave. Such as, if a state and an action are given, then a
model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action
by considering all future situations before actually experiencing those situations.
 The approaches for solving the RL problems with the help of the model are
termed as the model-based approach. Comparatively, an approach without
using a model is called a model-free approach.

How does Reinforcement Learning Work?


To understand the working process of the RL, we need to consider two main things:
o Environment: It can be anything such as a room, maze, football ground, etc.
o Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider
the below image:

In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block,
then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take
four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible
fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-
reward point.The agent will try to remember the preceding steps that it has taken to reach
the final step. To memorize the steps, it assigns 1 value to each previous step. Consider
the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has
1 value block on both sides? Consider the below diagram:

It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to reach the
destination. Hence to solve the problem, we will use the Bellman equation, which is the
main concept behind reinforcement learning.

The Bellman Equation


The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in
the year 1953, and hence it is called as a Bellman equation. It is associated with dynamic
programming and used to calculate the values of a decision problem at a certain point
by including the values of previous states.
It is a way of calculating the value functions in dynamic programming or environment
that leads to modern reinforcement learning.
The key-elements used in Bellman equations are:
 Action performed by the agent is referred to as "a"
 State occurred by performing the action is "s."
 The reward/feedback obtained for each good and bad action is "R."
 A discount factor is Gamma "γ."
The Bellman equation can be written as:
Bellman equation
V(s) = max [R(s,a) + γV(s`)]
Where,
 V(s)= value calculated at a particular point.
 R(s,a) = Reward at a particular state s by performing an action.
 γ = Discount factor
 V(s`) = The value at the previous state.
In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.

For 1st block:


V(s3) = max [R(s,a) + γV(s`)],
here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]
V(s3)= max[1]
V(s3)= 1.

For 2nd block:


V(s2) = max [R(s,a) + γV(s`)],
here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward at this state.
V(s2)= max[0.9(1)]
V(s2)= max[0.9]
V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)],
here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this state also.
V(s1)= max[0.9(0.9)]
V(s3)= max[0.81]
V(s1) =0.81

For 4th block:


V(s5) = max [R(s,a) + γV(s`)]
here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]
V(s5)= max[0.81]
V(s5) =0.73

For 5th block:


V(s9) = max [R(s,a) + γV(s`)]
here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no reward at this state also.
V(s9)= max[0.9(0.73)]
V(s4)= max[0.81]
V(s4) =0.66
Consider the below image:
Now, we will move further to the 6th block, and here agent may change the route because
it always tries to find the optimal path. So now, let's consider from the block next to the
fire pit.

Now, the agent has three options to move; if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking
only positive rewards, so for this, he will move to upwards only. The complete block
values will be calculated using this formula. Consider the below image:

Types of Reinforcement learning


There are mainly two types of reinforcement learning, which are:
1. Positive Reinforcement
2. Negative Reinforcement
1. Positive Reinforcement: The positive reinforcement learning means adding
something to increase the tendency that expected behavior would occur again. It impacts
positively on the behavior of the agent and increases the strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too much positive
reinforcement may lead to an overload of states that can reduce the consequences.
2. Negative Reinforcement: The negative reinforcement learning is opposite to the
positive reinforcement as it increases the tendency that the specific behavior will occur
again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on situation and
behavior, but it provides reinforcement only to meet minimum behavior.

How to represent the agent state?


We can represent the agent state using the Markov State that contains all the required
information from the history. The State St is Markov state if it follows the given
condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
The Markov state follows the Markov property, which says that the future is
independent of the past and can only be defined with the present. The RL works on fully
observable environments, where the agent can observe the environment and act for the
new state. The complete process is known as Markov Decision process, which is
explained below:

Markov Decision Process


Markov Decision Process or MDP, is used to formalize the reinforcement learning
problems. If the environment is completely observable, then its dynamic can be modeled
as a Markov Process. In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can
be formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
 A set of finite States S
 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about
it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and
move to the state s2, then the state transition from s1 to s2 only depends on the current
state and future action and states do not depend on past actions, rewards, or states."

Or, in other words, as per Markov Property, the current state transition does not depend
on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do
not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process: Markov Process is a memoryless process with a sequence of
random states S1, S2, ....., St that uses the Markov Property. Markov process is also
known as Markov chain, which is a tuple (S, P) on state S and transition function P.
These two components (S and P) can define the dynamics of the system.

Difference between Reinforcement Learning and Supervised Learning

This example helps us to better understand reinforcement learning.

The Reinforcement Learning and Supervised Learning both are the part of machine
learning, but both types of learning’s are far opposite to each other. The RL agents
interact with the environment, explore it, take action, and get rewarded. Whereas
supervised learning algorithms learn from the labeled dataset and, on the basis of the
training, predict the output.

Difference between RL and Supervised learning


Reinforcement Learning Supervised Learning
RL works by interacting with the Supervised learning works on the existing
environment. dataset.
Supervised Learning works as when a
The RL algorithm works like the human
human learns things in the supervision of a
brain works when making some decisions.
guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
In Supervised learning, decisions are made
RL helps to take decisions sequentially.
when input is given.

Reinforcement Learning Applications

1. Robotics: RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.


2. Control: RL can be used for adaptive control such as Factory processes,
admission control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
3. Game Playing: RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry: RL can be used for optimizing the chemical reactions.
5. Business: RL is now used for business strategy planning.
6. Manufacturing: In various automobile manufacturing companies, the robots use
deep reinforcement learning to pick goods and put them in some containers.
7. Finance Sector: The RL is currently used in the finance sector for evaluating
trading strategies.
Advantages and Disadvantages of Reinforcement Learning
Advantages of Reinforcement learning
1. Reinforcement learning can be used to solve very complex problems that cannot
be solved by conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the
environment
4. Reinforcement learning can handle environments that are non-deterministic,
meaning that the outcomes of actions are not always predictable. This is useful in
real-world applications where the environment may change over time or is
uncertain.
5. Reinforcement learning can be used to solve a wide range of problems, including
those that involve decision making, control, and optimization.
6. Reinforcement learning is a flexible approach that can be combined with other
machine learning techniques, such as deep learning, to improve performance.
Disadvantages of Reinforcement learning
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation.
3. Reinforcement learning is highly dependent on the quality of the reward function.
If the reward function is poorly designed, the agent may not learn the desired
behavior.
4. Reinforcement learning can be difficult to debug and interpret. It is not always
clear why the agent is behaving in a certain way, which can make it difficult to
diagnose and fix problems.

Q-Learning
What is Q-Learning?
Q-learning is a popular model-free reinforcement learning algorithm used in machine
learning and artificial intelligence applications. Q-learning is based on the Bellman
equation. It falls under the category of temporal difference learning techniques, in
which an agent picks up new information by observing results, interacting with the
environment, and getting feedback in the form of rewards. Q-learning is a model-free,
value-based, off-policy algorithm that will find the best series of actions based on the
agent's current state. The “Q” stands for quality. Quality represents how valuable the
action is in maximizing future rewards. The value of Q-learning can be derived from
the Bellman equation.

Key Components of Q-learning


1. Q-Values or Action-Values: Q-values are defined for states and actions. Q(S,A)
is an estimation of how good is it to take the action A at the state S . This
estimation of Q(S,A) will be iteratively computed using the TD- Update rule
which we will see in the upcoming sections.
2. Rewards and Episodes: At every step of transition, the agent from a state takes
an action, observes a reward from the environment, and then transits to another
state. If at any point in time, the agent ends up in one of the terminating states that
means there are no further transitions possible. This is said to be the completion
of an episode.
3. Temporal Difference or TD-Update: The Temporal Difference or TD-Update
rule can be represented as follows:

This update rule to estimate the value of Q is applied at every time step of the
agent’s interaction with the environment. The terms used are explained below:
 S – Current State of the agent.
 A – Current Action Picked according to some policy.
 S’ – Next State where the agent ends up.
 A’ – Next best action to be picked using current Q-value estimation, i.e. pick
the action with the maximum Q-value in the next state.
 R – Current Reward observed from the environment in Response of current
action.
 γ (>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are
less valuable than current rewards so they must be discounted. Since Q-value
is an estimation of expected rewards from a state, discounting rule applies
here as well.
 α: Step length taken to update the estimation of Q(S, A).
4. Selecting the Course of Action with ϵ-greedy policy: A simple method for
selecting an action to take based on the current estimates of the Q-value is the ϵ-
greedy policy.

How does Q-Learning Works?


Q-learning models engage in an iterative process where various components collaborate
to train the model. This iterative procedure encompasses the agent exploring the
environment and continuously updating the model based on this exploration. The key
components of Q-learning include:
Agents: Entities that operate within an environment, making decisions and taking
actions.
1. States: Variables that identify an agent’s current position in the environment.
2. Actions: Operations undertaken by the agent in specific states.
3. Rewards: Positive or negative responses provided to the agent based on its
actions.
4. Episodes: Instances where an agent concludes its actions, marking the end of an
episode.
5. Q-values: Metrics used to evaluate actions at specific states.
FLOW CHART

6.
There are two methods for determining Q-values:
1. Temporal Difference: Calculated by comparing the current state and action
values with the previous ones.
2. Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957,
used to calculate the value of a given state and determine its optimal position. It
provides a recursive formula for calculating the value of a given state in a Markov
Decision Process (MDP) and is particularly influential in the context of Q-
learning and optimal decision-making.
Q-Table
The agent will use a Q-table to take the best possible action based on the expected reward
for each state in the environment. In simple words, a Q-table is a data structure of sets of
actions and states, and we use the Q-learning algorithm to update the values in the table.
Q-Function
The Q-function uses the Bellman equation and takes state(s) and action(a) as input. The
equation simplifies the state values and state-action value calculation.

Q-learning Advantages and Disadvantages


Advantages:
 Long-term outcomes, which are exceedingly challenging to accomplish, are best
achieved with this strategy.
 This learning paradigm closely resembles how people learn. Consequently, it is
almost ideal.
 The model has the ability to fix mistakes made during training.
 Once a model has fixed a mistake, there is virtually little probability that it will
happen again.
 It can produce the ideal model to address a certain issue.
Disadvantages:
 Drawback of using actual samples. Think about the situation of robot learning, for
instance. The hardware for robots is typically quite expensive, subject to
deterioration, and in need of meticulous upkeep. The expense of fixing a robot
system is high.
 Instead of abandoning reinforcement learning altogether, we can combine it with
other techniques to alleviate many of its difficulties. Deep learning and
reinforcement learning are one common combo.
Q-learning Applications
Applications for Q-learning, a reinforcement learning algorithm, can be found in many
different fields. Here are a few noteworthy instances:
1. Playing Games:
 Atari Games: Classic Atari 2600 games can now be played with Q-learning. In
games like Space Invaders and Breakout, Deep Q Networks (DQN), an extension
of Q-learning that makes use of deep neural networks, has demonstrated
superhuman performance.
2. Automation:
 Robot Control: Q-learning is used in robotics to perform tasks like navigation
and robot control. With Q-learning algorithms, robots can learn to navigate
through environments, avoid obstacles, and maximise their movements.
3. Driverless Automobiles:
 Traffic Management: Autonomous vehicle traffic management systems use Q-
learning. It lessens congestion and enhances traffic flow overall by optimising
route planning and traffic signal timings.
3. Finance:
 Algorithmic Trading: The use of Q-learning to make trading decisions has been
investigated in algorithmic trading. It makes it possible for automated agents to
pick up the best strategies from past market data and adjust to shifting market
conditions.
4. Health Care:
 Personalized Treatment Plans: To make treatment plans more unique, Q-
learning is used in the medical field. Through the use of patient data, agents are
able to recommend personalized interventions that account for individual
responses to various treatments.
5. Energy Management:
 Smart Grids: Energy management systems for smart grids employ Q-learning. It
aids in maximizing energy use, achieving supply and demand equilibrium, and
enhancing the effectiveness of energy distribution.
6. Education:
 Adaptive Learning Systems: Adaptive learning systems make use of Q-learning.
These systems adjust the educational material and level of difficulty according to
each student’s performance and learning style using Q-learning algorithms.
7. Recommendations Systems:
 Content Recommendation: To customise content recommendations,
recommendation systems use Q-learning. To increase user satisfaction, agents
pick up on user preferences and modify recommendations accordingly.
8. Resources Management:
 Network Resource Allocation: Allocating bandwidth in communication
networks is one example of how network resource management uses Q-learning.
It aids in resource allocation optimisation for improved network performance.
9. Space Travel:
 Satellite Control: Autonomous satellite control is possible with Q-learning.
Agents are trained in the best movements and activities for satellite operations in
orbit.

What is meant by passive and active reinforcement learning and how do


we compare the two?
Both active and passive reinforcement learning is types of reinforcement learning. In
case of passive RL, the agent’s policy is fixed which means that it is told what to do. In
contrast to this, in active RL, an agent needs to decide what to do as there’s no fixed
policy that it can act on. Therefore, the goal of a passive RL agent is to execute a fixed
policy (sequence of actions) and evaluate it while that of an active RL agent is to act and
learn an optimal policy.
What are some common active and passive RL techniques?
Passive Learning
As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to
learn the expected utility Uπ(s) for each state s. This can be done in three ways.
1. Direct Utility Estimation
2. Adaptive Dynamic Programming (ADP)
3. Temporal Difference Learning (TD)

1. Direct Utility Estimation: In this method, the agent executes a sequence of trials or
runs (sequences of states-actions transitions that continue until the agent reaches the
terminal state). Each trial gives a sample value and the agent estimates the utility based
on the samples values. Can be calculated as running averages of sample values.
The main drawback is that this method makes a wrong assumption that state
utilities are independent while in reality they are Markovian. Also, it is slow to converge.
Suppose we have a 4x3 grid as the environment in which the agent can move either Left,
Right, Up or Down(set of available actions). An example of a run

Total reward starting at (1,1) = 0.72


2. Adaptive Dynamic Programming (ADP): ADP is a smarter method than Direct
Utility Estimation as it runs trials to learn the model of the environment by estimating the
utility of a state as a sum of reward for being in that state and the expected discounted
reward of being in the next state.

Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount
factor and Uπ(s) = utility of being in state s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can
become quite costly to compute for large state spaces. ADP is a model based approach
and requires the transition model of the environment. A model-free approach is Temporal
Difference Learning.
3. Temporal Difference Learning (TD): TD learning does not require the agent to learn
the transition model. The update occurs between successive states and agent only updates
states that are directly affected.

Where α = learning rate which determines the convergence to true utilities.


While ADP adjusts the utility of s with all its successor states, TD learning adjusts it
with that of a single successor state s’. TD is slower in convergence but much simpler in
terms of computation.

Active Learning: This can be done in two ways.


1. ADP with exploration function
2. Q-Learning
1. ADP with exploration function
As the goal of an active agent is to learn an optimal policy, the agent needs to learn the
expected utility of each state and update its policy. Can be done using a passive ADP
agent and then using value or policy iteration it can learn optimal actions. But this
approach results into a greedy agent. Hence, we use an approach that gives higher
weights to unexplored actions and lower weights to actions with lower utilities.

Where f(u, n) is the exploration function that increases with expected value u and
decreases with number of tries n

2. Q-Learning
Q-learning is a TD learning method which does not require the agent to learn the
transitional model, instead learns Q-value functions Q(s, a) .
V(s) = max [R(s,a) + γV(s`)]

You might also like