AI Unit3 Part 1

Uploaded by

22jn1a05c1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

AI Unit3 Part 1

Uploaded by

22jn1a05c1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

UNIT -3

3. Reinforcement Learning:
3.1 Introduction
3.2 Passive Reinforcement Learning
3.3 Active Reinforcement Learning
3.4 Generalization in Reinforcement Learning
3.5 Policy Search
3.6 Applications of Reinforcement Learning

Reinforcement learning

3.1. Introduction:

Reinforcement learning might be considered to encompass all of AI: an agent is placed

in an environment and must learn to behave successfully therein.

Thus, the agent faces an unknown Markov decision process. We will consider three of the
agent designs first introduced in Chapter 2:
• A utility-based agent learns a utility function on states and uses it to select actions that
maximize the expected outcome utility.
• A Q-learning agent learns an action-utility function, or Q-function, giving the ex-pected utility of
taking a given action in a given state.
• A reflex agent learns a policy that maps directly from states to actions.

We begin with passive learning, where the agent’s policy is fixed and the task is to learn the utilities of
states (or state–action pairs); this could also involve learning a model of the environment.
Active learning, where the agent must also learn what to do. The principal issue is exploration: an agent
must experience as much as possible of its environment in order to learn how to behave in it.

3.2 Passive Reinforcement Learning:

In passive learning, the agent’s policy _ is fixed: in state s, it always executes the action _(s). Its goal is
simply to learn how good the policy is—that is, to learn the utility function U_(s). We will use as our
example the 4×3 world.
Clearly, the passive learning task is similar to the policy evaluation task, part of the policy iteration
algorithm described in Section 17.3. The main difference is that the passive learning agent does not know
the transition model P(s ′ | s, a), which specifies the probability of reaching state s ′ from state s after
doing action a; nor does it know the reward function R(s), which specifies the reward for each state.
Note that each state percept is subscripted with the reward received. The object is to use the
information about rewards to learn the expected utility U_(s) associated with each nonterminal
state s. The utility is defined to be the expected sum of (discounted) rewards obtained if policy _ is
followed. As in Equation (17.2) on page 650, we write

where R(s) is the reward for a state, St (a random variable) is the state reached at time t when
executing policy _, and S0 =s. We will include a discount factor in all of our equations,
but for the 4×3 world we will set =1.

3.2.1 Direct utility estimation

Direct utility estimation succeeds in reducing the reinforcement learning problem to

an inductive learning problem, about which much is known. Unfortunately, it misses a very
important source of information, namely, the fact that the utilities of states are not independent!
The utility of each state equals its own reward plus the expected utility of its successor
states. That is, the utility values obey the Bellman equations for a fixed policy (see also
Equation :

By ignoring the connections between states, direct utility estimation misses opportunities for
learning.

3.2.2 Adaptive dynamic programming

An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among the
utilities of states by learning the transition model that connects them and solving the corresponding
Markov decision process using a dynamic programming method.
The process of learning the model itself is easy, because the environment is fully observable.
This means that we have a supervised learning task where the input is a state–action
pair and the output is the resulting state. In the simplest case, we can represent the transition
model as a table of probabilities. We keep track of how often each action outcome
occurs and estimate the transition probability P(s
′| s, a) from the frequency with which s is reached when executing a in s.
For example, in the three trials given on page 832, Right is executed three times in (1,3) and two out of
three times the resulting state is (2,3), so P((2, 3) | (1, 3),Right ) is estimated to be 2/3.
3.2.3 Temporal-difference learning
Another way is to use the observed
transitions to adjust the utilities of the observed states so that they agree with the constraint
equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on
page 832. Suppose that, as a result of the first trial, the utility estimates are U_(1, 3)=0.84
and U_(2, 3)=0.92. Now, if this transition occurred all the time, we would expect the utilities
to obey the equation

so U_(1, 3) would be 0.88. Thus, its current estimate of 0.84 might be a little low and should
be increased. More generally, when a transition occurs from state s to state s
′, we apply the following update to U_(s):

Here, _ is the learning rate parameter. Because this update rule uses the difference in utilities
between successive states, it is often called the temporal-difference, or TD, equation.

3.3 Active Reinforcement learning :

An active agent mustdecide what actions to take. Let us begin with the adaptive dynamic programming
agent and consider how it must be modified to handle this new freedom.
3.3.1 Exploration
ADP agent that follows the recommendation of the optimal policy for the learned model at each step. The
agent does not learn the true utilities or the true optimal policy! What happens instead is that, in the 39th
trial, it finds a policy that reaches the +1 reward along the lower route via (2,1), (3,1), (3,2), and (3,3).
(See Figure 21.6(b).) After experimenting with minor variations, from the 276th trial onward it sticks to
that policy, never learning the utilities of the other states and never finding the optimal route via (1,2),
(1,3), and (2,3). We call this GREEDY AGENT agent the greedy agent.
By improving the model, the agent will receive greater rewards in the future.2 An agent therefore must
make a tradeoff between exploitation to maximize its reward—as reflected in its current utility estimates
Technically, any such scheme needs to be greedy in the limit of infinite exploration, or GLIE. A GLIE
scheme must try each action in each state an unbounded number of times to avoid having a finite
probability that an optimal action is missed because of an unusually bad series of outcomes.
Suppose we are using value iteration in an ADP learning agent; then we need to rewrite the update
equation (Equation (17.6) on page 652) to incorporate the optimistic estimate. The following equation
does this:

3.3.2 Learning an action-utility function:

There is an alternative TD method, called Q-learning, which learns an action-utility
representation instead of learning utilities. We will use the notation Q(s, a) to denote the
value of doing action a in state s. Q-values are directly related to utility values as follows:

Q-functions may seem like just another way of storing utility information, but they have a
very important property: a TD agent that learns a Q-function does not need a model of the
form P(s′ | s, a), either for learning or for action selection. For this reason, Q-learning is
called a model-free method. As with utilities, we can write a constraint equation that must
hold at equilibrium when the Q-values are correct:

3.4 Generalization in Reinforcement Learning:

So far, we have assumed that the utility functions and Q-functions learned by the agents are
represented in tabular form with one output value for each input tuple. Such an approach
works reasonably well for small state spaces, but the time to convergence and (for ADP) the
time per iteration increase rapidly as the space gets larger. With carefully controlled, approximate ADP
methods, it might be possible to handle 10,000 states or more. :
One way to handle such problems is to use function approximation, which simply means using any sort
of representation for the Q-function. For example, we described an evaluation function for chess that is
represented as a weighted linear function of a set of features (or basis functions) f1, . . . , fn:

A reinforcement learning algorithm can learn values for the parameters _ =_1, . . . , _n such
that the evaluation function ˆU _ approximates the true utility function.
The features of the squares are just their x and y coordinates, so we have

This is called the Widrow–Hoff rule, or the delta rule, for online least-squares.

We can apply these rules to the example where ˆU _(1, 1) is 0.8 and uj(1, 1) is 0.4. _0, _1,
and _2 are all decreased by 0.4_, which reduces the error for (1,1).
Remember that what matters for linear function approximation
is that the function be linear in the parameters—the features themselves can be arbitrary
nonlinear functions of the state variables. Hence, we can include a term such as
that measures the distance to the goal.

3.5 Policy Search:

The final approach we will consider for reinforcement learning problems is called policy
search.
Let us begin with the policies themselves. Remember that a policy _ is a function that
maps states to actions. We are interested primarily in parameterized representations of _ that
have far fewer parameters than there are states in the state space (just as in the preceding
section). For example, we could represent _ by a collection of parameterized Q-functions,
one for each action, and take the action with the highest predicted value:

Each Q-function could be a linear function of the parameters _, as in Equation (21.10),

or it could be a nonlinear function such as a neural network. Policy search will then adjust
the parameters _ to improve the policy. Notice that if the policy is represented by Qfunctions,
then policy search results in a process that learns Q-functions.
One problem with policy representations of the kind given in Equation (21.14) is that
the policy is a discontinuous.
For this reason, policy search methods often use a stochastic policy representation __(s, a), which
specifies the probability of selecting action a in state s. One popular representation is the softmax
function:

Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Sections
No ratings yet
Sections
76 pages
RL 1
No ratings yet
RL 1
30 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
31 pages
Unit 3
No ratings yet
Unit 3
32 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Unit 3
No ratings yet
Unit 3
29 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Unit4 (AI) 2024 Docx-1
No ratings yet
Unit4 (AI) 2024 Docx-1
22 pages
Unit 1
No ratings yet
Unit 1
18 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
AI Notes
No ratings yet
AI Notes
37 pages
37 RL
No ratings yet
37 RL
18 pages
RL Examples
No ratings yet
RL Examples
6 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Unit V
100% (1)
Unit V
24 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
q2B Review
No ratings yet
q2B Review
9 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Numerical Analysis
No ratings yet
Numerical Analysis
2 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
RNN Neural Network
No ratings yet
RNN Neural Network
23 pages
CMPE113 Lecture3
No ratings yet
CMPE113 Lecture3
49 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
ENG6500 8 DL IntroductionToDeepLearning Part2
No ratings yet
ENG6500 8 DL IntroductionToDeepLearning Part2
65 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
RBD Problems 2021
No ratings yet
RBD Problems 2021
12 pages
Surprise Sampling 21
No ratings yet
Surprise Sampling 21
29 pages
Huawei Questions: True or False
No ratings yet
Huawei Questions: True or False
24 pages
Automata Theory and Computability
No ratings yet
Automata Theory and Computability
189 pages
Sven O Krumke Integer Programming Polyhedra and Algorithms Lecture Notes
No ratings yet
Sven O Krumke Integer Programming Polyhedra and Algorithms Lecture Notes
188 pages
CSE4062S21 Group3 Project Delivery7 FinalReport
No ratings yet
CSE4062S21 Group3 Project Delivery7 FinalReport
9 pages
3 Queueing Lecture - Sept 2022 Multi Servers v2
No ratings yet
3 Queueing Lecture - Sept 2022 Multi Servers v2
24 pages
AE4133 CFD II Part 1 Discretisations For Compressible Flows
No ratings yet
AE4133 CFD II Part 1 Discretisations For Compressible Flows
104 pages
Excel Data Manipulation CheatSheet 1731972102
No ratings yet
Excel Data Manipulation CheatSheet 1731972102
10 pages
V SEM Merged
No ratings yet
V SEM Merged
23 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
Finite Element Method: by Maj. Dr. Riessom W/Giorgis
No ratings yet
Finite Element Method: by Maj. Dr. Riessom W/Giorgis
62 pages
Physics-Informed Neural Networks For Modeling Physiological Time Series For Cuf Ess Blood Pressure Estimation
No ratings yet
Physics-Informed Neural Networks For Modeling Physiological Time Series For Cuf Ess Blood Pressure Estimation
15 pages
Robotics TP
No ratings yet
Robotics TP
15 pages
Linear Transformation
No ratings yet
Linear Transformation
10 pages
Practical Answrs
No ratings yet
Practical Answrs
22 pages
Experiment 3: Name: Harshit Kapoor Reg. No: 15BCE0657 Slot: L11+L12
No ratings yet
Experiment 3: Name: Harshit Kapoor Reg. No: 15BCE0657 Slot: L11+L12
8 pages
Certainty Factor Model: SEEM 5750
No ratings yet
Certainty Factor Model: SEEM 5750
25 pages
ADAMS/Solver Primer
No ratings yet
ADAMS/Solver Primer
30 pages
Breaking The Barrier With A Multi-Domain SER (Dataset)
No ratings yet
Breaking The Barrier With A Multi-Domain SER (Dataset)
6 pages
Introduction To Theory of Automata: Lecture # 1
No ratings yet
Introduction To Theory of Automata: Lecture # 1
18 pages
Unit 1
No ratings yet
Unit 1
6 pages
Intrusion Detection System Using Multivariate Control Chart Hotelling's T Based On PCA
No ratings yet
Intrusion Detection System Using Multivariate Control Chart Hotelling's T Based On PCA
7 pages
Runge Kutta Adaptativo
No ratings yet
Runge Kutta Adaptativo
11 pages
Literature Survey
No ratings yet
Literature Survey
3 pages
CS1201 Data Structures
No ratings yet
CS1201 Data Structures
4 pages