0% found this document useful (0 votes)
21 views5 pages

AI Unit3 Part 1

Uploaded by

22jn1a05c1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

AI Unit3 Part 1

Uploaded by

22jn1a05c1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

UNIT -3

3. Reinforcement Learning:
3.1 Introduction
3.2 Passive Reinforcement Learning
3.3 Active Reinforcement Learning
3.4 Generalization in Reinforcement Learning
3.5 Policy Search
3.6 Applications of Reinforcement Learning

Reinforcement learning

3.1. Introduction:

Reinforcement learning might be considered to encompass all of AI: an agent is placed


in an environment and must learn to behave successfully therein.

Thus, the agent faces an unknown Markov decision process. We will consider three of the
agent designs first introduced in Chapter 2:
• A utility-based agent learns a utility function on states and uses it to select actions that
maximize the expected outcome utility.
• A Q-learning agent learns an action-utility function, or Q-function, giving the ex-pected utility of
taking a given action in a given state.
• A reflex agent learns a policy that maps directly from states to actions.

We begin with passive learning, where the agent’s policy is fixed and the task is to learn the utilities of
states (or state–action pairs); this could also involve learning a model of the environment.
Active learning, where the agent must also learn what to do. The principal issue is exploration: an agent
must experience as much as possible of its environment in order to learn how to behave in it.

3.2 Passive Reinforcement Learning:


In passive learning, the agent’s policy _ is fixed: in state s, it always executes the action _(s). Its goal is
simply to learn how good the policy is—that is, to learn the utility function U_(s). We will use as our
example the 4×3 world.
Clearly, the passive learning task is similar to the policy evaluation task, part of the policy iteration
algorithm described in Section 17.3. The main difference is that the passive learning agent does not know
the transition model P(s ′ | s, a), which specifies the probability of reaching state s ′ from state s after
doing action a; nor does it know the reward function R(s), which specifies the reward for each state.
Note that each state percept is subscripted with the reward received. The object is to use the
information about rewards to learn the expected utility U_(s) associated with each nonterminal
state s. The utility is defined to be the expected sum of (discounted) rewards obtained if policy _ is
followed. As in Equation (17.2) on page 650, we write

where R(s) is the reward for a state, St (a random variable) is the state reached at time t when
executing policy _, and S0 =s. We will include a discount factor in all of our equations,
but for the 4×3 world we will set =1.

3.2.1 Direct utility estimation

Direct utility estimation succeeds in reducing the reinforcement learning problem to


an inductive learning problem, about which much is known. Unfortunately, it misses a very
important source of information, namely, the fact that the utilities of states are not independent!
The utility of each state equals its own reward plus the expected utility of its successor
states. That is, the utility values obey the Bellman equations for a fixed policy (see also
Equation :

By ignoring the connections between states, direct utility estimation misses opportunities for
learning.

3.2.2 Adaptive dynamic programming

An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among the
utilities of states by learning the transition model that connects them and solving the corresponding
Markov decision process using a dynamic programming method.
The process of learning the model itself is easy, because the environment is fully observable.
This means that we have a supervised learning task where the input is a state–action
pair and the output is the resulting state. In the simplest case, we can represent the transition
model as a table of probabilities. We keep track of how often each action outcome
occurs and estimate the transition probability P(s
′| s, a) from the frequency with which s is reached when executing a in s.
For example, in the three trials given on page 832, Right is executed three times in (1,3) and two out of
three times the resulting state is (2,3), so P((2, 3) | (1, 3),Right ) is estimated to be 2/3.
3.2.3 Temporal-difference learning
Another way is to use the observed
transitions to adjust the utilities of the observed states so that they agree with the constraint
equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on
page 832. Suppose that, as a result of the first trial, the utility estimates are U_(1, 3)=0.84
and U_(2, 3)=0.92. Now, if this transition occurred all the time, we would expect the utilities
to obey the equation

so U_(1, 3) would be 0.88. Thus, its current estimate of 0.84 might be a little low and should
be increased. More generally, when a transition occurs from state s to state s
′, we apply the following update to U_(s):

Here, _ is the learning rate parameter. Because this update rule uses the difference in utilities
between successive states, it is often called the temporal-difference, or TD, equation.

3.3 Active Reinforcement learning :


An active agent mustdecide what actions to take. Let us begin with the adaptive dynamic programming
agent and consider how it must be modified to handle this new freedom.
3.3.1 Exploration
ADP agent that follows the recommendation of the optimal policy for the learned model at each step. The
agent does not learn the true utilities or the true optimal policy! What happens instead is that, in the 39th
trial, it finds a policy that reaches the +1 reward along the lower route via (2,1), (3,1), (3,2), and (3,3).
(See Figure 21.6(b).) After experimenting with minor variations, from the 276th trial onward it sticks to
that policy, never learning the utilities of the other states and never finding the optimal route via (1,2),
(1,3), and (2,3). We call this GREEDY AGENT agent the greedy agent.
By improving the model, the agent will receive greater rewards in the future.2 An agent therefore must
make a tradeoff between exploitation to maximize its reward—as reflected in its current utility estimates
Technically, any such scheme needs to be greedy in the limit of infinite exploration, or GLIE. A GLIE
scheme must try each action in each state an unbounded number of times to avoid having a finite
probability that an optimal action is missed because of an unusually bad series of outcomes.
Suppose we are using value iteration in an ADP learning agent; then we need to rewrite the update
equation (Equation (17.6) on page 652) to incorporate the optimistic estimate. The following equation
does this:

3.3.2 Learning an action-utility function:


There is an alternative TD method, called Q-learning, which learns an action-utility
representation instead of learning utilities. We will use the notation Q(s, a) to denote the
value of doing action a in state s. Q-values are directly related to utility values as follows:

Q-functions may seem like just another way of storing utility information, but they have a
very important property: a TD agent that learns a Q-function does not need a model of the
form P(s′ | s, a), either for learning or for action selection. For this reason, Q-learning is
called a model-free method. As with utilities, we can write a constraint equation that must
hold at equilibrium when the Q-values are correct:

3.4 Generalization in Reinforcement Learning:

So far, we have assumed that the utility functions and Q-functions learned by the agents are
represented in tabular form with one output value for each input tuple. Such an approach
works reasonably well for small state spaces, but the time to convergence and (for ADP) the
time per iteration increase rapidly as the space gets larger. With carefully controlled, approximate ADP
methods, it might be possible to handle 10,000 states or more. :
One way to handle such problems is to use function approximation, which simply means using any sort
of representation for the Q-function. For example, we described an evaluation function for chess that is
represented as a weighted linear function of a set of features (or basis functions) f1, . . . , fn:

A reinforcement learning algorithm can learn values for the parameters _ =_1, . . . , _n such
that the evaluation function ˆU _ approximates the true utility function.
The features of the squares are just their x and y coordinates, so we have

This is called the Widrow–Hoff rule, or the delta rule, for online least-squares.

We can apply these rules to the example where ˆU _(1, 1) is 0.8 and uj(1, 1) is 0.4. _0, _1,
and _2 are all decreased by 0.4_, which reduces the error for (1,1).
Remember that what matters for linear function approximation
is that the function be linear in the parameters—the features themselves can be arbitrary
nonlinear functions of the state variables. Hence, we can include a term such as
that measures the distance to the goal.

3.5 Policy Search:


The final approach we will consider for reinforcement learning problems is called policy
search.
Let us begin with the policies themselves. Remember that a policy _ is a function that
maps states to actions. We are interested primarily in parameterized representations of _ that
have far fewer parameters than there are states in the state space (just as in the preceding
section). For example, we could represent _ by a collection of parameterized Q-functions,
one for each action, and take the action with the highest predicted value:

Each Q-function could be a linear function of the parameters _, as in Equation (21.10),


or it could be a nonlinear function such as a neural network. Policy search will then adjust
the parameters _ to improve the policy. Notice that if the policy is represented by Qfunctions,
then policy search results in a process that learns Q-functions.
One problem with policy representations of the kind given in Equation (21.14) is that
the policy is a discontinuous.
For this reason, policy search methods often use a stochastic policy representation __(s, a), which
specifies the probability of selecting action a in state s. One popular representation is the softmax
function:

You might also like