AI Unit3 Part 1
AI Unit3 Part 1
3. Reinforcement Learning:
3.1 Introduction
3.2 Passive Reinforcement Learning
3.3 Active Reinforcement Learning
3.4 Generalization in Reinforcement Learning
3.5 Policy Search
3.6 Applications of Reinforcement Learning
Reinforcement learning
3.1. Introduction:
Thus, the agent faces an unknown Markov decision process. We will consider three of the
agent designs first introduced in Chapter 2:
• A utility-based agent learns a utility function on states and uses it to select actions that
maximize the expected outcome utility.
• A Q-learning agent learns an action-utility function, or Q-function, giving the ex-pected utility of
taking a given action in a given state.
• A reflex agent learns a policy that maps directly from states to actions.
We begin with passive learning, where the agent’s policy is fixed and the task is to learn the utilities of
states (or state–action pairs); this could also involve learning a model of the environment.
Active learning, where the agent must also learn what to do. The principal issue is exploration: an agent
must experience as much as possible of its environment in order to learn how to behave in it.
where R(s) is the reward for a state, St (a random variable) is the state reached at time t when
executing policy _, and S0 =s. We will include a discount factor in all of our equations,
but for the 4×3 world we will set =1.
By ignoring the connections between states, direct utility estimation misses opportunities for
learning.
An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among the
utilities of states by learning the transition model that connects them and solving the corresponding
Markov decision process using a dynamic programming method.
The process of learning the model itself is easy, because the environment is fully observable.
This means that we have a supervised learning task where the input is a state–action
pair and the output is the resulting state. In the simplest case, we can represent the transition
model as a table of probabilities. We keep track of how often each action outcome
occurs and estimate the transition probability P(s
′| s, a) from the frequency with which s is reached when executing a in s.
For example, in the three trials given on page 832, Right is executed three times in (1,3) and two out of
three times the resulting state is (2,3), so P((2, 3) | (1, 3),Right ) is estimated to be 2/3.
3.2.3 Temporal-difference learning
Another way is to use the observed
transitions to adjust the utilities of the observed states so that they agree with the constraint
equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on
page 832. Suppose that, as a result of the first trial, the utility estimates are U_(1, 3)=0.84
and U_(2, 3)=0.92. Now, if this transition occurred all the time, we would expect the utilities
to obey the equation
so U_(1, 3) would be 0.88. Thus, its current estimate of 0.84 might be a little low and should
be increased. More generally, when a transition occurs from state s to state s
′, we apply the following update to U_(s):
Here, _ is the learning rate parameter. Because this update rule uses the difference in utilities
between successive states, it is often called the temporal-difference, or TD, equation.
Q-functions may seem like just another way of storing utility information, but they have a
very important property: a TD agent that learns a Q-function does not need a model of the
form P(s′ | s, a), either for learning or for action selection. For this reason, Q-learning is
called a model-free method. As with utilities, we can write a constraint equation that must
hold at equilibrium when the Q-values are correct:
So far, we have assumed that the utility functions and Q-functions learned by the agents are
represented in tabular form with one output value for each input tuple. Such an approach
works reasonably well for small state spaces, but the time to convergence and (for ADP) the
time per iteration increase rapidly as the space gets larger. With carefully controlled, approximate ADP
methods, it might be possible to handle 10,000 states or more. :
One way to handle such problems is to use function approximation, which simply means using any sort
of representation for the Q-function. For example, we described an evaluation function for chess that is
represented as a weighted linear function of a set of features (or basis functions) f1, . . . , fn:
A reinforcement learning algorithm can learn values for the parameters _ =_1, . . . , _n such
that the evaluation function ˆU _ approximates the true utility function.
The features of the squares are just their x and y coordinates, so we have
This is called the Widrow–Hoff rule, or the delta rule, for online least-squares.
We can apply these rules to the example where ˆU _(1, 1) is 0.8 and uj(1, 1) is 0.4. _0, _1,
and _2 are all decreased by 0.4_, which reduces the error for (1,1).
Remember that what matters for linear function approximation
is that the function be linear in the parameters—the features themselves can be arbitrary
nonlinear functions of the state variables. Hence, we can include a term such as
that measures the distance to the goal.