PDF Unit-5 (Full Unit)
PDF Unit-5 (Full Unit)
Fundamentals
Intelligent Agents
• The stepping-stone crossing challenge described illustrates the intelligent agent
approach that underpins reinforcement learning. We can consider the scout to be an
intelligent agent (or simply agent) attempting to complete a task within an
environment. The goal of the agent is to complete the task as successfully as possible.
Each attempt at the task is referred to as an episode. At any point in time, t, the agent
observes the current state of its environment, ot; considers these observations to select
an action, at; and takes this action, receiving immediate feedback, rt, from the
environment about whether this was a good or bad action to take. We use rt to refer to
feedback, as in reinforcement learning feedback is more commonly referred to as
reward (where reward can be either positive or negative). This gives a series of
discrete steps that make up an episode
• where the episode proceeds through time-steps t = 1,……,e. At each time-step the
agent makes an observation, ot, of the environment, takes an action, at, and receives a
reward, rt, based on that action. This cycle is illustrated in Figure 11.1
The sequence of observations, actions and rewards that precede any time-step, t, is
referred to as a history, Ht. The job of the agent in the environment is to make decisions
at each time-step, t, about what action to take next on the basis of its current
observations of the environment, ot, and the history, Ht. Maintaining long histories of
actions, rewards, and observations (which are possibly only very slightly different from
one iteration to the next) is not a very efficient way to reason about the world,
particularly as episodes might cover hundreds or thousands of time-steps.
Instead, we collapse this information into a single representation, referred to as a state.
• The state at time-step t, st, should contain all the important information about the
environment at that time-step, any important information about what has been
happening in the environment at preceding time-steps, and any important
information about the internal composition of the agent. For example, for a robot
deployed within a hospital to deliver equipment to operating theaters, the state
might include the robot’s position in the environment, the positions of people
nearby, whether the robot is on the way to collect items or to deliver them, and the
current levels of the robot’s batteries. In Figure 11.1 we show how the observations
made about the environment at timestep t are converted into a state, st, using a state
generation function, . In many cases, if the environment is fully observable this
function is a simple identity function because the observation fully defines the state.
It is also possible, however, for this function to be more
elaborate when the observations over multiple time-steps are accumulated into a
state.1 Using states instead of observations, Equation (11.1) can be restated
• That intelligent behavior can be driven by the singular goal of maximizing return is
a bold statement—it is often argued that it is very ambitious to expect sophisticated,
longterm behavior to emerge from simple accumulation of instantaneous rewards.
Reward is often delayed, and the real value of an action is not reflected immediately
but rather by the fact that an action takes us toward a later state that will ultimately
allow an agent to earn a reward. For example, early moves in a game of chess do not
lead to large positive rewards but set the ground for later high-reward moves.
Rewards can also often be somewhat contradictory, and an action that gives an
immediate positive reward may turn out to be a bad one in the longer term. For
example, eating cake almost always seems like a good idea in the moment, but in
terms of long-term health is probably not always a strong choice. It has been shown
repeatedly, however, that it is in fact possible to learn sophisticated, longterm
behaviors using the maximization of cumulative reward alone. This introduces the
second art of reinforcement learning: the design of effective reward functions.
• The policy can be thought of as a simple lookup table that records the action that should
be taken in every state, and reinforcement learning problems can be framed as an effort to
learn this table directly. Policies can also be encoded as a rule used to choose an action
from those available in a particular state, and this is the approach we focus on in this
chapter. For example, we might use a greedy action selection policy that says the agent
should always take the action that will give it this highest immediate reward. This would,
however, ignore the fact that sometimes reward is delayed and that taking an action that
gives a low immediate reward can be a good idea if it leads the agent to a state that could
give it large positive rewards later on. This suggests the need for a more sophisticated measure
of the value of taking an action in a given state and leads to the final fundamental
component of a reinforcement learning agent: a value function
Markov Decision Processes
• Markov decision processes (MDPs) are an attractive mathematical framework
within which to reason about decision making scenarios in which outcomes are
partly under the control of a decision maker, but also partly random. This has made
them an attractive framework for applications ranging from financial modeling, to
robot control, to modeling the flow of human conversation. This also makes them
ideal for reasoning about reinforcement learning
• A Markov process, a more basic framework than an MDP that does not include
decision making, can be used to model a discrete random process that transitions
through a finite set of states, S . For example, we could use a Markov process to
model how infection progresses in an individual when a disease epidemic breaks
out. Individuals can belong to one of three states: SUSCEPTIBLE, INFECTED, or
RECOVERED (these are often referred to as S-I-R models). An individual can
belong to only one of these states at a time and moves between them according to a
Markov process. Figure 11.2(a) shows these states and how an individual can move
between them
• Markov processes are built on the Markov assumption that the probability of
transitioning to a particular state at the next time-step relies only on the current
state, and does not require any knowledge of the history of states that came before
that, or
• where S t and S t+1 are random variables to which the states at time t and t 1 are
assigned.
• The full dynamics of a Markov process can be captured in a transition matrix
is added to a replay memory, D. After taking the action, instead of performing a single step
of stochastic gradient descent, the agent randomly selects a random sample of b instances
from the replay memory, and performs an iteration of mini-batch gradient descent using
this sample as the mini-batch. The target feature values for the instances in the mini-batch
are generated as described in naive neural Q-learning algorithm. This means that the training
process is using its experience of the environment much more efficiently because each step is
used in network training multiple times. Furthermore, the correlations between instances are
broken because mini-batches are randomly selected from the replay memory. The replay
memory is given a maximum size, N (usually greater than 10;000), and when it reaches
this the oldest instances are dropped as new ones are added. Figure 11.9 illustrates this
process.
• In the naive approach described the network being trained is also being used to
generate target feature values . This can cause the network training process to
become unstable as small change in the outputs of the action-value network can
lead to sudden changes in the policy as a different action is suddenly favored in a
type of state. Target network freezing is used to address this. Two different
networks are used in the training process: an action-value behavior network that is
used to predict the values of actions for making decisions and an action-value target
network that is used to predict the value of taking subsequent actions in subsequent
states when generating target feature values. The action-value target network is
frozen and not updated at each iteration of the algorithm. It does, however, need to
be updated occasionally because otherwise the estimated values used in the loss
function will be inaccurate. Therefore, after every C steps the current action-value
target network is replaced with a copy of the action-value behavior network. This is
also illustrated in Figure 11.9. Target network freezing makes the training process
more stable and leads to faster convergence. A pseudocode description of the deep
Q network algorithm is given in Algorithm 16
• The deep Q network algorithm can be used with any state representation that can be
input into a neural network, and can use different neural network architectures. The
simplest version of this would be a numeric state vector input into a multi-layer
perceptron feedforward network. The algorithm was first proposed, however, as an
approach to playing video games in which the only inputs were screenshots of the
game. To best handle image inputs a convolutional neural network29 was used. A
single screenshot of a game does not contain sufficient information about the state of
an environment and an agent for the environment to be considered fully observable,
and so the Markov assumption does not hold. For example, in the single screenshot
of the Lunar Lander environment in Figure 11.7, it is not possible to tell at what
velocity the spaceship is moving. To overcome this, sequences of the last k
screenshots stacked together can be used as the state representation.
• This is an example of using a state generation function Usually small stacks of
screenshots (e.g. k = 4) provide enough information to capture the state. It is difficult
to provide a detailed worked example of the DQN algorithm because the number of
weights to be learned and steps required for anything interesting is too large for clear
presentation. Instead, to illustrate the DQN algorithm we will examine at a higher
level how an automated player of the Lunar Lander game can be trained As
mentioned before this game has four actions available to the agent: None, Up, Left,
and Right. State can be represented as a stack of the last 4 frames in the game. This
is illustrated in Figure 11.9
There are two ways that an episode can end: an agent can either land successfully
or crash. The agent earns a reward of +100 for landing successfully and a reward of 100
for crashing. During landing the agent receives a reward of +10 each time one of its legs
touches the ground gently. For every step that the agent is firing one of its thrusters it
receives a reward of 0:3.
A convolutional neural network was used as the action-value network. Input images
were scaled to 84 84 and the network contained hidden convolutional layers with 32, 64
and 64 units.30 Filter sizes were 8* 8 (stride 4), 4* 4 (stride 3), and 3* 3 (stride 1).
Rectified linear activation functions were used in all hidden layer units. A final hidden
layer flattened the outputs of the previous convolutional layer and contained 512 fully
connected units with rectified linear activations. The output layer was a fully connected
layer with 4 outputs (one per action) using linear activations. Figure 11.9 illustrates
this architecture. The behavior policy used was greedy, but linear annealing was also
used. Linear annealing allows the value for used in greedy policy to change over time.
At the beginning, a large value 0:9 is used and this slowly moves down toward a small
value 0:05. During DQN training the size of the replay memory was 50;000 and the target
action-value function network, QM, was replaced every 10;000 steps.