0% found this document useful (0 votes)
18 views21 pages

Chapter 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

Chapter 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter – 2

• Methods exist to find the optimal policy without a model, by querying the
environment. These methods are named model-free methods.
• Value- based model-free methods are the most basic learning approach of
reinforcement learning.
• They work well in problems with deterministic environments and discrete action
spaces, such as mazes and games.
• Model-free learning makes few demands on the environment, building up the
policy function 𝜋(𝑠) → 𝑎 by sampling the environment.

• Reinforcement learning is a natural way of learning the optimal route as we go, by


trial and error, from the eIects of the actions that we take in our environment.
• There is an agent (you), an environment (the city), there are states (your location
at diIerent points in time), actions (assuming a Manhattan- style grid, moving a
block left, right, forward, or back), there are trajectories (the routes to the
supermarket that you tried), there is a policy (that tells which action you will take
at a particular location), there is a concept of cost/reward (the length of your
current path), we see exploration of new routes, exploitation of old routes, a
trade-oI between them, and your notebook in which you have been sketching a
map of the city (your local transition model).

Sequential Decision Problems

• In a sequential decision problem, the agent has to make a sequence of decisions


in order to solve a problem.
• Solving implies to find the sequence with the highest (expected cumulative
future) reward.
• The solver is called the agent, and the problem is called environment (or
sometimes the world).

Grid Worlds

• These environments consist of a rectangular grid of squares, with a start square,


and a goal square. The aim is for the agent to find the sequence of actions that it
must take (up, down, left, right) to arrive at the goal square.
• In fancy versions a “loss” square is added, that scores minus points, or a “wall”
square, that is impenetrable for the agent.
• Grid world is a simple environment that is well-suited for manually playing
around with reinforcement learning algorithms, to build up intuition of what the
algorithms do.
Mazes and Box Puzzles

• Trajectory planning algorithms play a central role in robotics; there is a long


tradition of using 2D and 3D mazes for path-finding problems in reinforcement
learning.
• The challenge in Sokoban is that boxes can only be pushed, not pulled.
• The action space of these puzzles and mazes is discrete.
• Small versions of the mazes can be solved exactly by planning, larger instances
are only suitable for approximate planning or learning methods.
• Solving these planning problems exactly is NP-hard or PSPACE-hard, as a
consequence the computational time required to solve problem instances
exactly grows exponentially with the problem size and becomes quickly
infeasible for all but the smallest problems.

Agent and Environment

the agent can find the best action for each state.

• In reinforcement learning the environment gives us only a number as an


indication of the quality of an action that we performed, and we are left to derive
the correct action policy from that.
• On the other hand, reinforcement learning allows us to generate as many action-
reward pairs as we need, without a large hand-labeled dataset.
Markov Decision Process

• Sequential decision problems can be modelled as Markov decision processes


(MDPs).
• Markov decision problems have the Markov property: the next state depends
only on the current state and the actions available in it (no historical memory of
previous states or information from elsewhere influences the next state).
• The no-memory property is important because it makes reasoning about future
states possible using only the information present in the current state.
• If previous histories would influence the current state, and these would all have
to be taken into account, then reasoning about the current state would be much
harder or even infeasible.

State Representation

• The state 𝑠 contains the information to uniquely represent the configuration of


the environment.
• For the supermarket example, each identifying location is a state.
• For chess, this can be the location of all pieces on the board (plus information for
the 50 move repetition rule, castling rights, and en-passant state).
• For robotics this can be the orientation of all joints of the robot, and the location
of the limbs of the robot. For Atari, the state comprises the values of all screen
pixels.
• How the environment reacts to the action is defined by the transition model 𝑇𝑎
(𝑠, 𝑠′) that is internal to the environment, which the agent does not know.
• The environment returns the new state 𝑠 , as well as a reward value 𝑟′ for the new
state.
Deterministic and Stochastic Environment

• In discrete deterministic environments the transition function defines a one-step


transition, as each action (from a certain old state) deterministically leads to a
single new state. This is the case in Grid worlds, Sokoban, and in games such as
chess and checkers, where a move action deterministically leads to one new
board position.
• An example of a non-deterministic situation is a robot movement in an
environment.
• The outcome of the action is unknown beforehand by the agent, and depends on
elements in the environment, that are not known to the agent.

Irreversible Environment Action

• When the agent is in state 𝑠, it chooses an action 𝐴 to perform, based on its


current behavior policy 𝜋(𝑎|𝑠).
• An action changes the state of the environment irreversibly.
• In the reinforcement learning paradigm, there is no undo operator for the
environment (nor is there in the real world).
• When the environment has performed a state transition, it is final.
• The new state is communicated to the agent, together with a reward value.
• The actions that the agent performs in the environment are also known as its
behavior, just as the actions of a human in the world constitute the human’s
behavior.

Discrete or Continuous Action Space

• The actions in board games and choosing a direction in a navigation task in a


grid, are discrete.
• In contrast, arm and joint movements of robots, and bet sizes in certain games,
are continuous.
• Applying algorithms to continuous or very large action spaces either requires
discretization of the continuous space (into buckets) or the development of a
diIerent kind of algorithm.
• Value-based methods work well for discrete action spaces, and policy-based
methods work well for both action spaces.
• For the supermarket example we can actually choose between modeling our
actions discrete or continuous. From every state, we can move any number of
steps, small or large, integer or fractional, in any direction. We can even walk a
curvy path. So, strictly speaking, the action space is continuous. However, if, as
in some cities, the streets are organized in a rectangular Manhattan-pattern,
then it makes sense to discretize the continuous space, and to only consider
discrete actions that take us to the next street corner. Then, our action space has
become discrete, by using extra knowledge of the problem structure.

Transition 𝑻𝒂

• The transition function 𝑇 𝑎 determines how the state changes after an action has
been selected.
• In model-free reinforcement learning the transition function is implicit to the
solution algorithm: the environment has access to the transition function, and
uses it to compute the next state 𝑠′, but the agent has not.
• In model-based reinforcement learning, the agent has its own transition
function, an approximation of the environment’s transition function, which is
learned from the environment feedback.

Graph View of the State Space

• The dynamics of the MDP are modelled by transition function 𝑇𝑎 (·) and reward
function 𝑅𝑎 (·).
• The imaginary space of all possible states is called the state space.
• The two functions define a two-step transition from state 𝑠 to 𝑠 , via action
𝑎: 𝑠 → 𝑎 → 𝑠′.

Trial and Error, Down and Up

• A graph such as the one in the center and right panel of Fig. 2.5, where child
nodes have only one parent node and without cycles, is known as a tree.
• The root node at the top is state 𝑠.
• As actions are performed and states and rewards are returned backup the tree, a
learning process is taking place in the agent.
• In the tree of Fig. 2.5 an action selection moves downward, towards the leaves.
At the deeper states, we find the rewards, which we propagate to the parent
states upwards.
• Reward learning is learning by backpropagation.
• Action selection moves down, reward learning flows up.
• Reinforcement learning is learning by trial and error.
• Trial is selecting an action down (using the behavior policy) to perform in the
environment.
• Error is moving up the tree, receiving a feedback reward from the environment,
and reporting that back up the tree to the state to update the current behavior
policy.
• The downward selection policy chooses which actions to explore, and the
upward propagation of the error signal performs the learning of the policy.

Reward 𝑹𝒂

• Rewards are associated with single states, indicating their quality.


• However, we are most often interested in the quality of a full decision making
sequence from root to leaves.
• The reward of such a full sequence is called the return, sometimes denoted
confusingly as 𝑅, just as the reward.
• The expected cumulative discounted future reward of a state is called the value
function 𝑉^𝜋 (𝑠).
• The value function 𝑉 𝜋 (𝑠) is the expected cumulative reward of 𝑠 where actions
are chosen according to policy 𝜋.

Discount Factor 𝜸

• We distinguish between two types of tasks: (1) continuous time, long running,
tasks, and (2) episodic tasks—tasks that end.
• In continuous and long running tasks it makes sense to discount rewards from
far in the future in order to more strongly value current information at the present
time. To achieve this a discount factor 𝛾 is used in our MDP that reduces the
impact of far away rewards.
• Many continuous tasks use discounting, 𝛾 ≠ 1.
• Both the supermarket example and the game of chess are episodic, and
discounting does not make sense in these problems, 𝛾 = 1.

Policy 𝝅

• The policy 𝜋 is a conditional probability distribution that for each possible state
specifies the probability of each possible action.
• The function 𝜋 is a mapping from the state space to a probability distribution
over the action space:
𝜋 : 𝑆 → 𝑝(𝐴)
where 𝑝 ( 𝐴) can be a discrete or continuous probability distribution.
• For a particular probability (density) from this distribution we write
𝜋(𝑎|𝑠)
• A special case of a policy is a deterministic policy, denoted by 𝜋(𝑠) where 𝜋:𝑆→𝐴.
• A deterministic policy selects a single action in every state.

Trace 𝝉

• Here, 𝑛 denotes the length of the 𝜏. In practice, we often assume 𝑛 = ∞, which


means that we run the trace until the domain terminates.
• Traces are a single full rollout of a sequence from the sequential decision
problem.
• They are also called trajectory, episode, or simply sequence.
• Since both the policy and the transition dynamics can be stochastic, we will not
always get the same trace from the start state. Instead, we will get a distribution
over traces.
• The distribution of traces from the start state (distribution) is denoted by 𝑝(𝜏0).
• The probability of each possible trace from the start is actually given by the
product of the probability of each specific transition in the trace:

• Policy-based reinforcement learning depends heavily on traces.


• Value-based reinforcement learning uses single transition steps.
Return 𝑹

• The sum of the future reward of a trace is known as the return.


• The return of trace 𝜏𝑡 is:

• Note that if we would use an infinite-horizon return (Eq. 2.2) and 𝛾 = 1.0, then the
cumulative reward may become unbounded. Therefore, in continuous problems,
we use a discount factor close to 1.0, such as 𝛾 = 0.99.

State Value 𝑽

• The environment can be stochastic, and so can our policy, and for a given policy
we do not always get the same trace.
• We are actually interested in the expected cumulative reward that a certain
policy achieves.
• The expected cumulative discounted future reward of a state is better known as
the value of that state.

• Every policy 𝜋 has one unique associated value function 𝑉^𝜋 (𝑠). We often omit 𝜋
to simplify notation, simply writing 𝑉 (𝑠), knowing a state value is always
conditioned on a certain policy.
• The state value is defined for every possible state 𝑠 ∈ 𝑆.
• Finally, the state value of a terminal state is by definition zero.
State-Action Value 𝑸

• Every policy 𝜋 has only one unique associated state-action value function 𝑄 𝜋 (𝑠,
𝑎).
• We often omit 𝜋 to simplify notation.
• The state-action value of a terminal state is by definition zero.

Reinforcement Learning Objective


Bellman Equation

• Bellman showed that discrete optimization problems can be described as a


recursive backward induction problem.
• He introduced the term dynamic programming to recursively traverse the states
and actions.
• The so-called Bellman equation shows the relationship between the value
function in state 𝑠 and the future child state 𝑠′, when we follow the transition
function.

• Note the recursion on the value function, and that for the Bellman equation the
transition and reward functions must be known for all states by the agent.
• Together, the transition and reward model are referred to as the dynamics model
of the environment.
• The dynamics model is often not known by the agent, and model-free methods
have been developed to compute the value function and policy function without
them.

MDP Solution Methods

• The Bellman equation is a recursive equation: it shows how to calculate the


value of a state, out of the values of applying the function specification again on
the successor states.
• Dynamic programming uses the principle of divide and conquer: it begins with a
start state whose value is to be determined by searching a large subtree, which it
does by going down into the recursion, finding the value of sub-states that are
closer to terminals.
• At terminals the reward values are known, and these are then used in the
construction of the parent values, as it goes up, back out of the recursion, and
ultimately arrives at the root value itself.
• A simple dynamic programming method to iteratively traverse the state space to
calculate Bellman’s equation is value iteration (VI).
• Value iteration converges to the optimal value function by iteratively improving
the estimate of 𝑉 (𝑠).
• The value function 𝑉(𝑠) is first initialized to random values. Value iteration
repeatedly updates 𝑄(𝑠,𝑎) and 𝑉 (𝑠) values, looping over the states and their
actions, until convergence occurs (when the values of 𝑉 (𝑠) stop changing much).
• Value iteration works with a finite set of actions.
• It has been proven to converge to the optimal values, but, it does so quite
ineIiciently by essentially repeatedly enumerating the entire state space in a
triply nested loop, traversing the state space many times.

Model-Free Learning

• Frequently, we are in a situation when the transition probabilities are not known
to the agent, and we need other methods to compute the policy function. For
this situation, model-free algorithms have been developed.
• We will see how, when the agent does not know the transition function, an
optimal policy can be learned by sampling rewards from the environment.

• These algorithms are based on a few principles. First, we will discuss how the
principle of sampling can be used to construct a value function.
• Second, we will discuss which mechanisms for action selection exist, where we
will encounter the exploration/exploitation trade-oI.
• Third, we will discuss how to learn from the rewards of the selected actions.
• Finally, we will discuss two full algorithms in which all these concepts come
together: SARSA and Q-learning.

Monte Carlo Sampling

• A straightforward way to sample rewards is to generate a random episode, and


use its return to update the value function at the visited states.
• This approach, of randomly sampling full episodes, has become known as the
Monte Carlo approach.
• The Monte Carlo approach is a basic building block of value based reinforcement
learning.
• An advantage of the approach is its simplicity. A disadvantage is that a full
episode has to be sampled before the reward values are used, and sample
eIiciency may be low.
Temporal DiIerence Learning

• Bootstrapping is the process of subsequent refinement by which old estimates


of a value are refined with new updates.
• Bootstrapping solves the problem of computing a final value when we only know
how to compute step-by-step intermediate values.
• Bellman’s recursive computation is a form of bootstrapping.
• A bootstrapping method that can be used to process the samples, and to refine
them to approximate the final state values, is temporal diIerence learning.
• The temporal diIerence in the name refers to the diIerence in values between
two time steps, which are used to calculate the value at the new time step.

• Note the introduction of 𝛼, the learning rate, which controls how fast the
algorithm learns (bootstraps).
• It is an important parameter; setting the value too high can be detrimental since
the last value then dominates the bootstrap process too much.
• The last term −𝑉(𝑠) subtracts the value of the current state, to compute the
temporal diIerence.
• Another way to write this update rule is:

as the diIerence between the new temporal diIerence target and the old value.
• Note the absence of transition model 𝑇 in the formula; temporal diIerence is a
model-free update formula.

Bias-Variance Trade-oI

• A crucial diIerence between the Monte Carlo method and the temporal
diIerence method is the use of bootstrapping to calculate the value function.
• The use of bootstrapping has an important consequence: it trades oI bias and
variance.
• Monte Carlo does not use bootstrapping. It performs a full episode with many
random action choices before it uses the reward. As such, its action choices are
unbiased (they are fully random), they are not influenced by previous reward
values.
• However, the fully random choices also cause a high variance of returns between
episodes.
• Monte Carlo is a low-bias/high-variance algorithm.
• In contrast, temporal diIerence bootstraps the 𝑄-function with the values of the
previous steps, refining the function values with the rewards after each single
step.
• Thus, because of bootstrapping, TD is a high-bias/low variance method.
• Both approaches have their uses in diIerent circumstances. In fact, we can think
of situations where a middle ground (of medium bias/medium variance) might be
useful.
• This is the idea behind the so-called n-step approach: do not sample a full
episode, and also not a single step, but sample a few steps at a time before using
the reward values.
• The n-step algorithm has medium bias and medium variance.

Find Policy by Value-based Learning

• In the value-based approach we know the value functions 𝑉(𝑠) or𝑄(𝑠,𝑎). How
can that help us to find action 𝑎?
• In a discrete action space, there is at least one discrete action with the highest
value. Thus, if we have the optimal state-value 𝑉* , then the optimal policy can
be found by finding the action with that value.
• This relationship is given by

• In this way the optimal policy sequence of best actions 𝜋★(𝑠) can be recovered
from the values, hence the name value-based method.
• A full reinforcement learning algorithm consists of a rule for the selection part
(downward) and a rule for the learning part (upward).
Exploration

• Since there is no local transition function, model-free methods perform their


state changes directly in the environment.
• The sampling policy should choose promising actions to reduce the number of
samples as much as possible, and not waste any actions.
• What behavior policy should we use? It is tempting to favor at each state the
actions with the highest Q-value, since then we would be following what is
currently thought to be the best policy.
• This approach is called the greedy approach. It appears attractive, but is short-
sighted and risks settling for local maxima.
• Indeed, the greedy approach is high bias, using values based on few samples.
• We run the risk of circular reinforcement, if we update the same behavior policy
that we use to choose our samples from.
• In addition to exploiting known good actions, a certain amount of exploration of
unknown actions is necessary.
• Smart sampling strategies use a mix of the current behavior policy (exploitation)
and randomness (exploration) to select which action to perform in the
environment.

Bandit Theory

• The exploration/exploitation trade-oI, the question of how to get the most


reliable information at the least cost, has been studied extensively in the
literature for single step decision problems.
• A bandit in this context refers to a casino slot machine, with not one arm, but
many arms, each with a diIerent and unknown payout probability.

𝝐-greedy Exploration

• A popular pragmatic exploration/exploitation approach is to use a fixed ratio of


exploration versus exploitation.
• This approach is known as the 𝜖-greedy approach, which is to mostly try the
(greedy) action that currently has the highest policy value except to explore an 𝜖
fraction of times a randomly selected other action.
• If 𝜖 = 0.1 then 90% of the times the currently-best action is taken, and 10% of the
times a random other action.
• The algorithmic choice between greedily exploiting known information and
exploring unknown actions to gain new information is called the
exploration/exploitation trade-oI.
• A second approach is to use an adaptive 𝜖-ratio, that changes over time, or over
other statistics of the learning process.
• Other popular approaches to add exploration are to add Dirichlet-noise or to use
Thompson sampling.

OI-Policy Learning

• The question is whether the agent should perform updates strictly on-policy—
only learning from its most recent action—or allow oI-policy updates, learning
from all available information.
• In on-policy learning, the learning takes place by using the value of the action
that was selected by the policy. The policy determines the action to take, and the
value of that action is used to update the value of the policy function: the
learning is on-policy.
• In oI-policy methods, the learning takes place by backing up values of another
action, not necessarily the one selected by the behavior policy.
• This method makes sense when the agent explores.
• When the behavior policy explores, it selects a non-optimal action.
• The diIerence between on-policy and oI-policy is only in how they act when
exploring the non-greedy action.
• In the case of exploration, oI-policy learning can be more eIicient, by not
stubbornly backing up the value of the action selected by the behavior policy, but
the value of an older, better, action.
• An important point is that the convergence behavior of on-policy and oI-policy
learning is diIerent.
• In general, tabular reinforcement learning have been proven to converge when
the policy is greedy in the limit with infinite exploration.
• OI-policy methods learn from the greedy rewards and thus converge to the
optimal policy, after having sampled enough states.
• However, on- policy methods with a fixed 𝜖 do not converge to the optimal policy,
since they keep selecting explorative actions.
• When we use a variable-𝜖-policy in which the value of 𝜖 goes to zero, then on-
policy methods do converge, since then they choose, in the limit, the greedy
action.

On-Policy SARSA
• On-policy learning selects an action, evaluates it in the environment, and follows
the actions, guided by the behavior policy.
• On-policy learning samples the state space following the behavior policy, and
improves the policy by backing up values of the selected actions.
• SARSA updates its Q-values using the Q-value of the next state 𝑠 and the current
policy’s action.
• The primary advantage of on-policy learning is its predictive behavior.

OI-Policy Q-Learning

• OI-policy learning is more complicated; it may learn its policy from actions that
are diIerent from the one just taken.

• The only diIerence from on-policy learning is that the 𝛾𝑄(𝑠𝑡+1, 𝑎𝑡+1) term from
Eq. 2.9 has been replaced by 𝛾 max𝑎 𝑄(𝑠𝑡+1, 𝑎).
• Indeed, the term temporal diIerence learning is sometimes used for the Q-
learning algorithm. Note that Q-learning uses bootstrapping.
• The reason that Q-learning is called oI-policy is that it updates its Q-values
using the Q-value of the next state 𝑠𝑡+1, and the greedy action (not necessarily
the behavior policy’s action—it is learning oI the behavior policy).

Sparse Rewards and Reward Shaping

• Environments in which a reward exists in each state are said to have a dense
reward structure.
• For other environments rewards may exist for only some of the states. For
example, in chess, rewards only exist at terminal board positions where there is a
win or a draw. In all other states the return depends on the future states and
must be calculated by the agent by propagating reward values from future states
up towards the root state 𝑠0. Such an environment is said to have a sparse
reward structure.
• Finding a good policy is more complicated when the reward structure is sparse.
• A graph of the landscape of such a sparse reward function would show a flat
landscape with a few sharp mountain peaks.
• Finding the optimum in a flat landscape where the gradient is zero, is hard.
• Reward shaping can make all the diIerence when no solution can be found with
a naive reward function.
• It is a way of incorporating heuristic knowledge into the MDP. The use of
heuristics on board games such as chess and checkers can also be regarded as
reward shaping.

Tuning your Learning Rate

• A choice close to 1 for the discount parameter is usually a good start, and a
choice close to 0 for the learning rate is a good start.

Conclusion

• Model-free methods use actions that are irreversible for the agent.
• The backup rule for learning is based on bootstrapping, and can follow the
rewards of the actions on-policy, including the value of the occasional
explorative action, or oI-policy, always using the value of the best action.

Summary

• Games and robotics are two important fields of application. Fields of application
can be episodic (they end—such as a game of chess) or continuous (they do not
end—a robot remains in the world).
• In continuous problems it often makes sense to discount behavior that is far
from the present, episodic problems typically do not bother with a discount
factor—a win is a win.
• Environments can be deterministic (many board games are deterministic—
boards don’t move) or stochastic (many robotic worlds are stochastic—the
world around a robot moves).

• The optimal policy can be found by finding the maximal value of a state. The
value function 𝑉 (𝑠) returns the expected reward for a state.
• When the transition function 𝑇𝑎 (𝑠, 𝑠′) is present, the agent can use Bellman’s
equation, or a dynamic programming method to recursively traverse the behavior
space. Value iteration is one such dynamic programming method.
• Value iteration traverses all actions of all states, backing up reward values, until
the value function stops changing. The state-action value 𝑄(𝑠, 𝑎) determines the
value of an action of a state.
• Bellman’s equation calculates the value of a state by calculating the value of
successor states. Accessing successor states (by following the action and
transition) is also called expanding a successor state.
• In a tree diagram successor states are called child nodes, and expanding is a
downward action. Backpropagating the reward values to the parent node is a
movement upward in the tree.
• Methods where the agent makes use of the transition model are called model-
based methods. When the agent does not use the transition model, they are
model- free methods.
• Value-based model-free methods can find an optimal policy by using only
irreversible actions, sampling the environment to find the value of the actions.
• A major determinant in model-free reinforcement learning is the
exploration/exploitation trade-oI, or how much of the information that has been
learned from the environment is used in choosing actions to sample.
• A well-known exploration/exploitation method is 𝜖- greedy, where the greedy
(best) action is followed from the behavior policy, except in 𝜖 times, when
random exploration is performed. Always following the policy’s best action runs
the risk of getting stuck in a cycle. Exploring random nodes allows breaking free
of such cycles.
• How should we process the rewards that are found at nodes? Here we
introduced another fundamental element of reinforcement learning:
bootstrapping, or finding a value by refining a previous value. Temporal diIerence
learning uses the principle of bootstrapping to find the value of a state by adding
appropriately discounted future reward values to the state value function.

Book Questions

• In reinforcement learning the agent can choose which training examples are
generated. Why is this beneficial? What is a potential problem?

• Beneficial because it allows for more efficient learning by focusing on relevant


experiences. A potential problem is that it might lead to biased exploration of the state
space.

• What is Grid world?

• A simple environment used in reinforcement learning, represented as a grid with


states and actions.

• Which five elements does an MDP have to model reinforcement learning problems?

• States, actions, rewards, transitions, and the discount factor (γ) .


• In a tree diagram, is successor selection of behavior up or down?

• Down.

• In a tree diagram, is learning values through backpropagation up or down?

• Up.

• What is 𝜏?

• 𝜏 (tau) represents a trace, which is a sequence or rollout of state-action-reward tuples


in reinforcement learning.

• What is 𝜋(𝑠)?

• The policy function that maps states to actions.

• What is 𝑉(𝑠)?

• The value function that estimates the value of a state.

• What is 𝑄(𝑠, 𝑎)?

• The action-value function that estimates the value of taking a specific action in a state.

• What is dynamic programming?

• A method for solving complex problems by breaking them down into simpler
subproblems .

• What is recursion?

• The process of solving a problem where the solution involves solving smaller
instances of the same problem .

• Do you know a dynamic programming method to determine the value of a state?

• Value iteration.

• Is an action in an environment reversible for the agent?

• Actions are generally not reversible; they lead to new states.

• Mention two typical application areas of reinforcement learning.

• Games and robotics.


• Is the action space of games typically discrete or continuous?

• Discrete.

• Is the action space of robots typically discrete or continuous?

• Continuous.

• Is the environment of games typically deterministic or stochastic?

• Deterministic.

• Is the environment of robots typically deterministic or stochastic?

• Stochastic.

• What is the goal of reinforcement learning?

• To learn a policy that maximizes the cumulative reward.

• Which of the five MDP elements is not used in episodic problems?

• Gamma (γ), the discount factor, is irrelevant in episodic problems .

• Which model or function is meant when we say “model-free” or “model-based”?

• "Model-free" means no model of the environment's dynamics is used; "model-based"


means such a model is used, specifically the transition model .

• What type of action space and what type of environment are suited for value-based
methods?

• Discrete action space and deterministic environment.

• Why are value-based methods used for games and not for robotics?

• Because games typically have discrete and finite action spaces.

• Name two basic Gym environments.

• CartPole and Mountain Car.


PPT QUESTIONS

You might also like