0% found this document useful (0 votes)

10 views11 pages

Unit 1-RL

Uploaded by

rounakhara25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Unit 1-RL

Uploaded by

rounakhara25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT-1

Chapter-1: Introduction

1.1. Elements of Reinforcement learning

What RL is about:

 trial and error search, delayed reward

 exploration vs exploitation tradeoff

Main elements : agent and environment

Fig 1.1. Agent and Environment interactions

Sub elements:

 policy π, mapping from states to actions. May be stochastic

 reward signal: given by the environment at each time step
 value function of a state V(s): total amount of reward an agent can expect to accumulate
over the future, starting from that state
 [optional] a model of the environment: allows inferences to be made about how the
environment will behave

1.2. Limitations and scope

 state signal design is overlooked
 focus on value function estimation (no evolutionary methods)

1.3. An Extended Example: Tic-Tac-Toe

To illustrate the general idea of reinforcement learning and contrast it with other
approaches, we next consider a single example in more detail.

Consider the familiar child's game of tic-tac-toe. Two players take turns playing on a
three-by-three board. One player plays Xs and the other Os until one player wins by
placing three marks in a row, horizontally, vertically, or diagonally, as the X player
has in this game:
If the board fills up with neither player getting three in a row, the game is a draw.
Because a skilled player can play so as never to lose, let us assume that we are
playing against an imperfect player, one whose play is sometimes incorrect and
allows us to win. For the moment, in fact, let us consider draws and losses to be
equally bad for us. How might we construct a player that will find the imperfections
in its opponent's play and learn to maximize its chances of winning?

Although this is a simple problem, it cannot readily be solved in a satisfactory way

through classical techniques. For example, the classical "minimax" solution from
game theory is not correct here because it assumes a particular way of playing by the
opponent. For example, a minimax player would never reach a game state from which
it could lose, even if in fact it always won from that state because of incorrect play by
the opponent. Classical optimization methods for sequential decision problems, such
as dynamic programming, can compute an optimal solution for any opponent, but
require as input a complete specification of that opponent, including the probabilities
with which the opponent makes each move in each board state. Let us assume that this
information is not available a priori for this problem, as it is not for the vast majority
of problems of practical interest. On the other hand, such information can be
estimated from experience, in this case by playing many games against the opponent.
About the best one can do on this problem is first to learn a model of the opponent's
behavior, up to some level of confidence, and then apply dynamic programming to
compute an optimal solution given the approximate opponent model. In the end, this
is not that different from some of the reinforcement learning methods.

An evolutionary approach to this problem would directly search the space of possible
policies for one with a high probability of winning against the opponent. Here, a
policy is a rule that tells the player what move to make for every state of the game--
every possible configuration of X s and Os on the three-by-three board. For each
policy considered, an estimate of its winning probability would be obtained by
playing some number of games against the opponent. This evaluation would then
direct which policy or policies were considered next. A typical evolutionary method
would hill-climb in policy space, successively generating and evaluating policies in an
attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm
could be used that would maintain and evaluate a population of policies. Literally
hundreds of different optimization methods could be applied. By directly searching
the policy space we mean that entire policies are proposed and compared on the basis
of scalar evaluations.

Here is how the tic-tac-toe problem would be approached using reinforcement

learning and approximate value functions. First we set up a table of numbers, one for
each possible state of the game. Each number will be the latest estimate of the
probability of our winning from that state. We treat this estimate as the state's value,
and the whole table is the learned value function. State A has higher value than state
B, or is considered "better" than state B, if the current estimate of the probability of
our winning from A is higher than it is from B. Assuming we always play X s, then
for all states with three Xs in a row the probability of winning is 1, because we have
already won. Similarly, for all states with three Os in a row, or that are "filled up," the
correct probability is 0, as we cannot win from them. We set the initial values of all
the other states to 0.5, representing a guess that we have a 50% chance of winning.

We play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves (one for each blank space on
the board) and look up their current values in the table. Most of the time we
move greedily, selecting the move that leads to the state with greatest value, that is,
with the highest estimated probability of winning. Occasionally, however, we select
randomly from among the other moves instead. These are called exploratory moves
because they cause us to experience states that we might otherwise never see. A
sequence of moves made and considered during a game can be diagrammed as in
Figure
Figure 1.1: A sequence of tic-tac-toe moves. The solid lines represent the moves
taken during a game; the dashed lines represent moves that we (our
reinforcement learning player) considered but did not make. Our second move
was an exploratory move, meaning that it was taken even though another sibling
move, the one leading to , was ranked higher. Exploratory moves do not result
in any learning, but each of our other moves does, causing backups as suggested
by the curved arrows and detailed in the text.

While we are playing, we change the values of the states in which we find ourselves
during the game. We attempt to make them more accurate estimates of the
probabilities of winning. To do this, we "back up" the value of the state after each
greedy move to the state before the move, as suggested by the arrows in Figure 1.1.
More precisely, the current value of the earlier state is adjusted to be closer to the
value of the later state. This can be done by moving the earlier state's value a fraction
of the way toward the value of the later state. If we let denote the state before the
greedy move, and the state after the move, then the update to the estimated value
of , denoted , can be written as

where is a small positive fraction called the step-size parameter, which influences
the rate of learning. This update rule is an example of a temporal-difference learning
method, so called because its changes are based on a difference, ,
between estimates at two different times.
CHAPTER-2 MULTI-ARMED BANDIT

2.1. A k-armed bandit problem

 extension of the bandit setting, inspired by slot machines that are sometimes
called“onearmed bandits”. Here we have k levers (k-arms).
 repeated choice among k actions
 after each choice we receive a reward (chosen from a stationary distribution that depends on
the action chosen)
 objective: maximize the total expected reward over some time period

Each of the k actions has a mean reward that we call the value of the action (q-value).

 At action selected at time step t

 Rt: corresponding reward
 q∗(a): value of the action

q∗(a)≐E[Rt|At=a]
If we knew the values of each action it would be trivial to solve the problem. But we don’t have
them so we need estimates. We call Qt(a) the estimated value for action a at time t.

 greedy action: take the action with the highest current estimate. That’s exploitation
 nongreedy actions allow us to explore and build better estimates

2.2. Action-value methods

= methods for computing our Qt(a) estimates. One natural way to do that is to average the
rewards actually received (sample average):
Qt(a)≐sum of rewards when a taken prior to tnumber of times a taken prior to
t=∑t−1i=1Ri⋅1Ai=a∑t−1i=11Ai=a≐sum of rewards when a taken prior to tnumber of times a
taken prior to t=∑=1
As the denominator goes to infinity, by the law of large numbers Qt(a) converges to q∗(a).
Now that we have an estimate, how do we use it to select actions?

Greedy action selection: At≐argmaxaQt(a)≐argmax

Alternative: ε-greedy action selection (behave greedily except for a small proportion ε of the
actions where we pick an action at random)
2.3. The 10-armed testbed
Test setup: set of 2000 10-armed bandits in which all of the 10 action values are selected
according to a Gaussian with mean 0 and variance 1.
When testing a learning method, it selects an action At and the reward is selected from a
Gaussian with mean q∗(At) and variance 1.
TL;DR : ε -greedy >> greedy

2.4. Incremental implementation

How can the computation of the Qt(a)estimates be done in a computationally effective manner?
For a single action, the estimate Q of this action value after it has been selected n−1−1 times is:
Qn=R1+R2+...Rn−1n−1
We’re not going to store all the values of R and recompute the sum at every time step, so we’re
going to use a better update rule, which is equivalent:
Qn+1=Qn+1n[Rn−Qn]
general form:

New Estimate ←Old Estimate + Step Size[Target−Old Estimate]

The expression [Target−Old Estimate][ is the error in the estimate.

2.5. Tracking a non-stationary problem

 Now the reward probabilities change over time
 We want to give more weight to the most recent rewards

One easy way to do it is to use a constant step size so Qn+1 becomes a weighted average of past
rewards and the initial estimate Qn.Qn+1≐Qn+α[Rn−Qn]
Which is equivalent to:

Qn+1=(1−α)nQ1+n∑i=1α(1−α)n−iRi

 we call that a weighted average because the sum of the weights is 1

 the weight given to Ri depends on how many rewards ago (n−i)it was observed
 since (1−α)<1, the weight given to Ri decreases as the number of intervening rewards
increases
2.6. Optimistic initial values
 a way to encourage exploration, which is effective in stationary problems
 select initial estimates to a high number (higher than the reward we can actually get in the
environment) so that unexplored states have a higher value than explored ones and are
selected more often

2.7. Upper-Confidence-Bound (UCB) Action

Selection
 another way to tackle the exploration problem
 ε-greedy methods choose randomly the non-greedy actions
 maybe there is a better way to select those actions according to their potential of being
optimal

At≐argmaxa[Qt(a)+c√lntNt(a)]≐argmax

 lntln increases at each timestep

 Nt(a) is the number of times action a has been selected prior to time t, so it increases every
time a is selected
 c>0>0 is our confidence level that controls the degree of exploration

The square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The
quantity being max’ed over is thus a sort of upper bound on the possible true value of action a,
with c determining the confidence level.

 Each time a is selected the uncertainty is reduced (Nt(a) increments and the uncertainty
term descreases)
 Each time another action is selected, lntln increases but not Nt(a) so the uncertainty
increases
2.8. Gradient Bandit Algorithms
 another way to select actions
 so far we estimate action values (q-values) and use these to select actions
 here we compute a numerical preference for each action a, which we denote Ht(a) we select
the action with a softmax (introducing the πt(a) notation)

Algorithm based on stochastic gradient ascent: on each step, after selecting action Atand
receiving reward Rt, the action preferences are updated by:
Ht+1(At)≐Ht(At)+α(Rt−Rt)(1−πt(At))
and for all a≠At:
Ht+1(a)≐Ht(a)+α(Rt−¯Rt)πt(a) α<0<0 is the step size

 Rtis the average of all rewards up through and including time t

The ¯¯ term serves as a baseline with which the reward is compared. If the reward is higher than
the baseline, then the probability of taking At in the future is increased, and if the reward is
below the baseline, then the probability is decreased. The non-selected actions move in the
opposite direction.

2.9. Associative search (contextual bandits)

 nonassociative tasks: no need to associate different actions with different situations
 extend the nonassociative bandit problem to the associative setting
 at each time step the bandit is different
 learn a different policy for different bandits
 it opens a whole set of problems and we will see some answers in the next chapter

2.10. Summary
 one key topic is balancing exploration and exploitation.
 Exploitation is straightforward: we select the action with the highest estimated value
(we exploit our current knowledge)
 Exploration is trickier, and we have seen several ways to deal with it:
o ε-greedy choose randomly
o UCB favors the more uncertain actions
o gradient bandit estimate a preference instead of action values
Which one is best (evaluated on the 10 armed testbed)?

Fig 2.1. Evaluation of several exploration methods on the 10-armed testbed

C1 5 DRL 2021
No ratings yet
C1 5 DRL 2021
38 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
66 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
80 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
No ratings yet
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
27 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Learning From Delayed Rewards（1989）-可选定
No ratings yet
Learning From Delayed Rewards（1989）-可选定
241 pages
Bandit
No ratings yet
Bandit
8 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Unit Iv-1
No ratings yet
Unit Iv-1
32 pages
Spatial Smoothing - Support BrainVoyager
No ratings yet
Spatial Smoothing - Support BrainVoyager
4 pages
Notes
No ratings yet
Notes
6 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
Curve Fitting
No ratings yet
Curve Fitting
16 pages
TSP Iasc 43854
No ratings yet
TSP Iasc 43854
25 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
DGC 111
No ratings yet
DGC 111
324 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
CS3381-OOPs Laboratory II CSE
No ratings yet
CS3381-OOPs Laboratory II CSE
63 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Impulse Balance Theory and its Extension by an Additional Criterion
From Everand
Impulse Balance Theory and its Extension by an Additional Criterion
Reinhard Selten
1/5 (1)
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Unit 1
No ratings yet
Unit 1
18 pages
Algorithms To Solve An MDP
No ratings yet
Algorithms To Solve An MDP
24 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit 4
100% (1)
Unit 4
7 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Dynamics of The Bush-Mosteller Learning Algorithm in 2x2 Games
No ratings yet
Dynamics of The Bush-Mosteller Learning Algorithm in 2x2 Games
26 pages
Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
No ratings yet
Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
11 pages
Advanced Statistical Computing PDF
No ratings yet
Advanced Statistical Computing PDF
329 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
21CST603 AIML - Model Question Paper 1
No ratings yet
21CST603 AIML - Model Question Paper 1
3 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Krishnanshu - Belwanshi - 21meb0b36
No ratings yet
Krishnanshu - Belwanshi - 21meb0b36
11 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
Training An Artificial Neural Network To Play Tic Tac Toe PDF
No ratings yet
Training An Artificial Neural Network To Play Tic Tac Toe PDF
16 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Unit 2 CNN
No ratings yet
Unit 2 CNN
9 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
21bce0427 VL2023240101150 Ast02
No ratings yet
21bce0427 VL2023240101150 Ast02
17 pages
Image Denoising Using GANSAMPLE
No ratings yet
Image Denoising Using GANSAMPLE
9 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Distinctive Image Feature From Scale-Invariant Keypoints: David G. Lowe, 2004
No ratings yet
Distinctive Image Feature From Scale-Invariant Keypoints: David G. Lowe, 2004
27 pages
Unit II
No ratings yet
Unit II
10 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Solution 3
No ratings yet
Solution 3
4 pages
Newton's Forward & Backward PPT's
71% (7)
Newton's Forward & Backward PPT's
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Using A Heuristic Approach To Design Personalized Urban Tourism Itineraries With Hotel Selection
No ratings yet
Using A Heuristic Approach To Design Personalized Urban Tourism Itineraries With Hotel Selection
14 pages
2.1. List Decoding of Polar Codes 2015
No ratings yet
2.1. List Decoding of Polar Codes 2015
14 pages
Sheet 6
No ratings yet
Sheet 6
2 pages
ADC Dynamic Errors
No ratings yet
ADC Dynamic Errors
34 pages
CSPC41 Es 2022
No ratings yet
CSPC41 Es 2022
2 pages
Division de 2 Cifras Con Grid 3 PDF
No ratings yet
Division de 2 Cifras Con Grid 3 PDF
20 pages
WEKA: Classification: Instructor: Amany Al Luhaybi
No ratings yet
WEKA: Classification: Instructor: Amany Al Luhaybi
8 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
The QR Method
No ratings yet
The QR Method
15 pages
The University of Zambia Department of Mathematics and Statistics MAT1110: Foundation Mathematics and Statistics For Social Sciences Tutorial Sheet 4
No ratings yet
The University of Zambia Department of Mathematics and Statistics MAT1110: Foundation Mathematics and Statistics For Social Sciences Tutorial Sheet 4
2 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Desk Check Example - Modules
No ratings yet
Desk Check Example - Modules
2 pages
Digital Image Processing: Interpolation
No ratings yet
Digital Image Processing: Interpolation
8 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
10.3 Power Method For Approximating Eigenvalues: Definition of Dominant Eigenvalue and Dominant Eigenvector
No ratings yet
10.3 Power Method For Approximating Eigenvalues: Definition of Dominant Eigenvalue and Dominant Eigenvector
9 pages
Ece122 Final Exam
No ratings yet
Ece122 Final Exam
5 pages
Linear Programming Reviewer
No ratings yet
Linear Programming Reviewer
4 pages
Saad Khalid (FA16-BCE-087) PCS LAB Report #9 To Explore Different Line Coding Schemes
No ratings yet
Saad Khalid (FA16-BCE-087) PCS LAB Report #9 To Explore Different Line Coding Schemes
3 pages
1st Periodical Test Math - IV
No ratings yet
1st Periodical Test Math - IV
4 pages
7 ME F344 Engineering Optimization AKGupta
No ratings yet
7 ME F344 Engineering Optimization AKGupta
2 pages
Exercises of Limits
From Everand
Exercises of Limits
Simone Malacrida
No ratings yet
Minimax: Fundamentals and Applications
From Everand
Minimax: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Unit 1-RL

Uploaded by

Unit 1-RL

Uploaded by

UNIT-1

1.1. Elements of Reinforcement learning

 trial and error search, delayed reward

Main elements : agent and environment

Fig 1.1. Agent and Environment interactions

 policy π, mapping from states to actions. May be stochastic

1.2. Limitations and scope

1.3. An Extended Example: Tic-Tac-Toe

Although this is a simple problem, it cannot readily be solved in a satisfactory way

Here is how the tic-tac-toe problem would be approached using reinforcement

2.1. A k-armed bandit problem

 At action selected at time step t

2.2. Action-value methods

Greedy action selection: At≐argmaxaQt(a)≐argmax

2.4. Incremental implementation

New Estimate ←Old Estimate + Step Size[Target−Old Estimate]

2.5. Tracking a non-stationary problem

 we call that a weighted average because the sum of the weights is 1

2.7. Upper-Confidence-Bound (UCB) Action

 lntln increases at each timestep

 Rtis the average of all rewards up through and including time t

2.9. Associative search (contextual bandits)

Fig 2.1. Evaluation of several exploration methods on the 10-armed testbed

You might also like