Reinforcement Learning: A Short Cut
Reinforcement Learning: A Short Cut
Krishna C Podila N V
S1062150 Page 1
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
and Exploration techniques for an enhance action can fetch. So, better actions are highly
performance. probable to be chosen than worse ones.
Exploitation is euphemism for selection Greedy In the above equation, ∏ is a policy followed from
actions. An action a* is chosen based on the best which instructions to select actions like a are
value every possible action can provide selected from states like s. Time step t is computed
from 0 to infinite. E∏ is estimated value if policy ∏
a∗¿ max V ( a ) (2) is followed. Thus a value of state is computed. But
how good an action can be identified only if the
Always choosing a greedy action might not be an goodness of an action from a particular state is
optimal path to achieve goal. An unexplored path computed.
might be more promising than a current and best
policy. Bellman Equation
To strike a balance, ϵ-greedy method is created. Goodness of a state action pair can be computed
Epsilon (ϵ) is the amount of time a random action by getting its value in following way [ CITATION
can be selected and rest of the time a greedy Pen92 \l 2057 ]
action is selected. Initially ϵ can be set high and
can be reduced gradually over time to attain better
results. But during the random selection, all the
V∏(s) =∑ ∏ (s , a) ∑ Pass ' ¿] (6)
a s'
possible actions are given equal importance
irrespective of their effect. While ∑ ∏ (s , a) refers to action taken from
a
Using Gibbs/Boltzmann temperature, we can current state, s ' refers to state achieved from s
select the action with the probability. Here due to action a. ss' refers to next possible state
probability is mapped as estimated reward an a
from s ' . Pss ' is the probability of transitioning
S1062150 Page 2
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
from s ' to ss' with action sequence. This ∏’(s) = max a Q∏ ❑ ( S , a ) (7)
equation is also known as Bellman equation
[ CITATION Ric98 \l 2057 ]. As per above equation, optimal reward can be
easily obtained. This is also known as policy
Once the feedback method is implemented and improvement. Agent follows the policy, but when
computation of value functions are attained, an action proves more valuable than the chosen
techniques for maximising reward (in layman one by policy, policy ∏ is modified to replace with
terms, choosing best possible path to goal) like a better action. ∏’(s) is modified or improved
Dynamic Programming has evolved. policy. [ CITATION Bus10 \l 2057 ]
Dynamic Programming (DP) The improvement we discussed above is by greedy
improvement method. It is possible to obtain
Dynamic programming sweeps to every state and
optimal policy in this method using dynamic
gets agent the best path. Consider the diagram
programming. Thus evaluating the policy by
below….
computing the values and by improving them in
S1
iterations we can attain optimal policy.
a1 a3
Evaluate
S2 ∏0(Ω) V∏0
Improve
S6
Evaluate
∏1(Ω) V∏1
Fig2: State & Action Representation
In the above figure, let circles be states and edges Improve
be actions. In dynamic programming (See fig2), all Fig4: Policy Improvement
possible actions from S1 (a1, a2, a3) are checked
individually. Transition from S1 to S2 is taken as a Dynamic Programming could prove expensive
case and the possible value of S2 is computed when state space Ω is large and actions for every
using actions a4 and a5. Likewise all actions and all state are numerous. Thus a way to stop the sweep
states are swept. Then actions yielding maximum was required. Sweep is stopped when the reward
value as output are chosen and executed. An obtained by a state is too small to show an impact
example of code showing implementation of basic on decision making. The dependence of Dynamic
Dynamic Programming is below. Programming on MDP is a huge drawback where
the environment is new and there are no
Get Policy ∏
predefined rewards or Transition probabilities.
Do
The complete dependence on MDPs lead scientists
iDiff = 0; to use a different method called Montecarlo. This
For s = 0 to allStates method does not need to know the probabilities of
every action from a state.
V = V(s)
Monte-Carlo Method (MC)
V(s) = ∑ ∏ (s , a) ∑ Pass ' ¿]
a s'
Monte-Carlo methods discover optimal policies by
iDiff = iDiff > abs(V – V(s) )? iDiff : abs(V- trying many different ways to reach the goal.
V(s) These methods neither assume to know the
probability of a state being selected from the
While (iDiff > iThreshold)
current one or Reward obtained after few states
Execute ∏ // based on V(s) are passed. These methods follow sample
sequences as per policy and determine the best
Fig3: Dynamic Programming Pseudo
one over iterations and in most cases the best
The code above depicts a way to compute values policy converges to optimal one.
and follow the policy to reach the goal. A little
tweak in the technique can result in maximising Each sample sequence is called an Episode. At the
the reward. end of every episode, total reward obtained is
averaged over total states encountered. All
S1062150 Page 3
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
encountered states are updated with the value. On-Policy method is fairly intuitive. After an
Policy should be written in order to cover all episode is executed, value of each state is set to
possible states n number of times. [ CITATION maximum possible except for ϵ times (See (8)).
Les96 \l 2057 ]
∏ ( S , a )=¿
When a state is visited multiple times, each state is
assigned value either by “First Visit” or “Every Here, a* is greedy choice, a(s) is action from
Visit” Mote-Carlo method. In First Visit MC, state‘s’, ϵ is the time where policy is not greedy.
weighted value (See formula (4) ) of goal from the
first time a state is visited is assigned to the state. Off-Policy method follows a Behaviour Policy and
In Every Visit method, Value will be averaged over Estimate Policy based on the weighted returns.
all state values encountered in the process to More of this is explained in Q-Learning method.
achieve goal.
One major disadvantage of Monte-Carlo is to wait
Episode = GenerateEpisode(∏) for an episode to finish, which might take infinitely
long time for large state spaces. This could be
For x = 0 to allStates(Episode) solved by updating the root nodes with the
aReturns(allStates(x)->StateID) += estimate of the current state otherwise known as
aReturns(allStates(x)->Value) Temporal Difference Learning.
aVisited(allStates(x)->StateID)++ Temporal Difference Learning (TD)
End
The agent learns from experience directly. Once
For x = 0 to iTotalStates V(St+1) is attained, V(St) is updated with newly
aReturns(x) /= aVisited(x) estimated value. Let us consider TempDiff(0),
End
V ( S t ) =V ( S t ) + α [ r t +1 +¿ ϒ V ∏ ( St +1 ) −V ( S t ) ] (9)
S1062150 Page 4
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
This Off-Policy technique updates maximum value I am going to explain Function Approximation and
path to the estimation policy while following SMDP with the help of RoboCup soccer
behaviour policy. This could lead to the shortest [ CITATION Pet05 \l 2057 ].
path to the goal.
Function Approximation
S1062150 Page 5
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
Vt(S) = ϒ max a ¿
CONCLUSION
The importance of Reinforcement learning is
undisputable and this paper has supported the
stance right from the Introduction. This paper is a
clear depiction of how the Reinforcement learning
algorithms have evolved gradually. This paper
shows primitive types of RL techniques and their
emergence into real time and complex algorithms.
The paper is a survey model which highlighted the
pitfalls of each method and how others were
derived to solve them. Some issues which were
fatal to RL were shown and the way in which they
were handled is explained. In conclusion this paper
is just a part of what RL algorithms have been, how
far they have come, problems faced, problems
solved and implicitly how RL is the future of
Artificial Intelligence.
Bibliography
Busoniu, L., Babuska, R., Schutter, B. D., & Ernst, D. (2010). Reinforcement Learning and Dynamic
Programming using Function Approximators. Taylor & Francis CRC Press.
S1062150 Page 6
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of
Artificial Intelligence Research, 237-285.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of
Artificial Intelligence Research, 237-285.
Monahan, G. E. (1982). Survey of partially observable Markov Decison Processes: Theory, Models
and Algorithms. Management Science.
N. Roy, G. G. (2005). Finding approximate POMDP solutions through belief compression. Artificial
Intelligence Research.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT
Press.
Syafiie S, T. F. (2004). Softmax and ε-greedy policies applied to process control. IFAC Workshop on
Adaptive and Learning Systems.
S1062150 Page 7