0% found this document useful (0 votes)
101 views7 pages

Reinforcement Learning: A Short Cut

This document summarizes the evolution of reinforcement learning algorithms. It discusses how reinforcement learning allows agents to learn from experience through trial-and-error interactions with an environment. Early algorithms used evaluative feedback and exploitation versus exploration techniques. Over time, algorithms incorporated concepts like state values, state-action values, and weighting rewards based on time. Markov decision processes provided a framework to represent connected states and actions with probabilities and rewards. Later algorithms addressed issues by incorporating concepts like partially observable states and semi-Markov decision processes.

Uploaded by

Son Krishna
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views7 pages

Reinforcement Learning: A Short Cut

This document summarizes the evolution of reinforcement learning algorithms. It discusses how reinforcement learning allows agents to learn from experience through trial-and-error interactions with an environment. Early algorithms used evaluative feedback and exploitation versus exploration techniques. Over time, algorithms incorporated concepts like state values, state-action values, and weighting rewards based on time. Markov decision processes provided a framework to represent connected states and actions with probabilities and rewards. Later algorithms addressed issues by incorporating concepts like partially observable states and semi-Markov decision processes.

Uploaded by

Son Krishna
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Evolution of Reinforcement Learning Algorithms

Krishna C Podila N V

ABSTRACT State: Agent in place 29


Agent: Take left, move forward for 4 places
This paper shows why agents need to learn and
why reinforcement learning (RL) is the best way to State: Hit the wall at 26, reward -1
do so. The paper shows what are the issues posed Agent: Take Left, move 6 spaces
to Reinforcement Learning and how they were
tacked. The main part of this paper is to show how State: Subgoal 3 reached at 86, reward +5
reinforcement learning algorithms have evolved .........
over time. Paper discusses how few threats to
traditional RL algorithms are decomposed and Fig1: State, action, reward
handled.
[ CITATION Les96 \l 2057 ]
INTRODUCTION The states and actions described are interlinked.
Every state has bunch of actions, every action
Why learning?
leads agent to a new state. These links are part of
Robots have been doing tasks which were Markov Decision Processes
predefined. Some major Limitations with this
Markov Decision Processes (MDP)
system are, every task needs a new program and
system is not fail-safe to a disturbance. Robots The algorithms we deal with in the rest of the
need to learn by themselves to overcome above paper, assume we know the state space either
problems. completely or partially. The effect of an action is
known with certainty or with probability.
What is Learning??
MDP (Markov Decision Process) is a way to
Firstly, Supervised Learning:
represent how states are connected using actions.
 Needs training data similar to the task. MDP incorporates Rewards and Transaction
 Possible variations to be trained. probabilities for states and Actions respectively.
[ CITATION Ric98 \l 2057 ]
Now, Unsupervised Learning:
We will also see, how the problems in RL
 Needs to be told what the goal is. algorithms have influenced MDP’s evolution.
 Learns its way towards the goal. POMDP (Partially Observed), SMDP (Semi) etc. are
 Adapts to changes (Compliance) formed to overcome issues left unhandled by
 Needs very less if not no human traditional MDPs [ CITATION DJW93 \l 2057 ].
interaction
DISCUSSION
Reinforcement Learning is unsupervised.
In this section, we are going to see how RL is
Reinforcement Learning accomplished in general. Algorithms developed to
RL lets agents learn from their experience. Every achieve reinforcement learning are the main point
movement gets the agent a reward. Generally of discussion. We are going to see how algorithms
these rewards are scalar (eg: -1,0,1). Mistakes have evolved from a very basic idea of “Evaluative
punish the agent with negative reward while goals Feedback”.
reward them.
Evaluative feedback works by taking results of the
State, Action, Reward, Policy are the terms we
action taken into consideration. Every possible
generally encounter.
action from a specific state is given a scalar reward
 State: Situation of environment with relative to other actions. The system learns to do a
respect to agent. best possible sequence of actions to reach the
 Action: Immediate step taken by agent. ultimate goal by learning best action from each
 Reward: Punishment or accolade for an state. This method can be termed as trial and
action. error. This method is modified using Exploitation
 Policy: Plan to follow.

S1062150 Page 1
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

and Exploration techniques for an enhance action can fetch. So, better actions are highly
performance. probable to be chosen than worse ones.

Exploitation is acting by greed while Exploration is V (a )


exp ⁡( )
to find all possible ways. Exploitation is used τ
(3)
always to maximise the reward (Of course, only of V (x)
known actions). Every action taken during Σ exp ⁡( )
τ
exploitation is to achieve the best possible reward
for the next state. Exploration on the other hand In the above Equation, a is the action which is up
focuses on finding other ways to achieve goal against x (x is all other actions). The above method
which may or may not be profitable. For an agent is called softmax [ CITATION Sya04 \l 2057 ].
to decide what action to take, it needs the Weighted Averaging
promising values an action can fetch.
While we are looking into maximising the reward
State Value we obtain by taking an action, we should also
consider how long into the future we should look
Now let us see how the values are computed. Let into. It is possible that we lost sight of near and
us assume that the agent is in state S1. When an short term goals. To tackle this problem, we can
action a1 is chosen to be executed from S1, Value assign weights to the rewards based on the time of
the occurrence of action. We can keep ϒ below 1
V of the action a1 is, reward r1 attained from
and use it to give weights to particular reward. In
action a1 added with rewards r2, r3 ….. rx (rewards
the below equation, First reward has higher
obtained by actions that are taken from precedence over the next and so on.
subsequently attained states) over total number of
actions taken. [ CITATION Ric98 \l 2057 ] R=r t +1+ϒ r t +2 +ϒ 2 r t +3+ …( 4)

r 1+ r 2+r 3 … . rx So a value of a state can be rewritten as…


V ( a 1 )= (1)
x ∞

Exploration and Exploitation


V∏(s) = E∏ { ∑ ϒ x r t +x+1|st =s } (5)
x=0

Exploitation is euphemism for selection Greedy In the above equation, ∏ is a policy followed from
actions. An action a* is chosen based on the best which instructions to select actions like a are
value every possible action can provide selected from states like s. Time step t is computed
from 0 to infinite. E∏ is estimated value if policy ∏
a∗¿ max V ( a ) (2) is followed. Thus a value of state is computed. But
how good an action can be identified only if the
Always choosing a greedy action might not be an goodness of an action from a particular state is
optimal path to achieve goal. An unexplored path computed.
might be more promising than a current and best
policy. Bellman Equation

To strike a balance, ϵ-greedy method is created. Goodness of a state action pair can be computed
Epsilon (ϵ) is the amount of time a random action by getting its value in following way [ CITATION
can be selected and rest of the time a greedy Pen92 \l 2057 ]
action is selected. Initially ϵ can be set high and
can be reduced gradually over time to attain better
results. But during the random selection, all the
V∏(s) =∑ ∏ (s , a) ∑ Pass ' ¿] (6)
a s'
possible actions are given equal importance
irrespective of their effect. While ∑ ∏ (s , a) refers to action taken from
a
Using Gibbs/Boltzmann temperature, we can current state, s ' refers to state achieved from s
select the action with the probability. Here due to action a. ss' refers to next possible state
probability is mapped as estimated reward an a
from s ' . Pss ' is the probability of transitioning

S1062150 Page 2
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

from s ' to ss' with action sequence. This ∏’(s) = max a Q∏ ❑ ( S , a ) (7)
equation is also known as Bellman equation
[ CITATION Ric98 \l 2057 ]. As per above equation, optimal reward can be
easily obtained. This is also known as policy
Once the feedback method is implemented and improvement. Agent follows the policy, but when
computation of value functions are attained, an action proves more valuable than the chosen
techniques for maximising reward (in layman one by policy, policy ∏ is modified to replace with
terms, choosing best possible path to goal) like a better action. ∏’(s) is modified or improved
Dynamic Programming has evolved. policy. [ CITATION Bus10 \l 2057 ]
Dynamic Programming (DP) The improvement we discussed above is by greedy
improvement method. It is possible to obtain
Dynamic programming sweeps to every state and
optimal policy in this method using dynamic
gets agent the best path. Consider the diagram
programming. Thus evaluating the policy by
below….
computing the values and by improving them in
S1
iterations we can attain optimal policy.
a1 a3
Evaluate
S2 ∏0(Ω) V∏0
Improve
S6

Evaluate
∏1(Ω) V∏1
Fig2: State & Action Representation
In the above figure, let circles be states and edges Improve
be actions. In dynamic programming (See fig2), all Fig4: Policy Improvement
possible actions from S1 (a1, a2, a3) are checked
individually. Transition from S1 to S2 is taken as a Dynamic Programming could prove expensive
case and the possible value of S2 is computed when state space Ω is large and actions for every
using actions a4 and a5. Likewise all actions and all state are numerous. Thus a way to stop the sweep
states are swept. Then actions yielding maximum was required. Sweep is stopped when the reward
value as output are chosen and executed. An obtained by a state is too small to show an impact
example of code showing implementation of basic on decision making. The dependence of Dynamic
Dynamic Programming is below. Programming on MDP is a huge drawback where
the environment is new and there are no
Get Policy ∏
predefined rewards or Transition probabilities.
Do
The complete dependence on MDPs lead scientists
iDiff = 0; to use a different method called Montecarlo. This
For s = 0 to allStates method does not need to know the probabilities of
every action from a state.
V = V(s)
Monte-Carlo Method (MC)
V(s) = ∑ ∏ (s , a) ∑ Pass ' ¿]
a s'
Monte-Carlo methods discover optimal policies by
iDiff = iDiff > abs(V – V(s) )? iDiff : abs(V- trying many different ways to reach the goal.
V(s) These methods neither assume to know the
probability of a state being selected from the
While (iDiff > iThreshold)
current one or Reward obtained after few states
Execute ∏ // based on V(s) are passed. These methods follow sample
sequences as per policy and determine the best
Fig3: Dynamic Programming Pseudo
one over iterations and in most cases the best
The code above depicts a way to compute values policy converges to optimal one.
and follow the policy to reach the goal. A little
tweak in the technique can result in maximising Each sample sequence is called an Episode. At the
the reward. end of every episode, total reward obtained is
averaged over total states encountered. All

S1062150 Page 3
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

encountered states are updated with the value. On-Policy method is fairly intuitive. After an
Policy should be written in order to cover all episode is executed, value of each state is set to
possible states n number of times. [ CITATION maximum possible except for ϵ times (See (8)).
Les96 \l 2057 ]
∏ ( S , a )=¿
When a state is visited multiple times, each state is
assigned value either by “First Visit” or “Every Here, a* is greedy choice, a(s) is action from
Visit” Mote-Carlo method. In First Visit MC, state‘s’, ϵ is the time where policy is not greedy.
weighted value (See formula (4) ) of goal from the
first time a state is visited is assigned to the state. Off-Policy method follows a Behaviour Policy and
In Every Visit method, Value will be averaged over Estimate Policy based on the weighted returns.
all state values encountered in the process to More of this is explained in Q-Learning method.
achieve goal.
One major disadvantage of Monte-Carlo is to wait
Episode = GenerateEpisode(∏) for an episode to finish, which might take infinitely
long time for large state spaces. This could be
For x = 0 to allStates(Episode) solved by updating the root nodes with the
aReturns(allStates(x)->StateID) += estimate of the current state otherwise known as
aReturns(allStates(x)->Value) Temporal Difference Learning.
aVisited(allStates(x)->StateID)++ Temporal Difference Learning (TD)
End
The agent learns from experience directly. Once
For x = 0 to iTotalStates V(St+1) is attained, V(St) is updated with newly
aReturns(x) /= aVisited(x) estimated value. Let us consider TempDiff(0),

End
V ( S t ) =V ( S t ) + α [ r t +1 +¿ ϒ V ∏ ( St +1 ) −V ( S t ) ] (9)

Fig5: Every Visit Monte-Carlo where V ( S t ) is previously estimated value of


state. This is modified with estimated value of next
Greatest advantage is that Computational state and exact reward obtained to reach the state
complexity of Dynamic Programming using which gives more exact value. These are
recursions is completely avoided in Monte-Carlo. moderated with a constant α.
Transitions are only made based on policy;
learning is done through samples, updating root By modifying values on the fly (Boot Strapping), TD
nodes and getting us values of all states. Once is similar to DP. TD does not sweep all the possible
converged, it is easy to obtain optimal path. actions and states but follows a sample path like
MC. Thus TD is a hybrid algorithm of DP and MC.
Disadvantage of this method is, for improving But TD converges faster due to Bootstrapping and
policy we need thousands of iterations to be computationally not complex like DP.
completed. This was solved using Exploring Starts
algorithm. Due to the advantages discussed above, On-Policy
and Off-Policy techniques of TD are developed to
Exploring Starts is a method where Episode be used in multiple situations. The On-Policy
generation is followed by its execution, then it method is called SARSA while Off-Policy method is
evaluation and changing the policy based on its Q-Learning. [ CITATION Ric98 \l 2057 ]
results. This speeds up the policy improvement
wagon. Even though improvement looks SARSA
promising, Sampling forever or Getting to know all
states is almost impossible. The State Action pair values are computed to get a
greedy behaviour policy. For learning a ϵ-greedy
Exploration is the backbone of Monte-Carlo method can be used. The method is called SARSA
method as it handles unknown environments. Two because; a Reward (R) is obtained for moving from
methods “On-Policy” (where ϵ-greedy is used) and first state action (SA) pair to the next.
“Off-Policy” (where one policy is used for
exploration while other tries to determine optimal Q-Learning
way to achieve goal) are created.

S1062150 Page 4
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

This Off-Policy technique updates maximum value I am going to explain Function Approximation and
path to the estimation policy while following SMDP with the help of RoboCup soccer
behaviour policy. This could lead to the shortest [ CITATION Pet05 \l 2057 ].
path to the goal.
Function Approximation

Function Approximation works by estimating the


model of the state space. The huge state space is
reduced into states using generalization
While(1)
techniques.[ CITATION Koswn \l 2057 ]. Once
State = StartState() the model is constructed, Reinforcement Learning
While(State != Goal) Algorithms like Q-Learning and SARSA are
implemented.
Action = BehaviourPolicy(Q(s,a)) // ϵ-greedy
We are going to see how function approximation is
Execute(Action)
put into use for the RoboCup Soccer. There is a
Q(s,a) = Q(s,a) + α[r+ϒMax(Q(s’,a’) – Q(s,a)] 30mx30m region; it is difficult for agents to treat
S = s’ that huge arena as discrete state spaces. So, state
space is generated using a model. Here people use
Fig6:Q-Learning Distance between the Agent to the ball, distance
between enemy agents to ball, Angle between two
As value of the state is increased, possibility of that
enemy agents with respect to ball, possession of
state being selected next time increases. But this
the ball etc as state variables. Thus every agent will
has pitfalls if the states nearby are disastrous.
have few states to think about.
Q-Learning can be used where shortest path is
Now that state space is conveniently mapped, an
desired and occasional negative effect on the
agent has to act based on them. But as these
system does not have a huge impact. SARSA is
states are functions of distance and angles but not
safest way to learn but might not be optimal all
adjoining locations or grids, traditional way of
the time.
moving one step ahead will not work. These kinds
All the methods discussed above are feasible with of problems demanded in generation of actions
a limited number of state spaces. In real world which are internally composed of set of actions
where robots have to interact, state space could otherwise known as SMDP actions.
be ginormous. States and Actions may be needed
Semi-Markov Decision Process (SMDP)
to handle continuous time. For this to be solved,
RL has evolved to tackle states in a different way, SMDP lets agent tackle the modelling of changes in
which can be seen in function approximation. system in real time [ CITATION Ott06 \l 2057 ].
SMDP lets an agent change the plan every time
STATE SPACES
system changes its state. The value of state to be
MDP is the only style of state space we discussed achieved from an SMDP action depends on the
in this paper. State spaces are lot more time. This can be estimated by Probability that
complicated. One possibility is that working area of agent achieves Next decision in a particular time
Robot is too large, traditional state space system and probability that final goal is achieved in certain
doesn’t work, for which function approximation time. SMDP action can also be seen as a macro.
needs to be used. Sometimes an agent might not
An SMDP action can be broken down into multiple
get a full view of MDP for which Partially
actions done over some time units. For example, a
Observable MDPs (POMDP) has to be handled.
There are situations where in performing a single robot to stealing ball from an opponent.
step and re-planning is too costly demands usage [ CITATION Pet05 \l 2057 ].
of Semi MDPs (SMDP). These are the few views of
An SMDP action might be called StealBall from
how efficient usage of RL demands in evolving the
location X. But that action can be split in to, agent
structure of handling problems.
turning to the proper angle to get to the ball,
travelling to the probable location of the ball,
adjusting to the ball and moving against the

S1062150 Page 5
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

direction of the opposition agent. If the agent


discovers SMDP action cannot possibly be
completed, the old action could be terminated.
But this assumes that the field is fully observable.
There are situations where in not all the states are
observable for the robot and for handling such
situations POMDP is designed.

Partially Observable Markov Decision Process


(POMDP)

As POMDP is partially observable MDP, the agent


works on it with beliefs or Probability or
confidence [ CITATION Mon82 \l 2057 ]. These
are generally done through posterior distribution
of the states.

Vt(S) = ϒ max a ¿

This is belief or probability that agent is in a


particular state assuming that it has taken action
‘a’ from past state‘s’.[ CITATION NRo05 \l 2057 ]

CONCLUSION
The importance of Reinforcement learning is
undisputable and this paper has supported the
stance right from the Introduction. This paper is a
clear depiction of how the Reinforcement learning
algorithms have evolved gradually. This paper
shows primitive types of RL techniques and their
emergence into real time and complex algorithms.
The paper is a survey model which highlighted the
pitfalls of each method and how others were
derived to solve them. Some issues which were
fatal to RL were shown and the way in which they
were handled is explained. In conclusion this paper
is just a part of what RL algorithms have been, how
far they have come, problems faced, problems
solved and implicitly how RL is the future of
Artificial Intelligence.

Bibliography
Busoniu, L., Babuska, R., Schutter, B. D., & Ernst, D. (2010). Reinforcement Learning and Dynamic
Programming using Function Approximators. Taylor & Francis CRC Press.

Conn, K. G. (2005). Supervised-Reinforcement Learning For A Mobile Robot In A Real-World.


Nashville, Tennessee: Vanderbilt University.

S1062150 Page 6
Evolution of Reinforcement Learning Algorithms
Krishna C Podila N V

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of
Artificial Intelligence Research, 237-285.

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of
Artificial Intelligence Research, 237-285.

Kostas Kostiadis, H. H. (Unknown). KaBaGe-RL: Kanerva-based Generalisation and Reinforcement


Learning for Possession Football.

Monahan, G. E. (1982). Survey of partially observable Markov Decison Processes: Theory, Models
and Algorithms. Management Science.

N. Roy, G. G. (2005). Finding approximate POMDP solutions through belief compression. Artificial
Intelligence Research.

Peter Stone, R. S. (2005). Reinforcement Learning for RoboCup-Soccer Keepaway.

Rasanen, O. (2006). Semi-Markov Decision Processes. Seminar on MDP.

S, P. (1992). A generalized dynamic programming principle and Hamilton-Jacobi-Bellman equation.


Stochastics An International Journal of Probability.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT
Press.

Syafiie S, T. F. (2004). Softmax and ε-greedy policies applied to process control. IFAC Workshop on
Adaptive and Learning Systems.

White, D. J. (1993). Markov Decision Processes. Manchester: Wiley.

S1062150 Page 7

You might also like