5.4-Reinforcement Learning-Part2-Learning-Algorithms
5.4-Reinforcement Learning-Part2-Learning-Algorithms
A Passive Agent executes a fixed policy and evaluates it. The agent
simply watches the world going by and tries to learn the utilities of
being in various states.
Active learning
In the Planning DP case an agent could conduct off-policy planning, that is,
formulate a policy without needing to interact with the world.
In off-policy methods, the agent´s policy used to chose its actions is called
the behavior policy which may be unrelated to the policy that is evaluated and
improved, called the estimation policy.
Exploitation actions: have a preference for past actions that have been found to be effective at
producing reward and thereby exploiting what is already known. The Planning DP case is
completely in this genre.
Exploration actions: have a preference for exploring untested actions to discover new and
potentially more reward-producing actions. Exploration is typical for the true reinforcement
learning cases.
Managing the trade-off between exploration and exploitation in its policies is a critical issue in
RL algorithms.
• One guideline could be to explore more when knowledge is weak and exploit more when
we have gained more knowledge.
• One method: ε-greedy keep to exploitation as long as actions has the probability 1-ε. If no
action which satisfies this condition is found, the agent turns to exploration. ε is a
hyper-parameter controlling the balance between exploitation and exploration.
Model-free versus Model-based Reinforcement Learning
In the situation that a complete model of the environment is NOT available for the agent, i.e. elements
of T(s,a) and R(s´|s,a) are lacking, RL offers two different approaches.
Model-based RL algorithms.
One approach is to try to learn an adequate model of the environment and then fall back on the
planning task for a complete model to find a policy.
That is, if the agent is currently in state s, takes action a, and then observes the environment transition to
state s´ with reward r , that observation can be used to improve its estimate of T( s, a) and R(s´| s,a)
through supervised learning techniques.
Model-free algorithms.
However a model of the environment is not necessary to find a good policy.
One of the most classic examples is Q-learning, which directly estimates optimal so called Q-values of
each action in each state (related to the utility of each action in each state), from which a policy may be
derived by choosing the action with the highest Q-value in the current state.
Solving a Re-inforcement Learning Problem
Model-based approaches
Model-free approaches
The so called ´Reward to go´ of a state s is the sum of the (discounted) rewards from that state until
a terminal state is reached. The estimated value of the state is based on the ´Rewards to go´.
Direct Utility Estimation keeps a running average of the observed ´rewards‐to‐go´ for each state s.
Value (s) = average of ( all (´Reward to go´ (s) for all episodes e in the sample)
As the number of trials goes to infinity, the sample average converges to the true utility for state s.
The strategy in ADP is to first complete the partially known MDP model and then regard it as
complete applying the Dynamic Programming technique as in the planning case with complete
knowledge.
V(s):= ∑ T (s,a) * (R(s´|s,a)+ γ * V(s') )
s´
To be learned
Monte Carlo Simulations simply apply Monte Carlo methods in the context of
Simulations.
MC takes mean returns for states in sampled episodes. The value expectation of
a state is set to the iterative mean of all the empirical returns (not expected) for
the episodes.
First-visit MC: average returns only for first time s is visited in an episode.
Every-Visit MC: average returns for every time s is visited in an episode.
Algorithm for first-visit Monte Carlo
1. Initialize the policy, state-value function arbitrarily 2: Returns(s) ← empty list
3. For each s in E
Begin
4. if this is the first occurrence of this state add the return received to the returns list.
5. calculate an iterative means (average) over all returns
6. set the value of the state as that computed average
End
Repeat step 2-4 until convergence.
It is convenient to convert the mean return into an incremental update so that the
mean can be updated with each episode and one can follow the progress made with
each episode.
V(s) is incrementally updated for each state St, with return Gt.
Example of Monte Carlo Simulation
An undiscounted Markov Reward Process with two states A and B
Transition matrix and reward function are unknown
E1
E2
first-visit MC
every-visit MC
Temporal difference (TD) learning is a class of model-free reinforcement learning methods which learn
by bootstrapping from the current estimate of the value function.
TD algorithms:
- sample from the environment, like Monte Carlo simulations and
- perform updates based on current estimates like adaptive dynamic programming methods.
While Monte Carlo methods only adjust their estimates once the final outcome is known, TD methods adjust
predictions before the final outcome is known. Typically TD algorithms adjust the estimated utility value of
the current state s based on its immediate reward and the estimated value of the next state s´. The term
´temporal´ is motivated by the temporal relation between states s and s´.