ML Mod 6
ML Mod 6
4. The goal is to find a suitable action model that increases total reward of the
agent.
EXAMPLE:
2. The agent is supposed to find the best possible path to reach the reward.
[Figure 8.1: Example of Reinforcement Learning - showing a grid with robot,
diamond and fire]
Module 06 1
4. The goal of the robot is to get the reward.
7. After learning robot chooses a path which gives him the reward with the
least hurdles.
10. The total reward will be calculated when it reaches the final reward that is
the diamond.
1. A policy
2. Reward function
3. A value function
5. Policy:
1. Reward Function:
The reward function defines what the good and bad events are for the
agent
Module 06 2
1. Value Function:
The value of a state is the total amount of reward an agent can expect to
collect over the future, starting from that state
1. Model:
For example, given a state and action, it can predict the resultant next state
and next reward
Module 06 3
2. These models are parameterized with a certain number of parameters
which do not change as the model structure changes.
4. For example, if you assume that a set of data {(x₁,..., x_n)} you are given is
subject to a linear model y=f(x;w, b), where w ∈
ℝᵈ and b is the dimension
of each data point irrespective of n.
4. We can directly solve for the optimal policy using dynamic programming
6. When we have the optimal value function, the optimal policy is to choose
the action that maximizes value in next state as following:
π*(s+t) = argₐ max{[r(s,t,a,t) + γ P(s,t + 1|s,t,a,t)
v(s,t + 1)]}
BENEFITS:
Module 06 4
2. Allow for incorporation of prior knowledge
1. They can be used to predict returns - the total reward expected over the
future
2. Target returns are usually "boot strapped" returns towards a more accurate
target return
Different TD algorithms:
1. TD(0) algorithm
2. TD(λ) algorithm
3. TD(n) algorithm
5. The policy is not strict that it always chooses the action that gives the most
reward
Module 06 5
6. On policy algorithms cannot separate exploration from control
2. Again the behaviors are usually "soft" to ensure sufficient exploration going
on
1. Most of the time the action with the highest estimated reward is chosen,
called the greedy action
4. This method ensures that if enough trials are done, each action will be tried
an infinite number of times
II) Softmax:
4. An action is selected with respect to the weight associated with each action
Module 06 6
6. This is a good approach to take when the worst actions are very
unfavorable
ADVANTAGES OF TD METHODS:
7. An experience (s, a, s') provides one data point for the value of Q(s, a)
8. The data point is that the agent received the future value of V*(s'),
where V(s') = maxₐ Q(s', a)
9. This is the actual current reward plus the discounted estimated future value
11. The agent can use the temporal difference equation to update its estimate
for Q(s, a):
Q(s, a) = Q(s, a) + α[r + γmaxₐ' Q(s', a') - Q(s, a)]
Module 06 7
13. It can be proven that given sufficient training under any ε-soft policy, the
algorithm converges with probability 1 to a close approximation of the
action-value function for an arbitrary target policy
14. Q-Learning learns the optimal policy even when actions are selected
according to a more exploratory or even random policy
Q-TABLE:
3. Basically, this table will guide us to the best action at each state
4. In the Q-Table, the columns are the actions and the rows are the states
5. Each Q-table score will be the maximum expected future reward that the
agent will get if it takes that action at that state
2. Setting it to 0 means that the Q-values are never updated, hence nothing is
learned
3. Setting a high value such as 0.9 means that learning can occur quickly
2. This models the fact that future rewards are worth less than immediate
rewards
3. Mathematically, the discount factor needs to be set less than 0 for the
algorithm to converge
Module 06 8
PROCEDURAL APPROACH:
3. Choose an action, a, for that state based on one of the action selection
policies (i.e. soft, ε-greedy or Softmax)
4. Take the action, and observe the reward, r, as well as the new state, s'
5. Update the Q-value for the state using the observed reward and the
maximum reward possible for the next state. (The updating is done
according to the formula and parameters described above.)
6. Set the state to the new state, and repeat the process until a terminal state
is reached
EXPLORATION:
5. However, due to the lack of algorithms that properly scale well with the
number of states, simple exploration methods are the most practical
6. One such method is ε-greedy, when the agent chooses the action that it
believes has the best long-term effect with probability 1-ε
Module 06 9
DELAYED REWARDS:
2. And it also determine (at least probabilistically) the next state of the
environment
6. The agent must be able to learn which of its actions are desirable based on
reward that can take place arbitrarily far in the future
7. It can also be done with eligibility traces, which weight the previous actions
a lot
8. The action before that a little less, and the action before that even less, and
so on. But it takes lot of computational time
1. In certain applications, the agent does not know the state exactly
2. It is equipped with sensors that return an observation using which the agent
should estimate the state
4. The robot may not know its exact location in the room, or what else is in
room
5. The robot may have a camera with which sensory observations are
recorded
6. This does not tell the robot its state exactly but gives inputs as to its likely
state
7. For example, the robot may only know that there is a wall to right
8. The setting is like a Markov decision process, except that after taking an
action at state s
9. The new state s'+1 is not known but only an observation o+1 which is
stochastic function of s and p(o+1|st, at)
Module 06 10
10. This is called as partially observable state.
sUMS ON q1
Module 06 11