7- Reinforcement Learning
7- Reinforcement Learning
• Robot-soccer
• invest in shares
:S A
:S A
R ( s , a ) T ( s , a, s ' ) r ( s , a , s ' )
s'
MDP - Example I
• Consider the graph, and find the shortest path from a node S to a
goal node G.
• Set of states {S, T, U, V}
• Action – Traversal from one state to another state
• Reward - Traversing an edge provides “length edge” in dollars.
• Policy – Path considered to reach the destination {STV}
14 T
S 51
-22
25
-5 V G
U
15
Q - Learning
• Q-Learning is a value-based reinforcement learning algorithm uses Q-
values (action values) to iteratively improve the behavior of the learning
agent.
• Goal is to maximize the Q value to find the optimal action-selection policy.
• The Q table helps to find the best action for each state and maximize the
expected reward.
• Q-Values / Action-Values: Q-values are defined for states and actions.
• Q(s, a) denotes an estimation of the action a at the state s.
• This estimation of Q(s, a) will be iteratively computed using the TD-
Update rule.
• Reward: At every transition, the agent observes a reward for every action
from the environment, and then transits to another state.
• Episode: If at any point of time the agent ends up in one of the terminating
states i.e. there are no further transition possible is called completion of an
episode.
Q-Learning
• Initially agent explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and taking
better actions.
• It is an off-policy control algorithm i.e. the updated policy is different
from the behavior policy. It estimates the reward for future actions and
appends a value to the new state without any greedy policy
Temporal Difference or TD-Update:
• Estimate the value of Q is applied at every time step of the agents
interaction with the environment
Advantage:
• Converges to an optimal policy in both deterministic and nondeterministic
MDPs.
Disadvantage:
• Suitable for small problems.
Understanding the Q – Learning
• Building Environment contains 5 rooms that are connected with doors.
• Each room is numbered from 0 to 4. The building outside is numbered as 5.
• Doors from room 1 and 4 leads to the building outside 5.
• Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s
goal is to reach the building outside (room 5).
Understanding the Q – Learning
• Represent the room in the graph.
• Room number is the state and door is the edge.
Understanding the Q – Learning
• Assign the Reward value to each
door.
• The doors lead immediately to
target is assigned an instant reward
of 100.
• Other doors not directly connected
to the target room have zero
reward.
• For example, doors are two-way ( 0
leads to 4, and 4 leads back to 0 ),
two edges are assigned to each
room.
• Each edge contains an instant
reward value
Understanding the Q – Learning
• Let consider agent starts from state s (Room) 2.
• Agent’s movement from one state to another state is action a.
• Agent is traversing from state 2 to state 5 (Target).
– Initial state = current state i.e. state 2
– Transition State 2 State 3
– Transition State 3 State (2, 1, 4)
– Transition State 4 State 5
Understanding the Q – Learning
Understanding the Q – Learning: Prepare matrix Q
• Matrix Q is the memory of the agent in which learned information
from experience is stored.
• Row denotes the current state of the agent
• Column denotes the possible actions leading to the next state
Compute Q matrix:
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Gamma is discounting factor for future rewards. Its range is 0 to 1.
i.e. 0 < Gamma <1.
• Future rewards are less valuable than current rewards so they must
be discounted.
• If Gamma is closer to 0, the agent will tend to consider only the
immediate rewards.
• If Gamma is closer to 1, the agent will tend to consider only future
rewards with higher edge weights.
Q – Learning Algorithm
• Set the gamma parameter
• Set environment rewards in matrix R
• Initialize matrix Q as Zero
– Select random initial (source) state
• Set initial state s = current state
– Select one action a among all possible actions using exploratory policy
• Take this possible action a, going to the next state s’.
• Observe reward r
– Get maximum Q value to go to next state based on all possible
actions
• Compute:
– Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Repeat the above steps until reach the goal state i.e current state =
goal state
Example: Q – Learning
Action
0 1 2 3 4 5
State 0 1 2 3 4 5
Example: Q – Learning
3 1
4
1 5