Reinforcement Ch.2
Reinforcement Ch.2
Processes
REINFORCEMENT LEARNING WITH GYMNASIUM IN PYTHON
Fouad Trad
Machine Learning Engineer
MDP
Models RL environments mathematically
Discrete(4)
Discrete(16)
Number of actions: 4
Number of states: 16
print(env.unwrapped.P[state][action])
[
(probability_1, next_state_1, reward_1, is_terminal_1),
(probability_2, next_state_2, reward_2, is_terminal_2),
etc.
]
Fouad Trad
Machine Learning Engineer
Policies
RL objective → formulate effective policies
Specify which action to take in each state to maximize return
Nine states
Deterministic movements
Mountain: -2
Mountain: -2
Other states: -1
Computes state-values
action = policy[state]
_, next_state, reward, _ = env.unwrapped.P[state][action][0]
return reward + gamma * compute_state_value(next_state)
V = {state: compute_state_value(state)
for state in range(num_states)}
print(V)
{0: 1, 1: 8, 2: 9,
3: 2, 4: 7, 5: 10,
6: 3, 7: 5, 8: 0}
{0: 1, 1: 8, 2: 9, {0: 7, 1: 8, 2: 9,
3: 2, 4: 7, 5: 10, 3: 7, 4: 9, 5: 10,
6: 3, 7: 5, 8: 0} 6: 8, 7: 10, 8: 0}
Fouad Trad
Machine Learning Engineer
Action-value functions (Q-values)
Expected return of:
Starting at a state s
Taking action a
print(Q)
Old policy
print(improved_policy)
{0: 2, 1: 2, 2: 1,
3: 2, 4: 2, 5: 1,
6: 2, 7: 2}
Fouad Trad
Machine Learning Engineer
Policy iteration
Iterative process to find optimal policy
return improved_policy
if improved_policy == policy:
break
policy = improved_policy
return policy, V
{0: 2, 1: 2, 2: 1,
3: 1, 4: 2, 5: 1,
6: 2, 7: 2}
{0: 7, 1: 8, 2: 9,
3: 7, 4: 9, 5: 10,
6: 8, 7: 10, 8: 0}
while True:
new_V = {state: 0 for state in range(num_states)}
for state in range(num_states-1):
max_action, max_q_value = get_max_action_and_value(state, V)
new_V[state] = max_q_value
policy[state] = max_action
{0: 2, 1: 2, 2: 1,
3: 1, 4: 2, 5: 1,
6: 2, 7: 2}
{0: 7, 1: 8, 2: 9,
3: 7, 4: 9, 5: 10,
6: 8, 7: 10, 8: 0}