Kguh
Kguh
Dr.Ch.Balaram Murthy
The goal of an RL algorithm
For instance, imagine putting your
little brother in front of a video game
he never played, giving him a
controller, and leaving him alone.
Your brother will interact with the environment (the video game) by
pressing the right button (action). He got a coin, that’s a +1 reward.
It’s positive, he just understood that in this game he must get the
coins.
But then, he presses the right button again and he touches
an enemy. He just died, so that’s a -1 reward.
Without any supervision, the child will get better and better at
playing the game.
The RL Process
• Based on that state S0, the Agent takes action A0 — our Agent
will move to the right.
• The environment gives some reward R1 to the Agent — we’re not
dead (Positive Reward +1).
The Markov Property implies that our agent needs only the current
state to decide what action to take and not the history of all the
states and actions they took before.
Observations/States Space
The cumulative reward equals the sum of all rewards in the sequence.
Which is equivalent to:
Let’s say your agent is this tiny mouse that can move one tile each
time step, and your opponent is the cat (that can move too). The
mouse’s goal is to eat the maximum amount of cheese before
being eaten by the cat.
From figure, it is clear it’s more probable to eat the cheese near us
than the cheese close to the cat (the closer we are to the cat, the
more dangerous it is).
• The larger the gamma, the smaller the discount. This means our
agent cares more about the long-term reward.
• Otherwise the smaller the gamma, the bigger the discount. This
means our agent cares more about the short term reward (the
nearest cheese).
2. Then, each reward will be discounted by gamma to the
exponent of the time step. As the time step increases, the cat
gets closer to us, so the future reward is less and less likely to
happen.
Episodic task
Beginning of a new episode.
These are tasks that continue forever (no terminal state). In this
case, the agent must learn how to choose the best actions and
simultaneously interact with the environment.
For instance, an agent that does automated stock trading. For this
task, there is no starting point and terminal state. The agent keeps
running until we decide to stop it.
The Exploration/Exploitation trade-off
The Policy π is the brain of our Agent, it’s the function that tells us
what action to take given the state we are in. So it defines the
agent’s behavior at a given time.
Think of policy as the brain of our agent, the
function that will tell us the action to take
given a state
This Policy is the function we want to learn, our goal is to find the
optimal policy π*, the policy that maximizes expected return when
the agent acts according to it. We find this π* through training.
Two approaches to train our agent to find this optimal policy π*:
This function will define a mapping from each state to the best
corresponding action. Alternatively, it could define a probability
distribution over the set of possible actions at that state.
action = policy(state)
Stochastic: outputs a probability distribution over actions.
The value of a state is the expected discounted return the agent can
get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to
the state with the highest value”.
Here we see that our value function defined values for each possible
state.
Thanks to our value function, at each step our policy will select the
state with the biggest value defined by the value function: -7, then
-6, then -5 ………………. ……… to attain the goal.
Model-based RL algorithms build a model of the environment by
sampling the states, taking actions, and observing the rewards. For
every state and a possible action, the model predicts the expected
reward and the expected future state.