0% found this document useful (0 votes)
8 views29 pages

Chapter 18 - Reinforcement Learning

Uploaded by

ilaysaatci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

Chapter 18 - Reinforcement Learning

Uploaded by

ilaysaatci
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

INTRODUCTION TO

Machine Learning
CHAPTER 18:

Reinforcement Learning
Machine Learning

What is reinforcement learning?


Reinforcement learning is a machine learning training method based
on rewarding desired behaviours and punishing undesired ones.

In general, a reinforcement learning agent -- the entity being trained --


is able to:
perceive and interpret its environment,
take actions and
learn through trial and error.
Machine Learning

What is reinforcement learning?


What makes this approach important is that it empowers an agent,
whether:
it's a feature in a video game or
a robot in an industrial setting, to learn to navigate the
complexities of the environment it was created for.

Over time, through a feedback system that typically includes


rewards and punishments, the agent learns from its environment
and optimizes its behaviors.
Machine Learning

How does reinforcement learning work?


In reinforcement learning, developers devise a method of rewarding desired
behaviors and punishing negative behaviors.

This method assigns


positive values to the desired actions to encourage
negative values are assigned to undesired behaviours to discourage

This programs the agent to seek long-term and maximum overall rewards to
achieve an optimal solution.

These long-term goals help prevent the agent from getting stuck on less
important goals.

With time, the agent learns to avoid the negative and seek the positive.
Machine Learning

How does reinforcement learning work?


The Markov decision process serves as the basis for reinforcement
learning systems.

In this process, an agent exists in a specific state inside an environment;


it must select the best possible action from multiple potential actions it can
perform in its current state.

Certain actions offer rewards for motivation.

When in its next state, new rewarding actions are available to it.

Over time, the cumulative reward is the sum of rewards the agent
receives from the actions it chooses to perform.
Machine Learning

Applications and examples of reinforcement learning

• Gaming.

• Resource management.

• Personalized
recommendations.

• Robotics. Reinforcement learning is also used in


• operations research,
• information theory,
• game theory,
• control theory,
• simulation-based optimization,
• multi-agent systems,
• swarm intelligence,
• statistics,
• genetic algorithms and
• ongoing industrial automation efforts
Machine Learning

Challenges of applying reinforcement learning


One of the barriers for deployment of this type of machine learning is its reliance
on exploration of the environment.

For example, if you were to deploy a robot that was reliant on reinforcement learning to
navigate a complex physical environment, it will seek new states and take different
actions as it moves. With this type of reinforcement learning problem, however, it's
difficult to consistently take the best actions in a real-world environment because of how
frequently the environment changes.

The time required to ensure the learning is done properly through this method
can limit its usefulness and be intensive on computing resources. As the
training environment grows more complex, so too do the demands on time and
compute resources.

Supervised learning can deliver faster, more efficient results than reinforcement
learning to companies if the proper amount of data is available, as it can be
employed with fewer resources.
Machine Learning

Common reinforcement learning algorithms

There are different approaches because of the different strategies they use to
explore their environments:

• State-action-reward-state-action. This reinforcement learning algorithm starts by


giving the agent what's known as a policy. Determining the optimal policy-based
approach requires looking at the probability of certain actions resulting in rewards, or
beneficial states, to guide its decision-making.

• Q-learning. This approach to reinforcement learning takes the opposite approach.


The agent receives no policy and learns about an action's value based on
exploration of its environment. This approach isn't model-based but instead is more
self-directed. Real-world implementations of Q-learning are often written using Python
programming.

• Deep Q-networks. Combined with deep Q-learning, these algorithms use


neural networks in addition to reinforcement learning techniques. They are also
referred to as deep reinforcement learning and use reinforcement learning's self-
directed environment exploration approach. As part of the learning process, these
networks base future actions on a random sample of past beneficial actions
Machine Learning

Markov Decision Process (MDP)

We can formally describe a MDP as m = (S, A, P, R, gamma), where:


• S represents the set of all states.
• A represents the set of possible actions.
• P represents the transition probabilities.
• R represents the rewards.
• Gamma is known as the discount factor (more on this later).

9
Machine Learning

Markov Decision Process (MDP)


• The goal of the MDP m is to find a policy, often denoted as pi, that
yields the optimal long-term reward.
• Policies are simply a mapping of each state s to a distribution of
actions a.
• For each state s, the agent should take action a with a certain probability.
• Alternatively, policies can also be deterministic (i.e. the agent will take
action a in state s).

10
Machine Learning

MDP

R=0 R=0 R=5


a:0.9 a:1
S1 S2 S3

b:1 a:0.1 T(S,a,S’)=P


State Action Probability Next State
S1 a 0.9 S2

S5 S4 S1
S1
a
b
0.1
1
S4
S5
S2 a 1 S3
R=2 R=0

11
Machine Learning

Markov Decision Process (MDP)


The Bellman Equation is central to Markov Decision Processes.
It outlines a framework for determining the optimal expected reward at a
state s by answering the question: “what is the maximum reward an
agent can receive if they make the optimal action now and for all future
decisions?”

12
Machine Learning

Bellman Equation

gamma – which is between 0 or 1 (inclusive) –

If gamma is set to 0, the V(s’) term is completely cancelled out and the model only
cares about the immediate reward.

If gamma is set to 1, the model weights potential future rewards just as much as it
weights immediate rewards.

The optimal value of gamma is usually somewhere between 0 and 1, such that the
value of farther-out rewards has diminishing effects. 13
Machine Learning

Bellman Equation

2
𝑉 ( 𝑠 )=𝑟 + 𝛾 .𝑟 + 𝛾 .𝑟 +… .

14
Machine Learning

Bellman Equation

R=0 R=0 R=5


a a
S1 S2 S3

S5 V = 0 + 0.9 * 5 + (0.9)2 * 5 + ….
R=1

15
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning

16
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning
Maze Example: Utility

• Define the reward of being in a state:


– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0 ,…, sN ) = R(s0 ) + R(s1 ) +….+R(sN )
17
Machine Learning

Maze Example: Utility

• Define the reward of being in a state:


– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0 ,…, sN ) = R(s0 ) + R(s1 ) +….+R(sN )
18
Machine Learning

Maze Example: No Uncertainty

• States: locations in maze grid


• Actions: Moves up/left/down/right
• If no uncertainty: Find sequence of actions from current state
to goal (+1) that maximizes utility

19
Machine Learning

What we are looking for: Policy


• Policy = Mapping from states to action π(s) = a
-> Which action should I take in each state
• In the maze example, π(s) associates a motion to a particular
location on the grid
• For any state s, we define the utility U(s) of s as the sum of
discounted rewards of the sequence of states starting at state s
generated by using the policy π

U(s) = R(s) + γ R(s1 ) + γ2 R(s2 ) +…..

• Where we move from s to s1 by action π(s)


• We move from s1 to s2 by action π(s1 )
• …etc. 20
Machine Learning

Maze Example: No Uncertainty

Optimal Policy = The policy π* that maximizes the expected


utility U(s) of the sequence of states generated by π*, starting at s
• π *((1,1)) = UP
• π*((1,3)) = RIGHT
• π*((4,1)) = LEFT

21
Machine Learning

Maze Example: With Uncertainty

• The robot may not execute exactly the action that is


commanded-> The outcome of an action is no longer deterministic

• Uncertainty:
– We know in which state we are (fully observable)
– But we are not sure that the commanded action will be executed
exactly
22
Machine Learning

Uncertainty
• No uncertainty:
– An action a deterministically causes a transition from a
state s to another state s’

• With uncertainty:
– An action a causes a transition from a state s to another
state s’ with some probability T(s,a,s’)
– T(s,a,s’) is called the transition probability from state s to
state s’ through action a
– In general, we need |S|2x|A| numbers to store all the
transitions probabilities

23
Machine Learning

Maze Example: With Uncertainty

• We can no longer find a unique sequence of actions, but


• Can we find a policy that tells us how to decide which action to
take from each state except that now the policy maximizes the
expected utility

24
Machine Learning

Maze Example: Utility Revisited

U(s) = Expected reward of future states starting at s

How to compute U after one step?

25
Machine Learning

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

26
Machine Learning

Maze Example: Utility Revisited (Same with Discount )

Suppose s = (1,1) and we choose action Up.

27
Machine Learning

More General Expression

If we choose action a at state s, expected future rewards are:

U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’)

28
Machine Learning

More General Expression

If we are using policy π, we choose action a=π(s) at state s,


expected future rewards are:

Uπ (s) = R(s) + γ Σs’T(s,π(s),s’) Uπ (s’)

29

You might also like