0% found this document useful (0 votes)

8 views29 pages

Chapter 18 - Reinforcement Learning

Uploaded by

ilaysaatci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views29 pages

Chapter 18 - Reinforcement Learning

Uploaded by

ilaysaatci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Machine Learning

INTRODUCTION TO

Machine Learning
CHAPTER 18:

Reinforcement Learning
Machine Learning

What is reinforcement learning?

Reinforcement learning is a machine learning training method based
on rewarding desired behaviours and punishing undesired ones.

In general, a reinforcement learning agent -- the entity being trained --

is able to:
perceive and interpret its environment,
take actions and
learn through trial and error.
Machine Learning

What is reinforcement learning?

What makes this approach important is that it empowers an agent,
whether:
it's a feature in a video game or
a robot in an industrial setting, to learn to navigate the
complexities of the environment it was created for.

Over time, through a feedback system that typically includes

rewards and punishments, the agent learns from its environment
and optimizes its behaviors.
Machine Learning

How does reinforcement learning work?

In reinforcement learning, developers devise a method of rewarding desired
behaviors and punishing negative behaviors.

This method assigns

positive values to the desired actions to encourage
negative values are assigned to undesired behaviours to discourage

This programs the agent to seek long-term and maximum overall rewards to
achieve an optimal solution.

These long-term goals help prevent the agent from getting stuck on less
important goals.

With time, the agent learns to avoid the negative and seek the positive.
Machine Learning

How does reinforcement learning work?

The Markov decision process serves as the basis for reinforcement
learning systems.

In this process, an agent exists in a specific state inside an environment;

it must select the best possible action from multiple potential actions it can
perform in its current state.

Certain actions offer rewards for motivation.

When in its next state, new rewarding actions are available to it.

Over time, the cumulative reward is the sum of rewards the agent
receives from the actions it chooses to perform.
Machine Learning

Applications and examples of reinforcement learning

• Gaming.

• Resource management.

• Personalized
recommendations.

• Robotics. Reinforcement learning is also used in

• operations research,
• information theory,
• game theory,
• control theory,
• simulation-based optimization,
• multi-agent systems,
• swarm intelligence,
• statistics,
• genetic algorithms and
• ongoing industrial automation efforts
Machine Learning

Challenges of applying reinforcement learning

One of the barriers for deployment of this type of machine learning is its reliance
on exploration of the environment.

For example, if you were to deploy a robot that was reliant on reinforcement learning to
navigate a complex physical environment, it will seek new states and take different
actions as it moves. With this type of reinforcement learning problem, however, it's
difficult to consistently take the best actions in a real-world environment because of how
frequently the environment changes.

The time required to ensure the learning is done properly through this method
can limit its usefulness and be intensive on computing resources. As the
training environment grows more complex, so too do the demands on time and
compute resources.

Supervised learning can deliver faster, more efficient results than reinforcement
learning to companies if the proper amount of data is available, as it can be
employed with fewer resources.
Machine Learning

Common reinforcement learning algorithms

There are different approaches because of the different strategies they use to
explore their environments:

• State-action-reward-state-action. This reinforcement learning algorithm starts by

giving the agent what's known as a policy. Determining the optimal policy-based
approach requires looking at the probability of certain actions resulting in rewards, or
beneficial states, to guide its decision-making.

• Q-learning. This approach to reinforcement learning takes the opposite approach.

The agent receives no policy and learns about an action's value based on
exploration of its environment. This approach isn't model-based but instead is more
self-directed. Real-world implementations of Q-learning are often written using Python
programming.

• Deep Q-networks. Combined with deep Q-learning, these algorithms use

neural networks in addition to reinforcement learning techniques. They are also
referred to as deep reinforcement learning and use reinforcement learning's self-
directed environment exploration approach. As part of the learning process, these
networks base future actions on a random sample of past beneficial actions
Machine Learning

Markov Decision Process (MDP)

We can formally describe a MDP as m = (S, A, P, R, gamma), where:

• S represents the set of all states.
• A represents the set of possible actions.
• P represents the transition probabilities.
• R represents the rewards.
• Gamma is known as the discount factor (more on this later).

9
Machine Learning

Markov Decision Process (MDP)

• The goal of the MDP m is to find a policy, often denoted as pi, that
yields the optimal long-term reward.
• Policies are simply a mapping of each state s to a distribution of
actions a.
• For each state s, the agent should take action a with a certain probability.
• Alternatively, policies can also be deterministic (i.e. the agent will take
action a in state s).

10
Machine Learning

MDP

R=0 R=0 R=5

a:0.9 a:1
S1 S2 S3

b:1 a:0.1 T(S,a,S’)=P

State Action Probability Next State
S1 a 0.9 S2

S5 S4 S1
S1
a
b
0.1
1
S4
S5
S2 a 1 S3
R=2 R=0

11
Machine Learning

Markov Decision Process (MDP)

The Bellman Equation is central to Markov Decision Processes.
It outlines a framework for determining the optimal expected reward at a
state s by answering the question: “what is the maximum reward an
agent can receive if they make the optimal action now and for all future
decisions?”

12
Machine Learning

Bellman Equation

gamma – which is between 0 or 1 (inclusive) –

If gamma is set to 0, the V(s’) term is completely cancelled out and the model only
cares about the immediate reward.

If gamma is set to 1, the model weights potential future rewards just as much as it
weights immediate rewards.

The optimal value of gamma is usually somewhere between 0 and 1, such that the
value of farther-out rewards has diminishing effects. 13
Machine Learning

Bellman Equation

2
𝑉 ( 𝑠 )=𝑟 + 𝛾 .𝑟 + 𝛾 .𝑟 +… .

14
Machine Learning

Bellman Equation

R=0 R=0 R=5

a a
S1 S2 S3

S5 V = 0 + 0.9 * 5 + (0.9)2 * 5 + ….
R=1

15
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning

16
Machine Learning
Q-learning: Markov Decision Process + Reinforcement
Learning
Maze Example: Utility

• Define the reward of being in a state:

– R(s) = -0.04 if s is empty state
– R(4,3) = +1 (maximum reward when goal is reached)
– R(4,2) = -1 (avoid (4,2) as much as possible)
• Define the utility of a sequence of states:
– U(s0 ,…, sN ) = R(s0 ) + R(s1 ) +….+R(sN )
17
Machine Learning

Maze Example: Utility

• Define the reward of being in a state:

Maze Example: No Uncertainty

• States: locations in maze grid

• Actions: Moves up/left/down/right
• If no uncertainty: Find sequence of actions from current state
to goal (+1) that maximizes utility

19
Machine Learning

What we are looking for: Policy

• Policy = Mapping from states to action π(s) = a
-> Which action should I take in each state
• In the maze example, π(s) associates a motion to a particular
location on the grid
• For any state s, we define the utility U(s) of s as the sum of
discounted rewards of the sequence of states starting at state s
generated by using the policy π

U(s) = R(s) + γ R(s1 ) + γ2 R(s2 ) +…..

• Where we move from s to s1 by action π(s)

• We move from s1 to s2 by action π(s1 )
• …etc. 20
Machine Learning

Maze Example: No Uncertainty

Optimal Policy = The policy π* that maximizes the expected

utility U(s) of the sequence of states generated by π*, starting at s
• π *((1,1)) = UP
• π*((1,3)) = RIGHT
• π*((4,1)) = LEFT

21
Machine Learning

Maze Example: With Uncertainty

• The robot may not execute exactly the action that is

commanded-> The outcome of an action is no longer deterministic

• Uncertainty:
– We know in which state we are (fully observable)
– But we are not sure that the commanded action will be executed
exactly
22
Machine Learning

Uncertainty
• No uncertainty:
– An action a deterministically causes a transition from a
state s to another state s’

• With uncertainty:
– An action a causes a transition from a state s to another
state s’ with some probability T(s,a,s’)
– T(s,a,s’) is called the transition probability from state s to
state s’ through action a
– In general, we need |S|2x|A| numbers to store all the
transitions probabilities

23
Machine Learning

Maze Example: With Uncertainty

• We can no longer find a unique sequence of actions, but

• Can we find a policy that tells us how to decide which action to
take from each state except that now the policy maximizes the
expected utility

24
Machine Learning

Maze Example: Utility Revisited

U(s) = Expected reward of future states starting at s

How to compute U after one step?

25
Machine Learning

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

26
Machine Learning

Maze Example: Utility Revisited (Same with Discount )

Suppose s = (1,1) and we choose action Up.

27
Machine Learning

More General Expression

If we choose action a at state s, expected future rewards are:

U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’)

28
Machine Learning

More General Expression

If we are using policy π, we choose action a=π(s) at state s,

expected future rewards are:

Uπ (s) = R(s) + γ Σs’T(s,π(s),s’) Uπ (s’)

Practical RL
No ratings yet
Practical RL
514 pages
L24 MarkovDecisionProcess
No ratings yet
L24 MarkovDecisionProcess
129 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Xii-Pst Book PDF
0% (1)
Xii-Pst Book PDF
96 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
Unit 4
No ratings yet
Unit 4
49 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Báo Cáo Nhóm 5 Final AI
No ratings yet
Báo Cáo Nhóm 5 Final AI
23 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Markov Decision Process
No ratings yet
Markov Decision Process
15 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
Maribel - r92 - El Chico Que Detesto
No ratings yet
Maribel - r92 - El Chico Que Detesto
443 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
ML at Icl Reinforcement Learning: in A Nutshell
No ratings yet
ML at Icl Reinforcement Learning: in A Nutshell
60 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Elementos Basicos Aprendizaje Por Refuerzo
No ratings yet
Elementos Basicos Aprendizaje Por Refuerzo
52 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Early Method of Detecting Deception
100% (2)
Early Method of Detecting Deception
6 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
5 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
208B Manual de Vuelo PDF
100% (1)
208B Manual de Vuelo PDF
846 pages
Basic Microbiology and Biochemistry
No ratings yet
Basic Microbiology and Biochemistry
67 pages
Arch Dam Design - U.S. Army Corps of Engineers-Part A
No ratings yet
Arch Dam Design - U.S. Army Corps of Engineers-Part A
120 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
TX Planning Presentation
No ratings yet
TX Planning Presentation
18 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
API FR - INR.RINR DS2 en Excel v2 2917298
No ratings yet
API FR - INR.RINR DS2 en Excel v2 2917298
74 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Answers To The First General Quick TEST UTME
No ratings yet
Answers To The First General Quick TEST UTME
22 pages
Ict2611 Octnov24
No ratings yet
Ict2611 Octnov24
15 pages
Simulation and Performance Evaluation of Battery Based Stand-Alone Photovoltaic Systems of Malawi
No ratings yet
Simulation and Performance Evaluation of Battery Based Stand-Alone Photovoltaic Systems of Malawi
89 pages
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
No ratings yet
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
23 pages
9TH SSC Trigonometry Paper
100% (2)
9TH SSC Trigonometry Paper
2 pages
CHỨC NĂNG GIAO TIẾP
No ratings yet
CHỨC NĂNG GIAO TIẾP
10 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Water Cooled Cofigured Brochure A4 Revsd Low Res
No ratings yet
Water Cooled Cofigured Brochure A4 Revsd Low Res
16 pages
Prewedding Catalog 2023
No ratings yet
Prewedding Catalog 2023
8 pages
Bavleen Revised
No ratings yet
Bavleen Revised
4 pages
Lioba CV
No ratings yet
Lioba CV
5 pages
HL Business Management Course Outline - Final
No ratings yet
HL Business Management Course Outline - Final
14 pages
Why The Hammered Bracelet Could Not Be Flown Over
No ratings yet
Why The Hammered Bracelet Could Not Be Flown Over
21 pages
What Is Weather in Canada
No ratings yet
What Is Weather in Canada
5 pages
Solution Manual For Canadian PR For The Real World Maryse Cardin Kylie Mcmullan
No ratings yet
Solution Manual For Canadian PR For The Real World Maryse Cardin Kylie Mcmullan
6 pages
CT TIF Presentation For Kickoff-Final
No ratings yet
CT TIF Presentation For Kickoff-Final
13 pages
BPMG 3023 Transport and Society Assignment 1
No ratings yet
BPMG 3023 Transport and Society Assignment 1
8 pages
Neighbours Dec 5
No ratings yet
Neighbours Dec 5
10 pages
ANCHORE
No ratings yet
ANCHORE
2 pages
Withdrawn: Will Sell by Public Auction
No ratings yet
Withdrawn: Will Sell by Public Auction
1 page
Specifications-700-HC Relays: Relay and Timer Specifications
No ratings yet
Specifications-700-HC Relays: Relay and Timer Specifications
1 page
Grade 6 2nd Q Final
No ratings yet
Grade 6 2nd Q Final
5 pages
General Biology Chapter 2 Assignment
No ratings yet
General Biology Chapter 2 Assignment
2 pages
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet

Chapter 18 - Reinforcement Learning

Uploaded by

Chapter 18 - Reinforcement Learning

Uploaded by

Machine Learning

What is reinforcement learning?

In general, a reinforcement learning agent -- the entity being trained --

What is reinforcement learning?

Over time, through a feedback system that typically includes

How does reinforcement learning work?

This method assigns

How does reinforcement learning work?

In this process, an agent exists in a specific state inside an environment;

Certain actions offer rewards for motivation.

Applications and examples of reinforcement learning

• Robotics. Reinforcement learning is also used in

Challenges of applying reinforcement learning

Common reinforcement learning algorithms

• State-action-reward-state-action. This reinforcement learning algorithm starts by

• Q-learning. This approach to reinforcement learning takes the opposite approach.

• Deep Q-networks. Combined with deep Q-learning, these algorithms use

Markov Decision Process (MDP)

We can formally describe a MDP as m = (S, A, P, R, gamma), where:

Markov Decision Process (MDP)

R=0 R=0 R=5

b:1 a:0.1 T(S,a,S’)=P

Markov Decision Process (MDP)

gamma – which is between 0 or 1 (inclusive) –

R=0 R=0 R=5

• Define the reward of being in a state:

Maze Example: Utility

• Define the reward of being in a state:

Maze Example: No Uncertainty

• States: locations in maze grid

What we are looking for: Policy

U(s) = R(s) + γ R(s1 ) + γ2 R(s2 ) +…..

• Where we move from s to s1 by action π(s)

Maze Example: No Uncertainty

Optimal Policy = The policy π* that maximizes the expected

Maze Example: With Uncertainty

• The robot may not execute exactly the action that is

Maze Example: With Uncertainty

• We can no longer find a unique sequence of actions, but

Maze Example: Utility Revisited

U(s) = Expected reward of future states starting at s

How to compute U after one step?

Maze Example: Utility Revisited

Suppose s = (1,1) and we choose action Up.

Maze Example: Utility Revisited (Same with Discount )

Suppose s = (1,1) and we choose action Up.

More General Expression

If we choose action a at state s, expected future rewards are:

U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’)

More General Expression

If we are using policy π, we choose action a=π(s) at state s,

Uπ (s) = R(s) + γ Σs’T(s,π(s),s’) Uπ (s’)

You might also like