0% found this document useful (0 votes)

35 views40 pages

Reinforcement Learning: Nguyen Do Van, PHD

This document provides an overview of reinforcement learning. It begins with introducing reinforcement learning and its applications such as robotics and gaming. It then discusses key concepts in reinforcement learning including agents and environments, rewards, states, and the trade-off between exploration and exploitation. The document also introduces Markov decision processes and how they can be used to model reinforcement learning problems. Finally, it provides an overview of dynamic programming techniques like value iteration, policy iteration, and iterative policy evaluation that can be used to find optimal policies in reinforcement learning problems.

Uploaded by

Ác Qủy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views40 pages

Reinforcement Learning: Nguyen Do Van, PHD

Uploaded by

Ác Qủy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

REINFORCEMENT

LEARNING

Nguyen Do Van, PhD

Reinforcement Learning

§ Introduction
§ Markov Decision Process
§ Dynamic Programming

2
REINFORCEMENT LEARNING INTRODUCTION
Intelligent agents learning and acting
Sequence of decision, reward

3
Reinforcement learning: What is it?

§ Making good decision to do new task: fundamental challenge in

AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards

4
Characteristics of Reinforcement Learning

§ What makes reinforcement learning different from other

machine learning paradigms?
q There is no supervisor, only a reward signal
q Feedback is delayed, not instantaneous
q Time really matters (sequential, non i.i.d data)
q Agent's actions affect the subsequent data it receives

5
RL Applications

§ Multi-disciplinary Conference on Reinforcement

Learning and Decision Making (RLDM2017)
q Robotics
q Video games
q Conversational systems
q Medical intervention
q Algorithm improvement
q Improvisational theatre
q Autonomous driving
q Prosthetic arm control
q Financial trading
q Query completion

6
Robotics

https://www.youtube.com/watch?v=ZBFwe1gF0FU
7
Gaming

8
RL vs supervised and unsupervised learning

Practical and technical

challenges:
- Need to access to
the environment
- Jointly learning AND
planning from
correlated sample
- Data distribution
changes with action
choice

9
Rewards

§ A reward Rt is a scalar feedback signal

§ Indicates how well agent is doing at step t
§ The agent's job is to maximize cumulative reward
§ Example:
q Robot Navigation: (-) Crash wall, (+) reaching target…
q Control power station: (+) producing power, (-) exceeding safety
thresholds
q Games: (+) Wining game, Killing enemy, collecting bloods, (-) mine

10
Agent and Environment

§ At each step t the agent:

q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
11
History and State

§ History is the sequence of observations,

actions and rewards
Ht = O1, R1,A1, ...,At-1,Ot,Rt
§ State: the information to determine state
in a trajectory
q St = f (Ht )
q Environment State: private representation
of the environment
q Agent State: agent internal representation
q Information State (Markov Property):
useful information from the history

12
Fully and Partially Observable Environments

§ Full observation:
q Agent fully observes environment state

q Agent State = environment state = information

state
q Markov Decision Process (detail later)
§ Partially observability: agent indirectly or
partially observes environment
q Robot with first view cameras
q Agent state differ from environment state
q Agent must construct its own state representation
13
Major Component of an RL agent

§ Policy - maps current state to action

§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.

14
Policy

§ Policy: agent’s behavior, how is act in the environment

§ Map from state to action
§ Deterministic policy:
§ Stochastic:

15
Value Function

§ Value Function: a prediction of future

reward (how many, how much future
reward the agents expect)
§ Used to evaluate the goodness/badness
of state
§ Agent select action to chose the best
state based on value function (with
maximized expected reward)
16
Model

§ To model environments, predict what the environments will do

§ P: to predict the next state

§ R: to predict immediate (not future) reward

17
Maze Example
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent's location

Policy

Value function
18 Model
Categorizing Reinforcement Learning Agents

§ Agents Action:
q Value Based: Value function, no policy
q Policy Based: Policy, no value function
q Actor Critic: Both Policy and Value Function
Policy
§ Modelling environment
q Model Free: interacting directly environments
q Model Based: Learn and model environments

Value function
19
Learning and Planning

Sequence Decision Making

q Reinforcement Learning
• Environments is initially unknown
• Agent interacts with the environment
• Agent improves policies
q Planning
• Models of environment are known
• Action by functional computation
• Agent improve policies

20
Exploration and Exploitation

§ Solve problem in trial-error learning

§ Agents must learn to have good policies
§ Agents learn from acting with their environments
§ Reward may not response each step, it may be at
the end of games
§ Exploration: discovering the environment
§ Exploitation: planning with maximal reward
§ Trading between exploration and exploitation
21
Recap on RL introduction

§ Sequence of decision, reward

§ State, fully observation, partially observation
§ Main components: Policy, Value Function, Model
§ Categorizing RL agents
§ Learning and Planning

22
MARKOV DECISION PROCESS
Markov decision process: Model of finite-state environment
Bellman Equation
Dynamic Programming
23
Markov Decision Process
(Model of the environment)

§ Terminologies:

24
Markov Decision Process

• Markov property: The distribution over

future states depends only on the present
state and action, not on any other previous
event.

• Maximize return
• Episodic task: consider return over finite horizon (e.g. games, maze).

• Continuing task: consider return over infinite horizon (e.g. juggling, balancing).

25
How we get good decision?

§ Defining behavior: the policy

q Policy: defines the action-selection strategy at every state

• Goals: finds the policy that maximizes expected total reward

26
Value functions

§ The expected return of a policy for a state is call value function

• Strategy to find optimal policy

• Enumerate the space of all
policies
• Estimate the expected return
of each one
• Keep the policy that has
Gridworld example
maximum expected return - Reward to Off grid: -1
- Reward to On grid: 0
- Reward exception at A, B
27
Value functions

§ Value of a policy

Note: T(s,a,s’) = p(s’|s,a)

28
Bellman’s equation

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations

(one per state)
• Belman’s equation in matrix form:
29
Optimal Value, Q and policy

§ Optimal V: the highest possible value for each s under any

possible policy
§ Satisfies the bellman Equation:

§ Optimal Q-function:

§ Optimal policy:

30
Dynamic Programming (DP)

§ Assuming full knowledge of Markov Decision Process

§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
31
DP: Iterative Policy Evaluation

§ Main idea of Dynamic Programming: turn Bellman eq: 𝑉 " = 𝑅" + 𝛾𝑃" 𝑉 "
Bellman equations to update rules
§ Problem: evaluate a given policy π
§ Iterative policy evaluation: Fix policy

32
DP: Improving a Policy

§ Finding a good policy: Policy

iteration

33
Gridworld example

34
Gridworld example

35
DP:Value Iteration

§ Finding a good policy:Value iteration

q Drawback of policy iteration: evaluate policy also
needs iteration
q Main idea: Turn the Bellman optimality equation
into an iterative update rule (same policy
evaluation)

36
DP: Pros and Cons

§ Rarely use Dynamic programming in real applications

q To calculate we must access environment model, fully observe with
knowledge of environment.
q Extending to continues actions and state
§ However:
§ Mathematically exact, expressible and analyzable
q Good deals for small problem.
q Stable, simple and fast

37
Visualization and Codes

§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

38
Recap on Reinforcement Learning

§ Introduction on RL
q Intelligent agents learning and acting
q Sequence of decision, reward
§ Markov Decision Process
q Model of finite-state environment
q Bellman Equation
q Dynamic Programming
§ Next:
q Online Learning
39
Questions?

THANK YOU!

Basler RDP-110
No ratings yet
Basler RDP-110
26 pages
Nfpa 70B
100% (4)
Nfpa 70B
32 pages
Ume Ingepac Ef MD Eng PDF
100% (1)
Ume Ingepac Ef MD Eng PDF
351 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Core Java - Munishwar Gulati
No ratings yet
Core Java - Munishwar Gulati
252 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Diploma Syllabus 1st Year-2nd Sem PDF
No ratings yet
Diploma Syllabus 1st Year-2nd Sem PDF
7 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
SAILOR Battery Panel BP4680
No ratings yet
SAILOR Battery Panel BP4680
16 pages
2000 Procedimientos Industriales - Formoso
100% (2)
2000 Procedimientos Industriales - Formoso
1,219 pages
Telecom Customer Churn
0% (1)
Telecom Customer Churn
39 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
SCADA User Interface: E-Terracontrol - Module 4
No ratings yet
SCADA User Interface: E-Terracontrol - Module 4
14 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Maai 6
No ratings yet
Maai 6
143 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
12 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Introduction To HVDC Architecture and Solutions For Control and Protection
No ratings yet
Introduction To HVDC Architecture and Solutions For Control and Protection
18 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
AC51526140 Nimh Battery Pack
No ratings yet
AC51526140 Nimh Battery Pack
1 page
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Sanyo Cm21sf1 Cm21sf1 Chassis Fc8-A SM
No ratings yet
Sanyo Cm21sf1 Cm21sf1 Chassis Fc8-A SM
37 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Chapter 1 & 2 7-18-2013
No ratings yet
Chapter 1 & 2 7-18-2013
15 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
MS OFFICE APPLICATION-ren
No ratings yet
MS OFFICE APPLICATION-ren
17 pages
Automatic Drawing Machine
No ratings yet
Automatic Drawing Machine
2 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Statistics and Probability (MAT02) Numerical Descriptive Measure
No ratings yet
Statistics and Probability (MAT02) Numerical Descriptive Measure
13 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Lect1 introRL
No ratings yet
Lect1 introRL
52 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Low Power Square and Cube Architectures Using Vedic
100% (1)
Low Power Square and Cube Architectures Using Vedic
18 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Lesson 2.2 Understanding Files: Slideshow Created by Sarel Myburgh Updated by Savon (25-Feb-23)
No ratings yet
Lesson 2.2 Understanding Files: Slideshow Created by Sarel Myburgh Updated by Savon (25-Feb-23)
8 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
No ratings yet
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
1 page
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Aderonke Project Proposal
No ratings yet
Aderonke Project Proposal
40 pages
V7N1P4 Published Behavioural Informatics Journal
No ratings yet
V7N1P4 Published Behavioural Informatics Journal
12 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
DLMAIRIL01 Q4-2024 Session1
No ratings yet
DLMAIRIL01 Q4-2024 Session1
84 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Chapter 2 RRL
No ratings yet
Chapter 2 RRL
9 pages
COS 101.use. Lecture 1
No ratings yet
COS 101.use. Lecture 1
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
No ratings yet
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
112 pages
Unit-5 Test
No ratings yet
Unit-5 Test
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Soln Numerical Methods Practice Questions MSBTE
No ratings yet
Soln Numerical Methods Practice Questions MSBTE
24 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning Advancements Limitations An
No ratings yet
Reinforcement Learning Advancements Limitations An
14 pages
The First Crusade The Call From The East Peter Frankopan Download
100% (1)
The First Crusade The Call From The East Peter Frankopan Download
18 pages
Top 58 MySql Interview Questions (2023) - Javatpoint
No ratings yet
Top 58 MySql Interview Questions (2023) - Javatpoint
37 pages

Reinforcement Learning: Nguyen Do Van, PHD

Uploaded by

Reinforcement Learning: Nguyen Do Van, PHD

Uploaded by

REINFORCEMENT

Nguyen Do Van, PhD

§ Making good decision to do new task: fundamental challenge in

§ What makes reinforcement learning different from other

§ Multi-disciplinary Conference on Reinforcement

Practical and technical

§ A reward Rt is a scalar feedback signal

§ At each step t the agent:

§ History is the sequence of observations,

q Agent State = environment state = information

§ Policy - maps current state to action

§ Policy: agent’s behavior, how is act in the environment

§ Value Function: a prediction of future

§ To model environments, predict what the environments will do

§ R: to predict immediate (not future) reward

Sequence Decision Making

§ Solve problem in trial-error learning

§ Sequence of decision, reward

• Markov property: The distribution over

§ Defining behavior: the policy

• Goals: finds the policy that maximizes expected total reward

§ The expected return of a policy for a state is call value function

• Strategy to find optimal policy

Note: T(s,a,s’) = p(s’|s,a)

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations

§ Optimal V: the highest possible value for each s under any

§ Assuming full knowledge of Markov Decision Process

§ Finding a good policy: Policy

§ Finding a good policy:Value iteration

§ Rarely use Dynamic programming in real applications

You might also like