0% found this document useful (0 votes)
35 views40 pages

Reinforcement Learning: Nguyen Do Van, PHD

This document provides an overview of reinforcement learning. It begins with introducing reinforcement learning and its applications such as robotics and gaming. It then discusses key concepts in reinforcement learning including agents and environments, rewards, states, and the trade-off between exploration and exploitation. The document also introduces Markov decision processes and how they can be used to model reinforcement learning problems. Finally, it provides an overview of dynamic programming techniques like value iteration, policy iteration, and iterative policy evaluation that can be used to find optimal policies in reinforcement learning problems.

Uploaded by

Ác Qủy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views40 pages

Reinforcement Learning: Nguyen Do Van, PHD

This document provides an overview of reinforcement learning. It begins with introducing reinforcement learning and its applications such as robotics and gaming. It then discusses key concepts in reinforcement learning including agents and environments, rewards, states, and the trade-off between exploration and exploitation. The document also introduces Markov decision processes and how they can be used to model reinforcement learning problems. Finally, it provides an overview of dynamic programming techniques like value iteration, policy iteration, and iterative policy evaluation that can be used to find optimal policies in reinforcement learning problems.

Uploaded by

Ác Qủy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

REINFORCEMENT

LEARNING

Nguyen Do Van, PhD


Reinforcement Learning

§ Introduction
§ Markov Decision Process
§ Dynamic Programming

2
REINFORCEMENT LEARNING INTRODUCTION
Intelligent agents learning and acting
Sequence of decision, reward

3
Reinforcement learning: What is it?

§ Making good decision to do new task: fundamental challenge in


AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards

4
Characteristics of Reinforcement Learning

§ What makes reinforcement learning different from other


machine learning paradigms?
q There is no supervisor, only a reward signal
q Feedback is delayed, not instantaneous
q Time really matters (sequential, non i.i.d data)
q Agent's actions affect the subsequent data it receives

5
RL Applications

§ Multi-disciplinary Conference on Reinforcement


Learning and Decision Making (RLDM2017)
q Robotics
q Video games
q Conversational systems
q Medical intervention
q Algorithm improvement
q Improvisational theatre
q Autonomous driving
q Prosthetic arm control
q Financial trading
q Query completion

6
Robotics

https://www.youtube.com/watch?v=ZBFwe1gF0FU
7
Gaming

8
RL vs supervised and unsupervised learning

Practical and technical


challenges:
- Need to access to
the environment
- Jointly learning AND
planning from
correlated sample
- Data distribution
changes with action
choice

9
Rewards

§ A reward Rt is a scalar feedback signal


§ Indicates how well agent is doing at step t
§ The agent's job is to maximize cumulative reward
§ Example:
q Robot Navigation: (-) Crash wall, (+) reaching target…
q Control power station: (+) producing power, (-) exceeding safety
thresholds
q Games: (+) Wining game, Killing enemy, collecting bloods, (-) mine

10
Agent and Environment

§ At each step t the agent:


q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
11
History and State

§ History is the sequence of observations,


actions and rewards
Ht = O1, R1,A1, ...,At-1,Ot,Rt
§ State: the information to determine state
in a trajectory
q St = f (Ht )
q Environment State: private representation
of the environment
q Agent State: agent internal representation
q Information State (Markov Property):
useful information from the history

12
Fully and Partially Observable Environments

§ Full observation:
q Agent fully observes environment state

q Agent State = environment state = information


state
q Markov Decision Process (detail later)
§ Partially observability: agent indirectly or
partially observes environment
q Robot with first view cameras
q Agent state differ from environment state
q Agent must construct its own state representation
13
Major Component of an RL agent

§ Policy - maps current state to action


§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.

14
Policy

§ Policy: agent’s behavior, how is act in the environment


§ Map from state to action
§ Deterministic policy:
§ Stochastic:

15
Value Function

§ Value Function: a prediction of future


reward (how many, how much future
reward the agents expect)
§ Used to evaluate the goodness/badness
of state
§ Agent select action to chose the best
state based on value function (with
maximized expected reward)
16
Model

§ To model environments, predict what the environments will do


§ P: to predict the next state

§ R: to predict immediate (not future) reward

17
Maze Example
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent's location

Policy

Value function
18 Model
Categorizing Reinforcement Learning Agents

§ Agents Action:
q Value Based: Value function, no policy
q Policy Based: Policy, no value function
q Actor Critic: Both Policy and Value Function
Policy
§ Modelling environment
q Model Free: interacting directly environments
q Model Based: Learn and model environments

Value function
19
Learning and Planning

Sequence Decision Making


q Reinforcement Learning
• Environments is initially unknown
• Agent interacts with the environment
• Agent improves policies
q Planning
• Models of environment are known
• Action by functional computation
• Agent improve policies

20
Exploration and Exploitation

§ Solve problem in trial-error learning


§ Agents must learn to have good policies
§ Agents learn from acting with their environments
§ Reward may not response each step, it may be at
the end of games
§ Exploration: discovering the environment
§ Exploitation: planning with maximal reward
§ Trading between exploration and exploitation
21
Recap on RL introduction

§ Sequence of decision, reward


§ State, fully observation, partially observation
§ Main components: Policy, Value Function, Model
§ Categorizing RL agents
§ Learning and Planning

22
MARKOV DECISION PROCESS
Markov decision process: Model of finite-state environment
Bellman Equation
Dynamic Programming
23
Markov Decision Process
(Model of the environment)

§ Terminologies:

24
Markov Decision Process

• Markov property: The distribution over


future states depends only on the present
state and action, not on any other previous
event.

• Maximize return
• Episodic task: consider return over finite horizon (e.g. games, maze).

• Continuing task: consider return over infinite horizon (e.g. juggling, balancing).

25
How we get good decision?

§ Defining behavior: the policy


q Policy: defines the action-selection strategy at every state

• Goals: finds the policy that maximizes expected total reward

26
Value functions

§ The expected return of a policy for a state is call value function

• Strategy to find optimal policy


• Enumerate the space of all
policies
• Estimate the expected return
of each one
• Keep the policy that has
Gridworld example
maximum expected return - Reward to Off grid: -1
- Reward to On grid: 0
- Reward exception at A, B
27
Value functions

§ Value of a policy

Note: T(s,a,s’) = p(s’|s,a)

28
Bellman’s equation

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations


(one per state)
• Belman’s equation in matrix form:
29
Optimal Value, Q and policy

§ Optimal V: the highest possible value for each s under any


possible policy
§ Satisfies the bellman Equation:

§ Optimal Q-function:

§ Optimal policy:

30
Dynamic Programming (DP)

§ Assuming full knowledge of Markov Decision Process


§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
31
DP: Iterative Policy Evaluation

§ Main idea of Dynamic Programming: turn Bellman eq: 𝑉 " = 𝑅" + 𝛾𝑃" 𝑉 "
Bellman equations to update rules
§ Problem: evaluate a given policy π
§ Iterative policy evaluation: Fix policy

32
DP: Improving a Policy

§ Finding a good policy: Policy


iteration

33
Gridworld example

34
Gridworld example

35
DP:Value Iteration

§ Finding a good policy:Value iteration


q Drawback of policy iteration: evaluate policy also
needs iteration
q Main idea: Turn the Bellman optimality equation
into an iterative update rule (same policy
evaluation)

36
DP: Pros and Cons

§ Rarely use Dynamic programming in real applications


q To calculate we must access environment model, fully observe with
knowledge of environment.
q Extending to continues actions and state
§ However:
§ Mathematically exact, expressible and analyzable
q Good deals for small problem.
q Stable, simple and fast

37
Visualization and Codes

§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

38
Recap on Reinforcement Learning

§ Introduction on RL
q Intelligent agents learning and acting
q Sequence of decision, reward
§ Markov Decision Process
q Model of finite-state environment
q Bellman Equation
q Dynamic Programming
§ Next:
q Online Learning
39
Questions?

THANK YOU!

40

You might also like