0% found this document useful (0 votes)

198 views

Reinforcement Learning

Reinforcement learning is a general machine learning framework where an agent learns to achieve a goal by interacting with its environment. There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Reinforcement learning is more general than supervised or unsupervised learning as the agent learns from trial-and-error search by trial interactions with an uncertain environment. The agent's objective is to maximize the total reward received over time. Markov decision processes provide a mathematical framework for defining reinforcement learning problems and finding optimal policies. There are several algorithms for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal-difference learning.

Uploaded by

Abhay thakur B-2 13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

198 views

Reinforcement Learning

Uploaded by

Abhay thakur B-2 13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Reinforcement Learning

Learning Algorithms
• Supervised learning
– classification, regression
environment
• Unsupervised learning reward action
new state
– clustering agent

• Reinforcement learning
– more general than supervised/unsupervised learning
– learn from interaction w/ environment to achieve a
goal
Reinforcement Learning

Training Info = evaluations (“rewards” / “penalties”)

RL
Inputs System
Outputs (“actions”)

Objective: get as much reward as possible

Key Features of RL
• Learner is not told which actions to take
• Trial-and-Error search
• Possibility of delayed reward (sacrifice short-term
gains for greater long-term gains).
• The need to explore and exploit
• Considers the whole problem of a goal-directed
agent interacting with an uncertain environment.
Element of RL

Policy

Reward
Value
Model of
environment

• Policy: what to do
• Reward: what is good
• Value: what is good because it predicts reward
• Model: what follows what
Outlines
• Examples

• Defining an RL problem
– Markov Decision Processes

• Solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
Example: Tic-Tac-Toe

X X X X X O X X O X
X O X O X O X O X O X O X
O O O O X

} x’s move
x x
... x

x o
... ... o
... } o’s move
x o x

} x’s move
... ... ... ... ...

} o’s move
Assume an imperfect opponent:
he/she sometimes makes mistakes } x’s move
x o
x
x o
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry per state:
State V(s) – estimated probability of winning

.5 ?

2. Now play lots of games. To

x
.5 ?

x x x 1 win
pick our moves, look ahead
o
o
one step:
current state
x
x o
o 0 loss
o

various possible
o x o next states
0 draw
o x x
x o o
*
Just pick the next state with the highest
estimated prob. of winning — the largest V(s);
a greedy move.

But 10% of the time pick a move at random;

an exploratory move.
RL Learning Rule for Tic-Tac-
Toe

“Exploratory” move

s – the state before our greedy move

s – the state after our greedy move

We increment each V ( s ) toward V ( s ) – a backup :

V ( s )  V ( s )  V ( s )  V ( s )

a small positive fraction,e.g.,  .1

Robot in a Room
actions: UP, DOWN, LEFT, RIGHT
+1
UP
-1 80% move UP
10% move LEFT
10% move RIGHT
START

• reward +1 at [4,3], -1 at [4,2]

• reward -0.04 for each step
• what’s the strategy to achieve max reward?
• what if the actions were deterministic?
Other examples
• Pole-balancing
• TD-Gammon [Gerry Tesauro]
• Helicopter [Andrew Ng]
• No teacher who would say “good” or “bad”
– is reward “10” good or bad?
– rewards could be delayed
• Similar to control theory
– more general, fewer constraints
• Explore the environment and learn from
experience
– not just blind search, try to be smart about it
Resource allocation in datacenters

loadbalancer

application A application B application C

• A Hybrid Reinforcement Learning Approach to Autonomic

Resource Allocation
– Tesauro, Jong, Das, Bennani (IBM)
– ICAC 2006
Outline
• Examples

• Defining an RL problem
– Markov Decision Processes

• Solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP

-1 80% move UP
10% move LEFT
10% move RIGHT
START
reward +1 at [4,3], -1 at [4,2]
reward -0.04 for each step

• states
• actions
• rewards
• what is the solution?
Is this a solution?
+1

-1

• only if actions deterministic

– not in this case (actions are stochastic)

• solution/policy
– mapping from each state to an action
Optimal policy
+1

-1
Reward for each step: -2
+1

-1
Reward for each step: -0.1
+1

-1
Reward for each step: -0.04
+1

-1
Reward for each step: -0.01
+1

-1
Reward for each step: +0.01
+1

-1
Markov Decision Process (MDP)
• set of states S, set of actions A, initial state S0
• transition model P(s,a,s’) environment
– P( [1,1], up, [1,2] ) = 0.8 reward action
new state
• reward function r(s) agent
– r( [4,3] ) = +1
• goal: maximize cumulative reward in the long run

• policy: mapping from S to A

– (s) or (s,a) (deterministic vs. stochastic)
• reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment
Computing return from rewards
• episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze

• additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks

• discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded
Value functions
• state value function: V(s)
– expected return when starting in s and following 

• state-action value function: Q(s,a)

– expected return when starting in s, performing
s
a, and
following 
a
r
• useful for finding the optimal policy
s’
– can estimate from experience
– pick the best action using Q(s,a)

• Bellman equation
Optimal value functions
• there’s a set of optimal policies
– V defines partial ordering on policies
– they share the same optimal value function

• Bellman optimality equation

s
– system of n non-linear equations a
– solve for V*(s) r
– easy to extract the optimal policy
s’

• having Q*(s,a) makes it even simpler

Outline
• Examples

• Defining an RL problem
– Markov Decision Processes

• Solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
Dynamic programming
• main idea
– use value functions to structure the search for good
policies
– need a perfect model of the environment

• two main components

– policy evaluation: compute V from 
– policy improvement: improve  based on V

– start with an arbitrary policy

– repeat evaluation/improvement until convergence
Policy evaluation/improvement
• policy evaluation:  -> V
– Bellman eqn’s define a system of n eqn’s
– could solve, but will use iterative version

– start with an arbitrary value function V0, iterate until Vk

converges

• policy improvement: V -> ’

– ’ either strictly better than , or ’ is optimal (if  =
’)
Policy/Value iteration
• Policy iteration

– two nested iterations; too slow

– don’t need to converge to Vk
• just move towards it

• Value iteration

– use Bellman optimality equation as an update

– converges to V*
Using DP
• need complete model of the environment and
rewards
– robot in a room
• state space, action space, transition model

• can we use DP to solve

– robot in a room?
– back gammon?
– helicopter?
Outline
• examples

• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
• miscellaneous
– state representation
– function approximation
– rewards
Monte Carlo methods
• don’t need full knowledge of environment
– just experience, or
– simulated experience

• but similar to DP
– policy evaluation, policy improvement

• averaging sample returns

– defined only for episodic tasks
Monte Carlo policy evaluation
• want to estimate V(s)
= expected return starting from s and following 
– estimate as average of observed returns in state s
• first-visit MC
– average returns following the first visit to state s
s s
s0 R1(s) = +2
+1 -2 0 +1 -3 +5
s0
s0 R2(s) = +1
s0 R3(s) = -5
s0
s0 R4(s) = +4

V(s) ≈ (2 + 1 – 5 + 4)/4 = 0.5

Monte Carlo control
• V not enough for policy improvement
– need exact model of environment

• estimate Q(s,a)

• MC control
– update after each episode

• non-stationary environment

• a problem
– greedy policy won’t explore all actions
Maintaining exploration
• deterministic/greedy policy won’t explore all
actions
– don’t know anything about the environment at the
beginning
– need to try all actions to find the optimal one
• maintain exploration
– use soft policies instead: (s,a)>0 (for all s,a)
• ε-greedy policy
– with probability 1-ε perform the optimal/greedy
action
– with probability ε perform a random action
– will keep exploring the environment
– slowly move it towards greedy policy: ε -> 0
Simulated experience
• 5-card draw poker
– s0: A, A, 6, A, 2
– a0: discard 6, 2
– s1: A, A, A, A, 9 + dealer takes 4 cards
– return: +1 (probably)
• DP
– list all states, actions, compute P(s,a,s’)
• P( [A,A,6,A,2], [6,2], [A,9,4] ) = 0.00192
• MC
– all you need are sample episodes
– let MC play against a random policy, or itself, or
another algorithm
Summary of Monte Carlo
• don’t need model of environment
– averaging of sample returns
– only for episodic tasks
• learn from sample episodes or simulated
experience
• can concentrate on “important” states
– don’t need a full sweep
• need to maintain exploration
– use soft policies
Outline
• examples
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
• miscellaneous
– state representation
– function approximation
– rewards
Temporal Difference Learning
• combines ideas from MC and DP
– like MC: learn directly from experience (don’t need a
model)
– like DP: learn from values of successors
– works for continuous tasks, usually faster than MC

• constant-alpha MC:
– have to wait until the end of target
episode to update

• simplest TD
– update after every step, based on the successor
MC vs. TD
• observed the following 8 episodes:
A – 0, B – 0 B–1 B–1 B-1
B–1 B–1 B–1 B–0

• MC and TD agree on V(B) = 3/4

• MC: V(A) = 0
r=1
– converges to values that minimize the error
75%
on training
data r=0
A 100%
B
r=0
• TD: V(A) = 3/4 25%
– converges to ML estimate
Sarsa
• again, need Q(s,a), not just V(s)
st at st+1 at+1 st+2 at+2
rt rt+1

• control
– start with a random policy
– update Q and  after each step
– again, need -soft policies
Q-learning
• before: on-policy algorithms
– start with a random policy, iteratively improve
– converge to optimal

• Q-learning: off-policy
– use any policy to estimate Q

– Q directly approximates Q* (Bellman optimality eqn)

– independent of the policy being followed
– only requirement: keep updating each (s,a) pair

• Sarsa
Outline
• examples
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
• miscellaneous
– state representation
– function approximation
– rewards
State representation
• pole-balancing
– move car left/right to keep the pole balanced
• state representation
– position and velocity of car
– angle and angular velocity of pole
• what about Markov property?
– would need more info
– noise in sensors, temperature, bending of pole
• solution
– coarse discretization of 4 state variables
• left, center, right
– totally non-Markov, but still works
Function approximation
• represent Vt as a parameterized function
– linear regression, decision tree, neural net, …
– linear regression:

• update parameters instead of entries in a table

– better generalization
• fewer parameters and updates affect “similar” states as well

• TD update
x y
– treat as one data point for regression
– want method that can learn on-line (update after each
Features
• tile coding, coarse coding
– binary features

• radial basis functions

– typically a Gaussian
– between 0 and 1
Splitting and aggregation
• want to discretize the state space
– learn the best discretization during training
• splitting of state space
– start with a single state
– split a state when different parts of that state have
different values

• state aggregation
– start with many states
– merge states with similar values
Designing rewards
• robot in a maze
– episodic task, not discounted, +1 when out, 0 for each step
• chess
– GOOD: +1 for winning, -1 losing
– BAD: +0.25 for taking opponent’s pieces
• high reward even when lose
• rewards
– rewards indicate what we want to accomplish
– NOT how we want to accomplish it
• shaping
– positive reward often very “far away”
– rewards for achieving subgoals (domain knowledge)
– also: adjust initial policy or initial value function
Case study: Back gammon
• rules
– 30 pieces, 24 locations
– roll 2, 5: move 2, 5
– hitting, blocking
– branching factor: 400

• implementation
– use TD() and neural nets
– 4 binary features for each position on board (# white pieces)
– no BG expert knowledge

• results
– TD-Gammon 0.0: trained against itself (300,000 games)
• as good as best previous BG computer program (also by Tesauro)
• lot of expert input, hand-crafted features
– TD-Gammon 1.0: add special features
– TD-Gammon 2 and 3 (2-ply and 3-ply search)
• 1.5M games, beat human champion
Summary
• Reinforcement learning
– use when need to make decisions in uncertain
environment
• solution methods
– dynamic programming
• need complete model
– Monte Carlo
– time-difference learning (Sarsa, Q-learning)
• most work
– algorithms simple
– need to design features, state representation, rewards

Applied ML notes
No ratings yet
Applied ML notes
123 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Fuzzy Timed Petri Net Definitions, Properties, and Applications
No ratings yet
Fuzzy Timed Petri Net Definitions, Properties, and Applications
16 pages
Reinforcement Learning: Nazia Bibi
100% (1)
Reinforcement Learning: Nazia Bibi
61 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Correction TD2 Programmation C Embarquée
No ratings yet
Correction TD2 Programmation C Embarquée
1 page
Agenda: 1. What Is Tensorflow?
No ratings yet
Agenda: 1. What Is Tensorflow?
10 pages
Markov Decision Process and Reinforcement Learning
No ratings yet
Markov Decision Process and Reinforcement Learning
36 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
Adversarial Search 2020
No ratings yet
Adversarial Search 2020
34 pages
DAA Unit-2: Fundamental Algorithmic Strategies
No ratings yet
DAA Unit-2: Fundamental Algorithmic Strategies
5 pages
Load Scheduling
100% (1)
Load Scheduling
10 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Lecture 04 Part A - Knowledge Representation and Reasoning
100% (1)
Lecture 04 Part A - Knowledge Representation and Reasoning
23 pages
Unit V
100% (1)
Unit V
24 pages
AI Old Question Papers
No ratings yet
AI Old Question Papers
7 pages
The Rabin-Karp Algorithm: String Matching
No ratings yet
The Rabin-Karp Algorithm: String Matching
18 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Unit Iii PDF
No ratings yet
Unit Iii PDF
97 pages
Cse330 Agent-Based-Intelligent-Systems TH 1.00 Ac26
0% (1)
Cse330 Agent-Based-Intelligent-Systems TH 1.00 Ac26
2 pages
What Are Generative Adversarial Networks (GANs) - Simplilearn
No ratings yet
What Are Generative Adversarial Networks (GANs) - Simplilearn
19 pages
Soft Computing-Lab File
No ratings yet
Soft Computing-Lab File
40 pages
Knowledge Representation & Reasoning: By: Irum Naz Sodhar Lecturer IT, SBBU-SBA Main Campus
100% (1)
Knowledge Representation & Reasoning: By: Irum Naz Sodhar Lecturer IT, SBBU-SBA Main Campus
22 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Expert System in AI
No ratings yet
Expert System in AI
11 pages
Chapter 3 - Solving Problems by Searching
No ratings yet
Chapter 3 - Solving Problems by Searching
71 pages
Markov Chains
No ratings yet
Markov Chains
35 pages
Cluster Computing
No ratings yet
Cluster Computing
32 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Question Bank - Machine Learning
No ratings yet
Question Bank - Machine Learning
16 pages
6CS5 DS Unit-4
No ratings yet
6CS5 DS Unit-4
64 pages
Lecture 6 - State Space Search - Uninformed Search
No ratings yet
Lecture 6 - State Space Search - Uninformed Search
43 pages
Lecture15 Deep Reinforcement Learning PDF
No ratings yet
Lecture15 Deep Reinforcement Learning PDF
109 pages
Midsem Regular QP
No ratings yet
Midsem Regular QP
2 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
07 Game Playing
No ratings yet
07 Game Playing
30 pages
ML Full Slides Final
No ratings yet
ML Full Slides Final
458 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
Generative Adversarial Networks: Akrit Mohapatra Ece Department, Virginia Tech
No ratings yet
Generative Adversarial Networks: Akrit Mohapatra Ece Department, Virginia Tech
21 pages
CST 402 DC QB
No ratings yet
CST 402 DC QB
6 pages
L - 2.3, L-2.4 RSA - Diffie Hellman Algorithm
No ratings yet
L - 2.3, L-2.4 RSA - Diffie Hellman Algorithm
21 pages
Ml Lab Manual (5cs4-23)
No ratings yet
Ml Lab Manual (5cs4-23)
53 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
CSS - Orientation TE SEM VI FH - 22
No ratings yet
CSS - Orientation TE SEM VI FH - 22
17 pages
Lecture 2.3.4GAN
No ratings yet
Lecture 2.3.4GAN
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
What are Generative Adversarial Networks
No ratings yet
What are Generative Adversarial Networks
14 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
3 pages
Chapter 1 - Data Representation 1.1 - Data Types
No ratings yet
Chapter 1 - Data Representation 1.1 - Data Types
12 pages
Multithreading in Python
No ratings yet
Multithreading in Python
10 pages
AI Assignment
No ratings yet
AI Assignment
6 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Experiment 3: Aim: Generate or Functions Using Mcculloch-Pitts Neural Net by A Matlab Program
No ratings yet
Experiment 3: Aim: Generate or Functions Using Mcculloch-Pitts Neural Net by A Matlab Program
3 pages
NN LMS DR Gamal PDF
No ratings yet
NN LMS DR Gamal PDF
34 pages
Unit 2 (Process Synchronization) 1
No ratings yet
Unit 2 (Process Synchronization) 1
79 pages
HighPerformanceComputing DS
No ratings yet
HighPerformanceComputing DS
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Network Diagram
No ratings yet
Network Diagram
31 pages
Chapter (12) - Entity Relationship
No ratings yet
Chapter (12) - Entity Relationship
63 pages
Digital Twin Enabled Smart
No ratings yet
Digital Twin Enabled Smart
8 pages
Limitations Imposed by RHP Zeros/Poles in Multivariable Systems
No ratings yet
Limitations Imposed by RHP Zeros/Poles in Multivariable Systems
6 pages
Toyota Lean Manufacturing
100% (2)
Toyota Lean Manufacturing
62 pages
Enhancing Efficiency and Reliability: Automation of Oil Desalination and Dehydration Process
No ratings yet
Enhancing Efficiency and Reliability: Automation of Oil Desalination and Dehydration Process
14 pages
010 Informed Search 2 - A Star
No ratings yet
010 Informed Search 2 - A Star
20 pages
Lect4 architectureNN
No ratings yet
Lect4 architectureNN
17 pages
Introduction To AI-Chapter1
No ratings yet
Introduction To AI-Chapter1
16 pages
Software Quality Assurance Verification and Testing - INTL
100% (1)
Software Quality Assurance Verification and Testing - INTL
171 pages
Thermodynamics
No ratings yet
Thermodynamics
1 page
ML Seminar Presentation
No ratings yet
ML Seminar Presentation
26 pages
Sistem Informasi Pengelolaan Aset (Studi Kasus: PPSDM MIGAS Cepu)
No ratings yet
Sistem Informasi Pengelolaan Aset (Studi Kasus: PPSDM MIGAS Cepu)
5 pages
Artificial Neural Network Unit-3
No ratings yet
Artificial Neural Network Unit-3
2 pages
Unit Ii Software Effect Estimation and Activity Planning
No ratings yet
Unit Ii Software Effect Estimation and Activity Planning
22 pages
A Computer Vision Based Approach For Driver Distraction Recognition Using Deep Learning and Genetic Algorithm Based Ensemble
No ratings yet
A Computer Vision Based Approach For Driver Distraction Recognition Using Deep Learning and Genetic Algorithm Based Ensemble
12 pages
Fluent and GT Powercoupling
No ratings yet
Fluent and GT Powercoupling
2 pages
Iscm 2
No ratings yet
Iscm 2
3 pages
OOAD MULTIPLE CHOICE QUESTIONS UNITWISE New
No ratings yet
OOAD MULTIPLE CHOICE QUESTIONS UNITWISE New
6 pages
Example 3. CPM: Activity Duration (Days) Events Preceding Activity (Predecessor)
No ratings yet
Example 3. CPM: Activity Duration (Days) Events Preceding Activity (Predecessor)
3 pages
Cse Presentation
No ratings yet
Cse Presentation
3 pages
Industrial Automation and Robotics (MFI)
No ratings yet
Industrial Automation and Robotics (MFI)
2 pages
BIF307 Assignment Topics
No ratings yet
BIF307 Assignment Topics
2 pages
The System Dynamics As A Tool For Modeling Healtcare System
No ratings yet
The System Dynamics As A Tool For Modeling Healtcare System
8 pages
SRE - Week - 2 - RE Process & Levels of Requirements
No ratings yet
SRE - Week - 2 - RE Process & Levels of Requirements
29 pages
Report of The Summer Internship Project
No ratings yet
Report of The Summer Internship Project
25 pages
CS LabReport 3
No ratings yet
CS LabReport 3
10 pages
ICT 9 Activity Sheet: Quarter 3 - Week 2
No ratings yet
ICT 9 Activity Sheet: Quarter 3 - Week 2
6 pages
Continual Learning and Catastrophic Forgetting
No ratings yet
Continual Learning and Catastrophic Forgetting
21 pages
Coffee Machine Model
No ratings yet
Coffee Machine Model
7 pages