0% found this document useful (0 votes)

15 views

Lec 04 Reinforcement Learning

autonmous driving course lecture

Uploaded by

yunxiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lec 04 Reinforcement Learning

autonmous driving course lecture

Uploaded by

yunxiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Self-Driving Cars

Lecture 4 – Reinforcement Learning

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

4.1 Markov Decision Processes

4.2 Bellman Optimality and Q-Learning

4.3 Deep Q-Learning

2
4.1
Markov Decision Processes
Reinforcement Learning
So far:
I Supervised learning, lots of expert demonstrations required
I Use of auxiliary, short-term loss functions
I Imitation learning: per-frame loss on action
I Direct perception: per-frame loss on affordance indicators

Now:
I Learning of models based on the loss that we actually care about, e.g.:
I Minimize time to target location
I Minimize number of collisions
I Minimize risk
I Maximize comfort
I etc.

Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 4

Types of Learning
Supervised Learning:
I Dataset: {(xi , yi )} (xi = data, yi = label) Goal: Learn mapping x 7→ y
I Examples: Classiﬁcation, regression, imitation learning, affordance learning, etc.

Unsupervised Learning:
I Dataset: {(xi )} (xi = data) Goal: Discover structure underlying data
I Examples: Clustering, dimensionality reduction, feature learning, etc.

Reinforcement Learning:
I Agent interacting with environment which provides numeric reward signals
I Goal: Learn how to take actions in order to maximize reward
I Examples: Learning of manipulation or control tasks (everything that interacts)
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 5
Introduction to Reinforcement Learning

Agent

Reward rt
State st Action at
Next state st+1

Environment

I Agent oberserves environment state st at time t

I Agent sends action at at time t to the environment
I Environment returns the reward rt and its new state st+1 to the agent
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 6
Introduction to Reinforcement Learning

I Goal: Select actions to maximize total future reward

I Actions may have long term consequences
I Reward may be delayed, not instantaneous
I It may be better to sacriﬁce immediate reward to gain more long-term reward
I Examples:
I Financial investment (may take months to mature)
I Refuelling a helicopter (might prevent crash in several hours)
I Sacriﬁcing a chess piece (might help winning chances in the future)

Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 7

Example: Cart Pole Balancing

I Objective: Balance pole on moving cart

I State: Angle, angular vel., position, vel.
I Action: Horizontal force applied to cart
I Reward: 1 if pole is upright at time t

https://gym.openai.com/envs/#classic_control 8
Example: Robot Locomotion

I Objective: Make robot move forward

I State: Position and angle of joints
I Action: Torques applied on joints
I Reward: 1 if upright & forward moving

http://blog.openai.com/roboschool/

https://gym.openai.com/envs/#mujoco 9
Example: Atari Games

I Objective: Maximize game score

I State: Raw pixels of screen (210x160)
I Action: Left, right, up, down
I Reward: Score increase/decrease at t

http://blog.openai.com/gym-retro/

https://gym.openai.com/envs/#atari 10
Example: Go

I Objective: Winning the game

I State: Position of all pieces
I Action: Location of next piece
I Reward: 1 if game won, 0 otherwise

www.deepmind.com/research/alphago/

www.deepmind.com/research/alphago/ 11
Example: Self-Driving

I Objective: Lane Following

I State: Image (96x96)
I Action: Acceleration, Steering
I Reward: - per frame, + per tile

https://gym.openai.com/envs/CarRacing-v0/ 12
Reinforcement Learning: Overview

Agent

Reward rt
State st Action at
Next state st+1

Environment

I How can we mathematically formalize the RL problem?

13
Markov Decision Process

Markov Decision Process (MDP) models the environment and is deﬁned by the tuple

(S, A, R, P, γ)
with
I S : set of possible states
I A: set of possible actions
I R(rt |st , at ): distribution of current reward given (state,action) pair
I P (st+1 |st , at ): distribution over next state given (state,action) pair
I γ: discount factor (determines value of future rewards)

Almost all reinforcement learning problems can be formalized as MDPs

14
Markov Decision Process

Markov property: Current state completely characterizes state of the world

I A state st is Markov if and only if

P (st+1 |st ) = P (st+1 |s1 , ..., st )

I ”The future is independent of the past given the present”

I The state captures all relevant information from the history
I Once the state is known, the history may be thrown away
I The state is a sufﬁcient statistics of the future

15
Markov Decision Process

Reinforcement learning loop:

I At time t = 0:
I Environment samples initial state s0 ∼ P (s0 )
Agent
I Then, for t = 0 until done: rt
st at
I st+1
Agent selects action at
I Environment samples reward rt ∼ R(rt |st , at ) Environment
I Environment samples next state st+1 ∼ P (st+1 |st , at )
I Agent receives reward rt and next state st+1

How do we select an action?

16
Policy

A policy π is a function from S to A that speciﬁes what action to take in each state:
I A policy fully deﬁnes the behavior of an agent
I Deterministic policy: a = π(s)
I Stochastic policy: π(a|s) = P (at = a|st = s)

Remark:
I MDP policies depend only on the current state and not the entire history
I However, the current state may include past observations

17
Policy

How do we learn a policy?

Imitation Learning: Learn a policy from expert demonstrations

I Expert demonstrations are provided
I Supervised learning problem

Reinforcement Learning: Learn a policy through trial-and-error

I No expert demonstrations given
I Agent discovers itself which actions maximize the expected future reward
I The agent interacts with the environment and obtains reward
I The agent discovers good actions and improves its policy π

18
Exploration vs. Exploitation

How do we discover good actions?

Answer: We need to explore the state/action space. Thus RL combines two tasks:
I Exploration: Try a novel action a in state s , observe reward rt
I Discovers more information about the environment, but sacriﬁces total reward
I Game-playing example: Play a novel experimental move

I Exploitation: Use a previously discovered good action a

I Exploits known information to maximize reward, but sacriﬁce unexplored areas
I Game-playing example: Play the move you believe is best

Trade-off: It is important to explore and exploit simultaneously

19
Exploration vs. Exploitation

How to balance exploration and exploitation?

-greedy exploration algorithm:

I Try all possible actions with non-zero probability
I With probability choose an action at random (exploration)
I With probability 1 − choose the best action (exploitation)
I Greedy action is deﬁned as best action which was discovered so far
I is large initially and gradually annealed (=reduced) over time

20
4.2
Bellman Optimality and Q-Learning
Value Functions
How good is a state?

The state-value function V π at state st is the expected cumulative discounted reward

(rt ∼ R(rt |st , at )) when following policy π from state st :
 
X
V π (st ) = E[rt + γrt+1 + γ 2 rt+2 + . . . |st , π] = E  γ k rt+k st , π 
k≥0

I The discount factor γ < 1 is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16
1
,...)
I Determines agent’s far/short-sightedness
I Avoids inﬁnite returns in cyclic Markov processes
22
Value Functions
How good is a state-action pair?

The action-value function Qπ at state st and action at is the expected cumulative

discounted reward when taking action at in state st and then following the policy π:
 
X
Qπ (st , at ) = E  γ k rt+k st , at , π 
k≥0

I The discount factor γ ∈ [0, 1] is the value of future rewards at current time t
I Weights immediate reward higher than future reward
(e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16
1
,...)
I Determines agent’s far/short-sightedness
I Avoids inﬁnite returns in cyclic Markov processes
23
Optimal Value Functions
The optimal state-value function V ∗ (st ) is the best V π (st ) over all policies π:
 
X
V ∗ (st ) = max V π (st ) V π (st ) = E  γ k rt+k st , π 
π
k≥0

The optimal action-value function Q∗ (st , at ) is the best Qπ (st , at ) over all policies π:
 
X
Q∗ (st , at ) = max Qπ (st , at ) Qπ (st , at ) = E  γ k rt+k st , at , π 
π
k≥0

I The optimal value functions specify the best possible performance in the MDP
I However, searching over all possible policies π is computationally intractable
24
Optimal Policy

If Q∗ (st , at ) would be known, what would be the optimal policy?

π ∗ (st ) = argmax Q∗ (st , a0 )

a0 ∈A

I Unfortunately, searching over all possible policies π is intractable in most cases

I Thus, determining Q∗ (st , at ) is hard in general (for most interesting problems)
I Let’s have a look at a simple example where the optimal policy is easy to compute

25
A Simple Grid World Example

states
actions = {
1. right ?
2. left reward: r = −1 for
3. up ? each transition
4. down
}

Objective: Reach one of terminal states (marked with ’?’) in least number of actions
I Penalty (negative reward) given for every transition made

26
A Simple Grid World Example

? ?

Random Policy Optimal Policy

I The arrows indicate equal probability of moving into each of the directions

27
Solving for the Optimal Policy
Bellman Optimality Equation

I The Bellman Optimality Equation is

named after Richard Ernest Bellman who
introduced dynamic programming in 1953

I Almost any problem which can be solved

using optimal control theory can be solved
via the appropriate Bellman equation

Richard Ernest Bellman

Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 29
Bellman Optimality Equation
The Bellman Optimality Equation (BOE) decomposes Q∗ as follows:

Q∗ (st , at ) = E rt + γrt+1 + γ 2 rt+2 + . . . |st , at

BOE ∗ 0
= E rt + γ max
0
Q (st+1 , a ) st , at
a ∈A

This recursive formulation comprises two parts:

I Current reward: rt
I Discounted optimal action-value of successor: γ max Q∗ (st+1 , a0 )
0 a ∈A

We want to determine Q∗ (st , at ). How can we solve the BOE?

I The BOE is non-linear (because of max-operator) ⇒ no closed form solution
I Several iterative methods have been proposed, most popular: Q-Learning
Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 30
Proof of the Bellman Optimality Equation
Proof of the Bellman Optimality Equation for the optimal action-value function Q∗ :

Q∗ (st , at ) = E rt + γrt+1 + γ 2 rt+2 + . . . |st , at

 
X
= E γ k rt+k |st , at 
k≥0
 
X
= E rt + γ γ k rt+k+1 |st , at 
k≥0
∗
= E [rt + γV (st+1 )|st , at ]

∗ 0
= E rt + γ max
0
Q (st+1 , a )|st , at
a

Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 31

Bellman Optimality Equation

Why is it useful to solve the BOE?

I A greedy policy which chooses the action that maximizes the optimal
action-value function Q∗ or the optimal state-value function V ∗ takes
into account the reward consequences of all possible future behavior
I Via Q∗ and V ∗ the optimal expected long-term return is turned into a quantity
that is locally and immediately available for each state / state-action pair
I For V ∗ , a one-step-ahead search yields the optimal actions
I Q∗ effectively caches the results of all one-step-ahead searches

Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 32

Q-Learning
Q-Learning: Iteratively solve for Q∗

∗ ∗ 0
Q (st , at ) = E rt + γ max
0
Q (st+1 , a ) st , at
a ∈A

by constructing an update sequence Q1 , Q2 , . . . using learning rate α:

Qi+1 (st , at ) ← (1 − α)Qi (st , at ) + α(rt + γ max

0
Qi (st+1 , a0 ))
a ∈A
= Qi (st , at ) + α (rt + γ max Qi (st+1 , a0 ) − Qi (st , at ))
a0 ∈A | {z }
prediction
| {z }
target
| {z }
temporal difference (TD) error

I Qi will converge to Q∗ as i → ∞ Note: policy π learned implicitly via Q table!

Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 33

Q-Learning
Implementation:
I Initialize Q table and initial state s0 randomly
I Repeat:
I Observe state st , choose action at according to -greedy strategy
(Q-Learning is “off-policy” as the updated policy is different from the behavior policy)
I Observe reward rt and next state st+1
I Compute TD error: rt + γ max Qi (st+1 , a0 ) − Qi (st , at )
a0 ∈A
I Update Q table

What’s the problem with using Q tables?

I Scalability: Tables don’t scale to high dimensional state/action spaces (e.g., GO)
I Solution: Use a function approximator (neural network) to represent Q(s, a)

Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 34

4.3
Deep Q-Learning
Deep Q-Learning
Use a deep neural network with weights θ as function approximator to estimate Q:

Q(s, a; θ) ≈ Q∗ (s, a)

Q(s, a; θ) Q(s, a1 ; θ), ...Q(s, am ; θ)

θ θ

s a s

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 36
Training the Q Network
Forward Pass:
Loss function is the mean-squared error in Q-values:
 2 

rt + γ max Q(st+1 , a0 ; θ) − Q(st , at ; θ) 

  
L(θ) = E 
 a0 | {z } 
prediction
| {z }
target

Backward Pass:
Gradient update with respect to Q-function parameters θ:
" 2 #
∇θ L(θ) = ∇θ E rt + γ max 0
Q(st+1 , a0 ; θ) − Q(st , at ; θ)
a

Optimize objective end-to-end with stochastic gradient descent (SGD) using ∇θ L(θ).
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 37
Experience Replay
To speed-up training we like to train on mini-batches:
I Problem: Learning from consecutive samples is inefﬁcient
I Reason: Strong correlations between consecutive samples

Experience replay stores agent’s experiences at each time-step

I Continually update a replay memory D with new experiences et = (st , at , rt , st+1 )
I Train on samples (st , at , rt , st+1 ) ∼ U (D) drawn uniformly at random from D
I Breaks correlations between samples
I Improves data efﬁciency as each sample can be used multiple times

In practice, a circular replay memory of ﬁnite memory size is used.

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 38
Fixed Q Targets
Problem: Non-stationary targets
I As the policy changes, so do our targets: rt + γ max Q(st+1 , a0 ; θ)
0 a
I This may lead to oscillation or divergence

Solution: Use ﬁxed Q targets to stabilize training

I A target network Q with weights θ− is used to generate the targets:
" 2 #
L(θ) = E(st ,at ,rt ,st+1 )∼U (D) rt + γ max
0
Q(st+1 , a0 ; θ− ) − Q(st , at ; θ)
a

I Target network Q is only updated every C steps by cloning the Q-network

I Effect: Reduces oscillation of the policy by adding a delay

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 39
Putting it together

Deep Q-Learning using experience replay and ﬁxed Q targets:

I Take action at according to -greedy policy
I Store transition (st , at , rt , st+1 ) in replay memory D
I Sample random mini-batch of transitions (st , at , rt , st+1 ) from D
I Compute Q targets using old parameters θ−
I Optimize MSE between Q targets and Q network predictions
" 2 #
0 −
L(θ) = Est ,at ,rt ,st+1 ∼D rt + γ max
0
Q(st+1 , a ; θ ) − Q(st , at ; θ)
a

using stochastic gradient descent.

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 40
Case Study: Playing Atari Games

Agent

; ; ;

Environment

Objective: Complete the game with the highest score

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 41
Case Study: Playing Atari Games
Q(s, a; θ): Neural network with weights θ
Output: Q values for all (4 to 18) Atari actions
FC-Out (Q values)
(efﬁcient: single forward pass computes Q for all actions)
FC-256

32 4x4 conv, stride 2

16 8x8 conv, stride 2

;;; Input: 84 × 84 × 4 stack of last 4 frames

;
(after grayscale conversion, downsampling, cropping)

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 42
Case Study: Playing Atari Games

Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 43
Deep Q-Learning Shortcomings

Deep Q-Learning suffers from several shortcomings:

I Long training times
I Uniform sampling from replay buffer ⇒ all transitions equally important
I Simplistic exploration strategy
I Action space is limited to a discrete set of actions
(otherwise, expensive test-time optimization required)

Various improvements over the original algorithm have been explored.

44
Deep Deterministic Policy Gradients
DDPG addresses the problem of continuous action spaces.
Problem: Finding a continuous action requires optimization at every timestep.
Solution: Use two networks, an actor (deterministic policy) and a critic.

µ(s; θµ ) Q(s, a; θQ )

θµ θQ

s s a = µ(s; θµ )
Actor Critic

Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 45
Deep Deterministic Policy Gradients

I Actor network with weights θµ estimates agent’s deterministic policy µ(s; θµ )

I Update deterministic policy µ(·) in direction that most improves Q
I Apply chain rule to the expected return (this is the policy gradient):

∇θµ Est ,at ,rt ,st+1 ∼D Q(st , µ(st ; θµ ); θQ ) = E ∇at Q(st , at ; θQ ) ∇θµ µ(st ; θµ )

I Critic estimates value of current policy Q(s, a; θQ )

I Learned using the Bellman Optimality Equation as in Q Learning:
h 2 i
∇θQ Est ,at ,rt ,st+1 ∼D rt + γQ(st+1 , µ(st+1 ; θµ− ); θQ− ) − Q(st , at ; θQ )

I Remark: No maximization over actions required as this step is now learned via µ(·)

Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 46
Deep Deterministic Policy Gradients
Experience replay and target networks are again used to stabilize training:
I Replay memory D stores transition tuples (st , at , rt , st+1 )
I Target networks are updated using “soft” target updates
I Weights are not directly copied but slowly adapted:

θQ− ← τ θQ + (1 − τ )θQ−
θµ− ← τ θµ + (1 − τ )θµ−

where 0 < τ 1 controls the tradeoff between speed and stability of learning

Exploration is performed by adding noise ∇θµ to the policy µ(s):

µ(s; θµ ) + N

Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 47
Prioritized Experience Replay

Prioritize experience to replay important transitions more frequently

I Priority δ is measured by magnitude of temporal difference (TD) error:

δ = rt + γ max
0
Q(st+1 , a0 ; θQ− ) − Q(st , at ; θQ )
a

I TD error measures how “surprising” or unexpected the transition is

I Stochastic prioritization avoids overﬁtting due to lack of diversity
I Enables learning speed-up by a factor of 2 on Atari benchmarks

Schaul et al.: Prioritized Experience Replay. ICLR, 2016. 48

Learning to Drive in a Day
Real-world RL demo by Wayve:
I Deep Deterministic Policy Gradients
with Prioritized Experience Replay
I Input: Single monocular image
I Action: Steering and speed
I Reward: Distance traveled without
the safety driver taking control
(requires no maps / localization)
I 4 Conv layers, 2 FC layers
I Only 35 training episodes

Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 49
Learning to Drive in a Day

Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 50
Other ﬂavors of Deep RL
Asynchronous Deep Reinforcement Learning

Execute multiple agents in separate environment instances:

I Each agent interacts with its own environment copy and collects experience
I Agents may use different exploration policies to maximize experience diversity
I Experience is not stored but directly used to update a shared global model
I Stabilizes training in similar way to experience replay by decorrelating samples
I Leads to reduction in training time roughly linear in the number of parallel agents

Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016. 52
Bootstrapped DQN
Bootstrapping for efﬁcient exploration:
I Approximate a distribution over Q values via K bootstrapped ”heads”
I At the start of each epoch, a single head Qk is selected uniformly at random
I After training, all heads can be combined into a single ensemble policy

Q1 QK

θQ 1 ... θQK

θshared

s
Osband et al.: Deep Exploration via Bootstrapped DQN. NIPS, 2016. 53
Double Q-Learning

Double Q-Learning
I Decouple Q function for selection and evaluation of actions
to avoid Q overestimation and stabilize training. Target:

DQN : rt + γ max
0
Q(st+1 , a0 ; θ− )
a
DoubleDQN : rt + γQ(st+1 , argmax Q(st+1 , a0 ; θ); θ− )
a0
I Online network with weights θ is used to determine greedy policy
I Target network with weights θ− is used to determine corresponding action value
I Improves performance on Atari benchmarks

van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning. AAAI, 2016. 54
Deep Recurrent Q-Learning
Add recurrency to a deep Q-network to handle partial observability of states:

FC-Out (Q-values)

LSTM Replace fully-connected layer with recurrent LSTM layer

32 4x4 conv, stride 2

16 8x8 conv, stride 2

;;;;

Hausknecht and Stone: Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI, 2015 55
Faulty Reward Functions

https://blog.openai.com/faulty-reward-functions/ 56
Summary

I Reinforcement learning learns through interaction with the environment

I The environment is typically modeled as a Markov Decision Process
I The goal of RL is to maximize the expected future reward
I Reinforcement learning requires trading off exploration and exploitation
I Q-Learning iteratively solves for the optimal action-value function
I The policy is learned implicitly via the Q table
I Deep Q-Learning scales to continuous/high-dimensional state spaces
I Deep Deterministic Policy Gradients scales to continuous action spaces
I Experience replay and target networks are necessary to stabilize training

Learning Medicine - 1.0
No ratings yet
Learning Medicine - 1.0
206 pages
MG412 CW2 Brand Extension Report 1
No ratings yet
MG412 CW2 Brand Extension Report 1
6 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
lecture-06
No ratings yet
lecture-06
98 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
RL Lecturer (1)
No ratings yet
RL Lecturer (1)
38 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
16 - Reinforcement Learning and Bandits.pptx
No ratings yet
16 - Reinforcement Learning and Bandits.pptx
41 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Markov Decision Process and Reinforcement Learning
No ratings yet
Markov Decision Process and Reinforcement Learning
36 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Andy 2
No ratings yet
Andy 2
73 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Advertising and Public Relations Research 2nd Edition Donald W Jugenheimer - The ebook version is available in PDF and DOCX for easy access
100% (3)
Advertising and Public Relations Research 2nd Edition Donald W Jugenheimer - The ebook version is available in PDF and DOCX for easy access
72 pages
Lifeline: Starting Point Now
No ratings yet
Lifeline: Starting Point Now
1 page
Detailed Lesson Plan in Mapeh 9 I. Objectives
No ratings yet
Detailed Lesson Plan in Mapeh 9 I. Objectives
8 pages
1124 - B.A. English Literature Semester-I, II
No ratings yet
1124 - B.A. English Literature Semester-I, II
10 pages
The Loyal Companions
No ratings yet
The Loyal Companions
3 pages
How Conversation Works 6 Lessons For Better Communication
100% (1)
How Conversation Works 6 Lessons For Better Communication
4 pages
Research PR1 1
No ratings yet
Research PR1 1
18 pages
Stress Management Skills Assessment Report
No ratings yet
Stress Management Skills Assessment Report
8 pages
Makalah Dan Notulen Bing Kelompok 2
No ratings yet
Makalah Dan Notulen Bing Kelompok 2
17 pages
State Common Entrance Test Cell, Government of Maharashtra - .
No ratings yet
State Common Entrance Test Cell, Government of Maharashtra - .
2 pages
Puja Das Resume
No ratings yet
Puja Das Resume
2 pages
Bursar
No ratings yet
Bursar
5 pages
1430 General Science Marking Criteria New
No ratings yet
1430 General Science Marking Criteria New
11 pages
Parallel Session: Its Impact To The Academic Achievement in Mathematics of Grade 3 Learners of Diffun Central School-Isc
No ratings yet
Parallel Session: Its Impact To The Academic Achievement in Mathematics of Grade 3 Learners of Diffun Central School-Isc
11 pages
Macon State College: Office of The Registrar
No ratings yet
Macon State College: Office of The Registrar
2 pages
Practical Autoencoder Based Anomaly Detection by Using Vector Reconstruction Error
No ratings yet
Practical Autoencoder Based Anomaly Detection by Using Vector Reconstruction Error
13 pages
Fundamental Research Grant Scheme (FRGS) Application Form: Jabatan Pengajian Tinggi Kementerian Pengajian Tinggi
No ratings yet
Fundamental Research Grant Scheme (FRGS) Application Form: Jabatan Pengajian Tinggi Kementerian Pengajian Tinggi
12 pages
Imo Model 1.27 (2012)
No ratings yet
Imo Model 1.27 (2012)
155 pages
Esp Englishforspecificpurposes
100% (2)
Esp Englishforspecificpurposes
15 pages
Assignment # 2 - Education For Sustainable Development
No ratings yet
Assignment # 2 - Education For Sustainable Development
3 pages
LEARNING MODULE EsP 8 2nd Week
No ratings yet
LEARNING MODULE EsP 8 2nd Week
20 pages
Guess How Much I Love You: by Sam Mcbratney and Anita Jeram
No ratings yet
Guess How Much I Love You: by Sam Mcbratney and Anita Jeram
4 pages
Cambridge English Key For Schools Faqs
No ratings yet
Cambridge English Key For Schools Faqs
4 pages
Grade 1 Fact Fiction e
No ratings yet
Grade 1 Fact Fiction e
3 pages
article_36155
No ratings yet
article_36155
5 pages
Esl 401
No ratings yet
Esl 401
3 pages
Through The Looking Glass Lewis Carroll
No ratings yet
Through The Looking Glass Lewis Carroll
1 page
Grade 1 - 3 Schedule Time Monday Tuesday Wednesday Thursday Friday
No ratings yet
Grade 1 - 3 Schedule Time Monday Tuesday Wednesday Thursday Friday
8 pages