RL Ug
RL Ug
User's Guide
R2022b
How to Contact MathWorks
Phone: 508-647-7000
Getting Started
1
Reinforcement Learning Toolbox Product Description . . . . . . . . . . . . . . . 1-2
Create Environments
2
Create MATLAB Reinforcement Learning Environments . . . . . . . . . . . . . . 2-2
Action and Observation Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Predefined MATLAB Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Custom MATLAB Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
iii
Load Predefined Control System Environments . . . . . . . . . . . . . . . . . . . . 2-23
Cart-Pole Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
Double Integrator Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Simple Pendulum Environments with Image Observation . . . . . . . . . . . . 2-27
Create Agents
3
Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Built-In Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Choose Agent Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Model-Based Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Extract Policy Objects from Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Custom Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
iv Contents
Deep Q-Network (DQN) Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Critic Function Approximator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Agent Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
Target Update Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
v
Actor and Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70
Required Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70
Optional Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
Create Custom Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
vi Contents
Train Reinforcement Learning Agent for Simple Contextual Bandit
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal
........................................................ 5-109
Create Agent Using Deep Network Designer and Train Using Image
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-138
vii
Train DDPG Agent for Path-Following Control . . . . . . . . . . . . . . . . . . . . 5-219
Train DQN Agent for Lane Keeping Assist Using Parallel Computing . 5-227
Train DQN Agent with LSTM Network to Control House Heating System
........................................................ 5-363
viii Contents
Deploy Trained Policies
6
Deploy Trained Reinforcement Learning Policies . . . . . . . . . . . . . . . . . . . . 6-2
Generate Code Using GPU Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Generate Code Using MATLAB Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
ix
1
Getting Started
Reinforcement Learning Toolbox™ provides an app, functions, and a Simulink® block for training
policies using reinforcement learning algorithms, including DQN, PPO, SAC, and DDPG. You can use
these policies to implement controllers and decision-making algorithms for complex applications such
as resource allocation, robotics, and autonomous systems.
The toolbox lets you represent policies and value functions using deep neural networks or look-up
tables and train them through interactions with environments modeled in MATLAB® or Simulink. You
can evaluate the single- or multi-agent reinforcement learning algorithms provided in the toolbox or
develop your own. You can experiment with hyperparameter settings, monitor training progress, and
simulate trained agents either interactively through the app or programmatically. To improve training
performance, simulations can be run in parallel on multiple CPUs, GPUs, computer clusters, and the
cloud (with Parallel Computing Toolbox™ and MATLAB Parallel Server™).
Through the ONNX™ model format, existing policies can be imported from deep learning frameworks
such as TensorFlow™ Keras and PyTorch (with Deep Learning Toolbox™). You can generate optimized
C, C++, and CUDA® code to deploy trained policies on microcontrollers and GPUs. The toolbox
includes reference examples to help you get started.
1-2
What Is Reinforcement Learning?
The goal of reinforcement learning is to train an agent to complete a task within an unknown
environment. The agent receives observations and a reward from the environment and sends actions
to the environment. The reward is a measure of how successful an action is with respect to
completing the task goal.
• The policy is a mapping that selects actions based on the observations from the environment.
Typically, the policy is a function approximator with tunable parameters, such as a deep neural
network.
• The learning algorithm continuously updates the policy parameters based on the actions,
observations, and reward. The goal of the learning algorithm is to find an optimal policy that
maximizes the cumulative reward received during the task.
In other words, reinforcement learning involves an agent learning the optimal behavior through
repeated trial-and-error interactions with the environment without human involvement.
As an example, consider the task of parking a vehicle using an automated driving system. The goal of
this task is for the vehicle computer (agent) to park the vehicle in the correct position and
1-3
1 Getting Started
orientation. To do so, the controller uses readings from cameras, accelerometers, gyroscopes, a GPS
receiver, and lidar (observations) to generate steering, braking, and acceleration commands
(actions). The action commands are sent to the actuators that control the vehicle. The resulting
observations depend on the actuators, sensors, vehicle dynamics, road surface, wind, and many other
less-important factors. All these factors, that is, everything that is not the agent, make up the
environment in reinforcement learning.
To learn how to generate the correct actions from the observations, the computer repeatedly tries to
park the vehicle using a trial-and-error process. To guide the learning process, you provide a signal
that is one when the car successfully reaches the desired position and orientation and zero otherwise
(reward). During each trial, the computer selects actions using a mapping (policy) initialized with
some default values. After each trial, the computer updates the mapping to maximize the reward
(learning algorithm). This process continues until the computer learns an optimal mapping that
successfully parks the car.
1 Formulate problem — Define the task for the agent to learn, including how the agent interacts
with the environment and any primary and secondary goals the agent must achieve.
2 Create environment — Define the environment within which the agent operates, including the
interface between agent and environment and the environment dynamic model. For more
information, see “Create MATLAB Reinforcement Learning Environments” on page 2-2 and
“Create Simulink Reinforcement Learning Environments” on page 2-8.
3 Define reward — Specify the reward signal that the agent uses to measure its performance
against the task goals and how to calculate this signal from the environment. For more
information, see “Define Reward Signals” on page 2-14.
4 Create agent — Create the agent, which includes defining a policy approximator (actor) an
value function approximator (critic) and configuring the agent learning algorithm. For more
information, see “Create Policies and Value Functions” on page 4-2 and “Reinforcement
Learning Agents” on page 3-2.
5 Train agent — Train the agent approximators using the defined environment, reward, and agent
learning algorithm. For more information, see “Train Reinforcement Learning Agents” on page 5-
3.
6 Validate agent — Evaluate the performance of the trained agent by simulating the agent and
environment together. For more information, see “Train Reinforcement Learning Agents” on page
5-3.
1-4
What Is Reinforcement Learning?
7 Deploy policy — Deploy the trained policy approximator using, for example, generated GPU
code. For more information, see “Deploy Trained Reinforcement Learning Policies” on page 6-
2.
Training an agent using reinforcement learning is an iterative process. Decisions and results in later
stages can require you to return to an earlier stage in the learning workflow. For example, if the
training process does not converge to an optimal policy within a reasonable amount of time, you
might have to update some of the following before retraining the agent:
• Training settings
• Learning algorithm configuration
• Policy and value function (actor and critic) approximators
• Reward signal definition
• Action and observation signals
• Environment dynamics
See Also
More About
• “Reinforcement Learning for Control Systems Applications” on page 1-6
• “Create Simulink Environment and Train Agent” on page 1-20
1-5
1 Getting Started
1-6
Reinforcement Learning for Control Systems Applications
• Measurement noise
• Disturbance signals
• Filters
• Analog-to-digital and digital-to-analog converters
Observation Any measurable value from the environment that is visible to the agent
— In the preceding diagram, the controller can see the error signal
from the environment. You can also create agents that observe, for
example, the reference signal, measurement signal, and measurement
signal rate of change.
Action Manipulated variables or control actions
Reward Function of the measurement, error signal, or some other performance
metric — For example, you can implement reward functions that
minimize the steady-state error while minimizing control effort. When
control specifications such as cost and constraint functions are
available, you can use generateRewardFunction to generate a
reward function from an MPC object or model verification blocks. You
can then use the generated reward function as a starting point for
reward design, for example by changing the weights or penalty
functions.
Learning Algorithm Adaptation mechanism of an adaptive controller
Many control problems encountered in areas such as robotics and automated driving require
complex, nonlinear control architectures. Techniques such as gain scheduling, robust control, and
nonlinear model predictive control (MPC) can be used for these problems, but often require
significant domain expertise from the control engineer. For example, gains and parameters are
difficult to tune. The resulting controllers can pose implementation challenges, such as the
computational intensity of nonlinear MPC.
You can use deep neural networks, trained using reinforcement learning, to implement such complex
controllers. These systems can be self-taught without intervention from an expert control engineer.
Also, once the system is trained, you can deploy the reinforcement learning policy in a
computationally efficient way.
You can also use reinforcement learning to create an end-to-end controller that generates actions
directly from raw data, such as images. This approach is attractive for video-intensive applications,
such as automated driving, since you do not have to manually define and select image features.
See Also
More About
• “What Is Reinforcement Learning?” on page 1-3
1-7
1 Getting Started
1-8
Train Reinforcement Learning Agent in MDP Environment
This example shows how to train a Q-learning agent to solve a generic Markov decision process
(MDP) environment. For more information on these agents, see “Q-Learning Agents” on page 3-17.
Here:
Create an MDP model with eight states and two actions ("up" and "down").
MDP = createMDP(8,["up";"down"]);
To model the transitions from the above graph, modify the state transition matrix and reward matrix
of the MDP. By default, these matrices contain zeros. For more information on creating an MDP model
and the properties of an MDP object, see createMDP.
Specify the state transition and reward matrices for the MDP. For example, in the following
commands:
• The first two lines specify the transition from state 1 to state 2 by taking action 1 ("up") and a
reward of +3 for this transition.
1-9
1 Getting Started
• The next two lines specify the transition from state 1 to state 3 by taking action 2 ("down") and a
reward of +1 for this transition.
MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;
Similarly, specify the state transitions and rewards for the remaining rules in the graph.
% State 2 transition and reward
MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;
% State 3 transition and reward
MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;
% State 4 transition and reward
MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;
% State 5 transition and reward
MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;
% State 6 transition and reward
MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;
% State 7 transition and reward
MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;
% State 8 transition and reward
MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;
Create the reinforcement learning MDP environment for this process model.
env = rlMDPEnv(MDP);
To specify that the initial state of the agent is always state 1, specify a reset function that returns the
initial agent state. This function is called at the start of each training episode and simulation. Create
an anonymous function handle that sets the initial state to 1.
env.ResetFcn = @() 1;
1-10
Train Reinforcement Learning Agent in MDP Environment
rng(0)
To create a Q-learning agent, first create a Q table using the observation and action specifications
from the MDP environment. Set the learning rate of the representation to 1.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
qTable = rlTable(obsInfo, actInfo);
qFunction = rlQValueFunction(qTable, obsInfo, actInfo);
qOptions = rlOptimizerOptions("LearnRate",1);
Next, create a Q-learning agent using this table representation, configuring the epsilon-greedy
exploration. For more information on creating Q-learning agents, see rlQAgent and
rlQAgentOptions.
agentOpts = rlQAgentOptions;
agentOpts.DiscountFactor = 1;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.9;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
agentOpts.CriticOptimizerOptions = qOptions;
qAgent = rlQAgent(qFunction,agentOpts); %#ok<NASGU>
To train the agent, first specify the training options. For this example, use the following options:
• Train for at most 500 episodes, with each episode lasting at most 50 time steps.
• Stop training when the agent receives an average cumulative reward greater than 10 over 30
consecutive episodes.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 500;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 13;
trainOpts.ScoreAveragingWindowLength = 30;
Train the agent using the train function. This may take several minutes to complete. To save time
while running this example, load a pretrained agent by setting doTraining to false. To train the
agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(qAgent,env,trainOpts); %#ok<UNRCH>
else
% Load pretrained agent for the example.
load('genericMDPQAgent.mat','qAgent');
end
1-11
1 Getting Started
To validate the training results, simulate the agent in the training environment using the sim
function. The agent successfully finds the optimal path which results in cumulative reward of 13.
Data = sim(qAgent,env);
cumulativeReward = sum(Data.Reward)
cumulativeReward = 13
Since the discount factor is set to 1, the values in the Q table of the trained agent match the
undiscounted returns of the environment.
QTable = getLearnableParameters(getCritic(qAgent));
QTable{1}
ans = 8×2
12.9874 7.0759
-7.6425 9.9990
10.7193 0.9090
5.9128 -2.2466
6.7830 8.9988
7.5928 -5.5053
0 0
0 0
1-12
Train Reinforcement Learning Agent in MDP Environment
TrueTableValues = [13,12;5,10;11,9;3,2;1,9;5,1;0,0;0,0]
TrueTableValues = 8×2
13 12
5 10
11 9
3 2
1 9
5 1
0 0
0 0
See Also
rlMDPEnv | createMDP
More About
• “Reinforcement Learning Agents” on page 3-2
• “Train Reinforcement Learning Agents” on page 5-3
1-13
1 Getting Started
This example shows how to solve a grid world environment using reinforcement learning by training
Q-learning and SARSA agents. For more information on these agents, see “Q-Learning Agents” on
page 3-17 and “SARSA Agents” on page 3-20.
This grid world environment has the following configuration and rules:
1 The grid world is 5-by-5 and bounded by borders, with four possible actions (North = 1, South =
2, East = 3, West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives a reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with a reward of +5.
5 The agent is blocked by obstacles (black cells).
6 All other actions result in –1 reward.
To specify that the initial state of the agent is always [2,1], create a reset function that returns the
state number for the initial agent state. This function is called at the start of each training episode
1-14
Train Reinforcement Learning Agent in Basic Grid World
and simulation. States are numbered starting at position [1,1]. The state number increases as you
move down the first column and then down each subsequent column. Therefore, create an anonymous
function handle that sets the initial state to 2.
env.ResetFcn = @() 2;
rng(0)
To create a Q-learning agent, first create a Q table using the observation and action specifications
from the grid world environment. Set the learning rate of the optimizer to 0.01.
To approximate the Q-value function within the agent, create a rlQValueFunction approximator
object, using the table and the environment information.
qAgent = rlQAgent(qFcnAppx);
Configure agent options such as the epsilon-greedy exploration and the learning rate for the function
approximator.
qAgent.AgentOptions.EpsilonGreedyExploration.Epsilon = .04;
qAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 0.01;
For more information on creating Q-learning agents, see rlQAgent and rlQAgentOptions.
To train the agent, first specify the training options. For this example, use the following options:
• Train for at most 200 episodes. Specify that each episode lasts for most 50 time steps.
• Stop training when the agent receives an average cumulative reward greater than 10 over 30
consecutive episodes.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes= 200;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 11;
trainOpts.ScoreAveragingWindowLength = 30;
Train the Q-learning agent using the train function. Training can take several minutes to complete.
To save time while running this example, load a pretrained agent by setting doTraining to false.
To train the agent yourself, set doTraining to true.
1-15
1 Getting Started
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(qAgent,env,trainOpts);
else
% Load the pretrained agent for the example.
load('basicGWQAgent.mat','qAgent')
end
The Episode Manager window opens and displays the training progress.
To validate the training results, simulate the agent in the training environment.
Before running the simulation, visualize the environment and configure the visualization to maintain a
trace of the agent states.
plot(env)
env.Model.Viewer.ShowTrace = true;
env.Model.Viewer.clearTrace;
1-16
Train Reinforcement Learning Agent in Basic Grid World
sim(qAgent,env)
The agent trace shows that the agent successfully finds the jump from cell [2,4] to cell [4,4].
To create a SARSA agent, use the same Q value function and epsilon-greedy configuration as for the
Q-learning agent. For more information on creating SARSA agents, see rlSARSAAgent and
rlSARSAAgentOptions.
sarsaAgent = rlSARSAAgent(qFcnAppx);
sarsaAgent.AgentOptions.EpsilonGreedyExploration.Epsilon = .04;
sarsaAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 0.01;
Train the SARSA agent using the train function. Training can take several minutes to complete. To
save time while running this example, load a pretrained agent by setting doTraining to false. To
train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(sarsaAgent,env,trainOpts);
else
% Load the pretrained agent for the example.
load('basicGWSarsaAgent.mat','sarsaAgent')
end
1-17
1 Getting Started
To validate the training results, simulate the agent in the training environment.
plot(env)
env.Model.Viewer.ShowTrace = true;
env.Model.Viewer.clearTrace;
sim(sarsaAgent,env)
1-18
Train Reinforcement Learning Agent in Basic Grid World
The SARSA agent finds the same grid world solution as the Q-learning agent.
See Also
rlMDPEnv | createGridWorld
More About
• “Reinforcement Learning Agents” on page 3-2
• “Train Reinforcement Learning Agents” on page 5-3
1-19
1 Getting Started
This example shows how to convert the PI controller in the watertank Simulink® model to a
reinforcement learning deep deterministic policy gradient (DDPG) agent. For an example that trains a
DDPG agent in MATLAB®, see “Train DDPG Agent to Control Double Integrator System” on page 5-
77.
The original model for this example is the water tank model. The goal is to control the level of the
water in the tank. For more information about the water tank model, see “watertank Simulink Model”
(Simulink Control Design).
The resulting model is rlwatertank.slx. For more information on this model and the changes, see
“Create Simulink Reinforcement Learning Environments” on page 2-8.
open_system('rlwatertank')
1-20
Create Simulink Environment and Train Agent
• Action and observation signals that the agent uses to interact with the environment. For more
information, see rlNumericSpec and rlFiniteSetSpec.
• Reward signal that the agent uses to measure its success. For more information, see “Define
Reward Signals” on page 2-14.
Set a custom reset function that randomizes the reference values for the model.
env.ResetFcn = @(in)localResetFcn(in);
Specify the simulation time Tf and the agent sample time Ts in seconds.
Ts = 1.0;
Tf = 200;
1-21
1 Getting Started
rng(0)
Given observations and actions, a DDPG agent approximates the long-term reward using a value
function approximator as a critic.
Create a deep neural network to approximate the value function within the critic. To create a network
with two inputs, the observation and action, and one output, the value, use three different paths, and
specify each path as a row vector of layer objects. You can obtain the dimension of the observation
and action spaces from the obsInfo and actInfo specifications.
statePath = [
featureInputLayer(obsInfo.Dimension(1),Name="netObsIn")
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(25,Name="CriticStateFC2")];
actionPath = [
featureInputLayer(actInfo.Dimension(1),Name="netActIn")
fullyConnectedLayer(25,Name="CriticActionFC1")];
commonPath = [
additionLayer(2,Name="add")
reluLayer
fullyConnectedLayer(1,Name="CriticOutput")];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
figure
plot(criticNetwork)
1-22
Create Simulink Environment and Train Agent
Initialized: true
Inputs:
1 'netObsIn' 3 features
2 'netActIn' 1 features
Create the critic approximator object using the specified deep neural network, the environment
specification objects, and the names if the network inputs to be associated with the observation and
action channels.
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo, ...
ObservationInputNames="netObsIn", ...
ActionInputNames="netActIn");
1-23
1 Getting Started
ans = single
-0.1631
For more information on creating critics, see “Create Policies and Value Functions” on page 4-2.
Given observations, a DDPG agent decides which action to take using a deterministic policy, which is
implemented by an actor.
Create a deep neural network to approximate the policy within the actor. To create a network with
one input, the observation, and one output, the action, a row vector of layer objects. You can obtain
the dimension of the observation and action spaces from the obsInfo and actInfo specifications.
actorNetwork = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(3)
tanhLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
actorNetwork = dlnetwork(actorNetwork);
summary(actorNetwork)
Initialized: true
Number of learnables: 16
Inputs:
1 'input' 3 features
Create the actor approximator object using the specified deep neural network, the environment
specification objects, and the name if the network input to be associated with the observation
channel.
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
For more information on creating critics, see “Create Policies and Value Functions” on page 4-2.
Create the DDPG agent using the specified actor and critic approximator objects.
agent = rlDDPGAgent(actor,critic);
1-24
Create Simulink Environment and Train Agent
Specify options for the agent, the actor, and the critic using dot notation.
agent.SampleTime = Ts;
agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 64;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.NoiseOptions.Variance = 0.3;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Alternatively, you can specify the agent options using an rlDDPGAgentOptions object.
getAction(agent,{rand(obsInfo.Dimension)})
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run each training for at most 5000 episodes. Specify that each episode lasts for at most
ceil(Tf/Ts) (that is 200) time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than 800 over 20
consecutive episodes. At this point, the agent can control the level of water in the tank.
trainOpts = rlTrainingOptions(...
MaxEpisodes=5000, ...
MaxStepsPerEpisode=ceil(Tf/Ts), ...
ScoreAveragingWindowLength=20, ...
Verbose=false, ...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=800);
Train the agent using the train function. Training is a computationally intensive process that takes
several minutes to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
1-25
1 Getting Started
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("WaterTankDDPG.mat","agent")
end
simOpts = rlSimulationOptions(MaxSteps=ceil(Tf/Ts),StopOnError="on");
experiences = sim(env,agent,simOpts);
1-26
Create Simulink Environment and Train Agent
1-27
1 Getting Started
Local Function
function in = localResetFcn(in)
end
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Simulink Reinforcement Learning Environments” on page 2-8
1-28
2
Create Environments
• Action and observation signals that the agent uses to interact with the environment.
• A reward signal that the agent uses to measure its success. For more information, see “Define
Reward Signals” on page 2-14.
• The environment initial condition and its dynamic behavior.
2-2
Create MATLAB Reinforcement Learning Environments
What signals you select as actions and observations depends on your application. For example, for
control system applications, the integrals (and sometimes derivatives) of error signals are often
useful observations. Also, for reference-tracking applications, having a time-varying reference signal
as an observation is helpful.
When you define your observation signals, ensure that all the environment states (or their estimation)
are included in the observation vector. This is a good practice because the agent is often a static
function which lacks internal memory or state, and so it might not be able to successfully reconstruct
the environment state internally.
For example, an image observation of a swinging pendulum has position information but does not
have enough information, by itself, to determine the pendulum velocity. In this case, you can measure
or estimate the pendulum velocity as an additional entry in the observation vector.
For more information, see “Load Predefined Grid World Environments” on page 2-17 and “Load
Predefined Control System Environments” on page 2-23.
Once you create a custom environment object, you can train an agent in the same manner as in a
predefined environment. For more information on training agents, see “Train Reinforcement Learning
Agents” on page 5-3.
You can create custom grid worlds of any size with your own custom reward, state transition, and
obstacle configurations. To create a custom grid world environment:
1 Create a grid world model using the createGridWorld function. For example, create a grid
world named gw with ten rows and nine columns.
gw = createGridWorld(10,9);
2 Configure the grid world by modifying the properties of the model. For example, specify the
terminal state as the location [7,9]
gw.TerminalStates = "[7,9]";
2-3
2 Create Environments
3 A grid world needs to be included in a Markov decision process (MDP) environment. Create an
MDP environment for this grid world, which the agent uses to interact with the grid world model.
env = rlMDPEnv(gw);
For more information on custom grid worlds, see “Create Custom Grid World Environments” on page
2-36.
For simple environments, you can define a custom environment object by creating an
rlFunctionEnv object and specifying your own custom reset and step functions.
• At the beginning of each training episode, the agent calls the reset function to set the environment
initial condition. For example, you can specify known initial state values or place the environment
into a random initial state.
• The step function defines the dynamics of the environment, that is, how the state changes as a
function of the current state and the agent action. At each training time step, the state of the
model is updated using the step function.
For more information, see “Create MATLAB Environment Using Custom Functions” on page 2-41.
For more complex environments, you can define a custom environment by creating and modifying a
template environment. To create a custom environment:
For more information, see “Create Custom MATLAB Environment from Template” on page 2-48.
See Also
rlPredefinedEnv | rlFunctionEnv | rlCreateEnvTemplate
More About
• “What Is Reinforcement Learning?” on page 1-3
• “Create MATLAB Environments for Reinforcement Learning Designer” on page 2-5
• “Create Simulink Reinforcement Learning Environments” on page 2-8
2-4
Create MATLAB Environments for Reinforcement Learning Designer
When training an agent using the Reinforcement Learning Designer app, you can create a
predefined MATLAB environment from within the app or import a custom environment.
To use a custom environment, you must first create the environment at the MATLAB command line
and then import the environment into Reinforcement Learning Designer. For more information on
creating such an environment, see “Create MATLAB Reinforcement Learning Environments” on page
2-2.
Once you create a custom environment using one of the methods described in the preceding section,
import the environment into Reinforcement Learning Designer. On the Reinforcement Learning
tab, click Import. Then, under Select Environment, select the environment.
2-5
2 Create Environments
Once you have created or imported an environment, the app adds the environment to the
Environments pane.
2-6
Create MATLAB Environments for Reinforcement Learning Designer
Once you have created an environment, you can create an agent to train in that environment. For
more information, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
See Also
Reinforcement Learning Designer
Related Examples
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
• “Create Simulink Environments for Reinforcement Learning Designer” on page 2-11
• “Create Agents Using Reinforcement Learning Designer” on page 3-9
• “Design and Train Agent Using Reinforcement Learning Designer” on page 5-12
2-7
2 Create Environments
• Action and observation signals that the agent uses to interact with the environment.
• Reward signal that the agent uses to measure its success. For more information, see “Define
Reward Signals” on page 2-14.
• Environment dynamic behavior.
What signals you select as actions and observations depends on your application. For example, for
control system applications, the integrals (and sometimes derivatives) of error signals are often
2-8
Create Simulink Reinforcement Learning Environments
useful observations. Also, for reference-tracking applications, having a time-varying reference signal
as an observation is helpful.
When you define your observation signals, ensure that all the system states are observable through
the observations. For example, an image observation of a swinging pendulum has position
information but does not have enough information to determine the pendulum velocity. In this case,
you can specify the pendulum velocity as a separate observation.
For more information, see “Load Predefined Simulink Environments” on page 2-30.
For the action and observation signals, you must create specification objects using rlNumericSpec
for continuous signals and rlFiniteSetSpec for discrete signals. For bus signals, create
specifications using bus2RLSpec.
For the reward signal, construct a scalar signal in the model and connect this signal to the RL Agent
block. For more information, see “Define Reward Signals” on page 2-14.
After configuring the Simulink model, create an environment object for the model using the
rlSimulinkEnv function.
If you have a reference model with an appropriate action input port, observation output port, and
scalar reward output port, you can automatically create a Simulink model that includes this reference
model and an RL Agent block. For more information, see createIntegratedEnv. This function
returns the environment object, action specifications, and observation specifications for the model.
Your environment can include third-party functionality. For more information, see “Integrate with
Existing Simulation or Environment” (Simulink).
See Also
rlPredefinedEnv | rlSimulinkEnv | createIntegratedEnv
More About
• “What Is Reinforcement Learning?” on page 1-3
• “Create Simulink Environments for Reinforcement Learning Designer” on page 2-11
2-9
2 Create Environments
2-10
Create Simulink Environments for Reinforcement Learning Designer
When training an agent using the Reinforcement Learning Designer app, you can create a
predefined Simulink environment from within the app or import a custom environment.
To use a custom environment, you must first create the environment at the MATLAB command line
and then import the environment into Reinforcement Learning Designer. For more information on
creating a Simulink environment, see “Create Simulink Reinforcement Learning Environments” on
page 2-8.
For training and simulating Simulink environments, you must define all variables necessary for
running the Simulink mode in the MATLAB workspace.
Once you create a custom environment using one of the methods described in the preceding section,
import the environment into Reinforcement Learning Designer. On the Reinforcement Learning
tab, click Import. Then, under Select Environment, select the environment.
2-11
2 Create Environments
Once you have created or imported an environment, the app adds the environment to the
Environments pane.
2-12
Create Simulink Environments for Reinforcement Learning Designer
Once you have created an environment, you can create an agent to train in that environment. For
more information, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
See Also
Reinforcement Learning Designer
Related Examples
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Create MATLAB Environments for Reinforcement Learning Designer” on page 2-5
• “Create Agents Using Reinforcement Learning Designer” on page 3-9
• “Design and Train Agent Using Reinforcement Learning Designer” on page 5-12
2-13
2 Create Environments
In general, you provide a positive reward to encourage certain agent actions and a negative reward
(penalty) to discourage other actions. A well-designed reward signal guides the agent to maximize the
expectation of the long-term reward. What constitutes a well-designed reward depends on your
application and the agent goals.
For example, when an agent must perform a task for as long as possible, a common strategy is to
provide a small positive reward for each time step that the agent successfully performs the task and a
large penalty when the agent fails. This approach encourages longer training episodes while heavily
discouraging episodes that fail. For an example that uses this approach, see “Train DQN Agent to
Balance Cart-Pole System” on page 5-50.
If your reward function incorporates multiple signals, such as position, velocity, and control effort,
you must consider the relative sizes of the signals and scale their contributions to the reward signal
accordingly.
You can specify either continuous or discrete reward signals. In either case, you must provide a
reward signal that provides rich information when the action and observation signals change.
For applications where control system specifications like cost functions and constraints are already
available, you can also use generate rewards functions from such specifications.
Continuous Rewards
A continuous reward function varies continuously with changes in the environment observations and
actions. In general, continuous reward signals improve convergence during training and can lead to
simpler network structures.
An example of a continuous reward is the quadratic regulator (QR) cost function, where the long-term
reward can be expressed as
τ
Ji = − sτT Qτsτ + ∑ sTj Q js j + aTj R ja j + 2sTj N ja j
j=i
Here, Qτ, Q, R, and N are the weight matrices. Qτ is the terminal weight matrix, applied only at the
end of the episode. Also, s is the observation vector, a is the action vector, and τ is the terminal
iteration of the episode. The instantaneous reward for this cost function is
This QR reward structure encourages driving s to zero with minimal action effort. A QR-based reward
structure is a good reward to choose for regulation or stationary point problems, such as pendulum
swing-up or regulating the position of the double integrator. For training examples that use a QR
reward, see “Train DQN Agent to Swing Up and Balance Pendulum” on page 5-88 and “Train DDPG
Agent to Control Double Integrator System” on page 5-77.
2-14
Define Reward Signals
Smooth continuous rewards, such as the QR regulator, are good for fine-tuning parameters and can
provide policies similar to optimal controllers (LQR/MPC).
Discrete Rewards
A discrete reward function varies discontinuously with changes in the environment observations or
actions. These types of reward signals can make convergence slower and can require more complex
network structures. Discrete rewards are usually implemented as events that occur in the
environment—for example, when an agent receives a positive reward if it exceeds some target value
or a penalty when it violates some performance constraint.
While discrete rewards can slow down convergence, they can also guide the agent toward better
reward regions in the state space of the environment. For example, a region-based reward, such as a
fixed reward when the agent is near a target location, can emulate final-state constraints. Also, a
region-based penalty can encourage an agent to avoid certain areas of the state space.
Mixed Rewards
In many cases, providing a mixed reward signal that has a combination of continuous and discrete
reward components is beneficial. The discrete reward signal can be used to drive the system away
from bad states, and the continuous reward signal can improve convergence by providing a smooth
reward near target states. For example, in “Train DDPG Agent to Control Flying Robot” on page 5-
156, the reward function has three components: r1, r2, and r3.
2
r1 = 10 xt2 + yt2 + θt < 0.5
r2 = − 100 xt ≥ 20 yt ≥ 20
2 2 2
r3 = − 0.2 Rt − 1 + Lt − 1 + 0.3 Rt − 1 − Lt − 1 + 0.03xt2 + 0.03yt2 + 0.02θt
r = r1 + r2 + r3
Here:
• r1 is a region-based continuous reward that applies only near the target location of the robot.
• r2 is a discrete signal that provides a large penalty when the robot moves far from the target
location.
• r3 is a continuous QR penalty that applies for all robot states.
• Cost and constraint specifications defined in an mpc or nlmpc controller object. This feature
requires Model Predictive Control Toolbox™ software.
• Performance constraints defined in Simulink Design Optimization™ model verification blocks.
In both cases, when constraints are violated, a negative reward is calculated using penalty functions
such as exteriorPenalty (default), hyperbolicPenalty or barrierPenalty functions.
2-15
2 Create Environments
Starting from the generated reward function, you can tune the cost and penalty weights, use a
different penalty function, and then use the resulting reward function within an environment to train
an agent.
See Also
Functions
generateRewardFunction | exteriorPenalty | hyperbolicPenalty | barrierPenalty
More About
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Generate Reward Function from a Model Predictive Controller for a Servomotor” on page 5-
279
• “Generate Reward Function from a Model Verification Block for a Water Tank System” on page
5-289
2-16
Load Predefined Grid World Environments
You can load the following predefined MATLAB grid world environments using the
rlPredefinedEnv function.
For more information on the properties of grid world environments, see “Create Custom Grid World
Environments” on page 2-36.
You can also load predefined MATLAB control system environments. For more information, see “Load
Predefined Control System Environments” on page 2-23.
To create a basic grid world environment, use the rlPredefinedEnv function. This function creates
an rlMDPEnv object representing the grid world.
env = rlPredefinedEnv('BasicGridWorld');
You can visualize the grid world environment using the plot function.
• Agent location is a red circle. By default, the agent starts in state [1,1].
• Terminal location is a blue square.
• Obstacles are black squares.
plot(env)
2-17
2 Create Environments
Actions
The agent can move in one of four possible directions (north, south, east, or west).
Rewards
To create a deterministic waterfall grid world, use the rlPredefinedEnv function. This function
creates an rlMDPEnv object representing the grid world.
env = rlPredefinedEnv('WaterFallGridWorld-Deterministic');
As with the basic grid world, you can visualize the environment, where the agent is a red circle and
the terminal location is a blue square.
plot(env)
2-18
Load Predefined Grid World Environments
Actions
The agent can move in one of four possible directions (north, south, east, or west).
Rewards
Waterfall Dynamics
In this environment, a waterfall pushes the agent toward the bottom of the grid.
2-19
2 Create Environments
The intensity of the waterfall varies between the columns, as shown at the top of the preceding
figure. When the agent moves into a column with a nonzero intensity, the waterfall pushes it
downward by the indicated number of squares. For example, if the agent goes east from state [5,2], it
reaches state [7,3].
To create a stochastic waterfall grid world, use the rlPredefinedEnv function. This function
creates an rlMDPEnv object representing the grid world.
env = rlPredefinedEnv('WaterFallGridWorld-Stochastic');
As with the basic grid world, you can visualize the environment, where the agent is a red circle and
the terminal location is a blue square.
plot(env)
2-20
Load Predefined Grid World Environments
Actions
The agent can move in one of four possible directions (north, south, east, or west).
Rewards
Waterfall Dynamics
In this environment, a waterfall pushes the agent towards the bottom of the grid with a stochastic
intensity. The baseline intensity matches the intensity of the deterministic waterfall environment.
However, in the stochastic waterfall case, the agent has an equal chance of experiencing the
indicated intensity, one level above that intensity, or one level below that intensity. For example, if the
agent goes east from state [5,2], it has an equal chance of reaching state [6,3], [7,3], or [8,3].
See Also
rlPredefinedEnv | train | rlMDPEnv
More About
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
2-21
2 Create Environments
2-22
Load Predefined Control System Environments
You can load the following predefined MATLAB control system environments using the
rlPredefinedEnv function.
You can also load predefined MATLAB grid world environments. For more information, see “Load
Predefined Grid World Environments” on page 2-17.
Cart-Pole Environments
The goal of the agent in the predefined cart-pole environments is to balance a pole on a moving cart
by applying horizontal forces to the cart. The pole is considered successfully balanced if both of the
following conditions are satisfied:
• The pole angle remains within a given threshold of the vertical position, where the vertical
position is zero radians.
• The magnitude of the cart position remains below a given threshold.
There are two cart-pole environment variants, which differ by the agent action space.
• Discrete — Agent can apply a force of either Fmax or -Fmax to the cart, where Fmax is the MaxForce
property of the environment.
• Continuous — Agent can apply any force within the range [-Fmax,Fmax].
env = rlPredefinedEnv('CartPole-Discrete');
• Continuous action space
env = rlPredefinedEnv('CartPole-Continuous');
You can visualize the cart-pole environment using the plot function. The plot displays the cart as a
blue square and the pole as a red rectangle.
2-23
2 Create Environments
plot(env)
To visualize the environment during training, call plot before training and keep the visualization
figure open.
For examples showing how to train agents in cart-pole environments, see the following:
Environment Properties
Continuous — -50
2-24
Load Predefined Control System Environments
• Cart position
• Derivative of cart position
• Pole angle
• Derivative of pole angle
Actions
In the cart-pole environments, the agent interacts with the environment using a single action signal,
the horizontal force applied to the cart. The environment contains a specification object for this
action signal. For the environment with a:
For more information on obtaining action specifications from an environment, see getActionInfo.
Observations
In the cart-pole system, the agent can observe all the environment state variables in env.State. For
each state variable, the environment contains an rlNumericSpec observation specification. All the
states are continuous and unbounded.
Reward
• A positive reward for each time step that the pole is balanced, that is, the cart and pole both
remain within their specified threshold ranges. This reward accumulates over the entire training
episode. To control the size of this reward, use the RewardForNotFalling property of the
environment.
• A one-time negative penalty if either the pole or cart moves outside of their threshold range. At
this point, the training episode stops. To control the size of this penalty, use the
PenaltyForFalling property of the environment.
Training episodes for these environments end when either of the following events occurs:
2-25
2 Create Environments
There are two double integrator environment variants, which differ by the agent action space.
• Discrete — Agent can apply a force of either Fmax or -Fmax to the cart, where Fmax is the MaxForce
property of the environment.
• Continuous — Agent can apply any force within the range [-Fmax,Fmax].
env = rlPredefinedEnv('DoubleIntegrator-Discrete');
• Continuous action space
env = rlPredefinedEnv('DoubleIntegrator-Continuous');
You can visualize the double integrator environment using the plot function. The plot displays the
mass as a red rectangle.
plot(env)
To visualize the environment during training, call plot before training and keep the visualization
figure open.
For examples showing how to train agents in double integrator environments, see the following:
Environment Properties
2-26
Load Predefined Control System Environments
Continuous: Inf
State Environment state, specified as a column vector [0 0]'
with the following state variables:
• Mass position
• Derivative of mass position
Actions
In the double integrator environments, the agent interacts with the environment using a single action
signal, the force applied to the mass. The environment contains a specification object for this action
signal. For the environment with a:
For more information on obtaining action specifications from an environment, see getActionInfo.
Observations
In the double integrator system, the agent can observe both of the environment state variables in
env.State. For each state variable, the environment contains an rlNumericSpec observation
specification. Both states are continuous and unbounded.
Reward
The reward signal for this environment is the discrete-time equivalent of the following continuous-
time reward, which is analogous to the cost function of an LQR controller.
This reward is the episodic reward, that is, the cumulative reward across the entire training episode.
2-27
2 Create Environments
There are two simple pendulum environment variants, which differ by the agent action space.
For examples showing how to train an agent in this environment, see the following:
• “Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation” on page 5-130
• “Create Agent Using Deep Network Designer and Train Using Image Observations” on page 5-138
Environment Properties
• Pendulum angle
• Pendulum angular velocity
Q Weight matrix for observation component of [1 0;0 0.1]
reward signal
R Weight matrix for action component of reward 1e-3
signal
Actions
In the simple pendulum environments, the agent interacts with the environment using a single action
signal, the torque applied at the base of the pendulum. The environment contains a specification
object for this action signal. For the environment with a:
For more information on obtaining action specifications from an environment, see getActionInfo.
2-28
Load Predefined Control System Environments
Observations
In the simple pendulum environment, the agent receives the following observation signals:
For each observation signal, the environment contains an rlNumericSpec observation specification.
All the observations are continuous and unbounded.
Reward
2 2
rt = − θt + 0.1 ∗ θ̇ t + 0.001 ∗ ut2− 1
Here:
See Also
rlPredefinedEnv | train
More About
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
• “Load Predefined Grid World Environments” on page 2-17
• “Train Reinforcement Learning Agents” on page 5-3
2-29
2 Create Environments
You can load the following predefined Simulink environments using the rlPredefinedEnv function.
For predefined Simulink environments, the environment dynamics, observations, and reward signal
are defined in a corresponding Simulink model. The rlPredefinedEnv function creates a
SimulinkEnvWithAgent object that the train function uses to interact with the Simulink model.
open_system('rlSimplePendulumModel')
2-30
Load Predefined Simulink Environments
There are two simple pendulum environment variants, which differ by the agent action space.
• Discrete — Agent can apply a torque of either Tmax, 0, or -Tmax to the pendulum, where Tmax is the
max_tau variable in the model workspace.
• Continuous — Agent can apply any torque within the range [-Tmax,Tmax].
env = rlPredefinedEnv('SimplePendulumModel-Discrete');
• Continuous action space
env = rlPredefinedEnv('SimplePendulumModel-Continuous');
For examples that train agents in the simple pendulum environment, see:
Actions
In the simple pendulum environments, the agent interacts with the environment using a single action
signal, the torque applied at the base of the pendulum. The environment contains a specification
object for this action signal. For the environment with a:
2-31
2 Create Environments
For more information on obtaining action specifications from an environment, see getActionInfo.
Observations
In the simple pendulum environment, the agent receives the following three observation signals,
which are constructed within the create observations subsystem.
For each observation signal, the environment contains an rlNumericSpec observation specification.
All the observations are continuous and unbounded.
Reward
The reward signal for this environment, which is constructed in the calculate reward subsystem, is
2 2
rt = − θt + 0.1 ∗ θ̇ t + 0.001 ∗ ut2− 1
Here:
• The pole angle remains within a given threshold of the vertical position, where the vertical
position is zero radians.
• The magnitude of the cart position remains below a given threshold.
The model for this environment is defined in the rlCartPoleSimscapeModel Simulink model. The
dynamics of this model are defined using Simscape Multibody™.
open_system('rlCartPoleSimscapeModel')
2-32
Load Predefined Simulink Environments
In the Environment subsystem, the model dynamics are defined using Simscape components and the
reward and observation are constructed using Simulink blocks.
open_system('rlCartPoleSimscapeModel/Environment')
There are two cart-pole environment variants, which differ by the agent action space.
2-33
2 Create Environments
env = rlPredefinedEnv('CartPoleSimscapeModel-Discrete');
• Continuous action space
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous');
For an example that trains an agent in this cart-pole environment, see “Train DDPG Agent to Swing
Up and Balance Cart-Pole System” on page 5-102.
Actions
In the cart-pole environments, the agent interacts with the environment using a single action signal,
the force applied to the cart. The environment contains a specification object for this action signal.
For the environment with a:
For more information on obtaining action specifications from an environment, see getActionInfo.
Observations
In the cart-pole environment, the agent receives the following five observation signals.
For each observation signal, the environment contains an rlNumericSpec observation specification.
All the observations are continuous and unbounded.
Reward
The reward signal for this environment is the sum of two components (r = rqr + rn + rp):
rp = − 100 ∗ x ≥ 3.5
Here:
2-34
Load Predefined Simulink Environments
See Also
Blocks
RL Agent
Functions
rlPredefinedEnv | train
More About
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Train Reinforcement Learning Agents” on page 5-3
2-35
2 Create Environments
Reinforcement Learning Toolbox lets you create custom MATLAB grid world environments for your
own applications. To create a custom grid world environment:
2-36
Create Custom Grid World Environments
The agent starts from the CurrentState once you use the reset
function in the rlMDPEnv environment object.
States Yes A string vector containing the state names of the grid world. For
instance, for a 2-by-2 grid world model GW, specify the following:
GW.States = ["[1,1]";
"[2,1]";
"[1,2]";
"[2,2]"];
Actions Yes A string vector containing the list of possible actions that the agent can
use. You can set the actions when you create the grid world model by
using the moves argument:
GW = createGridWorld(m,n,moves)
moves Gw.Actions
'Standard' ['N';'S';'E';'W']
'Kings' ['N';'S';'E';'W';'NE';'NW';'SE';'SW']
2-37
2 Create Environments
T can be denoted as
T s, s′, a = probability s′ s, a .
northStateTransition = GW.T(:,:,1)
2-38
Create Custom Grid World Environments
r = R s, s′, a .
Set up R such that there is a reward to the agent after every action. For
instance, you can set up a positive reward if the agent transitions over
obstacle states and when it reaches the terminal state. You can also set
up a default reward of -11 for all actions the agent takes, independent
of the current state and next state. For an example that show how to set
up the reward transition matrix, see “Train Reinforcement Learning
Agent in Basic Grid World” on page 1-14.
ObstacleSta No ObstacleStates are states that cannot be reached in the grid world,
tes specified as a string vector. Consider the following 5-by-5 grid world
model GW.
The black cells are obstacle states, and you can specify them using the
following syntax:
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];
2-39
2 Create Environments
GW.TerminalStates = "[5,5]";
For more information, see rlMDPEnv and “Train Reinforcement Learning Agent in Basic Grid World”
on page 1-14.
See Also
createGridWorld | rlMDPEnv | rlPredefinedEnv
More About
• “Train Reinforcement Learning Agent in Basic Grid World” on page 1-14
2-40
Create MATLAB Environment Using Custom Functions
This example shows how to create a cart-pole environment by supplying custom dynamic functions in
MATLAB®.
Using the rlFunctionEnv function, you can create a MATLAB reinforcement learning environment
from an observation specification, an action specification, and user-defined step and reset
functions. You can then train a reinforcement learning agent in this environment. The necessary step
and reset functions are already defined for this example.
Creating an environment using custom functions is useful for environments with less complex
dynamics, environments with no special visualization requirements, or environments with interfaces
to third-party libraries. For more complex environments, you can create an environment object using
a template class. For more information, see “Create Custom MATLAB Environment from Template” on
page 2-48.
For more information on creating reinforcement learning environments, see “Create MATLAB
Reinforcement Learning Environments” on page 2-2 and “Create Simulink Reinforcement Learning
Environments” on page 2-8.
The cart-pole environment is a pole attached to an unactuated joint on a cart, which moves along a
frictionless track. The training goal is to make the pendulum stand upright without falling over.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
• The pendulum starts upright with an initial angle that is between –0.05 and 0.05.
• The force action signal from the agent to the environment is from –10 to 10 N.
• The observations from the environment are the cart position, cart velocity, pendulum angle, and
pendulum angle derivative.
• The episode terminates if the pole is more than 12 degrees from vertical, or if the cart moves
more than 2.4 m from the original position.
2-41
2 Create Environments
• A reward of +1 is provided for every time step that the pole remains upright. A penalty of –10 is
applied when the pendulum falls.
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
The observations from the environment are the cart position, cart velocity, pendulum angle, and
pendulum angle derivative.
The environment has a discrete action space where the agent can apply one of two possible force
values to the cart: -10 or 10 N.
For more information on specifying environment actions and observations, see rlNumericSpec and
rlFiniteSetSpec.
To define a custom environment, first specify the custom step and reset functions. These functions
must be in your current working folder or on the MATLAB path.
The custom reset function sets the default state of the environment. This function must have the
following signature.
[InitialObservation,LoggedSignals] = myResetFunction()
To pass information from one step to the next, such as the environment state, use LoggedSignals.
For this example, LoggedSignals contains the states of the cart-pole environment: the position and
velocity of the cart, the pendulum angle, and the pendulum angle derivative. The reset function sets
the cart angle to a random value each time the environment is reset.
For this example, use the custom reset function defined in myResetFunction.m.
type myResetFunction.m
% Theta (randomize)
T0 = 2 * 0.05 * rand() - 0.05;
% Thetadot
Td0 = 0;
% X
X0 = 0;
% Xdot
Xd0 = 0;
2-42
Create MATLAB Environment Using Custom Functions
LoggedSignal.State = [X0;Xd0;T0;Td0];
InitialObservation = LoggedSignal.State;
end
The custom step function specifies how the environment advances to the next state based on a given
action. This function must have the following signature.
[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)
To get the new state, the environment applies the dynamic equation to the current state stored in
LoggedSignals, which is similar to giving an initial condition to a differential equation. The new
state is stored in LoggedSignals and returned as an output.
For this example, use the custom step function defined in myStepFunction.m. For implementation
simplicity, this function redefines physical constants, such as the cart mass, every time step is
executed.
type myStepFunction.m
2-43
2 Create Environments
State = LoggedSignals.State;
XDot = State(2);
Theta = State(3);
ThetaDot = State(4);
% Get reward.
if ~IsDone
Reward = RewardForNotFalling;
else
Reward = PenaltyForFalling;
end
end
Construct the custom environment using the defined observation specification, action specification,
and function names.
env = rlFunctionEnv(ObservationInfo,ActionInfo,'myStepFunction','myResetFunction');
You can also define custom functions that have additional input arguments beyond the minimum
required set. For example, to pass the additional arguments arg1 and arg2 to both the step and rest
function, use the following code.
[InitialObservation,LoggedSignals] = myResetFunction(arg1,arg2)
[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals,arg1,arg2)
To use these functions with rlFunctionEnv, you must use anonymous function handles.
ResetHandle = @()myResetFunction(arg1,arg2);
StepHandle = @(Action,LoggedSignals) myStepFunction(Action,LoggedSignals,arg1,arg2);
2-44
Create MATLAB Environment Using Custom Functions
Using additional input arguments can create a more efficient environment implementation. For
example, myStepFunction2.m contains a custom step function that takes the environment
constants as an input argument (envConstants). By doing so, this function avoids redefining the
environment constants at each step.
type myStepFunction2.m
% Get reward.
if ~IsDone
Reward = EnvConstants.RewardForNotFalling;
else
Reward = EnvConstants.PenaltyForFalling;
end
end
2-45
2 Create Environments
Create an anonymous function handle to the custom step function, passing envConstants as an
additional input argument. Because envConstants is available at the time that StepHandle is
created, the function handle includes those values. The values persist within the function handle even
if you clear the variables.
Use the same reset function, specifying it as a function handle rather than by using its name.
env2 = rlFunctionEnv(ObservationInfo,ActionInfo,StepHandle,ResetHandle);
Before you train an agent in your environment, the best practice is to validate the behavior of your
custom functions. To do so, you can initialize your environment using the reset function and run one
simulation step using the step function. For reproducibility, set the random generator seed before
validation.
rng(0);
InitialObs = reset(env)
InitialObs = 4×1
0
0
0.0315
0
2-46
Create MATLAB Environment Using Custom Functions
[NextObs,Reward,IsDone,LoggedSignals] = step(env,10);
NextObs
NextObs = 4×1
0
0.1947
0.0315
-0.2826
rng(0);
InitialObs2 = reset(env2)
InitialObs2 = 4×1
0
0
0.0315
0
[NextObs2,Reward2,IsDone2,LoggedSignals2] = step(env2,10);
NextObs2
NextObs2 = 4×1
0
0.1947
0.0315
-0.2826
Both environments initialize and simulate successfully, producing the same state values in NextObs.
See Also
rlFunctionEnv
More About
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
• “Create Custom MATLAB Environment from Template” on page 2-48
2-47
2 Create Environments
For more information about creating MATLAB classes, see “User-Defined Classes”.
You can create less complex custom reinforcement learning environments using custom functions, as
described in “Create MATLAB Environment Using Custom Functions” on page 2-41.
rlCreateEnvTemplate("MyEnvironment")
The software creates and opens the template class file. The template class is a subclass of the
rl.env.MATLABEnvironment abstract class, as shown in the class definition at the start of the
template file. This abstract class is the same one used by the other MATLAB reinforcement learning
environment objects.
By default, the template class implements a simple cart-pole balancing model similar to the cart-pole
predefined environments described in “Load Predefined Control System Environments” on page 2-23.
To define your environment dynamics modify the template class, specify the following:
• Environment properties
• Required environment methods
• Optional environment methods
Environment Properties
In the properties section of the template, specify any parameters necessary for creating and
simulating the environment. These parameters can include:
• Physical constants — The sample environment defines the acceleration due to gravity (Gravity).
• Environment geometry — The sample environment defines the cart and pole masses (CartMass
and PoleMass) and the half-length of the pole (HalfPoleLength).
• Environment constraints — The sample environment defines the pole angle and cart distance
thresholds (AngleThreshold and DisplacementThreshold). The environment uses these
values to detect when a training episode is finished.
• Variables required for evaluating the environment — The sample environment defines the state
vector (State) and a flag for indicating when an episode is finished (IsDone).
2-48
Create Custom MATLAB Environment from Template
• Constants for defining the actions or observation spaces — The sample environment defines the
maximum force for the action space (MaxForce).
• Constants for calculating the reward signal — The sample environment defines the constants
RewardForNotFalling and PenaltyForFalling.
properties
% Specify and initialize the necessary properties of the environment
% Acceleration due to gravity in m/s^2
Gravity = 9.8
% Sample time
Ts = 0.02
properties
% Initialize system state [x,dx,theta,dtheta]'
State = zeros(4,1)
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
end
Required Functions
A reinforcement learning environment requires the following functions to be defined. The
getObservationInfo, getActionInfo, sim, and validateEnvironment functions are already
defined in the base abstract class. To create your environment, you must define the constructor,
reset, and step functions.
Function Description
getObservationInfo Return information about the environment observations
getActionInfo Return information about the environment actions
sim Simulate the environment with an agent
validateEnvironment Validate the environment by calling the reset function and
simulating the environment for one time step using step
reset Initialize the environment state and clean up any visualization
2-49
2 Create Environments
Function Description
step Apply an action, simulate the environment for one step, and
output the observations and rewards; also, set a flag
indicating whether the episode is complete
Constructor function A function with the same name as the class that creates an
instance of the class
• Defining the action and observation specifications. For more information about creating these
specifications, see rlNumericSpec and rlFiniteSetSpec.
• Calling the constructor of the base abstract class.
function this = MyEnvironment()
% Initialize observation settings
ObservationInfo = rlNumericSpec([4 1]);
ObservationInfo.Name = 'CartPole States';
ObservationInfo.Description = 'x, dx, theta, dtheta';
This sample constructor function does not include any input arguments. However, you can add input
arguments for your custom constructor.
The sample cart-pole reset function sets the initial condition of the model and returns the initial
values of the observations. It also generates a notification that the environment has been updated by
calling the envUpdatedCallback function, which is useful for updating the environment
visualization.
% Reset environment to initial state and return initial observation
function InitialObservation = reset(this)
% Theta (+- .05 rad)
T0 = 2 * 0.05 * rand - 0.05;
% Thetadot
Td0 = 0;
% X
X0 = 0;
% Xdot
Xd0 = 0;
InitialObservation = [X0;Xd0;T0;Td0];
this.State = InitialObservation;
2-50
Create Custom MATLAB Environment from Template
% Get action
Force = getForce(this,Action);
% Euler integration
Observation = this.State + this.Ts.*[XDot;XDotDot;ThetaDot;ThetaDotDot];
% Get reward
Reward = getReward(this);
Optional Functions
You can define any other functions in your template class as required. For example, you can create
helper functions that are called by either step or reset. The cart-pole template model implements a
getReward function for computing the reward at each time step.
2-51
2 Create Environments
else
Reward = this.PenaltyForFalling;
end
end
Environment Visualization
You can add a visualization to your custom environment by implementing the plot function. In the
plot function:
• Create a figure or an instance of a visualizer class of your own implementation. For this example,
you create a figure and store a handle to the figure within the environment object.
• Call the envUpdatedCallback function.
function plot(this)
% Initiate the visualization
this.Figure = figure('Visible','on','HandleVisibility','off');
ha = gca(this.Figure);
ha.XLimMode = 'manual';
ha.YLimMode = 'manual';
ha.XLim = [-3 3];
ha.YLim = [-1 2];
hold(ha,'on');
% Update the visualization
envUpdatedCallback(this)
end
For this example, store the handle to the figure as a protected property of the environment object.
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
% Handle to figure
Figure
end
In the envUpdatedCallback, plot the visualization to the figure or use your custom visualizer
object. For example, check if the figure handle has been set. If it has, then plot the visualization.
function envUpdatedCallback(this)
if ~isempty(this.Figure) && isvalid(this.Figure)
% Set visualization figure as the current figure
ha = gca(this.Figure);
cartplot = findobj(ha,'Tag','cartplot');
poleplot = findobj(ha,'Tag','poleplot');
if isempty(cartplot) || ~isvalid(cartplot) ...
|| isempty(poleplot) || ~isvalid(poleplot)
% Initialize the cart plot
cartpoly = polyshape([-0.25 -0.25 0.25 0.25],[-0.125 0.125 0.125 -0.125]);
cartpoly = translate(cartpoly,[x 0]);
cartplot = plot(ha,cartpoly,'FaceColor',[0.8500 0.3250 0.0980]);
cartplot.Tag = 'cartplot';
2-52
Create Custom MATLAB Environment from Template
The environment calls the envUpdatedCallback function, and therefore updates the visualization,
whenever the environment is updated.
env = MyEnvironment;
If your constructor has input arguments, specify them after the class name. For example,
MyEnvironment(arg1,arg2).
After you create your environment, the best practice is to validate the environment dynamics. To do
so, use the validateEnvironment function, which prints an error to the command window if your
environment implementation has any issues.
validateEnvironment(env)
After validating the environment object, you can use it to train a reinforcement learning agent. For
more information on training agents, see “Train Reinforcement Learning Agents” on page 5-3.
See Also
rlCreateEnvTemplate | train
More About
• “Create MATLAB Reinforcement Learning Environments” on page 2-2
• “Create MATLAB Environment Using Custom Functions” on page 2-41
• “Define Reward Signals” on page 2-14
2-53
2 Create Environments
This example shows how to create a water tank reinforcement learning Simulink® environment that
contains an RL Agent block in the place of a controller for the water level in a tank. To simulate this
environment, you must create an agent and specify that agent in the RL Agent block. For an example
that trains an agent using this environment, see “Create Simulink Environment and Train Agent” on
page 1-20.
mdl = 'rlwatertank';
open_system(mdl)
This model already contains an RL Agent block, which connects to the following signals:
A reinforcement learning environment receives action signals from the agent and generates
observation signals in response to these actions. To create and train an agent, you must create action
and observation specification objects.
The action signal for this environment is the flow rate control signal that is sent to the plant. To
create a specification object for this continuous action signal, use the rlNumericSpec function.
2-54
Water Tank Reinforcement Learning Environment Model
If the action signal takes one of a discrete set of possible values, create the specification using the
rlFiniteSetSpec function.
For this environment, there are three observation signals sent to the agent, specified as a vector
T
signal. The observation vector is ∫e dt e h , where:
Create a three-element vector of observation specifications. Specify a lower bound of 0 for the water
height, leaving the other observation signals unbounded.
If the actions or observations are represented by bus signals, create specifications using the
bus2RLSpec function.
Reward Signal
Construct a scalar reward signal. For this example, specify the following reward.
The reward is positive when the error is below 0.1 and negative otherwise. Also, there is a large
reward penalty when the water height is outside the 0 to 20 range.
2-55
2 Create Environments
Stop Signal
To terminate training episodes and simulations, specify a logical signal to the isdone input port of
the block. For this example, terminate the episode if h ≤ 0 or h ≥ 20.
Reset Function
You can also create a custom reset function that randomizes parameters, variables, or states of the
model. In this example, the reset function randomizes the reference signal and the initial water
height and sets the corresponding block parameters.
env.ResetFcn = @(in)localResetFcn(in);
Local Function
function in = localResetFcn(in)
2-56
Water Tank Reinforcement Learning Environment Model
h = 3*randn + 10;
end
in = setBlockParameter(in,blk,'Value',num2str(h));
end
See Also
rlSimulinkEnv
More About
• “Create Simulink Reinforcement Learning Environments” on page 2-8
2-57
3
Create Agents
• The policy is a mapping from the current environment observation to a probability distribution of
the actions to be taken. Within an agent, the policy is implemented by a function approximator
with tunable parameters and a specific approximation model, such as a deep neural network.
• The learning algorithm continuously updates the policy parameters based on the actions,
observations, and rewards. The goal of the learning algorithm is to find an optimal policy that
maximizes the expected cumulative long-term reward received during the task.
Depending on the learning algorithm, an agent maintains one or more parameterized function
approximators for training the policy. Approximators can be used in two ways.
• Critics — For a given observation and action, a critic returns the predicted discounted value of
the cumulative long-term reward.
• Actor — For a given observation, an actor returns as output the action that (often) maximizes the
predicted discounted cumulative long-term reward.
Agents that use only critics to select their actions rely on an indirect policy representation. These
agents are also referred to as value-based, and they use an approximator to represent a value
3-2
Reinforcement Learning Agents
Agents that use only actors to select their actions rely on a direct policy representation. These agents
are also referred to as policy-based. The policy can be either deterministic or stochastic. In general,
these agents are simpler and can handle continuous action spaces, though the training algorithm can
be sensitive to noisy measurement and can converge on local minima.
Agents that use both an actor and a critic are referred to as actor-critic agents. In these agents,
during training, the actor learns the best action to take using feedback from the critic (instead of
using the reward directly). At the same time, the critic learns the value function from the rewards so
that it can properly criticize the actor. In general, these agents can handle both discrete and
continuous action spaces.
Built-In Agents
Reinforcement Learning Toolbox software provides the following built-in agents. You can train these
agents in environments with either continuous or discrete observation spaces and the following
action spaces.
The following tables summarize the types, action spaces, and used approximators for all the built-in
agents. For each agent, the observation space can be discrete, continuous or mixed.
3-3
3 Create Agents
3-4
Reinforcement Learning Agents
rlVectorQValueFunction
Deterministic policy actor π(S),
which you can create using
X
rlContinuousDeterministicA
ctor
Stochastic (Multinoulli) policy
actor π(S), for discrete action
spaces, which you can create X X
using
rlDiscreteCategoricalActor
Stochastic (Gaussian) policy actor
π(S), for continuous action spaces,
which you can create using X X X
rlContinuousGaussianActor
Agent with default networks — All agents except Q-learning and SARSA agents support default
networks for actors and critics. You can create an agent with a default actor and critic based on the
observation and action specifications from the environment. To do so, at the MATLAB command line,
perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options by creating an options object set for the specific agent. This
option object in turn includes rlOptimizerOptions objects that specify optimization objects
for the agent actor or critic.
3-5
3 Create Agents
5 Create the agent using the corresponding agent creation function. The resulting agent contains
the appropriate actor and critics listed in the table above. The actor and critic use default agent-
specific deep neural networks as internal approximators.
For more information on creating actor and critic function approximators, see “Create Policies and
Value Functions” on page 4-2.
You can use the Reinforcement Learning Designer app to import an existing environment and
interactively design DQN, DDPG, PPO, or TD3 agents. The app allows you to train and simulate the
agent within your environment, analyze the simulation results, refine the agent parameters, and
export the agent to the MATLAB workspace for further use and deployment. For more information,
see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
• Discrete action and observation spaces — For environments with discrete action and
observation spaces, the Q-learning and SARSA agents are the simplest compatible agent, followed
by DQN, PPO, and TRPO.
• Discrete action space and continuous observation space — For environments with a discrete
action space and a continuous observation space, DQN is the simplest compatible agent followed
by PPO and then TRPO.
• Continuous action space — For environments with both a continuous action and observation
space, DDPG is the simplest compatible agent, followed by TD3, PPO, and SAC, which are then
followed by TRPO. For such environments, try DDPG first. In general:
3-6
Reinforcement Learning Agents
• TRPO is a more complex version of PPO that is more robust for deterministic environments
with fewer observations.
During training, the MBPO agent generates real experiences by interacting with the environment.
These experiences are used to train the internal environment model, which is used to generate
additional experiences. The training algorithm then uses both the real and generated experiences to
update the agent policy.
An MBPO agent can be more sample efficient than model-free agents because the model can generate
large sets of diverse experiences. However, MBPO agents require much more computational time
than model-free agents, because they must train the environment model and generate samples in
addition to training the base agent.
For more information, see “Model-Based Policy Optimization Agents” on page 3-62.
Custom Agents
You can also train policies using other learning algorithms by creating a custom agent. To do so, you
create a subclass of a custom agent class, and define the agent behavior using a set of required and
optional methods. For more information, see “Create Custom Reinforcement Learning Agents” on
page 3-68. For more information about custom training loops, see “Train Reinforcement Learning
Policy Using Custom Training Loop” on page 5-388.
See Also
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent | rlMBPOAgent
3-7
3 Create Agents
More About
• “What Is Reinforcement Learning?” on page 1-3
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Agents Using Reinforcement Learning Designer” on page 3-9
3-8
Create Agents Using Reinforcement Learning Designer
The Reinforcement Learning Designer app supports the following types of agents.
To train an agent using Reinforcement Learning Designer, you must first create or import an
environment. For more information, see “Create MATLAB Environments for Reinforcement Learning
Designer” on page 2-5 and “Create Simulink Environments for Reinforcement Learning Designer” on
page 2-11.
Create Agent
To create an agent, on the Reinforcement Learning tab, in the Agent section, click New.
3-9
3 Create Agents
The Reinforcement Learning Designer app creates agents with actors and critics based on default
deep neural network. You can specify the following options for the default networks.
• Number of hidden units — Specify number of units in each fully-connected or LSTM layer of the
actor and critic networks.
• Use recurrent neural network — Select this option to create actor and critic with recurrent
neural networks that contain an LSTM layer.
The app adds the new default agent to the Agents pane and opens a document for editing the agent
options.
3-10
Create Agents Using Reinforcement Learning Designer
Import Agent
You can also import an agent from the MATLAB workspace into Reinforcement Learning Designer.
To do so, on the Reinforcement Learning tab, click Import. Then, under Select Agent, select the
agent to import.
The app adds the new imported agent to the Agents pane and opens a document for editing the
agent options.
3-11
3 Create Agents
• Agent Options — Agent options, such as the sample time and discount factor. Specify these
options for all supported agent types.
• Exploration Model — Exploration model options. PPO agents do not have an exploration model.
• Target Policy Smoothing Model — Options for target policy smoothing, which is supported for
only TD3 agents.
For more information on these options, see the corresponding agent options object.
You can import agent options from the MATLAB workspace. To create options for each type of agent,
use one of the preceding objects. You can also import options that you previously exported from the
Reinforcement Learning Designer app
To import the options, on the corresponding Agent tab, click Import. Then, under Options, select an
options object. The app lists only compatible options objects from the MATLAB workspace.
3-12
Create Agents Using Reinforcement Learning Designer
The app configures the agent options to match those In the selected options object.
You can also import actors and critics from the MATLAB workspace. For more information on creating
actors and critics, see “Create Policies and Value Functions” on page 4-2. You can also import
actors and critics that you previously exported from the Reinforcement Learning Designer app.
To import an actor or critic, on the corresponding Agent tab, click Import. Then, under either Actor
or Critic, select an actor or critic object with action and observation specifications that are
compatible with the specifications of the agent.
3-13
3 Create Agents
The app replaces the existing actor or critic in the agent with the selected one. If you import a critic
for a TD3 agent, the app replaces the network for both critics.
To import a deep neural network, on the corresponding Agent tab, click Import. Then, under either
Actor Neural Network or Critic Neural Network, select a network with input and output layers
that are compatible with the observation and action specifications of the agent.
3-14
Create Agents Using Reinforcement Learning Designer
The app replaces the deep neural network in the corresponding actor or agent. If you import a critic
network for a TD3 agent, the app replaces the network for both critics.
• Agent
• Agent options
• Actor or critic
• Deep neural network in the actor or critic
To export an agent or agent component, on the corresponding Agent tab, click Export. Then, select
the item to export.
3-15
3 Create Agents
The app saves a copy of the agent or agent component in the MATLAB workspace.
See Also
Reinforcement Learning Designer | analyzeNetwork
Related Examples
• “Reinforcement Learning Agents” on page 3-2
• “Create MATLAB Environments for Reinforcement Learning Designer” on page 2-5
• “Create Simulink Environments for Reinforcement Learning Designer” on page 2-11
• “Design and Train Agent Using Reinforcement Learning Designer” on page 5-12
• “Deep Q-Network (DQN) Agents” on page 3-23
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
• “Twin-Delayed Deep Deterministic Policy Gradient Agents” on page 3-35
• “Proximal Policy Optimization Agents” on page 3-44
3-16
Q-Learning Agents
Q-Learning Agents
The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-
learning agent is a value-based reinforcement learning agent that trains a critic to estimate the
return or future rewards. For a given observation, the agent selects and outputs the action for which
the estimated return is greatest.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
Q-learning agents can be trained in environments with the following observation and action spaces.
Critic Actor
Q-value function critic Q(S,A), which you create Q agents do not use an actor
using rlQValueFunction or
rlVectorQValueFunction
During training, the agent explores the action space using epsilon-greedy exploration. During each
control interval the agent selects a random action with probability ϵ, otherwise it selects the action
for which the value function greatest with probability 1–ϵ.
For critics that use table-based value functions, the parameters in ϕ are the actual Q(S,A) values in
the table.
For more information on creating critics for value function approximation, see “Create Policies and
Value Functions” on page 4-2.
During training, the agent tunes the parameter values in ϕ. After training, the parameters remain at
their tuned value and the trained value function approximator is stored in critic Q(S,A).
Agent Creation
To create a Q-learning agent:
3-17
3 Create Agents
Training Algorithm
Q-learning agents use the following training algorithm. To configure the training algorithm, specify
options using an rlQAgentOptions object.
a For the current observation S, select a random action A with probability ϵ. Otherwise,
select the action for which the critic value function is greatest.
A = argmaxQ S, A; ϕ
A
y = R + γmaxQ S′, A; ϕ
A
ΔQ = y − Q S, A; ϕ
e Update the critic using the learning rate α. Specify the learning rate when you create the
critic by setting the LearnRate option in the rlCriticOptimizerOptions property
within the agent options object.
• For table-based critics, update the corresponding Q(S,A) value in the table.
Q S, A = Q S, A; ϕ + α ⋅ ΔQ
• For all other types of critics, compute the gradients Δϕ of the loss function with
respect to the parameters ϕ. Then, update the parameters based on the computed
gradients. In this case, the loss function is the square of ΔQ.
1 2
Δϕ = ∇ ΔQ
2 ϕ
ϕ = ϕ + α ⋅ Δϕ
f Set the observation S to S'.
See Also
rlQAgent | rlQAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
3-18
Q-Learning Agents
3-19
3 Create Agents
SARSA Agents
The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA
agent is a value-based reinforcement learning agent that trains a critic to estimate the return or
future rewards. For a given observation, the agent selects and outputs the action for which the
estimated return is greatest.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
SARSA agents can be trained in environments with the following observation and action spaces.
Critic Actor
Q-value function critic Q(S,A), which you create SARSA agents do not use an actor.
using rlQValueFunction or
rlVectorQValueFunction
During training, the agent explores the action space using epsilon-greedy exploration. During each
control interval the agent selects a random action with probability ϵ, otherwise it selects the action
for which the value function greatest with probability 1–ϵ.
For critics that use table-based value functions, the parameters in ϕ are the actual Q(S,A) values in
the table.
For more information on creating critics for value function approximation, see “Create Policies and
Value Functions” on page 4-2.
During training, the agent tunes the parameter values in ϕ. After training, the parameters remain at
their tuned value and the trained value function approximator is stored in critic Q(S,A).
Agent Creation
To create a SARSA agent:
3-20
SARSA Agents
Training Algorithm
SARSA agents use the following training algorithm. To configure the training algorithm, specify
options using an rlSARSAAgentOptions object.
A = argmaxQ S, A; ϕ
A
y = R + γQ S′, A′; ϕ
ΔQ = y − Q S, A; ϕ
e Update the critic using the learning rate α. Specify the learning rate when you create the
critic by setting the LearnRate option in the rlCriticOptimizerOptions property
within the agent options object.
• For table-based critics, update the corresponding Q(S,A) value in the table.
Q S, A = Q S, A; ϕ + α ⋅ ΔQ
• For all other types of critics, compute the gradients Δϕ of the loss function with
respect to the parameters ϕ. Then, update the parameters based on the computed
gradients. In this case, the loss function is the square of ΔQ.
1 2
Δϕ = ∇ ΔQ
2 ϕ
ϕ = ϕ + α ⋅ Δϕ
f Set the observation S to S'.
g Set the action A to A'.
See Also
rlSARSAAgent | rlSARSAAgentOptions
3-21
3 Create Agents
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
• “Train Reinforcement Learning Agent in Basic Grid World” on page 1-14
3-22
Deep Q-Network (DQN) Agents
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
DQN agents can be trained in environments with the following observation and action spaces.
Critic Actor
Q-value function critic Q(S,A), which you create DQN agents do not use an actor.
using rlQValueFunction or
rlVectorQValueFunction
• Critic Q(S,A;ϕ) — The critic, with parameters ϕ, takes observation S and action A as inputs and
returns the corresponding expectation of the long-term reward.
• Target critic Qt(S,A;ϕt) — To improve the stability of the optimization, the agent periodically
updates the target critic parameters ϕt using the latest critic parameter values.
Both Q(S,A;ϕ) and Qt(S,A;ϕt) have the same structure and parameterization.
For more information on creating critics for value function approximation, see “Create Policies and
Value Functions” on page 4-2.
During training, the agent tunes the parameter values in ϕ. After training, the parameters remain at
their tuned value and the trained value function approximator is stored in critic Q(S,A).
3-23
3 Create Agents
Agent Creation
You can create and train DQN agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a DQN agent with a critic based on the observation and action
specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlDQNAgentOptions object.
5 Create the agent using an rlDQNAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
DQN agents support critics that use recurrent deep neural networks as functions approximators.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
DQN agents use the following training algorithm, in which they update their critic model at each time
step. To configure the training algorithm, specify options using an rlDQNAgentOptions object.
• Initialize the critic Q(s,a;ϕ) with random parameter values ϕ, and initialize the target critic
parameters ϕt with the same values. ϕt = ϕ.
• For each training time step:
1 For the current observation S, select a random action A with probability ϵ. Otherwise, select
the action for which the critic value function is greatest.
A = argmaxQ S, A; ϕ
A
3-24
Deep Q-Network (DQN) Agents
To set the discount factor γ, use the DiscountFactor option. To use double DQN, set the
UseDoubleDQN option to true.
6 Update the critic parameters by one-step minimization of the loss L across all sampled
experiences.
M
1 2
Mi∑
L= yi − Q Si, Ai; ϕ
=1
7 Update the target critic parameters depending on the target update method. For more
information, see “Target Update Methods” on page 3-25.
8 Update the probability threshold ϵ for selecting a random action based on the decay rate you
specify in the EpsilonGreedyExploration option.
• Smoothing — Update the target parameters at every time step using smoothing factor τ. To
specify the smoothing factor, use the TargetSmoothFactor option.
ϕt = τϕ + 1 − τ ϕt
• Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor
= 1). To specify the update period, use the TargetUpdateFrequency parameter.
• Periodic Smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, create a rlDQNAgentOptions object, and set the
TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.
References
[1] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning.”
ArXiv:1312.5602 [Cs], December 19, 2013. https://arxiv.org/abs/1312.5602.
3-25
3 Create Agents
See Also
rlDQNAgent | rlDQNAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-26
Policy Gradient Agents
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
PG agents can be trained in environments with the following observation and action spaces.
• Estimates probabilities of taking each action in the action space and randomly selects actions
based on the probability distribution.
• Completes a full training episode using the current policy before learning from the experience and
updating the policy parameters.
If the UseExplorationPolicy option of the agent is set to false the action with maximum
likelihood is always used in sim and generatePolicyFunction. As a result, the simulated agent
and generated policy behave deterministically.
If the UseExplorationPolicy is set to true the agent selects its actions by sampling its
probability distribution. As a result the policy is stochastic and the agent explores its observation
space.
This option affects only simulation and deployment; it does not affect training.
• Discrete action space — The probability of taking each discrete action. The sum of these
probabilities across all actions is 1.
• Continuous action space — The mean and standard deviation of the Gaussian probability
distribution for each continuous action.
3-27
3 Create Agents
To reduce the variance during gradient estimation, PG agents can use a baseline value function,
which is estimated using a critic function approximator, V(S;ϕ) with parameters ϕ. The critic
computes the value function for a given observation state.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(A|S).
Agent Creation
You can create a PG agent with default actor and critic based on the observation and action
specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlPGAgentOptions object.
5 Create the agent using an rlPGAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
PG agents use the REINFORCE (Monte Carlo policy gradient) algorithm either with or without a
baseline. To configure the training algorithm, specify options using an rlPGAgentOptions object.
REINFORCE Algorithm
3-28
Policy Gradient Agents
Here, St is a state observation, At is an action taken from that state, St+1 is the next state, and Rt
+1 is the reward received for moving from St to St+1.
3 For each state in the episode sequence, that is, for t = 1, 2, …, T-1, calculate the return Gt, which
is the discounted future reward.
T
Gt = ∑ γk − tRk
k=t
4 Accumulate the gradients for the actor network by following the policy gradient to maximize the
expected discounted reward. If the EntropyLossWeight option is greater than zero, then
additional gradients are accumulated to minimize the entropy loss function.
T−1
dθ = ∑ Gt ∇θ lnπ St; θ
t=1
5 Update the actor parameters by applying the gradients.
θ = θ + αdθ
Here, α is the learning rate of the actor. Specify the learning rate when you create the actor by
setting the LearnRate option in the rlActorOptimizerOptions property within the agent
options object. For simplicity, this step shows a gradient update using basic stochastic gradient
descent. The actual gradient update method depends on the optimizer you specify using in the
rlOptimizerOptions object assigned to the rlActorOptimizerOptions property.
6 Repeat steps 2 through 5 for each training episode until training is complete.
δt = Gt − V St; ϕ
5 Accumulate the gradients for the critic network.
T−1
dϕ = ∑ δt ∇ϕ V St; ϕ
t=1
6 Accumulate the gradients for the actor network. If the EntropyLossWeight option is greater
than zero, then additional gradients are accumulated to minimize the entropy loss function.
3-29
3 Create Agents
T−1
dθ = ∑ δt ∇θ lnπ St; θ
t=1
7 Update the critic parameters ϕ.
ϕ = ϕ + βdϕ
Here, β is the learning rate of the critic. Specify the learning rate when you create the critic by
setting the LearnRate option in the rlCriticOptimizerOptions property within the agent
options object.
8 Update the actor parameters θ.
θ = θ + αdθ
9 Repeat steps 3 through 8 for each training episode until training is complete.
For simplicity, the actor and critic updates in this algorithm show a gradient update using basic
stochastic gradient descent. The actual gradient update method depends on the optimizer you specify
using in the rlOptimizerOptions object assigned to the rlCriticOptimizerOptions property.
References
[1] Williams, Ronald J. “Simple Statistical Gradient-Following Algorithms for Connectionist
Reinforcement Learning.” Machine Learning 8, no. 3–4 (May 1992): 229–56. https://doi.org/
10.1007/BF00992696.
See Also
rlPGAgent | rlPGAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-30
Deep Deterministic Policy Gradient (DDPG) Agents
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
DDPG agents can be trained in environments with the following observation and action spaces.
Critic Actor
Q-value function critic Q(S,A), which you create Deterministic policy actor π(S), which you create
using rlQValueFunction using rlContinuousDeterministicActor
• Updates the actor and critic properties at each time step during learning.
• Stores past experiences using a circular experience buffer. The agent updates the actor and critic
using a mini-batch of experiences randomly sampled from the buffer.
• Perturbs the action chosen by the policy using a stochastic noise model at each training step.
• Actor π(S;θ)— The actor, with parameters θ, takes observation S and returns the corresponding
action that maximizes the long-term reward.
• Target actor πt(S;θt) — To improve the stability of the optimization, the agent periodically updates
the target actor parameters θt using the latest actor parameter values.
• Critic Q(S,A;ϕ) — The critic, with parameters ϕ, takes observation S and action A as inputs and
returns the corresponding expectation of the long-term reward.
• Target critic Qt(S,A;ϕt) — To improve the stability of the optimization, the agent periodically
updates the target critic parameters ϕt using the latest critic parameter values.
Both Q(S,A;ϕ) and Qt(S,A;ϕt) have the same structure and parameterization, and both π(S;θ) and
πt(S;θt) have the same structure and parameterization.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(S).
3-31
3 Create Agents
Agent Creation
You can create and train DDPG agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a DDPG agent with default actor and critics based on the
observation and action specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlDDPGAgentOptions object.
5 Create the agent using an rlDDPGAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
DDPG agents use the following training algorithm, in which they update their actor and critic models
at each time step. To configure the training algorithm, specify options using an
rlDDPGAgentOptions object.
• Initialize the critic Q(S,A;ϕ) with random parameter values ϕ, and initialize the target critic
parameters ϕt with the same values: ϕt = ϕ.
• Initialize the actor π(S;θ) with random parameter values θ, and initialize the target actor
parameters θt with the same values: θt = θ.
• For each training time step:
1 For the current observation S, select action A = π(S;θ) + N, where N is stochastic noise from
the noise model. To configure the noise model, use the NoiseOptions option.
2 Execute action A. Observe the reward R and next observation S'.
3 Store the experience (S,A,R,S') in the experience buffer. The length of the experience buffer is
specified in the ExperienceBufferLength property of the rlDDPGAgentOptions object.
3-32
Deep Deterministic Policy Gradient (DDPG) Agents
The value function target is the sum of the experience reward Ri and the discounted future
reward. To specify the discount factor γ, use the DiscountFactor option.
To compute the cumulative reward, the agent first computes a next action by passing the next
observation S'i from the sampled experience to the target actor. The agent finds the
cumulative reward by passing the next action to the target critic.
6 Update the critic parameters by minimizing the loss L across all sampled experiences.
M
1 2
Mi∑
L= yi − Q Si, Ai; ϕ
=1
7 Update the actor parameters using the following sampled policy gradient to maximize the
expected discounted reward.
M
1
Mi∑
∇θ J ≈ GaiGπi
=1
Gai = ∇ A Q Si, A; ϕ where A = π Si; θ
Gπi = ∇θ π Si; θ
Here, Gai is the gradient of the critic output with respect to the action computed by the actor
network, and Gπi is the gradient of the actor output with respect to the actor parameters. Both
gradients are evaluated for observation Si.
8 Update the target actor and critic parameters depending on the target update method. For
more information see “Target Update Methods” on page 3-33.
For simplicity, the actor and critic updates in this algorithm show a gradient update using basic
stochastic gradient descent. The actual gradient update method depends on the optimizer you specify
using in the rlOptimizerOptions object assigned to the rlCriticOptimizerOptions property.
• Smoothing — Update the target parameters at every time step using smoothing factor τ. To
specify the smoothing factor, use the TargetSmoothFactor option.
ϕt = τϕ + 1 − τ ϕt critic parameters
θt = τθ + 1 − τ θt actor parameters
• Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor
= 1). To specify the update period, use the TargetUpdateFrequency parameter.
• Periodic Smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, create a rlDDPGAgentOptions object, and set the
TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.
3-33
3 Create Agents
References
[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.”
ArXiv:1509.02971 [Cs, Stat], September 9, 2015. https://arxiv.org/abs/1509.02971.
See Also
rlDDPGAgent | rlDDPGAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-34
Twin-Delayed Deep Deterministic Policy Gradient Agents
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
The TD3 algorithm is an extension of the DDPG algorithm. DDPG agents can overestimate value
functions, which can produce suboptimal policies. To reduce value function overestimation, the TD3
algorithm includes the following modifications of the DDPG algorithm.
1 A TD3 agent learns two Q-value functions and uses the minimum value function estimate during
policy updates.
2 A TD3 agent updates the policy and targets less frequently than the Q functions.
3 When updating the policy, a TD3 agent adds noise to the target action, which makes the policy
less likely to exploit actions with high Q-value estimates.
You can use a TD3 agent to implement one of the following training algorithms, depending on the
number of critics you specify.
• TD3 — Train the agent with two Q-value functions. This algorithm implements all three of the
preceding modifications.
• Delayed DDPG — Train the agent with a single Q-value function. This algorithm trains a DDPG
agent with target policy smoothing and delayed policy and target updates.
TD3 agents can be trained in environments with the following observation and action spaces.
Critics Actor
One or more Q-value function critics Q(S,A), Deterministic policy actor π(S), which you create
which you create using rlQValueFunction using rlContinuousDeterministicActor
• Updates the actor and critic properties at each time step during learning.
• Stores past experiences using a circular experience buffer. The agent updates the actor and critic
using a mini-batch of experiences randomly sampled from the buffer.
• Perturbs the action chosen by the policy using a stochastic noise model at each training step.
3-35
3 Create Agents
• Deterministic actor π(S;θ) — The actor, with parameters θ, takes observation S and returns the
corresponding action that maximizes the long-term reward.
• Target actor πt(S;θt) — To improve the stability of the optimization, the agent periodically updates
the target actor parameters θt using the latest actor parameter values.
• One or two Q-value critics Qk(S,A;ϕk) — The critics, each with different parameters ϕk, take
observation S and action A as inputs and returns the corresponding expectation of the long-term
reward.
• One or two target critics Qtk(S,A;ϕtk) — To improve the stability of the optimization, the agent
periodically updates the target critic parameters ϕtk using the latest corresponding critic
parameter values. The number of target critics matches the number of critics.
Both π(S;θ) and πt(S;θt) have the same structure and parameterization.
For each critic, Qk(S,A;ϕk) and Qtk(S,A;ϕtk) have the same structure and parameterization.
When using two critics, Q1(S,A;ϕ1) and Q2(S,A;ϕ2), each critic can have a different structure, though
TD3 works best when the critics have the same structure. When the critics have the same structure,
they must have different initial parameter values.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(S).
Agent Creation
You can create and train TD3 agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a TD3 agent with default actor and critics based on the
observation and action specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlTD3AgentOptions object.
5 Create the agent using an rlTD3Agent object.
Alternatively, you can create actor and critics and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critics match the corresponding action
and observation specifications of the environment.
3-36
Twin-Delayed Deep Deterministic Policy Gradient Agents
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
TD3 agents use the following training algorithm, in which they update their actor and critic models at
each time step. To configure the training algorithm, specify options using an rlDDPGAgentOptions
object. Here, K = 2 is the number of critics and k is the critic index.
• Initialize each critic Qk(S,A;ϕk) with random parameter values ϕk, and initialize each target critic
with the same random parameter values: ϕtk = ϕk.
• Initialize the actor π(S;θ) with random parameter values θ, and initialize the target actor with the
same parameter values: θt = θ.
• For each training time step:
1 For the current observation S, select action A = π(S;θ) + N, where N is stochastic noise from
the noise model. To configure the noise model, use the ExplorationModel option.
2 Execute action A. Observe the reward R and next observation S'.
3 Store the experience (S,A,R,S') in the experience buffer.
4 Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To
specify M, use the MiniBatchSize option.
5 If S'i is a terminal state, set the value function target yi to Ri. Otherwise, set it to
The value function target is the sum of the experience reward Ri and the minimum discounted
future reward from the critics. To specify the discount factor γ, use the DiscountFactor
option.
To compute the cumulative reward, the agent first computes a next action by passing the next
observation S'i from the sampled experience to the target actor. Then, the agent adds noise ε
to the computed action using the TargetPolicySmoothModel, and clips the action based on
the upper and lower noise limits. The agent finds the cumulative rewards by passing the next
action to the target critics.
6 At every time training step, update the parameters of each critic by minimizing the loss Lk
across all sampled experiences.
M
1 2
Mi∑
Lk = yi − Qk Si, Ai; ϕk
=1
7 Every D1 steps, update the actor parameters using the following sampled policy gradient to
maximize the expected discounted reward. To set D1, use the PolicyUpdateFrequency
option.
3-37
3 Create Agents
M
1
Mi∑
∇θ J ≈ GaiGπi
=1
Gai = ∇ A min Qk Si, A; ϕ where A = π Si; θ
k
Gπi = ∇θ π Si; θ
Here, Gai is the gradient of the minimum critic output with respect to the action computed by
the actor network, and Gπi is the gradient of the actor output with respect to the actor
parameters. Both gradients are evaluated for observation Si.
8 Every D2 steps, update the target actor and critics depending on the target update method. To
specify D2, use the TargetUpdateFrequency option. For more information, see “Target
Update Methods” on page 3-38.
For simplicity, the actor and critic updates in this algorithm show a gradient update using basic
stochastic gradient descent. The actual gradient update method depends on the optimizer you specify
using in the rlOptimizerOptions object assigned to the rlCriticOptimizerOptions property.
• Smoothing — Update the target parameters at every time step using smoothing factor τ. To
specify the smoothing factor, use the TargetSmoothFactor option.
ϕtk = ϕk
θt = θ
• Periodic Smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, create a rlTD3AgentOptions object, and set the
TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.
References
[1] Fujimoto, Scott, Herke van Hoof, and David Meger. "Addressing Function Approximation Error in
Actor-Critic Methods". ArXiv:1802.09477 [Cs, Stat], 22 October 2018. https://arxiv.org/abs/
1802.09477.
3-38
Twin-Delayed Deep Deterministic Policy Gradient Agents
See Also
rlTD3Agent | rlTD3AgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
• “Train Biped Robot to Walk Using Reinforcement Learning Agents” on page 5-235
3-39
3 Create Agents
Actor-Critic Agents
You can use the actor-critic (AC) agent, which uses a model-free, online, on-policy reinforcement
learning method, to implement actor-critic algorithms, such as A2C and A3C. The goal of this agent is
to optimize the policy (actor) directly and train a critic to estimate the return or future rewards [1].
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
AC agents can be trained in environments with the following observation and action spaces.
Critic Actor
Value function critic V(S), which you create using Stochastic policy actor π(S), which you create
rlValueFunction using rlDiscreteCategoricalActor (for
discrete action spaces) or
rlContinuousGaussianActor (for continuous
action spaces)
• Estimates probabilities of taking each action in the action space and randomly selects actions
based on the probability distribution.
• Interacts with the environment for multiple steps using the current policy before updating the
actor and critic properties.
If the UseExplorationPolicy option of the agent is set to false the action with maximum
likelihood is always used in sim and generatePolicyFunction. As a result, the simulated agent
and generated policy behave deterministically.
If the UseExplorationPolicy is set to true the agent selects its actions by sampling its
probability distribution. As a result the policy is stochastic and the agent explores its observation
space.
This option affects only simulation and deployment; it does not affect training.
• Actor π(A|S;θ) — The actor, with parameters θ, outputs the conditional probability of taking each
action A when in state S as one of the following:
• Discrete action space — The probability of taking each discrete action. The sum of these
probabilities across all actions is 1.
• Continuous action space — The mean and standard deviation of the Gaussian probability
distribution for each continuous action.
3-40
Actor-Critic Agents
• Critic V(S;ϕ) — The critic, with parameters ϕ, takes observation S and returns the corresponding
expectation of the discounted long-term reward.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(A|S).
Agent Creation
You can create an AC agent with default actor and critics based on the observation and action
specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlACAgentOptions object.
5 Create the agent using an rlACAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
AC agents use the following training algorithm. To configure the training algorithm, specify options
using an rlACAgentOptions object.
Here, St is a state observation, At is an action taken from that state, St+1 is the next state, and Rt
+1 is the reward received for moving from St to St+1.
3-41
3 Create Agents
When in state St, the agent computes the probability of taking each action in the action space
using π(A|St;θ) and randomly selects action At based on the probability distribution.
ts is the starting time step of the current set of N experiences. At the beginning of the training
episode, ts = 1. For each subsequent set of N experiences in the same training episode, ts = ts +
N.
For each training episode that does not contain a terminal state, N is equal to the
NumStepsToLookAhead option value. Otherwise, N is less than NumStepsToLookAhead and SN
is the terminal state.
4 For each episode step t = ts+1, ts+2, …, ts+N, compute the return Gt, which is the sum of the
reward for that step and the discounted future reward. If Sts+N is not a terminal state, the
discounted future reward includes the discounted state value function, computed using the critic
network V.
ts + N
Gt = ∑ γk − tRk + bγN − t + 1V Sts + N; ϕ
k=t
Dt = Gt − V St; ϕ
6 Accumulate the gradients for the actor network by following the policy gradient to maximize the
expected discounted reward.
N
dθ = ∑ ∇θμ lnπ A St; θ ⋅ Dt
t=1
7 Accumulate the gradients for the critic network by minimizing the mean squared error loss
between the estimated value function V (St;ϕ) and the computed target return Gt across all N
experiences. If the EntropyLossWeight option is greater than zero, then additional gradients
are accumulated to minimize the entropy loss function.
N
2
dϕ = ∑ ∇ϕ Gt − V St; ϕ
t=1
8 Update the actor parameters by applying the gradients.
θ = θ + αdθ
Here, α is the learning rate of the actor. Specify the learning rate when you create the actor by
setting the LearnRate option in the rlActorOptimizerOptions property within the agent
options object.
9 Update the critic parameters by applying the gradients.
ϕ = ϕ + βdϕ
Here, β is the learning rate of the critic. Specify the learning rate when you create the critic by
setting the LearnRate option in the rlCriticOptimizerOptions property within the agent
options object.
3-42
Actor-Critic Agents
10 Repeat steps 3 through 9 for each training episode until training is complete.
For simplicity, the actor and critic updates in this algorithm description show a gradient update using
basic stochastic gradient descent. The actual gradient update method depends on the optimizer you
specify using in the rlOptimizerOptions object assigned to the rlCriticOptimizerOptions
property.
References
[1] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep
Reinforcement Learning.” ArXiv:1602.01783 [Cs], February 4, 2016. https://arxiv.org/abs/
1602.01783.
See Also
rlACAgent | rlACAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-43
3 Create Agents
PPO is a simplified version of TRPO. TRPO is more computationally expensive than PPO, but TRPO
tends to be more robust than PPO if the environment dynamics are deterministic and the observation
is low dimensional. For more information on TRPO agents, see “Trust Region Policy Optimization
Agents” on page 3-50.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
PPO agents can be trained in environments with the following observation and action spaces.
Critic Actor
Value function critic V(S), which you create using Stochastic policy actor π(S), which you create
rlValueFunction using rlDiscreteCategoricalActor (for
discrete action spaces) or
rlContinuousGaussianActor (for continuous
action spaces)
• Estimates probabilities of taking each action in the action space and randomly selects actions
based on the probability distribution.
• Interacts with the environment for multiple steps using the current policy before using mini-
batches to update the actor and critic properties over multiple epochs.
If the UseExplorationPolicy option of the agent is set to false the action with maximum
likelihood is always used in sim and generatePolicyFunction. As a result, the simulated agent
and generated policy behave deterministically.
If the UseExplorationPolicy is set to true the agent selects its actions by sampling its
probability distribution. As a result the policy is stochastic and the agent explores its observation
space.
This option affects only simulation and deployment; it does not affect training.
3-44
Proximal Policy Optimization Agents
• Actor π(A|S;θ) — The actor, with parameters θ, outputs the conditional probability of taking each
action A when in state S as one of the following:
• Discrete action space — The probability of taking each discrete action. The sum of these
probabilities across all actions is 1.
• Continuous action space — The mean and standard deviation of the Gaussian probability
distribution for each continuous action.
• Critic V(S;ϕ) — The critic, with parameters ϕ, takes observation S and returns the corresponding
expectation of the discounted long-term reward.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(A|S).
Agent Creation
You can create and train PPO agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a PPO agent with default actor and critic based on the
observation and action specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use an LSTM
layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 Specify agent options using an rlPPOAgentOptions object.
5 Create the agent using an rlPPOAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
PPO agents support actors and critics that use recurrent deep neural networks as function
approximators.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
3-45
3 Create Agents
Training Algorithm
PPO agents use the following training algorithm. To configure the training algorithm, specify options
using an rlPPOAgentOptions object.
Here, St is a state observation, At is an action taken from that state, St+1 is the next state, and Rt
+1 is the reward received for moving from St to St+1.
When in state St, the agent computes the probability of taking each action in the action space
using π(A|St;θ) and randomly selects action At based on the probability distribution.
ts is the starting time step of the current set of N experiences. At the beginning of the training
episode, ts = 1. For each subsequent set of N experiences in the same training episode, ts ← ts +
N.
For each experience sequence that does not contain a terminal state, N is equal to the
ExperienceHorizon option value. Otherwise, N is less than ExperienceHorizon and SN is
the terminal state.
4 For each episode step t = ts+1, ts+2, …, ts+N, compute the return and advantage function using
the method specified by the AdvantageEstimateMethod option.
Here, b is 0 if Sts+N is a terminal state and 1 otherwise. That is, if Sts+N is not a terminal state,
the discounted future reward includes the discounted state value function, computed using
the critic network V.
Dt = Gt − V St; ϕ
• Generalized Advantage Estimator (AdvantageEstimateMethod = "gae") — Compute
the advantage function Dt, which is the discounted sum of temporal difference errors [3].
ts + N − 1
k−t
Dt = ∑ γλ δk
k=t
δk = Rt + bγV St; ϕ
Here, b is 0 if Sts+N is a terminal state and 1 otherwise. λ is a smoothing factor specified using
the GAEFactor option.
3-46
Proximal Policy Optimization Agents
Gt = Dt + V St; ϕ
To specify the discount factor γ for either method, use the DiscountFactor option.
5 Learn from mini-batches of experiences over K epochs. To specify K, use the NumEpoch option.
For each learning epoch:
a Sample a random mini-batch data set of size M from the current set of experiences. To
specify M, use the MiniBatchSize option. Each element of the mini-batch data set contains
a current experience and the corresponding return and advantage function values.
b Update the critic parameters by minimizing the loss Lcritic across all sampled mini-batch data.
M
1 2
Mi∑
Lcritic ϕ = Gi − V Si; ϕ
=1
c Normalize the advantage values Di based on recent unnormalized advantage values.
Di Di
• If the NormalizedAdvantageMethod option is 'current', normalize the advantage
values based on the unnormalized advantages in the current mini-batch.
Here:
• Di and Gi are the advantage function and return value for the ith element of the mini-
batch, respectively.
• π(Ai|Si;θ) is the probability of taking action Ai when in state Si, given the updated policy
parameters θ.
• π(Ai|Si;θold) is the probability of taking action Ai when in state Si, given the previous policy
parameters θold from before the current learning epoch.
3-47
3 Create Agents
Entropy Loss
To promote agent exploration, you can add an entropy loss term wℋi(θ,Si) to the actor loss function,
where w is the entropy loss weight and ℋi(θ,Si) is the entropy.
The entropy value is higher when the agent is more uncertain about which action to take next.
Therefore, maximizing the entropy loss term (minimizing the negative entropy loss) increases the
agent uncertainty, thus encouraging exploration. To promote additional exploration, which can help
the agent move out of local optima, you can specify a larger entropy loss weight.
For a discrete action space, the agent uses the following entropy value. In this case, the actor outputs
the probability of taking each possible discrete action.
P
ℋ i θ, Si = − ∑ π Ak Si; θ lnπ Ak Si; θ
k=1
Here:
For a continuous action space, the agent uses the following entropy value. In this case, the actor
outputs the mean and standard deviation of the Gaussian distribution for each continuous action.
C
1
2 k∑
2
ℋ i θ, Si = ln 2π ⋅ e ⋅ σk, i
=1
Here:
References
[1] Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy
Optimization Algorithms.” ArXiv:1707.06347 [Cs], July 19, 2017. https://arxiv.org/abs/
1707.06347.
[2] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep
Reinforcement Learning.” ArXiv:1602.01783 [Cs], February 4, 2016. https://arxiv.org/abs/
1602.01783.
[3] Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. “High-
Dimensional Continuous Control Using Generalized Advantage Estimation.”
ArXiv:1506.02438 [Cs], October 20, 2018. https://arxiv.org/abs/1506.02438.
3-48
Proximal Policy Optimization Agents
See Also
rlPPOAgent | rlPPOAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-49
3 Create Agents
PPO is a simplified version of TRPO. TRPO is more computationally expensive than PPO, but TRPO
tends to be more robust than PPO if the environment dynamics are deterministic and the number of
observations is low. For more information on PPO agents, see “Proximal Policy Optimization Agents”
on page 3-44.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
TRPO agents can be trained in environments with the following observation and action spaces.
Critic Actor
Value function critic V(S), which you create using Stochastic policy actor π(S), which you create
rlValueFunction using rlDiscreteCategoricalActor (for
discrete action spaces) or
rlContinuousGaussianActor (for continuous
action spaces)
• Estimates probabilities of taking each action in the action space and randomly selects actions
based on the probability distribution.
• Interacts with the environment for multiple steps using the current policy before using mini-
batches to update the actor and critic properties over multiple epochs.
If the UseExplorationPolicy option of the agent is set to false the action with maximum
likelihood is always used in sim and generatePolicyFunction. As a result, the simulated agent
and generated policy behave deterministically.
If the UseExplorationPolicy is set to true the agent selects its actions by sampling its
probability distribution. As a result the policy is stochastic and the agent explores its observation
space.
This option affects only simulation and deployment; it does not affect training.
3-50
Trust Region Policy Optimization Agents
• Actor π(A|S;θ) — The actor, with parameters θ, outputs the conditional probability of taking each
action A when in state S as one of the following:
• Discrete action space — The probability of taking each discrete action. The sum of these
probabilities across all actions is 1.
• Continuous action space — The mean and standard deviation of the Gaussian probability
distribution for each continuous action.
• Critic V(S;ϕ) — The critic, with parameters ϕ, takes observation S and returns the corresponding
expectation of the discounted long-term reward.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(A|S).
Agent Creation
You can create and train TRPO agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a TRPO agent with default actor and critic based on the
observation and action specifications from the environment. To do so, perform the following steps.
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer. To do so, create an agent
initialization options object using rlAgentInitializationOptions.
4 Specify agent options using an rlTRPOAgentOptions object.
5 Create the agent using an rlTRPOAgent object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
TRPO agents do not support actors and critics that use recurrent deep neural networks as function
approximators. TRPO agents also do not support deep neural networks that use a quadraticLayer.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
3-51
3 Create Agents
Here:
Here:
• DKL(θold,θ,Si) is the Kullback-Leibler (KL) divergence between the old policy π(A|Si;θold) and current
policy π(A|Si;θ). DKL measures how much the probability distributions of the old and new policies
differ. DKL is zero when the two distributions are identical.
• δ is the limit for DKL and controls how much the new policy can deviate from the old policy.
For agents with discrete action spaces, DKL is computed as follows, where P is the number of actions.
P π Ak Si; θold
DKL θold, θ, Si = ∑ π Ak Si; θold ln
π Ak Si; θ
k=1
P 2 2
1 σθold, k + μθold, k − μθ, k
DKL θold, θ, Si = ∑
Pk=1
ln σθ, k − ln σθold, k + 2
− 0.5
2σθ, k
Here:
• μθ,k and σθ,k are the mean and standard deviation for the kth action output by the current actor
policy π(Ak|Si;θ).
• μθold,k and σθold,k are the mean and standard deviation for the kth action output by the old policy
π(Ak|Si;θold).
To approximate this optimization problem, the TRPO agent uses a linear approximation of Lactor(θ) and
a quadratic approximation of DKL(θold,θ,Si). The approximations are computed by taking the Taylor
series expansion around θ.
3-52
Trust Region Policy Optimization Agents
2δ
θ = θold + α x
xT H−1x
Here, x=H-1g and α is a coefficient for ensuring that the policy improves and satisfies the constraint.
Training Algorithm
TRPO agents use the following training algorithm. To configure the training algorithm, specify
options using an rlTRPOAgentOptions object.
Here, St is a state observation, At is an action taken from that state, St+1 is the next state, and Rt
+1 is the reward received for moving from St to St+1.
When in state St, the agent computes the probability of taking each action in the action space
using π(A|St;θ) and randomly selects action At based on the probability distribution.
ts is the starting time step of the current set of N experiences. At the beginning of the training
episode, ts = 1. For each subsequent set of N experiences in the same training episode, ts ← ts +
N.
For each experience sequence that does not contain a terminal state, N is equal to the
ExperienceHorizon option value. Otherwise, N is less than ExperienceHorizon and SN is
the terminal state.
4 For each episode step t = ts+1, ts+2, …, ts+N, compute the return and advantage function using
the method specified by the AdvantageEstimateMethod option.
Here, b is 0 if Sts+N is a terminal state and 1 otherwise. That is, if Sts+N is not a terminal state,
the discounted future reward includes the discounted state value function, computed using
the critic network V.
3-53
3 Create Agents
Dt = Gt − V St; ϕ
• Generalized Advantage Estimator (AdvantageEstimateMethod = "gae") — Compute
the advantage function Dt, which is the discounted sum of temporal difference errors [3].
ts + N − 1
k−t
Dt = ∑ γλ δk
k=t
δk = Rt + bγV St; ϕ
Here, b is 0 if Sts+N is a terminal state and 1 otherwise. λ is a smoothing factor specified using
the GAEFactor option.
Gt = Dt + V St; ϕ
To specify the discount factor γ for either method, use the DiscountFactor option.
5 Learn from mini-batches of experiences over K epochs. To specify K, use the NumEpoch option.
For each learning epoch:
a Sample a random mini-batch data set of size M from the current set of experiences. To
specify M, use the MiniBatchSize option. Each element of the mini-batch data set contains
a current experience and the corresponding return and advantage function values.
b Update the critic parameters by minimizing the loss Lcritic across all sampled mini-batch data.
M
1 2
Mi∑
Lcritic ϕ = Gi − V Si; ϕ
=1
c Normalize the advantage values Di based on recent unnormalized advantage values.
Di Di
• If the NormalizedAdvantageMethod option is 'current', normalize the advantage
values based on the unnormalized advantages in the current mini-batch.
3-54
Trust Region Policy Optimization Agents
ii Apply the conjugate gradient (CG) method to find an approximate solution to the
following equation, where H is the Hessian of the KL-divergence between the old and
new policies.
x ≈ H−1g
2δ
θ = θold + α x
xT H−1x
Lactor θ − Lactor θold < 0
M
1
Mi∑
DKL θold, θ, Si ≤ δ
=1
1 1 1
α ∈ 1, , 2 , …, n − 1
2 2 2
Here, δ is the KL-divergence limit, which you set using the KLDivergenceLimit
option. n is the number of line search iterations, which you set using the
NumIterationsLineSearch option.
iv If a valid value of α exists, update the parameters of the actor network to θ. If a valid
value of α does not exist, do not update the actor parameters.
6 Repeat steps 3 through 5 until the training episode reaches a terminal state.
Entropy Loss
To promote agent exploration, you can add an entropy loss term wℋi(θ,Si) to the actor loss function,
where w is the entropy loss weight and ℋi(θ,Si) is the entropy.
The entropy value is higher when the agent is more uncertain about which action to take next.
Therefore, maximizing the entropy loss term (minimizing the negative entropy loss) increases the
agent uncertainty, thus encouraging exploration. To promote additional exploration, which can help
the agent move out of local optima, you can specify a larger entropy loss weight.
For a discrete action space, the agent uses the following entropy value. In this case, the actor outputs
the probability of taking each possible discrete action.
P
ℋ i θ, Si = − ∑ π Ak Si; θ lnπ Ak Si; θ
k=1
Here:
For a continuous action space, the agent uses the following entropy value. In this case, the actor
outputs the mean and standard deviation of the Gaussian distribution for each continuous action.
3-55
3 Create Agents
C
1
2 k∑
2
ℋ i θ, Si = ln 2π ⋅ e ⋅ σk, i
=1
Here:
References
[1] Schulman, John, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. "Trust Region
Policy Optimization." Proceedings of the 32nd International Conference on Machine Learning,
pp. 1889-1897. 2015.
[2] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep
Reinforcement Learning.” ArXiv:1602.01783 [Cs], February 4, 2016. https://arxiv.org/abs/
1602.01783.
[3] Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. “High-
Dimensional Continuous Control Using Generalized Advantage Estimation.”
ArXiv:1506.02438 [Cs], October 20, 2018. https://arxiv.org/abs/1506.02438.
See Also
rlTRPOAgent | rlTRPOAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-56
Soft Actor-Critic Agents
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
The implementation of the SAC agent in Reinforcement Learning Toolbox software uses two Q-value
function critics, which prevents overestimation of the value function. Other implementations of the
SAC algorithm use an additional value function critic.
SAC agents can be trained in environments with the following observation and action spaces.
Critics Actor
Q-value function critics Q(S,A), which you create Stochastic policy actor π(S), which you create
using rlQValueFunction using rlContinuousGaussianActor
• Updates the actor and critic properties at regular intervals during learning.
• Estimates the mean and standard deviation of a Gaussian probability distribution for the
continuous action space, then randomly selects actions based on the distribution.
• Updates an entropy weight term that balances the expected return and the entropy of the policy.
• Stores past experience using a circular experience buffer. The agent updates the actor and critic
using a mini-batch of experiences randomly sampled from the buffer.
If the UseExplorationPolicy option of the agent is set to false the action with maximum
likelihood is always used in sim and generatePolicyFunction. As a result, the simulated agent
and generated policy behave deterministically.
If the UseExplorationPolicy is set to true the agent selects its actions by sampling its
probability distribution. As a result the policy is stochastic and the agent explores its observation
space.
This option affects only simulation and deployment; it does not affect training.
3-57
3 Create Agents
• Stochastic actor π(A|S;θ) — The actor, with parameters θ, outputs the mean ans standard
deviation of conditional Gaussian probability of taking each continuous action A when in state S.
• One or two Q-value critics Qk(S,A;ϕk) — The critics, each with parameters ϕk, take observation S
and action A as inputs and return the corresponding expectation of the value function, which
includes both the long-term reward and entropy.
• One or two target critics Qtk(S,A;ϕtk) — To improve the stability of the optimization, the agent
periodically sets the target critic parameters ϕtk to the latest corresponding critic parameter
values. The number of target critics matches the number of critics.
When you use two critics, Q1(S,A;ϕ1) and Q2(S,A;ϕ2), each critic can have different structures. When
the critics have the same structure, they must have different initial parameter values.
Each critic Qk(S,A;ϕk) and corresponding target critic Qtk(S,A;ϕtk) must have the same structure and
parameterization.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at
their tuned value and the trained actor function approximator is stored in π(A|S).
Action Generation
The actor in a SAC agent generates mean and standard deviation outputs. To select an action, the
actor first randomly selects an unbounded action from a Gaussian distribution with these parameters.
During training, the SAC agent uses the unbounded probability distribution to compute the entropy of
the policy for the given observation.
If the action space of the SAC agent is bounded, the actor generates bounded actions by applying
tanh and scaling operations to the unbounded action.
Agent Creation
You can create and train SAC agents at the MATLAB command line or using the Reinforcement
Learning Designer app. For more information on creating agents using Reinforcement Learning
Designer, see “Create Agents Using Reinforcement Learning Designer” on page 3-9.
At the command line, you can create a SAC agent with default actor and critic based on the
observation and action specifications from the environment. To do so, perform the following steps.
3-58
Soft Actor-Critic Agents
1 Create observation specifications for your environment. If you already have an environment
interface object, you can obtain these specifications using getObservationInfo.
2 Create action specifications for your environment. If you already have an environment interface
object, you can obtain these specifications using getActionInfo.
3 If needed, specify the number of neurons in each learnable layer or whether to use a recurrent
neural network. To do so, create an agent initialization option object using
rlAgentInitializationOptions.
4 If needed, specify agent options using an rlSACAgentOptions object.
5 Create the agent using an rlSACAgent object.
Alternatively, you can create actor and critics and use these objects to create your agent. In this case,
ensure that the input and output dimensions of the actor and critic match the corresponding action
and observation specifications of the environment.
1 Create a stochastic actor using an rlContinuousGaussianActor object. For SAC agents, in
order to properly scale the mean values to the desired action range, the actor network must not
contain a tanhLayer and scalingLayer as last two layers in the output path for the mean
values. Similarly, in order to ensure non-negativity of the standard deviation values, the actor
network must not contain a reluLayer as a last layer in the output path for the standard
deviation values.
2 Create one or two critics using rlQValueFunction objects.
3 Specify agent options using an rlSACAgentOptions object.
4 Create the agent using an rlSACAgent object.
SAC agents do not support actors and critics that use recurrent deep neural networks as function
approximators.
For more information on creating actors and critics for function approximation, see “Create Policies
and Value Functions” on page 4-2.
Training Algorithm
SAC agents use the following training algorithm, in which they periodically update their actor and
critic models and entropy weight. To configure the training algorithm, specify options using an
rlSACAgentOptions object. Here, K = 2 is the number of critics and k is the critic index.
• Initialize each critic Qk(S,A;ϕk) with random parameter values ϕk, and initialize each target critic
with the same random parameter values: ϕtk = ϕk.
• Initialize the actor π(S;θ) with random parameter values θ.
• Perform a warm start by taking a sequence of actions following the initial random policy in π(S).
For each action, store the experience in the experience buffer. To specify the number of warm up
actions, use the NumWarmStartSteps option.
• For each training time step:
1 For the current observation S, select action A using the policy in π(S;θ).
2 Execute action A. Observe the reward R and next observation S'.
3 Store the experience (S,A,R,S') in the experience buffer.
4 Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To
specify M, use the MiniBatchSize option.
3-59
3 Create Agents
5 Every DC time steps, update the parameters of each critic by minimizing the loss Lk across all
sampled experiences. To specify DC, use the CriticUpdateFrequency option.
M
1 2
Mi∑
Lk = yi − Qk Si, Ai; ϕk
=1
If S'i is a terminal state, the value function target yi is equal to the experience reward Ri.
Otherwise, the value function target is the sum of Ri, the minimum discounted future reward
from the critics, and the weighted entropy.
Here:
• A'i is the bounded action derived from the unbounded output of the actor π(S'i).
• γ is the discount factor, which you specify using the DiscountFactor option.
• −αlnπ S; θ is the weighted policy entropy for the bounded output of the actor when in
state S. α is the entropy loss weight, which you specify using the EntropyLossWeight
option.
6 Every DA time steps, update the actor parameters by minimizing the following objective
function. To set DA, use the PolicyUpdateFrequency option.
M
1
Mi∑
Jπ = −min Qtk Si, Ai; ϕtk + αlnπ Si; θ
=1 k
7 Every DA time steps, also update the entropy weight by minimizing the following loss function.
M
1
Mi∑
Lα = −αlnπ Si; θ − αℋ
=1
• Smoothing — Update the target critic parameters at every time step using smoothing factor τ. To
specify the smoothing factor, use the TargetSmoothFactor option.
ϕtk = ϕk
3-60
Soft Actor-Critic Agents
To configure the target update method, create an rlSACAgentOptions object, and set the
TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.
References
[1] Haarnoja, Tuomas, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash
Kumar, et al. "Soft Actor-Critic Algorithms and Application." Preprint, submitted January 29,
2019. https://arxiv.org/abs/1812.05905.
See Also
rlSACAgent | rlSACAgentOptions
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
• “Train Reinforcement Learning Agents” on page 5-3
3-61
3 Create Agents
The following figure shows the components and behavior of an MBPO agent. The agent samples real
experience data through environmental interaction and trains a model of the environment using this
experience. Then, the agent updates the policy parameters of its base agent using the real experience
data and experience generated from the environment model.
MBPO agents can be trained in environments with the following observation and action spaces.
You can use the following off-policy agents as the base agent in an MBPO agent.
3-62
Model-Based Policy Optimization Agents
MBPO agents use an environment model that you define using an rlNeuralNetworkEnvironment
object, which contains the following components. In general, these components use a deep neural
network to learn the environment behavior during training.
• One or more transition functions that predict the next observation based on the current
observation and action. You can define deterministic transition functions using
rlContinuousDeterministicTransitionFunction objects or stochastic transition functions
using rlContinuousGaussianTransitionFunction objects.
• A reward function that predicts the reward from the environment based on a combination of the
current observation, current action, and next observation. You can define a deterministic reward
function using an rlContinuousDeterministicRewardFunction object or a stochastic
reward function using an rlContinuousGaussianRewardFunction object. You can also define
a known reward function using a custom function.
• An is-done function that predicts the termination signal based on a combination of the current
observation, current action, and next observation. You can also define a known termination signal
using a custom function.
• Updates the environment model at the beginning of each episode by training the transition
functions, reward function, and is-done function
• Generates samples using the trained environment model and stores the samples in a circular
experience buffer
• Stores real samples from the interaction between the agent and the environment using a separate
circular experience buffer within the base agent
• Updates the actor and critic of the base agent using a mini-batch of experiences randomly
sampled from both the generated experience buffer and the real experience buffer
Training Algorithm
MBPO agents use the following training algorithm, in which they periodically update the environment
model and the base off-policy agent. To configure the training algorithm, specify options using an
rlMBPOAgentOptions object.
a For each model-training epoch, perform the following steps. To specify the number of
epochs, use the NumEpochForTrainingModel option.
3-63
3 Create Agents
1 1
w0 = M
, w1 = M
∑i=1 1 − Ti ∑
i = 1 Ti
M
−1
M i∑
Loss = w0TilnY i + w1 1 − Ti ln 1 − Y i
=1
Here, M is the mini-batch size, Ti is a target, and Yi is the output from the reward
network for the ith sample in the batch. Ti = 1 when isdone is 1 and Ti = 0 when
isdone is 0.
b Generate samples using the trained environment model. The following figure shows an
example of two roll-out trajectories with a horizon of two.
3-64
Model-Based Policy Optimization Agents
i Increase the horizon based on the horizon update settings defined in the
ModelRolloutOptions object.
ii Randomly sample a batch of NR observations from the real experience buffer. To specify
NR, use the ModelRolloutOptions.NumRollout option.
iii For each horizon step:
3-65
3 Create Agents
• Sample Nreal = ⌈M·R⌉ samples from the real experience buffer. To specify R, use the
RealRatio option.
• Nmodel = M – Nreal samples from the generated experience buffer.
b Train the base agent using the sampled mini-batch of data by following the update rule of the
base agent. For more information, see the corresponding SAC on page 3-57, TD3 on page 3-
35, DDPG on page 3-31, or DQN on page 3-23 training algorithm.
Tips
• MBPO agents can be more sample-efficient than model-free agents because the model can
generate large sets of diverse experiences. However, MBPO agents require much more
computational time than model-free agents, because they must train the environment model and
generate samples in addition to training the base agent.
• To overcome modeling uncertainty, best practice is to use multiple environment transition models.
• If they are available, it is best to use known ground-truth reward and is-done functions.
• It is better to generate a large number of trajectories (thousands or tens of thousands). Doing so
generates many samples, which reduces the likelihood of selecting the same sample multiple
times in a training episode.
• Since modeling errors can accumulate, it is better to use a shorter horizon when generating
samples. A shorter horizon is usually enough to generate diverse experiences.
• In general, an agent created using rlMBPOAgent is not suitable for environments with image
observations.
• When using a SAC base agent, taking more gradient steps (defined by the
NumGradientStepsPerUpdate SAC agent option) makes the MBPO agent more sample-efficient.
However, doing so increases the computational time.
• The MBPO implementation in rlMBPOAgent is based on the algorithm in the original MBPO paper
[1] but with the differences shown in the following table.
References
[1] Janner, Michael, Justin Fu, Marvin Zhang, and Sergey Levine. “When to Trust Your Model: Model-
Based Policy Optimization.” In Proceedings of the 33rd International Conference on Neural
Information Processing Systems, 12519–30. 1122. Red Hook, NY, USA: Curran Associates
Inc., 2019.
See Also
rlMBPOAgent | rlMBPOAgentOptions | rlNeuralNetworkEnvironment
3-66
Model-Based Policy Optimization Agents
Related Examples
• “Train MBPO Agent to Balance Cart-Pole System” on page 5-419
3-67
3 Create Agents
addpath(fullfile(matlabroot,'examples','rl','main'));
edit LQRCustomAgent.m
After saving the class to your own working folder, you can remove the example files from the path.
rmpath(fullfile(matlabroot,'examples','rl','main'));
This class has the following class definition, which indicates the agent class name and the associated
abstract agent.
• Agent properties
• Constructor function
• A critic that estimates the discounted long-term reward (if required for learning)
• An actor that selects an action based on the current observation (if required for learning)
• Required agent methods
• Optional agent methods
Agent Properties
In the properties section of the class file, specify any parameters necessary for creating and
training the agent. These parameters can include:
For more information on potential agent properties, see the option objects for the built-in
Reinforcement Learning Toolbox agents.
3-68
Create Custom Reinforcement Learning Agents
The rl.Agent.CustomAgent class already includes properties for the agent sample time
(SampleTime) and the action and observation specifications (ActionInfo and ObservationInfo,
respectively).
properties
% Q
Q
% R
R
% Feedback gain
K
% Discount factor
Gamma = 0.95
% Critic
Critic
% Buffer for K
KBuffer
% Number of updates for K
KUpdate = 1
Constructor Function
To create your custom agent, you must define a constructor function that:
• Defines the action and observation specifications. For more information about creating these
specifications, see rlNumericSpec and rlFiniteSetSpec.
• Creates actor and critic as required by your training algorithm. For more information, see “Create
Policies and Value Functions” on page 4-2.
• Configures agent properties.
• Calls the constructor of the base abstract class.
For example, the LQRCustomAgent constructor defines continuous action and observation spaces
and creates a critic. The createCritic function is an optional helper function that defines the critic.
3-69
3 Create Agents
For example, the custom LQR agent uses a critic, stored in its Critic property, and no actor. The
critic creation is implemented in the createCritic helper function, which is called from the
LQRCustomAgent constructor.
function critic = createCritic(obj)
nQ = size(obj.Q,1);
nR = size(obj.R,1);
n = nQ+nR;
w0 = 0.1*ones(0.5*(n+1)*n,1);
critic = rlQValueFunction({@(x,u) computeQuadraticBasis(x,u,n),w0},...
getObservationInfo(obj),getActionInfo(obj));
critic.Options.GradientThreshold = 1;
end
In this case, the critic is an rlQValueFunction object. To create this object, you must specify the
handle to a custom basis function, in this case the computeQuadraticBasis function. For more
information, see “Train Custom LQR Agent” on page 5-373.
Required Functions
To create a custom reinforcement learning agent you must define the following implementation
functions. To call these functions in your own code, use the wrapper methods from the abstract base
class. For example, to call getActionImpl, use getAction. The wrapper methods have the same
input and output arguments as the implementation methods.
3-70
Create Custom Reinforcement Learning Agents
Function Description
getActionImpl Selects an action by evaluating the agent policy for a given
observation
getActionWithExplorationImpl Selects an action using the exploration model of the agent
learnImpl Learns from the current experiences and returns an action
with exploration
Within your implementation functions, to evaluate your actor and critic, you can use the getValue,
getAction, and getMaxQValue functions.
• To evaluate an rlValueFunction critic, you need only the observation input, and you can obtain
the value of the current observation V using the following syntax.
V = getValue(Critic,Observation);
• To evaluate an rlQValueFunction critic you need both observation and action inputs, and you
can obtain the value of the current state-action Q using the following syntax.
Q = getValue(Critic,[Observation,Action]);
• To evaluate an rlVectorQValueFunction critic you need only the observation input, and you
can obtain the value of the current observation Q for all possible discrete actions using the
following syntax.
Q = getValue(Critic,Observation);
• For a discrete action space rlQValueFunction critic, obtain the maximum Q state-action value
function Q for all possible discrete actions using the following syntax.
[MaxQ,MaxActionIndex] = getMaxQValue(Critic,Observation);
• To evaluate an actor, obtain the action A using the following syntax.
A = getAction(Actor,Observation);
For each of these cases, if your actor or critic network uses a recurrent neural network, the functions
can also return the current values of the network state after obtaining the corresponding network
output.
getActionImpl Function
The getActionImpl function is evaluates the policy of your agent and selects an action. This
function must have the following signature, where obj is the agent object, Observation is the
current observation, and action is the selected action.
function action = getActionImpl(obj,Observation)
For the custom LQR agent, you select an action by applying the u=-Kx control law.
function action = getActionImpl(obj,Observation)
% Given the current state of the system, return an action
action = -obj.K*Observation{:};
end
getActionWithExplorationImpl Function
3-71
3 Create Agents
This function must have the following signature, where obj is the agent object, Observation is the
current observation, and action is the selected action.
For the custom LQR agent, the getActionWithExplorationImpl function adds random white
noise to an action selected using the current agent policy.
learnImpl Function
The learnImpl function defines how the agent learns from the current experience. This function
implements the custom learning algorithm of your agent by updating the policy parameters and
selecting an action with exploration. This function must have the following signature, where obj is
the agent object, exp is the current agent experience, and action is the selected action.
For the custom LQR agent, the critic parameters are updated every N steps.
3-72
Create Custom Reinforcement Learning Agents
H_buf = obj.HBuffer;
y_buf = obj.YBuffer;
theta = (H_buf'*H_buf)\H_buf'*y_buf;
obj.Critic = setLearnableParameters(obj.Critic,{theta});
Optional Functions
Optionally, you can define how your agent is reset at the start of training by specifying a resetImpl
function with the following function signature, where obj is the agent object. Using this function, you
can set the agent into a known or random condition before training.
function resetImpl(ob)
Also, you can define any other helper functions in your custom agent class as required. For example,
the custom LQR agent defines a createCritic function for creating the critic and a getNewK
function that derives the feedback gain matrix from the trained critic parameters.
Q = [10,3,1;3,5,4;1,4,9];
R = 0.5*eye(3);
K0 = place(A,B,[0.4,0.8,0.5]);
agent = LQRCustomAgent(Q,R,K0);
After validating the environment object, you can use it to train a reinforcement learning agent. For an
example that trains the custom LQR agent, see “Train Custom LQR Agent” on page 5-373.
See Also
train
More About
• “Reinforcement Learning Agents” on page 3-2
• “Create Policies and Value Functions” on page 4-2
3-73
3 Create Agents
3-74
4
Reinforcement learning agents use parametrized policies and value functions, which are implemented
by function approximators called actors and critics, respectively. During training, an agent updates
the parameters of its actor and critic to maximize the expected cumulative long-term reward.
Before creating a non-default agent, you must create the actor and critic using approximation models
such as deep neural networks, linear basis functions, or lookup tables. The type of function
approximator and model you can use depends on the type of agent that you want to create.
You can also create policy objects from agents, actors, or critics. You can train these objects using
custom loops and deploy them in applications.
For more information on agents, see “Reinforcement Learning Agents” on page 3-2.
• V(S|θV) — Critics that estimate the expected cumulative long-term reward based on a given
observation S. You can create these critics using rlValueFunction.
• Q(S,A|θQ) — Critics that estimate the expected cumulative long-term reward for a given discrete
action A and a given observation S. You can create these critics using rlQValueFunction.
• Qi(S,Ai|θQ) — Multi-output critics that estimate the expected cumulative long-term reward for all
possible discrete actions Ai given the observation S. You can create these critics using
rlVectorQValueFunction.
• π(S|θπ) — Actors with a continuous action space that select an action deterministically based on a
given observation S. You can create these actors using rlContinuousDeterministicActor.
• π(S|θπ) — Actors that select an action stochastically (the action is sampled from a probability
distribution) based on a given observation S. You can create these actors using either
rlDiscreteCategoricalActor (for discrete action spaces) or rlContinuousGaussianActor
(for continuous action spaces).
Each approximator uses a set of parameters (θV, θQ, θπ), which are computed during the learning
process.
4-2
Create Policies and Value Functions
For systems with a limited number of discrete observations and discrete actions, you can store value
functions in a lookup table. For systems that have many discrete observations and actions and for
observation and action spaces that are continuous, storing the observations and actions is
impractical. For such systems, you can represent your actors and critics using deep neural networks
or custom (linear in the parameters) basis functions.
The following table summarizes the way in which you can use the six approximator objects available
with Reinforcement Learning Toolbox software, depending on the action and observation spaces of
your environment, and on the approximation model and agent that you want to use.
4-3
4 Define Policies and Value Functions
rlContinuousGaussianActor
You can configure the actor and critic optimization options using the rlOptimizerOptions object
within an agent option object.
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to appropriate rlOptimizerOptions objects. Then you pass
the agent options object to the function that creates the agent.
4-4
Create Policies and Value Functions
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;.
For more information on agents, see “Reinforcement Learning Agents” on page 3-2.
Policy Objects
You can extract a policy object from an agent or create it from an actor or critic. You can then use
getAction to generate deterministic or stochastic actions from the policy, given an input
observation. Differently from function approximator objects like actors and critics, policy objects do
not have functions that you can use to easily calculate gradients with respect to parameters.
Therefore, policy objects are more tailored toward application deployment, rather than training. The
following table describes the available policy objects.
4-5
4 Define Policies and Value Functions
Policy Objects
Each one of the stochastic policy objects has an option to enable deterministic behavior, thereby
disabling exploration. Except for rlEpsilonGreedyPolicy and rlAdditiveNoisePolicy, you
can use generatePolicyBlock and generatePolicyFunction to generate a Simulink block or a
function that evaluates the policy, returning an action, for a given observation input. You can then use
the generated function or block to generate code for application deployment. For more information,
see “Deploy Trained Reinforcement Learning Policies” on page 6-2.
Table Models
Value function approximators (critics) based on lookup tables models are appropriate for
environments with a limited number of discrete observations and actions. You can create two types of
lookup tables:
4-6
Create Policies and Value Functions
To create a table based critic, first create a value table or Q-table using the rlTable function. Then
use the table object as input argument for either rlValueFunction or rlQValueFunction to
create the approximator object.
The dimensions of the network input and output layers for your actor and critic must match the
dimension of the corresponding environment observation and action channels, respectively. To obtain
the action and observation specifications from the environment env, use the getActionInfo and
getObservationInfo functions, respectively.
actInfo = getActionInfo(env);
obsInfo = getObservationInfo(env);
Access the Dimensions property of each channel. For example, get the size of the first environment
and action channel:
actSize = actInfo(1).Dimensions;
obsSize = obsInfo(1).Dimensions;
In general actSize and obsSize are row vectors whose elements are the lengths of the
corresponding dimensions. For example, if the first observation channel is a 256-by-256 RGB image,
actSize is the vector [256 256 3]. To calculate the total number of dimension of the channel, use
prod.For example, assuming the environment has only one observation channel:
obsDimensions = prod(obsInfo.Dimensions);
Networks for value function critics (such as the ones used in AC, PG, PPO or TRPO agents) must take
only observations as inputs and must have a single scalar output. For these networks, the dimensions
of the input layers must match the dimensions of the environment observation channels. For more
information, see rlValueFunction.
Networks for single-output Q-value function critics (such as the ones used in Q, DQN, SARSA, DDPG,
TD3, and SAC agents) must take both observations and actions as inputs, and must have a single
scalar output. For these networks, the dimensions of the input layers must match the dimensions of
the environment channels for both observations and actions. For more information, see
rlQValueFunction.
Networks for multi-output Q-value function critics (such as those used in Q, DQN, and SARSA agents)
take only observations as inputs and must have a single output layer with output size equal to the
number of possible discrete actions. For these networks the dimensions of the input layers must
4-7
4 Define Policies and Value Functions
match the dimensions of the environment observations channels. For more information, see
rlVectorQValueFunction.
For actor networks, the dimensions of the input layers must match the dimensions of the environment
observation channels and the dimension of the output layer must be as follows.
• Networks used in actors with a discrete action space (such as the ones in PG, AC, and PPO agents)
must have a single output layer with an output size equal to the number of possible discrete
actions. For more information, see rlDiscreteCategoricalActor.
• Networks used in deterministic actors with a continuous action space (such as the ones in DDPG
and TD3 agents) must have a single output layer with an output size matching the dimension of
the action space defined in the environment action specification. For more information, see
rlContinuousDeterministicActor.
• Networks used in stochastic actors with a continuous action space (such as the ones in PG, AC,
PPO, and SAC agents) must have a two output layers each with as many elements as the
dimension of the action space, as defined in the environment specification. One output layer must
produce the mean values (which must be scaled to the output range of the action), and the other
must produce the standard deviations of the actions (which must be non-negative). For more
information, see rlContinuousGaussianActor.
Deep neural networks consist of a series of interconnected layers. You can specify a deep neural
network as one of the following:
Note Among the different network objects, dlnetwork is preferred, since it has built-in validation
checks and supports automatic differentiation. If you pass another network object as an input
argument, it is internally converted to a dlnetwork object. However, best practice is to convert other
network objects to dlnetwork explicitly before using it to create a critic or an actor for a
reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
neural network object from the Deep Learning Toolbox. The resulting dlnet is the dlnetwork object
that you use for your critic or actor. This practice allows a greater level of insight and control for
cases in which the conversion is not straightforward and might require additional specifications.
Typically, you build your neural network by stacking together a number of layers in an array of Layer
objects, possibly adding these arrays to a layerGraph object, and then converting the final result to
a dlnetwork object.
For agents that need multiple input or output layers, you create an array of Layer objects for each
input path (observations or actions) and for each output path (estimated rewards or actions). You
then add these arrays to a layerGraph object and connect them paths together using the
connectLayers function.
4-8
Create Policies and Value Functions
You can also create your deep neural network using the Deep Network Designer app. For an
example, see “Create Agent Using Deep Network Designer and Train Using Image Observations” on
page 5-138.
The following table lists some common deep learning layers used in reinforcement learning
applications. For a full list of available layers, see “List of Deep Learning Layers”.
Layer Description
featureInputLayer Inputs feature data and applies normalization
imageInputLayer Inputs vectors and 2-D images and applies normalization.
sigmoidLayer Applies a sigmoid function to the input such that the output
is bounded in the interval (0,1).
tanhLayer Applies a hyperbolic tangent activation layer to the input.
reluLayer Sets any input values that are less than zero to zero.
fullyConnectedLayer Multiplies the input vector by a weight matrix, and add a
bias vector.
softmaxLayer Applies a softmax function layer to the input, normalizing it
to a probability distribution.
convolution2dLayer Applies sliding convolutional filters to the input.
additionLayer Adds the outputs of multiple layers together.
concatenationLayer Concatenates inputs along a specified dimension.
sequenceInputLayer Provides inputs sequence data to a network.
lstmLayer Applies a Long Short-Term Memory layer to the input.
Supported for DQN and PPO agents.
The bilstmLayer and batchNormalizationLayer layers are not supported for reinforcement
learning.
The Reinforcement Learning Toolbox software provides the following layers, which contain no
tunable parameters (that is, parameters that change during training).
Layer Description
scalingLayer Applies a linear scale and bias to an input array. This layer
is useful for scaling and shifting the outputs of nonlinear
layers, such as tanhLayer and sigmoidLayer.
quadraticLayer Creates a vector of quadratic monomials constructed from
the elements of the input array. This layer is useful when
you need an output that is some quadratic function of its
inputs, such as for an LQR controller.
softplusLayer Implements the softplus activation Y = log(1 + eX), which
ensures that the output is always positive. This function is a
smoothed version of the rectified linear unit (ReLU).
You can also create your own custom layers. For more information, see “Define Custom Deep
Learning Layers”.
When you create a deep neural network, it is good practice to specify names for the first layer of each
input path and the final layer of the output path.
4-9
4 Define Policies and Value Functions
The following code creates and connects the following input and output paths:
• An observation input path, observationPath, with the first layer named 'observation'.
• An action input path, actionPath, with the first layer named 'action'.
• An estimated value function output path, commonPath, which takes the outputs of
observationPath and actionPath as inputs. The final layer of this path is named 'output'.
observationPath = [
featureInputLayer(4,'Normalization','none','Name','myobs')
fullyConnectedLayer(24,'Name','CriticObsFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(24,'Name','CriticObsFC2')];
actionPath = [
featureInputLayer(1,'Normalization','none','Name','myact')
fullyConnectedLayer(24,'Name','CriticActFC1')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');
criticNetwork = dlnetwork(criticNetwork);
For all observation and action input paths, you must specify a featureInputLayer as the first layer
in the path, with a number of input neurons equal to the number of dimensions of the corresponding
environment channel.
You can view the structure of your deep neural network using the plot function.
plot(layerGraph(criticNetwork))
4-10
Create Policies and Value Functions
Determining the number, type, and size of layers for your deep neural network can be difficult and is
application dependent. However, the most critical component in deciding the characteristics of the
function approximator is whether it is able to approximate the optimal policy or discounted value
function for your application, that is, whether it has layers that can correctly learn the features of
your observation, action, and reward signals.
• For continuous action spaces, bound actions with a tanhLayer followed by a ScalingLayer to
scale the action to desired values, if necessary.
• Deep dense networks with reluLayer layers can be fairly good at approximating many different
functions. Therefore, they are often a good first choice.
• Start with the smallest possible network that you think can approximate the optimal policy or
value function.
• When you approximate strong nonlinearities or systems with algebraic constraints, adding more
layers is often better than increasing the number of outputs per layer. In general, the ability of the
approximator to represent more complex (compositional) functions grows only polynomially in the
size of the layers, but grows exponentially with the number of layers. In other words, more layers
allow approximating more complex and nonlinear compositional functions, although this generally
requires more data and longer training times. Given a total number of neurons and comparable
approximation tasks, networks with fewer layers can require exponentially more units to
4-11
4 Define Policies and Value Functions
successfully approximate the same class of functions, and might fail to learn and generalize
correctly.
• For on-policy agents (the ones that learn only from experience collected while following the
current policy), such as AC and PG agents, parallel training works better if your networks are
large (for example, a network with two hidden layers with 32 nodes each, which has a few
hundred parameters). On-policy parallel updates assume each worker updates a different part of
the network, such as when they explore different areas of the observation space. If the network is
small, the worker updates can correlate with each other and make training unstable.
To create a critic from your deep neural network, use an rlValueFunction, rlQValueFunction or
(whenever possible) an rlVectorQValueFunction object. To create a deterministic actor for a
continuous action space from your deep neural network, use an
rlContinuousDeterministicActor object. To create a stochastic actor from your deep neural
network use either an rlDiscreteCategoricalActor or an rlContinuousGaussianActor
object. To configure the learning rate and optimization used by the actor or critic, use an optimizer
object within an agent option object.
For example, create a Q-value function object for the critic network criticNetwork. Then create a
the critic optimizer object criticOpts specifying a learning rate of 0.02 and a gradient threshold of
1.
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'action'});
criticOpts = rlOptimizerOptions('LearnRate',0.02,...
'GradientThreshold',1);
Then create an agent option object, and set the CriticOptimizerOptions property of the agent
option object to criticOpts. When finally you create the agent, pass the agent option object as a
last input argument to the agent constructor function.
When you create your deep neural network and configure your actor or critic, consider using the
following approach as a starting point.
1 Start with the smallest possible network and a high learning rate (0.01). Train this initial
network to see if the agent converges quickly to a poor policy or acts in a random manner. If
either of these issues occur, rescale the network by adding more layers or more outputs on each
layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and
shows signs of learning (an improving trajectory of the reward graph) after an initial training
period.
2 Once you settle on a good network architecture, a low initial learning rate can allow you to see if
the agent is on the right track, and help you check that your network architecture is satisfactory
for the problem. A low learning rate makes tuning parameters easier, especially for difficult
problems.
Also, consider the following tips when configuring your deep neural network agent.
• Be patient with DDPG and DQN agents, since they might not learn anything for some time during
the early episodes, and they typically show a dip in cumulative reward early in the training
process. Eventually, they can show signs of learning after the first few thousand episodes.
• For DDPG and DQN agents, promoting exploration of the agent is critical.
4-12
Create Policies and Value Functions
• For agents with both actor and critic networks, set the initial learning rates of both actor and
critic to the same value. However, for some problems, setting the critic learning rate to a higher
value than that of the actor can improve learning results.
When creating actors or critics for use with any agent except Q and SARSA, you can use recurrent
neural networks (RNN). These networks are deep neural networks with a sequenceInputLayer
input layer and at least one layer that has hidden state information, such as an lstmLayer. They can
be especially useful when the environment has states that cannot be included in the observation
vector.
For agents that have both actor and critic, you must either use an RNN for both of them, or not use
an RNN for any of them. You cannot use an RNN only for the critic or only for the actor.
When using PG agents, the learning trajectory length for the RNN is the whole episode. For an AC
agent, the NumStepsToLookAhead property of its options object is treated as the training trajectory
length. For a PPO agent, the trajectory length is the MiniBatchSize property of its options object.
For DQN, DDPG, SAC and TD3 agents, you must specify the length of the trajectory training as an
integer greater than one in the SequenceLength property of their options object.
Note that code generation is not supported for continuous action space PG, AC, PPO and TRPO
agents, and SAC agents using a recurrent neural network (RNN), or for any agent having multiple
input paths and containing an RNN in any of the paths.
For more information and examples on policies and value functions, see rlValueFunction,
rlQValueFunction, rlVectorQValueFunction, rlContinuousDeterministicActor,
rlDiscreteCategoricalActor, and rlContinuousGaussianActor.
For value function critics, (such as the ones used in AC, PG or PPO agents), f is a scalar value, so W
must be a column vector with the same length as B, and B must be a function of the observation. For
more information and examples, see rlValueFunction.
For single-output Q-value function critics, (such as the ones used in Q, DQN, SARSA, DDPG, TD3, and
SAC agents), f is a scalar value, so W must be a column vector with the same length as B, and B must
be a function of both the observation and action. For more information and examples, see
rlQValueFunction.
For multi-output Q-value function critics with discrete action spaces, (such as those used in Q, DQN,
and SARSA agents), f is a vector with as many elements as the number of possible actions. Therefore
W must be a matrix with as many columns as the number of possible actions and as many rows as the
length of B. B must be only a function of the observation. For more information and examples, see
rlVectorQValueFunction.
• For deterministic actors with a continuous action space (such as the ones in DDPG, and TD3
agents), the dimensions of f must match the dimensions of the agent action specification, which is
4-13
4 Define Policies and Value Functions
either a scalar or a column vector. For more information and examples, see
rlContinuousDeterministicActor.
• For stochastic actors with a discrete action space (such as the ones in PG, AC, and PPO agents), f
must be column vector with length equal to the number of possible discrete actions. The output of
the actor is softmax(f), which represents the probability of selecting each possible action. For
more information and examples, see rlDiscreteCategoricalActor.
• For stochastic actors with continuous action spaces cannot rely on custom basis functions (they
can only use neural network approximators, due to the need to enforce positivity for the standard
deviations). For more information and examples, see rlContinuousGaussianActor.
For any actor, W must have as many columns as the number of elements in f, and as many rows as the
number of elements in B. B must be only a function of the observation.
For an example that trains a custom agent that uses a linear basis function, see “Train Custom LQR
Agent” on page 5-373.
Create an Agent
Once you create your actor and critic, you can create a reinforcement learning agent that uses them.
For example, create a PG agent using a given actor and critic (baseline) network.
agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents” on page 3-2.
You can obtain the actor and critic from an existing agent using getActor and getCritic,
respectively.
You can also set the actor and critic of an existing agent using setActor and setCritic,
respectively. The input and output layers of the actor and critic must match the observation and
action specifications of the original agent.
See Also
More About
• “Reinforcement Learning Agents” on page 3-2
• “Import Neural Network Models” on page 4-15
4-14
Import Neural Network Models
• Open Neural Network Exchange (ONNX) models, which require the Deep Learning Toolbox
Converter for ONNX Model Format support package software. For more information,
importONNXLayers.
• TensorFlow-Keras networks, which require Deep Learning Toolbox Converter for TensorFlow
Models support package software. For more information, see importKerasLayers.
• Caffe convolutional networks, which require Deep Learning Toolbox Importer for Caffe Models
support package software. For more information, see importCaffeLayers.
After you import a deep neural network, you can create an actor or critic object, such as
rlQValueFunction or rlDiscreteCategoricalActor.
When you import deep neural network architectures, consider the following.
• The dimensions of the imported network architecture input and output layers must match the
dimensions of the corresponding action, observation, or reward dimensions for your environment.
• After importing the network architecture, you must set the names of the input and output layers to
match the names of the corresponding action and observation specifications.
For more information on the deep neural network architectures supported for reinforcement
learning, see “Create Policies and Value Functions” on page 4-2.
Also, assume that you have the following network architectures to import:
• A deep neural network architecture for the actor with a 50-by-50 image input layer and a scalar
output layer, which is saved in the ONNX format (criticNetwork.onnx).
• A deep neural network architecture for the critic with a 50-by-50 image input layer and a scalar
output layer, which is saved in the ONNX format (actorNetwork.onnx).
To import the critic and actor networks, use the importONNXLayers function without specifying an
output layer.
criticNetwork = importONNXLayers('criticNetwork.onnx');
actorNetwork = importONNXLayers('actorNetwork.onnx');
These commands generate a warning, which states that the network is trainable until an output layer
is added. When you use an imported network to create an actor or critic, Reinforcement Learning
Toolbox software automatically adds an output layer for you.
4-15
4 Define Policies and Value Functions
After you import the networks, create the actor and critic function approximators. To do so, first
obtain the observation and action specifications from the environment.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create the critic, specifying the name of the input layer of the critic network as the observation
name. Since the critic network has a single observation input and a single action output, use a value-
function.
critic = rlValueFunction(criticNetwork,obsInfo,...
'ObservationInputNames',{criticNetwork.Layers(1).Name});
Create the actor, specifying the name of the input layer of the actor network as the observation name
and the output layer of the actor network as the observation name. Since the actor network has a
single scalar output, use a continuous deterministic actor.
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo,...
'ObservationInputNames',{actorNetwork.Layers(1).Name});
• Create an agent using this actor and critic. For more information, see “Reinforcement Learning
Agents” on page 3-2.
• Set the actor and critic in an existing agent using setActor and setCritic, respectively.
See Also
More About
• “Create Policies and Value Functions” on page 4-2
• “Reinforcement Learning Agents” on page 3-2
4-16
5
5-2
Train Reinforcement Learning Agents
opt = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',1000,...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',480);
trainResults = train(agent,env,opt);
If env is a multi-agent environment created with rlSimulinkEnv, specify the agent argument as an
array. The order of the agents in the array must match the agent order used to create env. Multi-
agent training is not supported for MATLAB environments.
For more information on creating agents, see “Reinforcement Learning Agents” on page 3-2. For
more information on creating environments, see “Create MATLAB Reinforcement Learning
Environments” on page 2-2 and “Create Simulink Reinforcement Learning Environments” on page 2-
8.
Note train updates the agent as training progresses. This is possible because each agent is an
handle object. To preserve the original agent parameters for later use, save the agent to a MAT-file:
save("initialAgent.mat","agent")
If you copy the agent into a new variable, the new variable will also always point to the most recent
agent version with updated parameters. For more information about handle objects, see “Handle
Object Behavior”.
Training terminates automatically when the conditions you specify in the StopTrainingCriteria
and StopTrainingValue options of your rlTrainingOptions object are satisfied. You can also
terminate training before any termination condition is reached by clicking Stop Training in the
Reinforcement Learning Episode Manager.
When training terminates the training statistics and results are stored in the trainResults object.
Because train updates the agent at the end of each episode, and because trainResults stores the
last training results, along with data to correctly recreate the training scenario and update the
episode manager, you can later resume training from the exact point at which it stopped. To do so, at
the command line, type:
trainResults = train(agent,env,trainResults);
This starts the training from the last values of the agent parameters and training results object
obtained after the previous train call.
The trainResults object contains, as one of its properties, the rlTrainingOptions object opt
specifying the training option set. Therefore, to restart the training with updated training options,
first change the training options in trainResults using dot notation. If the maximum number of
episodes was already reached in the previous training session, you must increase the maximum
number of episodes.
5-3
5 Train and Validate Agents
For example, disable displaying the training progress on Episode Manager, enable the Verbose
option to display training progress at the command line, change the maximum number of episodes to
2000, and then restart the training, returning a new trainResults object as output.
trainResults.TrainingOptions.MaxEpisodes = 2000;
trainResults.TrainingOptions.Plots = "none";
trainResults.TrainingOptions.Verbose = 1;
trainResultsNew = train(agent,env,trainResults);
Note When training terminates, agents reflects the state of each agent at the end of the final
training episode. The rewards obtained by the final agents are not necessarily the highest achieved
during the training process, due to continuous exploration. To save agents during training, create an
rlTrainingOptions object specifying the SaveAgentCriteria and SaveAgentValue properties and
pass it to train as a trainOpts argument.
Training Algorithm
In general, training performs the following steps.
1 Initialize the agent.
2 For each episode:
a Reset the environment.
b Get the initial observation s0 from the environment.
c Compute the initial action a0 = μ(s0), where μ(s) is the current policy.
d Set the current action to the initial action (a←a0), and set the current observation to the
initial observation (s←s0).
e While the episode is not finished or terminated, perform the following steps.
i Apply action a to the environment and obtain the next observation s''and the reward r.
ii Learn from the experience set (s,a,r,s').
iii Compute the next action a' = μ(s').
iv Update the current action with the next action (a←a') and update the current
observation with the next observation (s←s').
v Terminate the episode if the termination conditions defined in the environment are met.
3 If the training termination condition is met, terminate training. Otherwise, begin the next
episode.
The specifics of how the software performs these steps depend on the configuration of the agent and
environment. For instance, resetting the environment at the start of each episode can include
randomizing initial state values, if you configure your environment to do so. For more information on
agents and their training algorithms, see “Reinforcement Learning Agents” on page 3-2. To use
parallel processing and GPUs to speed up training, see “Train Agents Using Parallel Computing and
GPUs” on page 5-8.
Episode Manager
By default, calling the train function opens the Reinforcement Learning Episode Manager, which
lets you visualize the training progress.
5-4
Train Reinforcement Learning Agents
The Episode Manager plot shows the reward for each episode (EpisodeReward) and a running
average reward value (AverageReward).
For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start
of each episode, given the initial observation of the environment. As training progresses, if the critic
is well designed and learns successfully, Episode Q0 approaches in average the true discounted long-
term reward, which may be offset from the EpisodeReward value because of discounting. For a well
designed critic using an undiscounted reward (DiscountFactor is equal to 1), then on average
Episode Q0 approaches the true episode reward, as shown in the preceding figure.
The Episode Manager also displays various episode and training statistics. You can also use the
train function to return episode and training information. To turn off the Reinforcement Learning
Episode Manager, set the Plots option of rlTrainingOptions to "none".
train stores saved agents in a MAT-file in the folder you specify using the SaveAgentDirectory
option of rlTrainingOptions. Saved agents can be useful, for instance, to test candidate agents
5-5
5 Train and Validate Agents
generated during a long-running training process. For details about saving criteria and saving
location, see rlTrainingOptions.
After training is complete, you can save the final trained agent from the MATLAB workspace using
the save function. For example, save the agent myAgent to the file finalAgent.mat in the current
working directory.
save(opt.SaveAgentDirectory + "/finalAgent.mat",'agent')
By default, when DDPG and DQN agents are saved, the experience buffer data is not saved. If you
plan to further train your saved agent, you can start training with the previous experience buffer as a
starting point. In this case, set the SaveExperienceBufferWithAgent option to true. For some
agents, such as those with large experience buffers and image-based observations, the memory
required for saving the experience buffer is large. In these cases, you must ensure that enough
memory is available for the saved agents.
When validating your agent, consider checking how your agent handles the following:
• Changes to simulation initial conditions — To change the model initial conditions, modify the reset
function for the environment. For example reset functions, see “Create MATLAB Environment
Using Custom Functions” on page 2-41, “Create Custom MATLAB Environment from Template” on
page 2-48, and “Create Simulink Reinforcement Learning Environments” on page 2-8.
• Mismatches between the training and simulation environment dynamics — To check such
mismatches, create test environments in the same way that you created the training environment,
modifying the environment behavior.
As with parallel training, if you have Parallel Computing Toolbox software, you can run multiple
parallel simulations on multicore computers. If you have MATLAB Parallel Server software, you can
run multiple parallel simulations on computer clusters or cloud resources. For more information on
configuring your simulation to use parallel computing, see UseParallel and
ParallelizationOptions in rlSimulationOptions.
Environment Visualization
If your training environment implements the plot method, you can visualize the environment
behavior during training and simulation. If you call plot(env) before training or simulation, where
env is your environment object, then the visualization updates during training to allow you to
visualize the progress of each episode or simulation.
Environment visualization is not supported when training or simulating your agent using parallel
computing.
For custom environments, you must implement your own plot method. For more information on
creating a custom environments with a plot function, see “Create Custom MATLAB Environment
from Template” on page 2-48.
See Also
train
5-6
Train Reinforcement Learning Agents
More About
• “Reinforcement Learning Agents” on page 3-2
5-7
5 Train and Validate Agents
Note that parallel training and simulation of agents using recurrent neural networks, or agents within
multi-agent environments, is not supported.
Independently on which devices you use to simulate or train the agent, once the agent has been
trained, you can generate code to deploy the optimal policy on a CPU or GPU. This is explained in
more detail in “Deploy Trained Reinforcement Learning Policies” on page 6-2.
pool = parpool(N);
If you do not create a parallel pool using parpool, the train function automatically creates one
using your default parallel pool preferences. For more information on specifying these preferences,
5-8
Train Agents Using Parallel Computing and GPUs
see “Specify Your Parallel Preferences” (Parallel Computing Toolbox). Note that using a parallel pool
of thread workers, such as pool = parpool("threads"), is not supported.
To train an agent using multiple processes you must pass to the train function an
rlTrainingOptions object in which UseParallel is set to true.
For more information on configuring your training to use parallel computing, see the UseParallel
and ParallelizationOptions options in rlTrainingOptions. For an example on how to
configure options for asynchronous advantage actor-critic (A3C) agent training, see the last example
in rlTrainingOptions.
For an example that trains an agent using parallel computing in MATLAB, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing” on page 5-151. For an example that trains an
agent using parallel computing in Simulink, see “Train DQN Agent for Lane Keeping Assist Using
Parallel Computing” on page 5-227 and “Train Biped Robot to Walk Using Reinforcement Learning
Agents” on page 5-235.
When training AC and PG agents in parallel, both the environment simulation and gradient
computations are done by the workers. Specifically, workers simulate the agent against the
environment, compute the gradients from experiences, and send the gradients to the client. The
client averages the gradients, updates the network parameters and sends the updated parameters
back to the workers to they can continue simulating the agent with the new parameters.
This type of parallel training is also known as gradient-based parallelization, and allows you to
achieve, in principle, a speed improvement which is nearly linear in the number of workers. However,
this option requires synchronous training (that is the Mode property of the rlTrainingOptions
object that you pass to the train function must be set to "sync"). This means that workers must
pause execution until all workers are finished, and as a result the training only advances as fast as
the slowest worker allows.
When training DQN, DDPG, PPO, TD3, and SAC agents in parallel, the environment simulation is done
by the workers and the learning is done by the client. Specifically, the workers simulate the agent
against the environment, and send experience data (observation, action, reward, next observation,
and a termination signal) to the client. The client then computes the gradients from experiences,
updates the network parameters and sends the updated parameters back to the workers, which
continue to simulate the agent with the new parameters.
This type of parallel training is also known as experience-based parallelization, and can run using
asynchronous training (that is the Mode property of the rlTrainingOptions object that you pass to
the train function can be set to "async").
5-9
5 Train and Validate Agents
Experience-based parallelization can reduce training time only when the computational cost of
simulating the environment is high compared to the cost of optimizing network parameters.
Otherwise, when the environment simulation is fast enough, the workers lie idle waiting for the client
to learn and send back the updated parameters.
To sum up, experience-based parallelization can improve sample efficiency (intended as the number
of samples an agent can process within a given time) only when the ratio R between the environment
step complexity and the learning complexity is large. If both environment simulation and learning are
similarly computationally expensive, experience-based parallelization is unlikely to improve sample
efficiency. However, in this case, for off-policy agents that are supported in parallel (DQN, DDPG, TD3,
and SAC) you can reduce the mini-batch size to make R larger, thereby improving sample efficiency.
To enforce contiguity in the experience buffer when training DQN, DDPG, TD3, or SAC agents in
parallel, set the NumStepsToLookAhead property or the corresponding agent option object to 1. A
different value causes an error when parallel training is attempted.
Using GPUs
You can speed up training by performing representation operations (such as gradient computation
and prediction), on a local GPU rather than a CPU. To do so, when creating a critic or actor, set its
UseDevice option to "gpu" instead of "cpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA®
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Using GPUs is likely to be beneficial when you have a deep neural network in the actor or critic which
has large batch sizes or needs to perform operations such as multiple convolutional layers on input
images.
For an example on how to train an agent using the GPU, see “Train DDPG Agent to Swing Up and
Balance Pendulum with Image Observation” on page 5-130.
For gradient-based parallelization, (which must run in synchronous mode) the environment
simulation is done by the workers, which use their local GPU to calculate the gradients and perform a
prediction step. The gradients are then sent back to the parallel pool client process which calculates
the averages, updates the network parameters and sends them back to the workers so they continue
to simulate the agent, with the new parameters, against the environment.
For experience-based parallelization, (which can run in asynchronous mode), the workers simulate
the agent against the environment, and send experiences data back to the parallel pool client. The
5-10
Train Agents Using Parallel Computing and GPUs
client then uses its local GPU to compute the gradients from the experiences, then updates the
network parameters and sends the updated parameters back to the workers, which continue to
simulate the agent, with the new parameters, against the environment.
Note that when using both parallel processing and GPU to train PPO agents, the workers use their
local GPU to compute the advantages, and then send processed experience trajectories (which
include advantages, targets and action probabilities) back to the client.
See Also
train | rlTrainingOptions | rlRepresentationOptions
Related Examples
• “Train AC Agent to Balance Cart-Pole System Using Parallel Computing” on page 5-151
• “Train DQN Agent for Lane Keeping Assist Using Parallel Computing” on page 5-227
• “Train Biped Robot to Walk Using Reinforcement Learning Agents” on page 5-235
5-11
5 Train and Validate Agents
reinforcementLearningDesigner
When using the Reinforcement Learning Designer, you can import an environment from the
MATLAB workspace or create a predefined environment. For more information, see “Create MATLAB
Environments for Reinforcement Learning Designer” on page 2-5 and “Create Simulink Environments
for Reinforcement Learning Designer” on page 2-11.
For this example, use the predefined discrete cart-pole MATLAB environment. To import this
environment, on the Reinforcement Learning tab, in the Environments section, select New >
Discrete Cart-Pole.
In the Environments pane, the app adds the imported Discrete CartPole environment. To
rename the environment, click the environment text. You can also import multiple environments in
the session.
To view the dimensions of the observation and action space, click the environment text. The app
shows the dimensions in the Preview pane.
This environment has a continuous four-dimensional observation space (the positions and velocities of
both the cart and pole) and a discrete one-dimensional action space consisting of two possible forces,
–10N or 10N. This environment is used in the “Train DQN Agent to Balance Cart-Pole System” on
page 5-50 example. For more information on predefined control system environments, see “Load
Predefined Control System Environments” on page 2-23.
To create an agent, on the Reinforcement Learning tab, in the Agent section, click New. In the
Create agent dialog box, specify the agent name, the environment, and the training algorithm. The
default agent configuration uses the imported environment and the DQN algorithm. For this example,
change the number of hidden units from 256 to 24. For more information on creating agents, see
“Create Agents Using Reinforcement Learning Designer” on page 3-9.
Click OK.
The app adds the new agent to the Agents pane and opens a corresponding agent1 document.
5-12
Design and Train Agent Using Reinforcement Learning Designer
For a brief summary of DQN agent features and to view the observation and action specifications for
the agent, click Overview.
When you create a DQN agent in Reinforcement Learning Designer, the agent uses a default deep
neural network structure for its critic. To view the critic network, on the DQN Agent tab, click View
Critic Model.
The Deep Learning Network Analyzer opens and displays the critic structure.
Train Agent
To train your agent, on the Train tab, first specify options for training the agent. For information on
specifying training options, see “Specify Simulation Options in Reinforcement Learning Designer” on
page 5-21.
For this example, specify the maximum number of training episodes by setting Max Episodes to
1000. For the other training options, use their default values. The default criteria for stopping is
when the average number of steps per episode (over the last 5 episodes) is greater than 500.
During training, the app opens the Training Session tab and displays the training progress in the
Training Results document.
Here, the training stops when the average number of steps per episode is 500. Clear the Show
Episode Q0 option to visualize better the episode and average rewards.
To accept the training results, on the Training Session tab, click Accept. In the Agents pane, the
app adds the trained agent, agent1_Trained.
To simulate the trained agent, on the Simulate tab, first select agent1_Trained in the Agent drop-
down list, then configure the simulation options. For this example, use the default number of episodes
(10) and maximum episode length (500). For more information on specifying simulation options, see
“Specify Training Options in Reinforcement Learning Designer” on page 5-16.
The app opens the Simulation Session tab. After the simulation is completed, the Simulation
Results document shows the reward for each episode as well as the reward mean and standard
deviation.
In the Simulation Data Inspector you can view the saved signals for each simulation episode. For
more information, see Simulation Data Inspector (Simulink).
The following image shows the first and third states of the cart-pole system (cart position and pole
angle) for the sixth simulation episode. The agent is able to successfully balance the pole for 500
5-13
5 Train and Validate Agents
steps, even though the cart position undergoes moderate swings. You can modify some DQN agent
options such as BatchSize and TargetUpdateFrequency to promote faster and more robust
learning. For more information, see “Train DQN Agent to Balance Cart-Pole System” on page 5-50.
To accept the simulation results, on the Simulation Session tab, click Accept.
In the Results pane, the app adds the simulation results structure, experience1.
To export the trained agent to the MATLAB workspace for additional simulation, on the
Reinforcement Learning tab, under Export, select the trained agent.
To save the app session, on the Reinforcement Learning tab, click Save Session. In the future, to
resume your work where you left off, you can open the session in Reinforcement Learning
Designer.
To simulate the agent at the MATLAB command line, first load the cart-pole environment.
env = rlPredefinedEnv("CartPole-Discrete");
The cart-pole environment has an environment visualizer that allows you to see how the system
behaves during simulation and training.
Plot the environment and perform a simulation using the trained agent that you previously exported
from the app.
plot(env)
xpr2 = sim(env,agent1_Trained);
During the simulation, the visualizer shows the movement of the cart and pole. The trained agent is
able to stabilize the system.
sum(xpr2.Reward)
env =
500
See Also
Reinforcement Learning Designer | analyzeNetwork
Related Examples
• “Create MATLAB Environments for Reinforcement Learning Designer” on page 2-5
• “Create Simulink Environments for Reinforcement Learning Designer” on page 2-11
5-14
Design and Train Agent Using Reinforcement Learning Designer
5-15
5 Train and Validate Agents
To configure the training of an agent in the Reinforcement Learning Designer app, specify
training options on the Train tab.
Option Description
Max Episodes Maximum number of episodes to train the agent, specified as a positive
integer.
Max Episode Length Maximum number of steps to run per episode, specified as a positive
integer.
Stopping Criteria Training termination condition, specified as one of the following values.
In the More Training Options dialog box, you can specify the following options.
5-16
Specify Training Options in Reinforcement Learning Designer
Option Description
Save agent criteria Condition for saving agents during training, specified as one of the
following values.
To train your agent using parallel computing, on the Train tab, click . Training agents using
parallel computing requires Parallel Computing Toolbox software. For more information, see “Train
Agents Using Parallel Computing and GPUs” on page 5-8.
To specify options for parallel training, select Use Parallel > Parallel training options.
5-17
5 Train and Validate Agents
In the Parallel Training Options dialog box, you can specify the following training options.
Option Description
Parallel computing Parallel computing mode, specified as one of the following values.
mode
• sync — Use parpool to run synchronous training on the available
workers. The parallel pool client (the process that starts the training)
updates the parameters of its actor and critic, based on the results
from all the workers, and sends the updated parameters to all
workers. In this case, workers must pause execution until all workers
are finished, and as a result the training only advances as fast as the
slowest worker allows.
• async — Use parpool to run asynchronous training on the available
workers. In this case, workers send their data back to the client as
soon as they finish and receive updated parameters from the client.
The workers then continue with their task.
Transfer workspace Select this option to send model and workspace variables to parallel
variables to workers workers. When you select this option, the parallel pool client (the process
that starts the training) sends variables used in models and defined in
the MATLAB workspace to the workers.
5-18
Specify Training Options in Reinforcement Learning Designer
Option Description
Random seed for Randomizer initialization for workers, specified as one of the following
workers values.
The following figure shows an example parallel training configuration the following files and
functions.
5-19
5 Train and Validate Agents
See Also
Reinforcement Learning Designer
Related Examples
• “Design and Train Agent Using Reinforcement Learning Designer” on page 5-12
• “Specify Simulation Options in Reinforcement Learning Designer” on page 5-21
• “Create Agents Using Reinforcement Learning Designer” on page 3-9
5-20
Specify Simulation Options in Reinforcement Learning Designer
To configure the simulation of an agent in the Reinforcement Learning Designer app, specify
simulation options on the Simulate tab.
Option Description
Number of Episodes Number of episodes to simulate the agent, specified as a positive integer.
At the start of each simulation episode, the app resets the environment.
Max Episode Length Number of steps to run the simulation, specified as a positive integer. In
general, you define episode termination conditions in the environment.
This value is the maximum number of steps to run in the simulation if
those termination conditions are not met.
Stop on Error Select this option to stop simulation when an error occurs during an
episode.
To simulate your agent using parallel computing, on the Simulate tab, click . Simulating agents
using parallel computing requires Parallel Computing Toolbox software. For more information, see
“Train Agents Using Parallel Computing and GPUs” on page 5-8.
To specify options for parallel simulation, select Use Parallel > Parallel training options.
5-21
5 Train and Validate Agents
In the Parallel Simulation Options dialog box, you can specify the following training options.
Option Description
Transfer workspace Select this option to send model and workspace variables to parallel
variables to workers workers. When you select this option, the parallel pool client (the process
that starts the training) sends variables used in models and defined in
the MATLAB workspace to the workers.
Random seed for Randomizer initialization for workers, specified as one of the following
workers values.
The following figure shows an example parallel training configuration the following files and
functions.
5-22
Specify Simulation Options in Reinforcement Learning Designer
See Also
Reinforcement Learning Designer
Related Examples
• “Design and Train Agent Using Reinforcement Learning Designer” on page 5-12
• “Specify Training Options in Reinforcement Learning Designer” on page 5-16
• “Create Agents Using Reinforcement Learning Designer” on page 3-9
5-23
5 Train and Validate Agents
This example shows how to log data to disk when using the train function to train agents in the
Reinforcement Learning Toolbox™.
Overview
fileLogger =
FileLogger with properties:
Specify options to log data such as the logging directory and the frequency (in number of episodes) at
which the data logger writes data to disk. This step is optional.
% Specify a logging directory. You must have write access for this
% directory.
logDir = fullfile(pwd,"myDataLog");
fileLogger.LoggingOptions.LoggingDirectory = logDir;
% Specify a naming rule for files. The naming rule episode<id> saves files
% as episode001.mat, episode002.mat and so on.
fileLogger.LoggingOptions.FileNameRule = "episode<id>";
% Set the frequency (in number of episodes) at which the data logger writes data to disk
fileLogger.LoggingOptions.DataWriteFrequency = 1;
Training data of interest is generated at different stages of training, for example, experience data is
available after the completion of an episode. Configure the logger object with callback functions to
log at these stages. The callback functions are:
• EpisodeFinishedFcn - callback function to log data such as experiences, logged Simulink signals,
or initial observation. The function is executed after the completion of a training episode. A
template for the function is shown below.
function dataToLog = myEpisodeLoggingFcn(data)
% data is a structure that contains the following fields:
5-24
Log Training Data To Disk
end
• AgentStepFinishedFcn - callback function to log data such as the state of exploration. The
function is executed after the completion of an agent step within an episode. A template for the
function is shown below.
function dataToLog = myAgentStepLoggingFcn(data)
% data is a structure that contains the following fields:
% EpisodeCount: The current episode number.
% AgentStepCount: The cumulative number of steps taken by the agent.
% SimulationTime: The current simulation time in the environment.
% Agent: Agent object.
%
% dataToLog is a structure containing the data to be logged to disk.
end
• AgentLearnFinishedFcn - callback function to log data such as the actor and critic training
losses after the completion of the learn subroutine. A template for the function is shown below.
function dataToLog = myAgentLearnLoggingFcn(data)
% data is a structure that contains the following fields:
% EpisodeCount: The current episode number.
% AgentStepCount: The cumulative number of steps taken by the agent.
% AgentLearnCount: The cumulative number of learning steps taken by the agent.
% EnvModelTrainingInfo: A structure containing the fields TransitionFcnLoss, RewardFcnLos, IsDone
% Agent: Agent object.
% ActorLoss: Training loss of actor function.
% Agent: Training loss of critic function.
%
% dataToLog is a structure containing the data to be logged to disk.
end
For this example, configure only the AgentLearnFinishedFcn callback. The function
logTrainingLoss logs the actor and critic training losses and is provided at the end of this Script.
fileLogger.AgentLearnFinishedFcn = @logTrainingLoss;
5-25
5 Train and Validate Agents
Run Training
Specify training options to train the agent for 100 episodes without visualization in the Episode
Manager.
Train the agent using the train function and specifying the fileLogger object in the Logger
name-value option.
Episode: 1/100 | Episode reward: -4.25 | Episode steps: 47 | Average reward: -4.25 | St
Episode: 2/100 | Episode reward: -20.08 | Episode steps: 31 | Average reward: -12.17 | St
Episode: 3/100 | Episode reward: -40.08 | Episode steps: 11 | Average reward: -21.47 | St
Episode: 4/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -26.62 | St
Episode: 5/100 | Episode reward: -40.12 | Episode steps: 11 | Average reward: -29.32 | St
Episode: 6/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -36.88 | St
Episode: 7/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.28 | St
Episode: 8/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 9/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.48 | St
Episode: 10/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 11/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 12/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 13/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 14/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 15/100 | Episode reward: -43.04 | Episode steps: 8 | Average reward: -41.88 | St
Episode: 16/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 17/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 18/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 19/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 20/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 21/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.48 | St
Episode: 22/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 23/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 24/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 25/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.87 | St
Episode: 26/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 27/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 28/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.48 | St
Episode: 29/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 30/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 31/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.09 | St
Episode: 32/100 | Episode reward: -40.13 | Episode steps: 11 | Average reward: -40.90 | St
Episode: 33/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.10 | St
Episode: 34/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.29 | St
5-26
Log Training Data To Disk
Episode: 35/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.49 | St
Episode: 36/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 37/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 38/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 39/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 40/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.87 | St
Episode: 41/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 42/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 43/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 44/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 45/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.48 | St
Episode: 46/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.48 | St
Episode: 47/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -41.48 | St
Episode: 48/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 49/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 50/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.87 | St
Episode: 51/100 | Episode reward: -40.13 | Episode steps: 11 | Average reward: -41.68 | St
Episode: 52/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 53/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 54/100 | Episode reward: -43.04 | Episode steps: 8 | Average reward: -41.68 | St
Episode: 55/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 56/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 57/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 58/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 59/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 60/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 61/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.68 | St
Episode: 62/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 63/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 64/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 65/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 66/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.29 | St
Episode: 67/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.68 | St
Episode: 68/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 69/100 | Episode reward: -40.13 | Episode steps: 11 | Average reward: -41.29 | St
Episode: 70/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.68 | St
Episode: 71/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -42.07 | St
Episode: 72/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 73/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 74/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 75/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 76/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.48 | St
Episode: 77/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.48 | St
Episode: 78/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.87 | St
Episode: 79/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.87 | St
Episode: 80/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 81/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.07 | St
Episode: 82/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -42.46 | St
Episode: 83/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -42.26 | St
Episode: 84/100 | Episode reward: -40.12 | Episode steps: 11 | Average reward: -41.87 | St
Episode: 85/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 86/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 87/100 | Episode reward: -40.13 | Episode steps: 11 | Average reward: -41.09 | St
Episode: 88/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -40.90 | St
Episode: 89/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.10 | St
Episode: 90/100 | Episode reward: -42.06 | Episode steps: 9 | Average reward: -41.29 | St
Episode: 91/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.09 | St
Episode: 92/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.48 | St
5-27
5 Train and Validate Agents
Episode: 93/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.48 | St
Episode: 94/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.87 | St
Episode: 95/100 | Episode reward: -41.09 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 96/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.68 | St
Episode: 97/100 | Episode reward: -41.10 | Episode steps: 10 | Average reward: -41.49 | St
Episode: 98/100 | Episode reward: -42.07 | Episode steps: 9 | Average reward: -41.68 | St
Episode: 99/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -41.68 | St
Episode: 100/100 | Episode reward: -43.05 | Episode steps: 8 | Average reward: -42.07 | St
To view a summary of the generated MAT files, run the following code.
dirInfo = dir(logDir)
You can import the data into the MATLAB® workspace by loading the MAT files individually, or by
writing a script.
For this example, import the data using a FileDatastore object. This object loads data using the
read function loadTrainingLoss provided at the end of this script. For more information, see
fileDataStore.
% Create a figure
f = figure();
5-28
Log Training Data To Disk
Local Functions
5-29
5 Train and Validate Agents
This example shows how to solve a contextual bandit problem [1] using reinforcement learning by
training DQN and Q agents. For more information on these agents, see “Deep Q-Network (DQN)
Agents” on page 3-23 and “Q-Learning Agents” on page 3-17.
In contextual bandits problems, an agent selects an action given the initial observation (context), it
then receives a reward, and the episode terminates. Hence, the agent action does not affect the next
observation.
The context bandits can be used for various applications such as hyper-parameter tuning,
recommender systems, medical treatment, and 5G communication.
The following figure describes the difference between reinforcement learning, multi-armed bandits,
and contextual bandits.
Environment
Pr s = 1 = 0 . 5
Pr s = 2 = 0 . 5
5-30
Train Reinforcement Learning Agent for Simple Contextual Bandit Problem
Reward:
Rewards in this environment are stochastic. The probability of each observation and action pair is
defined below.
1 . s = 1, a = 1
Pr r = 5 | s = 1, a = 1 = 0 . 3
Pr r = 2 | s = 1, a = 1 = 0 . 7
2 . s = 1, a = 2
Pr r = 10 | s = 1, a = 2 = 0 . 1
Pr r = 1 | s = 1, a = 2 = 0 . 9
3 . s = 1, a = 3
Pr r = 3 . 5 | s = 1, a = 3 = 1
4 . s = 2, a = 1
Pr r = 10 | s = 2, a = 1 = 0 . 2
Pr r = 2 | s = 2, a = 1 = 0 . 8
5 . s = 2, a = 2
Pr r = 3 | s = 2, a = 2 = 1
6 . s = 2, a = 3
Pr r = 5 | s = 2, a = 3 = 0 . 5
Pr r = 0 . 5 | s = 2, a = 3 = 0 . 5
Is-Done signal: This is a contextual bandit problem, and each episode has only one step. Hence, the
Is-Done signal is always 1.
env = ToyContextualBanditEnvironment;
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
rng(1)
agentOpts = rlDQNAgentOptions(...
UseDoubleDQN = false, ...
TargetSmoothFactor = 1, ...
5-31
5 Train and Validate Agents
TargetUpdateFrequency = 4, ...
MiniBatchSize = 64);
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.0005;
Train Agent
To train the agent, first, specify the training options. For this example, use the following options:
Train the agent using the train function. To save time while running this example, load a pre-trained
agent by setting doTraining to false. To train the agent yourself, set doTraining to be true.
MaxEpisodes = 3000;
trainOpts = rlTrainingOptions(...
MaxEpisodes = MaxEpisodes, ...
MaxStepsPerEpisode = 1, ...
Verbose = false, ...
Plots = "training-progress",...
StopTrainingCriteria = "EpisodeCount",...
StopTrainingValue = MaxEpisodes);
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(DQNagent,env,trainOpts);
else
% Load the pre-trained agent for the example.
load("ToyContextualBanditDQNAgent.mat","DQNagent")
end
5-32
Train Reinforcement Learning Agent for Simple Contextual Bandit Problem
Assume that you know the distribution of the rewards, and you can compute the optimal actions.
Validate the agent's performance by comparing these optimal actions with the actions selected by the
agent. First, compute the true expected rewards with the true distributions.
If a = 1
E R = 0.3*5 + 0.7*2 = 2.9
If a = 2
E R = 0 . 1 * 10 + 0 . 9 * 1 = 1 . 9
If a = 3
E R = 3.5
5-33
5 Train and Validate Agents
If a = 1
E R = 0 . 2 * 10 + 0 . 8 * 2 = 3 . 6
If a = 2
E R = 3.0
If a = 3
E R = 0 . 5 * 5 + 0 . 5 * 0 . 5 = 2 . 75
With enough sampling, the Q-values should be closer to the true expected reward. Visualize the true
expected rewards.
ExpectedRewards = zeros(2,3);
ExpectedRewards(1,1) = 0.3*5 + 0.7*2;
ExpectedRewards(1,2) = 0.1*10 + 0.9*1;
ExpectedRewards(1,3) = 3.5;
ExpectedRewards(2,1) = 0.2*10 + 0.8*2;
ExpectedRewards(2,2) = 3.0;
ExpectedRewards(2,3) = 0.5*5 + 0.5*0.5;
Now, validate whether the DQN agent learns the optimal behavior.
5-34
Train Reinforcement Learning Agent for Simple Contextual Bandit Problem
observation = 1;
getAction(DQNagent,observation)
observation = 2;
getAction(DQNagent,observation)
The agent also selects the optimal action. Thus, the DQN agent has learned the optimal behavior.
Next, compare the Q-Value function to the true expected reward when selecting the optimal action.
% Get critic
figure(1)
DQNcritic = getCritic(DQNagent);
QValues = zeros(2,3);
for s = 1:2
QValues(s,:) = getValue(DQNcritic, {s});
end
% Visualize Q values
localPlotQvalues(QValues, "Q values")
5-35
5 Train and Validate Agents
The learned Q-values are close to the true expected rewards computed above.
Next, we train a Q-learning agent. To create a Q-learning agent, first, create a table using the
observation and action specifications from the environment.
rng(1); % For reproducibility
opt = rlQAgentOptions;
opt.EpsilonGreedyExploration.Epsilon = 1;
opt.EpsilonGreedyExploration.EpsilonDecay = 0.0005;
Qagent = rlQAgent(critic,opt);
To save time while running this example, load a pre-trained agent by setting doTraining to false.
To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(Qagent,env,trainOpts);
else
5-36
Train Reinforcement Learning Agent for Simple Contextual Bandit Problem
observation = 1;
getAction(Qagent,observation)
observation = 2;
getAction(Qagent,observation)
5-37
5 Train and Validate Agents
The agent also selects the optimal action. Hence, the Q-learning agent has learned the optimal
behavior.
Next, compare the Q-Value function to the true expected reward when selecting the optimal action.
% Get critic
figure(2)
Qcritic = getCritic(Qagent);
QValues = zeros(2,3);
for s = 1:2
for a = 1:3
QValues(s,a) = getValue(Qcritic, {s}, {a});
end
end
% Visualize Q values
localPlotQvalues(QValues, "Q values")
Again, the learned Q-values are close to the true expected rewards computed above. The Q-values for
deterministic rewards, Q(s=1, a=3) and Q(s=2, a=2), are the same as true expected rewards. Note
that the corresponding Q-values learned by the DQN network, while close, are not identical to the
true values. This happens because the DQN uses a neural network instead of a table as internal
function approximator.
Local Function
function localPlotQvalues(QValues, titleText)
% Visualize Q values
5-38
Train Reinforcement Learning Agent for Simple Contextual Bandit Problem
figure;
imagesc(QValues,[1,4])
colormap("autumn")
title(titleText)
colorbar
set(gca,'Xtick',1:3,'XTickLabel',{"a=1", "a=2", "a=3"})
set(gca,'Ytick',1:2,'YTickLabel',{"s=1", "s=2"})
Reference
[1] RS Sutton, AG Barto, "Reinforcement learning: An Introduction, 2nd edition", MIT press
5-39
5 Train and Validate Agents
This example shows how to train a reinforcement learning agent with the water tank reinforcement
learning Simulink® environment by sweeping parameters. You can use this example as a template for
tuning parameters when training reinforcement learning agents.
Open a preconfigured project which has all required files added as project dependencies. Opening the
project also launches the Experiment Manager App.
TrainAgentUsingParameterSweepingStart
Note that it is best practice to add any Simulink models and supporting files as dependencies to your
project.
In this section you tune the agent parameters to search for an optimal training policy.
5-40
Train Reinforcement Learning Agent Using Parameter Sweeping
Open Experiment
% Create options for the reinforcement learning agent. You can assign
% values from the params structure for sweeping parameters.
agentOpts = rlDDPGAgentOptions();
agentOpts.MiniBatchSize = 64;
agentOpts.TargetSmoothFactor = 1e-3;
agentOpts.SampleTime = Ts;
agentOpts.DiscountFactor = params.DiscountFactor;
agentOpts.ActorOptimizerOptions.LearnRate = params.ActorLearnRate;
agentOpts.CriticOptimizerOptions.LearnRate = params.CriticLearnRate;
agentOpts.ActorOptimizerOptions.GradientThreshold = 1;
agentOpts.CriticOptimizerOptions.GradientThreshold = 1;
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;
5-41
5 Train and Validate Agents
maxepisodes = 200;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
MaxEpisodes=maxepisodes, ...
MaxStepsPerEpisode=maxsteps, ...
ScoreAveragingWindowLength=20, ...
Verbose=false, ...
Plots="none",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=800);
end
Run Experiment
When you run the experiment, Experiment Manager executes the training function multiple times.
Each trial uses one combination of hyperparameter values. By default, Experiment Manager runs one
trial at a time. If you have Parallel Computing Toolbox, you can run multiple trials at the same time or
offload your experiment as a batch job in a cluster.
• To run one trial at a time, under Mode, select Sequential, and click Run.
• To run multiple trials simultaneously, under Mode, select Simultaneous, and click Run. This
requires a Parallel Computing Toolbox license.
• To offload the experiment as a batch job under Mode, select Batch Sequential or Batch
Simultaneous, specify your Cluster and Pool Size, and click Run. Note that you will need to
configure the cluster with the files necessary for this example. This step also requires a Parallel
Computing Toolbox license.
Note that your cluster needs to be configured with files necessary for this experiment when running
in the Batch Sequential or Batch Simultaneous modes. To configure your cluster:
• Open Cluster Profile Manager (TODO link) and under Properties, click Edit.
• Under the AttachedFiles option, click Add and specify the files rlwatertank.slx and
loadWaterTankParams.m.
• Click Done.
• Select a trial row from the table of results, and under the toolstrip, click Training Plot. This
shows the episode and average reward plots for that trial.
5-42
Train Reinforcement Learning Agent Using Parameter Sweeping
• Select the row corresponding to "trial 7" which received the maximum average reward, and under
the toolstrip, click Export. This exports the results of the trial to a base workspace variable.
• Name the variable as agentParamSweepTrainingOutput.
In this section you tune the environment's reward function parameters to search for an optimal
training policy.
5-43
5 Train and Validate Agents
Open Experiment
5-44
Train Reinforcement Learning Agent Using Parameter Sweeping
% Specify a reset function for the environment. You can tune environment
% parameters such as reward or initial condition within this function.
env.ResetFcn = @(in) localResetFcn(in, params);
% Create options for the reinforcement learning agent. You can assign
% values from the params structure for sweeping parameters.
agentOpts = rlDDPGAgentOptions();
agentOpts.MiniBatchSize = 64;
agentOpts.TargetSmoothFactor = 1e-3;
agentOpts.SampleTime = Ts;
agentOpts.DiscountFactor = 0.99;
agentOpts.ActorOptimizerOptions.LearnRate = 1e-3;
agentOpts.CriticOptimizerOptions.LearnRate = 1e-3;
agentOpts.ActorOptimizerOptions.GradientThreshold = 1;
agentOpts.CriticOptimizerOptions.GradientThreshold = 1;
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;
maxepisodes = 200;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
MaxEpisodes=maxepisodes, ...
MaxStepsPerEpisode=maxsteps, ...
ScoreAveragingWindowLength=20, ...
Verbose=false, ...
Plots="none",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=800);
5-45
5 Train and Validate Agents
end
end
Run Experiment
When you run the experiment, Experiment Manager executes the training function multiple times.
Each trial uses one combination of hyperparameter values. By default, Experiment Manager runs one
trial at a time. If you have Parallel Computing Toolbox, you can run multiple trials at the same time or
offload your experiment as a batch job in a cluster.
• To run one trial at a time, under Mode, select Sequential, and click Run.
• To run multiple trials simultaneously, under Mode, select Simultaneous, and click Run. This
requires a Parallel Computing Toolbox license.
• To offload the experiment as a batch job under Mode, select Batch Sequential or Batch
Simultaneous, specify your Cluster and Pool Size, and click Run. This step also requires a
Parallel Computing Toolbox license.
Note that your cluster needs to be configured with files necessary for this experiment when running
in the Batch Sequential or Batch Simultaneous modes. To configure your cluster:
• Open Cluster Profile Manager (TODO link) and under Properties, click Edit.
• Under the AttachedFiles option, click Add and specify the files rlwatertank.slx and
loadWaterTankParams.m.
• Click Done.
5-46
Train Reinforcement Learning Agent Using Parameter Sweeping
• Select a trial row from the table of results, and under the toolstrip, click Training Plot. This
shows the episode and average reward plots for that trial.
• Select the row corresponding to "trial 4" which received the maximum average reward, and under
the toolstrip, click Export. This exports the results of the trial to a base workspace variable.
• Name the variable as envParamSweepTrainingOutput.
Execute the following code in MATLAB after exporting the agents from the above experiments. This
simulates the agent with the environment and displays the performance in the Scope blocks.
open_system('rlwatertank');
simOpts = rlSimulationOptions(MaxSteps=200);
5-47
5 Train and Validate Agents
5-48
Train Reinforcement Learning Agent Using Parameter Sweeping
close(prj);
5-49
5 Train and Validate Agents
This example shows how to train a deep Q-learning network (DQN) agent to balance a cart-pole
system modeled in MATLAB®.
For more information on DQN agents, see “Deep Q-Network (DQN) Agents” on page 3-23. For an
example that trains a DQN agent in Simulink®, see “Train DQN Agent to Swing Up and Balance
Pendulum” on page 5-88.
The reinforcement learning environment for this example is a pole attached to an unactuated joint on
a cart, which moves along a frictionless track. The training goal is to make the pole stand upright
without falling over.
• The upward balanced pole position is 0 radians, and the downward hanging position is pi radians.
• The pole starts upright with an initial angle between –0.05 and 0.05 radians.
• The force action signal from the agent to the environment is either –10 or 10 N.
• The observations from the environment are the position and velocity of the cart, the pole angle,
and the pole angle derivative.
• The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more
than 2.4 m from the original position.
• A reward of +1 is provided for every time step that the pole remains upright. A penalty of –5 is
applied when the pole falls.
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
env = rlPredefinedEnv("CartPole-Discrete")
5-50
Train DQN Agent to Balance Cart-Pole System
env =
CartPoleDiscreteAction with properties:
Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5
State: [4x1 double]
The interface has a discrete action space where the agent can apply one of two possible force values
to the cart, –10 or 10 N.
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
rng(0)
DQN agents can use vector Q-value functions critics, which are generally more efficient than
comparable single-output critics. A vector Q-value function critic has observations as inputs and
state-action values as outputs. Each output element represents the expected cumulative long-term
reward for taking the corresponding discrete action from the state indicated by the observation
inputs. For more information on creating value-functions, see “Create Policies and Value Functions”
on page 4-2.
5-51
5 Train and Validate Agents
To approximate the Q-value function within the critic, use a neural network with one input channel
(the 4-dimensional observed state vector) and one output channel with two elements (one for the 10
N action, another for the –10 N action). Define the network as an array of layer objects, and get the
dimension of the observation space and the number of possible actions from the environment
specification objects.
net = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(20)
reluLayer
fullyConnectedLayer(length(actInfo.Elements))];
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'input' 4 features
plot(net)
5-52
Train DQN Agent to Balance Cart-Pole System
Create the critic approximator using net and the environment specifications. For more information,
see rlVectorQValueFunction.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
-0.2257
0.4299
Create the DQN agent using critic. For more information, see rlDQNAgent.
agent = rlDQNAgent(critic);
Specify the DQN agent options, including training options for the critic. Alternatively, you can use
rlDQNAgentOptions and rlOptimizerOptions objects.
agent.AgentOptions.UseDoubleDQN = false;
agent.AgentOptions.TargetSmoothFactor = 1;
agent.AgentOptions.TargetUpdateFrequency = 4;
agent.AgentOptions.ExperienceBufferLength = 1e5;
agent.AgentOptions.MiniBatchSize = 256;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run one training session containing at most 1000 episodes, with each episode lasting at most 500
time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an moving average cumulative reward greater than 480. At
this point, the agent can balance the cart-pole system in the upright position.
5-53
5 Train and Validate Agents
You can visualize the cart-pole system can be visualized by using the plot function during training or
simulation.
plot(env)
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("MATLABCartpoleDQNMulti.mat","agent")
end
5-54
Train DQN Agent to Balance Cart-Pole System
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim. The agent can balance
the cart-pole even when the simulation time increases to 500 steps.
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
5-55
5 Train and Validate Agents
totalReward = sum(experience.Reward)
totalReward = 500
See Also
train
More About
• “Deep Q-Network (DQN) Agents” on page 3-23
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-56
Train PG Agent to Balance Cart-Pole System
This example shows how to train a policy gradient (PG) agent to balance a cart-pole system modeled
in MATLAB®. For more information on PG agents, see “Policy Gradient Agents” on page 3-27.
For an example that trains a PG agent with a baseline, see “Train PG Agent with Baseline to Control
Double Integrator System” on page 5-70.
The reinforcement learning environment for this example is a pole attached to an unactuated joint on
a cart, which moves along a frictionless track. The training goal is to make the pendulum stand
upright without falling over.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
• The pendulum starts upright with an initial angle between –0.05 and 0.05 radians.
• The force action signal from the agent to the environment is either –10 or 10 N.
• The observations from the environment are the position and velocity of the cart, the pendulum
angle, and the pendulum angle derivative.
• The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more
than 2.4 m from the original position.
• A reward of +1 is provided for every time step that the pole remains upright. A penalty of –5 is
applied when the pendulum falls.
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
env = rlPredefinedEnv("CartPole-Discrete")
5-57
5 Train and Validate Agents
env =
CartPoleDiscreteAction with properties:
Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5
State: [4x1 double]
The interface has a discrete action space where the agent can apply one of two possible force values
to the cart, –10 or 10 N.
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
rng(0)
Create PG Agent
For policy gradient agents, the actor executes a stochastic policy, which for discrete action spaces is
approximated by a discrete categorical actor. This actor must take the observation signal as input and
return a probability for each action.
To approximate the policy within the actor, use a deep neural network. Define the network as an array
of layer objects, and get the dimension of the observation space and the number of possible actions
from the environment specification objects. For more information on creating a deep neural network
policy representation, see “Create Policies and Value Functions” on page 4-2.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(10)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
softmaxLayer
];
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Number of learnables: 72
Inputs:
1 'input' 4 features
5-58
Train PG Agent to Balance Cart-Pole System
Create the actor representation using the specified deep neural network and the environment
specification objects. For more information, see rlDiscreteCategoricalActor.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)});
prb{1}
0.7229
0.2771
Create the agent using the actor. For more information, see rlPGAgent.
agent = rlPGAgent(actor);
Specify training options for the actor. Alternatively, you can use rlPGAgentOptions and
rlOptimizerOptions objects.
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training episode for at most 1000 episodes, with each episode lasting at most 500 time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than 480 over 100
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
You can visualize the cart-pole system by using the plot function during training or simulation.
5-59
5 Train and Validate Agents
plot(env)
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("MATLABCartpolePG.mat","agent");
end
5-60
Train PG Agent to Balance Cart-Pole System
Simulate PG Agent
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim. The agent can balance
the cart-pole system even when the simulation time increases to 500 steps.
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
5-61
5 Train and Validate Agents
totalReward = sum(experience.Reward)
totalReward = 500
See Also
train
More About
• “Policy Gradient Agents” on page 3-27
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-62
Train AC Agent to Balance Cart-Pole System
This example shows how to train an actor-critic (AC) agent to balance a cart-pole system modeled in
MATLAB®.
For more information on AC agents, see “Actor-Critic Agents” on page 3-40. For an example showing
how to train an AC agent using parallel computing, see “Train AC Agent to Balance Cart-Pole System
Using Parallel Computing” on page 5-151.
The reinforcement learning environment for this example is a pole attached to an unactuated joint on
a cart, which moves along a frictionless track. The training goal is to make the pendulum stand
upright without falling over.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
• The pendulum starts upright with an initial angle between –0.05 and 0.05 rad.
• The force action signal from the agent to the environment is either –10 or 10 N.
• The observations from the environment are the position and velocity of the cart, the pendulum
angle, and the pendulum angle derivative.
• The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more
than 2.4 m from the original position.
• A reward of +1 is provided for every time step that the pole remains upright. A penalty of –5 is
applied when the pendulum falls.
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
5-63
5 Train and Validate Agents
env =
CartPoleDiscreteAction with properties:
Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5
State: [4x1 double]
env.PenaltyForFalling = -10;
The interface has a discrete action space where the agent can apply one of two possible force values
to the cart, –10 or 10 N.
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
rng(0)
Create AC Agent
An AC agent approximates the discounted cumulative long-term reward using a value-function critic.
A value-function critic must accept an observation as input and return a single scalar (the estimated
discounted cumulative long-term reward) as output.
To approximate the value function within the critic, use a neural network. Define the network as an
array of layer objects, and get the dimension of the observation space and the number of possible
actions from the environment specification objects. For more information on creating a deep neural
network value function representation, see “Create Policies and Value Functions” on page 4-2.
criticNet = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(1)];
criticNet = dlnetwork(criticNet);
summary(criticNet)
Initialized: true
Inputs:
1 'input' 4 features
5-64
Train AC Agent to Balance Cart-Pole System
Create the critic approximator object using criticNet, and the observation specification. For more
information, see rlValueFunction.
critic = rlValueFunction(criticNet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.3590
An AC agent decides which action to take using a stochastic policy, which for discrete action spaces is
approximated by a discrete categorical actor. This actor must take the observation signal as input and
return a probability for each action.
To approximate the policy function within the actor, use a deep neural network. Define the network as
an array of layer objects, and get the dimension of the observation space and the number of possible
actions from the environment specification objects.
actorNet = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
softmaxLayer];
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Inputs:
1 'input' 4 features
Create the actor approximator object using actorNet and the observation and action specifications.
For more information, see rlDiscreteCategoricalActor.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)})
prb{1}
0.4414
5-65
5 Train and Validate Agents
0.5586
Create the agent using the actor and critic. For more information, see rlACAgent.
agent = rlACAgent(actor,critic);
getAction(agent,{rand(obsInfo.Dimension)})
Specify agent options, including training options for the actor and critic, using dot notation.
Alternatively, you can use rlACAgentOptions and rlOptimizerOptions objects before creating
the agent.
agent.AgentOptions.EntropyLossWeight = 0.01;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-2;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-2;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training episode for at most 1000 episodes, with each episode lasting at most 500 time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than 480 over 10
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
trainOpts = rlTrainingOptions(...
MaxEpisodes=1000,...
MaxStepsPerEpisode=500,...
Verbose=false,...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
ScoreAveragingWindowLength=10);
You can visualize the cart-pole system during training or simulation using the plot function.
plot(env)
5-66
Train AC Agent to Balance Cart-Pole System
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("MATLABCartpoleAC.mat","agent");
end
5-67
5 Train and Validate Agents
Simulate AC Agent
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
5-68
Train AC Agent to Balance Cart-Pole System
totalReward = 500
See Also
train
More About
• “Actor-Critic Agents” on page 3-40
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-69
5 Train and Validate Agents
This example shows how to train a policy gradient (PG) agent with baseline to control a second-order
dynamic system modeled in MATLAB®.
For more information on the basic PG agent with no baseline, see the example “Train PG Agent to
Balance Cart-Pole System” on page 5-57.
The reinforcement learning environment for this example is a second-order double integrator system
with a gain. The training goal is to control the position of a mass in the second-order system by
applying a force input.
r t = − x t ′Qx t +u t ′Ru t
Here:
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
5-70
Train PG Agent with Baseline to Control Double Integrator System
env = rlPredefinedEnv("DoubleIntegrator-Discrete")
env =
DoubleIntegratorDiscreteAction with properties:
Gain: 1
Ts: 0.1000
MaxDistance: 5
GoalThreshold: 0.0100
Q: [2x2 double]
R: 0.0100
MaxForce: 2
State: [2x1 double]
The interface has a discrete action space where the agent can apply one of three possible force
values to the mass: -2, 0, or 2 N.
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
rng(0)
For policy gradient agents, the actor executes a stochastic policy, which for discrete action spaces is
approximated by a discrete categorical actor. This actor must take the observation signal as input and
return a probability for each action.
To approximate the policy within the actor, use a neural network. Define the network as an array of
layer objects with one input (the observation) and one output (the action), and get the dimension of
the observation space and the number of possible actions from the environment specification objects.
For more information on creating a deep neural network value function representation, see “Create
Policies and Value Functions” on page 4-2.
actorNet = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(numel(actInfo.Elements))];
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Number of learnables: 9
Inputs:
1 'input' 2 features
5-71
5 Train and Validate Agents
Specify training options for the actor. For more information, see rlOptimizerOptions.
Alternatively, you can change agent (including actor and critic) options using dot notation after the
agent is created.
actorOpts = rlOptimizerOptions( ...
LearnRate=5e-3, ...
GradientThreshold=1);
Create the actor representation using the neural network and the environment specification objects.
For more information, see rlDiscreteCategoricalActor.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)})
prb{1}
0.4994
0.3770
0.1235
The PG Agent algorithm, (also known as REINFORCE) returns can be compared to a baseline that
depends on the state. This can reduce the variance of the expected value of the update and thus
improve the speed of learning. A possible choice for the baseline is an estimate of the state value
function [1].
A value-function approximator object must accept an observation as input and return a single scalar
(the estimated discounted cumulative long-term reward) as output. Use a neural network as
approximation model. Define the network as an array of layer objects, and get the dimension of the
observation space and the number of possible actions from the environment specification objects.
baselineNet = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(8)
reluLayer
fullyConnectedLayer(1)];
Create the baseline value function approximator using baselineNet, and the observation
specification. For more information, see rlValueFunction.
baseline = rlValueFunction(baselineNet,obsInfo);
5-72
Train PG Agent with Baseline to Control Double Integrator System
getValue(baseline,{rand(obsInfo.Dimension)})
ans = single
0.2152
To create the PG agent with baseline, specify the PG agent options using rlPGAgentOptions and set
the UseBaseline option set to true.
agentOpts = rlPGAgentOptions(...
UseBaseline=true, ...
ActorOptimizerOptions=actorOpts, ...
CriticOptimizerOptions=baselineOpts);
Then create the agent using the specified actor representation, baseline representation, and agent
options. For more information, see rlPGAgent.
agent = rlPGAgent(actor,baseline,agentOpts);
getAction(agent,{rand(obsInfo.Dimension)})
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run at most 1000 episodes, with each episode lasting at most 200 time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option).
• Stop training when the agent receives a moving average cumulative reward greater than –45. At
this point, the agent can control the position of the mass using minimal control effort.
trainOpts = rlTrainingOptions(...
MaxEpisodes=1000, ...
MaxStepsPerEpisode=200, ...
Verbose=false, ...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=-43);
You can visualize the double integrator system using the plot function during training or simulation.
plot(env)
5-73
5 Train and Validate Agents
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained parameters for the example.
load("DoubleIntegPGBaseline.mat");
end
5-74
Train PG Agent with Baseline to Control Double Integrator System
Simulate PG Agent
To validate the performance of the trained agent, simulate it within the double integrator
environment. For more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
5-75
5 Train and Validate Agents
totalReward = -39.9140
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning Series. Cambridge, MA: The MIT Press, 2018.
See Also
rlPGAgent
More About
• “Policy Gradient Agents” on page 3-27
• “Train PG Agent to Balance Cart-Pole System” on page 5-57
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-76
Train DDPG Agent to Control Double Integrator System
This example shows how to train a deep deterministic policy gradient (DDPG) agent to control a
second-order linear dynamic system modeled in MATLAB®. The example also compares the DDPG
agent to an LQR controller.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31. For an example showing how to train a DDPG agent in Simulink®, see “Train DDPG Agent
to Swing Up and Balance Pendulum” on page 5-95.
The reinforcement learning environment for this example is a second-order double-integrator system
with a gain. The training goal is to control the position of a mass in the second-order system by
applying a force input.
Here:
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
For this example the environment is a linear dynamical system, the environment state is observed
directly, and the reward is a quadratic function of the observation and action. Therefore the problem
5-77
5 Train and Validate Agents
of finding the sequence of actions that minimizes the cumulative long-term reward is a discrete-time
linear-quadratic optimal control problem, for which the optimal action is known to be a linear
function of the system states. This problem can also be solved using Linear-Quadratic Regulator
(LQR) design, and in the last part of the example you can compare the agent to an LQR controller.
env =
DoubleIntegratorContinuousAction with properties:
Gain: 1
Ts: 0.1000
MaxDistance: 5
GoalThreshold: 0.0100
Q: [2x2 double]
R: 0.0100
MaxForce: Inf
State: [2x1 double]
The interface has a continuous action space where the agent can apply force values from -Inf to Inf
to the mass. The sample time is stored in env.Ts, while the continuous time cost function matrices
are stored in env.Q and env.R respectively.
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
5-78
Train DDPG Agent to Control Double Integrator System
x0 = 2×1
4
0
rng(0)
A DDPG agent approximates the discounted cumulative long-term reward using a Q-value-function
critic. A Q-value function critic must accept an observation and an action as inputs and return a
scalar (the estimated discounted cumulative long-term reward) as output. To approximate the Q-value
function within the critic, use a neural network. The value function of the optimal policy is known to
be quadratic, use a network with a quadratic layer (which outputs a vector of quadratic monomials,
as described in quadraticLayer) and a fully connected layer (which provides a linear combination
of its inputs).
Define each network path as an array of layer objects and get the dimension of the observation and
action spaces from the environment specification objects. Assign names to the network input layers,
so you can connect them to the output path and later explicitly associate them with the appropriate
environment channel. Since there is no need for a bias term, set the bias term to zero (Bias=0) and
prevent it from changing (BiasLearnRateFactor=0).
For more information on creating value function approximators, see “Create Policies and Value
Functions” on page 4-2.
% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
quadraticLayer
fullyConnectedLayer(1,Name="value", ...
BiasLearnRateFactor=0,Bias=0)
];
% Connect layers
criticNet = connectLayers(criticNet,"obsIn","concat/in1");
criticNet = connectLayers(criticNet,"actIn","concat/in2");
figure
plot(criticNet)
5-79
5 Train and Validate Agents
criticNet = dlnetwork(criticNet);
summary(criticNet)
Initialized: true
Number of learnables: 7
Inputs:
1 'obsIn' 2 features
2 'actIn' 1 features
Create the critic approximator object using criticNet, the environment observation and action
specifications, and the names of the network input layers to be connected with the environment
observation and action channels, respectively. For more information, see rlQValueFunction.
getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.3977
5-80
Train DDPG Agent to Control Double Integrator System
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. Since there is no need for a bias term, as done for
the critic, set the bias term to zero (Bias=0) and prevent it from changing
(BiasLearnRateFactor=0). For more information on actors, see “Create Policies and Value
Functions” on page 4-2.
actorNet = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(actInfo.Dimension(1), ...
BiasLearnRateFactor=0,Bias=0)
];
Initialized: true
Number of learnables: 3
Inputs:
1 'input' 2 features
Create the actor using actorNet and the observation and action specifications. For more
information, see rlContinuousDeterministicActor.
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
Create the DDPG agent using the actor and critic. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic);
Specify options for the agent, including training options for the critic, using dot notation.
Alternatively, you can use rlDDPGAgentOptions, and rlOptimizerOptions objects before
creating the agent.
agent.AgentOptions.SampleTime = env.Ts;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.MiniBatchSize = 32;
agent.AgentOptions.NoiseOptions.Variance = 0.3;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-7;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
5-81
5 Train and Validate Agents
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
The policy implemented by the actor is u = K1x1 + K2x2 = Kx, where the feedback gains K1 and K2 are
the two weights of the actor network. It can be shown that the closed loop system is stable if these
gains are negative, therefore, initializing them to negative values can speed up convergence.
Q x, u = W1x12 + W2 x1 x2 + W3 x22 + W4 x1 u + W5 x2 u + W6 u2
WhereWi are the weights of the fully connected layer. Alternatively, in matrix form:
W2 W4
W1
2 2
x1
W2 W5 x
Q x, u = x1 x2 u W3 x2 = xT u W
2 2 u
u
W4 W5
W6
2 2
For a fixed policy u = Kx, the cumulative long-term reward (that is the value of the policy) becomes:
V x = Q x, Kx =
I
xT I K T W x=
K
xT Px
Since the rewards are always negative, to properly approximate the cumulative reward both P and W
must be negative definite. Therefore, to speed up convergence, initialize the critic network weightsWi
so that W is negative definite.
1 1 1
0 1 1
0 0 1
5-82
Train DDPG Agent to Control Double Integrator System
getAction(agent,{rand(obsInfo.Dimension)})
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run at most 5000 episodes in the training session, with each episode lasting at most 200 time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (Verbose option).
• Stop training when the agent receives a moving average cumulative reward greater than –66. At
this point, the agent can control the position of the mass using minimal control effort.
trainOpts = rlTrainingOptions(...
MaxEpisodes=5000, ...
MaxStepsPerEpisode=200, ...
Verbose=false, ...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=-66);
You can visualize the double integrator environment by using the plot function during training or
simulation.
plot(env)
Train the agent using train. Training this agent is a computationally intensive process that takes
several hours to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
5-83
5 Train and Validate Agents
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("DoubleIntegDDPG.mat","agent");
end
To validate the performance of the trained agent, simulate it within the double integrator
environment. For more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
5-84
Train DDPG Agent to Control Double Integrator System
totalReward = sum(experience.Reward)
totalReward = -65.9849
The function lqrd (Control System Toolbox) solves a discretized LQR problem, like the one presented
in this example. This function calculates the optimal discrete-time gain matrix Klqr, together with
the solution of the Riccati equation Plqr. When Klqr is connected via negative state feedback to the
plant input (force), the discrete-time equivalent of the cost function specified by env.Q and env.R is
minimized going forward. Furthermore, the cumulative cost from initial time to inifinity, starting from
an initial state x0, is equal to x0'*Plqr*x0.
Here, [0 1;0 0] and [0;env.Gain] are the continuous-time transition and input gain matrices,
respectively, of the double integrator system.
If Control System Toolbox™ is not installed, use the solution for the default example values.
Klqr = [17.8756 8.2283];
Plqr = [4.1031 0.3376; 0.3376 0.1351];
If the actor policy u = Kx successfully approximates the optimal policy, then the resulting K must be
close to −Klqr (the minus sign is due to the fact that Klqr is calculated assuming a negative feedback
connection).
If the critic learns a good approximation of the optimal value function, then the resulting P, as
defined before, must be close to −Plqr (the minus sign is due to the fact that the reward is defined as
the negative of the cost).
Extract the parameters (weighs) of the actor and critic within the agent.
par = getLearnableParameters(agent);
5-85
5 Train and Validate Agents
-15.4622 -7.2252
Note that the gains are close to the ones of the optimal solution -Klqr:
-Klqr
ans = 1×2
-17.8756 -8.2283
Recreate the matrices W and P defining the Q-value and value functions, respectively. First, re-
initialize W to zero.
W = zeros(3);
W(idx) = par.Critic{1};
W = (W + W')/2
W = 3×3
P = [eye(2) K']*W*[eye(2);K]
-4.1008 -0.3772
-0.3772 -0.1633
Note that the gains are close to the solution of the Riccati equation -Plqr.
-Plqr
ans = 2×2
-4.1031 -0.3376
-0.3376 -0.1351
x0=reset(env);
5-86
Train DDPG Agent to Control Double Integrator System
The value function is the estimate of future cumulative long-term reward when using the policy
enacted by the actor. Calculate the value function at the initial state, according to the critic weights.
This is the same value displayed in the training window as Episode Q0.
q0 = x0'*P*x0
q0 = single
-65.6130
Note that the value is very close to the actual reward obtained in the validation simulation,
totalReward, suggesting that the critic learns a good approximation of the value function for the
policy enacted by the actor.
Calculate the value of the initial state, following the true optimal policy enacted by the LQR
controller.
-x0'*Plqr*x0
ans = -65.6494
This value is also very close to the value obtained in the validation simulation, confirming that the
policy learned and enacted by the actor is a good approximation of the true optimal policy.
See Also
train
More About
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-87
5 Train and Validate Agents
This example shows how to train a deep Q-learning network (DQN) agent to swing up and balance a
pendulum modeled in Simulink®.
For more information on DQN agents, see “Deep Q-Network (DQN) Agents” on page 3-23. For an
example that trains a DQN agent in MATLAB®, see “Train DQN Agent to Balance Cart-Pole System”
on page 5-50.
The reinforcement learning environment for this example is a simple frictionless pendulum that
initially hangs in a downward position. The training goal is to make the pendulum stand upright
without falling over using minimal control effort.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
5-88
Train DQN Agent to Swing Up and Balance Pendulum
• The torque action signal from the agent to the environment is from –2 to 2 N·m.
• The observations from the environment are the sine of the pendulum angle, the cosine of the
pendulum angle, and the pendulum angle derivative.
• The reward rt, provided at every timestep, is
2 2
rt = − θt + 0 . 1θ˙t + 0 . 001ut2− 1
Here:
For more information on this model, see “Load Predefined Simulink Environments” on page 2-30.
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
The interface has a discrete action space where the agent can apply one of three possible torque
values to the pendulum: –2, 0, or 2 N·m.
To define the initial condition of the pendulum as hanging downward, specify an environment reset
function using an anonymous function handle. This reset function sets the model workspace variable
theta0 to pi.
env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);
Get the observation and action specification information from the environment
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "observations"
Description: [0x0 string]
Dimension: [3 1]
DataType: "double"
actInfo = getActionInfo(env)
5-89
5 Train and Validate Agents
actInfo =
rlFiniteSetSpec with properties:
Specify the simulation time Tf and the agent sample time Ts in seconds.
Ts = 0.05;
Tf = 20;
rng(0)
A DQN agent approximates the long-term reward, given observations and actions, using a value
function critic.
Since DQN has a discrete action space, it can rely on a multi-output critic approximator, which is
generally a more efficient option than relying on a comparable single-output approximator. A multi-
output approximator has only the observation as input and an output vector having as many elements
as the number of possible discrete actions. Each output element represents the expected cumulative
long-term reward following from the observation given as input, when the corresponding discrete
action is taken.
To create the critic, first create a deep neural network with an input vector of three elements ( for the
sine, cosine, and derivative of the pendulum angle) and one output vector with three elements (–2, 0,
or 2 Nm actions). For more information on creating a deep neural network value function
representation, see “Create Policies and Value Functions” on page 4-2.
dnn = [
featureInputLayer(3,'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(48,'Name','CriticStateFC2')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(3,'Name','output')];
dnn = dlnetwork(dnn);
figure
plot(layerGraph(dnn))
5-90
Train DQN Agent to Swing Up and Balance Pendulum
criticOpts = rlOptimizerOptions('LearnRate',0.001,'GradientThreshold',1);
Create the critic representation using the specified deep neural network and options. You must also
specify observation and action info for the critic. For more information, see
rlVectorQValueFunction.
critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);
To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOptions = rlDQNAgentOptions(...
'SampleTime',Ts,...
'CriticOptimizerOptions',criticOpts,...
'ExperienceBufferLength',3000,...
'UseDoubleDQN',false);
Then, create the DQN agent using the specified critic representation and agent options. For more
information, see rlDQNAgent.
agent = rlDQNAgent(critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
5-91
5 Train and Validate Agents
• Run each training for at most 1000 episodes, with each episode lasting at most 500 time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than –1100 over five
consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright
position using minimal control effort.
• Save a copy of the agent for each episode where the cumulative reward is greater than –1100.
trainingOptions = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',500,...
'ScoreAveragingWindowLength',5,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-1100,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',-1100);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOptions);
else
% Load the pretrained agent for the example.
load('SimulinkPendulumDQNMulti.mat','agent');
end
5-92
Train DQN Agent to Swing Up and Balance Pendulum
To validate the performance of the trained agent, simulate it within the pendulum environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-93
5 Train and Validate Agents
See Also
rlDQNAgent
More About
• “Deep Q-Network (DQN) Agents” on page 3-23
• “Create Simulink Reinforcement Learning Environments” on page 2-8
5-94
Train DDPG Agent to Swing Up and Balance Pendulum
This example shows how to train a deep deterministic policy gradient (DDPG) agent to swing up and
balance a pendulum modeled in Simulink®.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31. For an example that trains a DDPG agent in MATLAB®, see “Train DDPG Agent to Control
Double Integrator System” on page 5-77.
The reinforcement learning environment for this example is a simple frictionless pendulum that
initially hangs in a downward position. The training goal is to make the pendulum stand upright
without falling over using minimal control effort.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
5-95
5 Train and Validate Agents
• The torque action signal from the agent to the environment is from –2 to 2 N·m.
• The observations from the environment are the sine of the pendulum angle, the cosine of the
pendulum angle, and the pendulum angle derivative.
• The reward rt, provided at every time step, is
2 2
rt = − θt + 0 . 1θ˙t + 0 . 001ut2− 1
Here:
For more information on this model, see “Load Predefined Simulink Environments” on page 2-30.
env = rlPredefinedEnv('SimplePendulumModel-Continuous')
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
The interface has a continuous action space where the agent can apply torque values between –2 to 2
N·m to the pendulum.
Set the observations of the environment to be the sine of the pendulum angle, the cosine of the
pendulum angle, and the pendulum angle derivative.
numObs = 3;
set_param('rlSimplePendulumModel/create observations','ThetaObservationHandling','sincos');
To define the initial condition of the pendulum as hanging downward, specify an environment reset
function using an anonymous function handle. This reset function sets the model workspace variable
theta0 to pi.
env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);
Specify the simulation time Tf and the agent sample time Ts in seconds.
Ts = 0.05;
Tf = 20;
rng(0)
5-96
Train DDPG Agent to Swing Up and Balance Pendulum
A DDPG agent approximates the long-term reward, given observations and actions, using a critic
value function representation. To create the critic, first create a deep neural network with two inputs
(the state and action) and one output. For more information on creating a deep neural network value
function representation, see “Create Policies and Value Functions” on page 4-2.
statePath = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(400,'Name','CriticStateFC1')
reluLayer('Name', 'CriticRelu1')
fullyConnectedLayer(300,'Name','CriticStateFC2')];
actionPath = [
featureInputLayer(1,'Normalization','none','Name','action')
fullyConnectedLayer(300,'Name','CriticActionFC1','BiasLearnRateFactor',0)];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticNetwork = dlnetwork(criticNetwork);
figure
plot(layerGraph(criticNetwork))
5-97
5 Train and Validate Agents
criticOpts = rlOptimizerOptions('LearnRate',1e-03,'GradientThreshold',1);
Create the critic representation using the specified deep neural network and options. You must also
specify the action and observation info for the critic, which you obtain from the environment
interface. For more information, see rlQValueRepresentation.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,'ObservationInputNames','observation','Ac
A DDPG agent decides which action to take given observations using an actor representation. To
create the actor, first create a deep neural network with one input, the observation, and one output,
the action.
Construct the actor in a manner similar to the critic. For more information, see
rlDeterministicActorRepresentation.
actorNetwork = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(400,'Name','ActorFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(300,'Name','ActorFC2')
reluLayer('Name','ActorRelu2')
fullyConnectedLayer(1,'Name','ActorFC3')
tanhLayer('Name','ActorTanh')
5-98
Train DDPG Agent to Swing Up and Balance Pendulum
scalingLayer('Name','ActorScaling','Scale',max(actInfo.UpperLimit))];
actorNetwork = dlnetwork(actorNetwork);
actorOpts = rlOptimizerOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo);
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOpts = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'CriticOptimizerOptions',criticOpts,...
'ActorOptimizerOptions',actorOpts,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'MiniBatchSize',128);
agentOpts.NoiseOptions.Variance = 0.6;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;
Then create the DDPG agent using the specified actor representation, critic representation, and
agent options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOpts);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run training for at most 50000 episodes, with each episode lasting at most ceil(Tf/Ts) time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than –740 over five
consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright
position using minimal control effort.
• Save a copy of the agent for each episode where the cumulative reward is greater than –740.
maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'ScoreAveragingWindowLength',5,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-740,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',-740);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
5-99
5 Train and Validate Agents
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load('SimulinkPendulumDDPG.mat','agent')
end
To validate the performance of the trained agent, simulate it within the pendulum environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-100
Train DDPG Agent to Swing Up and Balance Pendulum
See Also
rlDDPGAgent | rlSimulinkEnv | train
More About
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
5-101
5 Train and Validate Agents
This example shows how to train a deep deterministic policy gradient (DDPG) agent to swing up and
balance a cart-pole system modeled in Simscape™ Multibody™.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31. For an example showing how to train a DDPG agent in MATLAB®, see “Train DDPG Agent
to Control Double Integrator System” on page 5-77.
The reinforcement learning environment for this example is a pole attached to an unactuated joint on
a cart, which moves along a frictionless track. The training goal is to make the pole stand upright
without falling over using minimal control effort.
mdl = 'rlCartPoleSimscapeModel';
open_system(mdl)
5-102
Train DDPG Agent to Swing Up and Balance Cart-Pole System
• The upward balanced pole position is 0 radians, and the downward hanging position is pi radians.
• The force action signal from the agent to the environment is from –15 to 15 N.
• The observations from the environment are the position and velocity of the cart, and the sine,
cosine, and derivative of the pole angle.
• The episode terminates if the cart moves more than 3.5 m from the original position.
• The reward rt, provided at every timestep, is
2
rt = − 0 . 1 5θt + xt2 + 0 . 05ut2− 1 − 100B
Here:
For more information on this model, see “Load Predefined Simulink Environments” on page 2-30.
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')
env =
SimulinkEnvWithAgent with properties:
Model : rlCartPoleSimscapeModel
5-103
5 Train and Validate Agents
The interface has a continuous action space where the agent can apply possible torque values from –
15 to 15 N to the pole.
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
Specify the simulation time Tf and the agent sample time Ts in seconds
Ts = 0.02;
Tf = 25;
rng(0)
A DDPG agent approximates the long-term reward, given observations and actions, using a critic
value function representation. To create the critic, first create a deep neural network with two inputs
(the state and action) and one output. The input size of action path is [1 1 1] since the agent can
apply an action as one force value to the environment. For more information on creating a deep
neural network value function representation, see “Create Policies and Value Functions” on page 4-2.
statePath = [
featureInputLayer(numObservations,'Normalization','none','Name','observation')
fullyConnectedLayer(128,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(200,'Name','CriticStateFC2')];
actionPath = [
featureInputLayer(1,'Normalization','none','Name','action')
fullyConnectedLayer(200,'Name','CriticActionFC1','BiasLearnRateFactor',0)];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticNetwork = dlnetwork(criticNetwork);
figure
plot(layerGraph(criticNetwork))
5-104
Train DDPG Agent to Swing Up and Balance Cart-Pole System
criticOptions = rlOptimizerOptions('LearnRate',1e-03,'GradientThreshold',1);
Create the critic representation using the specified deep neural network and options. You must also
specify the action and observation information for the critic, which you already obtained from the
environment interface. For more information, see rlQValueFunction.
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
'ObservationInputNames','observation','ActionInputNames','action');
A DDPG agent decides which action to take, given observations, using an actor representation. To
create the actor, first create a deep neural network with one input (the observation) and one output
(the action).
Construct the actor in a similar manner to the critic. For more information, see
rlContinuousGaussianActor.
actorNetwork = [
featureInputLayer(numObservations,'Normalization','none','Name','observation')
fullyConnectedLayer(128,'Name','ActorFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(200,'Name','ActorFC2')
reluLayer('Name','ActorRelu2')
fullyConnectedLayer(1,'Name','ActorFC3')
tanhLayer('Name','ActorTanh1')
scalingLayer('Name','ActorScaling','Scale',max(actInfo.UpperLimit))];
5-105
5 Train and Validate Agents
actorNetwork = dlnetwork(actorNetwork);
actorOptions = rlOptimizerOptions('LearnRate',5e-04,'GradientThreshold',1);
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo);
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',128);
agentOptions.NoiseOptions.Variance = 0.4;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Then, create the agent using the specified actor representation, critic representation and agent
options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training episode for at most 2000 episodes, with each episode lasting at most
ceil(Tf/Ts) time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than –400 over five
consecutive episodes. At this point, the agent can quickly balance the pole in the upright position
using minimal control effort.
• Save a copy of the agent for each episode where the cumulative reward is greater than –400.
maxepisodes = 2000;
maxsteps = ceil(Tf/Ts);
trainingOptions = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'ScoreAveragingWindowLength',5,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-400,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',-400);
Train the agent using the train function. Training this agent process is computationally intensive
and takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
5-106
Train DDPG Agent to Swing Up and Balance Cart-Pole System
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOptions);
else
% Load the pretrained agent for the example.
load('SimscapeCartPoleDDPG.mat','agent')
end
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-107
5 Train and Validate Agents
bdclose(mdl)
See Also
rlDDPGAgent | rlSimulinkEnv | train
More About
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
5-108
Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal
This example shows how to convert a simple frictionless pendulum Simulink® model to a
reinforcement learning environment interface, and trains a deep deterministic policy gradient
(DDPG) agent in this environment.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31. For an example showing how to train a DDPG agent in MATLAB®, see “Train DDPG Agent
to Control Double Integrator System” on page 5-77.
The starting model for this example is a simple frictionless pendulum. The training goal is to make
the pendulum stand upright without falling over using minimal control effort.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
5-109
5 Train and Validate Agents
• The torque action signal from the agent to the environment is from –2 to 2 N·m.
• The observations from the environment are the sine of the pendulum angle, the cosine of the
pendulum angle, and the pendulum angle derivative.
• Both the observation and action signals are Simulink buses.
• The reward rt, provided at every time step, is
2 2
rt = − θt + 0 . 1θ˙t + 0 . 001ut2− 1
Here:
The model used in this example is similar to the simple pendulum model described in “Load
Predefined Simulink Environments” on page 2-30. The difference is that the model in this example
uses Simulink buses for the action and observation signals.
The environment interface from a Simulink model is created using rlSimulinkEnv, which requires
the name of the Simulink model, the path to the agent block, and observation and action
reinforcement learning data specifications. For models that use bus signals for actions or
observations, you can create the corresponding specifications using the bus2RLSpec function.
obsBus = Simulink.Bus();
obs(1) = Simulink.BusElement;
obs(1).Name = 'sin_theta';
obs(2) = Simulink.BusElement;
obs(2).Name = 'cos_theta';
obs(3) = Simulink.BusElement;
obs(3).Name = 'dtheta';
obsBus.Elements = obs;
actBus = Simulink.Bus();
act(1) = Simulink.BusElement;
act(1).Name = 'tau';
act(1).Min = -2;
act(1).Max = 2;
actBus.Elements = act;
Create the action and observation specification objects using the Simulink buses.
obsInfo = bus2RLSpec('obsBus','Model',mdl);
actInfo = bus2RLSpec('actBus','Model',mdl);
5-110
Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
To define the initial condition of the pendulum as hanging downward, specify an environment reset
function using an anonymous function handle. This reset function sets the model workspace variable
theta0 to pi.
env.ResetFcn = @(in)setVariable(in,'theta0',pi,'Workspace',mdl);
Specify the simulation time Tf and the agent sample time Ts in seconds.
Ts = 0.05;
Tf = 20;
rng(0)
A DDPG agent decides which action to take, given observations, using an actor representation. To
create the actor, first create a deep neural network with three inputs (the observations) and one
output (the action). The three observations can be combined using a concatenationLayer.
For more information on creating a deep neural network value function representation, see “Create
Policies and Value Functions” on page 4-2.
sinThetaInput = featureInputLayer(1,'Normalization','none','Name','sin_theta');
cosThetaInput = featureInputLayer(1,'Normalization','none','Name','cos_theta');
dThetaInput = featureInputLayer(1,'Normalization','none','Name','dtheta');
commonPath = [
concatenationLayer(1,3,'Name','concat')
fullyConnectedLayer(400, 'Name','ActorFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(300,'Name','ActorFC2')
reluLayer('Name','ActorRelu2')
fullyConnectedLayer(1,'Name','ActorFC3')
tanhLayer('Name','ActorTanh1')
scalingLayer('Name','ActorScaling1','Scale',max(actInfo.UpperLimit))];
actorNetwork = layerGraph(sinThetaInput);
actorNetwork = addLayers(actorNetwork,cosThetaInput);
actorNetwork = addLayers(actorNetwork,dThetaInput);
actorNetwork = addLayers(actorNetwork,commonPath);
actorNetwork = connectLayers(actorNetwork,'sin_theta','concat/in1');
actorNetwork = connectLayers(actorNetwork,'cos_theta','concat/in2');
actorNetwork = connectLayers(actorNetwork,'dtheta','concat/in3');
actorNetwork = dlnetwork(actorNetwork);
figure
plot(layerGraph(actorNetwork))
5-111
5 Train and Validate Agents
actorOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1);
Create the actor representation using the specified deep neural network and options. You must also
specify the action and observation info for the actor, which you obtained from the environment
interface. For more information, see rlContinuousDeterministicActor.
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo,...
"ObservationInputNames",["sin_theta","cos_theta","dtheta"]);
A DDPG agent approximates the long-term reward given observations and actions using a critic value
function representation. To create the critic, first create a deep neural network with two inputs, the
observation and action, and one output, the state action value.
Construct the critic in a similar manner to the actor. For more information, see rlQValueFunction.
statePath = [
concatenationLayer(1,3,'Name','concat')
fullyConnectedLayer(400,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(300,'Name','CriticStateFC2')];
actionPath = [
featureInputLayer(1,'Normalization','none','Name', 'action')
fullyConnectedLayer(300,'Name','CriticActionFC1','BiasLearnRateFactor', 0)];
5-112
Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = layerGraph(sinThetaInput);
criticNetwork = addLayers(criticNetwork,cosThetaInput);
criticNetwork = addLayers(criticNetwork,dThetaInput);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'sin_theta','concat/in1');
criticNetwork = connectLayers(criticNetwork,'cos_theta','concat/in2');
criticNetwork = connectLayers(criticNetwork,'dtheta','concat/in3');
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticOptions = rlOptimizerOptions('LearnRate',1e-03,'GradientThreshold',1);
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
"ObservationInputNames",["sin_theta","cos_theta","dtheta"],"ActionInputN
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOpts = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',128);
agentOpts.NoiseOptions.Variance = 0.6;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;
Then create the DDPG agent using the specified actor representation, critic representation, and
agent options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOpts);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run each training for at most 50000 episodes, with each episode lasting at most ceil(Tf/Ts)
time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than –740 over five
consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright
position using minimal control effort.
• Save a copy of the agent for each episode where the cumulative reward is greater than –740.
maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
5-113
5 Train and Validate Agents
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'ScoreAveragingWindowLength',5,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-740);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load('SimulinkPendBusDDPG.mat','agent')
end
To validate the performance of the trained agent, simulate it within the pendulum environment. For
more information on agent simulation, see rlSimulationOptions and sim.
5-114
Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
See Also
rlDDPGAgent | rlSimulinkEnv | train | bus2RLSpec
More About
• “Create Simulink Reinforcement Learning Environments” on page 2-8
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
5-115
5 Train and Validate Agents
This example trains reinforcement learning (RL) agents to swing up and control a Quanser QUBE™-
Servo 2 inverted pendulum system.
In this example, the pendulum system is modeled in Simulink® using Simscape™ Electrical™ and
Simscape Multibody™ components. For this system:
Load the parameters for this example using the loadQubeParameters helper script.
loadQubeParameters
mdl = "rlQubeServo";
open_system(mdl)
5-116
Train Reinforcement Learning Agents to Control Quanser QUBE Pendulum
Control Structure
In this example, the RL agents generate reference trajectories, which are then passed to an inner
control loop.
Outer-Loop Components
The outer loop of the control architecture injects the pendulum angle reference signal to the inner
loop. It consists of the following reinforcement learning agents.
• Swing-up agent — A soft actor-critic (SAC) agent that computes reference angles for swinging up
the pendulum arm.
• Mode-select agent — A proximal policy optimization (PPO) agent that performs a mode switching
operation when the pendulum angle is close to the upright position (π ± π/6 radians).
The mode-switching action (0 or 1) switches the outer-loop reference signal between the trajectory
generated by the swing-up agent action and π radians.
Inner-Loop Components
The inner-loop components compute the low-level control input umotor (voltage) to stabilize the
pendulum at the upright equilibrium point where the system is linearizable. Two proportional-
5-117
5 Train and Validate Agents
derivative (PD) controllers form the inner-loop control system as shown in the following figure. For
this example, the gains Pθ, Pdθ, Pφ, and Pdφ were tuned to 0.1620, 0.0356, 40, and 2, respectively.
The swing-up agent in this example is modeled using the soft actor-critic (SAC) algorithm. For this
agent:
• The environment is the pendulum system with the low-level controller. The mode-selection signal
is always set to 1.
• The observation is the vector sinθ, cosθ, sinφ, cosφ, θ̇, φ̇ .
2 2 2
r = − θ − 0.1 φ − θ − 0 . 1φ̇ + F
100 θ ∈ π ± π/6 radians and φ ∈ ± π radians
F=
0 otherwise
Open the Outer Loop Control block and set the Design mode parameter to Swing up. Doing so sets
the mode-selection action to 1, which configures the Switch block to pass the swing-up reference to
the inner-loop controllers.
Alternatively, you can set this parameter using the following command.
rng(0)
swingObsInfo = rlNumericSpec([6,1]);
swingActInfo = rlNumericSpec([1,1],"LowerLimit",-1,"UpperLimit",1);
5-118
Train Reinforcement Learning Agents to Control Quanser QUBE Pendulum
swingAgentBlk = ...
mdl + "/Outer Loop Control/RL_Swing_Up/RL_agent_swing_up";
swingEnv = rlSimulinkEnv(mdl,swingAgentBlk,swingObsInfo,swingActInfo);
The agent trains from an experience buffer of maximum capacity 1e6 by randomly selecting mini-
batches of size 128. The discount factor of 0.99 is close to 1 and therefore favors long term reward
with respect to a smaller value. For a full list of SAC hyperparameters and their descriptions, see
rlSACAgent. Specify the agent hyperparameters for training.
swingAgentOpts = rlSACAgentOptions(...
"SampleTime",Ts,...
"TargetSmoothFactor",1e-3,...
"ExperienceBufferLength",1e6,...
"DiscountFactor",0.99,...
"MiniBatchSize",128);
The actor and critic neural networks of the swing-up agent are updated by the Adam (adaptive
moment estimation) optimizer with the following configuration. Specify the optimizer options.
swingAgentOpts.ActorOptimizerOptions.Algorithm = "adam";
swingAgentOpts.ActorOptimizerOptions.LearnRate = 1e-4;
swingAgentOpts.ActorOptimizerOptions.GradientThreshold = 1;
for ct = 1:2
swingAgentOpts.CriticOptimizerOptions(ct).Algorithm = "adam";
swingAgentOpts.CriticOptimizerOptions(ct).LearnRate = 1e-3;
swingAgentOpts.CriticOptimizerOptions(ct).GradientThreshold = 1;
swingAgentOpts.CriticOptimizerOptions(ct).L2RegularizationFactor = 2e-4;
end
initOptions = rlAgentInitializationOptions("NumHiddenUnit",300);
swingAgent = rlSACAgent(swingObsInfo,swingActInfo,initOptions,swingAgentOpts);
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options:
• Run each training for at most 1000 episodes, with each episode lasting at most floor(Tf/Ts)
time steps.
• Stop training when the agent receives an average cumulative reward greater than 7500 over 50
consecutive episodes.
swingTrainOpts = rlTrainingOptions(...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",floor(Tf/Ts),...
"ScoreAveragingWindowLength",50,...
"Plots","training-progress",...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",7500);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doSwingTraining to false. To train the agent yourself, set doSwingTraining to
true.
doSwingTraining = false;
if doSwingTraining
5-119
5 Train and Validate Agents
swingTrainResult = train(swingAgent,swingEnv,swingTrainOpts);
else
load("rlQubeServoAgents.mat","swingAgent");
end
The mode-select agent in this example is modeled using the proximal policy optimization (PPO)
algorithm. For this agent:
• The environment is the pendulum system with the low-level controller and the swing-up agent.
• The observation is the vector sinθ, cosθ, sinφ, cosφ, θ̇, φ̇ .
• The action is 0 or 1 which determines which reference signal is sent to the PD controller.
• The reward signal is as follows:
2
r = −θ +G
1 θ ∈ π ± π/6 radians
G=
0 otherwise
Open the Outer Loop Control block and set the Design mode parameter to Mode select.
Alternatively, you can set this parameter using the following command.
5-120
Train Reinforcement Learning Agents to Control Quanser QUBE Pendulum
The mode-select agent trains by first collecting trajectories up to the experience horizon of
floor(Tf/Ts) steps. It then learns from the trajectory data using a mini-batch size of 128. The
discount factor of 0.99 favors long-term reward and an entropy loss weight of 1e-4 facilities
exploration during training.
Specify the hyperparameters for the agent. For more information on PPO agent options, see
rlPPOAgentOptions.
modeAgentOpts = rlPPOAgentOptions(...
"SampleTime",Ts,...
"DiscountFactor",0.99,...
"ExperienceHorizon",floor(Tf/Ts), ...
"MiniBatchSize",500, ...
"EntropyLossWeight",1e-4);
The actor and critic neural networks of the mode-select agent are updated by the Adam optimizer
with the following configuration. Specify the optimizer options.
modeAgentOpts.ActorOptimizerOptions.LearnRate = 1e-4;
modeAgentOpts.ActorOptimizerOptions.GradientThreshold = 1;
modeAgentOpts.CriticOptimizerOptions.LearnRate = 1e-4;
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options:
• Run each training for at most 1000 episodes with each episode lasting at most floor(Tf/Ts)
time steps.
• Stop training when the agent receives an average cumulative reward greater than 430 over 50
consecutive episodes.
modeTrainOpts = rlTrainingOptions(...
"MaxEpisodes",10000,...
"MaxStepsPerEpisode",floor(Tf/Ts),...
"ScoreAveragingWindowLength",50,...
"Plots","training-progress",...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",430);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
5-121
5 Train and Validate Agents
agent by setting doModeTraining to false. To train the agent yourself, set doModeTraining to
true.
doModeTraining = false;
if doModeTraining
modeTrainResult = train(modeAgent,modeEnv,modeTrainOpts);
else
load("rlQubeServoAgents.mat","modeAgent");
end
Simulation
rng(0)
modeAgent.UseExplorationPolicy = false;
swingAgent.UseExplorationPolicy = false;
Ensure that the Outer Loop Control block is configured for mode-selection. Then, simulate the model.
5-122
Train Reinforcement Learning Agents to Control Quanser QUBE Pendulum
View the performance of the agents in the Simulation Data Inspector. To open the Simulation Data
Inspector, in the Simulink model window, on the Simulation tab, in the Review Results gallery, click
Data Inspector.
In the plots:
• The measured values for θ (sensor(1)) and φ (sensor(2)) are stabilized at 0 and π radians
respectively. The pendulum is stabilized at the upright equilibrium position.
• The action_RL_select_mode signal shows the mode switching operation and the
action_RL_swing_up signal shows the swing up reference angles.
• The low-level control input is shown by the volt signal.
References
[1] Cazzolato, Benjamin Seth, and Zebb Prime. ‘On the Dynamics of the Furuta Pendulum’. Journal of
Control Science and Engineering 2011 (2011): 1–8.
5-123
5 Train and Validate Agents
This example shows how to perform software-in-the-loop (SIL) and processor-in-the-loop (PIL)
verification workflows for reinforcement learning agents in Simulink®.
• Raspberry Pi hardware.
• WIFI dongle or an Ethernet cable.
• Power source connected to a micro USB cable.
You must also download and install the following support packages using the Add-Ons Explorer.
• Simulink Support Package for Raspberry Pi Hardware. Follow the “Getting Started with Simulink
Support Package for Raspberry Pi Hardware” (Simulink Support Package for Raspberry Pi
Hardware) example to set up Raspberry Pi hardware.
• MATLAB® Coder™ Interface for Deep Learning. This will install the Intel® MKL-DNN and Arm®
Compute libraries.
Simulink Environment
The environment for this example is a Quanser QUBE™-Servo 2 pendulum swing-up model. The
swing-up and balancing actions are performed by a combination of proportional-derivative (PD)
controllers and reinforcement learning (RL) agents. In this example, you will simulate the controllers
in software-in-the-loop (SIL) and processor-in-the-loop (PIL) verification modes and compare the
results with normal simulation. For more information, see “SIL and PIL Simulations” (Embedded
Coder).
loadQubeParameters
The top level model consists of the controller model reference and the pendulum environment. Open
the model.
mdl = "rlQubeServo_SIL_PIL";
open_system(mdl)
5-124
Run SIL and PIL Verification for Reinforcement Learning
open_system("Controller")
The overall control system consists of two RL Agents in the outer loop computing high level reference
angles. The reference angles are sent to a low-level controller that stabilizes the pendulum system by
computing the motor voltage. For more information on the controller design and training, see “Train
Reinforcement Learning Agents to Control Quanser QUBE Pendulum” on page 5-116.
Policy Evaluation
To generate code and deploy reinforcement learning policies to hardware, you can use one of the
following methods.
1 Evaluate the policy using the Deep Learning Toolbox Predict block.
2 Evaluate the policy by generating MATLAB® code using generatePolicyFunction. This
option is used in this example.
load("rlQubeServoSILPILAgents.mat","swingAgent","modeAgent");
5-125
5 Train and Validate Agents
Open the Outer Loop Control subsystem to choose the model evaluation method and select RL from
the dropdown menu. Doing so activates policy execution using generated MATLAB code.
Alternatively, you can set this parameter using the following command.
set_param("Controller/Outer Loop Control","VChoice","RL");
Software-in-the-Loop Simulation
You can analyze code generation performance using software-in-the-loop (SIL) simulation. A SIL
simulation generates and builds code on your development computer and then simulates the system
using the generated code. You can then compare the results with the ones obtained from a Normal
mode simulation.
The following steps show how to configure code generation settings in Simulink for SIL simulation.
You can skip these steps and use preconfigured settings with the following command, which sets the
appropriate configuration reference for SIL simulation.
setActiveConfigSet("Controller","configSILReferenceRL");
Otherwise,
5-126
Run SIL and PIL Verification for Reinforcement Learning
Optionally, you can view the generated code for the controller from the C-code perspective.
• In the Simulink model window, on the Apps tab, in the gallery, click Embedded Coder.
• To generate code and display it in the Code panel, on the C Code tab, click Build,.
• Ensure that there are no errors in this process. You can reconfigure the code generation settings
to optimize the generated code.
Open the top-level model rlQubeServo_SIL_PIL.slx and specify the simulation mode for the
Controller model reference.
To configure the simulation mode of the Controller model reference, right-click the Controller
subsystem and select Block Parameters. Then, in the Block Parameters dialog box, set the
Simulation mode parameter to Software-in-the-loop (SIL).
To run the SIL simulation, on the Apps tab, click SIL/PIL Manager.
On the SIL/PIL tab, in the System Under Test drop-down menu, select Model blocks in
SIL/PIL mode. Then, in the Top Model Mode, select Normal.
To simulate the model, generate and run the code in a SIL simulation, and compare the results, click
Run Verification. The results are shown in the Simulink Data Inspector.
A comparison of controller output values is shown between Normal and SIL simulations. The error
tolerances are acceptable for this example.
5-127
5 Train and Validate Agents
Processor-in-the-Loop Simulation
In a processor-in-the-loop (PIL) simulation, you can generate code for the target hardware (in this
case the Raspberry Pi), and deploy and run the code from the hardware. The results of the PIL
simulation are transferred to Simulink to verify the numerical equivalence of the simulation and the
code generation results. The PIL verification process is an important part of the design cycle to
ensure that the behavior of the deployment code matches the design.
Follow the configuration steps from the SIL simulation workflow. You can alternatively use
preconfigured settings with the following command.
setActiveConfigSet("Controller","configPILReference");
In addition to these settings, in the Configuration Parameters dialog box, in the Hardware
Implementation section, set the Hardware board parameter to Raspberry Pi and enter the
board parameters. Specify the Device Address, Username, and Password as parameters to
appropriate values.
5-128
Run SIL and PIL Verification for Reinforcement Learning
To configure the top-level model for PIL simulation, first open the top-level model
rlQubeServo_SIL_PIL.slx.
Then, right-click the Controller subsystem and select Block Parameters. In the Block Parameters
dialog box, set the Simulation mode parameter to Processor-in-the-loop (PIL).
To run the PIL simulation, on the Apps tab, click SIL/PIL Manager.
On the SIL/PIL tab, in the System Under Test drop-down menu, select Model blocks in
SIL/PIL mode. Then, in the Top Model Mode, select Normal.
To simulate the model, generate and run the code on the hardware, and compare the results, click
Run Verification. The results are shown in the Simulink Data Inspector.
A comparison of controller output values is shown between Normal and PIL simulations. The error
tolerances are acceptable for this example.
5-129
5 Train and Validate Agents
This example shows how to train a deep deterministic policy gradient (DDPG) agent to swing up and
balance a pendulum with an image observation modeled in MATLAB®.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31.
The reinforcement learning environment for this example is a simple frictionless pendulum that
initially hangs in a downward position. The training goal is to make the pendulum stand upright
without falling over using minimal control effort.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
• The torque action signal from the agent to the environment is from –2 to 2 N·m.
• The observations from the environment are an image indicating the location of the pendulum mass
and the pendulum angular velocity.
• The reward rt, provided at every time step, is
2 2
rt = − θt + 0 . 1θ˙t + 0 . 001ut − 12
Here:
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
env =
SimplePendlumWithImageContinuousAction with properties:
Mass: 1
RodLength: 1
RodInertia: 0
Gravity: 9.8100
DampingRatio: 0
MaximumTorque: 2
Ts: 0.0500
5-130
Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation
The interface has a continuous action space where the agent can apply a torque between –2 to 2 N·m.
Obtain the observation and action specification from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
rng(0)
A DDPG agent approximates the long-term reward, given observations and actions, using a critic
value function representation. To create the critic, first create a deep convolutional neural network
(CNN) with three inputs (the image, angular velocity, and action) and one output. For more
information on creating representations, see “Create Policies and Value Functions” on page 4-2.
hiddenLayerSize1 = 400;
hiddenLayerSize2 = 300;
imgPath = [
imageInputLayer(obsInfo(1).Dimension,'Normalization','none','Name',obsInfo(1).Name)
convolution2dLayer(10,2,'Name','conv1','Stride',5,'Padding',0)
reluLayer('Name','relu1')
fullyConnectedLayer(2,'Name','fc1')
concatenationLayer(3,2,'Name','cat1')
fullyConnectedLayer(hiddenLayerSize1,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(hiddenLayerSize2,'Name','fc3')
additionLayer(2,'Name','add')
reluLayer('Name','relu3')
fullyConnectedLayer(1,'Name','fc4')
];
dthetaPath = [
imageInputLayer(obsInfo(2).Dimension,'Normalization','none','Name',obsInfo(2).Name)
fullyConnectedLayer(1,'Name','fc5','BiasLearnRateFactor',0,'Bias',0)
];
actPath =[
imageInputLayer(actInfo(1).Dimension,'Normalization','none','Name','action')
fullyConnectedLayer(hiddenLayerSize2,'Name','fc6','BiasLearnRateFactor',0,'Bias',zeros(hidden
];
criticNetwork = layerGraph(imgPath);
criticNetwork = addLayers(criticNetwork,dthetaPath);
criticNetwork = addLayers(criticNetwork,actPath);
criticNetwork = connectLayers(criticNetwork,'fc5','cat1/in2');
criticNetwork = connectLayers(criticNetwork,'fc6','add/in2');
figure
plot(criticNetwork)
5-131
5 Train and Validate Agents
criticOptions = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
Uncomment the following line to use the GPU to accelerate training of the critic CNN. For more
information on supported GPUs, see “GPU Computing Requirements” (Parallel Computing Toolbox).
% criticOptions.UseDevice = 'gpu';
Create the critic representation using the specified neural network and options. You must also specify
the action and observation info for the critic, which you obtain from the environment interface. For
more information, see rlQValueRepresentation.
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
'Observation',{'pendImage','angularRate'},'Action',{'action'},criticOptions);
A DDPG agent decides which action to take given observations using an actor representation. To
create the actor, first create a deep convolutional neural network (CNN) with two inputs (the image
and angular velocity) and one output (the action).
imgPath = [
imageInputLayer(obsInfo(1).Dimension,'Normalization','none','Name',obsInfo(1).Name)
convolution2dLayer(10,2,'Name','conv1','Stride',5,'Padding',0)
reluLayer('Name','relu1')
fullyConnectedLayer(2,'Name','fc1')
5-132
Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation
concatenationLayer(3,2,'Name','cat1')
fullyConnectedLayer(hiddenLayerSize1,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(hiddenLayerSize2,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(1,'Name','fc4')
tanhLayer('Name','tanh1')
scalingLayer('Name','scale1','Scale',max(actInfo.UpperLimit))
];
dthetaPath = [
imageInputLayer(obsInfo(2).Dimension,'Normalization','none','Name',obsInfo(2).Name)
fullyConnectedLayer(1,'Name','fc5','BiasLearnRateFactor',0,'Bias',0)
];
actorNetwork = layerGraph(imgPath);
actorNetwork = addLayers(actorNetwork,dthetaPath);
actorNetwork = connectLayers(actorNetwork,'fc5','cat1/in2');
actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
Uncomment the following line to use the GPU to accelerate training of the actor CNN.
% actorOptions.UseDevice = 'gpu';
Create the actor representation using the specified neural network and options. For more
information, see rlDeterministicActorRepresentation.
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'pendImage
figure
plot(actorNetwork)
5-133
5 Train and Validate Agents
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',env.Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'MiniBatchSize',128);
agentOptions.NoiseOptions.Variance = 0.6;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
Then create the agent using the specified actor representation, critic representation, and agent
options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training for at most 5000 episodes, with each episode lasting at most 400 time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option).
• Stop training when the agent receives a moving average cumulative reward greater than -740
over ten consecutive episodes. At this point, the agent can quickly balance the pendulum in the
upright position using minimal control effort.
5-134
Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation
maxepisodes = 5000;
maxsteps = 400;
trainingOptions = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-740);
You can visualize the pendulum by using the plot function during training or simulation.
plot(env)
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOptions);
else
% Load pretrained agent for the example.
load('SimplePendulumWithImageDDPG.mat','agent')
end
5-135
5 Train and Validate Agents
To validate the performance of the trained agent, simulate it within the pendulum environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-136
Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation
See Also
train
More About
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-137
5 Train and Validate Agents
This example shows how to create a deep Q-learning network (DQN) agent that can swing up and
balance a pendulum modeled in MATLAB®. In this example, you create the DQN agent using Deep
Network Designer. For more information on DQN agents, see “Deep Q-Network (DQN) Agents” on
page 3-23.
The reinforcement learning environment for this example is a simple frictionless pendulum that
initially hangs in a downward position. The training goal is to make the pendulum stand upright
without falling over using minimal control effort.
• The upward balanced pendulum position is 0 radians, and the downward hanging position is pi
radians.
• The torque action signal from the agent to the environment is from –2 to 2 N·m.
• The observations from the environment are the simplified grayscale image of the pendulum and
the pendulum angle derivative.
5-138
Create Agent Using Deep Network Designer and Train Using Image Observations
2 2
rt = − θt + 0 . 1θ˙t + 0 . 001ut2− 1
Here:
For more information on this model, see “Train DDPG Agent to Swing Up and Balance Pendulum with
Image Observation” on page 5-130.
env = rlPredefinedEnv('SimplePendulumWithImage-Discrete');
The interface has two observations. The first observation, named "pendImage", is a 50-by-50
grayscale image.
obsInfo = getObservationInfo(env);
obsInfo(1)
ans =
rlNumericSpec with properties:
LowerLimit: 0
UpperLimit: 1
Name: "pendImage"
Description: [0x0 string]
Dimension: [50 50]
DataType: "double"
The second observation, named "angularRate", is the angular velocity of the pendulum.
obsInfo(2)
ans =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "angularRate"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
The interface has a discrete action space where the agent can apply one of five possible torque values
to the pendulum: –2, –1, 0, 1, or 2 N·m.
actInfo = getActionInfo(env)
5-139
5 Train and Validate Agents
actInfo =
rlFiniteSetSpec with properties:
Elements: [-2 -1 0 1 2]
Name: "torque"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
rng(0)
A DQN agent approximates the long-term reward, given observations and actions, using a critic value
function representation. For this environment, the critic is a deep neural network with three inputs
(two observations and one action), and one output. For more information on creating a deep neural
network value function representation, see “Create Policies and Value Functions” on page 4-2.
You can construct the critic network interactively by using the Deep Network Designer app. To do so,
you first create separate input paths for each observation and action. These paths learn lower-level
features from their respective inputs. You then create a common output path that combines the
outputs from the input paths.
To create the image observation path, first drag an imageInputLayer from the Layer Library pane
to the canvas. Set the layer InputSize to 50,50,1 for the image observation, and set Normalization
to none.
5-140
Create Agent Using Deep Network Designer and Train Using Image Observations
Second, drag a convolution2DLayer to the canvas and connect the input of this layer to the output
of the imageInputLayer. Create a convolution layer with 2 filters (NumFilters property) that have a
height and width of 10 (FilterSize property), and use a stride of 5 in the horizontal and vertical
directions (Stride property).
5-141
5 Train and Validate Agents
Finally, complete the image path network with two sets of reLULayer and fullyConnectedLayer
layers. The output sizes of the first and second fullyConnectedLayer layers are 400 and 300,
respectively.
5-142
Create Agent Using Deep Network Designer and Train Using Image Observations
Construct the other input paths and output path in a similar manner. For this example, use the
following options.
Output path:
• additionLayer — Connect the output of all input paths to the input of this layer.
• reLULayer
• fullyConnectedLayer — Set OutputSize to 1 for the scalar value function.
5-143
5 Train and Validate Agents
5-144
Create Agent Using Deep Network Designer and Train Using Image Observations
To export the network to the MATLAB workspace, in Deep Network Designer, click Export. Deep
Network Designer exports the network as a new variable containing the network layers. You can
create the critic representation using this layer network variable.
Alternatively, to generate equivalent MATLAB code for the network, click Export > Generate Code.
tempLayers = [
imageInputLayer([1 1 1],"Name","angularRate","Normalization","none")
fullyConnectedLayer(400,"Name","dtheta_fc1")
reluLayer("Name","dtheta_relu1")
fullyConnectedLayer(300,"Name","dtheta_fc2")];
lgraph = addLayers(lgraph,tempLayers);
tempLayers = [
imageInputLayer([1 1 1],"Name","torque","Normalization","none")
fullyConnectedLayer(300,"Name","torque_fc1")];
lgraph = addLayers(lgraph,tempLayers);
tempLayers = [
imageInputLayer([50 50 1],"Name","pendImage","Normalization","none")
convolution2dLayer([10 10],2,"Name","img_conv1","Padding","same","Stride",[5 5])
reluLayer("Name","relu_1")
fullyConnectedLayer(400,"Name","critic_theta_fc1")
reluLayer("Name","theta_relu1")
fullyConnectedLayer(300,"Name","critic_theta_fc2")];
lgraph = addLayers(lgraph,tempLayers);
tempLayers = [
additionLayer(3,"Name","addition")
reluLayer("Name","relu_2")
5-145
5 Train and Validate Agents
fullyConnectedLayer(1,"Name","stateValue")];
lgraph = addLayers(lgraph,tempLayers);
lgraph = connectLayers(lgraph,"torque_fc1","addition/in3");
lgraph = connectLayers(lgraph,"critic_theta_fc2","addition/in1");
lgraph = connectLayers(lgraph,"dtheta_fc2","addition/in2");
figure
plot(lgraph)
criticOpts = rlOptimizerOptions('LearnRate',1e-03,'GradientThreshold',1);
Create the critic representation using the specified deep neural network lgraph and options. You
must also specify the action and observation info for the critic, which you obtain from the
environment interface. For more information, see rlQValueFunction.
net = dlnetwork(lgraph);
critic = rlQValueFunction(net,obsInfo,actInfo,...
"ObservationInputNames",["pendImage","angularRate"],"ActionInputNames","torque");
To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOpts = rlDQNAgentOptions(...
'UseDoubleDQN',false,...
'CriticOptimizerOptions',criticOpts,...
'ExperienceBufferLength',1e6,...
'SampleTime',env.Ts);
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-5;
Then, create the DQN agent using the specified critic representation and agent options. For more
information, see rlDQNAgent.
5-146
Create Agent Using Deep Network Designer and Train Using Image Observations
agent = rlDQNAgent(critic,agentOpts);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training for at most 5000 episodes, with each episode lasting at most 500 time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than –1000 over the
default window length of five consecutive episodes. At this point, the agent can quickly balance
the pendulum in the upright position using minimal control effort.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',5000,...
'MaxStepsPerEpisode',500,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-1000);
You can visualize the pendulum system during training or simulation by using the plot function.
plot(env)
5-147
5 Train and Validate Agents
Train the agent using the train function. This is a computationally intensive process that takes
several hours to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load pretrained agent for the example.
load('MATLABPendImageDQN.mat','agent');
end
5-148
Create Agent Using Deep Network Designer and Train Using Image Observations
To validate the performance of the trained agent, simulate it within the pendulum environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-149
5 Train and Validate Agents
totalReward = sum(experience.Reward)
totalReward = -888.9802
See Also
Deep Network Designer | rlDQNAgent
More About
• “Train DQN Agent to Swing Up and Balance Pendulum” on page 5-88
5-150
Train AC Agent to Balance Cart-Pole System Using Parallel Computing
This example shows how to train an actor-critic (AC) agent to balance a cart-pole system modeled in
MATLAB® by using asynchronous parallel training. For an example that shows how to train the agent
without using parallel training, see “Train AC Agent to Balance Cart-Pole System” on page 5-63.
When you use parallel computing with AC agents, each worker generates experiences from its copy of
the agent and the environment. After every N steps, the worker computes gradients from the
experiences and sends the computed gradients back to the client agent (the agent associated with the
MATLAB® process which starts the training). The client agent updates its parameters as follows.
• For asynchronous training, the client agent applies the received gradients without waiting for all
workers to send gradients, and sends the updated parameters back to the worker that provided
the gradients. Then, the worker continues to generate experiences from its environment using the
updated parameters.
• For synchronous training, the client agent waits to receive gradients from all of the workers and
updates its parameters using these gradients. The client then sends updated parameters to all the
workers at the same time. Then, all workers continue to generate experiences using the updated
parameters.
For more information about synchronous versus asynchronous parallelization, see “Train Agents
Using Parallel Computing and GPUs” on page 5-8.
Create a predefined environment interface for the cart-pole system. For more information on this
environment, see “Load Predefined Control System Environments” on page 2-23.
env = rlPredefinedEnv("CartPole-Discrete");
env.PenaltyForFalling = -10;
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
Create AC Agent
An AC agent approximates the long-term reward, given observations and actions, using a critic value
function representation. To create the critic, first create a deep neural network with one input (the
observation) and one output (the state value). The input size of the critic network is 4 since the
environment provides 4 observations. For more information on creating a deep neural network value
function representation, see “Create Policies and Value Functions” on page 4-2.
criticNetwork = [
featureInputLayer(4,'Normalization','none','Name','state')
5-151
5 Train and Validate Agents
fullyConnectedLayer(32,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(1, 'Name', 'CriticFC')];
criticOpts = rlOptimizerOptions('LearnRate',1e-2,'GradientThreshold',1);
critic = rlValueFunction(criticNetwork,obsInfo);
An AC agent decides which action to take, given observations, using an actor representation. To
create the actor, create a deep neural network with one input (the observation) and one output (the
action). The output size of the actor network is 2 since the agent can apply 2 force values to the
environment, –10 and 10.
actorNetwork = [
featureInputLayer(4,'Normalization','none','Name','state')
fullyConnectedLayer(32, 'Name','ActorStateFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(2,'Name','action')];
actorOpts = rlOptimizerOptions('LearnRate',1e-2,'GradientThreshold',1);
actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);
To create the AC agent, first specify the AC agent options using rlACAgentOptions.
agentOpts = rlACAgentOptions(...
'ActorOptimizerOptions',actorOpts,...
'CriticOptimizerOptions',criticOpts,...
'EntropyLossWeight',0.01);
Then create the agent using the specified actor representation and agent options. For more
information, see rlACAgent.
agent = rlACAgent(actor,critic,agentOpts);
To train the agent, first specify the training options. For this example, use the following options.
• Run each training for at most 1000 episodes, with each episode lasting at most 500 time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option).
• Stop training when the agent receives an average cumulative reward greater than 500 over 10
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode', 500,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',500,...
'ScoreAveragingWindowLength',10);
You can visualize the cart-pole system can during training or simulation using the plot function.
plot(env)
5-152
Train AC Agent to Balance Cart-Pole System Using Parallel Computing
To train the agent using parallel computing, specify the following training options.
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
Train Agent
Train the agent using the train function. Training the agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to
randomness in the asynchronous parallel training, you can expect different training results from the
following training plot. The plot shows the result of training with six workers.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load('MATLABCartpoleParAC.mat','agent');
end
5-153
5 Train and Validate Agents
Simulate AC Agent
You can visualize the cart-pole system with the plot function during simulation.
plot(env)
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
5-154
Train AC Agent to Balance Cart-Pole System Using Parallel Computing
totalReward = sum(experience.Reward)
totalReward = 500
References
[1] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. ‘Asynchronous Methods for Deep Reinforcement
Learning’. ArXiv:1602.01783 [Cs], 16 June 2016. https://arxiv.org/abs/1602.01783.
See Also
train | rlTrainingOptions
Related Examples
• “Train AC Agent to Balance Cart-Pole System” on page 5-63
• “Train DQN Agent for Lane Keeping Assist Using Parallel Computing” on page 5-227
• “Train Biped Robot to Walk Using Reinforcement Learning Agents” on page 5-235
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Train Agents Using Parallel Computing and GPUs” on page 5-8
5-155
5 Train and Validate Agents
This example shows how to train a deep deterministic policy gradient (DDPG) agent to generate
trajectories for a flying robot modeled in Simulink®. For more information on DDPG agents, see
“Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31.
The reinforcement learning environment for this example is a flying robot with its initial condition
randomized around a ring of radius 15 m. The orientation of the robot is also randomized. The robot
has two thrusters mounted on the side of the body that are used to propel and steer the robot. The
training goal is to drive the robot from its initial condition to the origin facing east.
mdl = 'rlFlyingRobotEnv';
open_system(mdl)
theta0 = 0;
x0 = -15;
y0 = 0;
Ts = 0.4;
Tf = 30;
r2 = − 100 xt ≥ 20 yt ≥ 20
2 2
r3 = − 0 . 2 Rt − 1 + Lt − 1 + 0 . 3 Rt − 1 − Lt − 1 + 0 . 03xt2 + 0 . 03yt2 + 0 . 02θt 2
rt = r1 + r2 + r3
where:
5-156
Train DDPG Agent to Control Flying Robot
To train an agent for the FlyingRobotEnv model, use the createIntegratedEnv function to
automatically generate an integrated model with the RL Agent block that is ready for training.
integratedMdl = 'IntegratedFlyingRobot';
[~,agentBlk,observationInfo,actionInfo] = createIntegratedEnv(mdl,integratedMdl);
Before creating the environment object, specify names for the observation and action specifications,
and bound the thrust actions between -1 and 1.
T
The observation signals for this environment are observation = x y ẋ ẏ sin θ cos θ θ̇ .
numObs = prod(observationInfo.Dimension);
observationInfo.Name = 'observations';
T
The action signals for this environment are action = TR TL .
numAct = prod(actionInfo.Dimension);
actionInfo.LowerLimit = -ones(numAct,1);
actionInfo.UpperLimit = ones(numAct,1);
actionInfo.Name = 'thrusts';
Create an environment interface for the flying robot using the integrated model.
env = rlSimulinkEnv(integratedMdl,agentBlk,observationInfo,actionInfo);
Reset Function
Create a custom reset function that randomizes the initial position of the robot along a ring of radius
15 m and the initial orientation. For details on the reset function, see flyingRobotResetFcn.
env.ResetFcn = @(in) flyingRobotResetFcn(in);
A DDPG agent approximates the long-term reward given observations and actions by using a critic
value function representation. To create the critic, first create a deep neural network with two inputs
(the observation and action) and one output. For more information on creating a neural network value
function representation, see “Create Policies and Value Functions” on page 4-2.
5-157
5 Train and Validate Agents
observationPath = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(hiddenLayerSize,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(hiddenLayerSize,'Name','fc2')
additionLayer(2,'Name','add')
reluLayer('Name','relu2')
fullyConnectedLayer(hiddenLayerSize,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(1,'Name','fc4')];
actionPath = [
featureInputLayer(numAct,'Normalization','none','Name','action')
fullyConnectedLayer(hiddenLayerSize,'Name','fc5')];
Create the critic representation using the specified neural network and options. You must also specify
the action and observation specification for the critic. For more information, see
rlQValueFunction.
critic = rlQValueFunction(criticNetwork,observationInfo,actionInfo,...
'ObservationInputNames','observation','ActionInputNames','action');
A DDPG agent decides which action to take given observations by using an actor representation. To
create the actor, first create a deep neural network with one input (the observation) and one output
(the action).
Construct the actor similarly to the critic. For more information, see
rlContinuousDeterministicActor.
actorNetwork = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(hiddenLayerSize,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(hiddenLayerSize,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(hiddenLayerSize,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(numAct,'Name','fc4')
tanhLayer('Name','tanh1')];
actorNetwork = dlnetwork(actorNetwork);
actorOptions = rlOptimizerOptions('LearnRate',1e-04,'GradientThreshold',1);
5-158
Train DDPG Agent to Control Flying Robot
actor = rlContinuousDeterministicActor(actorNetwork,observationInfo,actionInfo);
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6 ,...
'MiniBatchSize',256);
agentOptions.NoiseOptions.Variance = 1e-1;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
Then, create the agent using the specified actor representation, critic representation, and agent
options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run each training for at most 20000 episodes, with each episode lasting at most ceil(Tf/Ts)
time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Stop training when the agent receives an average cumulative reward greater than 415 over 10
consecutive episodes. At this point, the agent can drive the flying robot to the goal position.
• Save a copy of the agent for each episode where the cumulative reward is greater than 415.
maxepisodes = 20000;
maxsteps = ceil(Tf/Ts);
trainingOptions = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'StopOnError',"on",...
'Verbose',false,...
'Plots',"training-progress",...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',415,...
'ScoreAveragingWindowLength',10,...
'SaveAgentCriteria',"EpisodeReward",...
'SaveAgentValue',415);
Train the agent using the train function. Training is a computationally intensive process that takes
several hours to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOptions);
else
5-159
5 Train and Validate Agents
To validate the performance of the trained agent, simulate the agent within the environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',maxsteps);
experience = sim(env,agent,simOptions);
5-160
Train DDPG Agent to Control Flying Robot
See Also
train | rlDDPGAgent
More About
• “Train Reinforcement Learning Agents” on page 5-3
5-161
5 Train and Validate Agents
This example shows how to train a proximal policy optimization (PPO) agent with a discrete action
space to land a rocket on the ground. For more information on PPO agents, see “Proximal Policy
Optimization Agents” on page 3-44.
Environment
The environment in this example is a 3-DOF rocket represented by a circular disc with mass. The
rocket has two thrusters for forward and rotational motion. Gravity acts vertically downwards, and
there are no aerodynamic drag forces. The training goal is to make the robot land on the ground at a
specified location.
• Motion of the rocket is bounded in X (horizontal axis) from -100 to 100 meters and Y (vertical axis)
from 0 to 120 meters.
• The goal position is at (0,0) meters and the goal orientation is 0 radians.
• The maximum thrust applied by each thruster is 8.5 N.
• The sample time is 0.1 seconds.
5-162
Train PPO Agent to Land Rocket
• The observations from the environment are the rocket position x, y , orientation θ , velocity ẋ, ẏ ,
angular velocity θ̇ , and a sensor reading that detects rough landing (-1), soft landing (1) or
airborne (0) condition. The observations are normalized between -1 and 1.
• At the beginning of every episode, the rocket starts from a random initial x position and
orientation. The altitude is always reset to 100 meters.
• The reward rt provided at the time step t is as follows.
2
rt = st − st − 1 − 0 . 1θt − 0 . 01 Lt2 + Rt2 + 500c
vt
st = 1 − dt +
2
c = yt ≤ 0 && ẏt ≥ − 0 . 5 && ẋt ≤ 0 . 5
Here:
• xt, yt,ẋt, and ẏt are the positions and velocities of the rocket along the x and y axes.
• d = x 2 + y 2 /d
t t t max is the normalized distance of the rocket from the goal position.
• 2 2
vt = x˙t + y˙t /vmax is the normalized speed of the rocket.
• dmax and vmax are the maximum distances and speeds.
• θt is the orientation with respect to the vertical axis.
• Lt and Rt are the action values for the left and right thrusters.
• c is a sparse reward for soft-landing with horizontal and vertical velocities less than 0.5 m/s.
Create a MATLAB environment for the rocket lander using the RocketLander class.
env = RocketLander;
actionInfo = getActionInfo(env);
observationInfo = getObservationInfo(env);
numObs = observationInfo.Dimension(1);
numAct = numel(actionInfo.Elements);
Ts = 0.1;
rng(0)
The PPO agent in this example operates on a discrete action space. At every time step, the agent
selects one of the following discrete action pairs.
5-163
5 Train and Validate Agents
L, L − do nothing
L, M − fire right med
L, H − fire right high
M, L − fire left med
M, M − fire left med + right med
M, H − fire left med + right high
H, L − fire left high
H, M − fire left high + right med
H, H − fire left high + right high
To estimate the policy and value function, the agent maintains function approximators for the actor
and critic, which are modeled using deep neural networks. The training can be sensitive to the initial
network weights and biases, and results can vary with different sets of values. The network weights
are randomly initialized to small values in this example.
Create the critic deep neural network with six inputs and one output. The output of the critic network
is the discounted long-term reward for the input observations.
criticNetwork = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1', ...
'Weights',sqrt(2/numObs)*(rand(criticLayerSizes(1),numObs)-0.5), ...
'Bias',1e-3*ones(criticLayerSizes(1),1))
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2', ...
'Weights',sqrt(2/criticLayerSizes(1))*(rand(criticLayerSizes(2),criticLayerSizes(1))-
'Bias',1e-3*ones(criticLayerSizes(2),1))
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(1,'Name','CriticOutput', ...
'Weights',sqrt(2/criticLayerSizes(2))*(rand(1,criticLayerSizes(2))-0.5), ...
'Bias',1e-3)];
criticNetwork = dlnetwork(criticNetwork);
criticOpts = rlOptimizerOptions('LearnRate',1e-4);
critic = rlValueFunction(criticNetwork,observationInfo);
Create the actor using a deep neural network with six inputs and two outputs. The outputs of the
actor network are the probabilities of taking each possible action pair. Each action pair contains
normalized action values for each thruster. The environment step function scales these values to
determine the actual thrust values.
actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(actorLayerSizes(1),'Name','ActorFC1', ...
'Weights',sqrt(2/numObs)*(rand(actorLayerSizes(1),numObs)-0.5), ...
'Bias',1e-3*ones(actorLayerSizes(1),1))
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(actorLayerSizes(2),'Name','ActorFC2', ...
5-164
Train PPO Agent to Land Rocket
'Weights',sqrt(2/actorLayerSizes(1))*(rand(actorLayerSizes(2),actorLayerSizes(1))-0.5
'Bias',1e-3*ones(actorLayerSizes(2),1))
reluLayer('Name', 'ActorRelu2')
fullyConnectedLayer(numAct,'Name','Action', ...
'Weights',sqrt(2/actorLayerSizes(2))*(rand(numAct,actorLayerSizes(2))-0.5), ...
'Bias',1e-3*ones(numAct,1))
softmaxLayer('Name','actionProb')];
actorNetwork = dlnetwork(actorNetwork);
• The agent collects experiences until it reaches the experience horizon of 600 steps or episode
termination and then trains from mini-batches of 128 experiences for 3 epochs.
• For improving training stability, use an objective function clip factor of 0.02.
• A discount factor value of 0.997 encourages long term rewards.
• Variance in critic output is reduced by using the Generalized Advantage Estimate method with a
GAE factor of 0.95.
• The EntropyLossWeight term of 0.01 enhances exploration during training.
Train Agent
• Run the training for at most 20000 episodes, with each episode lasting at most 600 time steps.
• Stop the training when the average reward over 100 consecutive episodes is 430 or more.
• Save a copy of the agent for each episode where the episode reward is 700 or more.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',20000,...
'MaxStepsPerEpisode',600,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',430,...
5-165
5 Train and Validate Agents
'ScoreAveragingWindowLength',100,...
'SaveAgentCriteria',"EpisodeReward",...
'SaveAgentValue',700);
Train the agent using the train function. Due to the complexity of the environment, training process
is computationally intensive and takes several hours to complete. To save time while running this
example, load a pretrained agent by setting doTraining to false.
doTraining = false;
if doTraining
trainingStats = train(agent,env,trainOpts);
else
load('rocketLanderAgent.mat');
end
An example training session is shown below. The actual results may vary because of randomness in
the training process.
Simulate
plot(env)
Simulate the trained agent within the environment. For more information on agent simulation, see
rlSimulationOptions and sim.
5-166
Train PPO Agent to Land Rocket
simOptions = rlSimulationOptions('MaxSteps',600);
simOptions.NumSimulations = 5; % simulate the environment 5 times
experience = sim(env,agent,simOptions);
See Also
train | rlPPOAgent
More About
• “Train Reinforcement Learning Agents” on page 5-3
5-167
5 Train and Validate Agents
This example shows how to set up a multi-agent training session on a Simulink® environment. In the
example, you train two agents to collaboratively perform the task of moving an object.
The environment in this example is a frictionless two dimensional surface containing elements
represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and
robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots
attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision.
All elements within the environment have mass and obey Newton's laws of motion. In addition,
contact forces between the elements and the environment boundaries are modeled as spring and
mass damper systems. The elements can move on the surface through the application of externally
applied forces in the X and Y directions. There is no motion in the third dimension and the total
energy of the system is conserved.
Set the random seed and create the set of parameters required for this example.
rng(10)
rlCollaborativeTaskParams
mdl = "rlCollaborativeTask";
open_system(mdl)
5-168
Train Multiple Agents to Perform Collaborative Task
• The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.
• The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.
• The agents share the same observations for positions, velocities of A, B, and C and the action
values from the last time step.
• The simulation terminates when object C moves outside the circular ring.
• At each time step, the agents receive the following reward:
5-169
5 Train and Validate Agents
r A = rglobal + rlocal, A
rB = rglobal + rlocal, B
rglobal = 0 . 001dc
rlocal, A = − 0 . 005dAC − 0 . 008u2A
2
rlocal, B = − 0 . 005dBC − 0 . 008uB
Here:
Environment
To create a multi-agent environment, specify the block paths of the agents using a string array. Also,
specify the observation and action specification objects using cell arrays. The order of the
specification objects in the cell array must match the order specified in the block path array. When
agents are available in the MATLAB workspace at the time of environment creation, the observation
and action specification arrays are optional. For more information on creating multi-agent
environments, see rlSimulinkEnv.
Create the I/O specifications for the environment. In this example, the agents are homogeneous and
have the same I/O specifications.
% Number of observations
numObs = 16;
% Number of actions
numAct = 2;
5-170
Train Multiple Agents to Perform Collaborative Task
Specify a reset function for the environment. The reset function resetRobots ensures that the
robots start from random initial positions at the beginning of each episode.
Agents
This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The
agents apply external forces on the robots that result in motion. To learn more about PPO agents, see
“Proximal Policy Optimization Agents” on page 3-44.
The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory
completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor
of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-
term rewards.
agentOptions = rlPPOAgentOptions(...
"ExperienceHorizon",600,...
"ClipFactor",0.2,...
"EntropyLossWeight",0.01,...
"MiniBatchSize",300,...
"NumEpoch",4,...
"AdvantageEstimateMethod","gae",...
"GAEFactor",0.95,...
"SampleTime",Ts,...
"DiscountFactor",0.99);
agentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;
Create the agents using the default agent creation syntax. For more information see rlPPOAgent.
Training
To train multiple agents, you can pass an array of agents to the train function. The order of agents
in the array must match the order of agent block paths specified during environment creation. Doing
so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.
You can train multiple agents in a decentralized or centralized manner. In decentralized training,
agents collect their own set of experiences during the episodes and learn independently from those
experiences. In centralized training, the agents share the collected experiences and learn from them
together. The actor and critic functions are synchronized between the agents after trajectory
completion.
To configure a multi-agent training, you can create agent groups and specify a learning strategy for
each group through the rlMultiAgentTrainingOptions object. Each agent group may contain
5-171
5 Train and Validate Agents
unique agent indices, and the learning strategy can be "centralized" or "decentralized". For
example, you can use the following command to configure training for three agent groups with
different learning strategies. The agents with indices [1,2] and [3,4] learn in a centralized manner
while agent 4 learns in a decentralized manner.
You can perform decentralized or centralized training by running one of the following sections using
the Run Section button.
1. Decentralized Training
• Automatically assign agent groups using the AgentGroups=auto option. This allocates each
agent in a separate group.
• Specify the "decentralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups","auto",...
"LearningStrategy","decentralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file decentralizedAgents.mat
which contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
doTraining = false;
if doTraining
decentralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("decentralizedAgents.mat");
end
The following figure shows a snapshot of decentralized training progress. You can expect different
results due to randomness in the training process.
5-172
Train Multiple Agents to Perform Collaborative Task
2. Centralized Training
• Allocate both agents (with indices 1 and 2) in a single group. You can do this by specifying the
agent indices in the "AgentGroups" option.
• Specify the "centralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups",{[1,2]},...
"LearningStrategy","centralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file centralizedAgents.mat which
contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
5-173
5 Train and Validate Agents
doTraining = false;
if doTraining
centralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("centralizedAgents.mat");
end
The following figure shows a snapshot of centralized training progress. You can expect different
results due to randomness in the training process.
Simulation
Once the training is finished, simulate the trained agents with the environment.
simOptions = rlSimulationOptions("MaxSteps",300);
exp = sim(env,[agentA agentB],simOptions);
5-174
Train Multiple Agents to Perform Collaborative Task
See Also
train | rlSimulinkEnv
More About
• “Train Reinforcement Learning Agents” on page 5-3
5-175
5 Train and Validate Agents
This example demonstrates a multi-agent collaborative-competitive task in which you train three
proximal policy optimization (PPO) agents to explore all areas within a grid-world environment.
Multi-agent training is supported for Simulink® environments only. As shown in this example, if you
define your environment behavior using a MATLAB® System object, you can incorporate it into a
Simulink environment using a MATLAB System (Simulink) block.
Create Environment
The environment in this example is a 12x12 grid world containing obstacles, with unexplored cells
marked in white and obstacles marked in black. There are three robots in the environment
represented by the red, green, and blue circles. Three proximal policy optimization agents with
discrete action spaces control the robots. To learn more about PPO agents, see “Proximal Policy
Optimization Agents” on page 3-44.
The agents provide one of five possible movement actions (WAIT, UP, DOWN, LEFT, or RIGHT) to their
respective robots. The robots decide whether an action is legal or illegal. For example, an action of
moving LEFT when the robot is located next to the left boundary of the environment is deemed
5-176
Train Multiple Agents for Area Coverage
illegal. Similarly, actions for colliding against obstacles and other agents in the environment are
illegal actions and draw penalties. The environment dynamics are deterministic, which means the
robots execute legal and illegal actions with 100% and 0% probabilities, respectively. The overall goal
is to explore all cells as quickly as possible.
At each time step, an agent observes the state of the environment through a set of four images that
identify the cells with obstacles, current position of the robot that is being controlled, position of
other robots, and cells that have been explored during the episode. These images are combined to
create a 4-channel 12x12 image observation set. The following figure shows an example of what the
agent controlling the green robot observes for a given time step.
At each time step, agents receive the following rewards and penalties.
Define the locations of obstacles within the grid using a matrix of indices. The first column contains
the row indices, and the second column contains the column indices.
sA0 = [2 2];
sB0 = [11 4];
sC0 = [3 12];
s0 = [sA0; sB0; sC0];
Specify the sample time, simulation time, and maximum number of steps per episode.
5-177
5 Train and Validate Agents
Ts = 0.1;
Tf = 100;
maxsteps = ceil(Tf/Ts);
The GridWorld block is a MATLAB System block representing the training environment. The System
object for this environment is defined in GridWorld.m.
In this example, the agents are homogeneous and have the same observation and action
specifications. Create the observation and action specifications for the environment. For more
information, see rlNumericSpec and rlFiniteSetSpec.
% Define observation specifications.
obsSize = [12 12 4];
oinfo = rlNumericSpec(obsSize);
oinfo.Name = 'observations';
Create the environment interface, specifying the same observation and action specifications for all
three agents.
5-178
Train Multiple Agents for Area Coverage
env = rlSimulinkEnv(mdl,blks,{oinfo,oinfo,oinfo},{ainfo,ainfo,ainfo});
Specify a reset function for the environment. The reset function resetMap ensures that the robots
start from random initial positions at the beginning of each episode. The random initialization makes
the agents robust to different starting positions and improves training convergence.
env.ResetFcn = @(in) resetMap(in, obsMat);
Create Agents
The PPO agents in this example operate on a discrete action space and rely on actor and critic
functions to learn the optimal policies. The agents maintain deep neural network-based function
approximators for the actors and critics with similar network structures (a combination of
convolution and fully connected layers). The critic outputs a scalar value representing the state value
V s . The actor outputs the probabilities π a s of taking each of the five actions WAIT, UP, DOWN,
LEFT, or RIGHT.
Create the actor and critic functions using the following steps.
1 Create the actor and critic deep neural networks.
2 Create the actor function objects using the rlDiscreteCategoricalActor command.
3 Create the critic function objects using the rlValueFunction command.
Use the same network structure and representation options for all three agents.
for idx = 1:3
% Create actor deep neural network.
actorNetWork = [
imageInputLayer(obsSize,'Normalization','none','Name','observations')
convolution2dLayer(8,16,'Name','conv1','Stride',1,'Padding',1,'WeightsInitializer','he')
reluLayer('Name','relu1')
convolution2dLayer(4,8,'Name','conv2','Stride',1,'Padding','same','WeightsInitializer','h
reluLayer('Name','relu2')
fullyConnectedLayer(256,'Name','fc1','WeightsInitializer','he')
reluLayer('Name','relu3')
fullyConnectedLayer(128,'Name','fc2','WeightsInitializer','he')
reluLayer('Name','relu4')
fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he')
reluLayer('Name','relu5')
fullyConnectedLayer(numAct,'Name','output')
softmaxLayer('Name','action')];
actorNetWork = dlnetwork(actorNetWork);
5-179
5 Train and Validate Agents
reluLayer('Name','relu4')
fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he')
reluLayer('Name','relu5')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = dlnetwork(criticNetwork);
Specify the agent options using rlPPOAgentOptions. Use the same options for all three agents.
During training, agents collect experiences until they reach the experience horizon of 128 steps and
then train from mini-batches of 64 experiences. An objective function clip factor of 0.2 improves
training stability, and a discount factor value of 0.995 encourages long-term rewards.
opt = rlPPOAgentOptions(...
'ActorOptimizerOptions',actorOpts,...
'CriticOptimizerOptions',criticOpts,...
'ExperienceHorizon',128,...
'ClipFactor',0.2,...
'EntropyLossWeight',0.01,...
'MiniBatchSize',64,...
'NumEpoch',3,...
'AdvantageEstimateMethod','gae',...
'GAEFactor',0.95,...
'SampleTime',Ts,...
'DiscountFactor',0.995);
Create the agents using the defined actors, critics, and options.
agentA = rlPPOAgent(actor(1),critic(1),opt);
agentB = rlPPOAgent(actor(2),critic(2),opt);
agentC = rlPPOAgent(actor(3),critic(3),opt);
Train Agents
In this example, the agents are trained independently in decentralized manner. Specify the following
options for training the agents.
• Automatically assign agent groups using the AgentGroups=auto option. This allocates each
agent in a separate group.
• Specify the "decentralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 5000 time steps.
• Stop the training of an agent when its average reward over 100 consecutive episodes is 80 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups","auto",...
"LearningStrategy","decentralized",...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',maxsteps,...
5-180
Train Multiple Agents for Area Coverage
'Plots','training-progress',...
'ScoreAveragingWindowLength',100,...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',80);
To train the agents, specify an array of agents to the train function. The order of the agents in the
array must match the order of agent block paths specified during environment creation. Doing so
ensures that the agent objects are linked to the appropriate action and observation specifications in
the environment.
Training is a computationally intensive process that takes several minutes to complete. To save time
while running this example, load pretrained agent parameters by setting doTraining to false. To
train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
result = train([agentA,agentB,agentC],env,trainOpts);
else
load('rlAreaCoverageAgents.mat');
end
The following figure shows a snapshot of the training progress. You can expect different results due
to randomness in the training process.
5-181
5 Train and Validate Agents
Simulate Agents
Simulate the trained agents within the environment. For more information on agent simulation, see
rlSimulationOptions and sim.
5-182
Train Multiple Agents for Area Coverage
See Also
train | rlSimulinkEnv
More About
• “Train Reinforcement Learning Agents” on page 5-3
5-183
5 Train and Validate Agents
This example shows how to train multiple agents to collaboratively perform path-following control
(PFC) for a vehicle. The goal of PFC is to make the ego vehicle travel at a set velocity while
maintaining a safe distance from a lead car by controlling longitudinal acceleration and braking, and
also while keeping the vehicle travelling along the centerline of its lane by controlling the front
steering angle. For more information on PFC, see Path Following Control System (Model Predictive
Control Toolbox).
Overview
An example that trains a reinforcement learning agent to perform PFC is shown in “Train DDPG
Agent for Path-Following Control” on page 5-219. In that example, a single deep deterministic policy
gradient (DDPG) agent is trained to control both the longitudinal speed and lateral steering of the
ego vehicle. In this example, you train two reinforcement learning agents — A DDPG agent provides
continuous acceleration values for the longitudinal control loop and a deep Q-network (DQN) agent
provides discrete steering angle values for the lateral control loop.
The trained agents perform PFC through cooperative behavior and achieve satisfactory results.
Create Environment
The environment for this example includes a simple bicycle model for the ego car and a simple
longitudinal model for the lead car. The training goal is to make the ego car travel at a set velocity
while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking,
while also keeping the ego car travelling along the centerline of its lane by controlling the front
steering angle.
multiAgentPFCParams
mdl = "rlMultiAgentPFC";
open_system(mdl)
5-184
Train Multiple Agents for Path Following Control
In this model, the two reinforcement learning agents (RL Agent1 and RL Agent2) provide longitudinal
acceleration and steering angle signals, respectively.
• The reference velocity for the ego car V ref is defined as follows. If the relative distance is less than
the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity.
In this manner, the ego car maintains some distance from the lead car. If the relative distance is
greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe
distance is defined as a linear function of the ego car longitudinal velocity V , that is,
tgap * V + Ddef ault. The safe distance determines the tracking velocity for the ego car.
• The observations from the environment contain the longitudinal measurements: the velocity error
eV = V ref − V, its integral ∫e, and the ego car longitudinal velocity V .
• The action signal consists of continuous acceleration values between -3 and 2 m/s^2.
• The reward rt, provided at every time step t, is
Here, at − 1 is the acceleration input from the previous time step, and:
5-185
5 Train and Validate Agents
• The observations from the environment contain the lateral measurements: the lateral deviation e1,
relative yaw angle e2, their derivatives ė1 and ė2, and their integrals ∫e1 and ∫e2.
• The action signal consists of discrete steering angle actions which take values from -15 degrees
(-0.2618 rad) to 15 degrees (0.2618 rad) in steps of 1 degree (0.0175 rad).
• The reward rt, provided at every time step t, is
Here, ut − 1 is the steering input from the previous time step, at − 1 is the acceleration input from the
previous time step, and:
The logical terms in the reward functions (Ft, Mt, and Ht) penalize the agents if the simulation
terminates early, while encouraging the agents to make both the lateral error and velocity error
small.
Create the observation and action specifications for longitudinal control loop.
obsInfo1 = rlNumericSpec([3 1]);
actInfo1 = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
Create the observation and action specifications for lateral control loop.
obsInfo2 = rlNumericSpec([6 1]);
actInfo2 = rlFiniteSetSpec((-15:15)*pi/180);
Create a Simulink environment interface, specifying the block paths for both agent blocks. The order
of the block paths must match the order of the observation and action specification cell arrays.
blks = mdl + ["/RL Agent1", "/RL Agent2"];
env = rlSimulinkEnv(mdl,blks,obsInfo,actInfo);
Specify a reset function for the environment using the ResetFcn property. The function
pfcResetFcn randomly sets the initial poses of the lead and ego vehicles at the beginning of every
episode during training.
env.ResetFcn = @pfcResetFcn;
Create Agents
For this example you create two reinforcement learning agents. First, fix the random seed for
reproducibility.
rng(0)
Both agents operate at the same sample time in this example. Set the sample time value (in seconds).
5-186
Train Multiple Agents for Path Following Control
Ts = 0.1;
Longitudinal Control
The agent for the longitudinal control loop is a DDPG agent. A DDPG agent approximates the long-
term reward given observations and actions using a critic value function representation and selects
actions using an actor policy representation. For more information on creating deep neural network
value function and policy representations, see “Create Policies and Value Functions” on page 4-2.
Use the createCCAgent function to create a DDPG agent for longitudinal control. The structure of
this agent is similar to the “Train DDPG Agent for Adaptive Cruise Control” on page 5-193 example.
agent1 = createACCAgent(obsInfo1,actInfo1,Ts);
Lateral Control
The agent for the lateral control loop is a DQN agent. A DQN agent approximates the long-term
reward given observations and actions using a critic value function representation.
Use the createLKAAgent function to create a DQN agent for lateral control. The structure of this
agent is similar to the “Train DQN Agent for Lane Keeping Assist” on page 5-201 example.
agent2 = createLKAAgent(obsInfo2,actInfo2,Ts);
Train Agents
Specify the training options. For this example, use the following options.
• Run each training episode for at most 5000 episodes, with each episode lasting at most maxsteps
time steps.
• Display the training progress in the Episode Manager dialog box (set the Verbose and Plots
options).
• Stop training the DDPG and DQN agents when they receive an average reward greater than 480
and 1195, respectively. When one agent reaches its stop criteria, it simulates its own policy
without learning while the other agent continues training.
Train the agents using the train function. Training these agents is a computationally intensive
process that takes several minutes to complete. To save time while running this example, load a
pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to
true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train([agent1,agent2],env,trainingOpts);
5-187
5 Train and Validate Agents
else
% Load pretrained agents for the example.
load('rlPFCAgents.mat')
end
The following figure shows a snapshot of the training progress for the two agents.
Simulate Agents
To validate the performance of the trained agents, simulate the agents within the Simulink
environment by uncommenting the following commands. For more information on agent simulation,
see rlSimulationOptions and sim.
% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,[agent1, agent2],simOptions);
To demonstrate the trained agent using deterministic initial conditions, simulate the model in
Simulink.
e1_initial = -0.4;
e2_initial = 0.1;
x0_lead = 80;
sim(mdl)
The following plots show the results when the lead car is 70 m ahead of the ego car at the beginning
of simulation.
5-188
Train Multiple Agents for Path Following Control
• The lead car changes speed from 24 m/s to 30 m/s periodically (top-right plot). The ego car
maintains a safe distance throughout the simulation (bottom-right plot).
• From 0 to 30 seconds, the ego car tracks the set velocity (top-right plot) and experiences some
acceleration (top-left plot). After that, the acceleration is reduced to 0.
• The bottom-left plot shows the lateral deviation. As shown in the plot, the lateral deviation is
greatly decreased within 1 second. The lateral deviation remains less than 0.1 m.
5-189
5 Train and Validate Agents
5-190
Train Multiple Agents for Path Following Control
More About
• “Train Reinforcement Learning Agents” on page 5-3
5-192
Train DDPG Agent for Adaptive Cruise Control
This example shows how to train a deep deterministic policy gradient (DDPG) agent for adaptive
cruise control (ACC) in Simulink®. For more information on DDPG agents, see “Deep Deterministic
Policy Gradient (DDPG) Agents” on page 3-31.
Simulink Model
The reinforcement learning environment for this example is the simple longitudinal dynamics for an
ego car and lead car. The training goal is to make the ego car travel at a set velocity while
maintaining a safe distance from lead car by controlling longitudinal acceleration and braking. This
example uses the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive
Control” (Model Predictive Control Toolbox) example.
Specify the initial position and velocity for the two vehicles.
Specify standstill default spacing (m), time gap (s) and driver-set velocity (m/s).
D_default = 10;
t_gap = 1.4;
v_set = 30;
To simulate the physical limitations of the vehicle dynamics, constraint the acceleration to the range
[–3,2] m/s^2.
amin_ego = -3;
amax_ego = 2;
Ts = 0.1;
Tf = 60;
mdl = 'rlACCMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];
5-193
5 Train and Validate Agents
• The acceleration action signal from the agent to the environment is from –3 to 2 m/s^2.
• The reference velocity for the ego car V ref is defined as follows. If the relative distance is less than
the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity.
In this manner, the ego car maintains some distance from the lead car. If the relative distance is
greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe
distance is defined as a linear function of the ego car longitudinal velocity V ; that is,
tgap * V + Ddef ault. The safe distance determines the reference tracking velocity for the ego car.
• The observations from the environment are the velocity error e = V − V
ref ego, its integral ∫e, and
the ego car longitudinal velocity V .
• The simulation is terminated when longitudinal velocity of the ego car is less than 0, or the
relative distance between the lead car and ego car becomes less than 0.
• The reward rt, provided at every time step t, is
rt = − (0 . 1et2 + ut2− 1) + Mt
where ut − 1 is the control input from the previous time step. The logical value Mt = 1 if velocity error
et2 < = 0 . 25; otherwise, Mt = 0.
5-194
Train DDPG Agent for Adaptive Cruise Control
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);
To define the initial condition for the position of the lead car, specify an environment reset function
using an anonymous function handle. The reset function localResetFcn, which is defined at the end
of the example, randomizes the initial position of the lead car.
env.ResetFcn = @(in)localResetFcn(in);
rng('default')
A DDPG agent approximates the long-term reward given observations and actions using a critic value
function representation. To create the critic, first create a deep neural network with two inputs, the
state and action, and one output. For more information on creating a neural network value function
representation, see “Create Policies and Value Functions” on page 4-2.
actionPath = [
featureInputLayer(1,'Normalization','none','Name','action')
fullyConnectedLayer(L, 'Name', 'fc5')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');
criticNetwork = dlnetwork(criticNetwork);
plot(layerGraph(criticNetwork))
5-195
5 Train and Validate Agents
criticOptions = rlOptimizerOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor
Create the critic representation using the specified neural network and options. You must also specify
the action and observation info for the critic, which you obtain from the environment interface. For
more information, see rlQValueFunction.
critic = rlQValueFunction(criticNetwork,observationInfo,actionInfo,...
'ObservationInputNames','observation','ActionInputNames','action');
A DDPG agent decides which action to take given observations by using an actor representation. To
create the actor, first create a deep neural network with one input, the observation, and one output,
the action.
Construct the actor similarly to the critic. For more information, see
rlContinuousDeterministicActor.
actorNetwork = [
featureInputLayer(3,'Normalization','none','Name','observation')
fullyConnectedLayer(L,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(L,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(L,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(1,'Name','fc4')
5-196
Train DDPG Agent for Adaptive Cruise Control
tanhLayer('Name','tanh1')
scalingLayer('Name','ActorScaling1','Scale',2.5,'Bias',-0.5)];
actorNetwork = dlnetwork(actorNetwork);
actorOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor'
actor = rlContinuousDeterministicActor(actorNetwork,observationInfo,actionInfo);
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6);
agentOptions.NoiseOptions.Variance = 0.6;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Then, create the DDPG agent using the specified actor representation, critic representation, and
agent options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run each training episode for at most 5000 episodes, with each episode lasting at most 600 time
steps.
• Display the training progress in the Episode Manager dialog box.
• Stop training when the agent receives an episode reward greater than 260.
maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainingOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue',260);
Train the agent using the train function. Training is a computationally intensive process that takes
several minutes to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOpts);
else
% Load a pretrained agent for the example.
load('SimulinkACCDDPG.mat','agent')
end
5-197
5 Train and Validate Agents
To validate the performance of the trained agent, simulate the agent within the Simulink environment
by uncommenting the following commands. For more information on agent simulation, see
rlSimulationOptions and sim.
% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);
To demonstrate the trained agent using deterministic initial conditions, simulate the model in
Simulink.
x0_lead = 80;
sim(mdl)
The following plots show the simulation results when lead car is 70 (m) ahead of the ego car.
• In the first 28 seconds, the relative distance is greater than the safe distance (bottom plot), so the
ego car tracks set velocity (middle plot). To speed up and reach the set velocity, acceleration is
positive (top plot).
• From 28 to 60 seconds, the relative distance is less than the safe distance (bottom plot), so the
ego car tracks the minimum of the lead velocity and set velocity. From 28 to 36 seconds, the lead
velocity is less than the set velocity (middle plot). To slow down and track the lead car velocity,
acceleration is negative (top plot). From 36 to 60 seconds, the ego car adjusts its acceleration to
track the reference velocity closely (middle plot). Within this time interval, the ego car tracks the
set velocity from 43 to 52 seconds and tracks lead velocity from 36 to 43 seconds and 52 to 60
seconds.
5-198
Train DDPG Agent for Adaptive Cruise Control
5-199
5 Train and Validate Agents
bdclose(mdl)
Reset Function
function in = localResetFcn(in)
% Reset the initial position of the lead car.
in = setVariable(in,'x0_lead',40+randi(60,1,1));
end
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-200
Train DQN Agent for Lane Keeping Assist
This example shows how to train a deep Q-learning network (DQN) agent for lane keeping assist
(LKA) in Simulink®. For more information on DQN agents, see “Deep Q-Network (DQN) Agents” on
page 3-23.
The reinforcement learning environment for this example is a simple bicycle model for ego vehicle
dynamics. The training goal is to keep the ego vehicle traveling along the centerline of the lanes by
adjusting the front steering angle. This example uses the same vehicle model as in “Lane Keeping
Assist System Using Model Predictive Control” (Model Predictive Control Toolbox). The ego car
dynamics are specified by the following parameters.
Ts = 0.1;
T = 15;
The output of the LKA system is the front steering angle of the ego car. To simulate the physical
limitations of the ego car, constrain the steering angle to the range [-0.5,0.5] rad.
u_min = -0.5;
u_max = 0.5;
The curvature of the road is defined by a constant 0.001 (m−1). The initial value for the lateral
deviation is 0.2 m and the initial value for the relative yaw angle is –0.1 rad.
rho = 0.001;
e1_initial = 0.2;
e2_initial = -0.1;
mdl = 'rlLKAMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
5-201
5 Train and Validate Agents
• The steering-angle action signal from the agent to the environment is from –15 degrees to 15
degrees.
• The observations from the environment are the lateral deviation e1, the relative yaw angle e2,
their derivatives ė1 and ė2, and their integrals ∫e1 and ∫e2.
• The simulation is terminated when the lateral deviation |e1 | > 1 .
• The reward rt, provided at every time step t, is
2 2
rt = − (10e12 + 5e22 + 2u2 + 5ė1 + 5ė2)
Create a reinforcement learning environment interface for the ego vehicle. To do so, first create the
observation and action specifications.
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);
5-202
Train DQN Agent for Lane Keeping Assist
The interface has a discrete action space where the agent can apply one of 31 possible steering
angles from –15 degrees to 15 degrees. The observation is the six-dimensional vector containing
lateral deviation, relative yaw angle, as well as their derivatives and integrals with respect to time.
To define the initial condition for lateral deviation and relative yaw angle, specify an environment
reset function using an anonymous function handle. This reset function randomizes the initial values
for the lateral deviation and relative yaw angle.
env.ResetFcn = @(in)localResetFcn(in);
rng(0)
A DQN agent approximates the long-term reward, given observations and actions, using a value
function critic representation.
DQN agents can use multi-output Q-value critic approximators, which are generally more efficient. A
multi-output approximator has observations as inputs and state-action values as outputs. Each output
element represents the expected cumulative long-term reward for taking the corresponding discrete
action from the state indicated by the observation inputs.
To create the critic, first create a deep neural network with one input (the six-dimensional observed
state) and one output vector with 31 elements (evenly spaced steering angles from -15 to 15
degrees). For more information on creating a deep neural network value function representation, see
“Create Policies and Value Functions” on page 4-2.
dnn = [
featureInputLayer(nI,'Normalization','none','Name','state')
fullyConnectedLayer(nL,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(nL,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(nO,'Name','fc3')];
dnn = dlnetwork(dnn);
figure
plot(layerGraph(dnn))
5-203
5 Train and Validate Agents
criticOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor
Create the critic representation using the specified deep neural network and options. You must also
specify the action and observation info for the critic, which you obtain from the environment
interface. For more information, see rlVectorQValueFunction.
critic = rlVectorQValueFunction(dnn,observationInfo,actionInfo);
To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOptions = rlDQNAgentOptions(...
'SampleTime',Ts,...
'UseDoubleDQN',true,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',64);
Then, create the DQN agent using the specified critic representation and agent options. For more
information, see rlDQNAgent.
agent = rlDQNAgent(critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
5-204
Train DQN Agent for Lane Keeping Assist
• Run each training episode for at most 5000 episodes, with each episode lasting at most
ceil(T/Ts) time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option to
training-progress) and disable the command line display (set the Verbose option to false).
• Stop training when the episode reward reaches –1.
• Save a copy of the agent for each episode where the cumulative reward is greater than –2.5.
maxepisodes = 5000;
maxsteps = ceil(T/Ts);
trainingOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue',-1,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',-2.5);
Train the agent using the train function. Training is a computationally intensive process that takes
several hours to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOpts);
else
% Load the pretrained agent for the example.
load('SimulinkLKADQNMulti.mat','agent')
end
5-205
5 Train and Validate Agents
To validate the performance of the trained agent, uncomment the following two lines and simulate the
agent within the environment. For more information on agent simulation, see
rlSimulationOptions and sim.
% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);
To demonstrate the trained agent on deterministic initial conditions, simulate the model in Simulink.
e1_initial = -0.4;
e2_initial = 0.2;
sim(mdl)
As the plots show, the lateral error (top plot) and relative yaw angle (middle plot) are both driven
close to zero. The vehicle starts from off the centerline (–0.4 m) and with a nonzero yaw angle error
(0.2 rad). The lane keeping assist makes the ego car travel along the centerline after about 2.5
seconds. The steering angle (bottom plot) shows that the controller reaches steady state after about 2
seconds.
5-206
Train DQN Agent for Lane Keeping Assist
5-207
5 Train and Validate Agents
if ~doTraining
%bdclose(mdl)
end
Reset Function
function in = localResetFcn(in)
% reset
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation
in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle
end
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-208
Train PPO Agent for Automatic Parking Valet
This example demonstrates the design of a hybrid controller for an automatic search and parking
task. The hybrid controller uses model predictive control (MPC) to follow a reference path in a
parking lot and a trained reinforcement learning (RL) agent to perform the parking maneuver.
The automatic parking algorithm in this example executes a series of maneuvers while
simultaneously sensing and avoiding obstacles in tight spaces. It switches between an adaptive MPC
controller and an RL agent to complete the parking maneuver. The MPC controller moves the vehicle
at a constant speed along a reference path while an algorithm searches for an empty parking spot.
When a spot is found, the RL Agent takes over and executes a pretrained parking maneuver. Prior
knowledge of the environment (the parking lot) including the locations of the empty spots and parked
vehicles is available to the controllers.
Parking Lot
The parking lot is represented by the ParkingLot class, which stores information about the ego
vehicle, empty parking spots, and static obstacles (parked cars). Each parking spot has a unique
index number and an indicator light that is either green (free) or red (occupied). Parked vehicles are
represented in black.
freeSpotIdx = 7;
map = ParkingLot(freeSpotIdx);
5-209
5 Train and Validate Agents
Specify an initial pose X0, Y 0, θ0 for the ego vehicle. The target pose is determined based on the first
available free spot as the vehicle navigates the parking lot.
egoInitialPose = [20, 15, 0];
Compute the target pose for the vehicle using the createTargetPose function. The target pose
corresponds to the location in freeSpotIdx.
egoTargetPose = createTargetPose(map,freeSpotIdx)
egoTargetPose = 1×3
Sensor Modules
The parking algorithm uses camera and lidar sensors to gather information from the environment.
Camera
The field of view of a camera mounted on the ego vehicle is represented by the area shaded in green
in the following figure. The camera has a field of view φ bounded by ±60 degrees and a maximum
measurement depth dmax of 10 m.
As the ego vehicle moves forward, the camera module senses the parking spots that fall within the
field of view and determines whether a spot is free or occupied. For simplicity, this action is
implemented using geometrical relationships between the spot locations and the current vehicle
pose. A parking spot is within the camera range if di ≤ dmax and φmin ≤ φi ≤ φmax, where di is the
distance to the parking spot and φi is the angle to the parking spot.
5-210
Train PPO Agent for Automatic Parking Valet
Lidar
The reinforcement learning agent uses lidar sensor readings to determine the proximity of the ego
vehicle to other vehicles in the environment. The lidar sensor in this example is also modeled using
geometrical relationships. Lidar distances are measured along 12 line segments that radially emerge
from the center of the ego vehicle. When a lidar line intersects an obstacle, it returns the distance of
the obstacle from the vehicle. The maximum measurable lidar distance along any line segment is 6 m.
The parking valet model, including the controllers, ego vehicle, sensors, and parking lot, is
implemented in Simulink®.
autoParkingValetParams
mdl = 'rlAutoParkingValet';
open_system(mdl)
5-211
5 Train and Validate Agents
The ego vehicle dynamics in this model are represented by a single-track bicycle model with two
inputs: vehicle speed v (m/s) and steering angle δ (radians). The MPC and RL controllers are placed
within Enabled Subsystem blocks that are activated by signals representing whether the vehicle has
to search for an empty spot or execute a parking maneuver. The enable signals are determined by the
Camera algorithm within the Vehicle Mode subsystem. Initially, the vehicle is in search mode and the
MPC controller tracks the reference path. When a free spot is found, park mode is activated and the
RL agent executes the parking maneuver.
Create the adaptive MPC controller object for reference trajectory tracking using the
createMPCForParking script. For more information on adaptive MPC, see “Adaptive MPC” (Model
Predictive Control Toolbox).
createMPCForParking
The environment for training the RL agent is the region shaded in red in the following figure. Due to
symmetry in the parking lot, training within this region is sufficient for the policy to adjust to other
regions after applying appropriate coordinate transformations to the observations. Using this smaller
training region significantly reduces training time compared to training over the entire parking lot.
5-212
Train PPO Agent for Automatic Parking Valet
• The training region is a 22.5 m x 20 m space with the target spot at its horizontal center.
• The observations are the position errors Xe and Y e of the ego vehicle with respect to the target
pose, the sine and cosine of the true heading angle θ, and the lidar sensor readings.
• The vehicle speed during parking is a constant 2 m/s.
• The action signals are discrete steering angles that range between +/- 45 degrees in steps of 15
degrees.
• The vehicle is considered parked if the errors with respect to target pose are within specified
tolerances of +/- 0.75 m (position) and +/-10 degrees (orientation).
• The episode terminates if the ego vehicle goes out of the bounds of the training region, collides
with an obstacle, or parks successfully.
• The reward rt provided at time t, is:
2
rt = 2e− 0 . 05Xe2 + 0 . 04Y e2 2
+ 0 . 5e−40θe − 0 . 05δ + 100f t − 50gt
Here, Xe, Y e, and θe are the position and heading angle errors of the ego vehicle from the target pose,
and δ is the steering angle. f t (0 or 1) indicates whether the vehicle has parked and gt (0 or 1)
indicates if the vehicle has collided with an obstacle at time t.
The coordinate transformations on vehicle pose X, Y, θ observations for different parking spot
locations are as follows:
5-213
5 Train and Validate Agents
• 1-14: no transformation
‾ = Y, Y‾ = − X, θ‾ = θ − π/2
• 15-22: X
‾ = 100 − X, Y‾ = 60 − Y, θ‾ = θ − π
• 23-36: X
‾ = 60 − Y, Y‾ = X, θ‾ = θ − 3π/2
• 37-40: X
‾ = 100 − X, Y‾ = 30 − Y, θ‾ = θ + π
• 41-52: X
‾ = X, Y‾ = Y − 28, θ‾ = θ
• 53-64: X
steerMax = pi/4;
discreteSteerAngles = -steerMax : deg2rad(15) : steerMax;
actionInfo = rlFiniteSetSpec(num2cell(discreteSteerAngles));
actionInfo.Name = "actions";
numActions = numel(actionInfo.Elements);
Create the Simulink environment interface, specifying the path to the RL Agent block.
blk = [mdl '/RL Controller/RL Agent'];
env = rlSimulinkEnv(mdl,blk,observationInfo,actionInfo);
Specify a reset function for training. The autoParkingValetResetFcn function resets the initial
pose of the ego vehicle to random values at the start of each episode.
env.ResetFcn = @autoParkingValetResetFcn;
Create Agent
The RL agent in this example is a proximal policy optimization (PPO) agent with a discrete action
space. PPO agents rely on actor and critic representations to learn the optimal policy. The agent
maintains deep neural network-based function approximators for the actor and critic. To learn more
about PPO agents, see “Proximal Policy Optimization Agents” on page 3-44.
To create the critic representations, first create a deep neural network with 16 inputs and one output.
The output of the critic network is the state value function for a particular observation.
criticNetwork = [
featureInputLayer(numObservations, ...
Normalization="none", ...
Name="observations")
fullyConnectedLayer(128,Name="fc1")
reluLayer(Name="relu1")
fullyConnectedLayer(128,Name="fc2")
reluLayer(Name="relu2")
fullyConnectedLayer(128,Name="fc3")
reluLayer(Name="relu3")
5-214
Train PPO Agent for Automatic Parking Valet
fullyConnectedLayer(1,Name="fc4")];
criticNetwork = dlnetwork(criticNetwork);
Create the critic for the PPO agent. For more information, see rlValueFunction and
rlOptimizerOptions.
The outputs of the actor networks are the probabilities of taking each possible steering action when
the vehicle is in a certain state. Create the actor deep neural network.
actorNetwork = [
featureInputLayer(numObservations, ...
Normalization="none", ...
Name="observations")
fullyConnectedLayer(128,Name="fc1")
reluLayer(Name="relu1")
fullyConnectedLayer(128,Name="fc2")
reluLayer(Name="relu2")
fullyConnectedLayer(numActions, "Name", "out")
softmaxLayer(Name="actionProb")];
actorNetwork = dlnetwork(actorNetwork);
Create a discrete categorical actor for the PPO agent. For more information, see
rlDiscreteCategoricalActor.
Specify the agent options and create the PPO agent. For more information on PPO agent options, see
rlPPOAgentOptions.
agentOpts = rlPPOAgentOptions(...
SampleTime=Ts,...
ActorOptimizerOptions=actorOptions,...
CriticOptimizerOptions=criticOptions,...
ExperienceHorizon=200,...
ClipFactor=0.2,...
EntropyLossWeight=0.01,...
MiniBatchSize=64,...
NumEpoch=3,...
AdvantageEstimateMethod="gae",...
GAEFactor=0.95,...
DiscountFactor=0.998);
agent = rlPPOAgent(actor,critic,agentOpts);
During training, the agent collects experiences until it reaches experience horizon of 200 steps or the
episode terminates and then trains from mini-batches of 64 experiences for three epochs. An
objective function clip factor of 0.2 improves training stability and a discount factor value of 0.998
encourages long term rewards. Variance in critic the output is reduced by the generalized advantage
estimate method with a GAE factor of 0.95.
5-215
5 Train and Validate Agents
Train Agent
For this example, you train the agent for a maximum of 10000 episodes, with each episode lasting a
maximum of 200 time steps. The training terminates when the maximum number of episodes is
reached or the average reward over 100 episodes exceeds 100.
trainOpts = rlTrainingOptions(...
MaxEpisodes=10000,...
MaxStepsPerEpisode=200,...
ScoreAveragingWindowLength=200,...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=80);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
trainingStats = train(agent,env,trainOpts);
else
load('rlAutoParkingValetAgent.mat','agent');
end
5-216
Train PPO Agent for Automatic Parking Valet
Simulate Agent
Simulate the model to park the vehicle in the free parking spot. To simulate the vehicle parking in
different locations, change the free spot location in the following code.
The vehicle reaches the target pose within the specified error tolerances of +/- 0.75 m (position) and
+/-10 degrees (orientation).
To view the ego vehicle position and orientation, open the Ego Vehicle Pose scope.
5-217
5 Train and Validate Agents
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-218
Train DDPG Agent for Path-Following Control
This example shows how to train a deep deterministic policy gradient (DDPG) agent for path-
following control (PFC) in Simulink®. For more information on DDPG agents, see “Deep
Deterministic Policy Gradient (DDPG) Agents” on page 3-31.
Simulink Model
The reinforcement learning environment for this example is a simple bicycle model for the ego car
and a simple longitudinal model for the lead car. The training goal is to make the ego car travel at a
set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration
and braking, and also while keeping the ego car travelling along the centerline of its lane by
controlling the front steering angle. For more information on PFC, see Path Following Control System
(Model Predictive Control Toolbox). The ego car dynamics are specified by the following parameters.
m = 1600; % total vehicle mass (kg)
Iz = 2875; % yaw moment of inertia (mNs^2)
lf = 1.4; % longitudinal distance from center of gravity to front tires (m)
lr = 1.6; % longitudinal distance from center of gravity to rear tires (m)
Cf = 19000; % cornering stiffness of front tires (N/rad)
Cr = 33000; % cornering stiffness of rear tires (N/rad)
tau = 0.5; % longitudinal time constant
Specify the initial position and velocity for the two vehicles.
x0_lead = 50; % initial position for lead car (m)
v0_lead = 24; % initial velocity for lead car (m/s)
x0_ego = 10; % initial position for ego car (m)
v0_ego = 18; % initial velocity for ego car (m/s)
Specify the standstill default spacing (m), time gap (s), and driver-set velocity (m/s).
D_default = 10;
t_gap = 1.4;
v_set = 28;
To simulate the physical limitations of the vehicle dynamics, constrain the acceleration to the range
[–3,2] (m/s^2), and the steering angle to the range [–0.2618,0.2618] (rad), that is -15 and 15
degrees.
amin_ego = -3;
amax_ego = 2;
umin_ego = -0.2618; % +15 deg
umax_ego = 0.2618; % -15 deg
The curvature of the road is defined by a constant 0.001 (m−1). The initial value for lateral deviation
is 0.2 m and the initial value for the relative yaw angle is –0.1 rad.
rho = 0.001;
e1_initial = 0.2;
e2_initial = -0.1;
5-219
5 Train and Validate Agents
mdl = 'rlPFCMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];
• The action signal consists of acceleration and steering angle actions. The acceleration action
signal takes value between –3 and 2 (m/s^2). The steering action signal takes a value between –15
degrees (–0.2618 rad) and 15 degrees (0.2618 rad).
• The reference velocity for the ego car V ref is defined as follows. If the relative distance is less than
the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity.
In this manner, the ego car maintains some distance from the lead car. If the relative distance is
greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe
distance is defined as a linear function of the ego car longitudinal velocity V , that is,
tgap * V + Ddef ault. The safe distance determines the tracking velocity for the ego car.
• The observations from the environment contain the longitudinal measurements: the velocity error
eV = V ref − V ego, its integral ∫e, and the ego car longitudinal velocity V . In addition, the
observations contain the lateral measurements: the lateral deviation e1, relative yaw angle e2,
their derivatives ė1 and ė2, and their integrals ∫e1 and ∫e2.
• The simulation terminates when the lateral deviation |e1 | > 1, when the longitudinal velocity
V ego < 0 . 5, or when the relative distance between the lead car and ego car Drel < 0.
• The reward rt, provided at every time step t, is
5-220
Train DDPG Agent for Path-Following Control
where ut − 1 is the steering input from the previous time step t − 1, at − 1 is the acceleration input
from the previous time step. The three logical values are as follows.
The three logical terms in the reward encourage the agent to make both lateral error and velocity
error small, and in the meantime, penalize the agent if the simulation is terminated early.
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);
To define the initial conditions, specify an environment reset function using an anonymous function
handle. The reset function localResetFcn, which is defined at the end of the example, randomizes
the initial position of the lead car, the lateral deviation, and the relative yaw angle.
env.ResetFcn = @(in)localResetFcn(in);
rng(0)
A DDPG agent approximates the long-term reward given observations and actions by using a critic
value function representation. To create the critic, first create a deep neural network with two inputs,
the state and action, and one output. For more information on creating a deep neural network value
function representation, see “Create Policies and Value Functions” on page 4-2.
5-221
5 Train and Validate Agents
actionPath = [
featureInputLayer(2,'Normalization','none','Name','action')
fullyConnectedLayer(L,'Name','fc5')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');
criticNetwork = dlnetwork(criticNetwork);
figure
plot(layerGraph(criticNetwork))
criticOptions = rlOptimizerOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor
Create the critic function using the specified deep neural network. You must also specify the action
and observation info for the critic, which you obtain from the environment interface. For more
information, see rlQValueFunction.
critic = rlQValueFunction(criticNetwork,observationInfo,actionInfo,...
'ObservationInputNames','observation','ActionInputNames','action');
5-222
Train DDPG Agent for Path-Following Control
A DDPG agent decides which action to take given observations by using an actor representation. To
create the actor, first create a deep neural network with one input, the observation, and one output,
the action.
Construct the actor similarly to the critic. For more information, see
rlContinuousDeterministicActor.
actorNetwork = [
featureInputLayer(9,'Normalization','none','Name','observation')
fullyConnectedLayer(L,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(L,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(L,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(2,'Name','fc4')
tanhLayer('Name','tanh1')
scalingLayer('Name','ActorScaling1','Scale',[2.5;0.2618],'Bias',[-0.5;0])];
actorNetwork = dlnetwork(actorNetwork);
actorOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor'
actor = rlContinuousDeterministicActor(actorNetwork,observationInfo,actionInfo);
To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.
agentOptions = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6);
agentOptions.NoiseOptions.Variance = [0.6;0.1];
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Then, create the DDPG agent using the specified actor representation, critic representation, and
agent options. For more information, see rlDDPGAgent.
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
• Run each training episode for at most 10000 episodes, with each episode lasting at most
maxsteps time steps.
• Display the training progress in the Episode Manager dialog box (set the Verbose and Plots
options).
• Stop training when the agent receives an cumulative episode reward greater than 1700.
maxepisodes = 1e4;
maxsteps = ceil(Tf/Ts);
trainingOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
5-223
5 Train and Validate Agents
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue',1700);
Train the agent using the train function. Training is a computationally intensive process that takes
several minutes to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOpts);
else
% Load a pretrained agent for the example.
load('SimulinkPFCDDPG.mat','agent')
end
To validate the performance of the trained agent, simulate the agent within the Simulink environment
by uncommenting the following commands. For more information on agent simulation, see
rlSimulationOptions and sim.
% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);
To demonstrate the trained agent using deterministic initial conditions, simulate the model in
Simulink.
5-224
Train DDPG Agent for Path-Following Control
e1_initial = -0.4;
e2_initial = 0.1;
x0_lead = 80;
sim(mdl)
The following plots show the simulation results when the lead car is 70 (m) ahead of the ego car.
• In the first 35 seconds, the relative distance is greater than the safe distance (bottom-right plot),
so the ego car tracks the set velocity (top-right plot). To speed up and reach the set velocity, the
acceleration is mostly nonnegative (top-left plot).
• From 35 to 42 seconds, the relative distance is mostly less than the safe distance (bottom-right
plot), so the ego car tracks the minimum of the lead velocity and set velocity. Because the lead
velocity is less than the set velocity (top-right plot), to track the lead velocity, the acceleration
becomes nonzero (top-left plot).
• From 42 to 58 seconds, the ego car tracks the set velocity (top-right plot) and the acceleration
remains zero (top-left plot).
• From 58 to 60 seconds, the relative distance becomes less than the safe distance (bottom-right
plot), so the ego car slows down and tracks the lead velocity.
• The bottom-left plot shows the lateral deviation. As shown in the plot, the lateral deviation is
greatly decreased within 1 second. The lateral deviation remains less than 0.05 m.
5-225
5 Train and Validate Agents
Reset Function
function in = localResetFcn(in)
in = setVariable(in,'x0_lead',40+randi(60,1,1)); % random value for initial position of lead c
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation
in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle
end
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
5-226
Train DQN Agent for Lane Keeping Assist Using Parallel Computing
This example shows how to train a deep Q-learning network (DQN) agent for lane keeping assist
(LKA) in Simulink® using parallel training. For an example that shows how to train the agent without
using parallel training, see “Train DQN Agent for Lane Keeping Assist” on page 5-201.
For more information on DQN agents, see “Deep Q-Network (DQN) Agents” on page 3-23. For an
example that trains a DQN agent in MATLAB®, see “Train DQN Agent to Balance Cart-Pole System”
on page 5-50.
In a DQN agent, each worker generates new experiences from its copy of the agent and the
environment. After every N steps, the worker sends experiences to the host agent. The host agent
updates its parameters as follows.
• For asynchronous training, the host agent learns from received experiences without waiting for all
workers to send experiences, and sends the updated parameters back to the worker that provided
the experiences. Then, the worker continues to generate experiences from its environment using
the updated parameters.
• For synchronous training, the host agent waits to receive experiences from all of the workers and
learns from these experiences. The host then sends updated parameters to all the workers at the
same time. Then, all workers continue to generate experiences using the updated parameters.
The reinforcement learning environment for this example is a simple bicycle model for ego vehicle
dynamics. The training goal is to keep the ego vehicle traveling along the centerline of the lanes by
adjusting the front steering angle. This example uses the same vehicle model as “Train DQN Agent
for Lane Keeping Assist” on page 5-201.
Ts = 0.1;
T = 15;
The output of the LKA system is the front steering angle of the ego car. To simulate the physical
steering limits of the ego car, constrain the steering angle to the range [–0.5,0.5] rad.
u_min = -0.5;
u_max = 0.5;
The curvature of the road is defined by a constant 0.001 (m−1). The initial value for the lateral
deviation is 0.2 m and the initial value for the relative yaw angle is –0.1 rad.
5-227
5 Train and Validate Agents
rho = 0.001;
e1_initial = 0.2;
e2_initial = -0.1;
mdl = 'rlLKAMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];
• The steering-angle action signal from the agent to the environment is from –15 degrees to 15
degrees.
• The observations from the environment are the lateral deviation e1, relative yaw angle e2, their
derivatives ė1 and ė2, and their integrals ∫e1 and ∫e2.
• The simulation is terminated when the lateral deviation |e1 | > 1 .
• The reward rt, provided at every time step t, is
2 2
rt = − (10e12 + 5e22 + 2u2 + 5ė1 + 5ė2)
5-228
Train DQN Agent for Lane Keeping Assist Using Parallel Computing
actionInfo = rlFiniteSetSpec((-15:15)*pi/180);
actionInfo.Name = 'steering';
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);
The interface has a discrete action space where the agent can apply one of 31 possible steering
angles from –15 degrees to 15 degrees. The observation is the six-dimensional vector containing
lateral deviation, relative yaw angle, as well as their derivatives and integrals with respect to time.
To define the initial condition for the lateral deviation and relative yaw angle, specify an environment
reset function using an anonymous function handle. localResetFcn, which is defined at the end of
this example, randomizes the initial lateral deviation and relative yaw angle.
env.ResetFcn = @(in)localResetFcn(in);
rng(0)
DQN agents can use multi-output Q-value critic approximators, which are generally more efficient. A
multi-output approximator has observations as inputs and state-action values as outputs. Each output
element represents the expected cumulative long-term reward for taking the corresponding discrete
action from the state indicated by the observation inputs.
To create the critic, first create a deep neural network with one input (the six-dimensional observed
state) and one output vector with 31 elements (evenly spaced steering angles from -15 to 15
degrees). For more information on creating a deep neural network value function representation, see
“Create Policies and Value Functions” on page 4-2.
dnn = [
featureInputLayer(nI,'Normalization','none','Name','state')
fullyConnectedLayer(nL,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(nL,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(nO,'Name','fc3')];
dnn = dlnetwork(dnn);
figure
plot(layerGraph(dnn))
5-229
5 Train and Validate Agents
criticOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor
Create the critic representation using the specified deep neural network and options. You must also
specify the action and observation info for the critic, which you obtain from the environment
interface. For more information, see rlVectorQValueFunction.
critic = rlVectorQValueFunction(dnn,observationInfo,actionInfo);
To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOpts = rlDQNAgentOptions(...
'SampleTime',Ts,...
'UseDoubleDQN',true,...
'CriticOptimizerOptions',criticOptions,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',256);
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-4;
Then create the DQN agent using the specified critic representation and agent options. For more
information, see rlDQNAgent.
agent = rlDQNAgent(critic,agentOpts);
5-230
Train DQN Agent for Lane Keeping Assist Using Parallel Computing
Training Options
To train the agent, first specify the training options. For this example, use the following options.
• Run each training for at most 10000 episodes, with each episode lasting at most ceil(T/Ts)
time steps.
• Display the training progress in the Episode Manager dialog box only (set the Plots and
Verbose options accordingly).
• Stop training when the episode reward reaches -1.
• Save a copy of the agent for each episode where the cumulative reward is greater than 100.
maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes, ...
'MaxStepsPerEpisode',maxsteps, ...
'Verbose',false,...
'Plots','none',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue', -1,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',100);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
Train Agent
Train the agent using the train function. Training the agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to
randomness of the parallel training, you can expect different training results from the plot below. The
plot shows the result of training with four workers.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load pretrained agent for the example.
load('SimulinkLKADQNParallel.mat','agent')
end
5-231
5 Train and Validate Agents
To validate the performance of the trained agent, uncomment the following two lines and simulate the
agent within the environment. For more information on agent simulation, see
rlSimulationOptions and sim.
% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);
To demonstrate the trained agent using deterministic initial conditions, simulate the model in
Simulink.
e1_initial = -0.4;
e2_initial = 0.2;
sim(mdl)
As shown below, the lateral error (middle plot) and relative yaw angle (bottom plot) are both driven to
zero. The vehicle starts from off centerline (–0.4 m) and nonzero yaw angle error (0.2 rad). The LKA
enables the ego car to travel along the centerline after 2.5 seconds. The steering angle (top plot)
shows that the controller reaches steady state after 2 seconds.
5-232
Train DQN Agent for Lane Keeping Assist Using Parallel Computing
5-233
5 Train and Validate Agents
Local Function
function in = localResetFcn(in)
% reset
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation
in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle
end
See Also
train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Policies and Value Functions” on page 4-2
• “Train Agents Using Parallel Computing and GPUs” on page 5-8
See Also
Related Examples
• “Train DQN Agent for Lane Keeping Assist” on page 5-201
• “Train AC Agent to Balance Cart-Pole System Using Parallel Computing” on page 5-151
• “Train Biped Robot to Walk Using Reinforcement Learning Agents” on page 5-235
5-234
Train Biped Robot to Walk Using Reinforcement Learning Agents
This example shows how to train a biped robot to walk using both a deep deterministic policy
gradient (DDPG) agent and a twin-delayed deep deterministic policy gradient (TD3) agent. In the
example, you also compare the performance of these trained agents. The robot in this example is
modeled in Simscape™ Multibody™.
For more information on these agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31 and “Twin-Delayed Deep Deterministic Policy Gradient Agents” on page 3-35.
For the purpose of comparison in this example, this example trains both agents on the biped robot
environment with the same model parameters. The example also configures the agents to have the
following settings in common.
The reinforcement learning environment for this example is a biped robot. The training goal is to
make the robot walk in a straight line using minimal control effort.
5-235
5 Train and Validate Agents
robotParametersRL
mdl = 'rlWalkingBipedRobot';
open_system(mdl)
5-236
Train Biped Robot to Walk Using Reinforcement Learning Agents
• In the neutral 0 rad position, both of the legs are straight and the ankles are flat.
• The foot contact is modeled using the Spatial Contact Force (Simscape Multibody) block.
• The agent can control 3 individual joints (ankle, knee, and hip) on both legs of the robot by
applying torque signals from -3 to 3 N·m. The actual computed action signals are normalized
between -1 and 1.
• Y (lateral) and Z (vertical) translations of the torso center of mass. The translation in the Z
direction is normalized to a similar range as the other observations.
• X (forward), Y (lateral), and Z (vertical) translation velocities.
• Yaw, pitch, and roll angles of the torso.
• Yaw, pitch, and roll angular velocities of the torso.
• Angular positions and velocities of the three joints (ankle, knee, hip) on both legs.
• Action values from the previous time step.
• The robot torso center of mass is less than 0.1 m in the Z direction (the robot falls) or more than 1
m in the either Y direction (the robot moves too far to the side).
• The absolute value of either the roll, pitch, or yaw is greater than 0.7854 rad.
The following reward function rt, which is provided at every time step is inspired by [2].
2 2 Ts 2
rt = vx − 3y − 50z + 25 − 0 . 02 ∑ uti − 1
Tf
i
Here:
• vx is the translation velocity in the X direction (forward toward goal) of the robot.
• y is the lateral translation displacement of the robot from the target straight line trajectory.
• z is the normalized vertical translation displacement of the robot center of mass.
• ui
t − 1 is the torque from joint i from the previous time step.
• Ts is the sample time of the environment.
• Tf is the final simulation time of the environment.
This reward function encourages the agent to move forward by providing a positive reward for
positive forward velocity. It also encourages the agent to avoid episode termination by providing a
Ts
constant reward (25 Tf ) at every time step. The other terms in the reward function are penalties for
substantial changes in lateral and vertical translations, and for the use of excess control effort.
5-237
5 Train and Validate Agents
numAct = 6;
actInfo = rlNumericSpec([numAct 1],'LowerLimit',-1,'UpperLimit',1);
actInfo.Name = 'foot_torque';
This example provides the option to train the robot either using either a DDPG or TD3 agent. To
simulate the robot with the agent of your choice, set the AgentSelection flag accordingly.
AgentSelection = 'TD3';
switch AgentSelection
case 'DDPG'
agent = createDDPGAgent(numObs,obsInfo,numAct,actInfo,Ts);
case 'TD3'
agent = createTD3Agent(numObs,obsInfo,numAct,actInfo,Ts);
otherwise
disp('Enter DDPG or TD3 for AgentSelection')
end
The createDDPGAgent and createTD3Agent helper functions perform the following actions.
DDPG Agent
A DDPG agent approximates the long-term reward given observations and actions using a critic value
function representation. A DDPG agent decides which action to take given observations by using an
actor representation. The actor and critic networks for this example are inspired by [1].
For details on the creating the DDPG agent, see the createDDPGAgent helper function. For
information on configuring DDPG agent options, see rlDDPGAgentOptions.
For more information on creating a deep neural network value function representation, see “Create
Policies and Value Functions” on page 4-2. For an example that creates neural networks for DDPG
agents, see “Train DDPG Agent to Control Double Integrator System” on page 5-77.
TD3 Agent
A TD3 agent approximates the long-term reward given observations and actions using two critic value
function representations. A TD3 agent decides which action to take given observations using an actor
representation. The structure of the actor and critic networks used for this agent are the same as the
ones used for DDPG agent.
A DDPG agent can overestimate the Q value. Since the agent uses the Q value to update its policy
(actor), the resultant policy can be suboptimal and accumulating training errors can lead to divergent
5-238
Train Biped Robot to Walk Using Reinforcement Learning Agents
behavior. The TD3 algorithm is an extension of DDPG with improvements that make it more robust by
preventing overestimation of Q values [3].
• Two critic networks — TD3 agents learn two critic networks independently and use the minimum
value function estimate to update the actor (policy). Doing so prevents accumulation of error in
subsequent steps and overestimation of Q values.
• Addition of target policy noise — Adding clipped noise to value functions smooths out Q function
values over similar actions. Doing so prevents learning an incorrect sharp peak of noisy value
estimate.
• Delayed policy and target updates — For a TD3 agent, delaying the actor network update allows
more time for the Q function to reduce error (get closer to the required target) before updating
the policy. Doing so prevents variance in value estimates and results in a higher quality policy
update.
For details on the creating the TD3 agent, see the createTD3Agent helper function. For information
on configuring TD3 agent options, see rlTD3AgentOptions.
For this example, the training options for the DDPG and TD3 agents are the same.
• Run each training session for 2000 episodes with each episode lasting at most maxSteps time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option).
• Terminate the training only when it reaches the maximum number of episodes (maxEpisodes).
Doing so allows the comparison of the learning curves for multiple agents over the entire training
session.
maxEpisodes = 2000;
maxSteps = floor(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxEpisodes,...
'MaxStepsPerEpisode',maxSteps,...
'ScoreAveragingWindowLength',250,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeCount',...
'StopTrainingValue',maxEpisodes,...
'SaveAgentCriteria','EpisodeCount',...
'SaveAgentValue',maxEpisodes);
To train the agent in parallel, specify the following training options. Training in parallel requires
Parallel Computing Toolbox™. If you do not have Parallel Computing Toolbox software installed, set
UseParallel to false.
5-239
5 Train and Validate Agents
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = 'async';
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Experiences';
Train the agent using the train function. This process is computationally intensive and takes several
hours to complete for each agent. To save time while running this example, load a pretrained agent
by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to
randomness in the parallel training, you can expect different training results from the plots that
follow. The pretrained agents were trained in parallel using four workers.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load a pretrained agent for the selected agent type.
if strcmp(AgentSelection,'DDPG')
load('rlWalkingBipedRobotDDPG.mat','agent')
else
load('rlWalkingBipedRobotTD3.mat','agent')
end
end
5-240
Train Biped Robot to Walk Using Reinforcement Learning Agents
For the preceding example training curves, the average time per training step for the DDPG and TD3
agents are 0.11 and 0.12 seconds, respectively. The TD3 agent takes more training time per step
because it updates two critic networks compared to the single critic used for DDPG.
rng(0)
To validate the performance of the trained agent, simulate it within the biped robot environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',maxSteps);
experience = sim(env,agent,simOptions);
5-241
5 Train and Validate Agents
For the following agent comparison, each agent was trained five times using a different random seed
each time. Due to the random exploration noise and the randomness in the parallel training, the
learning curve for each run is different. Since the training of agents for multiple runs takes several
days to complete, this comparison uses pretrained agents.
For the DDPG and TD3 agents, plot the average and standard deviation of the episode reward (top
plot) and the episode Q0 value (bottom plot). The episode Q0 value is the critic estimate of the
discounted long-term reward at the start of each episode given the initial observation of the
environment. For a well-designed critic, the episode Q0 value approaches the true discounted long-
term reward.
comparePerformance('DDPGAgent','TD3Agent')
5-242
Train Biped Robot to Walk Using Reinforcement Learning Agents
5-243
5 Train and Validate Agents
• The DDPG agent appears to pick up learning faster (around episode number 600 on average) but
hits a local minimum. TD3 starts slower but eventually achieves higher rewards than DDPG as it
avoids overestimation of Q values.
• The TD3 agent shows a steady improvement in its learning curve, which suggests improved
stability when compared to the DDPG agent.
• For the TD3 agent, the critic estimate of the discounted long-term reward (for 2000 episodes) is
lower compared to the DDPG agent. This difference is because the TD3 algorithm takes a
conservative approach in updating its targets by using a minimum of two Q functions. This
behavior is further enhanced because of delayed updates to the targets.
• Although the TD3 estimate for these 2000 episodes is low, the TD3 agent shows a steady increase
in the episode Q0 values, unlike the DDPG agent.
In this example, the training was stopped at 2000 episodes. For a larger training period, the TD3
agent with its steady increase in estimates shows the potential to converge to the true discounted
long-term reward.
For another example on how to train a humanoid robot to walk using a DDPG agent, see “Train
Humanoid Walker” (Simscape Multibody). For an example on how to train a quadruped robot to walk
using a DDPG agent, see “Quadruped Robot Locomotion Using DDPG Agent” on page 5-246.
5-244
Train Biped Robot to Walk Using Reinforcement Learning Agents
References
[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. "Continuous Control with Deep Reinforcement Learning." Preprint,
submitted July 5, 2019. https://arxiv.org/abs/1509.02971.
[2] Heess, Nicolas, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa,
et al. "Emergence of Locomotion Behaviours in Rich Environments." Preprint, submitted July 10,
2017. https://arxiv.org/abs/1707.02286.
[3] Fujimoto, Scott, Herke van Hoof, and David Meger. "Addressing Function Approximation Error in
Actor-Critic Methods." Preprint, submitted October 22, 2018. https://arxiv.org/abs/1802.09477.
See Also
train
More About
• “Reinforcement Learning Agents” on page 3-2
• “Train Reinforcement Learning Agents” on page 5-3
• “Define Reward Signals” on page 2-14
• “Train Agents Using Parallel Computing and GPUs” on page 5-8
See Also
Related Examples
• “Train AC Agent to Balance Cart-Pole System Using Parallel Computing” on page 5-151
• “Quadruped Robot Locomotion Using DDPG Agent” on page 5-246
5-245
5 Train and Validate Agents
This example shows how to train a quadruped robot to walk using a deep deterministic policy
gradient (DDPG) agent. The robot in this example is modeled using Simscape™ Multibody™. For
more information on DDPG agents, see “Twin-Delayed Deep Deterministic Policy Gradient Agents” on
page 3-35.
initializeRobotParameters
The environment for this example is a quadruped robot, and the training goal is to make the robot
walk in a straight line using minimal control effort.
mdl = "rlQuadrupedRobot";
open_system(mdl)
5-246
Quadruped Robot Locomotion Using DDPG Agent
The robot is modeled using Simscape Multibody with its main structural components consisting of
four legs and a torso. The legs are connected to the torso through revolute joints that enable rotation
of the legs with respect to the torso. The joints are actuated by torque signals provided by the RL
Agent.
Observations
The robot environment provides 44 observations to the agent, each normalized between –1 and 1.
These observations are:
For all four legs, the initial values for the hip and knee joint angles are set to –0.8234 and 1.6468
radians, respectively. The neutral positions of the joints are set at 0 radian. The legs are in neutral
position when they are stretched to their maximum and are aligned perpendicularly to the ground.
Actions
The agent generates eight actions normalized between –1 and 1. After multiplying with a scaling
factor, these correspond to the eight joint torque signals for the revolute joints. The overall joint
torque bounds are +/– 10 N·m for each joint.
5-247
5 Train and Validate Agents
Reward
The following reward is provided to the agent at each time step during training. This reward function
encourages the agent to move forward by providing a positive reward for positive forward velocity. It
also encourages the agent to avoid early termination by providing a constant reward (25Ts /T f ) at
each time step. The remaining terms in the reward function are penalties that discourage unwanted
states, such as large deviations from the desired height and orientation or the use of excessive joint
torques.
Ts 2 2 2
rt = vx + 25 − 50y − 20θ − 0 . 02 ∑ uti − 1
Tf i
where
Episode Termination
During training or simulation, the episode terminates if any of the following situations occur.
• The height of the torso center of mass from the ground is below 0.5 m (fallen).
• The head or tail of the torso is below the ground.
• Any knee joint is below the ground.
• Roll, pitch, or yaw angles are outside bounds (+/– 0.1745, +/– 0.1745, and +/– 0.3491 radians,
respectively).
During training, the reset function introduces random deviations into the initial joint angles and
angular velocities.
env.ResetFcn = @quadrupedResetFcn;
5-248
Quadruped Robot Locomotion Using DDPG Agent
The DDPG agent approximates the long-term reward given observations and actions by using a critic
value function representation. The agent also decides which action to take given the observations,
using an actor representation. The actor and critic networks for this example are inspired by [2].
For more information on creating a deep neural network value function representation, see “Create
Policies and Value Functions” on page 4-2. For an example that creates neural networks for DDPG
agents, see “Train DDPG Agent to Control Double Integrator System” on page 5-77.
rng(0)
Create the networks in the MATLAB workspace using the createNetworks helper function.
createNetworks
You can also create your actor and critic networks interactively using the Deep Network Designer
app.
plot(criticNetwork)
plot(actorNetwork)
5-249
5 Train and Validate Agents
agentOptions = rlDDPGAgentOptions();
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.MiniBatchSize = 128;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.NoiseOptions.MeanAttractionConstant = 0.15;
agentOptions.NoiseOptions.Variance = 0.1;
agentOptions.ActorOptimizerOptions.Algorithm = "adam";
agentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agentOptions.ActorOptimizerOptions.GradientThreshold = 1;
agentOptions.ActorOptimizerOptions.L2RegularizationFactor = 1e-5;
agentOptions.CriticOptimizerOptions.Algorithm = "adam";
agentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
agentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agentOptions.CriticOptimizerOptions.L2RegularizationFactor = 2e-4;
agent = rlDDPGAgent(actor,critic,agentOptions);
• Run each training episode for at most 10,000 episodes, with each episode lasting at most
maxSteps time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option).
5-250
Quadruped Robot Locomotion Using DDPG Agent
• Stop training when the agent receives an average cumulative reward greater than 190 over 250
consecutive episodes.
• Save a copy of the agent for each episode where the cumulative reward is greater than 220.
trainOpts = rlTrainingOptions(...
MaxEpisodes=10000,...
MaxStepsPerEpisode=floor(Tf/Ts),...
ScoreAveragingWindowLength=250,...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=210,...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=220);
To train the agent in parallel, specify the following training options. Training in parallel requires
Parallel Computing Toolbox™ software. If you do not have Parallel Computing Toolbox™ software
installed, set UseParallel to false.
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
trainOpts.ParallelizationOptions.DataToSendFromWorkers = "Experiences";
Train Agent
Train the agent using the train function. Due to the complexity of the robot model, this process is
computationally intensive and takes several hours to complete. To save time while running this
example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set
doTraining to true. Due to the randomness of parallel training, you can expect different training
results from the plot below.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load pretrained agent parameters for the example.
load("rlQuadrupedAgentParams.mat","params")
setLearnableParameters(agent, params);
end
5-251
5 Train and Validate Agents
rng(0)
To validate the performance of the trained agent, simulate it within the robot environment. For more
information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions(MaxSteps=floor(Tf/Ts));
experience = sim(env,agent,simOptions);
5-252
Quadruped Robot Locomotion Using DDPG Agent
References
[1] Heess, Nicolas, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa,
et al. ‘Emergence of Locomotion Behaviours in Rich Environments’. ArXiv:1707.02286 [Cs], 10 July
2017. https://arxiv.org/abs/1707.02286.
[2] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. ‘Continuous Control with Deep Reinforcement Learning’.
ArXiv:1509.02971 [Cs, Stat], 5 July 2019. https://arxiv.org/abs/1509.02971.
See Also
train
More About
• “Reinforcement Learning Agents” on page 3-2
• “Train Reinforcement Learning Agents” on page 5-3
• “Define Reward Signals” on page 2-14
See Also
Related Examples
• “Train Biped Robot to Walk Using Reinforcement Learning Agents” on page 5-235
5-253
5 Train and Validate Agents
This example shows how to train a soft actor-critic (SAC) reinforcement learning agent to control a
robot arm for a ball-balancing task.
Introduction
The robot arm in this example is a Kinova Gen3 robot, which is a seven degree-of-freedom (DOF)
manipulator. The arm is tasked to balance a ping pong ball at the center of a flat surface (plate)
attached to the robot gripper. Only the final two joints are actuated and contribute to motion in the
pitch and roll axes as shown in the following figure. The remaining joints are fixed and do not
contribute to motion.
Open the Simulink® model to view the system. The model contains a Kinova Ball Balance subsystem
connected to an RL Agent block. The agent applies an action to the robot subsystem and receives the
resulting observation, reward, and is-done signals.
open_system("rlKinovaBallBalance")
5-254
Train SAC Agent for Ball Balance Control
In this model:
• The physical components of the system (manipulator, ball, and plate) are modeled using
Simscape™ Multibody™ components.
5-255
5 Train and Validate Agents
If you have the Robotics System Toolbox Robot Library Data support package, you can view a 3-D
animation of the manipulator in the Mechanics Explorer. To do so, open the 7 DOF Manipulator
subsystem and set its Visualization parameter to 3D Mesh. If you do have the support package
installed, set the Visualization parameter to None. To download and install the support package, use
the Add-On Explorer. For more information see “Get and Manage Add-Ons”.
Create the parameters for the example by running the kinova_params script included with this
example. When you have the Robotics System Toolbox Robot Library Data support package installed,
this script also adds the necessary mesh files to the MATLAB® path.
kinova_params
Define Environment
To train a reinforcement learning agent, you must define the environment with which it will interact.
For the ball balancing environment:
• The observations are represented by a 22 element vector that contains information about the
positions (sine and cosine of joint angles) and velocities (joint angle derivatives) of the two
actuated joints, positions (x and y distances from plate center) and velocities (x and y derivatives)
of the ball, orientation (quaternions) and velocities (quaternion derivatives) of the plate, joint
torques from the last time step, ball radius, and mass.
• The actions are normalized joint torque values.
• The sample time is Ts = 0 . 01s, and the simulation time is T f = 10s.
• The simulation terminates when the ball falls off the plate.
• The reward rt at time step t is given by:
Here, rball is a reward for the ball moving closer to the center of the plate, rplate is a penalty for plate
orientation, and raction is a penalty for control effort. ϕ, θ, and ψ are the respective roll, pitch, and
yaw angles of the plate in radians. τ1 and τ2 are the joint torques.
Create the observation and action specifications for the environment using continuous observation
and action spaces.
5-256
Train SAC Agent for Ball Balance Control
Create the Simulink environment interface using the observation and action specifications. For more
information on creating Simulink environments, see rlSimulinkEnv.
mdl = "rlKinovaBallBalance";
blk = mdl + "/RL Agent";
env = rlSimulinkEnv(mdl,blk,obsInfo,actInfo);
Specify a reset function for the environment using the ResetFcn parameter.
env.ResetFcn = @kinovaResetFcn;
This reset function (provided at the end of this example) randomly initializes the initial x and y
positions of the ball with respect to the center of the plate. For more robust training, you can also
randomize other parameters inside the reset function, such as the mass and radius of the ball.
Ts = 0.01;
Tf = 10;
Create Agent
The agent in this example is a soft actor-critic (SAC) agent. SAC agents have critics that approximate
the expectation of the value function given the states and actions and an actor that models a
stochastic policy. The agent selects an action based on this policy. For more information on SAC
agents, see “Soft Actor-Critic Agents” on page 3-57.
The SAC agent in this example uses two critics to learn the optimal Q-value function. Using two
critics helps avoid overfitting when learning the Q-function. To create the critics, first create a deep
neural network with two inputs (the observation and action) and one output. For more information on
creating deep neural networks for reinforcement learning agents, see “Create Policies and Value
Functions” on page 4-2.
5-257
5 Train and Validate Agents
When using two critics, a SAC agent requires them to have different initial parameters. Create and
initialize two dlnetwork objects.
criticdlnet = dlnetwork(criticNetwork,'Initialize',false);
criticdlnet1 = initialize(criticdlnet);
criticdlnet2 = initialize(criticdlnet);
The actor function in a SAC agent is stochastic actor with a continuous action space, which you define
as an rlContinuousGaussianActor object. Create a deep neural network to model the actor
policy.
% Create the actor network layers.
anet = [
featureInputLayer(numObs,"Normalization","none","Name","observation")
5-258
Train SAC Agent for Ball Balance Control
fullyConnectedLayer(128,"Name","fc1")
reluLayer("Name","relu1")
fullyConnectedLayer(64,"Name","fc2")
reluLayer("Name","relu2")];
meanPath = [
fullyConnectedLayer(32,"Name","meanFC")
reluLayer("Name","relu3")
fullyConnectedLayer(numAct,"Name","mean")];
stdPath = [
fullyConnectedLayer(numAct,"Name","stdFC")
reluLayer("Name","relu4")
softplusLayer("Name","std")];
5-259
5 Train and Validate Agents
"ObservationInputNames","observation", ...
"ActionMeanOutputNames","mean", ...
"ActionStandardDeviationOutputNames","std");
The SAC agent in this example trains from an experience buffer of maximum capacity 1e6 by
randomly selecting mini-batches of size 128. The discount factor of 0.99 is close to 1 and therefore
favors long term reward with respect to a smaller value. For a full list of SAC hyperparameters and
their descriptions, see rlSACAgentOptions.
For this example the actor and critic neural networks are updated using the Adam algorithm with a
learn rate of 1e-4 and gradient threshold of 1. Specify the optimizer parameters.
agentOpts.ActorOptimizerOptions.Algorithm = "adam";
agentOpts.ActorOptimizerOptions.LearnRate = 1e-4;
agentOpts.ActorOptimizerOptions.GradientThreshold = 1;
for ct = 1:2
agentOpts.CriticOptimizerOptions(ct).Algorithm = "adam";
agentOpts.CriticOptimizerOptions(ct).LearnRate = 1e-4;
agentOpts.CriticOptimizerOptions(ct).GradientThreshold = 1;
end
agent = rlSACAgent(actor,[critic1,critic2],agentOpts);
Train Agent
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options:
• Run each training for at most 5000 episodes, with each episode lasting at most floor(Tf/Ts)
time steps.
• Stop training when the agent receives an average cumulative reward greater than 675 over 100
consecutive episodes.
• To speed up training set the UseParallel option to true. Doing so is optional and requires
Parallel Computing Toolbox™ software.
trainOpts = rlTrainingOptions(...
"MaxEpisodes", 5000, ...
"MaxStepsPerEpisode", floor(Tf/Ts), ...
"ScoreAveragingWindowLength", 100, ...
"Plots", "training-progress", ...
"StopTrainingCriteria", "AverageReward", ...
"StopTrainingValue", 675, ...
"UseParallel", false);
5-260
Train SAC Agent for Ball Balance Control
For parallel training, specify a list of supporting files. These files are required to model the Kinova
robot in the parallel workers when the CAD geometry rendering option is selected.
if trainOpts.UseParallel
trainOpts.ParallelizationOptions.AttachedFiles = [pwd,filesep] + ...
["bracelet_with_vision_link.STL";
"half_arm_2_link.STL";
"end_effector_link.STL";
"shoulder_link.STL";
"base_link.STL";
"forearm_link.STL";
"spherical_wrist_1_link.STL";
"bracelet_no_vision_link.STL";
"half_arm_1_link.STL";
"spherical_wrist_2_link.STL"];
end
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
stats = train(agent,env,trainOpts);
else
load("kinovaBallBalanceAgent.mat")
end
A snapshot of training progress is shown in the following figure. You can expect different results due
to randomness in the training process.
5-261
5 Train and Validate Agents
Define an arbitrary initial position for the ball with respect to the plate center. To view the agent
performance in different situations, change this to other locations on the plate.
ball.x0 = 0.10;
ball.y0 = -0.10;
in = Simulink.SimulationInput(mdl);
in = setVariable(in,"ball",ball);
in = setPostSimFcn(in,@animatedPath);
out = sim(in);
5-262
Train SAC Agent for Ball Balance Control
View the trajectory of the ball using the Ball Position scope block.
5-263
5 Train and Validate Agents
function in = kinovaResetFcn(in)
% Ball parameters
ball.radius = 0.02; % m
ball.mass = 0.0027; % kg
ball.shell = 0.0002; % m
5-264
Train SAC Agent for Ball Balance Control
in = setVariable(in,"ball",ball);
% Animation
in = setPostSimFcn(in, @animatedPath);
end
5-265
5 Train and Validate Agents
This example shows the design of a hybrid controller for an automatic searching and parking task.
You will learn how to combine adaptive model predictive control (MPC) with reinforcement learning
(RL) to perform a parking maneuver. The path of the vehicle is visualized using the Unreal Engine®
simulation environment.
The control objective is to park the vehicle in an empty spot after starting from an initial pose. The
control algorithm executes a series of maneuvers while sensing and avoiding obstacles in tight
spaces. It switches between an adaptive MPC controller and a reinforcement learning agent to
complete the parking maneuver. The adaptive MPC controller moves the vehicle at a constant speed
along a reference path while searching for an empty parking spot. When a spot is found, the
reinforcement learning agent takes over and executes a pretrained parking maneuver. Prior
knowledge of the environment (the parking lot) including the locations of the empty spots and parked
vehicles is available to both the controllers.
Parking Lot
The parking lot environment used in this example is a subsection of the Large Parking Lot
(Automated Driving Toolbox) scene. The parking lot is represented by the ParkingLotEnvironment
object, which stores information on the ego vehicle, empty parking spots, and static obstacles (parked
cars and boundaries). Each parking spot has a unique index number and an indicator light that is
either green (free) or red (occupied). Parked vehicles are represented in black while other boundaries
are outlined in green.
5-266
Automatic Parking Valet with Unreal Engine Simulation
Specify a sample time Ts (seconds) for the controllers and a simulation time T f (seconds).
Ts = 0.1;
Tf = 50;
Create a reference path for the ego vehicle using the createReferenceTrajectory helper
function included with this example. The reference path starts from the south-east corner of the
parking lot and ends in the west as displayed with the dashed pink line.
xRef = createReferenceTrajectory(Ts,Tf);
Create a ParkingLotEnvironment object with a free spot at index 32 and the specified reference
path xRef.
freeSpotIndex = 32;
map = ParkingLotEnvironment(freeSpotIndex,"Route",xRef);
Specify an initial pose X0, Y 0, θ0 for the ego vehicle. The units of X0 and Y 0 are meters and θ0 is in
radians.
Compute the target pose for the vehicle using the createTargetPose function. The target pose
corresponds to the location in freeSpotIdx.
egoTargetPose = createTargetPose(map,freeSpotIndex);
Sensor Modules
The parking algorithm uses geometric approximations of a camera and a lidar sensor to gather
information from the parking lot environment.
5-267
5 Train and Validate Agents
Camera
In this example, the field of view of a camera mounted on the ego vehicle is represented by the area
shaded in green in the following figure. The camera has a field of view φ bounded by ±π/3 radians
and a maximum measurement depth dmax of 10 m. As the ego vehicle moves forward, the camera
module senses the parking spots within the field of view and determines whether a spot is free or
occupied. For simplicity, this action is implemented using geometrical relationships between the spot
locations and the current vehicle pose. A parking spot is within the camera range if di ≤ dmax and
φmin ≤ φi ≤ φmax, where di is the distance to the parking spot and φi is the angle to the parking spot.
Lidar
The lidar sensor in this example is modeled using radial line segments emerging from the geometric
center of the vehicle. Distances to obstacles are measured along these line segments. The maximum
measurable lidar distance along any line segment is 6 m. The reinforcement learning agent uses
these readings to determine the proximity of the ego vehicle to other vehicles in the environment.
5-268
Automatic Parking Valet with Unreal Engine Simulation
autoParkingValetParams3D
The parking valet model, including the controllers, ego vehicle, sensors, and parking lot, is
implemented in a Simulink model. Open the model.
mdl = "rlAutoParkingValet3D";
open_system(mdl)
5-269
5 Train and Validate Agents
In this model:
• The vehicle dynamics are modeled in the Ego Vehicle Model subsystem. The dynamics is
represented by a single-track bicycle kinematics model with two input signals: vehicle speed v
(m/s) and steering angle δ (radians).
• The adaptive MPC and RL agent blocks are found in the MPC Tracking Controller and RL
Controller subsystems, respectively.
• Mode switching between the controllers is handled by the Vehicle Mode subsystem, which outputs
Search and Park signals. Initially, the vehicle is in search mode and the adaptive MPC controller
tracks the reference path. When a free spot is found, the Park signal activates the RL Agent to
perform the parking maneuver.
5-270
Automatic Parking Valet with Unreal Engine Simulation
• The Visualization Subsystem handles animation of the environment. Double-click this subsystem to
specify the visualization options.
If you have Automated Driving Toolbox™ software installed, you can set the Unreal Engine
Visualization parameter to On to display the vehicle animation in the Unreal Engine environment.
Enabling the Unreal Engine simulation can degrade simulation performance. Therefore, set the
parameter to Off when training the agent.
Create the adaptive MPC controller object for reference trajectory tracking using the
createMPCForParking script. For more information on adaptive MPC, see “Adaptive MPC” (Model
Predictive Control Toolbox).
createMPCForParking3D;
To train the reinforcement learning agent, you must create an environment interface and an agent
object.
Create Environment
The environment for training is the region shaded in red in the following figure. Due to symmetry in
the parking lot, training within this region is sufficient for the policy to adjust to other regions after
coordinate transformations are applied to the observations. Constraining the training to this region
also significantly reduces training duration when compared to training over the entire parking lot
space.
5-271
5 Train and Validate Agents
• The training region is a 13.625 m x 12.34 m space with the target spot at its horizontal center.
• The observations are the position errors Xe and Y e of the ego vehicle with respect to the target
pose, the sine and cosine of the true heading angle θ, and the lidar sensor readings.
• The vehicle speed during parking is a constant 2 m/s.
• The action signals are discrete steering angles that range between +/- π/4 radians in steps of
0.2618 radians or 15 degrees.
• The vehicle is considered parked if the errors with respect to target pose are within specified
tolerances of +/- 0.75 m (position) and +/-10 degrees (orientation).
• The episode terminates if the ego vehicle goes out of the bounds of the training region, collides
with an obstacle, or parks successfully.
• The reward rt provided at time t, is:
2
rt = 2e− 0 . 05Xe2 + 0 . 04Y e2 2
+ 0 . 5e−40θe − 0 . 05δ + 100f t − 50gt
Here, Xe, Y e, and θe are the position and heading angle errors of the ego vehicle from the target pose,
while δ is the steering angle. f t (0 or 1) indicates whether the vehicle has parked and gt (0 or 1)
indicates if the vehicle has collided with an obstacle or left the training region at time t.
The coordinate transformations on vehicle pose X, Y, θ observations for different parking spot
locations are as follows:
‾ = X, Y‾ = Y + 20 . 41, θ‾ = θ
• Parking spots 1-14: X
‾ = 41 − X, Y‾ = − 64 . 485 − Y, θ‾ = θ − π
• Parking spots 15-28: X
• Parking spots 29-37: no transformation
‾ = 41 − X, Y‾ = − 84 . 48 − Y, ‾θ = θ − π
• Parking spots38-46: X
nObs = 16;
nAct = 1;
observationInfo = rlNumericSpec([nObs 1]);
5-272
Automatic Parking Valet with Unreal Engine Simulation
observationInfo.Name = "observations";
actionInfo = rlNumericSpec([nAct 1],"LowerLimit",-1,"UpperLimit",1);
actionInfo.Name = "actions";
Create the Simulink environment interface, specifying the path to the RL Agent block.
Specify a reset function for training. The autoParkingValetResetFcn function resets the initial
pose of the ego vehicle to random values at the start of each episode.
env.ResetFcn = @autoParkingValetResetFcn3D;
Create Agent
The agent in this example is a twin-delayed deep deterministic policy gradient (TD3) agent. TD3
agents rely on actor and critic functions to learn the optimal policy. To learn more about TD3 agents,
see “Twin-Delayed Deep Deterministic Policy Gradient Agents” on page 3-35.
rng(0)
To create the critic function, first create a deep neural network with 16 inputs and one output. The
output of the critic network is the state-action value function for a taking a given action from a given
observation.
cnet = [
featureInputLayer(nObs,"Normalization","none","Name","State")
fullyConnectedLayer(128, "Name", "fc1")
concatenationLayer(1,2,"Name","concat")
reluLayer("Name","relu1")
fullyConnectedLayer(128, "Name","fc3")
reluLayer("Name","relu2")
fullyConnectedLayer(1,"Name","CriticOutput")];
actionPath = [
featureInputLayer(nAct,"Normalization","none","Name","Action")
fullyConnectedLayer(128,"Name","fc2")];
criticNet = layerGraph(cnet);
criticNet = addLayers(criticNet, actionPath);
criticNet = connectLayers(criticNet,"fc2","concat/in2");
Create the Q-value function for the critics. For more information see rlQValueFunction.
criticdlnet = dlnetwork(criticNet);
critic1 = rlQValueFunction(criticNet,observationInfo,actionInfo,...
"ObservationInputNames","State","ActionInputNames","Action");
critic2 = rlQValueFunction(criticNet,observationInfo,actionInfo,...
"ObservationInputNames","State","ActionInputNames","Action");
Create the actor neural network. The output of the actor network is the steering angle.
anet = [featureInputLayer(nObs,"Normalization","none","Name","State")
fullyConnectedLayer(128, "Name","actorFC1")
reluLayer("Name","relu1")
5-273
5 Train and Validate Agents
fullyConnectedLayer(128,"Name","actorFC2")
reluLayer("Name","relu2")
fullyConnectedLayer(nAct,"Name","Action")
tanhLayer("Name","tanh1")];
actorNet = layerGraph(anet);
Create the actor function for the TD3 agent. For more information see
rlContinuousDeterministicActor.
actordlnet = dlnetwork(actorNet);
actor = rlContinuousDeterministicActor(actordlnet,observationInfo,actionInfo,...
"ObservationInputNames","State");
Specify the agent options and create the TD3 agent. For more information on TD3 agent options, see
rlTD3AgentOptions.
agentOpts.ExplorationModel.StandardDeviation = 0.1;
agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-4;
agentOpts.ExplorationModel.StandardDeviationMin = 0.01;
Set the optimizer parameters. For this example, set the actor and critic learn rates to 1e-3 and 2e-3,
respectively. Set a gradient threshold factor of 1 to limit the gradients during training.
Train Agent
The agent is trained for a maximum of 10000 episodes with each episode lasting a maximum of 200
time steps. The training terminates when the maximum number of episodes is reached or the average
reward over 200 episodes reaches the value of 120 or more. Specify the options for training using the
rlTrainingOptions function.
trainOpts = rlTrainingOptions(...
"MaxEpisodes",10000,...
"MaxStepsPerEpisode",200,...
"ScoreAveragingWindowLength",200,...
"Plots","training-progress",...
5-274
Automatic Parking Valet with Unreal Engine Simulation
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",120);
Train the agent using the train function. Fully training this agent is a computationally intensive
process that may take several hours to complete. To save time while running this example, load a
pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to
true.
doTraining = false;
if doTraining
trainingResult = train(agent,env,trainOpts);
else
load("rlAutoParkingValetAgent.mat","agent");
end
To Validate the trained agent, simulate the model and observe the parking maneuver.
sim(mdl);
5-275
5 Train and Validate Agents
The vehicle tracks the reference path using the MPC controller before switching to the RL controller
when the target spot is detected. The vehicle then completes the parking maneuver.
5-276
Automatic Parking Valet with Unreal Engine Simulation
Turn on the Unreal Engine visualization by opening the Visualization block and setting the Unreal
Engine Visualization parameter to On. Initializing the Unreal Engine simulation environment can
take a few seconds..
freeSpotIndex = 20;
sim(mdl)
5-277
5 Train and Validate Agents
5-278
Generate Reward Function from a Model Predictive Controller for a Servomotor
This example shows how to automatically generate a reward function from cost and constraint
specifications defined in a model predictive controller object. You then use the generated reward
function to train a reinforcement learning agent.
Introduction
You can use the generateRewardFunction to generate a reward function for reinforcement
learning, starting from cost and constraints specified in a model predictive controller. The resulting
reward signal is a sum of costs (as defined by an objective function) and constraint violation penalties
depending on the current state of the environment.
This example is based on the“DC Servomotor with Constraint on Unmeasured Output” (Model
Predictive Control Toolbox) example, in which you design a model predictive controller for a DC
servomechanism under voltage and shaft torque constraints. Here, you will convert the cost and
constraints specifications defined in the mpc object into a reward function and use it to train an agent
to control the servomotor.
Open the Simulink model for this example which is based on the above MPC example but has been
modified for reinforcement learning.
open_system('rl_motor')
5-279
5 Train and Validate Agents
Create the open-loop dynamic model of the motor, defined in plant and the maximum admissible
torque tau using an helper function.
[plant,tau] = mpcmotormodel;
Specify input and output signal types for the MPC controller. The shaft angular position, is measured
as first output. The second output, torque, is unmeasurable.
plant = setmpcsignals(plant,'MV',1,'MO',1,'UO',2);
MV = struct('Min',-220,'Max',220,'ScaleFactor',440);
Impose torque constraints during the first three prediction steps, and specify scale factor for both
shaft position and torque.
OV = struct('Min',{-Inf, [-tau;-tau;-tau;-Inf]},...
'Max',{Inf, [tau;tau;tau;Inf]},...
'ScaleFactor',{2*pi, 2*tau});
Specify weights for the quadratic cost function to achieve angular position tracking. Set to zero the
weight for the torque, thereby allowing it to float within its constraint.
Create an MPC controller for the plant model with a sample time of 0.1 s, a prediction horizon 10
steps, and a control horizon of 2 steps, using the previously defined structures for the weights,
manipulated variables, and output variables.
mpcobj = mpc(plant,0.1,10,2,Weights,MV,OV);
mpcobj
Plant Model:
--------------
1 manipulated variable(s) -->| 4 states |
| |--> 1 measured output(s)
0 measured disturbance(s) -->| 1 inputs |
| |--> 1 unmeasured output(s)
0 unmeasured disturbance(s) -->| 2 outputs |
--------------
Indices:
(input vector) Manipulated variables: [1 ]
(output vector) Measured outputs: [1 ]
Unmeasured outputs: [2 ]
5-280
Generate Reward Function from a Model Predictive Controller for a Servomotor
Weights:
ManipulatedVariables: 0
ManipulatedVariablesRate: 0.1000
OutputVariables: [0.1000 0]
ECR: 10000
Constraints:
-220 <= MV1 (V) <= 220, MV1/rate (V) is unconstrained, MO1 (rad) is unconstrained
-78.54 <= UO1 (Nm)(t+1) <= 78.54
-78.54 <= UO1 (Nm)(t+2) <= 78.54
-78.54 <= UO1 (Nm)(t+3) <= 78.54
UO1 (Nm)(t+4) is unconstrained
The controller operates on a plant with 4 states, 1 input (voltage) and 2 output signals (angle and
torque) and has the following specifications:
• The cost function weights for the manipulated variable, manipulated variable rate and output
variables are 0, 0.1 and [0.1 0] respectively.
• The manipulated variable is constrained between -220V and 220V.
• The manipulated variable rate is unconstrained.
• The first output variable (angle) is unconstrained but the second (torque) is constrained between
-78.54 Nm and 78.54 Nm in the first three prediction time steps and unconstrained in the fourth
step.
Note that for reinforcement learning only the constraints specification from the first prediction time
step will be used since the reward is computed for a single time step.
Generate the reward function code from specifications in the mpc object using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction(mpcobj)
The generated reward function is a starting point for reward design. You can modify the function with
different penalty function choices and tune the weights. For this example, make the following change
to the generated code:
• Scale the original cost weights Qy, Qmv and Qmvrate by a factor of 100.
• The default exterior penalty function method is step. Change the method to quadratic.
After you make changes, the cost and penalty specifications should be as follows:
Qy = [10 0];
Qmv = 0;
Qmvrate = 10;
Py = Wy * exteriorPenalty(y,ymin,ymax,'quadratic');
Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'quadratic');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'quadratic');
5-281
5 Train and Validate Agents
For this example, the modified code has been saved in the MATLAB function file
rewardFunctionMpc.m. Display the generated reward function.
type rewardFunctionMpc.m
%#codegen
%% Compute cost
dy = (refy(:)-y(:)) ./ Sy';
dmv = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy = dy' * diag(Qy.^2) * dy;
Jmv = dmv' * diag(Qmv.^2) * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost = Jy + Jmv + Jmvrate;
5-282
Generate Reward Function from a Model Predictive Controller for a Servomotor
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternatively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py = Wy * exteriorPenalty(y,ymin,ymax,'step');
Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;
%% Compute reward
reward = -(Cost + Penalty);
end
To integrate this reward function, open the MATLAB Function block in the Simulink model.
open_system('rl_motor/Reward Function')
Append the function with the following line of code and save the model.
r = rewardFunctionMpc(y,refy,mv,refmv,lastmv);
The MATLAB Function block will now execute rewardFunctionMpc.m during simulation.
For this example, the MATLAB Function block has already been modified and saved.
The environment dynamics are modeled in the Servomechanism subsystem. For this environment,
• The observations are the reference and actual output variables (angle and torque) from the last 8
time steps.
• The action is the voltage V applied to the servomotor.
• The sample time is Ts = 0 . 1s.
• The total simulation time is T f = 20s.
Tf = 20;
Ts = 0.1;
5-283
5 Train and Validate Agents
numObs = 32;
numAct = 1;
oinfo = rlNumericSpec([numObs 1]);
ainfo = rlNumericSpec([numAct 1],'LowerLimit',-220,'UpperLimit',220);
The agent in this example is a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent.
Specify the agent options using rlTD3AgentOptions. The agent trains from an experience buffer of
maximum capacity 1e6 by randomly selecting mini-batches of size 256. The discount factor of 0.995
favors long-term rewards.
agentOpts = rlTD3AgentOptions("SampleTime",Ts, ...
"DiscountFactor", 0.995, ...
5-284
Generate Reward Function from a Model Predictive Controller for a Servomotor
"ExperienceBufferLength",1e6, ...
"MiniBatchSize",256);
The exploration model in this TD3 agent is Gaussian. The noise model adds a uniform random value
to the action during training. Set the standard deviation of the noise to 100. The standard deviation
decays at the rate of 1e-5 every agent step until the minimum value of 0.005.
agentOpts.ExplorationModel.StandardDeviationMin = 0.005;
agentOpts.ExplorationModel.StandardDeviation = 100;
agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-5;
Create the TD3 agent using the actor and critic representations. For more information on TD3 agents,
see rlTD3Agent.
agent = rlTD3Agent(actor,[critic1,critic2],agentOpts);
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options:
• Run each training for at most 2000 episodes, with each episode lasting at most ceil(Tf/Ts)
time steps.
• Stop the training when the agent receives an average cumulative reward greater than -2 over 20
consecutive episodes. At this point, the agent can track the reference signal.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',2000, ...
'MaxStepsPerEpisode',ceil(Tf/Ts), ...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-2,...
'ScoreAveragingWindowLength',20);
Train the agent using the train function. Training this agent is a computationally intensive process
that may take several minutes to complete. To save time while running this example, load a
pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to
true.
doTraining = false;
if doTraining
trainingStats = train(agent,env,trainOpts);
else
load('rlDCServomotorTD3Agent.mat')
end
A snapshot of the training progress is shown in the following figure. You can expect different results
due to inherent randomness in the training process.
5-285
5 Train and Validate Agents
To validate the performance of the trained agent, simulate the model and view the response in the
Scope blocks. The reinforcement learning agent is able to track the reference angle while satisfying
the constraints on torque and voltage.
sim('rl_motor');
5-286
Generate Reward Function from a Model Predictive Controller for a Servomotor
5-287
5 Train and Validate Agents
close_system('rl_motor')
5-288
Generate Reward Function from a Model Verification Block for a Water Tank System
This example shows how to automatically generate a reward function from performance requirement
defined in a Simulink® Design Optimization™ model verification block. You then use the generated
reward function to train a reinforcement learning agent.
Introduction
You can use the generateRewardFunction to generate a reward function for reinforcement
learning, starting from performance constraints specified in a Simulink Design Optimization model
verification block. The resulting reward signal is a sum of weighted penalties on constraint violations
by the current state of the environment.
In this example, you will convert the cost and constraint specifications defined in a Check Step
Response Characteristics block for a water tank system into a reward function. You then use the
reward function and use it to train an agent to control the water tank.
% Watertank parameters
a = 2;
b = 5;
A = 20;
The original model for this example is the “watertank Simulink Model” (Simulink Control Design).
open_system('rlWatertankStepInput')
5-289
5 Train and Validate Agents
The model in this example has been modified for reinforcement learning. The goal is to control the
level of the water in the tank using a reinforcement learning agent, while satisfying the response
characteristics defined in the Check Step Response Characteristics block. Open the block to view
the desired step response specifications.
blk = 'rlWatertankStepInput/WaterLevelStepResponse';
open_system(blk)
5-290
Generate Reward Function from a Model Verification Block for a Water Tank System
5-291
5 Train and Validate Agents
Generate the reward function code from specifications in the WaterLevelStepResponse block using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction(blk)
The generated reward function is a starting point for reward design. You may modify the function by
choosing different penalty functions and tuning the penalty weights. For this example, make the
following change to the generated code:
After you make changes, the weight and penalty specifications should be as follows:
Weight = 10;
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'quadratic'));
For this example, the modified code has been saved in the MATLAB function file
rewardFunctionVfb.m. Display the generated reward function.
type rewardFunctionVfb.m
5-292
Generate Reward Function from a Model Verification Block for a Water Tank System
%
% x : Input of watertank_stepinput_rl/WaterLevelStepResponse
% t : Simulation time (s)
%#codegen
if t >= Block1_StepTime
if Block1_InitialValue <= Block1_FinalValue
Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
else
Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
end
Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
for idx = 1:numel(Block1_xmax)
tseg = Block1_UpperBoundTimes(idx,:);
xseg = Block1_UpperBoundAmplitudes(idx,:);
Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmax))
Block1_xmax = Inf;
else
Block1_xmax = max(Block1_xmax,[],'omitnan');
end
Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
for idx = 1:numel(Block1_xmin)
tseg = Block1_LowerBoundTimes(idx,:);
xseg = Block1_LowerBoundAmplitudes(idx,:);
Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmin))
Block1_xmin = -Inf;
else
Block1_xmin = max(Block1_xmin,[],'omitnan');
end
else
Block1_xmin = -Inf;
5-293
5 Train and Validate Agents
Block1_xmax = Inf;
end
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltely, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'quadratic'));
%% Compute reward
reward = -Weight * Penalty;
end
To integrate this reward function in the water tank model, open the MATLAB Function block under
the Reward Subsystem.
open_system('rlWatertankStepInput/Reward/Reward Function')
Append the function with the following line of code and save the model.
r = rewardFunctionVfb(x,t);
The MATLAB Function block will now execute rewardFunctionVfb.m for computing rewards.
For this example, the MATLAB Function block has already been modified and saved.
The environment dynamics are modeled in the Water-Tank Subsystem. For this environment,
• The observations are the reference height ref from the last 5 time steps, and the height error is
err = ref - H.
• The action is the voltage V applied to the pump.
• The sample time Ts is 0.1 s.
5-294
Generate Reward Function from a Model Verification Block for a Water Tank System
rng(100)
The agent in this example is a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent.
Specify the agent options using rlTD3AgentOptions. The agent trains from an experience buffer of
maximum capacity 1e6 by randomly selecting mini-batches of size 256. The discount factor of 0.99
favors long-term rewards.
agentOpts = rlTD3AgentOptions("SampleTime",Ts, ...
"DiscountFactor",0.99, ...
"ExperienceBufferLength",1e6, ...
"MiniBatchSize",256);
The exploration model in this TD3 agent is Gaussian. The noise model adds a uniform random value
to the action during training. Set the standard deviation of the noise to 0.5. The standard deviation
decays at the rate of 1e-5 every agent step until the minimum value of 0.
agentOpts.ExplorationModel.StandardDeviation = 0.5;
agentOpts.ExplorationModel.StandardDeviationDecayRate = 1e-5;
agentOpts.ExplorationModel.StandardDeviationMin = 0;
Create the TD3 agent using the actor and critic representations. For more information on TD3 agents,
see rlTD3Agent.
5-295
5 Train and Validate Agents
agent = rlTD3Agent(actor,[critic1,critic2],agentOpts);
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options:
• Run training for at most 100 episodes, with each episode lasting at most ceil(Tf/Ts) time
steps, where the total simulation time Tf is 10 s.
• Stop the training when the agent receives an average cumulative reward greater than -5 over 20
consecutive episodes. At this point, the agent can track the reference height.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',100, ...
'MaxStepsPerEpisode',ceil(Tf/Ts), ...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-5,...
'ScoreAveragingWindowLength',20);
Train the agent using the train function. Training this agent is a computationally intensive process
that may take several minutes to complete. To save time while running this example, load a
pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to
true.
doTraining = false;
if doTraining
trainingStats = train(agent,env,trainOpts);
else
load('rlWatertankTD3Agent.mat')
end
A snapshot of the training progress is shown in the following figure. You can expect different results
due to inherent randomness in the training process.
5-296
Generate Reward Function from a Model Verification Block for a Water Tank System
Simulate the model to view the closed loop step response. The reinforcement learning agent is able to
track the reference height while satisfying the step response constraints.
sim('rlWatertankStepInput');
5-297
5 Train and Validate Agents
close_system('rlWatertankStepInput')
5-298
Train TD3 Agent for PMSM Control
This example demonstrates speed control of a permanent magnet synchronous motor (PMSM) using a
twin delayed deep deterministic policy gradient (TD3) agent.
The goal of this example is to show that you can use reinforcement learning as an alternative to linear
controllers, such as PID controllers, in speed control of PMSM systems. Outside their regions of
linearity, linear controllers often do not produce good tracking performance. In such cases,
reinforcement learning provides a nonlinear control alternative.
model: 'BoostXL-DRV8305'
sn: 'INV_XXXX'
V_dc: 24
I_trip: 10
Rds_on: 0.0020
Rshunt: 0.0070
CtSensAOffset: 2295
CtSensBOffset: 2286
CtSensCOffset: 2295
ADCGain: 1
EnableLogic: 1
invertingAmp: 1
ISenseVref: 3.3000
ISenseVoltPerAmp: 0.0700
ISenseMax: 21.4286
R_board: 0.0043
CtSensOffsetMax: 2500
CtSensOffsetMin: 1500
model: 'LAUNCHXL-F28379D'
sn: '123456'
5-299
5 Train and Validate Agents
CPU_frequency: 200000000
PWM_frequency: 5000
PWM_Counter_Period: 20000
ADC_Vref: 3
ADC_MaxCount: 4095
SCI_baud_rate: 12000000
V_base: 13.8564
I_base: 21.4286
N_base: 3476
T_base: 1.0249
P_base: 445.3845
mdl = 'mcb_pmsm_foc_sim_RL';
open_system(mdl)
In a linear control version of this example, you can use PI controllers in both the speed and current
control loops. An outer-loop PI controller can control the speed while two inner-loop PI controllers
control the d-axis and q-axis currents. The overall goal is to track the reference speed in the
Speed_Ref signal. This example uses a reinforcement learning agent to control the currents in the
inner control loop while a PI controller controls the outer loop.
The environment in this example consists of the PMSM system, excluding the inner-loop current
controller, which is the reinforcement learning agent. To view the interface between the
reinforcement learning agent and the environment, open the Closed Loop Control subsystem.
5-300
Train TD3 Agent for PMSM Control
The Reinforcement Learning subsystem contains an RL Agent block, the creation of the observation
vector, and the reward calculation.
• The observations are the outer-loop reference speed Speed_ref, speed feedback Speed_fb, d-
axis and q-axis currents and errors (id, iq, iderror and iqerror), and the error integrals.
• The actions from the agent are the voltages vd_rl and vq_rl.
• The sample time of the agent is 2e-4 seconds. The inner-loop control occurs at a different sample
time than the outer loop control.
• The simulation runs for 5000 time steps unless it is terminated early when the iqref signal is
saturated at 1.
• The reward at each time step is:
2 2 2
rt = − Q1 * iderror + Q2 * iqerror + R * ∑ u jt − 1 − 100d
j
Here, Q1 = Q2 = 5, and R = 0 . 1 are constants, iderror is the d-axis current error, iqerror is the q-axis
current error, u jt − 1 are the actions from the previous time step, and d is a flag that is equal to 1
when the simulation is terminated early.
Create the observation and action specifications for the environment. For information on creating
continuous specifications, see rlNumericSpec.
% Create observation specifications.
numObservations = 8;
observationInfo = rlNumericSpec([numObservations 1],"DataType",dataType);
observationInfo.Name = 'observations';
observationInfo.Description = 'Information on error and reference signal';
5-301
5 Train and Validate Agents
numActions = 2;
actionInfo = rlNumericSpec([numActions 1],"DataType",dataType);
actionInfo.Name = 'vqdRef';
Create the Simulink environment interface using the observation and action specifications. For more
information on creating Simulink environments, see rlSimulinkEnv.
Provide a reset function for this environment using the ResetFcn parameter. At the beginning of
each training episode, the resetPMSM function randomly initializes the final value of the reference
speed in the SpeedRef block to 695.4 rpm (0.2 pu), 1390.8 rpm (0.4 pu), 2086.2 rpm (0.6 pu), or
2781.6 rpm (0.8 pu).
env.ResetFcn = @resetPMSM;
Create Agent
The agent used in this example is a twin-delayed deep deterministic policy gradient (TD3) agent. A
TD3 agent approximates the long-term reward given the observations and actions using two critics.
For more information on TD3 agents, see “Twin-Delayed Deep Deterministic Policy Gradient Agents”
on page 3-35.
To create the critics, first create a deep neural network with two inputs (the observation and action)
and one output. For more information on creating a value function representation, see “Create
Policies and Value Functions” on page 4-2.
statePath = [featureInputLayer(numObservations,'Normalization','none','Name','State')
fullyConnectedLayer(64,'Name','fc1')];
actionPath = [featureInputLayer(numActions, 'Normalization', 'none', 'Name','Action')
fullyConnectedLayer(64, 'Name','fc2')];
commonPath = [additionLayer(2,'Name','add')
reluLayer('Name','relu2')
fullyConnectedLayer(32, 'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(16, 'Name','fc4')
fullyConnectedLayer(1, 'Name','CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'fc1','add/in1');
criticNetwork = connectLayers(criticNetwork,'fc2','add/in2');
Create the critic representation using the specified neural network and options. You must also specify
the action and observation specification for the critic. For more information, see
rlQValueRepresentation.
criticOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
critic1 = rlQValueRepresentation(criticNetwork,observationInfo,actionInfo,...
'Observation',{'State'},'Action',{'Action'},criticOptions);
critic2 = rlQValueRepresentation(criticNetwork,observationInfo,actionInfo,...
'Observation',{'State'},'Action',{'Action'},criticOptions);
5-302
Train TD3 Agent for PMSM Control
A TD3 agent decides which action to take given the observations using an actor representation. To
create the actor, first create a deep neural network and construct the actor in a similar manner to the
critic. For more information, see rlDeterministicActorRepresentation.
actorNetwork = [featureInputLayer(numObservations,'Normalization','none','Name','State')
fullyConnectedLayer(64, 'Name','actorFC1')
reluLayer('Name','relu1')
fullyConnectedLayer(32, 'Name','actorFC2')
reluLayer('Name','relu2')
fullyConnectedLayer(numActions,'Name','Action')
tanhLayer('Name','tanh1')];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFa
actor = rlDeterministicActorRepresentation(actorNetwork,observationInfo,actionInfo,...
'Observation',{'State'},'Action',{'tanh1'},actorOptions);
To create the TD3 agent, first specify the agent options using an rlTD3AgentOptions object. The
agent trains from an experience buffer of maximum capacity 2e6 by randomly selecting mini-batches
of size 512. Use a discount factor of 0.995 to favor long-term rewards. TD3 agents maintain time-
delayed copies of the actor and critics known as the target actor and critics. Configure the targets to
update every 10 agent steps during training with a smoothing factor of 0.005.
Ts_agent = Ts;
agentOptions = rlTD3AgentOptions("SampleTime",Ts_agent, ...
"DiscountFactor", 0.995, ...
"ExperienceBufferLength",2e6, ...
"MiniBatchSize",512, ...
"NumStepsToLookAhead",1, ...
"TargetSmoothFactor",0.005, ...
"TargetUpdateFrequency",10);
During training, the agent explores the action space using a Gaussian action noise model. Set the
noise variance and decay rate using the ExplorationModel property. The noise variance decays at
the rate of 2e-4, which favors exploration towards the beginning of training and exploitation in later
stages. For more information on the noise model, see rlTD3AgentOptions.
agentOptions.ExplorationModel.Variance = 0.05;
agentOptions.ExplorationModel.VarianceDecayRate = 2e-4;
agentOptions.ExplorationModel.VarianceMin = 0.001;
The agent also uses a Gaussian action noise model for smoothing the target policy updates. Specify
the variance and decay rate for this model using the TargetPolicySmoothModel property.
agentOptions.TargetPolicySmoothModel.Variance = 0.1;
agentOptions.TargetPolicySmoothModel.VarianceDecayRate = 1e-4;
Create the agent using the specified actor, critics, and options.
agent = rlTD3Agent(actor,[critic1,critic2],agentOptions);
Train Agent
To train the agent, first specify the training options using rlTrainingOptions. For this example,
use the following options.
• Run each training for at most 1000 episodes, with each episode lasting at most ceil(T/
Ts_agent) time steps.
5-303
5 Train and Validate Agents
• Stop training when the agent receives an average cumulative reward greater than -190 over 100
consecutive episodes. At this point, the agent can track the reference speeds.
T = 1.0;
maxepisodes = 1000;
maxsteps = ceil(T/Ts_agent);
trainingOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes, ...
'MaxStepsPerEpisode',maxsteps, ...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-190,...
'ScoreAveragingWindowLength',100);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
trainingStats = train(agent,env,trainingOpts);
else
load('rlPMSMAgent.mat')
end
A snapshot of the training progress is shown in the following figure. You can expect different results
due to randomness in the training process.
5-304
Train TD3 Agent for PMSM Control
Simulate Agent
To validate the performance of the trained agent, simulate the model and view the closed-loop
performance through the Speed Tracking Scope block.
sim(mdl);
You can also simulate the model at different reference speeds. Set the reference speed in the
SpeedRef block to a different value between 0.2 and 1.0 per-unit and simulate the model again.
set_param('mcb_pmsm_foc_sim_RL/SpeedRef','After','0.6')
sim(mdl);
The following figure shows an example of closed-loop tracking performance. In this simulation, the
reference speed steps through values of 695.4 rpm (0.2 per-unit) and 1738.5 rpm (0.5 pu). The PI and
reinforcement learning controllers track the reference signal changes within 0.5 seconds.
Although the agent was trained to track the reference speed of 0.2 per-unit and not 0.5 per-unit, it
was able to generalize well.
The following figure shows the corresponding current tracking performance. The agent was able to
track the id and iq current references with steady-state error less than 2%.
5-305
5 Train and Validate Agents
See Also
rlTD3Agent | train
More About
• “Train Reinforcement Learning Agents” on page 5-3
• “Twin-Delayed Deep Deterministic Policy Gradient Agents” on page 3-35
5-306
Water Distribution System Scheduling Using Reinforcement Learning
This example shows how to learn an optimal pump scheduling policy for a water distribution system
using reinforcement learning (RL).
Here:
• QSupply is the amount of water supplied to the water tank from a reservoir.
• QDemand is the amount of water flowing out of the tank to satisfy usage demand.
The objective of the reinforcement learning agent is to schedule the number of pumps running to
both minimize the energy usage of the system and satisfy the usage demand (h > 0). The dynamics of
the tank system are governed by the following equation.
dh
A = QSupply t − QDemand t
dt
Here, A = 40 m2 and hmax = 7m. The demand over a 24 hour period is a function of time given as
5-307
5 Train and Validate Agents
QDemand t = μ t + η t
where μ t is the expected demand and η t represents the demand uncertainty, which is sampled from
a uniform random distribution.
0 a=0
164 a = 1 cm
QSupply t =Qa =
279 a=2 h
344 a=3
To simplify the problem, power consumption is defined as the number of pumps running, a.
The following function is the reward for this environment. To avoid overflowing or emptying the tank,
an additional cost is added if the water height is close to the maximum or minimum water levels, hmax
or hmin, respectively.
r h, a = − 10 h ≥ hmax − 0 . 1 − 10 h ≤ 0 . 1 − a
To generate and a water demand profile based on the number of days considered, use the
generateWaterDemand function defined at the end of this example.
plot(WaterDemand)
5-308
Water Distribution System Scheduling Using Reinforcement Learning
mdl = "watertankscheduling";
open_system(mdl)
5-309
5 Train and Validate Agents
In addition to the reinforcement learning agent, a simple baseline controller is defined in the Control
law MATLAB Function block. This controller activates a certain number of pumps depending on the
water level.
h0 = 3; % m
SampleTime = 0.2;
H_max = 7; % Max tank height (m)
A_tank = 40; % Area of tank (m^2)
To create an environment interface for the Simulink model, first define the action and observation
specifications, actInfo and obsInfo, respectively. The agent action is the selected number of
pumps. The agent observation is the water height, which is measured as a continuous-time signal.
5-310
Water Distribution System Scheduling Using Reinforcement Learning
actInfo = rlFiniteSetSpec([0,1,2,3]);
obsInfo = rlNumericSpec([1,1]);
Specify a custom reset function, which is defined at the end of this example, that randomizes the
initial water height and the water demand. Doing so allows the agent to be trained on different initial
water levels and water demand functions for each episode.
env.ResetFcn = @(in)localResetFcn(in);
A DQN agent approximates the long-term reward, given observations and actions, using a critic Q-
value function representation. To create the critic, first create a deep neural network. For more
information on creating a deep neural network value function representation, see “Create Policies
and Value Functions” on page 4-2.
Create a deep neural network for the critic. For this example, use a non-recurrent neural network. To
use a recurrent neural network, set useLSTM to true.
useLSTM = false;
if useLSTM
layers = [
sequenceInputLayer(obsInfo.Dimension(1),"Name","state","Normalization","none")
fullyConnectedLayer(32,"Name","fc_1")
reluLayer("Name","relu_body1")
lstmLayer(32,"Name","lstm")
fullyConnectedLayer(32,"Name","fc_3")
reluLayer("Name","relu_body3")
fullyConnectedLayer(numel(actInfo.Elements),"Name","output")];
else
layers = [
featureInputLayer(obsInfo.Dimension(1),"Name","state","Normalization","none")
fullyConnectedLayer(32,"Name","fc_1")
reluLayer("Name","relu_body1")
fullyConnectedLayer(32,"Name","fc_2")
reluLayer("Name","relu_body2")
fullyConnectedLayer(32,"Name","fc_3")
reluLayer("Name","relu_body3")
fullyConnectedLayer(numel(actInfo.Elements),"Name","output")];
end
dnn = dlnetwork(layers);
criticOpts = rlOptimizerOptions('LearnRate',0.001,'GradientThreshold',1);
Create a critic using rlVectorQValueFunction with the defined deep neural network and options.
critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);
5-311
5 Train and Validate Agents
To create an agent, first specify agent options. If you are using an LSTM network, set the sequence
length to 20.
opt = rlDQNAgentOptions('SampleTime',SampleTime);
if useLSTM
opt.SequenceLength = 20;
else
opt.SequenceLength = 1;
end
opt.DiscountFactor = 0.995;
opt.ExperienceBufferLength = 1e6;
opt.EpsilonGreedyExploration.EpsilonDecay = 1e-5;
opt.EpsilonGreedyExploration.EpsilonMin = .02;
opt.CriticOptimizerOptions = criticOpts;
Create the agent using the defined options and critic representation.
agent = rlDQNAgent(critic,opt);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run training for 1000 episodes, with each episode lasting at ceil(T_max/Ts) time steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option)
While you don't do it for this example, you can save agents during the training process. For example,
the following options save every agent with a reward value greater than or equal to -42.
Save agents using SaveAgentCriteria if necessary
trainOpts.SaveAgentCriteria = 'EpisodeReward';
trainOpts.SaveAgentValue = -42;
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
% Connect the RL Agent block by toggling the manual switch block
set_param(mdl+"/Manual Switch",'sw','0');
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
5-312
Water Distribution System Scheduling Using Reinforcement Learning
load('SimulinkWaterDistributionDQN.mat','agent')
end
To validate the performance of the trained agent, simulate it in the water-tank environment. For more
information on agent simulation, see rlSimulationOptions and sim.
To simulate the agent performance, connect the RL Agent block by toggling the manual switch block.
set_param(mdl+"/Manual Switch",'sw','0');
Set the maximum number of steps for each simulation and the number of simulations. For this
example, run 30 simulations. The environment reset function sets a different initial water height and
water demand are different in each simulation.
NumSimulations = 30;
simOptions = rlSimulationOptions('MaxSteps',T_max/SampleTime,...
'NumSimulations', NumSimulations);
To compare the agent with the baseline controller under the same conditions, reset the initial random
seed used in the environment reset function.
5-313
5 Train and Validate Agents
env.ResetFcn("Reset seed");
experienceDQN = sim(env,agent,simOptions);
To compare DQN agent with the baseline controller, you must simulate the baseline controller using
the same simulation options and initial random seed for the reset function.
set_param(mdl+"/Manual Switch",'sw','1');
To compare the agent with the baseline controller under the same conditions, reset the random seed
used in the environment reset function.
env.ResetFcn("Reset seed");
experienceBaseline = sim(env,agent,simOptions);
Initialize cumulative reward result vectors for both the agent and baseline controller.
Calculate cumulative rewards for both the agent and the baseline controller.
for ct = 1:NumSimulations
resultVectorDQN(ct) = sum(experienceDQN(ct).Reward);
resultVectorBaseline(ct) = sum(experienceBaseline(ct).Reward);
end
plot([resultVectorDQN resultVectorBaseline],'o')
set(gca,'xtick',1:NumSimulations)
xlabel("Simulation number")
ylabel('Cumulative Reward')
legend('DQN','Baseline','Location','NorthEastOutside')
5-314
Water Distribution System Scheduling Using Reinforcement Learning
The cumulative reward obtained by the agent is consistently around –40. This value is much greater
than the average reward obtained by the baseline controller. Therefore, the DQN agent consistently
outperforms the baseline controller in terms of energy savings.
Local Functions
t = 0:(num_days*24)-1; % hr
T_max = t(end);
Demand_mean = [28, 28, 28, 45, 55, 110, 280, 450, 310, 170, 160, 145, 130, ...
150, 165, 155, 170, 265, 360, 240, 120, 83, 45, 28]'; % m^3/hr
Demand = repmat(Demand_mean,1,num_days);
Demand = Demand(:);
5-315
5 Train and Validate Agents
WaterDemand.TimeInfo.Units = 'hours';
end
Reset Function
function in = localResetFcn(in)
% Use a persistent random seed value to evaluate the agent and the baseline
% controller under the same conditions.
persistent randomSeed
if isempty(randomSeed)
randomSeed = 0;
end
if strcmp(in,"Reset seed")
randomSeed = 0;
return
end
randomSeed = randomSeed + 1;
rng(randomSeed)
in = setBlockParameter(in,blk,'Value',num2str(h0));
end
See Also
5-316
Imitate MPC Controller for Lane Keeping Assist
This example shows how to train, validate, and test a deep neural network that imitates the behavior
of a model predictive controller for an automotive lane keeping assist system. In the example, you
also compare the behavior of the deep neural network with that of the original controller.
If the training of the network sufficiently traverses the state-space for the application, you can create
a reasonable approximation of the controller behavior. You can then deploy the network for your
control application. You can also use the network as a warm starting point for training the actor
network of a reinforcement learning agent. For an example, see “Train DDPG Agent with Pretrained
Actor Network” on page 5-325.
Design an MPC controller for lane keeping assist. To do so, first create a dynamic model for the
vehicle.
[sys,Vx] = createModelForMPCImLKA;
Create and design the MPC controller object mpcobj. Also, create an mpcstate object for setting the
initial controller state. For details on the controller design, type edit createMPCobjImLKA.
[mpcobj,initialState] = createMPCobjImLKA(sys);
For more information on designing model predictive controllers for lane keeping assist applications,
see “Lane Keeping Assist System Using Model Predictive Control” (Model Predictive Control Toolbox)
and “Lane Keeping Assist with Lane Detection” (Model Predictive Control Toolbox).
Load the input data from InputDataFileImLKA.mat. The columns of the data set follow:
1 Lateral velocity V y
2 Yaw angle rate r
3 Lateral deviation e1
4 Relative yaw angle e2
5 Previous steering angle (control variable) u
6 Measured disturbance (road yaw rate: longitudinal velocity * curvature (ρ))
7 Cost function value
8 MPC iterations
9 Steering angle computed by MPC controller: u*
The data in InputDataFileImLKA.mat was created by computing the MPC control action for
randomly generated states, previous control actions, and measured disturbances. To generate your
own training data, use the collectDataImLKA function.
5-317
5 Train and Validate Agents
dataStruct = load('InputDataFileImLKA.mat');
data = dataStruct.Data;
Divide the input data into training, validation, and testing data. First, determine the number of
validation data rows based on a given percentage.
totalRows = size(data,1);
validationSplitPercent = 0.1;
numValidationDataRows = floor(validationSplitPercent*totalRows);
testSplitPercent = 0.05;
numTestDataRows = floor(testSplitPercent*totalRows);
Randomly extract validation and testing data from the input data set. To do so, first randomly extract
enough rows for both data sets.
validationData = randomData(1:numValidationDataRows,:);
testData = randomData(numValidationDataRows + 1:end,:);
trainDataIdx = setdiff(1:totalRows,randomIdx);
trainData = data(trainDataIdx,:);
numTrainDataRows = size(trainData,1);
shuffleIdx = randperm(numTrainDataRows);
shuffledTrainData = trainData(shuffleIdx,:);
Reshape the training and validation data into 4-D matrices for use with trainNetwork.
numObservations = 6;
numActions = 1;
trainInput = shuffledTrainData(:,1:6);
trainOutput = shuffledTrainData(:,9);
validationInput = validationData(:,1:6);
validationOutput = validationData(:,9);
validationCellArray = {validationInput,validationOutput};
testDataInput = testData(:,1:6);
testDataOutput = testData(:,9);
5-318
Imitate MPC Controller for Lane Keeping Assist
Create the deep neural network that will imitate the MPC controller after training.
imitateMPCNetwork = [
featureInputLayer(numObservations,'Normalization','none','Name','InputLayer')
fullyConnectedLayer(45,'Name','Fc1')
reluLayer('Name','Relu1')
fullyConnectedLayer(45,'Name','Fc2')
reluLayer('Name','Relu2')
fullyConnectedLayer(45,'Name','Fc3')
reluLayer('Name','Relu3')
fullyConnectedLayer(numActions,'Name','OutputLayer')
tanhLayer('Name','Tanh1')
scalingLayer('Name','Scale1','Scale',1.04)
regressionLayer('Name','RegressionOutput')
];
plot(layerGraph(imitateMPCNetwork))
5-319
5 Train and Validate Agents
Train the deep neural network. To view detailed training information in the Command Window, set
the Verbose training option to true.
imitateMPCNetObj = trainNetwork(trainInput,trainOutput,imitateMPCNetwork,options);
5-320
Imitate MPC Controller for Lane Keeping Assist
Training of the deep neural network stops after the final iteration.
The training and validation loss are nearly the same for each mini-batch, which indicates that the
trained network does not overfit.
Check that the trained deep neural network returns steering angles similar to the MPC controller
control actions given the test input data. Compute the network output using the predict function.
predictedTestDataOutput = predict(imitateMPCNetObj,testDataInput);
Calculate the root mean squared error (RMSE) between the network output and the testing data.
The small RMSE value indicates that the network outputs closely reproduce the MPC controller
outputs.
To compare the performance of the MPC controller and the trained deep neural network, run closed-
loop simulations using the vehicle plant model.
Generate random initial conditions for the vehicle that are not part of the original input data set, with
values selected from the following ranges:
5-321
5 Train and Validate Agents
rng(5e7)
[x0,u0,rho] = generateRandomDataImLKA(data);
Set the initial plant state and control action in the mpcstate object.
initialState.Plant = x0;
initialState.LastMove = u0;
Extract the sample time from the MPC controller. Also, set the number of simulation steps.
Ts = mpcobj.Ts;
Tsteps = 30;
A = sys.A;
B = sys.B;
Initialize the state and input trajectories for the MPC controller simulation.
xHistoryMPC = repmat(x0',Tsteps+1,1);
uHistoryMPC = repmat(u0',Tsteps,1);
Run a closed-loop simulation of the MPC controller and the plant using the mpcmove function.
for k = 1:Tsteps
% Obtain plant output measurements, which correspond to the plant outputs.
xk = xHistoryMPC(k,:)';
% Compute the next cotnrol action using the MPC controller.
uk = mpcmove(mpcobj,initialState,xk,zeros(1,4),Vx*rho);
% Store the control action.
uHistoryMPC(k,:) = uk;
% Update the state using the control action.
xHistoryMPC(k+1,:) = (A*xk + B*[uk;Vx*rho])';
end
Initialize the state and input trajectories for the deep neural network simulation.
xHistoryDNN = repmat(x0',Tsteps+1,1);
uHistoryDNN = repmat(u0',Tsteps,1);
lastMV = u0;
Run a closed-loop simulation of the trained network and the plant. The neuralnetLKAmove function
computes the deep neural network output using the predict function.
for k = 1:Tsteps
% Obtain plant output measurements, which correspond to the plant outputs.
xk = xHistoryDNN(k,:)';
% Predict the next move using the trained deep neural network.
5-322
Imitate MPC Controller for Lane Keeping Assist
uk = neuralnetLKAmove(imitateMPCNetObj,xk,lastMV,rho);
% Store the control action and update the last MV for the next step.
uHistoryDNN(k,:) = uk;
lastMV = uk;
% Update the state using the control action.
xHistoryDNN(k+1,:) = (A*xk + B*[uk;Vx*rho])';
end
Plot the results to compare the MPC controller and trained deep neural network (DNN) trajectories.
plotValidationResultsImLKA(Ts,xHistoryDNN,uHistoryDNN,xHistoryMPC,uHistoryMPC);
5-323
5 Train and Validate Agents
The deep neural network successfully imitates the behavior of the MPC controller. The vehicle state
and control action trajectories for the controller and the deep neural network closely align.
See Also
trainNetwork | predict | mpcmove
More About
• “Lane Keeping Assist System Using Model Predictive Control” (Model Predictive Control
Toolbox)
• “Lane Keeping Assist with Lane Detection” (Model Predictive Control Toolbox)
5-324
Train DDPG Agent with Pretrained Actor Network
This example shows how to train a deep deterministic policy gradient (DDPG) agent for lane keeping
assist (LKA) in Simulink. To make training more efficient, the actor of the DDPG agent is initialized
with a deep neural network that was previously trained using supervised learning. This actor trained
is trained in the “Imitate MPC Controller for Lane Keeping Assist” on page 5-317 example.
For more information on DDPG agents, see “Deep Deterministic Policy Gradient (DDPG) Agents” on
page 3-31.
Simulink Model
The training goal for the lane-keeping application is to keep the ego vehicle traveling along the
centerline of the a lane by adjusting the front steering angle. This example uses the same ego vehicle
dynamics and sensor dynamics as the “Train DQN Agent for Lane Keeping Assist” on page 5-201
example.
Ts = 0.1;
T = 15;
The output of the LKA system is the front steering angle of the ego vehicle. Considering the physical
limitations of the ego vehicle, constrain its steering angle to the range [-60,60] degrees. Specify the
constraints in radians.
u_min = -1.04;
u_max = 1.04;
rho = 0.001;
Set initial values for the lateral deviation (e1_initial) and the relative yaw angle (e2_initial).
During training, these initial conditions are set to random values for each training episode.
e1_initial = 0.2;
e2_initial = -0.1;
mdl = 'rlActorLKAMdl';
open_system(mdl)
5-325
5 Train and Validate Agents
Create Environment
Create a reinforcement learning environment interface for the ego vehicle. To do so, first define the
observation and action specifications. These observations and actions are the same as the features for
supervised learning used in “Imitate MPC Controller for Lane Keeping Assist” on page 5-317.
The six observations for the environment are the lateral velocity vy, yaw rate ψ̇, lateral deviation e1,
relative yaw angle e2, steering angle at previous step u0, and curvature ρ.
observationInfo =
rlNumericSpec with properties:
observationInfo.Name = "observations";
The action for the environment is the front steering angle. Specify the steering angle constraints
when creating the action specification object.
actionInfo = rlNumericSpec([1 1], ...
LowerLimit=u_min, ...
UpperLimit=u_max);
actionInfo.Name = "steering";
5-326
Train DDPG Agent with Pretrained Actor Network
In the model, the Signal Processing for LKA block creates the observation vector signal, computes the
reward function, and calculates the stop signal.
The reward rt, provided at every time step t, is as follows, where u is the control input from the
previous time step t − 1.
2 2
rt = − (10e12 + 5e22 + 2u2 + 5ė1 + 5ė2)
To define the initial condition for lateral deviation and relative yaw angle, specify an environment
reset function using an anonymous function handle. The localResetFcn function, which is defined
at the end of the example, sets the initial lateral deviation and relative yaw angle to random values.
env.ResetFcn = @(in)localResetFcn(in);
rng(0)
A DDPG agent approximates the long-term reward given observations and actions using a critic value
function representation. To create the critic, first create a deep neural network with two inputs, the
state and action, and one output. For more information on creating a deep neural network value
function representation, see “Create Policies and Value Functions” on page 4-2.
A DDPG agent decides which action to take given observations using an actor representation. To
create the actor, first create a deep neural network with one input (the observation) and one output
(the action).
These initial actor and critic networks have random initial parameter values.
To create the DDPG agent, first specify the DDPG agent options.
agentOptions = rlDDPGAgentOptions(...
SampleTime=Ts,...
ActorOptimizerOptions=actorOptions,...
CriticOptimizerOptions=criticOptions,...
ExperienceBufferLength=1e6);
agentOptions.NoiseOptions.Variance = 0.3;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Create the DDPG agent using the specified actor representation, critic representation, and agent
options. For more information, see rlDDPGAgent.
5-327
5 Train and Validate Agents
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
As a baseline, train the agent with an actor that has random initial parameters. To train the agent,
first specify the training options. For this example, use the following options.
• Run training for at most 50000 episodes, with each episode lasting at most 150 time steps.
• Display the training progress in the Episode Manager dialog box.
• Stop training when the episode reward reaches –1.
• Save a copy of the agent for each episode where the cumulative reward is greater than –2.5.
maxepisodes = 50000;
maxsteps = T/Ts;
trainingOpts = rlTrainingOptions(...
MaxEpisodes=maxepisodes,...
MaxStepsPerEpisode=maxsteps,...
Verbose=false,...
Plots="training-progress",...
StopTrainingCriteria="EpisodeReward",...
StopTrainingValue=-1,...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=-2.5);
Train the agent using the train function. Training is a computationally intensive process that takes
several hours to complete. To save time while running this example, load a pretrained agent by
setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOpts);
else
% Load pretrained agent for the example.
load('ddpgFromScratch.mat');
end
5-328
Train DDPG Agent with Pretrained Actor Network
You can set the actor network of your agent to a deep neural network that has been previously
trained. For this example, use the deep neural network from the “Imitate MPC Controller for Lane
Keeping Assist” on page 5-317 example. This network was trained to imitate a model predictive
controller using supervised learning.
Check that the network used by supervisedActor is the same one that was loaded. To do so,
evaluate both the network and the agent using the same random input observation.
testData = rand(6,1);
5-329
5 Train and Validate Agents
evaluateRLRep = getAction(supervisedActor,{testData});
error = single
2.2352e-08
agent = rlDDPGAgent(supervisedActor,critic,agentOptions);
Reduce the maximum number of training episodes and train the agent using the train function. To
save time while running this example, load a pretrained agent by setting doTraining to false. To
train the agent yourself, set doTraining to true.
trainingOpts.MaxEpisodes = 5000;
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainingOpts);
else
% Load pretrained agent for the example.
load('ddpgFromPretrained.mat');
end
5-330
Train DDPG Agent with Pretrained Actor Network
By using the pretrained actor network, the training of the DDPG agent is more efficient. Both the
total training time and the total number of training steps have improved by approximately 20%. Also,
the number of episodes for the training to approach the neighborhood of the optimal result decreased
from approximately 4500 to approximately 3500.
To validate the performance of the trained agent, uncomment the following two lines and simulate it
within the environment. For more information on agent simulation, see rlSimulationOptions and
sim.
% simOptions = rlSimulationOptions(MaxSteps=maxsteps);
% experience = sim(env,agent,simOptions);
To check the performance of the trained agent within the Simulink model, simulate the model using
the previously defined initial conditions (e1_initial = 0.2 and e2_initial = -0.1).
sim(mdl)
As shown below, the lateral error (middle plot) and relative yaw angle (bottom plot) are both driven to
zero. The vehicle starts with a lateral deviation from the centerline (0.2 m) and a nonzero yaw angle
error (-0.1 rad). The lane-keeping controller makes the ego vehicle travel along the centerline after
around two seconds. The steering angle (top plot) shows that the controller reaches steady state after
about two seconds.
5-331
5 Train and Validate Agents
5-332
Train DDPG Agent with Pretrained Actor Network
bdclose(mdl)
Local Functions
function in = localResetFcn(in)
% Set random value for lateral deviation.
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand));
See Also
train | rlDDPGAgent
More About
• “Imitate MPC Controller for Lane Keeping Assist” on page 5-317
• “Train DQN Agent for Lane Keeping Assist” on page 5-201
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
5-333
5 Train and Validate Agents
This example shows how to train, validate, and test a deep neural network (DNN) that imitates the
behavior of a nonlinear model predictive controller for a flying robot. It then compares the behavior
of the deep neural network with that of the original controller. To train the deep neural network, this
example uses the data aggregation (DAgger) approach as in [1].
Nonlinear model predictive control (NLMPC) solves a constrained nonlinear optimization problem in
real time based on the current state of the plant. Since NLMPC solves its optimization problem in an
open-loop fashion, there is the potential to replace the controller with a trained DNN. Doing so is an
appealing option, since evaluating a DNN can be more computationally efficient than solving a
nonlinear optimization problem in real-time.
If the trained DNN reasonably approximates the controller behavior, you can then deploy the network
for your control application. You can also use the network as a warm starting point for training the
actor network of a reinforcement learning agent. For an example that does so with a DNN trained for
an MPC application, see “Train DDPG Agent with Pretrained Actor Network” on page 5-325.
Design a nonlinear MPC controller for a flying robot. The dynamics for the flying robot are the same
as in “Trajectory Optimization and Control of Flying Robot Using Nonlinear MPC” (Model Predictive
Control Toolbox) example. First, define the limit for the control variables, which are the robot thrust
levels.
umax = 3;
Create the nonlinear MPC controller object nlobj. To reduce command-window output, disable the
MPC update messages.
mpcverbosity off;
nlobj = createMPCobjImFlyingRobot(umax);
Load the input data from DAggerInputDataFileImFlyingRobot.mat. The columns of the data set
contain:
5-334
Imitate Nonlinear MPC Controller for Flying Robot
fileName = 'DAggerInputDataFileImFlyingRobot.mat';
DAggerData = load(fileName);
data = DAggerData.data;
existingData = data;
numCol = size(data,2);
The deep neural network architecture uses the following types of layers.
Create the deep neural network that will imitate the NLMPC controller after training.
numObservations = numCol-2;
numActions = 2;
hiddenLayerSize = 256;
imitateMPCNetwork = [
featureInputLayer(numObservations,'Normalization','none','Name','observation')
fullyConnectedLayer(hiddenLayerSize,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(hiddenLayerSize,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(hiddenLayerSize,'Name','fc3')
reluLayer('Name','relu3')
fullyConnectedLayer(hiddenLayerSize,'Name','fc4')
reluLayer('Name','relu4')
fullyConnectedLayer(hiddenLayerSize,'Name','fc5')
reluLayer('Name','relu5')
fullyConnectedLayer(hiddenLayerSize,'Name','fc6')
reluLayer('Name','relu6')
fullyConnectedLayer(numActions,'Name','fcLast')
tanhLayer('Name','tanhLast')
scalingLayer('Name','ActorScaling','Scale',umax)
regressionLayer('Name','routput')];
plot(layerGraph(imitateMPCNetwork))
5-335
5 Train and Validate Agents
One approach to learning an expert policy using supervised learning is the behavior cloning method.
This method divides the expert demonstrations (NLMPC control actions in response to observations)
into state-action pairs and applies supervised learning to train the network.
You can train the behavior cloning neural network by following below steps
5-336
Imitate Nonlinear MPC Controller for Flying Robot
Training a DNN is a computationally intensive process. To save time, load a pretrained neural
network object.
load('behaviorCloningMPCImDNNObject.mat');
imitateMPCNetBehaviorCloningObj = behaviorCloningNNObj.imitateMPCNetObj;
The training of the DNN using behavior cloning reduces the gap between the DNN and NLMPC
performance. However, the behavior cloning neural network fails to imitate the behavior of the
NLMPC controller correctly on some randomly generated data.
To improve the performance of the DNN, you can learn the policy using an interactive demonstrator
method. DAgger is an iterative method where the DNN is run in the closed-loop environment. The
expert, in this case the NLMPC controller, outputs actions based on the states visited by the DNN. In
this manner, more training data is aggregated and the DNN is retrained for improved performance.
For more information, see [1].
Train the deep neural network using the DAggerTrainNetwork function. It creates
DAggerImFlyingRobotDNNObj.mat file that contains the following information.
5-337
5 Train and Validate Agents
First, create and initialize the parameters for training. Use the network trained using behavior
cloning (imitateMPCNetBehaviorCloningObj) as the starting point for the DAgger training.
[dataStruct,nlmpcStruct,tuningParamsStruct,neuralNetStruct] = loadDAggerParameters(existingData,
numCol,nlobj,umax,options,imitateMPCNetBehaviorCloningObj);
To save time, load a pretrained neural network by setting doTraining to false. To train the DAgger
yourself, set doTraining to true.
doTraining = false;
if doTraining
DAgger = DAggerTrainNetwork(nlmpcStruct,dataStruct,neuralNetStruct,tuningParamsStruct);
else
load('DAggerImFlyingRobotDNNObj.mat');
end
DNN = DAgger.finalPolicy;
As an alternative, you can train the neural network with a modified policy update rule using the
DAggerModifiedTrainNetwork function. In this function, after every 20 training iterations, the
DNN is set to the most optimal configuration from the previous 20 iterations. To run this example
with a neural network object with the modified DAgger approach, use the
DAggerModifiedImFlyingRobotDNNObj.mat file.
To compare the performance of the NLMPC controller and the trained DNN, run closed-loop
simulations with the flying robot model.
Set initial condition for the states of the flying robot (x, y, θ, ẋ, ẏ, θ̇) and the control variables of flying
robot (ul, ur ).
% Duration
Tf = 15;
% Sample time
Ts = nlobj.Ts;
% Simulation steps
Tsteps = Tf/Ts+1;
% Run NLMPC in closed loop.
tic
[xHistoryMPC,uHistoryMPC] = simModelMPCImFlyingRobot(x0,u0,nlobj,Tf);
toc
tic
[xHistoryDNN,uHistoryDNN] = simModelDAggerImFlyingRobot(x0,u0,DNN,Ts,Tf);
toc
5-338
Imitate Nonlinear MPC Controller for Flying Robot
Plot the results, and compare the NLMPC and trained DNN trajectories.
plotSimResultsImFlyingRobot(nlobj,xHistoryMPC,uHistoryMPC,xHistoryDNN,uHistoryDNN,umax,Tf)
5-339
5 Train and Validate Agents
The DAgger neural network successfully imitates the behavior of the NLMPC controller. The flying
robot states and control action trajectories for the controller and the DAgger deep neural network
closely align. The closed-loop simulation time for the DNN is significantly less than that of the
NLMPC controller.
To validate the performance of the trained DNN, animate the flying robot with data from the DNN
closed-loop simulation. The flying robot lands at the origin successfully.
Lx = 5;
Ly = 5;
for ct = 1:Tsteps
x = xHistoryDNN(ct,1);
y = xHistoryDNN(ct,2);
theta = xHistoryDNN(ct,3);
tL = uHistoryDNN(ct,1);
tR = uHistoryDNN(ct,2);
rl.env.viz.plotFlyingRobot(x,y,theta,tL,tR,Lx,Ly);
pause(0.05);
end
5-340
Imitate Nonlinear MPC Controller for Flying Robot
References
[1] Osa, Takayuki, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan
Peters. ‘An Algorithmic Perspective on Imitation Learning’. Foundations and Trends in Robotics 7, no.
1–2 (2018): 1–179. https://doi.org/10.1561/2300000053.
See Also
train | rlDDPGAgent
More About
• “Imitate MPC Controller for Lane Keeping Assist” on page 5-317
• “Train DQN Agent for Lane Keeping Assist” on page 5-201
• “Deep Deterministic Policy Gradient (DDPG) Agents” on page 3-31
5-341
5 Train and Validate Agents
This example shows how to tune a PI controller using the twin-delayed deep deterministic policy
gradient (TD3) reinforcement learning algorithm. The performance of the tuned controller is
compared with that of a controller tuned using the Control System Tuner app. Using the Control
System Tuner app to tune controllers in Simulink® requires Simulink Control Design™ software.
For relatively simple control tasks with a small number of tunable parameters, model-based tuning
techniques can get good results with a faster tuning process compared to model-free RL-based
methods. However, RL methods can be more suitable for highly nonlinear systems or adaptive
controller tuning.
To facilitate the controller comparison, both tuning methods use a linear quadratic Gaussian (LQG)
objective function.
This example uses a reinforcement learning (RL) agent to compute the gains for a PI controller. For
an example that replaces the PI controller with a neural network controller, see “Create Simulink
Environment and Train Agent” on page 1-20.
Environment Model
The environment model for this example is a water tank model. The goal of this control system is to
maintain the level of water in a tank to match a reference value.
open_system('watertankLQG')
To maintain the water level while minimizing control effort u, the controllers in this example use the
following LQG criterion.
5-342
Tune PI Controller Using Reinforcement Learning
∫
1 T 2
J = lim E Href − y t + 0 . 01u2 t dt
T ∞ T 0
To simulate the controller in this model, you must specify the simulation time Tf and the controller
sample time Ts in seconds.
Ts = 0.1;
Tf = 10;
For more information about the water tank model, see “watertank Simulink Model” (Simulink Control
Design).
To tune a controller in Simulink using Control System Tuner, you must specify the controller block
as a tuned block and define the goals for the tuning process. For more information on using Control
System Tuner, see “Tune a Control System Using Control System Tuner” (Simulink Control Design).
For this example, open the saved session ControlSystemTunerSession.mat using Control
System Tuner. This session specifies the PID Controller block in the watertankLQG model as a
tuned block and contains an LQG tuning goal.
controlSystemTuner("ControlSystemTunerSession")
The tuned proportional and integral gains are approximately 9.8 and 1e-6, respectively.
Kp_CST = 9.80199999804512;
Ki_CST = 1.00019996230706e-06;
To define the model for training the RL agent, modify the water tank model using the following steps.
mdl = 'rlwatertankPIDTune';
open_system(mdl)
5-343
5 Train and Validate Agents
Create the environment interface object. To do so, use the localCreatePIDEnv function defined at
the end of this example.
[env,obsInfo,actInfo] = localCreatePIDEnv(mdl);
numObservations = obsInfo.Dimension(1);
numActions = prod(actInfo.Dimension);
rng(0)
Given observations, a TD3 agent decides which action to take using an actor representation. To
create the actor, first create a deep neural network with the observation input and the action output.
For more information, see rlContinuousDeterministicActor.
You can model a PI controller as a neural network with one fully-connected layer with error and error
integral observations.
u= ∫e dt e * Ki Kp
T
Here:
5-344
Tune PI Controller Using Reinforcement Learning
• e = Href − h t , h t is the height of the tank, and Href is the reference height.
Gradient descent optimization can drive the weights to negative values. To avoid negative weights,
replace normal fullyConnectedLayer with a fullyConnectedPILayer. This layer ensures that
the weights are positive by implementing the function Y = abs WEIGHTS * X. This layer is defined in
fullyConnectedPILayer.m. For more information on defining custom layers, see “Define Custom
Deep Learning Layers”.
A TD3 agent approximates the long-term reward given observations and actions using two critic
value-function representations. To create the critics, first create a deep neural network with two
inputs, the observation and action, and one output. For more information on creating a deep neural
network value function representation, see “Create Policies and Value Functions” on page 4-2.
To create the critics, use the localCreateCriticNetwork function defined at the end of this
example. Use the same network structure for both critic representations.
criticNetwork = localCreateCriticNetwork(numObservations,numActions);
criticOpts = rlOptimizerOptions('LearnRate',1e-3,'GradientThreshold',1);
critic1 = rlQValueFunction(dlnetwork(criticNetwork),obsInfo,actInfo,...
'ObservationInputNames','state','ActionInputNames','action');
critic2 = rlQValueFunction(dlnetwork(criticNetwork),obsInfo,actInfo,...
'ObservationInputNames','state','ActionInputNames','action');
critic = [critic1 critic2];
agentOpts = rlTD3AgentOptions(...
'SampleTime',Ts,...
'MiniBatchSize',128, ...
'ExperienceBufferLength',1e6,...
'ActorOptimizerOptions',actorOptions,...
'CriticOptimizerOptions',criticOpts);
agentOpts.TargetPolicySmoothModel.StandardDeviation = sqrt(0.1);
Create the TD3 agent using the specified actor representation, critic representation, and agent
options. For more information, see rlTD3AgentOptions.
agent = rlTD3Agent(actor,critic,agentOpts);
5-345
5 Train and Validate Agents
Train Agent
• Run each training for at most 1000 episodes, with each episode lasting at most 100 time steps.
• Display the training progress in the Episode Manager (set the Plots option) and disable the
command-line display (set the Verbose option).
• Stop training when the agent receives an average cumulative reward greater than -355 over 100
consecutive episodes. At this point, the agent can control the level of water in the tank.
maxepisodes = 1000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes, ...
'MaxStepsPerEpisode',maxsteps, ...
'ScoreAveragingWindowLength',100, ...
'Verbose',false, ...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-355);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load pretrained agent for the example.
load('WaterTankPIDtd3.mat','agent')
end
5-346
Tune PI Controller Using Reinforcement Learning
simOpts = rlSimulationOptions('MaxSteps',maxsteps);
experiences = sim(env,agent,simOpts);
The integral and proportional gains of the PI controller are the absolute weights of the actor
representation. To obtain the weights, first extract the learnable parameters from the actor.
actor = getActor(agent);
parameters = getLearnableParameters(actor);
Ki = abs(parameters{1}(1))
Ki = single
0.3958
Kp = abs(parameters{1}(2))
Kp = single
8.0822
5-347
5 Train and Validate Agents
Apply the gains obtained from the RL agent to the original PI controller block and run a step-
response simulation.
mdlTest = 'watertankLQG';
open_system(mdlTest);
set_param([mdlTest '/PID Controller'],'P',num2str(Kp))
set_param([mdlTest '/PID Controller'],'I',num2str(Ki))
sim(mdlTest)
Extract the step response information, LQG cost, and stability margin for the simulation. To compute
the stability margin, use the localStabilityAnalysis function defined at the end of this example.
rlStep = simout;
rlCost = cost;
rlStabilityMargin = localStabilityAnalysis(mdlTest);
Apply the gains obtained using Control System Tuner to the original PI controller block and run a
step-response simulation.
figure
plot(cstStep)
hold on
plot(rlStep)
grid on
legend('Control System Tuner','RL','Location','southeast')
title('Step Response')
5-348
Tune PI Controller Using Reinforcement Learning
rlStepInfo = stepinfo(rlStep.Data,rlStep.Time);
cstStepInfo = stepinfo(cstStep.Data,cstStep.Time);
stepInfoTable = struct2table([cstStepInfo rlStepInfo]);
stepInfoTable = removevars(stepInfoTable,{...
'SettlingMin','SettlingMax','Undershoot','PeakTime'});
stepInfoTable.Properties.RowNames = {'Control System Tuner','RL'};
stepInfoTable
stepInfoTable=2×5 table
RiseTime TransientTime SettlingTime Overshoot Peak
________ _____________ ____________ _________ ______
stabilityMarginTable=2×3 table
GainMargin PhaseMargin Stable
5-349
5 Train and Validate Agents
Compare the cumulative LQG cost for the two controllers. The RL-tuned controller produces a slightly
more optimal solution.
rlCumulativeCost = sum(rlCost.Data)
rlCumulativeCost = -375.9135
cstCumulativeCost = sum(cstCost.Data)
cstCumulativeCost = -376.9373
Both controllers produce stable responses, with the controller tuned using Control System Tuner
producing a faster response. However, the RL tuning method produces a higher gain margin and a
more optimal solution.
Local Functions
% Set a cutom reset function that randomizes the reference values for the model.
env.ResetFcn = @(in)localResetFcn(in,mdl);
end
Function to randomize the reference signal and initial height of the water tank at the beginning of
each episode.
function in = localResetFcn(in,mdl)
end
5-350
Tune PI Controller Using Reinforcement Learning
Function to linearize and compute stability margins of the SISO water tank system.
margin = allmargin(linsys);
end
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'fc1','concat/in1');
criticNetwork = connectLayers(criticNetwork,'fc2','concat/in2');
end
See Also
train | rlTD3Agent
More About
• “Twin-Delayed Deep Deterministic Policy Gradient Agents” on page 3-35
• “Define Custom Deep Learning Layers”
5-351
5 Train and Validate Agents
This example shows how to train a reinforcement learning (RL) agent with actions constrained using
the Constraint Enforcement block. This block computes modified control actions that are closest to
the actions output by the agent subject to constraints and action bounds. Training reinforcement
learning agents requires Reinforcement Learning Toolbox™ software.
In this example, the goal of the agent is to bring a green ball as close as possible to the changing
target position of a red ball [1].
The dynamics for the green ball from velocity v to position x are governed by Newton's law with a
small damping coefficient τ.
1
.
s(τs + 1)
The feasible region for the ball position 0 ≤ x ≤ 1 and the velocity of the green ball is limited to the
range −1, 1 .
The position of the target red ball is uniformly random across the range 0, 1 . The agent can observe
only a noisy estimate of this target position.
In this example, a constraint function is represented using a trained deep neural network. To train
the network, you must first collect training data from the environment.
To do so, first create an RL environment using the rlBallOneDim model. This model applies random
external actions through an RL Agent block to the environment.
mdl = 'rlBallOneDim';
open_system(mdl)
5-352
Train Reinforcement Learning Agent with Constraint Enforcement
• Applies the input velocity to the environment model and generates the resulting output
observations
• 2 +
Computes the training reward r = 1 − 10 x − xr , where xr denotes the position of the red ball
• Sets the termination signal isDone to true if the ball position violates the constraint 0 ≤ x ≤ 1
For this model, the observations from the environment include the position and velocity of the green
ball and the noisy measurement of the red ball position. Define a continuous observation space for
these three values.
The action that the agent applies to the green ball is its velocity. Create a continuous action space
and apply the required velocity limits.
Specify a reset function, which randomly initializes the environment at the start of each training
episode or simulation.
env.ResetFcn = @(in)localResetFcn(in);
Next, create a DDPG reinforcement learning agent, which supports continuous actions and
observations, using the createDDPGAgentBall helper function. This function creates critic and
5-353
5 Train and Validate Agents
actor representations based on the action and observation specifications and uses the representations
to create a DDPG agent.
agent = createDDPGAgentBall(Ts,obsInfo,actInfo);
In the rlBallOneDim model, the RL Agent block does not generate actions. Instead, it is configured
to pass a random external action to the environment. The purpose for using a data-collection model
with an inactive RL Agent block is to ensure that the environment model, action and observation
signal configurations, and model reset function used during data collection match those used during
subsequent agent training.
In this example, the ball position signal xk + 1 must satisfy 0 ≤ xk + 1 ≤ 1. To allow for some slack, the
constraint is set to be 0 . 1 ≤ xk + 1 ≤ 0 . 9 . The dynamic model from velocity to position has a very
small damping constant, thus it can be approximated by xk + 1 ≈ xk + h xk uk. Therefore, the
constraints for green ball are given by the following equation.
xk h xk 0.9
+ uk ≤
−xk −h xk −0 . 1
The Constraint Enforcement block accepts constraints of the form f x + gxu ≤ c. For the above
equation, the coefficients of this constraint function are as follows.
xk h xk 0.9
fx = , gx = ,c =
−xk −h xk −0 . 1
The function h xk is approximated by a deep neural network that is trained on the data collected by
simulating the RL agent within the environment. To learn the unknown function h xk , the RL agent
passes a random external action to the environment that is uniformly distributed in the range −1, 1 .
To collect data, use the collectDataBall helper function. This function simulates the environment
and agent and collects the resulting input and output data. The resulting training data has three
columns: xk, uk, and xk + 1.
For this example, load precollected training data. To collect the data yourself, set collectData to
true.
collectData = false;
if collectData
count = 1050;
data = collectDataBall(env,agent,count);
else
load trainingDataBall data
end
Train a deep neural network to approximate the constraint function using the
trainConstraintBall helper function. This function formats the data for training then creates and
trains a deep neural network. Training a deep neural network requires Deep Learning Toolbox™
software.
For this example, to ensure reproducibility, load a pretrained network. To train the network yourself,
set trainConstraint to true.
5-354
Train Reinforcement Learning Agent with Constraint Enforcement
trainConstraint = false;
if trainConstraint
network = trainConstraintBall(data);
else
load trainedNetworkBall network
end
Validate the trained neural network using the validateNetworkBall helper function. This function
processes the input training data using the trained deep neural network. It then compares the
network output with the training output and computes the root mean-squared error (RMSE).
validateNetworkBall(data,network)
The small RMSE value indicates that the network successfully learned the constraint function.
To train the agent with constraint enforcement, use the rlBallOneDimWithConstraint model.
This model constrains the actions from the agent before applying them to the environment.
mdl = 'rlBallOneDimWithConstraint';
open_system(mdl)
5-355
5 Train and Validate Agents
To view the constraint implementation, open the Constraint subsystem. Here, the trained deep neural
network approximates h xk , and the Constraint Enforcement block enforces the constraint function
and velocity bounds.
For this example the following Constraint Enforcement block parameter settings are used.
• Number of constraints — 2
• Number of actions — 1
• Constraint bound — [0.9;-0.1]
Create an RL environment using this model. The observation and action specifications match those
used for the previous data collection environment.
agentblk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
env.ResetFcn = @(in)localResetFcn(in);
5-356
Train Reinforcement Learning Agent with Constraint Enforcement
Specify options for training the agent. Train the RL agent for 300 episodes with 300 steps per
episode.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',300, ...
'MaxStepsPerEpisode', 300, ...
'Verbose', false, ...
'Plots','training-progress');
Train the agent. Training is a time-consuming process. For this example, load a pretrained agent
using the loadAgentParams helper function. To train the agent yourself, set trainAgent to true.
trainAgent = false;
if trainAgent
trainingStats = train(agent,env,trainOpts);
else
loadAgentParams(agent,'rlAgentBallParams')
end
The following figure shows the training results. The training process converges to a good agent
within 20 episodes.
5-357
5 Train and Validate Agents
Since Total Number of Steps equals the product of Episode Number and Episode Steps, each
training episode runs to the end without early termination. Therefore, the Constraint Enforcement
block ensures that the ball position x never violates the constraint 0 ≤ x ≤ 1.
simWithTrainedAgentBall(env,agent)
To see the benefit of training an agent with constraint enforcement, you can train the agent without
constraints and compare the training results to the constraint enforcement case.
To train the agent without constraints, use the rlBallOneDimWithoutConstraint model. This
model applies the actions from the agent directly to the environment.
mdl = 'rlBallOneDimWithoutConstraint';
open_system(mdl)
5-358
Train Reinforcement Learning Agent with Constraint Enforcement
Create a new DDPG agent to train. This agent has the same configuration as the agent used in the
previous training.
agent = createDDPGAgentBall(Ts,obsInfo,actInfo);
Train the agent using the same training options as in the constraint enforcement case. For this
example, as with the previous training, load a pretrained agent. To train the agent yourself, set
trainAgent to true.
trainAgent = false;
if trainAgent
trainingStats2 = train(agent,env,trainOpts);
else
loadAgentParams(agent,'rlAgentBallCompParams')
end
The following figure shows the training results. The training process converges to a good agent after
50 episodes. After that point, the agent has poor performance for some episodes, for example around
episodes 140 and 230.
5-359
5 Train and Validate Agents
Since Total Number of Steps is less than the product of Episode Number and Episode Steps, the
training includes episodes that terminated early due to constraint violations.
simWithTrainedAgentBall(env,agent)
5-360
Train Reinforcement Learning Agent with Constraint Enforcement
The agent successfully tracks the position of the red ball with more steady-state offset than the agent
trained with constraints.
Conclusion
In this example, training an RL agent with the Constraint Enforcement block ensures that the actions
applied to the environment never produce a constraint violation. As a result, the training process
converges to a good agent quickly. Training the same agent without constraints produces slower
convergence and poorer performance.
bdclose('rlBallOneDim')
bdclose('rlBallOneDimWithConstraint')
bdclose('rlBallOneDimWithoutConstraint')
close('Ball One Dim')
References
[1] Dalal, Gal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval
Tassa. "Safe Exploration in Continuous Action Spaces." Preprint, submitted January 26, 2018. https://
arxiv.org/abs/1801.08757
See Also
Blocks
RL Agent | Constraint Enforcement
5-361
5 Train and Validate Agents
Related Examples
• “Constraint Enforcement for Control Design” (Simulink Control Design)
• “Train RL Agent for Adaptive Cruise Control with Constraint Enforcement” (Simulink Control
Design)
• “Train RL Agent for Lane Keeping Assist with Constraint Enforcement” (Simulink Control
Design)
5-362
Train DQN Agent with LSTM Network to Control House Heating System
This example shows how to train a deep Q-learning network (DQN) agent with a Long Short-Term
Memory (LSTM) network to control house heating system modeled in Simscape®. For more
information on DQN agents, see “Deep Q-Network (DQN) Agents” on page 3-23.
The reinforcement learning (RL) environment for this example uses a model from the “House Heating
System” (Simscape). The model in this RL example contains a heater, a thermostat controlled by an
RL agent, a house, outside temperatures, and a reward function. Heat is transferred between the
outside environment and the interior of the home through the walls, windows, and roof. Weather
station data from the MathWorks® campus in Natick MA is used to simulate the outside temperature
between March 21st and April 15th, 2022. ThingSpeak™ was used to obtain the data. The data,
"temperatureMar21toApr15_20022.mat", is located in this example folder. For more information
about the data acquisition, see “Compare Temperature Data from Three Different Days”
(ThingSpeak).
The training goal for the agent is to minimize the energy cost and maximize the comfort of the room
by turning on/off the heater. The house is comfortable when the room temperature Troomis between
TcomfortMin and TcomfortMax.
• Observation is a 6-dimensional column vector that consists of room temperature (∘C) , outside
temperature (∘C) , max comfort temperature (∘C) , min comfort temperature (∘C) , last action, and
price per kWh (USD). The max comfort temperature, the min comfort temperature, and the price
per kWh in this example doesn't change over time, and it is unnecessary to use them to train the
agent. However, you can extend this example by varying these values over time.
• Action is discrete. Either turning on the heater or turning off. A = {0, 1}. We use 0 for off and 1
for on.
• Reward consists of three parts: energy cost, comfort level reward, and switching penalty. These
three units are different, and users are expected to take balance, especially between energy cost
5-363
5 Train and Validate Agents
and comfort level, by changing the coefficients of these terms. The reward function is inspired by
[1].
• IsDone signal is always 0, which means that there is no early termination condition.
A DQN agent approximates the discounted cumulative long-term reward using a vector Q-value
function critic. To approximate the Q-value function within the critic, the DQN agent in this example
5-364
Train DQN Agent with LSTM Network to Control House Heating System
uses an LSTM, which can capture the effect of previous observations. By setting UseRNN option in
rlAgentInitializationOptions, you can create a default DQN agent with an LSTM network.
Alternatively, you can manually configure the LSTM network. See this “Water Distribution System
Scheduling Using Reinforcement Learning” on page 5-307 to create an LSTM network for the DQN
agent manually. For more information about LSTM layers, see “Long Short-Term Memory Networks”.
Note that you must set SequenceLength greater than 1 in rlDQNAgentOptions. This option is
used in training to determine the length of the minibatch used to calculate the gradient.
criticOpts = rlOptimizerOptions( ...
LearnRate=0.001, ...
GradientThreshold=1);
agentOpts = rlDQNAgentOptions(...
UseDoubleDQN = false, ...
TargetSmoothFactor = 1, ...
TargetUpdateFrequency = 4, ...
ExperienceBufferLength = 1e6, ...
CriticOptimizerOptions = criticOpts, ...
MiniBatchSize = 64);
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.0001;
useRNN = true;
initOpts = rlAgentInitializationOptions( ...
UseRNN=useRNN, ...
NumHiddenUnit=64);
if useRNN
agentOpts.SequenceLength = 20;
end
agent = rlDQNAgent(obsInfo, actInfo, initOpts, agentOpts);
agent.SampleTime = sampleTime;
Train Agent
To train the agent, first, specify the training options. For this example, use the following options.
• Run training for at most 150 episodes, with each episode lasting 1000 time steps.
• Set the Plots option to "training-progress" which displays training progress in the
Reinforcement Learning Episode Manager.
• Set the Verbose option to false to disable the command line display
• Stop training when the agent receives an average cumulative reward greater than 85 over 5
consecutive episodes.
5-365
5 Train and Validate Agents
maxEpisodes = 150;
trainOpts = rlTrainingOptions(...
MaxEpisodes = maxEpisodes, ...
MaxStepsPerEpisode = maxStepsPerEpisode, ...
ScoreAveragingWindowLength = 5,...
Verbose = false, ...
Plots = "training-progress",...
StopTrainingCriteria = "AverageReward",...
StopTrainingValue = 85);
Train the agent using the train function. Training this agent is a computationally-intensive process
that takes several hours to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("HeatControlDQNAgent.mat","agent")
end
5-366
Train DQN Agent with LSTM Network to Control House Heating System
To validate the performance of the trained agent, simulate it within the house heating system. For
more information on agent simulation, see rlSimulationOptions and sim.
We first evaluate the agent's performance using the temperature data from March 21st, 2022. The
agent didn't use this temperature data during training.
Use the localPlotResults function provided at the end of the script to analyze the performance.
5-367
5 Train and Validate Agents
Secondly, we evaluate the agent's performance using the temperature data from April 15th, 2022.
The agent didn't use this temperature data during training.
% Validate agent using the data from April 15
validationTemperature = temperatureApril15;
experience2 = sim(env,agent,simOptions);
localPlotResults( ...
experience2, ...
maxSteps, ...
comfortMax, ...
comfortMin, ...
sampleTime,2)
5-368
Train DQN Agent with LSTM Network to Control House Heating System
Evaluate the agent's performance when the temperature is mild. Eight degrees are added to the
temperature from April 15th to create data for mild temperatures.
5-369
5 Train and Validate Agents
sampleTime, ...
3)
Local Function
5-370
Train DQN Agent with LSTM Network to Control House Heating System
% Cost of energy
totalCosts = experience.SimulationInfo(1).househeat_output{1}.Values;
totalCosts.Time = totalCosts.Time/60;
totalCosts.TimeInfo.Units='minutes';
totalCosts.Name = "Total Energy Cost";
finalCost = experience.SimulationInfo(1).househeat_output{1}.Values.Data(end);
% Plot results
fig = figure(figNum);
% Change the size of the figure;
fig.Position = fig.Position + [0, 0, 0, 200];
% Temperatures
layoutResult = tiledlayout(3,1);
nexttile
plot(minutes, ...
reshape(experience.Observation.obs1.Data(1,:,:), ...
[1,length(experience.Observation.obs1.Data)]),'k')
hold on
plot(minutes, ...
reshape(experience.Observation.obs1.Data(2,:,:), ...
[1,length(experience.Observation.obs1.Data)]),'g')
yline(comfortMin,'b')
yline(comfortMax,'r')
lgd = legend("T_{room}", "T_{outside}","T_{comfortMin}", ...
"T_{comfortMax}","location","northoutside");
lgd.NumColumns = 4;
title('Temperatures')
ylabel("Temperature")
xlabel('Time (minutes)')
hold off
% Total cost
nexttile
plot(totalCosts)
title('Total Cost')
ylabel("Energy cost")
5-371
5 Train and Validate Agents
Reference
[1]. Y. Du, F. Li, K. Kurte, J. Munk and H. Zandi, "Demonstration of Intelligent HVAC Load
Management With Deep Reinforcement Learning: Real-World Experience of Machine Learning in
Demand Control," in IEEE Power and Energy Magazine, vol. 20, no. 3, pp. 42-53, May-June 2022, doi:
10.1109/MPE.2022.3150825.
5-372
Train Custom LQR Agent
This example shows how to train a custom linear quadratic regulation (LQR) agent to control a
discrete-time linear system modeled in MATLAB®.
The reinforcement learning environment for this example is a discrete-time linear system. The
dynamics for the system are given by
xt + 1 = Axt + But
ut = − Kxt
1 . 05 0 . 05 0 . 05
A = 0 . 05 1 . 05 0 . 05
0 0 . 05 1 . 05
0.1 0 0.2
B = 0.1 0.5 0
0 0 0.5
A = [1.05,0.05,0.05;0.05,1.05,0.05;0,0.05,1.05];
B = [0.1,0,0.2;0.1,0.5,0;0,0,0.5];
10 3 1
Q= 3 5 4
1 49
0.5 0 0
R = 0 0.5 0
0 0 0.5
Q = [10,3,1;3,5,4;1,4,9];
R = 0.5*eye(3);
For this environment, the reward at time t is given by rt = − xt′Qxt − ut′Rut, which is the negative of
the quadratic cost. Therefore, maximizing the reward minimizes the cost. The initial conditions are
set randomly by the reset function.
Create the MATLAB environment interface for this linear system and reward. The myDiscreteEnv
function creates an environment by defining custom step and reset functions. For more information
on creating such a custom environment, see “Create MATLAB Environment Using Custom Functions”
on page 2-41.
5-373
5 Train and Validate Agents
env = myDiscreteEnv(A,B,Q,R);
x x
For the LQR problem, the Q-function for a given control gain K can be defined as QK x, u = ′HK ,
u u
Hxx Hxu
where HK = is a symmetric, positive definite matrix.
Hux Huu
−1 −1
The control law to maximize QK is u = − Huu Hux x, and the feedback gain is K = Huu Hux .
1
The matrix HK contains m = 2 n n + 1 distinct element values, where n is the sum of the number of
states and number of inputs. Denote θ as the vector corresponding to these m elements, where the
off-diagonal elements in HK are multiplied by two.
The LQR agent starts with a stabilizing controller K0. To get an initial stabilizing controller, place the
poles of the closed-loop system A − BK0 inside the unit circle.
K0 = place(A,B,[0.4,0.8,0.5]);
To create a custom agent, you must create a subclass of the rl.agent.CustomAgent abstract class.
For the custom LQR agent, the defined custom subclass is LQRCustomAgent. For more information,
see “Create Custom Reinforcement Learning Agents” on page 3-68. Create the custom LQR agent
using Q, R, and K0. The agent does not require information on the system matrices A and B.
agent = LQRCustomAgent(Q,R,K0);
For this example, set the agent discount factor to one. To use a discounted future reward, set the
discount factor to a value less than one.
agent.Gamma = 1;
Because the linear system has three states and three inputs, the total number of learnable
parameters is m = 21. To ensure satisfactory performance of the agent, set the number of parameter
estimates Np to be greater than twice the number of learnable parameters. In this example, the value
is Np = 45.
agent.EstimateNum = 45;
To get good estimation results for θ, you must apply a persistently excited exploration model to the
system. In this example, encourage model exploration by adding white noise to the controller output:
ut = − Kxt + et. In general, the exploration model depends on the system models.
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
5-374
Train Custom LQR Agent
• Run each training episode for at most 10 episodes, with each episode lasting at most 50 time
steps.
• Display the command line display (set the Verbose option) and disable the training progress in
the Episode Manager dialog box (set the Plots option).
To validate the performance of the trained agent, simulate it within the MATLAB environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOptions = rlSimulationOptions('MaxSteps',20);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
totalReward = -30.6482
You can compute the optimal solution for the LQR problem using the dlqr function.
[Koptimal,P] = dlqr(A,B,Q,R);
x0 = experience.Observation.obs1.getdatasamples(1);
Joptimal = -x0'*P*x0;
Compute the error in the reward between the trained LQR agent and the optimal LQR solution.
rewardError = totalReward - Joptimal
rewardError = 5.0439e-07
View the history of the 2-norm of error in the gains between the trained LQR agent and the optimal
LQR solution.
% number of gain updates
len = agent.KUpdate;
5-375
5 Train and Validate Agents
err = zeros(len,1);
for i = 1:len
% norm of error in the gain
err(i) = norm(agent.KBuffer{i}-Koptimal);
end
plot(err,'b*-')
gainError = 3.3210e-12
Overall, the trained agent finds an LQR solution that is close to the true optimal LQR solution.
See Also
train
More About
• “Create Custom Reinforcement Learning Agents” on page 3-68
• “Train Reinforcement Learning Agents” on page 5-3
5-376
Generate Policy Block for Deployment
This example shows how you can generate a Policy block ready for deployment from an agent object.
The policy will be generated from the “Train TD3 Agent for PMSM Control” on page 5-299 example.
The policy is simulated to validate performance. If Embedded Coder® is installed, a software-in-the-
loop (SIL) simulation is run to validate the generated code of the policy.
In general, the workflow for the deployment of a Reinforcement Learning policy via a Simulink®
model is accomplished by the following:
1 Train the agent (see “Train TD3 Agent for PMSM Control” on page 5-299)
2 Generate a Policy block from the trained agent
3 Replace the RL Agent block with the Policy block
4 Configure the model for code generation
5 Simulate the policy and verify policy performance
6 Generate code for the policy, simulate the generated code, and verify policy performance
7 Deploy to hardware for testing
Load the motor parameters along with the trained TD3 agent.
sim_data;
model: 'BoostXL-DRV8305'
sn: 'INV_XXXX'
V_dc: 24
I_trip: 10
Rds_on: 0.0020
Rshunt: 0.0070
CtSensAOffset: 2295
CtSensBOffset: 2286
5-377
5 Train and Validate Agents
ADCGain: 1
EnableLogic: 1
invertingAmp: 1
ISenseVref: 3.3000
ISenseVoltPerAmp: 0.0700
ISenseMax: 21.4286
R_board: 0.0043
CtSensOffsetMax: 2500
CtSensOffsetMin: 1500
model: 'LAUNCHXL-F28379D'
sn: '123456'
CPU_frequency: 200000000
PWM_frequency: 5000
PWM_Counter_Period: 20000
ADC_Vref: 3
ADC_MaxCount: 4095
SCI_baud_rate: 12000000
V_base: 13.8564
I_base: 21.4286
N_base: 3476
T_base: 1.0249
P_base: 445.3845
load("rlPMSMAgent.mat","agent");
Open the Simulink® model used for training the TD3 agent.
mdl_rl = "mcb_pmsm_foc_sim_RL";
open_system(mdl_rl);
To create a deployable model, the RL Agent block will be replaced with a Policy block.
Set the agent's UseExplorationPolicy property to false so the generated policy will take the
greedy action at each time step. Generate the policy block using generatePolicyBlock and specify
the name of the MAT file containing the policy data for the block.
5-378
Generate Policy Block for Deployment
Alternatively, the policy block can be generated by clicking the Generate greedy policy block from
the RL Agent block mask. Use open_system(agentblk) to open the RL Agent block mask, or
simply double-click the block.
5-379
5 Train and Validate Agents
For this example, the Policy block has already replaced the RL Agent block inside of the
pmsm_current_control model. This model has been configured for code generation and the Policy
block loads the trained policy from the MAT file PMSMPolicyBlockData.mat generated above.
mdl_current_ctrl = "pmsm_current_control";
open_system(mdl_current_ctrl);
Use open_system(policyblk) to open the Policy block mask, or simply double-click the block.
% Simulate the model with the current controller run in normal mode
in = Simulink.SimulationInput(mdl_policy);
in = setBlockParameter(in,current_ctrl_blk,"SimulationMode","Normal");
out_sim = sim(in);
plotOnSubPlot(speedSim ,1,1,true);
5-380
Generate Policy Block for Deployment
plotOnSubPlot(speedRefSim,1,1,true);
plotOnSubPlot(idSim ,2,1,true);
plotOnSubPlot(idRefSim,2,1,true);
plotOnSubPlot(iqSim ,3,1,true);
plotOnSubPlot(iqRefSim,3,1,true);
Simulink.sdi.view;
5-381
5 Train and Validate Agents
If Embedded Coder® is installed, the current controller model reference can be run in SIL mode.
Running the current controller in SIL mode will generate code for the current controller model,
including the policy block.
% Simulate the model with the current controller run in SIL mode
if ispc
in = setBlockParameter(in,current_ctrl_blk,"SimulationMode","Software-in-the-loop");
out_sil = sim(in);
% Get the SDI run
runSIL = Simulink.sdi.Run.getLatest;
5-382
Generate Policy Block for Deployment
% Compare the SIL response to the simulated response. The SIL response
% should be close to the response simulated in normal mode.
speedSim.AbsTol = 1e-3;
cr = Simulink.sdi.compareSignals(speedSim.ID,speedSIL.ID);
Simulink.sdi.view;
else
disp("The model ""pmsm_current_control"" is configured for SIL simulations on a Windows syste
end
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DINTEGER_CODE=0 -
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DINTEGER_CODE=0 -
5-383
5 Train and Validate Agents
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
### Using toolchain: MinGW64 | gmake (64-bit Windows)
### Creating 'C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl
### Building 'pmsm_current_control_rtwlib': "S:\21\GHAMXJ~Q\matlab\bin\win64\gmake" -f pmsm_curr
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCLASSIC_INTERFACE
"### Creating static library "./pmsm_current_control_rtwlib.lib" ..."
"H:\3rdparty\internal\3265117\win64\MinGW\bin/ar" ruvs ./pmsm_current_control_rtwlib.lib @pmsm_c
a - pmsm_current_control.obj
H:\3rdparty\internal\3265117\win64\MinGW\bin/ar: creating ./pmsm_current_control_rtwlib.lib
"### Created: ./pmsm_current_control_rtwlib.lib"
"### Successfully generated all binary outputs."
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
### Successfully updated the model reference code generation target for: pmsm_current_control
Build Summary
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
5-384
Generate Policy Block for Deployment
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DINTEGER_CODE=0 -
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DINTEGER_CODE=0 -
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DINTEGER_CODE=0 -
"### Creating static library "./pmsm_current_control_ca.lib" ..."
"H:\3rdparty\internal\3265117\win64\MinGW\bin/ar" ruvs ./pmsm_current_control_ca.lib @pmsm_curre
a - coder_assumptions_hwimpl.obj
a - coder_assumptions_flt.obj
a - pmsm_current_control_ca.obj
H:\3rdparty\internal\3265117\win64\MinGW\bin/ar: creating ./pmsm_current_control_ca.lib
"### Created: ./pmsm_current_control_ca.lib"
"### Successfully generated all binary outputs."
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
### Using toolchain: MinGW64 | gmake (64-bit Windows)
### Creating 'C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl
### Building 'pmsm_current_control': "S:\21\GHAMXJ~Q\matlab\bin\win64\gmake" -f pmsm_current_con
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"H:\3rdparty\internal\3265117\win64\MinGW\bin/gcc" -c -fwrapv -m64 -O0 -msse2 -DCODER_ASSUMPTIONS
"### Creating standalone executable "./pmsm_current_control.exe" ..."
"H:\3rdparty\internal\3265117\win64\MinGW\bin/g++" -static -m64 -o ./pmsm_current_control.exe @pm
"### Created: ./pmsm_current_control.exe"
"### Successfully generated all binary outputs."
C:\Users\gcampa\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\gcampa.bdoc\rl-ex31454746\sl
### Starting SIL simulation for component: pmsm_current_control
### Application stopped
### Stopping SIL simulation for component: pmsm_current_control
5-385
5 Train and Validate Agents
Once satisfied with the performance of the policy in simulation, you can use an appropriate target/
hardware support package to deploy the policy to hardware.
See Also
Policy | generatePolicyBlock | generatePolicyFunction | train | dlnetwork
Related Examples
• “Generate Policy Block for Deployment” on page 5-377
More About
• “Create Policies and Value Functions” on page 4-2
5-386
Generate Policy Block for Deployment
5-387
5 Train and Validate Agents
This example shows how to define a custom training loop for a reinforcement learning policy. You can
use this workflow to train reinforcement learning policies with your own custom training algorithms
rather than using one of the built-in agents from the Reinforcement Learning Toolbox™ software.
Using this workflow, you can train policies that use any of the following policy and value function
approximators.
In this example, a discrete actor policy with a discrete action space is trained using the REINFORCE
algorithm (with no baseline). For more information on the REINFORCE algorithm, see “Policy
Gradient Agents” on page 3-27.
For more information on the functions you can use for custom training, see Functions for Custom
Training on page 5-393.
Environment
For this example, a reinforcement learning policy is trained in a discrete cart-pole environment. The
objective in this environment is to balance the pole by applying forces (actions) on the cart. Create
the environment using the rlPredefinedEnv function.
env = rlPredefinedEnv("CartPole-Discrete");
For more information on this environment, see “Load Predefined Control System Environments” on
page 2-23.
Policy
5-388
Train Reinforcement Learning Policy Using Custom Training Loop
softmaxLayer layers. This network outputs probabilities for each discrete action given the current
observations. The softmaxLayer ensures that the actor outputs probability values in the range [0 1]
and that all probabilities sum to 1.
actorNetwork = [featureInputLayer(numObs)
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];
Convert to dlnetwork.
actorNetwork = dlnetwork(actorNetwork)
actorNetwork =
dlnetwork with properties:
actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);
actor = accelerate(actor,true);
actorOpts = rlOptimizerOptions(LearnRate=1e-2);
actorOptimizer = rlOptimizer(actorOpts);
Training Setup
• Set up the training to last at most 5000 episodes, with each episode lasting at most 250 steps.
• To calculate the discounted reward, choose a discount factor of 0.995.
• Terminate the training after the maximum number of episodes is reached or when the average
reward across 100 episodes reaches the value of 220.
numEpisodes = 5000;
maxStepsPerEpisode = 250;
discountFactor = 0.995;
5-389
5 Train and Validate Agents
aveWindowSize = 100;
trainingTerminationValue = 220;
Create a vector for storing the cumulative reward for each training episode.
episodeCumulativeRewardVector = [];
Create a figure for training visualization using the hBuildFigure on page 5-395 helper function.
[trainingPlot,lineReward,lineAveReward] = hBuildFigure;
The algorithm for the custom training loop is as follows. For each episode:
% Train the policy for the maximum number of episodes or until the average
% reward indicates that the policy is sufficiently trained.
for episodeCt = 1:numEpisodes
episodeReward = zeros(maxStepsPerEpisode,1);
5-390
Train Reinforcement Learning Policy Using Custom Training Loop
episodeReward(stepCt) = reward;
obs = nextObs;
end
5-391
5 Train and Validate Agents
end
Simulation
5-392
Train Reinforcement Learning Policy Using Custom Training Loop
Enable the environment visualization, which is updated each time the environment step function is
called.
plot(env)
1 Get the action by sampling from the policy using the getAction function.
2 Step the environment using the obtained action value.
3 Terminate if a terminal condition is reached.
obs = nextObs;
end
To obtain actions and value functions for given observations from Reinforcement Learning Toolbox
policy and value function approximators, you can use the following functions.
5-393
5 Train and Validate Agents
If your policy or value function approximator is a recurrent neural network, that is, a neural network
with at least one layer that has hidden state information, the preceding functions can return the
current network state. You can use the following function syntaxes to get and set the state of your
approximator.
You can get and set the learnable parameters of your approximator using the
getLearnableParameters and setLearnableParameters function, respectively.
In addition to these functions, you can use the gradient, optimize, and syncParameters
functions to set parameters and compute gradients for your policy and value function approximators.
gradient
The gradient function computes the gradients of the approximator loss function. You can compute
several different gradients. For example, to compute the gradient of the approximator outputs with
respect to its inputs, use the following syntax.
grad = gradient(actor,"output-input",inputData)
Here:
syncParameters
The syncParameters function updates the learnable parameters of one policy or value function
approximator based on those of another approximator. This function is useful for updating a target
actor or critic approximator, as is done for DDPG agents. To synchronize parameters values between
two approximators, use the following syntax.
newTargetApproximator = syncParameters(oldTargetApproximator,sourceApproximator,smoothFactor)
Here:
5-394
Train Reinforcement Learning Policy Using Custom Training Loop
Loss Function
The loss function in the REINFORCE algorithm is the product of the discounted reward and the log of
the policy, summed across all time steps. The discounted reward calculated in the custom training
loop must be resized to make it compatible for multiplication with the policy. The function first input
parameter must be a cell array like the one returned from the evaluation of a function approximator
object. For more information, see the description of outData in evaluate. The second, optional,
input argument contains additional data that might be needed for the gradient calculation.
policy = policy{1};
end
Helper Function
ax = gca(trainingPlot);
lineReward = animatedline(ax);
lineAveReward = animatedline(ax,Color="r",LineWidth=3);
xlabel(ax,"Episode");
ylabel(ax,"Reward");
legend(ax,"Cumulative Reward", ...
"Average Reward", ...
Location="northwest")
5-395
5 Train and Validate Agents
title(ax,"Training Progress");
end
See Also
train
More About
• “Create Custom Reinforcement Learning Agents” on page 3-68
• “Train Reinforcement Learning Agents” on page 5-3
• “Custom Training Loop with Simulink Action Noise” on page 5-397
5-396
Custom Training Loop with Simulink Action Noise
This example shows how to tune a controller for vehicle platooning applications using a custom
reinforcement learning (RL) training loop. For this application, action noise is generated in the
Simulink® model to promote exploration during training.
For an example on tuning a PID-based vehicle platooning system, see “Design Controller for Vehicle
Platooning” (Simulink Control Design).
• Individual vehicle stability — Spacing error for each following vehicle converges to zero if the
preceding vehicle is traveling at constant speed.
• String stability — Spacing errors do not amplify as they propagate towards the tail of the vehicle
string.
In this example, there are five vehicles in the platoon. Every vehicle is modeled as a truck-trailer
system with the following parameters. All lengths are in meters.
L1 = 6; % Truck length
L2 = 10; % Trailer length
M1 = 1; % Hitch length
L = L1 + L2 + M1 + 5; % Desired front-to-front vehicle spacing
The lead vehicle follows a given acceleration profile. Each trailing vehicle has a controller that
controls its acceleration.
mdl = "fiveVehiclePlatoonEnv";
open_system(mdl)
5-397
5 Train and Validate Agents
The model contains an RL Agent block with its last action input port enabled. This input port allows
the specification of custom noise in the Simulink model for off-policy RL agents, such as deep
deterministic policy gradient (DDPG) agents.
Controller Structure
In this example, each trailing vehicle (ego vehicle) has the same continuous-time controller structure
and parameterization.
Here:
• aego, vego, and xego are the respective acceleration, velocity, and position of the ego vehicle.
• afront, vfront, and xfront are the respective acceleration, velocity, and position of the vehicle directly
in front of the ego vehicle.
Each vehicle has full access to its own velocity and position states but can only access the
acceleration, velocity, and position of the vehicle directly in front using wireless communication.
The controller minimizes the velocity error vfront − vego using the velocity gain K2 and minimizes the
spacing error xego − xfront + L using the spacing gain K3. The feedforward gain K1 is used to improve
tracking of the front vehicle.
5-398
Custom Training Loop with Simulink Action Noise
alead = A sin wt
Here:
The objective of the agent is to compute adaptive gains so that each vehicle can track the desired
spacing with respect to the vehicle immediately in front. Therefore, the model is configured such
that:
• The action signal consists of the gains K = K1 K2 K3 shared by all vehicles except the lead
vehicle. Each gain has a lower bound of 0 and upper bounds of 1, 20, and 20, respectively. The
agent calculates new gains once per second. To encourage exploration during training, the gains
are perturbed by random noise with a zero-mean normal distribution: Knoisy = K + Ν 0, σ2 where
the variance σ2 = 0 . 02, 0 . 1, 0 . 1 .
• The observation signal consists of the vehicle spacing (diff x minus the target spacing (L , the
vehicle velocities (v), and the vehicle accelerations (a).
• The reward calculated at every time step t is
1 2
rt = ∑ 2
− 0 . 2 ∑ Kt − 1 − Kt − 2 − 1 . 0 maxOvershoot st, L − 10ct
1 + st − L
where:
The first term in the reward function encourages the vehicle spacing to match L. The second term
penalizes large changes in gain between time steps. The third term penalizes overshooting the target
spacing (getting too close to the front vehicle). Finally, the fourth term penalizes collisions.
For this example, to accommodate the custom noise specified in the model, you implement a custom
DDPG training loop.
Define the training and simulation parameters that remain fixed during training.
Ts = 1; % Sample time (seconds)
Tf = 100; % Simulation length (seconds)
5-399
5 Train and Validate Agents
Define the parameters that change every training episode. The values for these parameters are
updated in the environment reset function resetFunction.
Create Environment
To do so, first define the observation and action specifications for the environment.
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
Set the environment reset function to the local function resetFunction included with this example.
This function varies the training conditions for each episode.
env.ResetFcn = @resetFunction;
Noise in the model is specified using the Random Number (Simulink) block. Each block has its own
random number generator and thus its own starting seed parameter. To ensure the noise stream
varies across episodes, the seed variables are updated using resetFunction .
Create actor and critic function approximators for the agent using the local function
createNetworks included with this example.
[critic,actor] = createNetworks(obsInfo,actInfo);
Create optimizer objects for updating the actor and critic. Use the same options for both optimizers.
5-400
Custom Training Loop with Simulink Action Noise
optimizerOpt = rlOptimizerOptions(...
LearnRate=1e-3, ...
GradientThreshold=1, ...
L2RegularizationFactor=1e-3);
criticOptimizer = rlOptimizer(optimizerOpt);
actorOptimizer = rlOptimizer(optimizerOpt);
policy = rlDeterministicActorPolicy(actor);
policy.SampleTime = Ts;
Create an experience buffer for the agent with a maximum length of 1e6.
replayMemory = rlReplayMemory(obsInfo,actInfo,1e6);
To update the actor and critic during training, the runEpisode function processes each experience
as it is received from the environment. For this example, the processing function is the
processExperienceFcn local function.
This function requires additional data to perform its processing. Create a structure to store this
additional data.
processExpData.Critic = critic;
processExpData.TargetCritic = critic;
processExpData.Actor = actor;
processExpData.TargetActor = actor;
processExpData.ReplayMemory = replayMemory;
processExpData.CriticOptimizer = criticOptimizer;
processExpData.ActorOptimizer = actorOptimizer;
processExpData.MiniBatchSize = 128;
processExpData.DiscountFactor = 0.99;
processExpData.TargetSmoothFactor = 1e-3;
Each episode, the processExperienceFcn function updates the critics, actors, replay memory, and
optimizers. The updated data is used as the input for the next episode.
Training Loop
To train the agent, the custom training loop simulates the agent in the environment for a maximum of
maxEpisodes episodes.
maxEpisodes = 1000;
Compute the maximum number of steps per episode using the simulation time and sample time.
maxSteps = ceil(Tf/Ts);
5-401
5 Train and Validate Agents
• The runEpisode function simulates the agent in the environment for one episode.
• Experiences are processed as they are received from the environment using the
processExperienceFcn function.
• Experiences are not logged by runEpisode since the experiences are processed as they are
received.
• To speed up training, when calling runEpisode, the CleanupPostSim option is set to false.
Doing so keeps the model compiled between episodes.
• The PlatooningTrainingCurvePlotter object is a helper object to plot training data while the
training is running.
• You can stop the training using a Stop button in the training plot.
• After all the episodes are complete, the cleanup function cleans up the environment and
terminates the model compilation.
Training the policy is a computationally intensive process that can take several minutes to hours to
complete. To save time while running this example, load a pretrained agent by setting doTraining
to false. To train the policy yourself, set doTraining to true.
doTraining = false;
if doTraining
% Create plotting helper object.
plotObj = PlatooningTrainingCurvePlotter();
% Training loop
for i = 1:maxEpisodes
5-402
Custom Training Loop with Simulink Action Noise
Validate the learned policy by running five simulations with random initial conditions specified by the
reset function.
5-403
5 Train and Validate Agents
UseParamNoise = 0;
N = 5;
simOpts = rlSimulationOptions(...
MaxSteps=maxSteps, ...
NumSimulations=N);
experiences = sim(env,policy,simOpts);
Plot the vehicle spacing error, gains, and reward from the experiences output structure.
5-404
Custom Training Loop with Simulink Action Noise
From the plots, you can see that the trained policy generates adaptive gains that adequately track the
desired spacing for all vehicles.
Local Functions
The process experience function is called every time an experience is processed by the RL Agent
block. Here, processExperienceFcn appends the experience to the replay memory, samples a mini-
batch of experiences from the replay memory, and updates the critic, actor, and target networks.
function [policy,processExpData] = processExperienceFcn(exp,episodeInfo,policy,processExpData)
if ~isempty(miniBatch)
% Update network parameters using the mini-batch.
5-405
5 Train and Validate Agents
[processExpData,actorParams] = learnFcn(processExpData,miniBatch);
The learnFcn function updates the critic, actor, and target networks given a sampled mini-batch
5-406
Custom Training Loop with Simulink Action Noise
The actor gradient is computed to maximize the expected value of an observation-action pair given
the policy parameters. Here, the negative sign is used to maximize Q with respect to θ.
d 1 d 1
− ∑ Qϕ s, a = − ∑ Qϕ s, πθ s
dθ N dθ N
Here:
% Compute: q = Q(s,a)
q = predict(gradData.CriticNet,observation{:},action);
% Compute: d(-sum(q)/N)/dActorParams
dQdTheta = dlgradient(qsum,actorNet.Learnables);
end
The environment reset function varies the initial conditions, reference trajectory, and noise seeds for
every episode.
function in = resetFunction(in)
% Perturb the nominal reference amplitude and frequency.
LeadA = max(2 + 0.1*randn,0.1);
LeadF = max(1 + 0.1*randn,0.1);
5-407
5 Train and Validate Agents
hiddenLayerSize = 64;
numObs = prod(obsInfo.Dimension);
numAct = prod(actInfo.Dimension);
5-408
Custom Training Loop with Simulink Action Noise
fullyConnectedLayer(hiddenLayerSize,Name="fc1")
reluLayer(Name="relu1")
fullyConnectedLayer(numAct,Name="fc2")
reluLayer(Name="relu2")
fullyConnectedLayer(numAct,Name="fc3")
tanhLayer(Name="tanh1")
scalingLayer(Scale=scale,...
Bias=bias,...
Name=actInfo.Name)];
net = layerGraph;
net = addLayers(net,obsPath);
net = dlnetwork(net);
References
[1] Rajamani, Rajesh. Vehicle Dynamics and Control. 2. ed. Mechanical Engineering Series. New York,
NY Heidelberg: Springer, 2012.
See Also
Functions
runEpisode | setup | cleanup
Blocks
RL Agent
Related Examples
• “Train Reinforcement Learning Policy Using Custom Training Loop” on page 5-388
5-409
5 Train and Validate Agents
This example shows how to create a custom agent for your own custom reinforcement learning
algorithm. Doing so allows you to leverage the following built-in functionality from the Reinforcement
Learning Toolbox™ software.
In this example, you convert a custom REINFORCE training loop into a custom agent class. For more
information on the REINFORCE custom train loop, see “Train Reinforcement Learning Policy Using
Custom Training Loop” on page 5-388. For more information on writing custom agent classes, see
“Create Custom Reinforcement Learning Agents” on page 3-68.
rng(0)
Create Environment
Create the same training environment used in the “Train Reinforcement Learning Policy Using
Custom Training Loop” on page 5-388 example. The environment is a cart-pole balancing
environment with a discrete action space. Create the environment using the rlPredefinedEnv
function.
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numAct = numel(actInfo.Elements);
For more information on this environment, see “Load Predefined Control System Environments” on
page 2-23.
Define Policy
actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','fc1')
reluLayer('Name','relu1')
5-410
Create Agent for Custom Reinforcement Learning Algorithm
fullyConnectedLayer(24,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(2,'Name','output')
softmaxLayer('Name','actionProb')];
actorNetwork = dlnetwork(actorNetwork)
actorNetwork =
dlnetwork with properties:
actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);
actor = accelerate(actor,true);
actorOpts = rlOptimizerOptions('LearnRate',1e-3);
To define your custom agent, first create a class that is a subclass of the rl.agent.CustomAgent
class. The custom agent class for this example is defined in CustomReinforceAgent.m.
The CustomReinforceAgent class has the following class definition, which indicates the agent class
name and the associated abstract agent.
• Agent properties
• Constructor function
• Critic approximator that estimates the discounted long-term reward (if required for learning)
• Actor that selects an action based on the current observation (if required for learning)
• Required agent methods
• Optional agent methods
Agent Properties
In the properties section of the class file, specify any parameters necessary for creating and
training the agent.
5-411
5 Train and Validate Agents
The rl.Agent.CustomAgent class already includes properties for the agent sample time
(SampleTime) and the action and observation specifications (ActionInfo and ObservationInfo,
respectively).
The custom REINFORCE agent defines the following additional agent properties.
properties
% Actor
Actor
ActorOptimizer
% Agent options
Options
% Experience buffer
ObservationBuffer
ActionBuffer
RewardBuffer
end
Constructor Function
To create your custom agent, you must define a constructor function. The constructor function
performs the following actions.
• Defines the action and observation specifications. For more information about creating these
specifications, see rlNumericSpec and rlFiniteSetSpec.
• Sets the agent properties.
• Calls the constructor of the base abstract class.
• Defines the sample time (required for training in Simulink environments).
For example, the CustomREINFORCEAgent constructor defines action and observation spaces based
on the input actor.
5-412
Create Agent for Custom Reinforcement Learning Algorithm
obj.SampleTime = -1;
Required Functions
To create a custom reinforcement learning agent you must define the following implementation
functions.
To call these functions in your own code, use the wrapper methods from the abstract base class. For
example, to call getActionImpl, use getAction. The wrapper methods have the same input and
output arguments as the implementation methods.
getActionImpl Function
The getActionImpl function is used to evaluate the policy of your agent and select an action when
simulating the agent using the sim function. This function must have the following signature, where
obj is the agent object, Observation is the current observation, and Action is the selected action.
function Action = getActionImpl(obj,Observation)
For the custom REINFORCE agent, you select an action by calling the getAction function for the
actor. The discrete rlStochasticActorRepresentation generates a discrete distribution from an
observation and samples an action from this distribution.
function Action = getActionImpl(obj,Observation)
% Compute an action using the policy given the current
% observation.
Action = getAction(obj.Actor,Observation);
end
getActionWithExplorationImpl Function
5-413
5 Train and Validate Agents
For the custom REINFORCE agent, the getActionWithExplorationImpl function is the same as
getActionImpl. By default, stochastic actors always explore, that is, they always select an action
based on a probability distribution.
learnImpl Function
The learnImpl function defines how the agent learns from the current experience. This function
implements the custom learning algorithm of your agent by updating the policy parameters and
selecting an action with exploration for the next state. This function must have the following
signature, where obj is the agent object, Experience is the current agent experience, and Action
is the selected action.
For the custom REINFORCE agent, replicate steps 2 through 7 of the custom training loop in “Train
Reinforcement Learning Policy Using Custom Training Loop” on page 5-388. You omit steps 1, 8, and
9 since you will use the built-in train function to train your agent.
5-414
Create Agent for Custom Reinforcement Learning Algorithm
obj.ObservationBuffer(:,:,obj.Counter) = Obs{1};
obj.ActionBuffer(:,:,obj.Counter) = Action{1};
obj.RewardBuffer(:,obj.Counter) = Reward;
if ~IsDone
% Choose an action for the next state.
The actor computes gradient of loss with respect to parameters, which the loss function is
implemented as a local function lossFunction in CustomREINFORCEAgent.m.
policy = policy{1};
% Create the action indication matrix.
batchSize = lossData.batchSize;
Z = repmat(lossData.actInfo.Elements',1,batchSize);
actionIndicationMatrix = lossData.actionBatch(:,:) == Z;
5-415
5 Train and Validate Agents
end
Optional Functions
Optionally, you can define how your agent is reset at the start of training by specifying a resetImpl
function with the following function signature, where obj is the agent object.
function resetImpl(obj)
Using this function, you can set the agent into a know or random condition before training.
function resetImpl(obj)
% (Optional) Define how the agent is reset before training/
resetBuffer(obj);
obj.Counter = 1;
end
Also, you can define any other helper functions in your custom agent class as required. For example,
the custom REINFORCE agent defines a resetBuffer function for reinitializing the experience
buffer at the beginning of each training episode.
function resetBuffer(obj)
% Reinitialize all experience buffers.
obj.ObservationBuffer = zeros(obj.NumObservation,1,obj.Options.MaxStepsPerEpisode);
obj.ActionBuffer = zeros(obj.NumAction,1,obj.Options.MaxStepsPerEpisode);
obj.RewardBuffer = zeros(1,obj.Options.MaxStepsPerEpisode);
end
Once you have defined your custom agent class, create an instance of it in the MATLAB workspace.
To create the custom REINFORCE agent, first specify the agent options.
options.MaxStepsPerEpisode = 250;
options.DiscountFactor = 0.995;
options.OptimizerOptions = actorOpts;
Then, using the options and the previously defined actor, call the custom agent constructor function.
agent = CustomReinforceAgent(actor,options);
• Set up the training to last at most 5000 episodes, with each episode lasting at most 250 steps.
5-416
Create Agent for Custom Reinforcement Learning Algorithm
• Terminate the training after the maximum number of episodes is reached or when the average
reward across 100 episodes reaches a value of 240.
numEpisodes = 5000;
aveWindowSize = 100;
trainingTerminationValue = 220;
trainOpts = rlTrainingOptions(...
'MaxEpisodes',numEpisodes,...
'MaxStepsPerEpisode',options.MaxStepsPerEpisode,...
'ScoreAveragingWindowLength',aveWindowSize,...
'StopTrainingValue',trainingTerminationValue);
Train the agent using the train function. Training this agent is a computationally intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainStats = train(agent,env,trainOpts);
else
% Load pretrained agent for the example.
load('CustomReinforce.mat','agent');
end
5-417
5 Train and Validate Agents
Enable the environment visualization, which is updated each time the environment step function is
called.
plot(env)
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim.
simOpts = rlSimulationOptions('MaxSteps',options.MaxStepsPerEpisode);
experience = sim(env,agent,simOpts);
See Also
train
More About
• “Create Custom Reinforcement Learning Agents” on page 3-68
• “Train Reinforcement Learning Agents” on page 5-3
5-418
Train MBPO Agent to Balance Cart-Pole System
This example shows how to train a model-based policy optimization (MBPO) agent to balance a cart-
pole system modeled in MATLAB®. For more information on MBPO agents, see “Model-Based Policy
Optimization Agents” on page 3-62.
MBPO agents use an environment model to generate more experiences while training a base agent.
In this example, the base agent is a soft actor-critic (SAC) agent.
The build-int MBPO agent is based on a model-based policy optimization algorithm in [1]. The original
MBPO algorithm trains an ensemble of stochastic models. In contrast, this example trains an
ensemble of deterministic models.
The following figure summarizes the algorithm used in this example. During training the MBPO agent
collects real experiences resulting from interactions with the environment. The MBPO agent uses
these experiences to train its internal environment model. Then, it uses this model to generate
experiences without interacting with the actual environment Finally, the MBPO agent uses the real
experiences and generated experiences to train the SAC base agent.
For this example, the reinforcement learning environment is a pole attached to an unactuated
revolutionary joint on a cart. The cart has an actuated prismatic joint connected to a one-dimensional
5-419
5 Train and Validate Agents
frictionless track. The training goal in this environment is to balance the pole by applying forces
(actions) to the prismatic joint.
• The upward balanced pendulum position is 0 radians and the downward hanging position is pi
radians.
• The pendulum starts upright with an initial angle between –0.05 radians and 0.05 radians.
• The force action signal from the agent to the environment is from –10 N to 10 N.
• The observations from the environment are the position and velocity of the cart, the pendulum
angle, and the pendulum angle derivative.
• The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more
than 2.4 m from the original position.
• A reward of +0.5 is provided for every time-step that the pole remains upright. An additional
reward is provided based on the distance between the cart and the origin. A penalty of –50 is
applied when the pendulum falls.
For more information on this model, see “Load Predefined Control System Environments” on page 2-
23.
env = rlPredefinedEnv("CartPole-Continuous");
The interface has a continuous action space where the agent can apply one force value ranging from
–10 N to 10 N.
Obtain the observation and action specifications from the environment interface.
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
rng(0)
5-420
Train MBPO Agent to Balance Cart-Pole System
An MBPO agent decides which action to take given observations using a base off-policy agent. The
MBPO agent trains both the base agent and an environmental model. The environmental model
consists of transition functions, a reward function, and an is-done function. This model is used to
create more samples without interacting with an environment. This example uses the following steps
to construct an MBPO agent.
Create a SAC base agent with a default network structure. For more information on SAC agents, see
“Soft Actor-Critic Agents” on page 3-57. For an environment with a continuous action space, you can
also use a DDPG or TD3 base agent. For discrete environments, you can use a DQN base agent.
agentOpts = rlSACAgentOptions;
agentOpts.MiniBatchSize = 256;
initOpts = rlAgentInitializationOptions("NumHiddenUnit",64);
baseagent = rlSACAgent(obsInfo,actInfo,initOpts,agentOpts);
baseagent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
baseagent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-4;
baseagent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-4;
baseagent.AgentOptions.NumGradientStepsPerUpdate = 5;
To model the environment, an MBPO agent trains one or more transition models. To model an
environment effectively, you must consider two kinds of uncertainty: statistical uncertainty and
modeling uncertainty. A stochastic transition function can model the statistical uncertainty better
than a deterministic transition function. In this example, since the cart-pole environment is
deterministic, you use deterministic transition functions.
It is challenging to have a perfect model, and a trained model usually has modeling uncertainty. One
common approach to overcoming modeling uncertainty is to use multiple transition models. The
original MBPO paper uses seven models [1]. For this example, to reduce computational cost, you use
three models. The MBPO agent generates experiences using all three transition models. The
following figure shows how an ensemble of transition models generates samples without interacting
with the environment. In this figure, the models generate two trajectories with horizon = 2.
5-421
5 Train and Validate Agents
Create three deterministic transition functions. To do so, create a deep neural network using the
createDeterministicTransitionNetwork helper function. Then, use the neural network to
create an rlContinuousDeterministicTransitionFunction object. When creating a transition
function object, you must specify the action and observation input/output names for the neural
network.
net1 = createDeterministicTransitionNetwork(4,1);
transitionFcn = rlContinuousDeterministicTransitionFunction(net1,...
obsInfo,...
actInfo,...
ObservationInputNames="state",...
ActionInputNames="action",...
NextObservationOutputNames="nextObservation");
net2 = createDeterministicTransitionNetwork(4,1);
transitionFcn2 = rlContinuousDeterministicTransitionFunction(net2,...
obsInfo,...
actInfo,...
ObservationInputNames="state",...
ActionInputNames="action",...
NextObservationOutputNames="nextObservation");
net3 = createDeterministicTransitionNetwork(4,1);
transitionFcn3 = rlContinuousDeterministicTransitionFunction(net3,...
obsInfo,...
actInfo,...
ObservationInputNames="state",...
ActionInputNames="action",...
NextObservationOutputNames="nextObservation");
5-422
Train MBPO Agent to Balance Cart-Pole System
An MBPO agent also contains a reward model for the environment. If you know a ground-truth
reward function, you can specify it using a custom function. In this example, the ground-truth reward
function is defined in the cartPoleRewardFunction helper function. To use this reward function
set useGroundTruthReward to true.
You can also specify a neural-network-based reward function that the MBPO agent can train. In this
example, you can use such a reward function by setting useGroundTruthReward to false. The
deep neural network for the reward function is defined in the
createRewardNetworkActionNextObs helper function. To define an is-done function using the
neural network, create an rlContinuousDeterministicRewardFunction object.
useGroundTruthReward = true;
if useGroundTruthReward
rewardFcn = @cartPoleRewardFunction;
else
% This neural network uses action and next observation as inputs.
rewardnet = createRewardNetworkActionNextObs(4,1);
rewardFcn = rlContinuousDeterministicRewardFunction(rewardnet,...
obsInfo,...
actInfo, ...
ActionInputNames="action",...
NextObservationInputNames="nextState");
end
An MBPO agent also contains an is-done model for computing the termination signal for the
environment. If you know a ground-truth termination signal, you can specify it using a custom
function. In this example, the ground-truth termination signal is defined in the
cartPoleIsDoneFunction helper function. To use this reward function set
useGroundTruthIsDone to true.
You can also specify a neural-network-based is-done function that the MBPO agent can train. In this
example, you can use such an is-done function by setting useGroundTruthIsDone to false. The
deep neural network for the is-done function is defined in the createIsDoneNetwork helper
function. To define an is-done function using the neural network, create an rlIsDoneFunction
object.
useGroundTruthIsDone = true;
if useGroundTruthIsDone
isdoneFcn = @cartPoleIsDoneFunction;
else
% This neural network uses only next obesrvation as inputs.
isdoneNet = createIsDoneNetwork(4);
isdoneFcn = rlIsDoneFunction(isdoneNet,...
obsInfo,...
actInfo,...
NextObservationInputNames="nextState");
end
Define a neural network environment using the transition, reward, and is-done functions. To do so,
create an rlNeuralNetworkEnvironment object.
5-423
5 Train and Validate Agents
Define an MBPO agent using the base off-policy agent and environment model. To do so, first create
an MBPO agent options object.
MBPOAgentOpts = rlMBPOAgentOptions;
Specify options for training the environment model. Train the model for 1 epoch at the beginning of
each episode and use 15 mini-batches of size 256.
MBPOAgentOpts.NumEpochForTrainingModel = 1;
MBPOAgentOpts.NumMiniBatches = 15;
MBPOAgentOpts.MiniBatchSize = 256;
MBPOAgentOpts.ModelExperienceBufferLength = 60000;
Specify the ratio of real and generated experience used to train the base SAC agent. For this
example, 20% of samples are from the real experience buffer and 80% of samples are from model
experience buffer.
MBPOAgentOpts.RealSampleRatio = 0.2;
MBPOAgentOpts.ModelRolloutOptions.NumRollout = 20000;
MBPOAgentOpts.ModelRolloutOptions.HorizonUpdateSchedule = "piecewise";
MBPOAgentOpts.ModelRolloutOptions.HorizonUpdateFrequency = 100;
MBPOAgentOpts.ModelRolloutOptions.Horizon = 1;
MBPOAgentOpts.ModelRolloutOptions.HorizonMax = 3;
Specify optimizer options for training the transition models. Use the same optimizer options for all
three transition models.
transitionOptimizerOptions1 = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
transitionOptimizerOptions2 = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
transitionOptimizerOptions3 = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
MBPOAgentOpts.TransitionOptimizerOptions = ...
5-424
Train MBPO Agent to Balance Cart-Pole System
[transitionOptimizerOptions1,...
transitionOptimizerOptions2,...
transitionOptimizerOptions3];
Specify optimizer options for training the reward model. If you use a custom ground-truth reward
function, the agent ignores these options.
rewardOptimizerOptions = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
MBPOAgentOpts.RewardOptimizerOptions = rewardOptimizerOptions;
Specify optimizer options for training the is-done model. If you use a custom ground-truth reward
function, the agent ignores these options.
isdoneOptimizerOptions = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
MBPOAgentOpts.IsDoneOptimizerOptions = isdoneOptimizerOptions;
Create the MBPO agent, specifying the base agent, environment model, and options.
agent = rlMBPOAgent(baseagent,generativeEnv,MBPOAgentOpts);
Train Agent
To train the agent, first specify the training options. For this example, use the following options.
• Run each training episode for at most 500 episodes, with each episode lasting at most 500 time
steps.
• Display the training progress in the Episode Manager dialog box (set the Plots option) and
disable the command line display (set the Verbose option to false).
• Save the agent when the average episode reward is greater than or equal to 470.
• Stop training when the agent receives an average cumulative reward greater than 470 over 5
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
trainOpts = rlTrainingOptions(...
MaxEpisodes=500, ...
MaxStepsPerEpisode=500, ...
Verbose=false, ...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=470,...
ScoreAveragingWindowLength=5,...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=470,...
ScoreAveragingWindowLength=5);
You can visualize the cart-pole system by using the plot function during training or simulation.
plot(env)
5-425
5 Train and Validate Agents
Train the agent using the train function. Training this agent is a computationally-intensive process
that takes several minutes to complete. To save time while running this example, load a pretrained
agent by setting doTraining to false. To train the agent yourself, set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load the pretrained agent for the example.
load("MATLABCartpoleMBPO.mat","agent");
end
5-426
Train MBPO Agent to Balance Cart-Pole System
To validate the performance of the trained agent, simulate it within the cart-pole environment. For
more information on agent simulation, see rlSimulationOptions and sim. Exploration during
validation is not necessary in this example. Therefore, to use deterministic actions during the
simulation, set the UseExplorationPolicy agent property to agent to be false.
rng(1)
agent.UseExplorationPolicy = false; % disable exploration during sim
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
totalReward_MBPO = sum(experience.Reward)
totalReward_MBPO = 460.7233
Instead of simulating the MBPO agent, you can simulate the base agent. If you use the same random
seed, you get the same result as simulating the MBPO agent.
rng(1)
experience = sim(env,agent.BaseAgent,simOptions);
5-427
5 Train and Validate Agents
totalReward_SAC = sum(experience.Reward)
totalReward_SAC = 460.7233
To validate the trained environment transition models, you can check whether they are able to
correctly predict the next observations. Similarly, you can validate the performance of the reward and
is-done functions. To make a prediction based on the environment model, use the step function.
rng(1)
agent.UseExplorationPolicy = true; % enable exploration during sim to create diverse data for mod
simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,agent,simOptions);
For this example, evaluate the performance of the first transition model.
agent.EnvModel.TransitionModelNum = 1;
5-428
Train MBPO Agent to Balance Cart-Pole System
numSteps = length(experience.Reward.Data);
nextObsPrediction = zeros(4,1,numSteps);
rewardPrediction = zeros(1,numSteps);
isdonePrediction = zeros(1,numSteps);
nextObsGroundTruth = zeros(4,1,numSteps);
rewardGroundTruth = zeros(1,numSteps);
isdoneGroundTruth = zeros(1,numSteps);
for stepCt = 1:numSteps
% Extract the actual next observation, reward, and is-done value.
nextObsGroundTruth(:,:,stepCt) = ...
experience.Observation.CartPoleStates.Data(:,:,stepCt+1);
rewardGroundTruth(:, stepCt) = experience.Reward.Data(stepCt);
isdoneGroundTruth(:, stepCt) = experience.IsDone.Data(stepCt);
% Predict the next observation, reward, and is-done value using the
% environment model.
obs = experience.Observation.CartPoleStates.Data(:,:,stepCt);
agent.EnvModel.Observation = {obs};
action = experience.Action.CartPoleAction.Data(:,:,stepCt);
[nextObs,reward,isdone] = step(agent.EnvModel,{action});
nextObsPrediction(:,:,stepCt) = nextObs{1};
rewardPrediction(:,stepCt) = reward;
isdonePrediction(:,stepCt) = isdone;
end
Plot the ground truth and prediction of each dimension of the observations.
figure
for obsDimensionIndex = 1:4
subplot(2,2,obsDimensionIndex)
plot(reshape(nextObsGroundTruth(obsDimensionIndex,:,:),1,numSteps))
hold on
plot(reshape(nextObsPrediction(obsDimensionIndex,:,:),1,numSteps))
hold off
xlabel('Step')
ylabel('Observation')
if obsDimensionIndex == 1
legend('GroundTruth','Prediction','Location','southwest')
end
end
5-429
5 Train and Validate Agents
References
[1] Janner, Michael, Justin Fu, Marvin Zhang, and Sergey Levine. “When to Trust Your Model: Model-
Based Policy Optimization.” In Proceedings of the 33rd International Conference on Neural
Information Processing Systems, 12519–30. 1122. Red Hook, NY, USA: Curran Associates Inc., 2019.
See Also
rlMBPOAgent | rlMBPOAgentOptions | rlNeuralNetworkEnvironment
Related Examples
• “Model-Based Policy Optimization Agents” on page 3-62
5-430
Model-Based Reinforcement Learning Using Custom Training Loop
This example shows how to define a custom training loop for a model-based reinforcement learning
(MBRL) algorithm. You can use this workflow to train an MBRL policy with your custom training
algorithm using policy and value function representations from Reinforcement Learning Toolbox™
software.
For an example on how to use the built in model-based policy optimization (MBPO) agent, see . For an
overview of built-in MBPO agents, see “Model-Based Policy Optimization Agents” on page 3-62.
In this example, you use transition models to generate more experiences while training a custom
DQN [2] agent in a cart-pole environment. The algorithm used in this example is based on an MBPO
algorithm [1]. The original MBPO algorithm trains an ensemble of stochastic models and a soft actor-
critic (SAC) agent in tasks with continuous actions. In contrast, this example trains three
deterministic models and a DQN agent in a task with discrete actions. The following figure
summarizes the algorithm used in this example.
5-431
5 Train and Validate Agents
The agent generates real experiences by interacting with the environment. These experiences are
used to train a set of transition models, which are used to generate additional experiences. The
training algorithm then uses both the real and generated experiences to update the agent policy.
Create Environment
For this example, a reinforcement learning policy is trained in a discrete cart-pole environment. The
objective in this environment is to balance the pole by applying forces (actions) on the cart. Create
the environment using the rlPredefinedEnv function. Fix the random generator seed for
reproducibility. For more information on this environment, see “Load Predefined Control System
Environments” on page 2-23.
clear
clc
rngSeed = 1;
rng(rngSeed);
env = rlPredefinedEnv('CartPole-Discrete');
Critic Construction
DQN is a value-based reinforcement learning algorithm that estimates the discounted cumulative
reward using a critic. In this example, the critic network contains fullyConnectedLayer, and
reluLayer layers.
qNetwork = [
featureInputLayer(obsInfo.Dimension(1),'Name','state')
fullyConnectedLayer(24,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(24, 'Name','CriticStateFC2')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(length(actInfo.Elements),'Name','output')];
qNetwork = dlnetwork(qNetwork);
Create the critic representation using the specified neural network and options. For more
information, see rlQValueFunction.
critic = rlVectorQValueFunction(qNetwork,obsInfo,actInfo);
Create optimizer objects for updating the critic. For more information, see rlOptimizerOptions.
optimizerOpt = rlOptimizerOptions(...
LearnRate=1e-3, ...
GradientThreshold=1);
criticOptimizer = rlOptimizer(optimizerOpt);
5-432
Model-Based Reinforcement Learning Using Custom Training Loop
policy = rlMaxQPolicy(critic);
Model-based reinforcement learning uses transition models of the environment. The model usually
consists of a transition function, a reward function, and a terminal state function.
• The transition function predicts the next observation given the current observation and the action.
• The reward function predicts the reward given the current observation, the action, and the next
observation.
• The terminal state function predicts the terminal state given the observation.
As shown in the following figure, this example uses three transition functions as an ensemble of
transition models to generate samples without interacting with the environment. The true reward
function and the true terminal state function are given in this example.
Define three neural networks for transition models. The neural network predicts the difference
between the next observation and the current observation
5-433
5 Train and Validate Agents
numModels = 3;
transitionNetwork1 = ...
createTransitionNetwork(numObservations, numContinuousActions);
transitionNetwork2 = ...
createTransitionNetwork(numObservations, numContinuousActions);
transitionNetwork3 = ...
createTransitionNetwork(numObservations, numContinuousActions);
Create an experience buffer for storing agent experiences (observation, action, next observation,
reward, and isDone).
myBuffer.bufferSize = 1e5;
myBuffer.bufferIndex = 0;
myBuffer.currentBufferLength = 0;
myBuffer.observation = zeros(numObservations,myBuffer.bufferSize);
myBuffer.nextObservation = ...
zeros(numObservations,myBuffer.bufferSize);
myBuffer.action = ...
zeros(numContinuousActions,1,myBuffer.bufferSize);
myBuffer.reward = zeros(1,myBuffer.bufferSize);
myBuffer.isDone = zeros(1,myBuffer.bufferSize);
Create a model experience buffer for storing the experiences generated by the models.
myModelBuffer.bufferSize = 1e5;
myModelBuffer.bufferIndex = 0;
myModelBuffer.currentBufferLength = 0;
myModelBuffer.observation =...
zeros(numObservations,myModelBuffer.bufferSize);
myModelBuffer.nextObservation =...
zeros(numObservations,myModelBuffer.bufferSize);
myModelBuffer.action = ...
zeros(numContinuousActions,myModelBuffer.bufferSize);
myModelBuffer.reward = zeros(1,myModelBuffer.bufferSize);
myModelBuffer.isDone = zeros(1,myModelBuffer.bufferSize);
5-434
Model-Based Reinforcement Learning Using Custom Training Loop
Configure Training
numEpisodes = 250;
maxStepsPerEpisode = 500;
discountFactor = 0.99;
aveWindowSize = 10;
trainingTerminationValue = 480;
warmStartSamples = 2000;
numEpochs = 1;
miniBatchSize = 256;
horizonLength = 2;
epsilonMinModel = 0.1;
numGenerateSampleIteration = 20;
sampleGenerationOptions.horizonLength = horizonLength;
sampleGenerationOptions.numGenerateSampleIteration = ...
numGenerateSampleIteration;
sampleGenerationOptions.miniBatchSize = miniBatchSize;
sampleGenerationOptions.numObservations = numObservations;
sampleGenerationOptions.epsilonMinModel = epsilonMinModel;
% optimizer options
velocity1 = [];
velocity2 = [];
velocity3 = [];
decay = 0.01;
momentum = 0.9;
learnRate = 0.0005;
• Use the epsilon greedy algorithm with an initial epsilon value is 1, a minimum value of 0.01, and a
decay rate of 0.005.
• Update the target network every 4 steps.
5-435
5 Train and Validate Agents
• Set the ratio of the real experiences to generated experiences to 0.2:0.8 by setting RealRatio to
0.2. Setting RealRatio to 1.0 is the same as the model-free DQN.
• Take 5 gradient steps at each environment step.
epsilon = 1;
epsilonMin = 0.01;
epsilonDecay = 0.005;
targetUpdateFrequency = 4;
realRatio = 0.2; % Set to 1 to run a standard DQN
numGradientSteps = 5;
Create a vector for storing the cumulative reward for each training episode.
episodeCumulativeRewardVector = [];
Create a figure for model training visualization using the hBuildFigureModel helper function.
[trainingPlotModel, ...
lineLossTrain1, ...
lineLossTrain2, ...
lineLossTrain3, ...
axModel] = hBuildFigureModel();
Create a figure for model validation visualization using the hBuildFigureModelTest helper
function.
[testPlotModel, lineLossTest1, axModelTest] ...
= hBuildFigureModelTest();
Create a figure for DQN agent training visualization using the hBuildFigure helper function.
[trainingPlot,lineReward,lineAveReward, ax] = hBuildFigure;
Train Agent
Train the agent using a custom training loop. The training loop uses the following algorithm. For each
episode:
1 Train the transition models.
2 Generate experiences using the transition models and store the samples in the model experience
buffer.
3 Generate a real experience. To do so, generate an action using the policy, apply the action to the
environment, and obtain the resulting observation, reward, and is-done values.
4 Create a mini-batch by sampling experiences from both the experience buffer and the model
experience buffer.
5 Compute the target Q value.
6 Compute the gradient of the loss function with respect to the critic representation parameters.
7 Update the critic representation using the computed gradients.
8 Update the training visualization.
9 Terminate training if the critic is sufficiently trained.
Training the policy is a computationally intensive process. To save time while running this example,
load a pretrained agent by setting doTraining to false. To train the policy yourself, set
doTraining to true.
5-436
Model-Based Reinforcement Learning Using Custom Training Loop
doTrianing = false;
if doTrianing
targetCritic = critic;
modelTrainedAtleastOnce = false;
totalStepCt = 0;
start = tic;
set(trainingPlotModel,Visible = "on");
set(testPlotModel,Visible = "on");
set(trainingPlot,Visible = "on");
%----------------------------------------------
% 2. Generate experience using models.
%----------------------------------------------
% Create numGenerateSampleIteration x
% horizonLength xnumModels x miniBatchSize
% ex) 20 x 2 x 3 x 256 = 30720 samples
myModelBuffer = generateSamples(myBuffer,...
myModelBuffer,...
transitionNetworkVector,policy,actInfo,...
epsilon,sampleGenerationOptions);
end
5-437
5 Train and Validate Agents
end
%----------------------------------------------
% Interact with environment and train agent.
%----------------------------------------------
% Reset the environment at the start of the episode
observation = reset(env);
episodeReward = zeros(maxStepsPerEpisode,1);
errorPreddiction = zeros(maxStepsPerEpisode,1);
% Check prediction
dx = predict(transitionNetworkVector(1),...
dlarray(observation,'CB'),dlarray(action,'CB'));
predictedNextObservation = observation + dx;
errorPreddiction(stepCt) = ...
sqrt(sum((nextObservation - ...
predictedNextObservation).^2));
episodeReward(stepCt) = reward;
observation = nextObservation;
5-438
Model-Based Reinforcement Learning Using Custom Training Loop
%----------------------------------------------
[sampledObservation,...
sampledAction,...
sampledNextObservation,...
sampledReward,...
sampledIsdone] ...
= sampleMinibatch(...
modelTrainedAtleastOnce,...
realRatio,...
miniBatchSize,...
myBuffer,myModelBuffer);
%----------------------------------------------
% 5. Compute target Q value.
%----------------------------------------------
% Compute target Q value
[targetQValues, MaxActionIndices] = ...
getMaxQValue(targetCritic, ...
{reshape(sampledNextObservation,...
[numObservations,1,miniBatchSize])});
lossData.batchSize = miniBatchSize;
lossData.actInfo = actInfo;
lossData.actionBatch = sampledAction;
lossData.targetQValues = targetQValues;
%----------------------------------------------
% 6. Compute gradients.
%----------------------------------------------
criticGradient = ...
gradient(critic,...
@criticLossFunction, ...
{reshape(sampledObservation,...
[numObservations,1,miniBatchSize])},...
lossData);
%----------------------------------------------
% 7. Update the critic network using gradients.
%----------------------------------------------
[critic, criticOptimizer] = update(...
criticOptimizer, critic,...
criticGradient);
5-439
5 Train and Validate Agents
end
% Update target critic periodically
if mod(totalStepCt, targetUpdateFrequency)==0
targetCritic = critic;
end
%---------------------------------------------------------
% 8. Update the training visualization.
%---------------------------------------------------------
episodeCumulativeReward = sum(episodeReward);
episodeCumulativeRewardVector = cat(2,...
episodeCumulativeRewardVector,episodeCumulativeReward);
movingAveReward = movmean(episodeCumulativeRewardVector,...
aveWindowSize,2);
addpoints(lineReward,episodeCt,episodeCumulativeReward);
addpoints(lineAveReward,episodeCt,movingAveReward(end));
title(ax, "Training Progress - Episode: " + episodeCt + ...
", Total Step: " + string(totalStepCt) + ...
", epsilon:" + string(epsilon))
drawnow;
errorPreddiction = errorPreddiction(1:stepCt);
%---------------------------------------------------------
% 9. Terminate training
% if the network is sufficiently trained.
%---------------------------------------------------------
if max(movingAveReward) > trainingTerminationValue
break
end
5-440
Model-Based Reinforcement Learning Using Custom Training Loop
end
else
load("cartPoleModelBasedCustomLoopPolicy.mat");
end
5-441
5 Train and Validate Agents
5-442
Model-Based Reinforcement Learning Using Custom Training Loop
Simulate Agent
obs0 = reset(env);
obs = obs0;
Enable the environment visualization, which is updated each time the environment step function is
called.
plot(env)
1 Get the action by sampling from the policy using the getAction function.
2 Step the environment using the obtained action value.
3 Terminate if a terminal condition is reached.
actionVector = zeros(1,maxStepsPerEpisode);
obsVector = zeros(numObservations,maxStepsPerEpisode+1);
obsVector(:,1) = obs0;
for stepCt = 1:maxStepsPerEpisode
obsVector(:,stepCt+1) = nextObs;
actionVector(1,stepCt) = action;
obs = nextObs;
end
5-443
5 Train and Validate Agents
lastStepCt = stepCt;
Test Model
Test one of the models by predicting a next observation given a current observation and an action.
modelID = 3;
predictedObsVector = zeros(numObservations,lastStepCt);
obs = dlarray(obsVector(:,1),'CB');
predictedObsVector(:,1) = obs;
for stepCt = 1:lastStepCt
obs = dlarray(obsVector(:,stepCt),'CB');
action = dlarray(actionVector(1,stepCt),'CB');
dx = predict(transitionNetworkVector(modelID),obs, action);
predictedObs = obs + dx;
predictedObsVector(:,stepCt+1) = predictedObs;
end
predictedObsVector = predictedObsVector(:, 1:lastStepCt);
figure(5)
layOut = tiledlayout(4,1, "TileSpacing", "compact");
for i = 1:4
nexttile;
errorPrediction = abs(predictedObsVector(i,1:lastStepCt) - ...
obsVector(i,1:lastStepCt));
line1 = plot(errorPrediction,"DisplayName", "Absolute Error");
title("observation "+num2str(i));
end
title(layOut,"Prediction Absolute Error")
5-444
Model-Based Reinforcement Learning Using Custom Training Loop
The small absolute prediction error shows that the model is successfully trained to predict the next
observation.
References
[1] Volodymyr Minh, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning.” ArXiv:1312.5602
[Cs]. December 19, 2013. https://arxiv.org/abs/1312.5602.
[2] Janner, Michael, Justin Fu, Marvin Zhang, and Sergey Levine. "When to trust your model: Model-
based policy optimization." ArXiv:1907.08253 [Cs, Stat], November 5, 2019. https://arxiv.org/abs/
1906.08253.
See Also
Related Examples
• “Train Reinforcement Learning Agents” on page 5-3
• “Create Custom Reinforcement Learning Agents” on page 3-68
• “Create Agent for Custom Reinforcement Learning Algorithm” on page 5-410
5-445
6
• CUDA code for deep neural network policies using GPU Coder
• C/C++ code for table, deep neural network, or linear basis function policies using MATLAB Coder
Code generation is supported for agents using feedforward neural networks in any of the input paths,
provided that all the used layers are supported. Code generation is not supported for continuous
actions PG, AC, PPO, and SAC agents using a recurrent neural network (RNN).
For more information on training reinforcement learning agents, see “Train Reinforcement Learning
Agents” on page 5-3.
To generate a policy evaluation function that selects an action based on a given observation, use
generatePolicyFunction. You can generate code to deploy this policy function using GPU Coder
or MATLAB Coder.
To generate a Simulink policy evaluation block that selects an action based on a given observation,
use generatePolicyBlock. You can generate code to deploy this policy block using Simulink
Coder.
Not all deep neural network layers support GPU code generation. For a list of supported layers, see
“Supported Networks, Layers, and Classes” (GPU Coder). For more information and examples on
GPU code generation, see “Deep Learning with GPU Coder” (GPU Coder).
As an example, generate GPU code for the policy gradient agent trained in “Train PG Agent to
Balance Cart-Pole System” on page 5-57.
load('MATLABCartpolePG.mat','agent')
generatePolicyFunction(agent)
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor. For a given observation,
the policy function evaluates a probability for each potential action using the actor network. Then,
the policy function randomly selects an action based on these probabilities.
6-2
Deploy Trained Reinforcement Learning Policies
You can generate code for this network using GPU Coder. For example, you can generate a CUDA
compatible MEX function.
Configure the codegen function to create a CUDA compatible C++ MEX function.
cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
Set an example input value for the policy evaluation function. To find the observation dimension, use
the getObservationInfo function. In this case, the observations are in a four-element vector.
argstr = '{ones(4,1)}';
codegen('-config','cfg','evaluatePolicy','-args',argstr,'-report');
• C/C++ code for policies that use Q tables, value tables, or linear basis functions. For more
information on general C/C++ code generation, see “Generating Code” (MATLAB Coder).
• C++ code for policies that use deep neural networks. Note that code generation is not supported
for continuous actions PG, AC, PPO, and SAC agents using a recurrent neural network (RNN). For
a list of supported layers, see “Networks and Layers Supported for Code Generation” (MATLAB
Coder). For more information, see “Prerequisites for Deep Learning with MATLAB Coder”
(MATLAB Coder) and “Deep Learning with MATLAB Coder” (MATLAB Coder).
Generate C Code for Deep Neural Network Policy without using any Third-Party Library
As an example, generate C code without dependencies on third-party libraries for the policy gradient
agent trained in “Train PG Agent to Balance Cart-Pole System” on page 5-57.
load('MATLABCartpolePG.mat','agent')
generatePolicyFunction(agent)
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor. For a given observation,
the policy function evaluates a probability for each potential action using the actor network. Then,
the policy function randomly selects an action based on these probabilities.
Configure the codegen function to generate code suitable for building a MEX file.
cfg = coder.config('mex');
6-3
6 Deploy Trained Policies
On the configuration object, set the target language to C++, and set DeepLearningConfig to
'none'. This option generates code without using any third-party library.
cfg.TargetLang = 'C';
cfg.DeepLearningConfig = coder.DeepLearningConfig('none');
Set an example input value for the policy evaluation function. To find the observation dimension, use
the getObservationInfo function. In this case, the observations are in a four-element vector.
argstr = '{ones(4,1)}';
codegen('-config','cfg','evaluatePolicy','-args',argstr,'-report');
This command generates the C++ code for the policy gradient agent containing the deep neural
network actor.
Generate C++ Code for Deep Neural Network Policy using Third-Party Libraries
As an example, generate C++ code for the policy gradient agent trained in “Train PG Agent to
Balance Cart-Pole System” on page 5-57 using the Intel Math Kernel Library for Deep Neural
Networks (MKL-DNN).
load('MATLABCartpolePG.mat','agent')
generatePolicyFunction(agent)
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor. For a given observation,
the policy function evaluates a probability for each potential action using the actor network. Then,
the policy function randomly selects an action based on these probabilities.
Configure the codegen function to generate code suitable for building a MEX file.
cfg = coder.config('mex');
On the configuration object, set the target language to C++, and set DeepLearningConfig to the
target library 'mkldnn'. This option generates code using the Intel Math Kernel Library for Deep
Neural Networks (Intel MKL-DNN).
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('mkldnn');
Set an example input value for the policy evaluation function. To find the observation dimension, use
the getObservationInfo function. In this case, the observations are in a four-element vector.
argstr = '{ones(4,1)}';
codegen('-config','cfg','evaluatePolicy','-args',argstr,'-report');
6-4
Deploy Trained Reinforcement Learning Policies
This command generates the C++ code for the policy gradient agent containing the deep neural
network actor.
As an example, generate C code for the Q-learning agent trained in “Train Reinforcement Learning
Agent in Basic Grid World” on page 1-14.
load('basicGWQAgent.mat','qAgent')
generatePolicyFunction(qAgent)
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained Q table value function. For a given observation, the
policy function looks up the value function for each potential action using the Q table. Then, the
policy function selects the action for which the value function is greatest.
Set an example input value for the policy evaluation function. To find the observation dimension, use
the getObservationInfo function. In this case, there is a single one dimensional observation
(belonging to a discrete set of possible values).
argstr = '{[1]}';
Configure the codegen function to generate embeddable C code suitable for targeting a static
library, and set the output folder to buildFolder.
cfg = coder.config('lib');
outFolder = 'buildFolder';
codegen('-c','-d',outFolder,'-config','cfg',...
'evaluatePolicy','-args',argstr,'-report');
See Also
generatePolicyFunction | generatePolicyBlock
More About
• “Reinforcement Learning Agents” on page 3-2
• “Train Reinforcement Learning Agents” on page 5-3
6-5