Reinforcement Learning Toolbox™ Release Notes
Reinforcement Learning Toolbox™ Release Notes
Phone: 508-647-7000
R2022b
R2022a
RL Agent Block: Learn from last action applied to the environment . . . . 2-4
Reinforcement Learning Designer App: Support for SAC and TRPO agents
.......................................................... 2-5
iii
Training Parallelization Options: DataToSendFromWorkers and
StepsUntilDataIsSent properties are no longer active . . . . . . . . . . . . . 2-10
Code generated by generatePolicyFunction now uses policy objects . . . . 2-10
R2021b
R2021a
Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that use
deterministic actions during simulation and in generated policy
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
iv Contents
New Examples: Train agent with constrained actions and use DQN agent
for optimal scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
R2020b
R2020a
Continuous Action Spaces: Train AC, PG, and PPO agents in environments
with continuous action spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions . . . . . . . . . . . . . . . . . 6-2
Softplus Layer: Create deep neural network layer using the softplus
activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
v
Parallel Processing: Improved memory usage and performance . . . . . . . . 6-3
R2019b
R2019a
Interoperability: Import policies from Keras and the ONNX model format
.......................................................... 8-3
vi Contents
Reference Examples: Implement controllers using reinforcement learning
for automated driving and robotics applications . . . . . . . . . . . . . . . . . . 8-3
vii
1
R2022b
Version: 2.3
New Features
Bug Fixes
R2022b
You can automatically generate and configure a Policy block using either the new function
generatePolicyBlock or the new button on the RL Agent block mask.
For more information, see generatePolicyBlock and the Policy block. For an example, see
“Generate Policy Block for Deployment”.
For more information, see rlDataLogger. For an example, see “Log Training Data To Disk”.
Prioritized experience replay does not support agents that use recurrent neural networks.
1-2
2
R2022a
Version: 2.2
New Features
Bug Fixes
Compatibility Considerations
R2022a
You can represent actor and critic functions using six new approximator objects. These objects
replace the previous representation objects and improve efficiency, readability, scalability, and
flexibility.
When creating a critic or an actor, you can now select and update optimization options using the new
rlOptimizerOptions object, instead of using the older rlRepresentationOptions object.
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent.
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;.
To implement a customized agent, you can instantiate a policy using the following new policy objects.
• rlMaxQPolicy — This object implements a policy that selects the action that maximizes a
discrete state-action value function.
• rlEpsilonGreedyPolicy — This object implements a policy that selects the action that
maximizes a discrete state-action value function with probability 1-Epsilon, otherwise selects a
random action.
• rlDeterministicActorPolicy — This object implements a policy that you can use to
implement custom agents with a continuous action space.
2-2
• rlAdditiveNoisePolicy — This object is similar to rlDeterministicActorPolicy but noise
is added to the output according to an internal noise model.
• rlStochasticActorPolicy — This object implements a policy that you can use to implement
custom agents with a continuous action space.
For more information on these policy objects, at the MATLAB® command line type, help followed by
the policy object name.
You can use the new rlReplayMemory object to append, store, save, sample and replay experience
data. Doing so makes it easier to implement custom training loops and your own reinforcement
learning algorithms.
When creating a customized training loop or agent you can also access optimization features using
the objects created by the new rlOptimizer function. Specifically, create an optimizer algorithm
object using rlOptimizer, and optionally use dot notation to modify its properties. Then, create a
structure and set its CriticOptimizer or ActorOptimizer field to the optimizer object. When you
call runEpisode, pass the structure as an input parameter. The runEpisode function can then use
the update method of the optimizer object to update the learnable parameters of your actor or critic.
For more information, see Custom Training Loop with Simulink Action Noise and Train
Reinforcement Learning Policy Using Custom Training Loop.
Compatibility Considerations
The following representation objects are no longer recommended:
• rlValueRepresentation
• rlQValueRepresentation
• rlDeterministicActorRepresentation
• rlStochasticActorRepresentation
For more information on how to update your code to use the new objects, see “Representation objects
are not recommended” on page 2-5.
• Create an internal environment model for a model-based policy optimization (MBPO) agent. For
more information on MBPO agents, see Model-Based Policy Optimization Agents.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data. Depending on your application,
using a neural network environment as an approximation of a more complex first-principle
environment can speed up your simulation and training.
2-3
R2022a
For more information on creating MBPO agents, see Model-Based Policy Optimization Agents.
Centralized learning boosts exploration and facilitates learning in applications where the agents
perform a collaborative (or the same) task.
For more information on creating training options set for multiple agents, see
rlMultiAgentTrainingOptions.
For more information and examples, see the train reference page.
For more information, see the SampleTime property of any agent options object. For more
information on conditionally executed subsystems, see Conditionally Executed Subsystems Overview
(Simulink).
2-4
In such cases, to improve learning results, you can now enable an input port to connect the last
action signal applied to the environment.
For more information on creating agents using Reinforcement Learning Designer, see Create
Agents Using Reinforcement Learning Designer.
• Train Reinforcement Learning Agents To Control Quanser QUBE™ Pendulum — Train a SAC agent
to generate a swing-up reference trajectory for an inverted pendulum and a PPO agent as a mode-
selection controller.
• Run SIL and PIL Verification for Reinforcement Learning — Perform software-in-the-loop and
processor-in-the-loop verification of trained reinforcement learning agents.
• Train SAC Agent for Ball Balance Control — Control a Kinova robot arm to balance a ball on a
plate using a SAC agent.
• Automatic Parking Valet with Unreal Engine Simulation — Implement a hybrid reinforcement
learning and model predictive control system that searches a parking lot and parks in an open
space.
Functions to create representation objects are no longer recommended. Depending on the type of
actor or critic being created, use one of the following objects instead.
2-5
R2022a
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.
2-6
rlRepresentationOptions: Not rlOptimizerOptions: Recommended
Recommended
crtOpts = rlRepresentationOptions(...criticOpts = rlOptimizerOptions(...
'GradientThreshold',1); 'GradientThreshold',1);
agent = rlACAgent(actor,critic,agentOpts)
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.
The following table shows some typical uses of the representation objects to create neural network-
based critics and actors, and how to update your code with one of the new function approximator
objects instead.
2-7
R2022a
The following table shows some typical uses of the representation objects to express table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
objects instead.
The following table shows some typical uses of the representation objects to create critics and actors
which use a (linear in the learnable parameters) custom basis function, and how to update your code
with one of the new objects instead. In these function calls, the first input argument is a two-element
2-8
cell array containing both the handle to the custom basis function and the initial weight vector or
matrix.
For more information on the new approximator objects, see, rlTable, rlValueFunction,
rlQValueFunction, rlVectorQValueFunction, rlContinuousDeterministicActor,
rlDiscreteCategoricalActor, and rlContinuousGaussianActor.
The train function now returns an object or an array of objects as the output. The properties of the
object match the fields of the structure returned in previous versions. Therefore, the code based on
dot notation works in the same way.
2-9
R2022a
trainStats = train(agent,env,trainOptions);
When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.
The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.
You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.
For more information and examples, see train and “Training: Stop and resume agent training” on
page 2-4. For more information on training agents, see Train Reinforcement Learning Agents.
Attempting to set either of these properties will cause a warning. For more information, see
barrierPenalty.
The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.
2-10
3
R2021b
Version: 2.1
New Features
Bug Fixes
Compatibility Considerations
R2021b
• Cost and constraint specifications defined in an mpc (Model Predictive Control Toolbox) or nlmpc
(Model Predictive Control Toolbox) controller object. This feature requires Model Predictive
Control Toolbox™ software.
• Performance constraints defined in Simulink Design Optimization™ model verification blocks.
Compatibility Considerations
• getModel now returns a dlnetwork object.
• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).
For more information on creating TRPO agents, see rlTRPOAgent and rlTRPOAgentOptions.
3-2
PPO Agents: Improve agent performance by normalizing advantage
function
In some environments, you can improve PPO agent performance by normalizing the advantage
function during training. The agent normalizes the advantage function by subtracting the mean
advantage value and scaling by the standard deviation.
• "current" — Normalize the advantage function using the advantage function mean and standard
deviation for the current mini-batch of experiences.
• "moving" — Normalize the advantage function using the advantage function mean and standard
deviation for a moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.
For example, configure the agent options to normalize the advantage function using the mean and
standard deviation from the last 500 experiences.
opt = rlPPOAgentOptions;
opt.NormalizedAdvantageMethod = "moving";
opt.AdvantageNormalizingWidnow = 500;
For more information on PPO agents, see Proximal Policy Optimization Agents.
The built-in agents now use dlnetwork objects as actor and critic representations. In most cases this
allows for a speedup of about 30%.
3-3
4
R2021a
Version: 2.0
New Features
Bug Fixes
Compatibility Considerations
R2021a
To open the Reinforcement Learning Designer app, at the MATLAB command line, enter the
following:
reinforcementLearningDesigner
RNNs are deep neural networks with a sequenceInputLayer input layer and at least one layer that
has hidden state information, such as an lstmLayer. These networks can be especially useful when
the environment has states that are not included in the observation vector.
For more information on creating agent with RNNs, see rlDQNAgent, rlPPOAgent, and the
Recurrent Neural Networks section in Create Policy and Value Function Representations.
For more information on creating policies and value functions, see rlValueRepresentation,
rlQValueRepresentation, rlDeterministicActorRepresentation, and
rlStochasticActorRepresentation.
You can also use this input port to override the agent action for safe learning applications.
4-2
inspectTrainingResult Function: Plot training information from a
previous training session
You can now plot the saved training information from a previous reinforcement learning training
session using the inspectTrainingResult function.
By default, the train function shows the training progress and results in the Episode Manager. If you
configure training to not show the Episode Manager or you close the Episode Manager after training,
you can view the training results using the inspectTrainingResult function, which opens the
Episode Manager.
Deterministic Exploitation: Create PG, AC, PPO, and SAC agents that
use deterministic actions during simulation and in generated policy
functions
PG, AC, PPO, and SAC agents generate stochastic actions during training. By default, these agents
also use stochastic actions during simulation deployment. You can now configure these agents to use
deterministic actions during simulations and in generated policy function code.
To enable deterministic exploitation, in the corresponding agent options object, set the
UseDeterministicExploitation property to true. For more information, see
rlPGAgentOptions, rlACAgentOptions, rlPPOAgentOptions, or rlSACAgentOptions.
For more information on simulating agents and generating policy functions, see sim and
generatePolicyFunction, respectively.
New Examples: Train agent with constrained actions and use DQN
agent for optimal scheduling
This release includes the following new reference examples.
• Water Distribution System Scheduling Using Reinforcement Learning — Train a DQN agent to
learn an optimal pump scheduling policy for a water distribution system.
• Train Reinforcement Learning Agent with Constraint Enforcement — Train an agent with critical
constraints enforced on its actions.
The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.
4-3
R2021a
When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG and TD3 agents use OU noise for exploration.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
4-4
Update Code
This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt and rlTD3AgentOptions object td3opt.
4-5
5
R2020b
Version: 1.3
New Features
Bug Fixes
Compatibility Considerations
R2020b
For examples on training multiple agents, see Train Multiple Agents to Perform Collaborative Task,
Train Multiple Agents for Area Coverage, and Train Multiple Agents for Path Following Control.
You can create a SAC agent using the rlSACAgent function. You can also create a SAC-specific
options object with the rlSACAgentOptions function.
Default agents are available for DQN, DDPG, TD3, PPO, PG, AC, and SAC agents. For each agent, you
can call the agent creation function, passing in the observation and action specifications from the
environment. The function creates the required actor and critic representations using deep neural
network approximators.
You can specify initialization options (such as the number of hidden units for each layer, or whether to
use a recurrent neural network) for the default representations using an
rlAgentInitializationOptions object.
After creating a default agent, you can then access its properties and change its actor and critic
representations.
5-2
getModel and setModel Functions: Access computational model used
by actor and critic representations
You can now access the computational model used by the actor and critic representations in a
reinforcement learning agent using the following new functions.
Using these functions, you can modify the computational in a representation object without
recreating the representation.
• Create Agent for Custom Reinforcement Learning Algorithm — Create a custom agent for your
own custom reinforcement learning algorithm.
• Tune PI Controller using Reinforcement Learning — Tune a PI controller using the twin-delayed
deep deterministic policy gradient (TD3) reinforcement learning algorithm.
• Train PPO Agent for Automatic Parking Valet — Train a PPO agent to automatically search for a
parking space and park.
• Train DDPG Agent for PMSM Control — Train a DDPG agent to control the speed of a permanent
magnet synchronous motor.
For AC agents, the default value of the NumStepsToLookAhead option is now 32.
To use the previous default value instead, create an rlACAgentOptions object and set the option
value to 1.
opt = rlACAgentOptions;
opt.NumStepsToLookAhead = 1;
5-3
6
R2020a
Version: 1.2
New Features
Bug Fixes
Compatibility Considerations
R2020a
These objects allow you to easily implement custom training loops for your own reinforcement
learning algorithms. For more information, see Train Reinforcement Learning Policy Using Custom
Training Loop.
Compatibility Considerations
The rlRepresentation function is no longer recommended. Use one of the four new objects
instead. For more information, see “rlRepresentation is not recommended” on page 6-3.
Recurrent Neural Networks: Train DQN and PPO agents with recurrent
deep neural network policies and value functions
You can now train DQN and PPO agents using recurrent neural network policy and value function
representations. For more information, see rlDQNAgent, rlPPOAgent, and Create Policy and Value
Function Representations.
6-2
Softplus Layer: Create deep neural network layer using the softplus
activation function
You can now use the new softplusLayer layer when creating deep neural networks. This layer
implements the softplus activation function Y = log(1 + eX), which ensures that the output is always
positive. This activation function is a smooth continuous version of reluLayer.
• Train PPO Agent to Land Rocket — Train a PPO agent to land a rocket in an environment with a
discrete action space.
• Train DDPG Agent with Pretrained Actor Network — Train a DDPG agent using an actor network
that has been previously trained using supervised learning.
• Imitate Nonlinear MPC Controller for Flying Robot — Train a deep neural network to imitate a
nonlinear MPC controller.
6-3
R2020a
The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.
The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.
6-4
Table-Based Representations: Not Table-Based Representations: Recommended
Recommended
rep = rlRepresentation(tab), with tab rep =
containing a Q-value table with as many rows as rlQValueRepresentation(tab,obsInfo,act
the possible observations and as many columns Info). Use this syntax to create a single-output
as the possible actions. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.
The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-element cell array
containing both the handle to the custom basis function and the initial weight vector or matrix.
Target update method settings for DQN agents have changed. The following changes require updates
to your code:
6-5
R2020a
• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.
Target update method settings for DDPG agents have changed. The following changes require
updates to your code:
• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
6-6
Update Method TargetUpdateFrequency TargetSmoothFactor
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.
6-7
7
R2019b
Version: 1.1
New Features
Bug Fixes
R2019b
For more information on PPO agents, see Proximal Policy Optimization Agents.
7-2
8
R2019a
Version: 1.0
New Features
R2019a
• Q-learning
• SARSA
• Deep Q-networks (DQN)
• Deep deterministic policy gradients (DDPG)
• Policy gradient (PG)
• Advantage actor-critic (A2C)
You can also train policies using other algorithms by creating a custom agent.
For more information on creating and training agents, see Reinforcement Learning Agents and Train
Reinforcement Learning Agents.
• Action and observation signals that the agent uses to interact with the environment.
• Reward signal that the agent uses to measure its success.
• Environment dynamic behavior.
You can model your environment using MATLAB and Simulink. For more information, see Create
MATLAB Environments for Reinforcement Learning and Create Simulink Environments for
Reinforcement Learning
8-2
Interoperability: Import policies from Keras and the ONNX model
format
You can import existing deep neural network policies and value functions from other deep learning
frameworks, such as Keras and the ONNX™ format. For more information, see Import Policy and
Value Function Representations.
• Parallel Computing Toolbox software, you can run parallel simulations on multicore computers
• MATLAB Parallel Server software, you can run parallel simulations on computer clusters or cloud
resources
You can also speed up deep neural network training and inference with high-performance NVIDIA®
GPUs.
You can deploy trained policies as C/C++ shared libraries, Microsoft® .NET Framework assemblies,
Java® classes, and Python® packages.
8-3