0% found this document useful (0 votes)

11 views15 pages

3003 o Ine Reinforcement Learning W

Online reinforcement learning

Uploaded by

Gurumurthi. V Ramanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views15 pages

3003 o Ine Reinforcement Learning W

Online reinforcement learning

Uploaded by

Gurumurthi. V Ramanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Under review as submission to TMLR

Oﬄine Reinforcement Learning with

Bayesian Flow Networks

Anonymous authors
Paper under double-blind review

Abstract

This paper presents a novel approach to reinforcement learning (RL) utilizing Bayesian flow
networks for sequence generation, enabling eﬀective planning in both discrete and contin-
uous domains by conditioning on returns and current states. We explore two conditioning
strategies: state inpainting and a classifier-free method. Experimental results demonstrate
the robustness of our method across various environments. It navigated gridworld environ-
ments in discrete settings, without sacrificing performance in continuous tasks compared
to current state of the art . The results highlight our approach’s ability to eﬀectively cap-
ture spatial and temporal dependencies through a specialized neural network architecture
combining 2D convolutions with a temporal u-net.

1 Introduction

Offline reinforcement learning (RL), also known as batch RL, is a powerful paradigm that leverages pre-
viously collected data to learn effective policies. Unlike online RL, where agents interact directly with an
environment, offline RL operates in a safer mode, utilizing historical data without risking real-time ex-
ploration. This safety advantage is particularly crucial in domains like autonomous driving and medical
applications, where explorative policy collection can be hazardous. By drawing from pre-existing datasets,
offline RL enables more efficient learning, making it a promising approach for real-world applications where
data collection can be costly, time-consuming, or impractical. In recent years, there has been a surge of
interest in offline RL due to its potential to address the challenges of sample inefficiency and exploration in
traditional RL settings. However, offline RL also poses unique challenges, such as distributional shift and
data quality issues (Agarwal et al., 2020; Levine et al., 2020).
Recent advancements in conditional generative modeling offer an alternative approach to traditional offline
RL methods. Sequence modeling, in particular, has gained prominence (Janner et al., 2021; Ajay et al.,
2023; Chen et al., 2021). By viewing RL as a sequence modeling problem, we can leverage the power of
generative models to learn effective policies from already collected datasets. This perspective offers several
advantages, such as the ability to capture temporal dependencies and complex interactions within the data.
Moreover, conditional generative models allow for the generation of counterfactual trajectories, enabling
robust policy evaluation and exploration of alternative decision-making strategies conditioned on return,
desired goal state, or other desired behaviour. However, handling high-dimensional action or state spaces
can be computationally intensive and may necessitate innovative approaches to maintain tractability. Despite
this challenge, the integration of conditional generative modeling and RL holds promise for addressing the
limitations of traditional methods and advancing the state-of-the-art in offline reinforcement learning.

2 Preliminaries

This section introduces all the necessary background to follow the related works and method sections.

1
Under review as submission to TMLR

2.1 Reinforcement Learning

Reinforcement learning (RL) is a framework for learning to make decisions in an environment (Sutton &
Barto, 2018). The interactions with the environment are modelled as a Markov decision process (MDP),
which is a tuple (S, A, P, R, γ), where S is the state space, A is the action space, P is the transition
function, R is the reward function, and γ is the discount factor. In an environment with state s, the next
state, s′ ∼ P(s, a), is only dependent on the current state and action, not the history of previous states and
actions. In other words, it has the Markov property.󰁓∞The goal of RL is to learn a policy π : S → A that
maximizes the expected return E[Rt ], where Rt = i=0 γ i rt+i , is the discounted cumulative reward and rt
is the reward recieved at time t.
The exploration-exploitation trade-off is a fundamental challenge in reinforcement learning, typically asso-
ciated with online learning scenarios where agents iteratively interact with an environment to learn optimal
policies. Exploration involves sampling actions to gather information about the environment, potentially
leading to the discovery of better strategies, while exploitation entails leveraging known information to
maximize immediate rewards. Much of the research in online RL is dedicated to striking a balance be-
tween exploration and exploitation, devising algorithms that effectively navigate this trade-off to converge
to optimal or near-optimal policies.

2.2 Oﬄine RL

In the realm of offline reinforcement learning, the primary objective is to learn effective policies from a
static dataset, without the need for online interactions (Levine et al., 2020). In this context, where agents
learn from a fixed dataset without interacting with the environment, the exploration aspect is inherently
absent. Instead, the focus shifts towards effectively utilizing the available dataset to optimize policies.
Traditionally, RL has been concerned with estimating stationary policies or single-step models, leveraging
the Markov property to factorize problems in time. However, applying standard RL methods to offline
settings is challenging. Because methods relying on value function estimation often suffer from over valuing
out-of-distribution states and actions, various methods have been proposed to address this issue, including
constraining the policy to be close to the data distribution (Peters et al., 2010), or by using a conservative
value function (Kumar et al., 2020).
An intriguing perspective emerges when we view RL through the lens of sequence modeling. Instead of
treating it as a specialized domain, we can consider RL as a generic sequence modeling problem. The crux
of this viewpoint lies in producing a sequence of actions that leads to a sequence of high rewards. Earlier
work has solved this by conditioning the model on returns such that trajectories with high return can be
generated in online settings (Ajay et al., 2023; Janner et al., 2021). By adopting this perspective, we can
simplify design decisions and dispense with many components commonly found in offline RL algorithms. This
approach not only demonstrates flexibility across various tasks such as long-horizon dynamics prediction,
imitation learning, goal-conditioned RL, and offline RL but also yields state-of-the-art planners in sparse-
reward, long-horizon scenarios (Janner et al., 2022; Ajay et al., 2023).

2.3 Denoising Diﬀusion Probabilistic Models

Denoising diﬀusion probabilistic models (DDPMs) (Ho et al., 2020) are a type of generative models inspired
by non-equilibrium thermodynamics. The forward process slowly add Gaussian noise to data and the reverse
amounts to learning to iteratively denoise the noisy data. Diﬀusion models have primarily been used for
image generation, but has also shown state of the art performance on video generation and 3D model
generation (Ho et al., 2022; Luo & Hu, 2021).
Given data x0 ∼ q(x), we define the forward process to produce a sequence of noisy samples x1 , . . . , xK ,
󰁳
q(xk |xk−1 ) = N (xk ; 1 − βk xk−1 , βk I). (1)

where {βk ∈ (0, 1)}K

1 is a carefully chosen variance schedule. A nice property of the forward process is that
we can directly sample xk at any step k. The distribution q(xk |x0 ) can be derived using the property that

2
Under review as submission to TMLR

a sum󰁔of uncorrelated normally distributed random variables are normally distributed. Let ak = 1 − βk and
ak = i=1 ak , then
k
󰀃 √ 󰀄
q(xk |x0 ) = N xk ; ak x0 , (1 − ak ) I . (2)
Note also that 󰀃 󰀄
q(xk−1 |xk , x0 ) = N xk−1 ; µ̃(xk , x0 ), β̃k I , (3)
where √ √
ak−1 βt ak (1 − ak−1 ) 1 − ak−1
µ̃(xk , x0 ) = x0 + , β̃k = . (4)
1 − ak 1 − ak 1 − ak

While the forward process creates noisy representation of data, the reverse process aims to iteratively recreate
samples from noise by modelling and then sampling from q(xk−1 |xk ). Let pθ be a parameterized approxi-
mation of q. This gives
pθ (xk−1 |xk ) = N (xk−1 ; µθ (xk , k), Σθ (xk , k)) . (5)
Ho et al. (2020) chose to fix the variance term Σθ (xk , k) as a constant σk2 = β̃k , see Eq (4). Although
Nichol & Dhariwal (2021) has shown improved results by learning a parameterization of Σ 󰀓θ (xk , k), we will
󰀔
only look at how µθ (xk , k) is trained. First, we consider the reparameterization µ̃k = √1 xk − √1−ak 󰂃k ,
ak 1−ak
where 󰂃k ∼ N (0, I). Since xk is known during training, we can choose to predict 󰂃k , rather than µ̃k directly.
Empirically, this has shown better results. Let us define 󰂃θ (x, 󰀓k) as a model that predicts
󰀔 the noise, 󰂃k , added
to the input. This means that we can define µθ (xk , k) = √1ak xk − √1−a k
1−ak θ
󰂃 (xk , k) . Ho et al. (2020) derive
the following loss function to minimize the diﬀerence between µθ and µ̃:
󰀗 󰀘
βk2 2
L(θ) = Ek∼[1,K],x0 ,󰂃k 󰀂󰂃k − 󰂃θ (xk , k)󰀂 . (6)
2σk2 ak (1 − ak )

They also present the following simplified loss function that turns out to give better empirical results:
2
L(θ) = Ek∼[1,K],x0 ,󰂃k 󰀂󰂃k − 󰂃θ (xk , k)󰀂 . (7)

2.4 Guided Diﬀusion

There are two main ways diffusion models can condition on variables. One, called classifier-guided diffu-
sion (Dhariwal & Nichol, 2021), uses the gradients of the input to a classifier function, with conditioning
classes y, with respect to the predicted log-likelihood to alter the noise prediction toward the conditioning
information. This method has the advantage that the diffusion model does not have to be trained on the con-
ditioning variable. A model predictor 󰂃θ , guided by a classifier h(y|xk , k) meant to estimate the probability
that the noisy datapoint xk belongs to class y would assume the following form:
󰂃θ (xk , k, y) = 󰂃θ (xk , k) − wσk ∇xk log h(y|xk , k), (8)
where w is a parameter controlling the strength of the guidance.
The second way, called classifier-free guidance (Ho & Salimans, 2021), plugs the conditioning variable directly
into the denoising network as an auxiliary input variable during training. At test time, the auxiliary variable
can be set to the conditioning value. In this setting, the model predictor would take the following form:
󰂃˜(xk , k, y) = (w + 1)󰂃θ (xk , k, y) − w󰂃θ (xk , k) (9)
Classifier-free guidance has shown better practical performance than classifier-guided diffusion (Ho & Sali-
mans, 2021).
Beyond the two primary methods of conditioning, we can also employ inpainting (Lugmayr et al., 2022) as a
technique to condition on partial observations. In the context of image generation, this implies conditioning
on some pixels within the image. Consider an image x0 divided into known pixels x0known and unknown
pixels x0unknown , and a mask m defining the known pixels. During the reverse process, we define:
known
√
xk−1 = ak x0 + (1 − ak )z, z ∼ N (0, I). (10)

3
Under review as submission to TMLR

unknown
The unknown pixels at step k, xk−1 , are computed in standard fashion:
󰀕 󰀖
unknown 1 βk
xk−1 =√ xk − √ 󰂃θ (xk , k) + σk z, z ∼ N (0, I). (11)
ak 1 − ak

Finally, we have:
known unknown
xk−1 = m ⊙ xk−1 + (1 − m) ⊙ xk−1 , (12)

where ⊙ is elementwise multiplication. This is optimised for continuous data, and would likely not work well
for environments with discrete states. This is due to a multitude of reasons, one being that diﬀusion models
rely on smooth interpolation between states, which discrete data lack.

2.5 Bayesian Flow Networks

Bayesian flow networks (Graves et al., 2024) is a novel generative model capable of generating continuous,
discrete, and discretized data. It resembles diffusion models in that it generates data in an iterative process.
Unlike diffusion models however, it is not a reverse process starting from noisy data, but rather starts from
a prior distribution and iteratively updates the distribution conditioned on a noisy version of the previous
distribution. Also unlike diffusion models, Bayesian Flow Networks perform well on discrete data, and are
therefore a more natural choice for planning in discrete state spaces. Bayesian Flow Networks have been
shown to perform well on discretised image and text generation.
A comprehensive description of Bayesian Flow Networks is beyond the scope of this paper, but we aim to give
the reader a clear understanding of how they differ from diffusion models. Bayesian Flow Networks (BFNs)
are a class of probabilistic models designed to offer a flexible and scalable approach to modeling complex
distributions. In BFNs, the inputs to the neural network are parameters of distributions, which remain
continuous even for categorical distributions. This characteristic allows BFNs to adapt well to discrete data.
A flowchart of the Bayesian flow network algorithm for a single categorical variable is shown in Figure 1.
For
󰀃 (1)a basic(D)
description
󰀄 of the discrete case, consider data represented as a D dimensional vector x =
∈ {1, A} , where A is the number of classes, and {1, . . . , A} is the set of integers from 1
D
x ,...,x
and A. We will model this as a categorical distribution. First we will define the four different distributions,
the input distribution, the output distribution, the sender distribution, and the receiver distribution, shown
in Figure 1.

... Network A B C A B C Network ...

Output Data

+ Noise +

KL
A B C A B C

Receiver Sender

Sample

... Bayesian ...

A B C
Update A B C

Input Input

Figure 1: Chart shows one step in the Bayesian flow network process for one categorical variable. Figure is
adapted from Figure 1 in Graves et al. (2024).

4
Under review as submission to TMLR

Input distribution
󰀃 For discrete󰀄 data, the input distribution is modeled as a factorized categorical with
parameters θ = θ (1) , . . . , θ (D) , where each θ (d) comprises A corresponding to the categorical distribution
(d)
for variable d. Specifically, θa represents the probability assigned to class a for variable d.

D
󰁜
pI (x|θ) = θ (d) (13)
d=1

󰀅1 1
󰀆
Initially, the input distribution is uniform, meaning θ0 = a, . . . , a .

Output distribution Ψ(θ, k) is a neural network model that takes as input a D-dimensional parameter
vector, θ, where each element are parameters of a categorical distribution. The output is of the same
type. The output distribution for discrete
󰀃 data is defined based 󰀄 on the data x, model inputs θ, t, and
resulting model outputs Ψ (θ, k) = Ψ(1) (θ, k) , . . . , Ψ(D) (θ, k) ∈ RAD . The network inputs θ represent
the parameters of the factorized categorical distribution pI (x|θ), while k serves as an additional input that
represent the process time.

D
󰁜
pO (x | θ, t) = Ψ(d) (θ, k) (14)
d=1

Here,
󰀓 Ψ(d) (θ,󰀔k) denotes A components of the network output corresponding to the parameters
(d) (d)
θ1 , . . . , θA of the categorical distribution for the d-th observation.

Sender distribution A sample from the sender distribution is used to update the parameters of the input
distribution. The accuracy of these samples is controlled by an accuracy parameter α ∈ R+ When α is
low, the samples provide limited
󰀃 information 󰀄 about x. As α increases, the samples become increasingly
informative about x. For y = y (1) , . . . , y (D) ∈ Y D , the sender distribution is defined as

pS (y | x; α) = N (y | α (Aex − 1) , αAI) , (15)

where ex is a unit vector of length A and element x is 1.

Receiver distribution The receiver distribution is defined according to the output distribution pO , and
pS , this takes the form

pR (y | θ; k, α) = EpO (x′ |θ;k) [pS (y|x′ ; α)] , (16)

In essence, this integrates over all x′ ∈ {1, . . . , K}D , considering the contribution of each possible x′ as
weighted by its likelihood under the output distribution pO (x|θ, t), eﬀectively combines all potential sender
distributions into a single receiver distribution.
The objective at each step is to minimize the KL-divergence from the receiver distribution to the sender
distribution across all variables. Additionally, after all steps are complete, the objective is to maximize the
likelihood of sampling the data from the distribution pO .
We would like a loss function that can be performed at any step k without going through all the previous
steps. To do this, we need to know the distribution of the parameters θ given only the prior θ0 and the step
k. Graves et al. (2024) call this the Bayesian flow distribution, pF . This distribution is based on two terms
we will introduce now, namely the Bayesian update distribution, and the accuracy schedule.
The Bayesian update function, used to update the input function for each step as shown in Figure 1 is given
by

5
Under review as submission to TMLR

ey ⊙ θi−1
h (θi−1 , y) = 󰁓A . (17)
a=1 eya (θi−1 )a
This function updates the parameters θi−1 using new samples y resulting in θi ← h (θi−1 , y). We slightly
abuse notation and use ey to mean the elementwise exponentiation of y.
Given a multivariate Dirac delta distribution, δ(·), the Bayesian update distribution is defined as

pU (θ | θi−1 , y, α) = EN (y|α(Aex −1),αAI) [δ (θ − h(θi−1 , y))] . (18)

Next, we define the accuracy schedule β(k) as the integral of the accuracy rate α(k) over time.
󰁝 k
β(k) = α(k ′ )dk ′ . (19)
0

In the discrete case, Graves et al. (2024) use the schedule β(k) = β(1)k 2 , where β(1) is a hyperparameter
to be determined empirically for each experiment. Essentially, we calculate the cumulative accuracy up to
time k, so that we later can perform a single Bayesian update from the prior to time k.
Combining the accuracy schedule with the Bayesian update distribution, we obtain the Bayesian flow distri-
bution:

pF (θ | θ0 , y, t)) = EN (y|β(t)(Aex −1),β(t)AI) [δ (θ − h(θ0 , y))] . (20)

Graves et al. (2024) define two types of loss functions: the discrete-time loss Ln and the continuous-time loss
L∞ . The discrete-time loss Ln corresponds to n generation steps, while L∞ represents the loss as n → ∞.
One advantage of the continuous-time loss is that the number of generation steps can be determined at
inference time rather than when training the model. This is the loss we will utilize in our method. Given

A
󰁛
ê(d) (θ, k) = pO (a|θ; k)ea , (21)
a=1

the continuous time loss is defined as

2
L∞ (x) = Ek∼U (0,1),θ∼PF (·|x,k) k 󰀂ex − ê(θ, k)󰀂 . (22)

This completes the definition of the continuous time loss function for Bayesian flow networks with discrete
data.
The inference process begins with initial parameters θ0 , and proceeds through n steps, each characterized
by specific accuracies α1 , . . . , αn and corresponding time points ki = ni . At each step i, the parameters θi
are updated recursively as follows:

1. Sample x from pO (· | θi−1 , ki−1 )

2. Generate y from the sender distribution pS (· | x, αi ).

3. Update the parameters θi = h(θi−1 , y).

Notice that the sender distribution is now conditioned on the sample from the output distribution and not
the data. After completing n steps, with the final parameters θn , one last step is performed, and the final
sample is drawn from pO (· | θn , 1).

6
Under review as submission to TMLR

There are no obvious factors stopping Bayesian Flow Networks from utilizing the same conditioning tech-
niques as diﬀusion models, but the authors are not aware of any implementations of inpainting (Lugmayr
et al., 2022) or classifier guidance at the time of writing. Graves et al. (2024) specifically state that BFNs
pave the way for gradient-based sample guidance in discrete domains, however.

3 Related Work

Oﬄine RL as a sequence modeling problem has been explored in several recent works.

3.1 Diﬀuser

The Diffuser model (Janner et al., 2022) uses an unconditional diffusion model to model state-action se-
quences. Since this model is unconditional, a differentiable reward function trained on noisy state-action
pairs is necessary to guide the model. A u-net architecture with 1-dimensional local receptive field is used
to model the diffusion process. With this architecture, the model is able to model sequences of arbitrary
length, and can be conditioned on a desired return at test time. Additionally, the local receptive fields makes
the model learn local consistency that can evolve into global consistency through many diffusion steps. At
test-time, the model can be conditioned on the current state by using inpainting, and the desired return by
using classifier-free guidance. They further show that the inpainting technique can be used to condition on
desired end states, effectively allowing the model to solve planning problems it was not specifically trained
for.

3.2 Decision Diﬀuser

The Decision Diffuser Ajay et al. (2023) is similar to the Diffuser method, but differs in two main ways; how it
models actions, and how it conditions on rewards. First, the Decision Diffuser leverages an inverse dynamics
model to capture the relationship between states and actions. This inverse dynamics model estimates actions
conditioned on states, effectively predicting the action that brought the environment from one state to the
next. This lets the diffusion model only model the state sequences rather than state-action sequences. They
show empirically that using an inverse dynamics model is advantageous in deterministic environments, but
that as more stochasticity is introduced into the environment the performance reduces to the same level as
the Diffuser. The second way in which the Decision Diffuser differs from the Diffuser is that it conditions
on return-to-go in a classifier free manner. What this means is that the return is fed into the model during
training so that the model learns which sequences to associate with that return. The desired return can
again be fed into the model when generating a sequence of future states.

4 Method

We propose a sequence generating approach to reinforcement learning based on Bayesian flow networks
capable of planning in both discrete and continuous domains. From now on we will refer to our method
as BFN-RL. Like Decision Diffuser (Ajay et al., 2023) we will only model the state sequences conditioned
on current state and future return and utilize a second inverse-dynamics network to model the actions
conditioned on the states. We opt for this method, as it showed superior performance compared to modeling
state-action pairs with a single diffusion model (Janner et al., 2022). We expect the same benefits when
using Bayesian flow networks as the generative model, but future work may reveal differently.
In our method, the Bayesian flow network is trained to generate sequences conditioned on the return and
current state. The return is the sum of all discounted future rewards, and is therefore a measure of the
quality of a sequence. The network learns to model the distribution of sequences with both high and low
return, and can at test time be conditioned on a desired return. The network is not specifically trained to
generate sequences with high return, but rather at test-time it can be conditioned on a high return value
and perform plans that outperform any seen in the dataset.

7
Under review as submission to TMLR

4.1 Condition on Return

There are two obvious ways we can condition on return, the Diffuser (Janner et al., 2022) way, or the Decision
Diffuser (Ajay et al., 2023) way. Considering that the Decision Diffuser is significantly easier to implement,
showed better performance, and does not involve training an extra model, we opted to condition directly
on return in a classifier-free manner as was done in Decision Diffuser (Ajay et al., 2023). This is shown in
Figure 2, where the neural network in each BFN-step gets the current parameters for the state-distribution
factorized over each timestep, as well as the return and the step k. Int the figure, st refers to the state of
the environment at time t, ŝt+1,k refers to the noisy state estimate at time t + 1 after k BFN steps.
By directly conditioning on return during the generation process, we ensure that the generated trajectories
are biased towards high-reward regions of the state-action space, promoting the discovery of effective policies
that maximize cumulative rewards. Moreover, leveraging the Decision Diffuser framework allows for seamless
integration with existing generative modeling architectures, such as the temporal u-net architecture used in
Decision Diffuser (Ajay et al., 2023).

st ŝt+1,k ŝt+2,k ŝt+3,k ŝt+4,k ... Return k

BFN Step

ŝt+1,k+1 ŝt+2,k+1 ŝt+3,k+1 ŝt+4,k+1 ...

Figure 2: Each BFN step is conditioned on st , return, and step k to generate a sequence of future states.

4.2 Conditioning on Current State

When Diffuser (Janner et al., 2022) and Decision Diffuser (Ajay et al., 2023) conditions on the current
state, they apply an inpainting technique specific to diffusion models. During the reverse diffusion process,
the part of the sequence that is known, the first state, is at each timestep replaced by the true value
diffused the appropriate amount for that timestep. We have adapted a similar technique for Bayesian flow
networks. For continuous data, the Bayesian update at each step for conditional variable is made in the
correct direction. Similarly, for discrete data, the probabilities of the categorical distribution is set to the
appropriate probability for that step. Algorithm 1 shows this method implemented for discrete data. The
alterations to BFN sampling (Algorithm 9 in Graves et al. (2024)) are highlighted in red.
A different way of conditioning on current state is to implement this directly into the model in the same
way we condition on return. We call this method direct conditioning. When using direct conditioning, the
model must be given conditioning variables also at training time. Rather than modify the network to accept
a separate input for the condition, we choose to let the first step in the sequence denote the conditioning
variable. This has the advantage that the conditioning variable will be more directly changing the beginning
of the sequence than the end. Algorithm 2 shows how discrete conditioning would change the sampling
algorithm for discrete variables. The mask, m, masks out the first step in the sequence so that we condition
on this step rather than generate it. The condition, c, is the current state. The red lines indicate lines
added on top of the regular BFN algorithm. In Algorithm 1 a mask m indicates which variables should
be sampled from the sender distribution, and which should be sampled from the data. Algorithm 2 on the
other hand, changes the input distribution such that the data is used directly for the conditioned variable.
Preliminary experiments showed that this gave significantly better results for discrete variables, and similar
or slightly better results for continuous experiments. Based on this, we proceed to use the method presented
in Algorithm 2.

8
Under review as submission to TMLR

Algorithm 1 Inpainting Conditioning for Discrete Random Variables

Require: β(1) ∈ R+ , number of steps n ∈ N, number of classes K, mask m, condition c
1
θ←K
for i = 1 to n do
t ← i−1
n
k ∼ discrete_output_distribution(θ,
󰀃 󰀄 t)
α ← β(1) 2i−1n2
y ∼ N (α(Kek − 1), αKI)
yc ← α(Kec − 1)
y ← (1 − m) ⊙ yc + m ⊙ y
θ ′ ← ey θ
θ = 󰁓θ ′
′

θk
k
end for
k ∼ discrete_output_distribution(θ, 1)

Algorithm 2 Direct Conditioning for Discrete Random Variables

Require: β(1) ∈ R+ , number of steps n ∈ N, number of classes K, mask m, condition c
1
θ←K
for i = 1 to n do
t ← i−1
n
θ ← (1 − m) ⊙ ec + m ⊙ θ
k ∼ discrete_output_distribution(θ,
󰀃 󰀄 t)
α ← β(1) 2i−1n 2

y ∼ N (α(Kek − 1), αKI)

θ ′ ← ey θ
θ = 󰁓θ ′
′

θk
k
end for
θ ← (1 − m) ⊙ ec + m ⊙ θ
k ∼ discrete_output_distribution(θ, 1)

4.3 Inverse Dynamics Model

Just like Decision Diﬀuser (Ajay et al., 2023), we opt to only model the state sequence with the generative
model, and use an inverse dynamics model to predict the action that takes the environment from one state to
the next. Preliminary experiments showed that this method performed better than generating state-action
pairs also with Bayesian flow networks as the generative model.

5 Experiments

We evaluate our method on two sets of tasks, one with discrete action and state space, and one with
continuous action and state space. In the discrete case we use a gridworld environment. For the continuous
case we use the D4RL (Fu et al., 2020) datasets with the Gym-Mujoco suite of environments. We compare
our method to the Decision Diﬀuser (Ajay et al., 2023) and other state of the art oﬄine RL methods.

5.1 Gridworld

Our discrete environment experiments look at the method’s ability to plan in both stochastic and determin-
istic environments. The first environment, SingleRoomUndirected, is a simple grid world with a player and
a goal in a 6 × 6 grid surrounded by walls. There are 4 actions that can take the player up, down, left, or
right.

9
Under review as submission to TMLR

64 64 64 64
64 64 64 64
2d Conv 2d Conv
4 32 2d Conv 2d Conv 32 4
256 256 256 256
2d Conv 2d Conv
1d Conv 1d Conv

512 512 512 512

1d Conv 1d Conv

512 512 512 512

1d Conv 1d Conv

512 512 512 512 512

1d Conv 1d Conv 1d Conv

Figure 3: Specialized u-net architecture used for GridWorld problems. A series of 2d convolutions is per-
formed before the 1d temporal u-net, and another series of 2d transposed convolutions are performed after
the tmeporal u-net. This ensures that the model can take advantage of the 2d structure of the GridWorld
problems and generalize better to unseen sequences.

This environment serves as a testbed for evaluating the method’s performance in a controlled setting, allowing
us to assess its ability to navigate and reach the goal efficiently. Additionally, we introduce variations of
this environment to explore the method’s robustness to stochasticity. This comparative analysis provides
valuable insights into the method’s reliability and effectiveness in discrete state-space environments.
DynamicObstaclesUndirected is a different environment, but also with a 6x6 grid surrounded by walls and
a goal square. Additionally, it has a number of obstacles that moves around the grid taking by randomly
going up, down, left, or right for each timestep. We test the agent against 1, 2, or 4 obstacles to see how it
performs in an increasing stochastic environment. A score of 1 is given if the agent reaches the goals state,
a score of -1 if it hits any obstacles, and a score of 0 if it terminates after 32 steps without reaching the goal
or hitting an obstacle. The dataset is collected by a random agent over 100k episodes.
We propose a novel neural network architecture tailored specifically for gridworld environments. Leveraging
the inherent structure of gridworlds, our architecture starts by applying 2D convolutional operations to
each frame of sequence. This approach allows the network to capture spatial relationships within the grid.
Subsequently, the feature maps are flattened and fed into a temporal u-net, enabling the network to learn
temporal dependencies across frames. This design choice facilitates generalization to unseen transitions
within the gridworld while at the same time mitigates memory consumption compared to a 3D convolutional
approach. The temporal u-net part is similar to the one used in Diffuser and Decision Diffuser, and consists
of temporal convolutions, group normalization, and Mish activation functions.
In the discrete case we opt to use an analytic inverse dynamics model. Given two subsequent states, we look
at what direction the player moved in. If no action can move the player to the new state, a new plan is
generated from the current state. This assumes that player movement is deterministic.
Table 1 shows the methods ability to create valid state sequences in a simple deterministic environment. We
observe that as the number of BFN steps increases, a larger portion of the generated state sequences become
valid. A state sequence is considered valid if there exists a policy π such that following this policy from the
conditioned start state may generate the given state sequence. Furthermore, we observe that all the valid
state-sequences that were generated were of the desired length.
Table 2 shows the performance of BFN-RL DynamicObstaclesUndirected with 0, 1, 2, and 4 dynamic ob-
stacles. Surprisingly, the results for 4 obstacles are significantly better than for 2 obstacles. In general
though, the results suggest that our method struggles with more stochastic domains. Another possibility
for the reduced performance with more obstacles is that the model did not have the capacity to capture the
increased complexity of the environment.

10
Under review as submission to TMLR

Steps Fraction of plans that are valid Fraction of valid plans that produce the correct score
1 0.01 1.00
5 0.46 1.00
10 0.70 1.00
100 0.74 1.00

Table 1: This table shows how many plans in the discrete environment SingleRoomUndirected were valid
for the entire horizon, and how many of those gave the correct score.

Dynamic Obstacles Random BFN-RL

0 0.36 1.00
1 −0.06 0.90
2 −0.34 −0.05
4 −0.60 0.45

Table 2: This table shows the return on the DynamicObstaclesUndirected environment with a varying number
of obstacles. Random refers to the expected return of a random agent calculated from 10k simulations.
Returns for BFN-RL are averaged over 20 seeds.

5.2 Continuous Control

While the focus of our method lies in addressing challenges in discrete state-spaces, something Diffuser(Janner
et al., 2022) and Decision Diffuser (Ajay et al., 2023) cannot easily do, we also aim to demonstrate its
effectiveness in continuous settings. To this end, we evaluate our method on the D4RL MuJoCo dataset.
By extending our method to continuous environments, we aim to showcase its adaptability and versatility
across different problem domains. The D4RL MuJoCo dataset provides a comprehensive benchmark for
evaluating algorithms in continuous control tasks, offering a diverse range of simulated environments with
varying complexities. Through these evaluations, we aim to demonstrate that our method is not limited to
discrete settings but can also effectively handle continuous environments.
Table 3 shows the performance of BFN-RL compared to state-of-the-art algorithms. The table shows that
BFN-RL is competitive on most datasets, but clearly struggles on the hopper environment. The method
is generally very sensitive to hyperparameters, and we believe that results on the Hopper datasets could
likely be improved with further tuning specific to Hopper. HalfCheetah and Walker2d have the same 17-
dimensional input space, whereas Hopper has an 11-dimensional input space. We hypothesize that this
discrepancy, or something related to the dynamics of the agent, could require different hyperparameters for
BFN to effectively learn the environment dynamics. To make results with Decision Diffuser comparable, we
opted not to specifically tune hyperparameters for each environment.

6 Conclusion

In this work, we introduced a novel approach to reinforcement learning that leverages Bayesian flow net-
works (Graves et al., 2024) for sequence generation. Our method is capable of planning in both discrete and
continuous domains by conditioning on returns and current states. We explored two strategies for condition-
ing: conditioning on the current state using inpainting as seen in the Decision Diffuser (Ajay et al., 2023),
and conditioning on current state in a classifier-free manner. Our framework simplifies the training pipeline
and reduces computational overhead by eliminating the need for an additional return classifier.
Our experiments demonstrated the effectiveness of our approach across various environments. In the dis-
crete setting, our method successfully navigated gridworld environments, showcasing its ability to generate
valid plans and achieve desired outcomes. The specialized neural network architecture, which combines 2D
convolutions with a temporal u-net, effectively captured spatial and temporal dependencies, leading to ro-
bust performance. In continuous environments, our method’s adaptability was evident from its competitive
performance on the D4RL MuJoCo datasets, highlighting its versatility across different problem domains.

11
Under review as submission to TMLR

Dataset Environment BC CQL IQL DT TT MOReL Diﬀuser DD BFN-RL

Med-Expert HalfCheetah 55.2 91.6 86.7 86.8 95 53.3 79.8 90.6 93.4 ± 1.3
Med-Expert Hopper 52.5 105.4 91.5 107.6 110.0 108.7 107.2 111.8 93.4 ± 3.0
Med-Expert Walker2d 107.5 108.8 109.6 108.1 101.9 95.6 108.4 108.8 106.6 ± 0.2
Medium HalfCheetah 42.6 44.0 47.4 42.6 46.9 42.1 44.2 49.1 45.8 ± 0.4
Medium Hopper 52.9 58.5 66.3 67.6 61.1 95.4 58.5 79.3 45.1 ± 1.8
Medium Walker2d 75.3 72.5 78.3 74.0 79 77.8 79.7 82.5 75.3 ± 2.3
Med-Replay HalfCheetah 36.6 45.5 44.2 36.6 41.9 40.2 42.2 39.3 35.6 ± 1.3
Med-Replay Hopper 18.1 95 94.7 82.7 91.5 93.6 96.8 100 42.1 ± 3.4
Med-Replay Walker2d 26.0 77.2 73.9 66.6 82.6 49.8 61.2 75 54.7 ± 2.8
Average 51.9 77.6 77 74.7 78.9 72.9 75.3 81.8 65.8

Table 3: Oﬄine Reinforcement Learning. The table summarizes the test performance of BFN-RL and
a various other methods on continuous control. The results indicate that BFN-RL can match. We report
mean and standard error over 3 random seeds. All numbers except for BFN-RL are taken from Ajay et al.
(2023)

The results from both discrete and continuous tasks indicate that our approach can generalize well to diverse
reinforcement learning challenges.

References
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on oﬄine rein-
forcement learning. In Hal DaumÃľ III and Aarti Singh (eds.), Proceedings of the 37th International
Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 104–114.
PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/agarwal20c.html.

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B. Tenenbaum, Tommi S. Jaakkola, and Pulkit Agrawal.
Is conditional generative modeling all you need for decision making? In The Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sP1fo2K9DFG.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel,
Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence
modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan
(eds.), Advances in Neural Information Processing Systems, volume 34, pp. 15084–15097. Curran
Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
7f489f642a0ddb10272b5c31057f0663-Paper.pdf.

Prafulla Dhariwal and Alexander Nichol. Diﬀusion models beat gans on image synthesis. Advances in neural
information processing systems, 34:8780–8794, 2021.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-
driven reinforcement learning, 2020.

Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, and Faustino Gomez. Bayesian flow networks,
2024.

Jonathan Ho and Tim Salimans. Classifier-free diﬀusion guidance. In NeurIPS 2021 Workshop on Deep
Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=
qw8AKxfYbI.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusion probabilistic models. Advances in neural
information processing systems, 33:6840–6851, 2020.

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet.
Video diﬀusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.

12
Under review as submission to TMLR

Michael Janner, Qiyang Li, and Sergey Levine. Oﬄine reinforcement learning as one big sequence
modeling problem. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 1273–1286. Cur-
ran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
099fe6b0b444c23836c4a5d07346082b-Paper.pdf.

Michael Janner, Yilun Du, Joshua B. Tenenbaum, and Sergey Levine. Planning with diﬀusion for flexible
behavior synthesis. In International Conference on Machine Learning, 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,
2014. URL https://api.semanticscholar.org/CorpusID:6628106.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for of-
fline reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin
(eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1179–1191. Curran
Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/
0d2b2061826a5df3221116a5085a6052-Paper.pdf.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Oﬄine reinforcement learning: Tutorial, review,
and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint:
Inpainting using denoising diﬀusion probabilistic models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 11461–11471, 2022.

Shitong Luo and Wei Hu. Diﬀusion probabilistic models for 3d point cloud generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845, 2021.

Alex Nichol and Prafulla Dhariwal. Improved denoising diﬀusion probabilistic models. CoRR,
abs/2102.09672, 2021. URL https://arxiv.org/abs/2102.09672.

Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 24, pp. 1607–1612, 2010.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Appendix

Hyperparameters

Here we present hyperparameters used in the experiments. For the discrete experiments, we used:

• β(1) = 3.

• Planning horizon H = 32 for SingleRoomUndirected and H = 4 for DynamicObstaclesUndirected.

• Learning rate 2e − 4, adam optimizer (Kingma & Ba, 2014) with (β1 , β2 ) = (0.9, 0.999)

• K = 100 steps were used for sample generation.

Most hyperparameters and model architectures for continuous experiments that are not specific to Bayesian
Flow Networks are similar to those used in the oﬃcial Decision Diﬀuser implementation. Hyperparameters
used for the continuous experiments:

• Inverse dynamics model is an MLP with two layers with 512 units and ReLU activations.

13
Under review as submission to TMLR

• 󰂃θ and fφ are trained for 2e6 steps using the Adam optimiser (Kingma & Ba, 2014) with a batch
size of 64, a learning rate of 2e − 4, and (β1 , β2 ) = (0.9, 0.98).
• We use a planning horizon H of 20.
• For testing we used an exponential moving average of the weights with decay α = 0.999
• K = 200 steps were used for sample generation.
• σ1 = 0.01.

Algorithms

Algorithm 3 and 4 show an implementation of Algorithm 1 and 2 for continuous data.

Algorithm 3 Inpainting Conditioning for Continuous Random Variables

Require: σ1 ∈ R+ , number of steps n ∈ N, mask m, condition c
µ←0
ρ←0
for i = 1 to n do
t ← i−1n
x̂(θ, t) ← cts_output_distribution(µ,
󰀓 󰀔 t, 1 − σ12 )
2/n
1 − σ1
−2i/n
α ← σ1
󰀃 󰀄
y ∼ N x̂(θ, t), α−1 I
yc ← (1 − σ12t )c
y ← (1 − m) ⊙ yc + m ⊙ y
µ ← ρµ+αy
ρ+α
ρ←ρ+α
end for
x̂(θ, 1) ← cts_output_distribution(µ, 1, 1 − σ12 )

Algorithm 4 Direct Conditioning for Continuous Random Variables

Require: σ1 ∈ R+ , number of steps n ∈ N, mask m, condition c
µ←0
ρ←0
for i = 1 to n do
t ← i−1n
x̂(θ, t) ← cts_output_distribution(µ,
󰀓 󰀔 t, 1 − σ12 )
2/n
1 − σ1
−2i/n
α ← σ1
󰀃 󰀄
y ∼ N x̂(θ, t), α−1 I
θ ← (1 − m) ⊙ c + m ⊙ θ
µ ← ρµ+αy
ρ+α
ρ←ρ+α
end for
θ ← (1 − m) ⊙ c + m ⊙ θ
x̂(θ, 1) ← cts_output_distribution(µ, 1, 1 − σ12 )

Discrete Experiments

Figure 4 show two plans generated in the SingleRoomUndirected environment, one with 10 steps, and one
with 16 steps.

14
Under review as submission to TMLR

(a) (b)

Figure 4: Plans generated by BFN-RL in SingleRoomUndirected. (a) is conditioned to generate a plan of

length 10, and (b) is conditioned to generate a plan of length 16.

Reinforcement Learning (RL) : Big Data Mining
No ratings yet
Reinforcement Learning (RL) : Big Data Mining
86 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Adams2022 Article ASurveyOfInverseReinforcementL
No ratings yet
Adams2022 Article ASurveyOfInverseReinforcementL
40 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Advanced offlineRL
No ratings yet
Advanced offlineRL
39 pages
Follow Actions PDF
No ratings yet
Follow Actions PDF
42 pages
AS M - R L: Urvey On Odel Based Einforcement Earning
No ratings yet
AS M - R L: Urvey On Odel Based Einforcement Earning
28 pages
Bayesian Reinforcement Learning: A Survey
No ratings yet
Bayesian Reinforcement Learning: A Survey
147 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
9 pages
MOPO: Model-Based Offline Policy Optimization: Equal Contribution. Equal Advising. Orders Randomized
No ratings yet
MOPO: Model-Based Offline Policy Optimization: Equal Contribution. Equal Advising. Orders Randomized
18 pages
Mlt-Cia Iii Ans Key
No ratings yet
Mlt-Cia Iii Ans Key
14 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Offline Reinforcement Learning As One Big Sequence Modeling Problem
No ratings yet
Offline Reinforcement Learning As One Big Sequence Modeling Problem
17 pages
STS Special Issue Offline RL
No ratings yet
STS Special Issue Offline RL
27 pages
Delphic Offline Reinforcement
No ratings yet
Delphic Offline Reinforcement
29 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
13 pages
NeurIPS 2021 Decision Transformer Reinforcement Learning Via Sequence Modeling Paper
No ratings yet
NeurIPS 2021 Decision Transformer Reinforcement Learning Via Sequence Modeling Paper
14 pages
Robust Task Representations For Offline Meta-Reinforcement Learning Via Contrastive Learning
No ratings yet
Robust Task Representations For Offline Meta-Reinforcement Learning Via Contrastive Learning
13 pages
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
No ratings yet
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
16 pages
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
No ratings yet
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
39 pages
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
No ratings yet
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
21 pages
Sharma 2019 Dynamics Aware
No ratings yet
Sharma 2019 Dynamics Aware
11 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
Automatic Reward Shaping From Confounded Offline Data: Mingxuan Li Junzhe Zhang Elias Bareinboim
No ratings yet
Automatic Reward Shaping From Confounded Offline Data: Mingxuan Li Junzhe Zhang Elias Bareinboim
29 pages
Trajectory Transformer
No ratings yet
Trajectory Transformer
15 pages
26286-Article Text-30349-1-2-20230626
No ratings yet
26286-Article Text-30349-1-2-20230626
9 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
A Primer Chapter On Reinforcement Learning-Final
No ratings yet
A Primer Chapter On Reinforcement Learning-Final
22 pages
Reflect-then-Plan: Offline Model-Based Planning Through A Lens
No ratings yet
Reflect-then-Plan: Offline Model-Based Planning Through A Lens
21 pages
Offline RL
No ratings yet
Offline RL
18 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
RL
No ratings yet
RL
94 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Module 1
No ratings yet
Module 1
72 pages
Effective Reinforcement Learning Based On Structural Information Principles
No ratings yet
Effective Reinforcement Learning Based On Structural Information Principles
47 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
37 RL
No ratings yet
37 RL
18 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Unit 3
No ratings yet
Unit 3
12 pages
Q-Value Regularized Transformer For Offline Reinforcement Learning
No ratings yet
Q-Value Regularized Transformer For Offline Reinforcement Learning
17 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Disertatie
No ratings yet
Disertatie
5 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
RL
No ratings yet
RL
1 page
(18 April 2024) Aligning Open Language Models
No ratings yet
(18 April 2024) Aligning Open Language Models
77 pages
AI Unit 2 - Constarint Sattisfaction, Means End Analysis, Adversial Search
No ratings yet
AI Unit 2 - Constarint Sattisfaction, Means End Analysis, Adversial Search
42 pages
TP 1315321812 Phpapp02 11090610104236 Phpapp02
No ratings yet
TP 1315321812 Phpapp02 11090610104236 Phpapp02
32 pages
Appunti pg8 1 PDF
100% (1)
Appunti pg8 1 PDF
86 pages
PHYS30201
No ratings yet
PHYS30201
97 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Operations Research Theory Questions Basics of Operations Research
No ratings yet
Operations Research Theory Questions Basics of Operations Research
2 pages
Schrodinger's Cat
No ratings yet
Schrodinger's Cat
4 pages
A Fast and Elitist Multiobjective Genetic Algorithm: Nsga-Ii
No ratings yet
A Fast and Elitist Multiobjective Genetic Algorithm: Nsga-Ii
16 pages
Lecture 10 B-Splines
No ratings yet
Lecture 10 B-Splines
29 pages
2024 - Math Data Sci RPT
No ratings yet
2024 - Math Data Sci RPT
48 pages
Quantum Optimization Potential, Challenges, and The Path Forward
No ratings yet
Quantum Optimization Potential, Challenges, and The Path Forward
72 pages
Signals, Continuous Time and Discrete Time
No ratings yet
Signals, Continuous Time and Discrete Time
27 pages
Statistics M6
No ratings yet
Statistics M6
18 pages
Homework Assignment #9 - Solutions: Mr Θ Ω Θ Mgr Θ P ∂ /∂ Θ P Θ
No ratings yet
Homework Assignment #9 - Solutions: Mr Θ Ω Θ Mgr Θ P ∂ /∂ Θ P Θ
10 pages
Hsu-Chapter 4 Linear Algebra and Matrices
No ratings yet
Hsu-Chapter 4 Linear Algebra and Matrices
47 pages
Pes TP TR112 PSDP 090523
No ratings yet
Pes TP TR112 PSDP 090523
112 pages
2296 Treeformer Dense Gradient Tree
No ratings yet
2296 Treeformer Dense Gradient Tree
15 pages
Bits F111-Course Handout
No ratings yet
Bits F111-Course Handout
3 pages
Transportation Problem
No ratings yet
Transportation Problem
3 pages
1 Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331-3902 August 29, 1994
No ratings yet
1 Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331-3902 August 29, 1994
46 pages
2106 What Does The Knowledge Neuron
No ratings yet
2106 What Does The Knowledge Neuron
26 pages
2326 Are Large Language Models Real
No ratings yet
2326 Are Large Language Models Real
24 pages
2142 Pushing The Limits of Gradient
No ratings yet
2142 Pushing The Limits of Gradient
22 pages
2314 Uncertainty in Graph Neural Ne
No ratings yet
2314 Uncertainty in Graph Neural Ne
20 pages
2024 Stanford cs25 Guest Lecture Jason Wei
No ratings yet
2024 Stanford cs25 Guest Lecture Jason Wei
20 pages
Maintenance Engineering 5
No ratings yet
Maintenance Engineering 5
38 pages
04 Sorting
No ratings yet
04 Sorting
58 pages
Coa 2
No ratings yet
Coa 2
10 pages
24 More Details Please Improvi
No ratings yet
24 More Details Please Improvi
13 pages
Ai PPT
No ratings yet
Ai PPT
93 pages
Lect 3 Dynamic Analysis Elementary Level Part A Free Vibration Damped
No ratings yet
Lect 3 Dynamic Analysis Elementary Level Part A Free Vibration Damped
34 pages
2024 Meng Hidden Citations
No ratings yet
2024 Meng Hidden Citations
10 pages
2 High Performance Transformers
No ratings yet
2 High Performance Transformers
10 pages
Contoh PngmbngnAplikasiTextRecognitionDgnNeuralNetwork - Anif - Safitri
No ratings yet
Contoh PngmbngnAplikasiTextRecognitionDgnNeuralNetwork - Anif - Safitri
9 pages
Motion Planning For Multiple Autonomous Vehicles, PHD Thesis Defense Presentation
No ratings yet
Motion Planning For Multiple Autonomous Vehicles, PHD Thesis Defense Presentation
22 pages
Homework Week 4 Array Based Sequence
No ratings yet
Homework Week 4 Array Based Sequence
3 pages
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
No ratings yet
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
7 pages
Ridge Regression
No ratings yet
Ridge Regression
6 pages
IATI Day 2/junior Task 2. Artillery (English)
No ratings yet
IATI Day 2/junior Task 2. Artillery (English)
2 pages
PhysicsLettersA71 (1979) 155 PDF
No ratings yet
PhysicsLettersA71 (1979) 155 PDF
3 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

3003 o Ine Reinforcement Learning W

Uploaded by

3003 o Ine Reinforcement Learning W

Uploaded by

Under review as submission to TMLR

Oﬄine Reinforcement Learning with

2.1 Reinforcement Learning

2.3 Denoising Diﬀusion Probabilistic Models

where {βk ∈ (0, 1)}K

2.4 Guided Diﬀusion

2.5 Bayesian Flow Networks

... Network A B C A B C Network ...

... Bayesian ...

pS (y | x; α) = N (y | α (Aex − 1) , αAI) , (15)

where ex is a unit vector of length A and element x is 1.

pR (y | θ; k, α) = EpO (x′ |θ;k) [pS (y|x′ ; α)] , (16)

pU (θ | θi−1 , y, α) = EN (y|α(Aex −1),αAI) [δ (θ − h(θi−1 , y))] . (18)

pF (θ | θ0 , y, t)) = EN (y|β(t)(Aex −1),β(t)AI) [δ (θ − h(θ0 , y))] . (20)

the continuous time loss is defined as

1. Sample x from pO (· | θi−1 , ki−1 )

2. Generate y from the sender distribution pS (· | x, αi ).

3. Update the parameters θi = h(θi−1 , y).

3.2 Decision Diﬀuser

4.1 Condition on Return

st ŝt+1,k ŝt+2,k ŝt+3,k ŝt+4,k ... Return k

ŝt+1,k+1 ŝt+2,k+1 ŝt+3,k+1 ŝt+4,k+1 ...

4.2 Conditioning on Current State

Algorithm 1 Inpainting Conditioning for Discrete Random Variables

Algorithm 2 Direct Conditioning for Discrete Random Variables

y ∼ N (α(Kek − 1), αKI)

4.3 Inverse Dynamics Model

512 512 512 512

512 512 512 512

512 512 512 512 512

Dynamic Obstacles Random BFN-RL

5.2 Continuous Control

Dataset Environment BC CQL IQL DT TT MOReL Diﬀuser DD BFN-RL

• Planning horizon H = 32 for SingleRoomUndirected and H = 4 for DynamicObstaclesUndirected.

• K = 100 steps were used for sample generation.

Algorithm 3 and 4 show an implementation of Algorithm 1 and 2 for continuous data.

Algorithm 3 Inpainting Conditioning for Continuous Random Variables

Algorithm 4 Direct Conditioning for Continuous Random Variables

Figure 4: Plans generated by BFN-RL in SingleRoomUndirected. (a) is conditioned to generate a plan of

You might also like