Confidence-Guided Human-AI Collaboration: Reinforcement Learning With Distributional Proxy Value Propagation For Autonomous Driving
Confidence-Guided Human-AI Collaboration: Reinforcement Learning With Distributional Proxy Value Propagation For Autonomous Driving
8, AUGUST 2021 1
Abstract—Autonomous driving promises significant advance- safety, increased traffic efficiency, and reduced environmen-
ments in mobility, road safety and traffic efficiency, yet reinforce- tal impacts. These advancements have garnered significant
ment learning and imitation learning face safe-exploration and attention from both academia and industry [1]–[4]. Recent
arXiv:2506.03568v2 [cs.RO] 5 Jun 2025
without environment interaction, further reducing risk [20]– • Distributional proxy value propagation (D-PVP): A novel
[25], while IRL infers a reward function from demonstrations method integrating PVP into the DSAC framework is sug-
to guide desirable behaviors [26]. Despite these strengths, gested. D-PVP enables the agent to learn effective driving
IL faces significant challenges. Distributional shift can cause policies with relatively few human-guided interactions,
compounding errors, pushing the agent away from its training achieving competent performance within a short training
distribution and resulting in control failures [27], [28]. More- duration.
over, IL’s success heavily depends on the quality of demon- • Shared control mechanism: A mechanism combining the
strations, leaving it vulnerable to task variations [29]. Rare learned human-guided policy with a self-learning policy
or atypical scenarios, often underrepresented in training data, is proposed. The self-learning policy is designed to max-
can lead to erratic policy responses, especially in complex imize cumulative rewards, allowing the agent to explore
environments where expert data may be scarce or suboptimal independently and continuously improve its performance
[30]. Consequently, IL-based methods demand solutions that beyond human guidance.
mitigate distributional shift, cope with suboptimal or limited • Policy confidence evaluation algorithm: An algorithm
demonstrations, and adapt to diverse driving conditions. leveraging DSAC’s return distribution networks to facili-
Human-AI collaboration (HAC) leverages human expertise tate dynamic switching between human-guided and self-
to enhance safety and efficiency. Methods such as DAgger learning policies via an intervention function is devel-
[31] and its extensions [32]–[34] periodically request ex- oped. This ensures the agent can pursue optimal policies
pert demonstrations to correct compounding errors in IL, while maintaining guaranteed safety and performance
while expert intervention learning (EIL) [35] and intervention levels.
weighted regression (IWR) [36] rely on human operators to
intervene during exploration for safer state transitions. Other
approaches integrate human evaluative feedback [37]–[39] II. M ETHODOLOGY
or partial demonstrations with limited interventions, as in This section presents the proposed C-HAC framework,
HACO [40], to guide the learning process while reducing which integrates D-PVP, a shared control mechanism and
human effort. Proxy value propagation (PVP) [41] uses active policy confidence evaluation. As illustrated in Fig. 1, the
human input to learn a proxy value function encoding human framework consists of two phases. Initially, the agent employs
intentions, demonstrating robust performance across diverse D-PVP to learn from human demonstrations, allowing it to
tasks and action spaces. Despite these advances, many HAC acquire safe and efficient driving policies through expert
methods assume near-optimal human guidance, which can be interventions. Subsequently, the agent continues to refine its
prohibitively expensive or suboptimal in practice [42]. Humans policy via reinforcement learning, leveraging the shared con-
may adopt conservative strategies in ambiguous scenarios, trol mechanism and policy confidence evaluation for safe and
such as decelerating behind a slower vehicle for safety, even reliable exploration. To provide a comprehensive understand-
when a more effective approach (e.g., overtaking) might exist. ing, we will first offer a brief overview of conventional RL
Furthermore, continuous oversight places a heavy burden and HAC. We will then delve into detailed explanations of
on human operators, hindering large-scale deployment [19], D-PVP, the shared control mechanism combined with policy
while insufficient utilization of autonomous exploration data confidence evaluation, and the entire training process.
constrains overall learning efficiency [43]. Balancing human
involvement, agent safety, and learning efficiency thus remains
a central challenge in HAC research. A. Human-AI Collaboration
To reduce reliance on human guidance while ensuring effi- The policy learning of end-to-end autonomous driving is
cient and safe policy learning, we propose a confidence-guided essentially a continuous action space problem in RL, which
human-AI collaboration (C-HAC) strategy in this paper. It op- can be formulated as a Markov decision process (MDP). MDP
erates in two stages: a human-guided learning phase and a sub- is defined by the tuple M = ⟨S, A, P, R, γ⟩, where S is the
sequent RL enhancement phase. In the former, Distributional state space, A is the action space, P is the transition proba-
proxy value propagation (D-PVP) encodes human intentions bility, R is the reward function, and γ is the discount factor.
in the Distributional soft actor-critic (DSAC) algorithm. For The goal in standard RL is to learn a policy π : SP→ A that
the latter, the agent refines its policy independently. A shared ∞
maximizes the expected cumulative reward Rt = t=0 γ t rt ,
control mechanism combines the learned human-guided policy where rt is the reward at time t. In this study, we employ
with a self-learning policy, enabling exploration beyond human an entropy-augmented objective function [44], incorporating
demonstrations. Additionally, a policy confidence evaluation policy entropy into the reward term:
algorithm leverages DSAC’s return distributions to switch "∞ #
between human-guided and self-learning policies, ensuring X
i−t
safety and reliable performance. Extensive experiments show Jπ = E γ [ri + αH (π (· | si ))] , (1)
(si≥t ,ai≥t )∼ρπ i=t
that C-HAC quickly acquires robust driving skills with min-
imal human input and continues improving without ongoing where α is the temperature coefficient, and the policy entropy
human intervention, outperforming conventional RL, IL, and H has the form
HAC methods in safety, efficiency, and overall results. The
main contributions of this paper are as follows: H(π(· | s)) = E [− log π(a | s)]. (2)
a∼π(·|s)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
Human
Buffer Human-guided Policy Self-learning Policy
Confidence Evaluation
Reward-free
Human-guided Self-learning
Policy Policy Confidence-based Intervention Function
Error by Insufficient Confidence
Training Improvement
Novice
Shared Control Buffer
Mechanism Reward-required
State State
Environment
where G(s) = a′ ∈A I (s, a′ ) π g (a′ | s) da′ is the probability From (3), it follows that Qπ (s, a) = E [Z π (s, a)]. To
R
of human intervention. represent the distribution of the random variable Z π (s, a),
let Z π (Z π (s, a) | s, a) denote the mapping from (s, a) to
B. Distributional Proxy Value Propagation a probability distribution over Z π (s, a). Consequently, the
distributional version of the soft bellman operator in (4)
The HAC, which effective in guiding agents to exhibit
becomes
human-like behaviors, is heavily reliant on human interven-
D
tion. To alleviate this burden on human operators and enhance TDπ Z(s, a) = r + γ (Z (s′ , a′ ) − α log π (a′ | s′ )) , (8)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
D
where s′ ∼ p, a′ ∼ π, and A = B indicates that two random and
variables A and B share identical probability laws. For further "
2
#
usage, a reward-free distributional soft Bellman operator is (−1 − Qθ (s, ag )) ∇θ σθ (s, ag )
∇θ JZNg (θ) =E ∇θ +η
defined as 2σθ (s, ag )2 σθ (s, ag )
D (1 + Qθ (s, ag ))
TbDπ Z(s, a) = γ (Z (s′ , a′ ) − α log π (a′ | s′ )) , (9) =E ∇θ Qθ (s, ag )
σθ (s, ag )2
#
and the return distribution is updated via 2
(1 + Qθ (s, ag )) − σθ (s, ag )2
− η∇θ σθ (s, ag ) ,
Znew = arg min E [DKL (TDπ Zold (· | s, a), Z(· | s, a))] , σθ (s, ag )3
Z (s,a)∼ρπ (15)
(10) here η modulates the variance convergence rate.
where DKL is the Kullback-Leibler (KL) divergence. The transitions stored in the novice buffer, though devoid of
D-PVP relies on two value-distribution networks and one direct human intervention, still encapsulate valuable insights
stochastic policy, parameterized by Zθg (· | s, a), Zζc (· | s, a) regarding forward dynamics and human preferences. Rather
and πϕg (· | s). The parameters θ, ζ and ϕ govern these than discarding these data, D-PVP propagates proxy values to
networks. Specifically, the distribution Zθg is designed for these states through a reward-free TD update. As depicted in
proxy value propagation, while Zζc supports policy confidence Fig. 2(b), the reward-free TD loss admits
evaluation. The policy πϕg is guided by human actions. All h πg i
three networks are modeled as diagonal Gaussian, outputting JZT gD (θ) = E DKL TbD Φ̄ Zθ̄g (· | s, a), Zθg (· | s, a) ,
mean and standard deviation. (s,a)∼B
(16)
Fig. 2 illustrates the D-PVP. During training, a human
where θ̄ and ϕ̄ are theg target-network parameters, and B refers
subject supervises the agent-environment interactions. Those π
exploratory transitions by the agent are stored in the novice to Bg ∪Bh . Since TbD Φ̄ Zθ̄g (· | s, a) is unknown, a sample-based
buffer Bg = {(s, ag , s′ , r)}. At any time, the human sub- version of (16) is applied:
ject can intervene in the agent’s free exploration by taking
JZT gD (θ) = − E yz | Zθg (· | s, a))] ,
[log P (b
control using the device. During human involvement, both (s,a,s′ )∼B,a′ ∼πΦ̄g ,
human and novice actions will be recorded into the human Z g (s′ ,a′ )∼Zθ̄g (·|s′ ,a′ )
buffer Bh = {(s, ag , ah , s′ , r)}. Meanwhile, the novice policy (17)
πϕg (· | s) is updated following the D-PVP procedure. with the reward-free target value
As illustrated in Fig. 2(b), to emulate human behavior and
minimize intervention, D-PVP samples data (s, ag , ah ) from ybz = γ Z g (s′ , a′ ) − α log πϕ̄g (a′ | s′ ) . (18)
the human buffer and labels the value distribution of the human
action ah with δ1 (·) and the novice action ag with δ−1 . Here The corresponding update gradient is
δ1 (·) and δ−1 (·) represent the Dirac delta distribution centered "
2
#
at 1 and -1. This labeling fits Zθg (· | s, a) via the following TD yz − Qθ (s, a))
(b ∇θ σθ (s, a)
∇θ JZ g (θ) =E ∇θ +η
proxy value (PV) loss: 2σθ (s, a)2 σθ (s, a)
yz − Qθ (s, a))
(b
JZPgV (θ) = JZHg (θ) + JZNg (θ) I (s, ag ) ,
(11) =E[− ∇θ Qθ (s, a)
σθ (s, a)2
2
where yz − Qθ (s, a)) − σθ (s, a)2
(b
− η∇θ σθ (s, a)].
σθ (s, a)3
JZHg (θ) = E [DKL (δ1 (·), Zθg (· | s, ah ))] , (12) (19)
(s,ah ,ag )∼Bh
Hence, the final value loss for Zθg (· | s, a) integrates both
JZNg (θ) = E [DKL (δ−1 (·), Zθg (· | s, ag ))] . (13) PV loss and TD loss:
(s,ah ,ag )∼Bh
JZ g (θ) = JZPgV (θ) + JZT gD (θ). (20)
Since Zθg (· | s, a) has the
form of Gaussian, it is represented
as N Qθ (s, a), σθ (s, a)2 , where Qθ (s, a) and σθ (s, a) are For policy improvement, the actor πϕg (· | s) is optimized by
the mean and standard deviation of return distribution. The maximizing the return distribution
update gradients of JZHg (θ) and JZNg (θ) are:
Jπg (ϕ) = E Eg [Z g (s, a)] − α log πϕg (a | s)
" # s∼B, Z g (s,a)∼Zθ (·|s,a)
2 g
(1 − Qθ (s, ah )) ∇θ σθ (s, ah ) a∼πϕ
∇θ JZHg (θ) =E ∇θ +η h i
2σθ (s, ah )2 σθ (s, ah ) = E Qθ (s, a) − α log π g
(a | s) ,
g ϕ
s∼B,a∼πϕ
(1 − Qθ (s, ah )) (21)
=E[− ∇θ Qθ (s, ah )
σθ (s, ah )2 while the temperature α is updated to balance exploration and
2
(1 − Qθ (s, ah )) − σθ (s, ah )2 exploitation:
− η∇θ σθ (s, ah )],
σθ (s, ah )3 h i
(14) α ← α − Es∼B,a∼πϕg − log πϕg (a | s) − H . (22)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
The expected cumulative probability of failure Vπb of the where the target value yzr is defined as:
behavior policy π b in D-PVP can be upper-bounded by:
yzr = r + γ Z c (s′ , a′ ) − α log πψ̄r (a′ | s′ ) . (27)
1 γ ϵ2
Vπb ≤ ϵ+κ + K′ , (23) This mechanism enables the agent to exploit potentially
1−γ 1−γ
superior reward-based strategies while retaining the essential
where ϵ < 1 and κ < 1 are small constants representing, safety and efficiency derived from human demonstrations. A
respectively, the probability that the human expert provides critical component of this mechanism is the policy confidence
an unsafe action and the probability that the expert fails to evaluation, that determines which policy to follow at each
intervene when the agent takes an unsafe action. The term step. The confidence evaluation utilizes the return distribution
K ′ ≥ 0 captures the human expert’s tolerance, corresponding Zζc (s, a). By updating with the value target in (27), Zζc (s, a)
to the measure of the action set in which the expert can estimates the distribution of cumulative returns given states
intervene. For further details and the derivation of this upper and actions. Within the shared control mechanism, for each
bound, please refer to [45]. state-action pair (s, ag ) associated with the human-guided
policy and (s, ar ) associated with the self-learning policy,
C. Shared Control Mechanism with Policy Confidence Evalu- Zζc (s, a) outputs the distributions of cumulative rewards as:
ation
2
Zζc (s, ar ) → N Qζ (s, ar ) , σζ (s, ar ) ,
Exclusive reliance on human guidance can lead to subopti- (28)
2
mal policy outcomes. Therefore, it is essential to enable agents Zζc (s, ag ) → N Qζ (s, ag ) , σζ (s, ag ) .
to explore independently and continuously improve their per-
formance beyond human guidance. This can be achieved by Here, Qζ (s, ar ) and Qζ (s, ag ) represent the expected cu-
incorporating a reward signal into the learning process, allow- mulative returns for the reward-guided and human-guided
ing the agent to learn from the environment. However, directly policies, respectively. The standard deviations σζ (s, ar ) and
adding the reward signal throughout the entire training can σζ (s, ag ) indicate the uncertainties in the cumulative re-
destabilize the learning process due to discrepancies between ward estimates for each policy. For simplicity, Qζ (s, ar ) and
proxy values and reward signals. Alternatively, introducing the Qζ (s, ag ) are denoted as Qrζ and Qgζ , respectively. Similarly,
reward signal after the agent has learned a basic driving policy σζ (s, ar ) and σζ (s, ag ) are denoted as σζr and σζg
from human demonstrations may result in significant perfor- Using these outputs, the confidence that the self-learning
mance degradation. This occurs because the native reward policy outperforms the human-guided policy in state s is
function might not align with human preferences, hindering calculated as:
the retention of previously learned human policies.
To address these challenges, our C-HAC framework puts Qgζ − Qrζ
P Qrζ > Qgζ = 1 − Φ
2 2 , (29)
forward a shared control mechanism that combines the learned
r
human-guided policy with a self-learning policy. The shared σζr + σζg
control mechanism is defined as:
where Φ(·) is the cumulative distribution function (CDF) of
π b (a | s) = πψr (a | s) (1 − Tc (s)) + πϕg (a | s) Tc (s). (24) the standard normal distribution.
Based on the confidence evaluation, the confidence-based
Here, Tc denotes the confidence-based intervention func- intervention function Tc determines which policy to follow in
tion. The policy πϕg (s, a) represents the human-guided policy a given state s:
learned from human demonstrations through D-PVP, while (
πψr (a | s) is the self-learning policy derived from the human- 1 if P Qrζ > Qgζ ≤ 1 − δ,
guided policy but optimized to maximize cumulative rewards. Tc (s) = (30)
0 otherwise.
The self-learning policy is improved by maximizing the return
distribution: If the confidence probability exceeds the threshold 1 − δ,
the reward-guided policy πψr will be selected. Otherwise, the
Jπr (ψ) = E r c E [Z c (s, a)] human-guided policy πϕg is used.
s∼B,a∼πψ Z (s,a)∼Zζc (·|s,a)
Assumption 1. (Bounded variance of return distributions.) To
− α log πψr (a | s)
(25) simplify the analysis, the variances of the return distributions
for both the human-guided policy (π g ) and the reward-guided
E r Qζ (s, a) − α log πψr (a | s) .
= policy (π r ) are bounded by a constant σmax . Specifically:
s∼B,a∼πψ
σg2 ≤ σmax
2
, σr2 ≤ σmax
2
.
The distribution Zζc (· | s, a) is updated using the sample-
based TD loss: Theorem 1. With the confidence-based intervention function
Tc (s) defined in (30), the return of the behavior policy π b is
JZT cD (ζ) = − log P yzr | Zζc (· | s, a) ,
E guaranteed to be no worse than:
(s,a,r,s′ )∼B,a′ ∼πψ̄r , √
Z c (s′ ,a′ )∼Zζ̄c (·|s′ ,a′ ) 2 · σmax · Φ−1 (δ)
J π b ≥ J (π g ) − (1 − β) ·
(26) , (31)
1−γ
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
c
Here, σmean < ϑc indicates that the variance of the policy Metadrive Safety Benchmark