0% found this document useful (0 votes)
13 views13 pages

Confidence-Guided Human-AI Collaboration: Reinforcement Learning With Distributional Proxy Value Propagation For Autonomous Driving

This paper presents a confidence-guided human-AI collaboration (C-HAC) strategy for autonomous driving that integrates distributional proxy value propagation (D-PVP) within the distributional soft actor-critic (DSAC) framework to enhance learning efficiency and safety. C-HAC combines human-guided policies with self-learning mechanisms, allowing the agent to explore independently while maintaining safety through a policy confidence evaluation algorithm. Experimental results demonstrate that C-HAC significantly outperforms traditional methods in terms of safety, efficiency, and overall performance in diverse driving scenarios.

Uploaded by

Kirill Svyatov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Confidence-Guided Human-AI Collaboration: Reinforcement Learning With Distributional Proxy Value Propagation For Autonomous Driving

This paper presents a confidence-guided human-AI collaboration (C-HAC) strategy for autonomous driving that integrates distributional proxy value propagation (D-PVP) within the distributional soft actor-critic (DSAC) framework to enhance learning efficiency and safety. C-HAC combines human-guided policies with self-learning mechanisms, allowing the agent to explore independently while maintaining safety through a policy confidence evaluation algorithm. Experimental results demonstrate that C-HAC significantly outperforms traditional methods in terms of safety, efficiency, and overall performance in diverse driving scenarios.

Uploaded by

Kirill Svyatov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Confidence-Guided Human-AI Collaboration:


Reinforcement Learning with Distributional Proxy
Value Propagation for Autonomous Driving
Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Peng Li, Zhiqiang Zuo, Senior Member, IEEE, Chuan Hu

Abstract—Autonomous driving promises significant advance- safety, increased traffic efficiency, and reduced environmen-
ments in mobility, road safety and traffic efficiency, yet reinforce- tal impacts. These advancements have garnered significant
ment learning and imitation learning face safe-exploration and attention from both academia and industry [1]–[4]. Recent
arXiv:2506.03568v2 [cs.RO] 5 Jun 2025

distribution-shift challenges. Although human-AI collaboration


alleviates these issues, it often relies heavily on extensive human advances in artificial intelligence (AI) have propelled learning-
intervention, which increases costs and reduces efficiency. This based policies to the forefront of autonomous driving research,
paper develops a confidence-guided human-AI collaboration (C- providing an alternative to traditional modular approaches that
HAC) strategy to overcome these limitations. First, C-HAC segment the driving system into distinct components such
employs a distributional proxy value propagation method within as perception, localization, planning, and control [5], [6].
the distributional soft actor-critic (DSAC) framework. By lever-
aging return distributions to represent human intentions C-HAC Learning-based policies, particularly end-to-end approaches,
achieves rapid and stable learning of human-guided policies have demonstrated potential in enhancing the perception and
with minimal human interaction. Subsequently, a shared control decision-making capabilities of AI systems in autonomous ve-
mechanism is activated to integrate the learned human-guided hicles (AVs) [7]. To further enhance learning-based approaches
policy with a self-learning policy that maximizes cumulative in autonomous driving, reinforcement learning (RL), imitation
rewards. This enables the agent to explore independently and
continuously enhance its performance beyond human guidance. learning (IL) and human-AI collaboration (HAC) learning have
Finally, a policy confidence evaluation algorithm capitalizes been extensively explored.
on DSAC’s return distribution networks to facilitate dynamic RL is a powerful paradigm for training autonomous systems
switching between human-guided and self-learning policies via through trial-and-error interactions with the environment. It
a confidence-based intervention function. This ensures the agent enables agents to establish causal relationships among ob-
can pursue optimal policies while maintaining safety and perfor-
mance guarantees. Extensive experiments across diverse driving servations, actions, and outcomes [8]. A reward function
scenarios reveal that C-HAC significantly outperforms con- guides the learning process and encodes desired behaviors,
ventional methods in terms of safety, efficiency, and overall allowing extensive exploration of the environment to maximize
performance, achieving state-of-the-art results. The effectiveness cumulative rewards. This approach often yields robust policies
of the proposed method is further validated through real-world capable of handling complex tasks [9]–[11]. Nevertheless,
road tests in complex traffic conditions. The videos and code are
available at: https://github.com/lzqw/C-HAC. several challenges restrict the applicability of RL in safety-
critical scenarios. Designing a reward function that aligns
Index Terms—Autonomous driving, Human-AI collaboration, with diverse human preferences is nontrivial, and flawed
Deep reinforcement learning.
reward functions may lead to biased, misguided, or undesir-
able behaviors [12]–[14]. Moreover, the exploratory nature of
I. I NTRODUCTION RL frequently induces unsafe situations during both training
and testing. The low sample efficiency in agent-environment
UTONOMOUS driving technology is revolutionizing
A transportation by offering enhanced mobility, improved
interactions further exacerbates computational costs and pro-
longs training [15], [16]. In addition, although policies trained
This work was supported in part by the National Natural Science exclusively with RL can be technically safe, they often lack
Foundation of China under Grant 62173243, Grant 61933014, Grant the natural, human-like behaviors critical for coordinating with
62403348, and the Young Scientists Fund of the National Natural Science other vehicles in real-world driving scenarios [17]. Therefore,
Foundation of Tianjin, China, under Grant 23JCQNJC01780, and the
Postdoctoral Fellowship Program of CPSF under Grant GZC20241208, Grant there is a pressing need for approaches that mitigate these
2024M762357, and the Foundation of Key Laboratory of System Control limitations—particularly unsafe exploration, reward design
and Information Processing, Ministry of Education, P.R. China, under Grant pitfalls, and the lack of human-like behavior—to fully realize
Scip20240116.
The authors are with the Tianjin Key Laboratory of Intelligent RL’s potential in autonomous driving.
Unmanned Swarm Technology and System, School of Electrical and IL is a promising approach for learning driving policies
Information Engineering, Tianjin University, Tianjin 300072, China; by replicating human driving behavior through demonstrated
Haoyu Wang is also with Key Laboratory of System Control and
Information Processing, Ministry of Education of China, Shanghai, 200240 actions. Prominent methods include behavior cloning (BC)
(email: [email protected]; [email protected]; [email protected]; [18] and inverse reinforcement learning (IRL) [19]. Leveraging
[email protected]; lipeng [email protected]; [email protected]). Chuan Hu expert demonstrations allows novice agents to avoid hazardous
is with the School of Mechanical Engineering, Shanghai Jiao Tong University,
Shanghai 200240, China (e-mail: [email protected]). interactions during training, enhancing safety. Techniques such
as BC and offline RL train agents on pre-existing datasets
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

without environment interaction, further reducing risk [20]– • Distributional proxy value propagation (D-PVP): A novel
[25], while IRL infers a reward function from demonstrations method integrating PVP into the DSAC framework is sug-
to guide desirable behaviors [26]. Despite these strengths, gested. D-PVP enables the agent to learn effective driving
IL faces significant challenges. Distributional shift can cause policies with relatively few human-guided interactions,
compounding errors, pushing the agent away from its training achieving competent performance within a short training
distribution and resulting in control failures [27], [28]. More- duration.
over, IL’s success heavily depends on the quality of demon- • Shared control mechanism: A mechanism combining the
strations, leaving it vulnerable to task variations [29]. Rare learned human-guided policy with a self-learning policy
or atypical scenarios, often underrepresented in training data, is proposed. The self-learning policy is designed to max-
can lead to erratic policy responses, especially in complex imize cumulative rewards, allowing the agent to explore
environments where expert data may be scarce or suboptimal independently and continuously improve its performance
[30]. Consequently, IL-based methods demand solutions that beyond human guidance.
mitigate distributional shift, cope with suboptimal or limited • Policy confidence evaluation algorithm: An algorithm
demonstrations, and adapt to diverse driving conditions. leveraging DSAC’s return distribution networks to facili-
Human-AI collaboration (HAC) leverages human expertise tate dynamic switching between human-guided and self-
to enhance safety and efficiency. Methods such as DAgger learning policies via an intervention function is devel-
[31] and its extensions [32]–[34] periodically request ex- oped. This ensures the agent can pursue optimal policies
pert demonstrations to correct compounding errors in IL, while maintaining guaranteed safety and performance
while expert intervention learning (EIL) [35] and intervention levels.
weighted regression (IWR) [36] rely on human operators to
intervene during exploration for safer state transitions. Other
approaches integrate human evaluative feedback [37]–[39] II. M ETHODOLOGY
or partial demonstrations with limited interventions, as in This section presents the proposed C-HAC framework,
HACO [40], to guide the learning process while reducing which integrates D-PVP, a shared control mechanism and
human effort. Proxy value propagation (PVP) [41] uses active policy confidence evaluation. As illustrated in Fig. 1, the
human input to learn a proxy value function encoding human framework consists of two phases. Initially, the agent employs
intentions, demonstrating robust performance across diverse D-PVP to learn from human demonstrations, allowing it to
tasks and action spaces. Despite these advances, many HAC acquire safe and efficient driving policies through expert
methods assume near-optimal human guidance, which can be interventions. Subsequently, the agent continues to refine its
prohibitively expensive or suboptimal in practice [42]. Humans policy via reinforcement learning, leveraging the shared con-
may adopt conservative strategies in ambiguous scenarios, trol mechanism and policy confidence evaluation for safe and
such as decelerating behind a slower vehicle for safety, even reliable exploration. To provide a comprehensive understand-
when a more effective approach (e.g., overtaking) might exist. ing, we will first offer a brief overview of conventional RL
Furthermore, continuous oversight places a heavy burden and HAC. We will then delve into detailed explanations of
on human operators, hindering large-scale deployment [19], D-PVP, the shared control mechanism combined with policy
while insufficient utilization of autonomous exploration data confidence evaluation, and the entire training process.
constrains overall learning efficiency [43]. Balancing human
involvement, agent safety, and learning efficiency thus remains
a central challenge in HAC research. A. Human-AI Collaboration
To reduce reliance on human guidance while ensuring effi- The policy learning of end-to-end autonomous driving is
cient and safe policy learning, we propose a confidence-guided essentially a continuous action space problem in RL, which
human-AI collaboration (C-HAC) strategy in this paper. It op- can be formulated as a Markov decision process (MDP). MDP
erates in two stages: a human-guided learning phase and a sub- is defined by the tuple M = ⟨S, A, P, R, γ⟩, where S is the
sequent RL enhancement phase. In the former, Distributional state space, A is the action space, P is the transition proba-
proxy value propagation (D-PVP) encodes human intentions bility, R is the reward function, and γ is the discount factor.
in the Distributional soft actor-critic (DSAC) algorithm. For The goal in standard RL is to learn a policy π : SP→ A that
the latter, the agent refines its policy independently. A shared ∞
maximizes the expected cumulative reward Rt = t=0 γ t rt ,
control mechanism combines the learned human-guided policy where rt is the reward at time t. In this study, we employ
with a self-learning policy, enabling exploration beyond human an entropy-augmented objective function [44], incorporating
demonstrations. Additionally, a policy confidence evaluation policy entropy into the reward term:
algorithm leverages DSAC’s return distributions to switch "∞ #
between human-guided and self-learning policies, ensuring X
i−t
safety and reliable performance. Extensive experiments show Jπ = E γ [ri + αH (π (· | si ))] , (1)
(si≥t ,ai≥t )∼ρπ i=t
that C-HAC quickly acquires robust driving skills with min-
imal human input and continues improving without ongoing where α is the temperature coefficient, and the policy entropy
human intervention, outperforming conventional RL, IL, and H has the form
HAC methods in safety, efficiency, and overall results. The
main contributions of this paper are as follows: H(π(· | s)) = E [− log π(a | s)]. (2)
a∼π(·|s)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

Human Expert Learn from Demonstration RL Continuous Enhancement


Distributional Proxy Value Propagation Shared Control Mechanism

Human
Buffer Human-guided Policy Self-learning Policy
Confidence Evaluation

Reward-free

Action Action Policy Confidence Evaluation

Human-guided Self-learning
Policy Policy Confidence-based Intervention Function
Error by Insufficient Confidence
Training Improvement
Novice
Shared Control Buffer
Mechanism Reward-required

Human Intervention Hands-off from the Control

State State

Environment

Fig. 1. Overall framework of C-HAC

The soft Q value is given by Agent action Proxy Value Loss


Exploratory trajectory
π of agent
Q (st , at ) = rt
0

" # 0 +1
X Human Buffer
i−t
+γ E γ [ri − α log π (ai | si )] , Novice Buffer
(si>t ,ai>t )∼ρπ
i=t 0 -1 0
(3)
which delineates the expected soft return for choosing at at Human Buffer
TD Loss
state st under policy π. The soft Q value can be updated using Human Buffer Novice Buffer
the soft Bellman operator T π Human-involved
trajectory

T π Qπ (s, a) = r + γ E [Qπ (s′ , a′ ) − α log π (a′ | s′ )] . Human action


s′ ∼p,a′ ∼π
(4) (a) Human oversees the agent- (b) Label value distribution through
environment interactions. buffers.
Meanwhile, the policy π is updated by maximizing the
Fig. 2. Illustration of Distributional Proxy Value Propagation.
entropy-augmented objective (1), that is,
πnew = arg max Jπ
π
(5) training efficiency, the D-PVP is introduced. D-PVP builds
= arg max E [Qπold (s, a) − α log π(a | s)] . upon PVP, which manipulates the Q value to promote de-
π s∼ρπ ,a∼π
sired behaviors. Specifically, PVP assigns a value of +1 to
HAC extends the standard RL by integrating human expert
human actions and −1 to novice actions, thereby encouraging
into the learning process. During the training process, the
agents to learn human-aligned strategies. The original PVP,
human expert monitors the agent-environment interactions,
as presented in [41], is based on the twin delayed deep
and can intervene the agent’s exploration by providing guid-
deterministic policy gradient (TD3) algorithm, which employs
ance. Specifically, if the agent encounters a risky situation or
a deterministic policy gradient. To enable a more general and
makes a suboptimal decision, the human expert can overwrite
efficient learning process, and to provide an estimate of policy
the agent action ag with their own action ah . The action
confidence, PVP is extended to the DSAC framework. In this
applied to the environment can thus be expressed as b a =
extended version, the distribution of soft state-action returns
I (s, a) ah + (1 − I (s, a)) ag , where I (s, ag ) is a Boolean
in DSAC is leveraged to induce the desired behaviors.
indicator. Thus, with the human policy denoted by π h , the
Firstly, we define the soft state-action return as
shared policy π b , which generates the actual trajectory, is

defined as X
Z π (st , at ) := rt + γ γ i−t [ri − α log π (ai | si )] . (7)
b g h
π (a | s) = π (a | s)(1 − I(s, a)) + π (a | s)G(s), (6) i=t

where G(s) = a′ ∈A I (s, a′ ) π g (a′ | s) da′ is the probability From (3), it follows that Qπ (s, a) = E [Z π (s, a)]. To
R

of human intervention. represent the distribution of the random variable Z π (s, a),
let Z π (Z π (s, a) | s, a) denote the mapping from (s, a) to
B. Distributional Proxy Value Propagation a probability distribution over Z π (s, a). Consequently, the
distributional version of the soft bellman operator in (4)
The HAC, which effective in guiding agents to exhibit
becomes
human-like behaviors, is heavily reliant on human interven-
D
tion. To alleviate this burden on human operators and enhance TDπ Z(s, a) = r + γ (Z (s′ , a′ ) − α log π (a′ | s′ )) , (8)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

D
where s′ ∼ p, a′ ∼ π, and A = B indicates that two random and
variables A and B share identical probability laws. For further "
2
#
usage, a reward-free distributional soft Bellman operator is (−1 − Qθ (s, ag )) ∇θ σθ (s, ag )
∇θ JZNg (θ) =E ∇θ +η
defined as 2σθ (s, ag )2 σθ (s, ag )

D (1 + Qθ (s, ag ))
TbDπ Z(s, a) = γ (Z (s′ , a′ ) − α log π (a′ | s′ )) , (9) =E ∇θ Qθ (s, ag )
σθ (s, ag )2
#
and the return distribution is updated via 2
(1 + Qθ (s, ag )) − σθ (s, ag )2
− η∇θ σθ (s, ag ) ,
Znew = arg min E [DKL (TDπ Zold (· | s, a), Z(· | s, a))] , σθ (s, ag )3
Z (s,a)∼ρπ (15)
(10) here η modulates the variance convergence rate.
where DKL is the Kullback-Leibler (KL) divergence. The transitions stored in the novice buffer, though devoid of
D-PVP relies on two value-distribution networks and one direct human intervention, still encapsulate valuable insights
stochastic policy, parameterized by Zθg (· | s, a), Zζc (· | s, a) regarding forward dynamics and human preferences. Rather
and πϕg (· | s). The parameters θ, ζ and ϕ govern these than discarding these data, D-PVP propagates proxy values to
networks. Specifically, the distribution Zθg is designed for these states through a reward-free TD update. As depicted in
proxy value propagation, while Zζc supports policy confidence Fig. 2(b), the reward-free TD loss admits
evaluation. The policy πϕg is guided by human actions. All h  πg i
three networks are modeled as diagonal Gaussian, outputting JZT gD (θ) = E DKL TbD Φ̄ Zθ̄g (· | s, a), Zθg (· | s, a) ,
mean and standard deviation. (s,a)∼B
(16)
Fig. 2 illustrates the D-PVP. During training, a human
where θ̄ and ϕ̄ are theg target-network parameters, and B refers
subject supervises the agent-environment interactions. Those π
exploratory transitions by the agent are stored in the novice to Bg ∪Bh . Since TbD Φ̄ Zθ̄g (· | s, a) is unknown, a sample-based
buffer Bg = {(s, ag , s′ , r)}. At any time, the human sub- version of (16) is applied:
ject can intervene in the agent’s free exploration by taking
JZT gD (θ) = − E yz | Zθg (· | s, a))] ,
[log P (b
control using the device. During human involvement, both (s,a,s′ )∼B,a′ ∼πΦ̄g ,
human and novice actions will be recorded into the human Z g (s′ ,a′ )∼Zθ̄g (·|s′ ,a′ )
buffer Bh = {(s, ag , ah , s′ , r)}. Meanwhile, the novice policy (17)
πϕg (· | s) is updated following the D-PVP procedure. with the reward-free target value
As illustrated in Fig. 2(b), to emulate human behavior and  
minimize intervention, D-PVP samples data (s, ag , ah ) from ybz = γ Z g (s′ , a′ ) − α log πϕ̄g (a′ | s′ ) . (18)
the human buffer and labels the value distribution of the human
action ah with δ1 (·) and the novice action ag with δ−1 . Here The corresponding update gradient is
δ1 (·) and δ−1 (·) represent the Dirac delta distribution centered "
2
#
at 1 and -1. This labeling fits Zθg (· | s, a) via the following TD yz − Qθ (s, a))
(b ∇θ σθ (s, a)
∇θ JZ g (θ) =E ∇θ +η
proxy value (PV) loss: 2σθ (s, a)2 σθ (s, a)
yz − Qθ (s, a))
(b
JZPgV (θ) = JZHg (θ) + JZNg (θ) I (s, ag ) ,

(11) =E[− ∇θ Qθ (s, a)
σθ (s, a)2
2
where yz − Qθ (s, a)) − σθ (s, a)2
(b
− η∇θ σθ (s, a)].
σθ (s, a)3
JZHg (θ) = E [DKL (δ1 (·), Zθg (· | s, ah ))] , (12) (19)
(s,ah ,ag )∼Bh
Hence, the final value loss for Zθg (· | s, a) integrates both
JZNg (θ) = E [DKL (δ−1 (·), Zθg (· | s, ag ))] . (13) PV loss and TD loss:
(s,ah ,ag )∼Bh
JZ g (θ) = JZPgV (θ) + JZT gD (θ). (20)
Since Zθg (· | s, a) has the
 form of Gaussian, it is represented
as N Qθ (s, a), σθ (s, a)2 , where Qθ (s, a) and σθ (s, a) are For policy improvement, the actor πϕg (· | s) is optimized by
the mean and standard deviation of return distribution. The maximizing the return distribution
update gradients of JZHg (θ) and JZNg (θ) are:   
Jπg (ϕ) = E Eg [Z g (s, a)] − α log πϕg (a | s)
" # s∼B, Z g (s,a)∼Zθ (·|s,a)
2 g
(1 − Qθ (s, ah )) ∇θ σθ (s, ah ) a∼πϕ
∇θ JZHg (θ) =E ∇θ +η h  i
2σθ (s, ah )2 σθ (s, ah ) = E Qθ (s, a) − α log π g
(a | s) ,
g ϕ
s∼B,a∼πϕ
(1 − Qθ (s, ah )) (21)
=E[− ∇θ Qθ (s, ah )
σθ (s, ah )2 while the temperature α is updated to balance exploration and
2
(1 − Qθ (s, ah )) − σθ (s, ah )2 exploitation:
− η∇θ σθ (s, ah )],
σθ (s, ah )3 h i
(14) α ← α − Es∼B,a∼πϕg − log πϕg (a | s) − H . (22)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

The expected cumulative probability of failure Vπb of the where the target value yzr is defined as:
behavior policy π b in D-PVP can be upper-bounded by:  
yzr = r + γ Z c (s′ , a′ ) − α log πψ̄r (a′ | s′ ) . (27)
1  γ ϵ2 
Vπb ≤ ϵ+κ + K′ , (23) This mechanism enables the agent to exploit potentially
1−γ 1−γ
superior reward-based strategies while retaining the essential
where ϵ < 1 and κ < 1 are small constants representing, safety and efficiency derived from human demonstrations. A
respectively, the probability that the human expert provides critical component of this mechanism is the policy confidence
an unsafe action and the probability that the expert fails to evaluation, that determines which policy to follow at each
intervene when the agent takes an unsafe action. The term step. The confidence evaluation utilizes the return distribution
K ′ ≥ 0 captures the human expert’s tolerance, corresponding Zζc (s, a). By updating with the value target in (27), Zζc (s, a)
to the measure of the action set in which the expert can estimates the distribution of cumulative returns given states
intervene. For further details and the derivation of this upper and actions. Within the shared control mechanism, for each
bound, please refer to [45]. state-action pair (s, ag ) associated with the human-guided
policy and (s, ar ) associated with the self-learning policy,
C. Shared Control Mechanism with Policy Confidence Evalu- Zζc (s, a) outputs the distributions of cumulative rewards as:
ation 
2

Zζc (s, ar ) → N Qζ (s, ar ) , σζ (s, ar ) ,
Exclusive reliance on human guidance can lead to subopti-   (28)
2
mal policy outcomes. Therefore, it is essential to enable agents Zζc (s, ag ) → N Qζ (s, ag ) , σζ (s, ag ) .
to explore independently and continuously improve their per-
formance beyond human guidance. This can be achieved by Here, Qζ (s, ar ) and Qζ (s, ag ) represent the expected cu-
incorporating a reward signal into the learning process, allow- mulative returns for the reward-guided and human-guided
ing the agent to learn from the environment. However, directly policies, respectively. The standard deviations σζ (s, ar ) and
adding the reward signal throughout the entire training can σζ (s, ag ) indicate the uncertainties in the cumulative re-
destabilize the learning process due to discrepancies between ward estimates for each policy. For simplicity, Qζ (s, ar ) and
proxy values and reward signals. Alternatively, introducing the Qζ (s, ag ) are denoted as Qrζ and Qgζ , respectively. Similarly,
reward signal after the agent has learned a basic driving policy σζ (s, ar ) and σζ (s, ag ) are denoted as σζr and σζg
from human demonstrations may result in significant perfor- Using these outputs, the confidence that the self-learning
mance degradation. This occurs because the native reward policy outperforms the human-guided policy in state s is
function might not align with human preferences, hindering calculated as:
 
the retention of previously learned human policies.
To address these challenges, our C-HAC framework puts   Qgζ − Qrζ
P Qrζ > Qgζ = 1 − Φ 
 
  2  2  , (29)

forward a shared control mechanism that combines the learned
r
human-guided policy with a self-learning policy. The shared σζr + σζg
control mechanism is defined as:
where Φ(·) is the cumulative distribution function (CDF) of
π b (a | s) = πψr (a | s) (1 − Tc (s)) + πϕg (a | s) Tc (s). (24) the standard normal distribution.
Based on the confidence evaluation, the confidence-based
Here, Tc denotes the confidence-based intervention func- intervention function Tc determines which policy to follow in
tion. The policy πϕg (s, a) represents the human-guided policy a given state s:
learned from human demonstrations through D-PVP, while (  
πψr (a | s) is the self-learning policy derived from the human- 1 if P Qrζ > Qgζ ≤ 1 − δ,
guided policy but optimized to maximize cumulative rewards. Tc (s) = (30)
0 otherwise.
The self-learning policy is improved by maximizing the return
distribution: If the confidence probability exceeds the threshold 1 − δ,
 the reward-guided policy πψr will be selected. Otherwise, the
Jπr (ψ) = E r c E [Z c (s, a)] human-guided policy πϕg is used.
s∼B,a∼πψ Z (s,a)∼Zζc (·|s,a)
 Assumption 1. (Bounded variance of return distributions.) To
− α log πψr (a | s)
 (25) simplify the analysis, the variances of the return distributions
for both the human-guided policy (π g ) and the reward-guided
E r Qζ (s, a) − α log πψr (a | s) .
 
= policy (π r ) are bounded by a constant σmax . Specifically:
s∼B,a∼πψ
σg2 ≤ σmax
2
, σr2 ≤ σmax
2
.
The distribution Zζc (· | s, a) is updated using the sample-
based TD loss: Theorem 1. With the confidence-based intervention function
Tc (s) defined in (30), the return of the behavior policy π b is
JZT cD (ζ) = − log P yzr | Zζc (· | s, a) ,
 
E guaranteed to be no worse than:
(s,a,r,s′ )∼B,a′ ∼πψ̄r , √
Z c (s′ ,a′ )∼Zζ̄c (·|s′ ,a′ ) 2 · σmax · Φ−1 (δ)
J π b ≥ J (π g ) − (1 − β) ·

(26) , (31)
1−γ
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

where β is the expected intervention rate weighted by the


policy discrepancy. P (Qg (s, ar ) > Qg (s, ag )) ≤ 1 − δ,
Proof. To prove Theorem 1, several useful lemmas are intro- we have:
duced. Ea∼πr (·|sn ) Qg (sn , a) − Ea∼πg (·|sn ) Qg (sn , a)
Lemma 1. For behavior policy π b deduced by human-guided
q
≤ σr2 + σg2 · Φ−1 (δ),
policy π g , reward-guided policy π r and an intervention func-
(36)
tion Tc (s), the state distribution discrepancy between π b and
where Φ−1 (δ) is the inverse CDF of the standard normal dis-
π r is bounded by (Theorem 3.2 in [46])
tribution, and σr2 , σg2 are the variances of the value-distribution
βγ for π r and π g , respectively.
∥dπb − dπr ∥1 ≤ Es∼dπb ∥π g (· | s) − π r (· | s)∥1 ,
1−γ Substituting the lower bound of the Q-value difference into
(32) the performance difference yields
Es∼d b [T (s)∥π g (·|s)−π r (·|s)∥1 ]
where β = π
Es∼d b ∥π g (·|s)−π r (·|s)∥1 ∈ [0, 1] is the
π
J π b −J (π g )

expected intervention rate weighted by the policy discrepancy.

" #
Lemma 2. The performance difference between two policies in X q
n −1
terms of the advantage function can be expressed as (Equation ≥ −(1 − β)Esn ∼τπb γ · σr2 + σh2 ·Φ (δ) .
3 in [47]): n=0
"∞ # (37)

X
t
By applying the geometric series formula and variances
J(π) = J (π ) + Est ,at ∼τπ γ Aπ′ (st , at ) . (33) bound of the value-distribution, the performance difference
t=0 becomes:
With Lemma 2, the proof of Theorem 2 is given as follows.

J π b − J (π g ) 2 · σmax · Φ−1 (δ)

J π b − J (π g ) ≥ −(1 − β) ·

.
"∞
X
# 1−γ
n g
= E γ A (sn , an ) Thus, the final lower bound for the mixed policy satisfies
sn ,an ∼τπb
"n=0

# √
X
n g g b g 2 · σmax · Φ−1 (δ)
γ [Q (sn , an ) − V (sn )]

= E J π ≥ J (π ) − (1 − β) · .
sn ,an ∼τπb 1−γ
n=0

" #
X
n g g
 
= E γ Ea∼πb (·|sn ) Q (sn , a) − V (sn )
sn ∼τπb
n=0 D. Overall Training Process

X
= E γ n Tc (sn ) Ea∼πg (·|sn ) Qg (sn , a)
 During the human demonstration stage, the agent employs
sn ∼τπb
n=0 two value distribution networks, Zθg (· | s, a) and Zζc (· | s, a),
 as well as the policy network πϕg (· | s). Specifically, in each
+ (1 − Tc (sn )) Ea∼πr (·|sn ) Qg (sn , a) − V g (sn )

training iteration, two equally-sized batches, bg and bh , are
X ∞ sampled from Bg and Bh , respectively. The agent updates the
γ n (1 − Tc (sn )) Ea∼πr (·|sn ) Qg (sn , a) value distribution network Zθg (· | s, a) and the policy network
 
= E
sn ∼τπb
n=0 πϕg (· | s) by minimizing the loss functions in (20) and (21).
As mentioned earlier, the update of Zθg (· | s, a) is reward-

−V g (sn )]] free. Concurrently, the value distribution network Zζc (· | s, a)

X is updated by minimizing the loss:
γ n Ea∼πr (·|sn ) Qg (sn , a)

= (1 − β) E JZT cD (ζ) = − E

log P yzg | Zζc (· | s, a) ,

sn ∼τπb
n=0
 (s,ag ,r,s′ )∼Bg ,a′ ∼πϕ̄g ,
g Z c (s′ ,a′ )∼Zζ̄c (·|s′ ,a′ )
−V (sn )] .
(38)
(34) where the target value yzg is defined as:
Note that the choice of π r is guided by the confidence prob-  
ability P (Qg (s, ar ) > Qg (s, ag )). When the human-guided yzg = r + γ Z c (s′ , a′ ) − α log πϕ̄g (a′ | s′ ) . (39)
policy is selected, its expected advantage over the human- Once the agent has developed sufficient confidence in its
guided policy can be expressed as: policy evaluation capability and has learned a basic driving
policy, it transitions to the RL continuous enhancement stage.
Ea∼πr (·|sn ) Qg (sn , a) − V g (sn ) The transition conditions are as follows:
= Ea∼πr (·|sn ) Qg (sn , a) − Ea∼πg (·|sn ) Qg (sn , a) .  c
(35)  σmean
  < ϑc , 
Using the confidence-based condition for selecting human- log πϕg (a | s) < κ, (40)
guided policy:

nsteps > Ng .

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

c
Here, σmean < ϑc indicates that the variance of the policy Metadrive Safety Benchmark

evaluation network has dropped below a predefined threshold,


signifying sufficient confidence in policy evaluation. The con-
dition log πϕg (a | s) < κ implies that the agent has learned a
basic driving policy. Ng is a predefined threshold that specifies
the minimum number of training steps required before the Safe Action State
agent can transition to the second stage. Human Logitech Agent
Upon entering the RL continuous enhancement stage, the G29 Agent
self-learning policy πψr (a | s) is initialized with the weight Action
copied from the human-guided policy πϕg (a | s). The share
control mechanism in (24) is activated. During the training
process, the agent samples a batch bg from the novice buffer Fig. 3. Simulation environment and human interfaces.
Bg and updates the value distribution network Zζc (· | s, a) and
the policy network πψr (a | s) by minimizing the loss functions
III. SIMULATION VERIFICATION
in (26) and (25), respectively. The temperature parameter
α is updated according to (5). And the confidence-based A. Experimental Setting
intervention function Tc (s) is calculated using (30). The agent To evaluate the proposed C-HAC approach, we conduct
then alternates between human-guided policy and self-learning experiments using the Metadrive safety benchmark [48]. We
policy until the training process is complete. generate a diverse set of driving scenarios with each training
Algorithm 1 gives the details steps for the proposed C-HAC session consisting of 20 different scenarios. As shown in Fig 3,
approach. the map of each scenario is composed of various typical block
types, including straight, ramp, roundabout, T-intersection,
and intersection. Each scenario also includes randomly placed
obstacles, such as moving traffic vehicles, stationary traffic
Algorithm 1 Confidence-guided human-AI collaboration (C- cones, and triangular warning signs. The definitions of the
HAC) observation space, action space, environmental reward, and
g
1: Initialize: πϕ , πψr
, Zθg , Zζc , Bg , Bh environmental cost are as follows:
2: Set: Temperature parameter α Observation Space: The observation space is defined as a
3: Stage 1: Human Demonstration Learning continuous space with the following elements:
4: while Not ready to transition do • Current state: including the target vehicle’s steering,
5: Execute action ag ∼ πϕg (·|s) in environment heading, and velocity.
6: Observe next state s′ , reward r • Surrounding information: represented by a vector of 240
7: Store transition (s, ag , s′ , r) in Bg LIDAR-like distance measurements from nearby vehicles
8: if Human intervention occurs then and obstacles.
9: Store (s, ag , ah , s′ , r) in Bh • Navigation data: including the relative positions toward
10: end if future checkpoints and the destination.
11: Sample mini-batches bg ∼ Bg , bh ∼ Bh Action Space: The action space is defined as a continuous
12: Update Zθg via Proxy Value Loss (20) space with the acceleration and the steering angle.
13: Update πϕg via Policy Improvement (21) Reward: The reward function consists of four parts:
14: Update Zζc via TD Loss (39)
15: end while R = cdisp Rdisp + cspeed Rspeed + ccollision Rcollision + Rterm (41)
16: Stage 2: RL Continuous Enhancement
• Rdisp : Encourages forward movement, defined as Rdisp =
r
17: Initialize πψ with parameters from πϕg
dt − dt−1 , where dt and dt−1 are the longitudinal move-
18: while Training is not complete do
ments at the current and previous time steps.
19: Action Selection:
• Rspeed : Encourages maintaining a reasonable speed, de-
20: With probability 1 − Tc (s), choose a ∼ πψr (·|s)
fined as Rspeed = vt /vmax , where vt and vmax denote the
21: With probability Tc (s), choose a ∼ πϕg (·|s)
current speed and maximum allowed speed.
22: Execute action a, observe s′ , r
• Rcollision : Penalizes collisions, defined as Rcollision = −5
23: Store transition (s, a, s′ , r) in Bg
if a collision with a vehicle, human, or object occurs,
24: Sample mini-batch bg ∼ Bg
otherwise it is 0.
25: Update Zζc via TD Loss (26)
• Rterm : This reward is assigned only at the last time
26: Update πψr via Return Maximization (25)
step. If the vehicle reaches the destination, we choose
27: Update temperature α via (5)
Rterm = +10 (success). If the vehicle drives out of the
28: Compute Tc (s) via Confidence Intervention (30)
road, Rterm = −5.
29: end while
Cost: Each collision with traffic vehicles, obstacles, or
parked vehicles incurs a cost of -1. The environmental cost
is utilized for testing the safety of the trained policies and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

7HVW(SLVRGLF5HZDUG Total Training Cost


7DNHRYHU5DWH

 70000
 60000 
 50000
 
40000      .
 30000 (c) Takeover rate comparison.
  7HVW(SLVRGLF5HZDUG
20000

10000 

0
 
     0 0 0.2 0.4 0.6 0.8 1M      .
(QYLURQPHQWDO6WHS H Environmental Step(1e6) (QYLURQPHQWDO6WHS
(a) Test-time reward comparison. (b) Train-time cost comparison. (d) Test-time reward comparison.

6$& 6$&/DJ 332 332/DJ '6$& &+$& 393 6WDJH7UDQVLWLRQ

Fig. 4. Performance of different baselines.

measuring the occurrence of dangerous situations during the of collisions during training, reflecting the potential dangers.
training process. In the testing phase, the primary metrics are episodic return,
As shown in Fig 3, human subjects can take over control episodic safety cost, and success rate. The episodic safety cost
using the Logitech G29 racing wheel and monitor the training is the average number of crashes in one episode. And the
process through the visualization of environments on the success rate is the ratio of episodes in which agents reach the
screen. The user interface displays several key indicators: destination to the total test episodes. For HAC methods, the
’Speed’ flag shows the real-time speed of the target vehicles. total number of human data usage and the overall intervention
’Takeover’ flag is triggered when a human takeover occurs. rate are also reported. The overall intervention rate is the ratio
’Total step’ flag indicates the number of steps taking during of human data usage to total data usage, indicating the effort
training. ’Total time’ flag represents the total time spent on required from humans to teach the agents.
training. ’Takeover rate’ displays the frequency of human
intervention. ’Stage’ flag indicates the current training stage.
C. Simulation Results
’Reward policy’ flag is triggered when the self-learning policy
is adopted instead of the human-guided policy. The performance of different baselines are listed in Table I.
And Fig. 4 gives the learning curves.
Comparison with RL approaches. Fig. 4(a) highlights
B. Baseline Methods
the training and testing performance of the proposed C-HAC
The following baseline methods are used for comparison: compared to standard RL and safe RL algorithms. In the
• RL: Standard RL approaches including PPO, SAC, and MetaDrive environment, C-HAC achieves an average return
DSAC; Safe RL approaches including PPO-Lag [49] and of 392.92 with a safety cost 0.16, significantly outperforming
SAC-Lag [50]. SAC, PPO, DSAC, SAC-Lag, and PPO-Lag. It also maintains
• Offline RL and IL: Offline RL approach using conser- higher success rates across all scenarios. During the learning-
vative Q-learning (CQL) [51]; IL approaches using be- from-demonstration stage, C-HAC realizes a return of 353.39
havior cloning (BC) and generative adversarial imitation and an 83% success rate within 50,000 steps, completing
learning (GAIL) from human demonstrations. training in approximately one hour. In the subsequent RL
• HAC: HAC approaches including PVP, D-PVP, HG- enhancement stage, the agent further improves its return,
DAgger and IWR. demonstrating the benefits of reward-guided policy refinement.
These baseline methods are implemented using RLLib. Moreover, C-HAC ensures safety, recording an average of
The training of baseline methods is conducted through five 35.67 safety violations in the demonstration stage and 140.33
concurrent trails on Nvidia GeForce RTX 4080 GPUs. Each in the enhancement stage, while maintaining rapid conver-
trail utilizes 2 CPUs with 6 parallel rollout workers. All gence.
baseline experiments about RL and IL are repeated five times Comparison with Offline RL and IL methods. A dataset
using different random seeds. The experiments for C-HAC and of 100 episodes (50K steps) of human driving data was
PVP are conducted on a local computer and repeated three collected for training offline RL and IL baselines. The dataset
times. achieved a 97% success rate, an average return of 377.52, and
The evaluation metrics are divided into training and testing a safety cost of 0.39. Using this data, CQL, BC, and GAIL
phases. During the training phase, we focus on data usage and were trained. As presented in Table I, C-HAC significantly
total safety cost. The total safety cost represents the number outperforms these methods, with test success rates for the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

TABLE I
T HE PERFORMANCE OF DIFFERENT BASELINES IN THE M ETA D RIVE SIMULATOR .

Method Training Testing


Human Data Total Data Safety Cost Episodic Return Episodic Safety Cost Success Rate
SAC [44] - 1M 8.86K ± 4.91K 359.18 ± 18.51 1.00 ± 0.30 0.74 ± 0.17
PPO [52] - 1M 45.12K ± 21.11K 278.65 ± 35.07 3.92 ± 1.91 0.44 ± 0.14
SAC-Lag [50] - 1M 8.73K ± 4.14K 346.05 ± 20.57 0.64 ± 0.13 0.62 ± 0.17
PPO-Lag [49] - 1M 26.55K ± 9.97K 222.15 ± 49.66 0.88 ± 0.23 0.26 ± 0.07
DSAC [53] - 1M 7.44K ± 3.59K 349.35 ± 22.15 0.47 ± 0.08 0.77 ± 0.09
Human Demo. 50K - 23 377.523 0.39 0.97
CQL [51] 50K (1.0) - - 93.12 ± 16.31 1.45 ± 0.15 0.09 ± 0.05
BC [54] 50K (1.0) - - 59.13 ± 8.92 0.12 ± 0.03 0±0
GAIL [55] 50K (1.0) - - 34.78 ± 3.92 1.07 ± 0.13 0±0
HG-Dagger [33] 34.9K (0.70) 0.05M 56.13 142.35 2.1 0.30
IWR [36] 37.1K (0.74) 0.05M 48.78 329.97 4.00 0.70
PVP [41] 15K 0.05M 35.67 ± 4.32 338.28 ± 9.72 0.898 ± 0.15 0.81 ± 0.04
D-PVP 15K 0.05M 32.12±4.68 353.39±12.34 0.31±0.03 0.83±0.05
C-HAC 15K 0.05M+0.95M 176.00 ± 31.82 392.92±15.70 0.16±0.02 0.91±0.02

7HVW(SLVRGLF5HZDUG 7HVW(SLVRGLF&RVW 7HVW(SLVRGLF5HZDUG 7RWDO7UDLQLQJ&RVW



+**$JJHU 
 ,:5  

   

 
 
 
 &+$& 1R6KDUH
  &+$& 1R&RQILGHQFH 
 &+$& )XOO
 
               0      0
6DPSOHG6WHSV H 6DPSOHG6WHSV H (QYLURQPHQWDO6WHS H (QYLURQPHQWDO6WHS H

Fig. 5. Performance comparison of C-HAC with HAC methods. Fig. 6. Impact of shared control mechanism and policy confidence evaluation
on policy performance

baselines remaining below 10%. Notably, GAIL achieves a 0%


success rate due to its reliance on strictly matching the expert light the method’s ability to effectively leverage limited human
data distribution, leading to poor generalization in unseen input for robust policy learning. The C-HAC was further
scenarios. Additionally, the episodic returns of BC, CQL, and evaluated against the PVP approach. As shown in Fig. 4 and
GAIL are far below those achieved by C-HAC. Unlike IL Table I, C-HAC exhibits faster convergence during the human-
methods, which optimize the agent to mimic expert actions guided phase compared to PVP. It achieves a more rapid
at each timestep, C-HAC employs trajectory-based learning, reduction in the average intervention probability and obtains a
promoting actions that maximize future rewards rather than higher final test return while utilizing a comparable amount of
simple imitation. Furthermore, C-HAC collects expert data human guidance data. Furthermore, during the subsequent RL
online, effectively mitigating the distribution shift problem in continuous enhancement stage, C-HAC demonstrates superior
offline RL and achieving superior performance. performance, achieving higher returns than PVP.
Comparison with HAC Baseline. During the warm-up To evaluate the effectiveness of the shared control mech-
phase, both HG-DAgger and IWR utilize a dataset of 30K pre- anism and the policy confidence evaluation algorithm, we
collected samples, which are later added to their replay buffers conducted the following comparative experiments. First, a
during training. Both methods also undergo four rounds of baseline method (denoted as C-HAC (No Share)) excludes
training using BC, with at least 5K human-guided samples the shared control mechanism and directly trains the pol-
in each round. As shown in Fig. 5, only IWR achieves a icy to maximize cumulative rewards after the learning-from-
satisfactory success rate due to its emphasis on prioritizing demonstration stage. Second, another baseline (denoted as
human intervention data. This approach enables the agent to C-HAC (No Confidence)) excludes the policy confidence
learn critical maneuvers and avoid compounding errors. In evaluation algorithm, instead comparing the mean expected
contrast, HG-DAgger struggles with limited demonstrations, as returns of policies without considering the confidence factor
it lacks a re-weighting mechanism for human-guided samples. in formula (30); this method is equivalent to the case of setting
The proposed C-HAC consistently performs better than these δ = 0.5. Finally, our proposed approach (denoted as C-HAC
baselines, demonstrating superior performance during both the (Full)) incorporates both the shared control mechanism and the
initial guidance phase and final evaluation. These results high- policy confidence evaluation algorithm, with δ being 0.15. Fig.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

to those refined by C-HAC in the RL continuous enhancement


C-HAC PVP Human's Actions
Agent's Actions
stage. When the D-PVP policy leads to collisions, C-HAC
leverages its shared control mechanism to identify and correct
these shortcomings. This capability explains C-HAC’s higher
success rate relative to D-PVP.
In Fig. 8, Scenario 2 depicts a congested environment. The
D-PVP policy chooses to decelerate and wait for surrounding
vehicles, whereas C-HAC accelerates at the right moment to
overtake. Consequently, C-HAC completes the scenario more
Fig. 7. The action sequences generated by C-HAC and PVP agents in the quickly and achieves higher episodic rewards.
same MetaDrive map.

C-HAC T=23S C-HAC T=48S C-HAC T=59S PVP T=69s


Scenario1

Safe
Passage Crash

C-HAC T=17S C-HAC T=26S C-HAC T=31S PVP T=49s

Fig. 9. Routes for training and testing in real-world experiments

Accelerate and Accelerate and


Scenario2

Decelerate
overtake overtake
Camera INS Human intervention
BAU ECD
DC-DC

Radar LiDAR
Ego Vehicle (BEV) Other Vehicles (BEV) Ego Vehicle (3PV) Other Vehicles (3PV)
BAU
LiDAR BAU
Fig. 8. Comparative visualizations of D-PVP and C-HAC across two driving
Lidar
scenarios Value Action
Fusion

Camera Observation

6 depicts the Test Episodic Reward and Total Training Cost Radar Policy
for these three methods. Method C-HAC (No Share) shows a Navigarion data
70% decline in policy performance during the RL continuous Current state
enhancement stage, with significantly higher training costs.
Method C-HAC (No Confidence), which employs the shared Fig. 10. Architecture of UGV setup with human intervention.
control mechanism, reduces the decline to 50% and incurs a
training cost of 200. Method C-HAC (Full), using the policy
confidence evaluation algorithm, achieves the smallest decline E. Real-World Driving Experiment
of 20% and the lowest training cost. These results demonstrate
that the shared control mechanism and policy confidence As shown in Fig. 9, the real-world training process takes
evaluation algorithm significantly enhance the stability and place on the campus roads of Tianjin University. Each route
safety of policy learning. consists of multiple checkpoints that specify both position
and driving commands. Route 1 is used for training, while
Route 2 is designated for generalization testing. As illustrated
D. Visualization in Fig. 10, the UGV localizes via an integrated navigation
In Fig. 7, the action sequences of agents trained with C- system (INS) and perceives its surroundings using LiDAR,
HAC and PVP are visualized. The angle and length of each camera, and radar. An Nvidia Jetson AGX Orin is responsible
arrow represent the steering angle and acceleration, respec- for the perception tasks. Training is performed on a separate
tively, with the human subject’s actions highlighted in yellow. GPU, and the resulting control commands are sent to the
Compared to PVP, the action sequences generated by C-HAC base adapter unit (BAU). Random pedestrians, bicycles, and
are notably smoother and align more closely with those of vehicles introduce natural variability, making the training con-
human drivers. ditions more challenging. The definitions of the observation
In Fig. 8, Scenario 1 compares the trajectories of an agent space and action space are given below:
trained with D-PVP during the learn-from-demonstration stage Observation space: This is a continuous space comprising
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

(a) UGV executes left turn while avoiding pedes- (b) UGV slows down and stops to yield to a (c) UGV executes a sharp turn.
trian. crossing pedestrian.

(d) UGV maneuvers around an obstacle. (e) UGV navigates past a stationary vehicle. (f) UGV manages an intersection with heavy
traffic.
Fig. 11. Real-world driving performance on Route 1 and Route 2.

1 while actively avoiding obstacles and other vehicles. A


8*9$FWLRQ2XWSXWVRQ5RXWH 8*9$FWLRQ2XWSXWVRQ5RXWH
 human operator may intervene at any time by pressing the
6WHHULQJ$QJOH 'HJUHHV 7KURWWOH%UDNH

6WHHULQJ$QJOH 'HJUHHV 7KURWWOH%UDNH


autopilot mode switch button or using the steering wheel

 and throttle/brake pedals to manually override the system.

 The UGV continues driving under supervision until reaching
 a predefined total of 100,000 steps, at a policy execution

 frequency of 10 Hz, resulting in roughly two hours of training.

             In testing on Route 1, the UGV successfully completes
7LPH V 7LPH V
the entire route. As shown in Fig. 11(a), it executes a left
(a) Action outputs on Route 1. (b) Action outputs on Route 2.
turn while avoiding a pedestrian, and in Fig. 11(b), it slows
8*96SHHGLQ5RXWH 8*96SHHGLQ5RXWH down and stops to yield to a crossing pedestrian. Additionally,
  as depicted in Fig. 11(c), the UGV executes a sharp turn,
  with the corresponding actions and speed presented in Fig.
6SHHG NPK

6SHHG NPK

12(a) and Fig. 12(c). For generalization testing, the UGV is


 
evaluated on Route 2, with action and speed output shown in
  Fig. 12(b) and Fig. 12(d). In Fig. 11(f), it maneuvers around
an obstacle, while Fig. 11(g) shows the UGV navigating past
             
7LPH V 7LPH V a stationary vehicle. Finally, as illustrated in Fig. 12(b), it
(c) Speed on Route 1. (d) Speed on Route 2. effectively manages an intersection with heavy traffic.
Fig. 12. Action outputs and speeds of the vehicle in real-world scenarios
IV. C ONCLUSION
This paper proposes a confidence-guided human-AI col-
(a) the UGV’s current state, including speed, lateral offset laboration method that enhances autonomous driving pol-
from the lane center, and heading angle relative to the lane icy learning through the integration of distributional proxy
center; (b) surrounding information, which is derived by fusing value propagation and distributional soft actor-critic. Such a
detection results from LiDAR, camera, and radar, then convert- treatment overcomes the limitations of purely human-guided
ing these into a 240-dimensional LiDAR-like distance vector or purely self-learning strategies by providing a structured
for nearby vehicles and obstacles—following a representation two-stage procedure comprising learn from demonstration
approach similar to MetaDrive; and (c) navigation data, con- and RL continuous enhancement. Experimental results on
sisting of the next 30 checkpoints and driving instructions such the MetaDrive benchmark demonstrate substantial improve-
as go straight, turn left or turn right. ments in both safety and overall performance compared to
Action space: The action space is continuous, with two conventional RL, safe RL, IL, and other HAC methods.
components: acceleration and steering angle. By incorporating a shared control mechanism and a policy
During training, the UGV navigates checkpoints on Route confidence evaluation algorithm, this method efficiently re-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

duces human supervision while preserving essential human- [20] B. Widrow, “Pattern recognition and adaptive control,” IEEE Transac-
guided behaviors. These findings underscore the potential of tions on Applications and Industry, vol. 83, no. 74, pp. 269–277, 1964.
[21] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and
unifying human expertise and autonomous exploration within a J. Peters, “An algorithmic perspective on imitation learning,” ArXiv, vol.
single framework, thereby offering a more scalable and robust abs/1811.06711, 2018.
pathway toward real-world autonomous driving systems. [22] J. Huang, S. Xie, J. Sun, Q. Ma, C. Liu, D. Lin, and B. Zhou, “Learning a
decision module by imitating driver’s control behaviors,” in Proceedings
of the 2020 Conference on Robot Learning, J. Kober, F. Ramos, and
R EFERENCES C. Tomlin, Eds., 2021, pp. 1–10.
[23] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and
[1] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J. M. Allen, V. D. A. Garg, “Conservative safety critics for exploration,” ArXiv, vol.
Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in 2019 abs/2010.14497, 2020.
International Conference on Robotics and Automation (ICRA), 2019, pp. [24] Y. Wu, G. Tucker, and O. Nachum, “Behavior regularized offline
8248–8254. reinforcement learning,” ArXiv, vol. abs/1911.11361, 2019.
[2] M. Bojarski, D. W. del Testa, D. Dworakowski, B. Firner, B. Flepp, [25] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, learning without exploration,” in Proceedings of the 36th International
J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” ArXiv, Conference on Machine Learning, vol. 97, 2019, pp. 2052–2062.
vol. abs/1604.07316, 2016. [26] J. Sun, L. Yu, P. Dong, B. Lu, and B. Zhou, “Adversarial inverse rein-
[3] S. Feng, H. Sun, X. Yan, H. Zhu, Z. Zou, S. Shen, and H. X. Liu, “Dense forcement learning with self-attention dynamics model,” IEEE Robotics
reinforcement learning for safety validation of autonomous vehicles,” and Automation Letters, vol. 6, no. 2, pp. 1880–1886, 2021.
Nature, vol. 615, no. 7953, pp. 620–627, 2023. [27] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in
[4] S. Aradi, “Survey of deep reinforcement learning for motion planning of Proceedings of the thirteenth international conference on artificial intel-
autonomous vehicles,” IEEE Transactions on Intelligent Transportation ligence and statistics. JMLR Workshop and Conference Proceedings,
Systems, vol. 23, no. 2, pp. 740–759, 2022. 2010, pp. 661–668.
[5] Z. Zhu and H. Zhao, “A survey of deep RL and IL for autonomous [28] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limi-
driving policy learning,” IEEE Transactions on Intelligent Transporta- tations of behavior cloning for autonomous driving,” in 2019 IEEE/CVF
tion Systems, vol. 23, no. 9, pp. 4043–4065, 2022. International Conference on Computer Vision (ICCV), 2019, pp. 9328–
[6] X. Di and R. Shi, “A survey on autonomous vehicle control in the 9337.
era of mixed-autonomy: From physics-based to ai-guided driving policy
[29] R. Camacho and D. Michie, “Behavioral cloning a correction,” AI Mag.,
learning,” Transportation Research Part C: Emerging Technologies, vol.
vol. 16, pp. 92–101, 1995.
125, pp. 3008–3048, 2021.
[30] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
[7] K. Muhammad, A. Ullah, J. Lloret, J. D. Ser, and V. H. C. de Al-
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Si-
buquerque, “Deep learning for safe autonomous driving: Current chal-
monyan, and D. Hassabis, “A general reinforcement learning algorithm
lenges and future directions,” IEEE Transactions on Intelligent Trans-
that masters chess, shogi, and go through self-play,” Science, vol. 362,
portation Systems, vol. 22, no. 7, pp. 4316–4336, 2021.
pp. 1140–1144, 2018.
[8] S. Aradi, “Survey of deep reinforcement learning for motion planning of
[31] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning
autonomous vehicles,” IEEE Transactions on Intelligent Transportation
and structured prediction to no-regret online learning,” in International
Systems, vol. 23, no. 2, pp. 740–759, 2022.
Conference on Artificial Intelligence and Statistics, 2011, pp. 627–635.
[9] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, [32] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end
and D. Hassabis, “A general reinforcement learning algorithm that autonomous driving,” ArXiv, vol. abs/1605.06450, 2016.
masters chess, shogi, and go through self-play,” Science, vol. 362, no. [33] M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer,
6419, pp. 1140–1144, 2018. “Hg-dagger: Interactive imitation learning with human experts,” in 2019
[10] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. Yo- International Conference on Robotics and Automation (ICRA), 2019, pp.
gamani, and P. Pérez, “Deep reinforcement learning for autonomous 8077–8083.
driving: A survey,” IEEE Transactions on Intelligent Transportation [34] R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and
Systems, vol. 23, no. 6, pp. 4909–4926, 2022. K. Goldberg, “Thriftydagger: Budget-aware novelty and risk gating for
[11] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end interactive imitation learning,” in Proceedings of the 5th Conference on
urban driving by imitating a reinforcement learning coach,” in 2021 Robot Learning, vol. 164, 2022, pp. 598–608.
IEEE/CVF International Conference on Computer Vision (ICCV), 2021, [35] J. Spencer, S. Choudhury, M. Barnes, M. Schmittle, M. Chiang, P. J. Ra-
pp. 5202–5212. madge, and S. S. Srinivasa, “Expert intervention learning,” Autonomous
[12] W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, “Reward Robots, vol. 46, pp. 99–113, 2021.
misdesign for autonomous driving,” Artif. Intell., vol. 316, no. 103829, [36] A. Mandlekar, D. Xu, R. Mart’in-Mart’in, Y. Zhu, F. F. Li, and
Mar. 2023. S. Savarese, “Human-in-the-loop imitation learning using remote tele-
[13] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, operation,” ArXiv, vol. abs/2012.06733, 2020.
“Scalable agent alignment via reward modeling: A research direction. [37] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei,
arxiv 2018,” arXiv preprint arXiv:1811.07871, 1811. “Deep reinforcement learning from human preferences,” in Advances in
[14] A. Bondarenko, D. Volk, D. Volkov, and J. Ladish, “Demonstrating spec- Neural Information Processing Systems, 2017, pp. 4299–4307.
ification gaming in reasoning models,” arXiv preprint arXiv:2502.13295, [38] E. Biyik and D. Sadigh, “Batch active preference-based learning of
2025. reward functions,” in Proceedings of The 2nd Conference on Robot
[15] J. Wu, Z. Huang, Z. Hu, and C. Lv, “Toward human-in-the-loop AI: Learning, vol. 87, 2018, pp. 519–528.
Enhancing deep reinforcement learning via real-time human guidance [39] M. Palan, N. C. Landolfi, G. Shevchuk, and D. Sadigh, “Learning reward
for autonomous driving,” Engineering, vol. 21, pp. 75–91, 2023. functions by integrating human demonstrations and preferences,” ArXiv,
[16] A. Harutyunyan, W. Dabney, T. Mesnard, M. Gheshlaghi Azar, B. Piot, vol. abs/1906.08928, 2019.
N. Heess, H. P. van Hasselt, G. Wayne, S. Singh, D. Precup et al., [40] Q. Li, Z. Peng, and B. Zhou, “Efficient learning of safe driving policy
“Hindsight credit assignment,” vol. 32, 2019, pp. 167–175. via human-AI copilot optimization,” in International Conference on
[17] W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans, “Trial without Learning Representations, 2022, pp. 1–19.
error: Towards safe reinforcement learning via human intervention,” [41] Z. Peng, W. Mo, C. Duan, Q. Li, and B. Zhou, “Learning from active
arXiv preprint arXiv:1707.05173, 2017. human involvement through proxy value propagation,” in Advances in
[18] L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis, “A survey on Neural Information Processing Systems, 2023, pp. 7969–7992.
imitation learning techniques for end-to-end autonomous vehicles,” [42] T. Yu, Z. He, D. Quillen, R. Julian, K. Hausman, C. Finn, and
IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, S. Levine, “Meta-world: A benchmark and evaluation for multi-
pp. 4128–4147, 2022. task and meta reinforcement learning,” 2019. [Online]. Available:
[19] Z. Huang, H. Liu, J. Wu, and C. Lv, “Conditional predictive behavior https://arxiv.org/abs/1910.10897
planning with inverse reinforcement learning for human-like autonomous [43] R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein, “Socially situated
driving,” IEEE Transactions on Intelligent Transportation Systems, artificial intelligence enables learning from human interaction,” vol. 119,
vol. 24, no. 7, pp. 7244–7258, 2023. no. 39, 2022, pp. 1157–1169.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

[44] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- Zheng Li received his M.Eng. degree in control
policy maximum entropy deep reinforcement learning with a stochastic engineering from School of Electrical and Informa-
actor,” in International Conference on Machine Learning, 2018, pp. tion Engineering, Tianjin University, in 2021. He is
1861–1870. currently pursuing the Ph.D. degree with the Tianjin
[45] Z. Huang, Z. Sheng, C. Ma, and S. Chen, “Human as AI mentor: Key Laboratory of Intelligent Unmanned Swarm
Enhanced human-in-the-loop reinforcement learning for safe and effi- Technology and System, Tianjin University. From
cient autonomous driving,” Communications in Transportation Research, January 2024 to December 2024, he worked as a
vol. 4, pp. 100–127, 2024. visiting Ph.D. student in the Department of Mechani-
[46] Z. Xue, Z. Peng, Q. Li, Z. Liu, and B. Zhou, “Guarded policy cal Engineering, University of Victoria, Victoria BC,
optimization with imperfect online demonstrations,” 2023. Canada.
[47] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust His research interests including interactive multi-
region policy optimization,” 2017. task prediction, maneuver decision, trajectory planning, automatic control and
[48] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “Metadrive: vehicular simulation and verification for autonomous vehicles.
Composing diverse driving scenarios for generalizable reinforcement
learning,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, vol. 45, no. 3, pp. 3461–3475, 2023.
[49] A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforce-
ment learning by pid lagrangian methods,” in International Conference Peng Li received the Ph.D. degree in control science
on Machine Learning, 2020, pp. 9133–9143. and engineering in 2024 from Tianjin University,
[50] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the real China. He is currently a Research Associate with
world with minimal human effort,” arXiv preprint arXiv:2002.08550, the School of Electrical and Information Engineer-
2020. ing, Tianjin University, China. His research interests
[51] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q- include swarm energy systems, networked control
learning for offline reinforcement learning,” in Proceedings of the 34th systems, wheeled mobile robots, and finite frequency
International Conference on Neural Information Processing Systems, analysis.
2020, pp. 1179–1191.
[52] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[53] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional
soft actor-critic: Off-policy reinforcement learning for addressing value
estimation errors,” IEEE Transactions on Neural Networks and Learning Zhiqiang Zuo (Senior Member, IEEE) received the
Systems, vol. 33, no. 11, pp. 6584–6598, 2022. M.S. degree in control theory and control engineer-
[54] M. Bain and C. Sammut, “A framework for behavioural cloning,” in ing from Yanshan University, Qinhuangdao, China,
Machine Intelligence, 1999, pp. 103–129. in 2001, and the Ph.D. degree in control theory from
[55] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Ad- Peking University, Beijing, China, in 2004.
vances in Neural Information Processing Systems, D. Lee, M. Sugiyama, In 2004, he joined the School of Electrical and
U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29, 2016, pp. 1–9. Information Engineering, Tianjin University, Tianjin,
China, where he is a Full Professor. From 2008 to
2010, he was a Research Fellow with the Department
of Mathematics, City University of Hong Kong,
Hong Kong. From 2013 to 2014, he was a Visiting
Scholar with the University of California at Riverside, Riverside, CA, USA.
Zeqiao Li received the B.S. degree in intelligent His research interests include nonlinear control, robust control, and multiagent
science and technology from Hebei University of systems, with application to intelligent vehicles.
Technology, Tianjin, China, in 2023. He is currently
pursuing the PhD degree with the School of Electri-
cal and Information Engineering, Tianjin University,
Tianjin, China.
His research interests include reinforcement learn- Chuan Hu received the B.S. degree in automo-
ing, optimal control, and self-driving decision- tive engineering from Tsinghua University, Beijing,
making. China, in 2010, the M.S. degree in vehicle op-
eration engineering from the China Academy of
Railway Sciences, Beijing, in 2013, and the Ph.D.
degree in mechanical engineering from McMaster
Yijing Wang received the M.S. degree in control University, Hamilton, ON, Canada, in 2017. He
theory and control engineering from Yanshan Uni- has been a tenure-track Associate Professor with
versity, Qinhuangdao, China, in 2000, and the Ph.D. the School of Mechanical Engineering, Shanghai
degree in control theory from Peking University, Jiao Tong University, Shanghai, China, Since July
Beijing, China, in 2004. 2022. His research interests include the perception,
In 2004, she joined the School of Electrical and decisionmaking,path planning, and motion control of intelligent and connected
Information Engineering, Tianjin University, Tianjin, vehicles (ICVs), autonomous driving, eco-driving, human–machine trust an
China, where she is a Full Professor. Her research cooperation, shared control, and machine-learning applications in ICVs.
interests are analysis and control of switched/hybrid
systems and robust control.

Haoyu Wang received the B.S. degree in automation


and the Ph.D. degree in control theory and control
engineering from Tianjin University, Tianjin, China,
in 2018 and 2023, respectively.
In 2023, he joined the School of Electrical and
Information Engineering, Tianjin University, where
he is a Postdoctoral Researcher. His research in-
terests include active disturbance rejection control,
motion control, and disturbance observer design with
application to intelligent vehicles.

You might also like