WD-RL
WD-RL
We are getting to the point where there’s one last piece of the system that needs to
be a neural net which is the planning and control function.
Elon Musk, 2023 Tesla Annual Shareholder Meeting
Vehicle control is the final piece of the Tesla FSD AI puzzle. That will drop >300k
lines of C++ control code by 2 orders of magnitude. It is training as I write this.
Elon Musk, Twitter August, 2023
1 Introduction
Autonomous driving (AD) [18,29,44], especially urban driving, requires the vehi-
cles to engage with dense and diverse traffic participants [22,24,25] and adapt to
2 Q. Li, X. Jia et al.
ture that manages to handle quasi-realistic scenarios with techniques like reset-
ting technique, automated scenario generation, termination-priority
replay strategy, steering cost function, etc. 2) We propose a new and
balanced metric to evaluate the performance by route completion, infraction
number and scenario density. 3) Experimental results on CARLA V2 and the
proposed CornerCaseRepo benchmark show the superiority of our approach.
A demo of our proficient planner on CARLA V2 test routes is available at
https://thinklab-sjtu.github.io/CornerCaseRepo/
2 Related Works
3 Methodology
\label {eq:basic_model} \begin {aligned} s_t &\leftarrow F_\theta ^{Enc}(x_t), \quad a_t &\leftarrow \pi (s_t), \quad s_{t+1} &\leftarrow F_\theta ^{Pre}\left (s_t, a_t\right ) \end {aligned} (1)
By rollouting in the latent state space st of the world model, the planner can
think and learn efficiently, without interacting with the heavy physical simulator.
We use DreamerV3 [16]’s structure and objective to train the world model and
planner model. Note that our main novelty lies in the first successful adoption
of latent world model to AD.
World Model Learning. It has four components in line with [16]:
\hspace {-0.1cm} \begin {aligned} \operatorname {RSSM} & \begin {cases} \text { Sequence model: } & \quad h_t=f_\theta \left (h_{t-1}, z_{t-1}, a_{t-1}\right ) \\ \text { Encoder: } & \quad z_t \sim q_\theta \left (z_t \mid h_t, x_t\right ) \\ \text { Dynamics predictor: } & \quad \hat {z}_t \sim p_\theta \left (\hat {z}_t \mid h_t\right ) \\ \end {cases} \\ & \hspace {0.4cm} \begin {aligned} & \text { Reward predictor: } && \hat {r}_t \sim p_\theta \left (\hat {r}_t \mid h_t, z_t\right ) \\ & \text { Termination predictor: } && \hat {c}_t \sim p_\theta \left (\hat {c}_t \mid h_t, z_t\right ) \\ & \text { Decoder: } && \hat {x}_t \sim p_\theta \left (\hat {x}_t \mid h_t, z_t\right )\\ \end {aligned} \end {aligned}
(2)
\begin {aligned} \mathcal {L}_{\text {pred}}(\theta ) \doteq &-\ln p_\theta \left (x_t \mid z_t, h_t\right )-\ln p_\theta \left (r_t \mid z_t, h_t\right )-\ln p_\theta \left (c_t \mid z_t, h_t\right ) \\ \mathcal {L}_{\text {dyn}}(\theta ) \doteq & \max \left (1, \operatorname {KL}\left [\operatorname {fz}\left (q_\theta \left (z_t \mid h_t, x_t\right )\right ) \| p_\theta \left (z_t \mid h_t\right )\right ]\right ) \\ \mathcal {L}_{\text {rep}}(\theta ) \doteq & \max \left (1, \operatorname {KL}\left [q_\theta \left (z_t \mid h_t, x_t\right ) \| \operatorname {fz}\left (p_\theta \left (z_t \mid h_t\right )\right )\right ]\right ) \end {aligned}
(3)
where the prediction loss Lpred trains both the decoder and the termination
predictor via binary cross-entropy. Symlog loss [16] is utilized to train the reward
predictor. By minimizing the KL divergence between the prior pθ (zt | ht ), the
dynamics loss Ldyn trains the sequence model to predict the next representation,
and the representation loss Lrep is used to lower the difficulty of this prediction.
The two losses differ in the position of parameter-freeze operation fz(·).
Given a rollout of x1:T , actions a1:T , rewards r1:T , and termination flag c1:T
from records, the overall loss is:
\label {equ:world-model-loss} \mathcal {L}(\theta ) \doteq \mathrm {E}_{q_\theta } \sum _{t=1}^T\left (\beta _{\mathrm {pred}} \mathcal {L}_{\mathrm {pred}}^t(\theta )+\beta _{\mathrm {dyn}} \mathcal {L}_{\mathrm {dyn}}^t(\theta ) + \beta _{\mathrm {rep}} \mathcal {L}_{\mathrm {rep}}^t(\theta )\right ) (4)
\mathcal {L}_{\text {critic }}(\psi ) \doteq -\sum _{t=1}^T y_t^\top \ln p_\psi \left (\cdot \mid s_t\right ) (5)
where the softmax distribution pψ (· | st ) over equal split buckets is the output
of the critic. The reward expectation term for actor training is normalized by
moving statistics [15, 40]:
\mathcal {L}(\theta ) \doteq \sum _{t=1}^T \left ( \mathrm {E}_{\pi _\eta , p_\theta }\left [\frac {\operatorname {fz}\left (R_t^\lambda \right )}{\max (1, S)}\right ] -\beta _{en} \mathrm {H}\left [\pi _\eta \left (a_t \mid s_t\right )\right ] \right ) (6)
Think2Drive 7
where S is the decaying mean of the range from their 5th to the 95th batch
percentile. More details can be found in [16].
Driving Weighted Route Infraction Collision Collision Collision Red light Stop sign Agent
Input Method
score DS completion penalty pedestrians vehicles layout Infraction infraction blocked
Privileged Roach [45] 57.5±9 54.8±0.5 96.4±1.1 0.59±0.28 0.85±0.56 8.42±4.65 0.85±0.51 0.56±0.45 0.49±0.44 0.78±0.31
Information Think2Drive 83.8±1 89.0±0.2 99.6±0.1 0.84 ± 0.01 0.16±0.01 1.2±0.5 0.29±0.02 0.14±0.01 0.03±0.01 0.08±0.01
(Ours)
Think2Drive
Raw Sensors 36.40±12.23 29.6±0.2 85.88±8.26 0.41±0.32 0.46±0.32 9.92±5.12 6.75±3.08 3.27±1.64 5.03±3.82 6.18±4.62
+TCP [42]
Scenario Success Rate Scenario Success Rate Scenario Success Rate Scenario Success Rate
Hazard Vinilla Invading
ParkingExit 0.89 0.75 0.99 0.90
AtSidelane Turn Turn
Signalized Signalized OppositeVehicle OppositeVehicle
0.95 0.76 0.89 0.85
LeftTurn RightTurn TakingPriority RunningRedLight
Accident Crossing Highway
Accident 0.81 0.61 0.83 1.0
TwoWays BicycleFlow CutIn
Construction Interurban InterurbanAdvanced
Construction 0.84 0.72 0.83 0.8
TwoWays ActorFlow ActorFlow
Blocked Enter NonSignalized NonSignalizedJunction
0.80 0.65 0.75 0.67
Intersection ActorFlow RightTurn LeftTurnEnterFlow
MergerInto MergerInto Highway NonSignalized
0.67 0.87 0.83 0.79
SlowTraffic SlowTrafficV2 Exit JunctionLeftTurn
SignalizedJunction Vehicle VehicleTurning Pedestrain
0.86 0.78 0.75 0.91
LeftTurnEnterFlow TurningRoute RoutePedestrian Crossing
YieldTo Hard Parking Dynamic
0.92 1.0 0.98 0.94
EmergencyVehicle Brake CrossingPedestrian ObjectCrossing
Vehicles HazardAt Parked ParkedObstacle
0.78 0.92 0.90 0.91
DooropenTwoWays SideLaneTwoWays Obstacle TwoWays
Static Parking
0.85 0.90 ControlLoss 0.78
CutIn CutIn
4 Experiment
4.1 CARLA Leaderboard v2
CARLA v2 is based on the CARLA simulator with version bigger than 0.9.13
(V1 on 0.9.10). We evaluate our planner with CARLA 0.9.14. CARLA team
initially proposed Leaderboard v1, which is composed of basic tasks such as lane
following, turning, collision avoidance, and etc. Then, to facilitate quasi-realistic
urban driving, CARLA v2 is released, which encompasses multitude complex
scenarios previously absent in v1. These scenarios pose serious challenges.
Since the release of CARLA v2, no team has managed to get a spot to tackle
these scenarios, despite the availability of perfect logs scoring 100% on each
scenario, provided by the CARLA official platform to aid in related research. In
our analysis, there are four primary reasons to the difficulty of v2: 1) Extended
Route Lengths: In CARLA v2, the routes extend between 7 to 10 kilometers, a
substantial increase from the roughly 1-kilometer routes in v1. 2) Complex and
Abundant Scenarios: Each route contains around 60 scenarios, which require the
driving methods to be able to handle complex road conditions and conduct subtle
control. 3) Exponential decay scoring rules: The leaderboard employs a scoring
mechanism that penalizes infractions through multiplication penalty factors <1.
In scenarios with extended routes and a multitude of scenarios, models struggle
to attain high scores. 4) Limited data: the CARLA team only provides a set of
90 training routes coupled with scenarios while routes randomly generated by
researchers, does not have official API support for the placement of scenarios.
10 Q. Li, X. Jia et al.
\label {eq:metric} \begin {aligned} \text {WDS} = \text {RC}*\prod _{i}^{m} {\text {penalty}}_{i}^{n_i} \end {aligned} (7)
where RC means route completion rate, m is the total number of types of in-
fractions considered, penaltyi is the penalty factor for infraction type i officially
defined in CARLA, and ni = Number of Infractions
Scenario Density (when there is no scenario, we
set ni = Number of Infractions in which case Weighted Driving Score=Driving
Score). Weighted Driving Score effectively balances the weight between route
completion, number of infractions, and scenario density, providing a measure of
the average infractions encountered by the ego vehicle over routes.
4.4 Performance
Tab. 1 and Fig. 1 show results on CornerCaseRepoṪhe overall training time
on one A6000 GPU with AMD Epyc 7542 CPU – 128 logical cores is 3 days.
For the baseline expert model, we implement Roach [45], where we replace our
model-based RL model with model-free PPO [36] and keep all other techniques
the same. Both experts are trained on 1600 routes and evaluated on other 390
routes of CornerCaseRepo benchmark for 3 runs. Think2Drive outperforms
Roach by a large margin, showing the advantages of model-based RL.
Think2Drive 11
Roach
Weighted Think2Drive
DS Think2Drive+TCP
Roach
Think2Drive Stop sign Agent
Think2Drive+TCP Weighted infraction blocked
IP(%)
We also choose and train an end-to-end baseline TCP [42], a lightweight yet
competitive student model on CARLA Leaderboard v1, as the imitation learning
agent. We train TCP with 200K frames collected by the Think2Drive expert
under different weather. We could observe that TCP, as a student model with
only raw sensor inputs, has a large performance gap with both expert models,
as caused by the difficulty of perception as well as the imitation process.
We analyze the performance of Think2Drive in all scenarios, and give the success
rate of the scenarios in Tab. 2.
The scenarios Accident, Construction, and HazardAtSidelane along with their
respective TwoWays versions, belong to the category of RouteObstacles scenar-
ios. In these scenarios, the ego vehicle is required to perform lane changes to ma-
neuver around obstacles, particularly in the case of the TwoWays versions where
the ego vehicle needs to switch to the opposite lane. Such scenarios demand the
ego vehicle to acquire a sophisticated lane negotiation policy, especially in the
TwoWays scenarios where the ego vehicle must execute lane changing, maneu-
ver around obstacles, and return to its original lane within a short time window.
Failed cases in these scenarios typically result from collisions with an opposite
car during the process of returning to the original lane after bypassing the ob-
stacles. CARLA Leaderboard v2 generates randomly the opposite traffic flow
with speed and interval range within [8, 18] and [15, 50] (typical value, may vary
with specific road conditions) in the TwoWays scenarios, which may lead to a
significantly constrained time window (e.g.less 1 second) for bypassing obstacles
when the speed is large while the interval is small. Consequently, the ego vehicle
is required to rapidly accelerate from speed = 0 to its maximum speed, and the
ego vehicle usually runs at a high speed and is close to the opposite car when
returning to its original lane, which leads to a high risk of collisions.
The scenarios SignalizedLeftTurn, CrossingBicycleFlow, SignalizedRightTurn
and BlockedInterSection belong to JunctionNegotiate type. In these scenarios,
the ego vehicle has to interrupt the opposite dense car or bicycle flow, merge
into the dense traffic flow, and stop at the junction to await road clearance. In
CARLA Leaderboard v2, the traffic flow of these scenarios is configured to be
very aggressive, meaning it does not proactively yield to the ego vehicle. The
ego vehicle needs to maintain a reasonable distance from other vehicles to avoid
collisions. For instance, in the SignalizedRightTurn scenario, it is expected to
merge into traffic with an interval within [15, 25] meters and a speed within
[12, 20] m/s. With a vehicle length of approximately 3 meters, the ego vehicle
must not only accelerate rapidly to match the traffic speed in a short time but
also maintain a safe following distance from other vehicles.
The world model is capable of imaging observation transitions and future rewards
based on the agent’s actions, and it can decode them back into the interpretable
masks under BEV. Fig. 2 visualizes the initial input and the predicted BEV
masks within timestep 50. We could observe that the world model could generate
authentic future states, demonstrating one advantage of adopting model-based
RL for AD - the transition function is usually easy to learn.
Think2Drive 13
Success
Failure
T=0 5 10 15 20 25 30 35 40 45 50
Fig. 2: Prediction by the world model using the first 5 frames. It can predict
reasonable future frames. In the failure case, the planner runs a red light and the world
model terminates the episode, so the subsequent predictions are randomly generated.
Bricks Ablation
1.0
w/o Brick1
w/o Brick3
0.8 w/o Brick4
w/o Brick5
w/o Brick6
Success Rate
Ours
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Step(/10K)
Fig. 3: Ablation of different bricks devised in the paper.
5 Implement Details
For the input representation, we utilize BEV semantic segmentation masks
iRL ∈ {0, 1}H×W ×C as image input, where each channel denotes the occur-
rence of certain types of objects. It is generated from the privileged information
obtained from the simulator and consists of C masks of size H × W . In these
C masks, the route, lanes, and lane markings are all static and thus could be
represented by a single mask while those dynamic objects (e.g. vehicles and
pedestrians) have T masks where each mask represents their state at one history
time-step. Additionally, we feed speed, control action, and relative height of the
ego vehicle at the previous time steps as input vRL ∈ RK . More specific details
in Appendix B, Tab. 4 and Tab. 5.
For output representation, we discretize the continuous action into 30 actions
to reduce the complexity. The discretized actions are presented in Tab. 6.
Reward Shaping. It is shaped to make the planner keep safe driving and finish
the route as much as possible, which consists of four parts: 1) Speed reward rspeed
is used to train the ego vehicle to keep a safe speed, depending on the distance
to other objects and their type. 2) Travel reward rtravel is the distance traveled
along the target routes at each tick of CARLA. Travel reward encourages the
ego vehicle to finish more target routes. 3) Deviation penalty pdeviation is the
negative value of the distance between the ego vehicle and the lane center. It is
normalized by the max deviation threshold Dmax . 4) Steering cost csteer is used
to make the ego vehicle drive more smoother. We set it as the difference between
the current steer and the last one. The overall reward is given by:
\begin {aligned} r = r_{speed} + \alpha _{tr} r_{travel} + \alpha _{de} p_{deviation} + \alpha _{st} c_{steer} \end {aligned} (8)
6 Conclusion
We have proposed a purely learning-based planner, Think2Drive, for quasi-
realistic traffic scenarios. Benefiting from the model-based RL paradigm, It can
drive proficiently in CARLA Leaderboard v2 with all 39 scenarios within 3 days
of training on a single GPU. We also devise tailored bricks such as resetting tech-
nique, automated scenario generation, termination-priority replay strategy, and
steering cost function to address the obstacles associated with applying model-
based RL to autonomous driving tasks. Think2Drive highlights and validates a
feasible approach, model-based RL, for quasi-realistic autonomous driving. Our
model can also serve as a data collection model, providing expert driving data
for end-to-end autonomous driving models.
Think2Drive 15
References
1. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro-
ceedings of the 26th annual international conference on machine learning. pp. 41–48
(2009)
2. Brooks, F.P.: The Mythical Man-Month: Essays on Softw. Addison-Wesley Long-
man Publishing Co., Inc., USA, 1st edn. (1978)
3. CARLA: Carla autonomous driving leaderboard (2022),
https://leaderboard.carla.org/
4. Chekroun, R., Toromanoff, M., Hornauer, S., Moutarde, F.: Gri: General reinforced
imitation and its application to vision-based autonomous driving. Robotics 12(5),
127 (2023)
5. Chen, J., Yuan, B., Tomizuka, M.: Model-free deep reinforcement learning for urban
autonomous driving. In: 2019 IEEE Intelligent Transportation Systems Conference
(ITSC) (Oct 2019). https://doi.org/10.1109/itsc.2019.8917306, http://dx.
doi.org/10.1109/itsc.2019.8917306
6. Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imita-
tion with transformer-based sensor fusion for autonomous driving. Pattern Analysis
and Machine Intelligence (PAMI) (2023)
7. Diehl, C., Sievernich, T., Krüger, M., Hoffmann, F., Bertram, T.: Umbrella:
Uncertainty-aware model-based offline reinforcement learning leveraging planning.
arXiv preprint arXiv:2111.11097 (2021)
8. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open
urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)
9. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in
actor-critic methods. arXiv: Artificial Intelligence,arXiv: Artificial Intelligence (Feb
2018)
10. Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018)
11. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy max-
imum entropy deep reinforcement learning with a stochastic actor. arXiv: Learn-
ing,arXiv: Learning (Jan 2018)
12. Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors
by latent imagination. arXiv preprint arXiv:1912.01603 (2019)
13. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.:
Learning latent dynamics for planning from pixels. In: Chaudhuri, K., Salakhutdi-
nov, R. (eds.) Proceedings of the 36th International Conference on Machine Learn-
ing. Proceedings of Machine Learning Research, vol. 97, pp. 2555–2565. PMLR
(09–15 Jun 2019), https://proceedings.mlr.press/v97/hafner19a.html
14. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.:
Learning latent dynamics for planning from pixels. In: International conference on
machine learning. pp. 2555–2565. PMLR (2019)
15. Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world
models. arXiv preprint arXiv:2010.02193 (2020)
16. Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through
world models (2023)
17. Henaff, M., Canziani, A., LeCun, Y.: Model-predictive policy learning with uncer-
tainty regularization for driving in dense traffic. arXiv preprint arXiv:1901.02705
(2019)
18. Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T.,
Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented
16 Q. Li, X. Jia et al.
33. Nikishin, E., Schwarzer, M., D’Oro, P., Bacon, P.L., Courville, A.: The primacy bias
in deep reinforcement learning. In: International conference on machine learning.
pp. 16828–16847. PMLR (2022)
34. Rhinehart, N., McAllister, R., Levine, S.: Deep imitative models for flexible infer-
ence, planning, and control. Computer Vision and Pattern Recognition,Computer
Vision and Pattern Recognition (Oct 2018)
35. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S.,
Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al.: Mastering atari, go, chess
and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
36. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
37. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for
reinforcement learning with function approximation. Advances in neural informa-
tion processing systems 12 (1999)
38. Toromanoff, M., Wirbel, E., Moutarde, F.: End-to-end model-free reinforcement
learning for urban driving using implicit affordances. In: CVPR. pp. 7153–7162
(2020)
39. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double
q-learning. Proceedings of the AAAI Conference on Artificial Intelligence (Jun
2022). https://doi.org/10.1609/aaai.v30i1.10295, http://dx.doi.org/10.
1609/aaai.v30i1.10295
40. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning 8, 229–256 (1992)
41. Wu, P., Chen, L., Li, H., Jia, X., Yan, J., Qiao, Y.: Policy pre-training for au-
tonomous driving via self-supervised geometric modeling. In: International Con-
ference on Learning Representations (2023)
42. Wu, P., Jia, X., Chen, L., Yan, J., Li, H., Qiao, Y.: Trajectory-guided control pre-
diction for end-to-end autonomous driving: A simple yet strong baseline. Advances
in Neural Information Processing Systems 35, 6119–6132 (2022)
43. Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World
models for physical robot learning. In: Conference on Robot Learning. pp. 2226–
2240. PMLR (2023)
44. Yang, Z., Jia, X., Li, H., Yan, J.: Llm4drive: A survey of large language models for
autonomous driving (2023)
45. Zhang, Z., Liniger, A., Dai, D., Yu, F., Van Gool, L.: End-to-end urban driving
by imitating a reinforcement learning coach. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV) (2021)
18 Q. Li, X. Jia et al.
3. ParkingCutIn
7. Construction
The ego vehicle encounters a construction site block-
ing and must perform a lane change into traffic moving
in the same direction to avoid it. Compared to Parke-
dObstacle, the construction occupies more width of
the lane. The ego vehicle has to completely deviate
from its task route temporarily to bypass the con-
struction zone.
8. ConstructionTwoWays
9. Accident
The ego vehicle encounters multiple accident cars
blocking part of the lane and must perform a lane
change into traffic moving in the same direction to
avoid it. Compared to ParkedObstacle and Construc-
tion, these accident cars occupy more length along the
lane. The ego vehicle has to completely deviate from
its task route for a longer time to bypass the accident
zone.
10. AccidentTwoWays
The ’TwoWays’ version of Accident. Compared to
ParkedObstacleTwoWays and ConstructionTwoWays,
there is a much shorter time window for the ego vehi-
cle to bypass the route obstacles (i.g. accident cars).
11. HazardAtSideLane
The ego vehicle encounters a slow-moving hazard
blocking part of the lane. The ego vehicle must brake
or maneuver next to a lane of traffic moving in the
same direction to avoid it.
12. HazardAtSideLaneTwoWays
The ego vehicle encounters a slow-moving hazard
blocking part of the lane. The ego vehicle must brake
or maneuver to avoid it next to a lane of traffic moving
in the opposite direction.
13. VehiclesDooropenTwoWays
14. DynamicObjectCrossing
20 Q. Li, X. Jia et al.
4. BlockedIntersection
5. SignalizedJunctionLeftTurn
6. SignalizedJunctionLeftTurnEnterFlow
The ego vehicle is performing an unprotected left turn at an intersection,
merging into opposite traffic.
7. NonSignalizedJunctionLeftTurn
8. NonSignalizedJunctionLeftTurnEnterFlow
Non-signalized version of SignalizedJunctionLeftTurnEnterFlow.
9. SignalizedJunctionRightTurn
13. MergerIntoSlowTraffic
14. MergerIntoSlowTrafficV2
16. InterurbanAdvancedActorFlow
The ego vehicle incorporates into the interurban road
by turning left, first crossing a fast traffic flow, and
then merging into another one.
17. HighwayCutIn
The ego vehicle encounters a vehicle merging into its
lane from a highway on-ramp. The ego vehicle must
decelerate, brake, or change lanes to avoid a collision.
18. CrossingBicycleFlow
21. VinillaTurn
C = 34
Static(Cs = 1) Dynamic(Cd = 4)
yellow white emergency green yellow&red stop
road route ego lane vehicle walker obstacle
line line car traffic light traffic light sign
H W C TM
128 128 34 4 30
Table 6: The discretized actions. The continuous action space is decomposed into
30 discrete actions, each for specific values of throttle, steer, and brake. Each action is
rational and legitimate.
n+m-1
Resetting
n+m
Env
Env
Env
Env
Env
Env
n+4
n+3
n+2
n+1
Envs
Load a task and the town
Reset workflow
Asynchronously
get
get
finish
Reset
fail
Generate route
action Running
Envs
obs, reward
Parallelly
Agent
Run
Instantiate scenarios
Fig. 4: Procedure of the Wrapped RL Environment. The right shows the work-
flow of each reset. In large maps like town12 and town13, the entire reset workflow
takes about 1 minute.
D Hyper-parameters
Table 7: Hyper-parameters of the neural network. The CNN encoder maps the
128 × 128 BEV to the 4 × 4 feature map by 4 × 4 convolutional kernel with stride
2. The flattened feature maps, concatenated with the state features output by the
MLP encoder, are then input to the RSSM which consists of GRU cells and a few
dense layers. The structure of the decoder inverts that of the encoder and outputs the
reconstructed mask and state vector. Please refer to DreamerV3 [16] for details about
network.
GRU recurrent units CNN multiplier Dense hidden units MLP layers Parameters
512 96 512 5 104M