A Review of Reward Functions For Reinforcement Learning in The Context of Autonomous Driving
A Review of Reward Functions For Reinforcement Learning in The Context of Autonomous Driving
Abstract— Reinforcement learning has emerged as an impor- of a set of objectives, namely the reward function, based on
tant approach for autonomous driving. A reward function is the action executed by the agent and the updated environment
used in reinforcement learning to establish the learned skill state [9]. To account for the complexity and size of the state
objectives and guide the agent toward the optimal policy. Since
autonomous driving is a complex domain with partly conflicting and action spaces, Deep Reinforcement Learning (DRL) has
arXiv:2405.01440v1 [cs.RO] 12 Apr 2024
objectives with varying degrees of priority, developing a suitable been introduced in state-of-the-art approaches using a deep
reward function represents a fundamental challenge. This paper neural network [10].
aims to highlight the gap in such function design by assessing RL has recently been applied in autonomous driving at
different proposed formulations in the literature and dividing different motion planning levels. In behavior planning, RL
individual objectives into Safety, Comfort, Progress, and Traffic
Rules compliance categories. Additionally, the limitations of the is utilized to learn a policy of high-level decisions based on
reviewed reward functions are discussed, such as objectives the environment state [11]. As demonstrated in [12]–[15],
aggregation and indifference to driving context. Furthermore, turning decisions and lane changes are the most common
the reward categories are frequently inadequately formulated behavior planning applications of reinforcement learning.
and lack standardization. This paper concludes by proposing Another planning level where RL has been applied is tra-
future research that potentially addresses the observed short-
comings in rewards, including a reward validation framework jectory planning in both Cartesian and Frenet coordinate
and structured rewards that are context-aware and able to systems [16]–[20]. Finally, RL can also be formulated to
resolve conflicts. directly compute low-level control commands of the vehicle
such as steering, acceleration, and deceleration as illustrated
I. I NTRODUCTION in [21]–[23].
Since human error accounts for 94% of traffic accidents The objective of this research is to bring attention to the di-
[1]–[3], autonomous vehicles are being pursued as an ef- verse challenges associated with designing a suitable reward
ficient and safe replacement for manual-driven vehicles to system for autonomous driving and conduct a comprehensive
reduce the high number of injuries and traffic accidents [4]. analysis of the rewards employed in state-of-the-art RL
In addition to increasing road safety, autonomous vehicles approaches. Individual objectives in the recently proposed
have the potential to improve two major challenges in road formulations are analyzed in depth and divided into different
transportation, namely, the efficiency of road infrastructure categories. Additionally, a comprehensive discussion on the
as well as the efficiency of fuel consumption and associated strengths and limitations of the reviewed reward functions is
emissions [5]. presented, with the aim of uncovering potential areas for
Autonomous driving software architecture can be divided refinement. Lastly, this research puts forward suggestions
into modular and End-to-End (E2E) learning approaches [6]. for enhancing the structure of the reward function, with
Modular approaches divide the complex task of driving into the overarching goal of improving the safety and efficiency
multiple subtasks such as perception, localization, planning, of RL approaches and addressing the need for validation
and control to compute vehicle actions from sensor data frameworks for such systems.
[6]. The aforementioned approaches have high dependencies
between different modules and suffer from error propagation II. C HALLENGES OF REWARD DESIGN IN THE
through individual modules [7]. These drawbacks have led AUTONOMOUS DRIVING DOMAIN
the research community to pay more attention to E2E ap- A pivotal aspect of autonomous driving is the conversion
proaches. Compared to modular approaches, E2E approaches of the intricate multi-objective nature of autonomous driving
are more sustainable and have fewer components, as they into a performance metric, as discussed by [24]. State-of-
learn a driving policy directly from input sensor data [7] the-art approaches employ objective functions or reward
without any intermediate explicit tasks. functions to encode competing objectives like safety and
Reinforcement learning (RL) has been one of the dominant efficiency. Although this paper focuses on reward functions
approaches in E2E driving. Since RL is a learning framework due to the prevalence of reinforcement learning in recent
based on the process of repeated interaction between an agent driving approaches, it is crucial to recognize that the lim-
and its environment in contrast to data-focused approaches itations highlighted in this review extend to the design of
such as supervised learning [8]. The main goal of an RL objective functions within modular architectures as well.
agent is to maximize the cumulative reward, which is the The reward function, a fundamental element in reinforce-
discounted total of the rewards received for each time step ment learning, serves the purpose of evaluating a specific
[9]. The reward is a singular value based on the aggregation action within the current environment state. It is therefore
necessary for the reward function to facilitate a rational and driving behavior that prioritizes avoiding collisions. This
decision among multiple actions by representing the critical study specifically emphasizes safe and collision-free driving,
performance indicators of the task [25]. Inadequately formu- as it aligns most closely with the reward function and RL
lated reward functions, particularly in complex environments, objectives. A number of standards, including ISO 26262 [31]
fail to accurately reflect the agent’s objectives and may result and SOTIF [32], address other aspects of safety; however,
in policies that are suboptimal in relation to the agent’s actual none of them explicitly address a definition of safety that
task [26]. can be applied to the development of reward or objective
Insufficient emphasis is placed on the design phase of functions for autonomous driving. This standard, which is
reward functions in current autonomous driving approaches, lacking in the literature to date, is essential to defining reward
resulting in notable limitations and significant opportunities functions that ensure RL agents’ safety.
for enhancement in the overall design of reward functions To tackle the absence of a specific definition, this work
[24]. The process of designing reward functions for au- delves into key findings from the literature relevant to
tonomous driving is confronted with numerous challenges. formulating such a definition. The safety objective should
Autonomous driving, being a multi-objective problem, neces- promote safe driving and penalize risky behavior, such as
sitates a reward function capable of not only encompassing actions that result in collisions. Maintaining a safe distance
diverse objectives but also addressing the complex task of from other road users, and driving at the proper speed, are
combining these objectives effectively and resolving their part of safe driving behavior. Two general approaches can
conflicts [24]. Furthermore, autonomous driving is dependent be distinguished in the current implementations of safety
on context such as the region, weather conditions, and in reward functions. The direct method penalizes collisions
driving environment (urban environment vs. highway) [24]. to reduce the likelihood of accidents in the learned agent
Additionally, the absence of robust performance metrics for behavior. The second, a more situational approach, seeks to
evaluating specific reward functions, coupled with delayed enhance hazard perception by estimating the risk of potential
rewards, presents further design difficulties [24]. Ill-defined collisions and allocating rewards based on the level of risk.
reward attributes may lead to undesirable or even safety- The simplest implementation of safety is a conditional
compromising driving behaviors, emphasizing the need for function, wherein a negative reward is applied if a collision
careful consideration during the reward function design occurs; otherwise, the reward is set to zero, as exemplified
phase. in previous works [21], [28], [33]–[42]. The mathematical
formulation of a safety conditional reward is illustrated in
III. C ATEGORIES OF O BJECTIVES IN THE REWARD equation (1), where x is a negative collision penalty that
FUNCTION DESIGN
is manually defined. One limitation of this approach is its
In order to establish comparability and facilitate additional failure to consider the severity of the collision. For example,
analysis, the reward functions are broken down into individ- a minor collision at low speed is penalized the same as a
ual components. These components are then assigned to a collision with pedestrians at high speed.
predefined set of categories. The selected categories, safety,
x , if there is a collision
progress, comfort, traffic rules conformance, and related to rsaf ety = (1)
0 , otherwise
model performance, were chosen due to their prevalence
in the approaches under investigation and their intrinsic Compared to the proposed definition of safety, only the
significance as fundamental metrics for driving performance aspect of punishing unsafe driving behavior is covered.
assessment. Additionally, the analysis delved into the specific Even if this conditional function is interpretable, a manually
use case, considering constraints, strengths, and weaknesses defined penalty x is often difficult to interpret and depends
of objectives formulation. The suitability of existing industry heavily on the magnitude assigned to other rewards and
standards for the attribute categories of autonomous driving penalties. Various levels of severity can be defined in the
was examined and missing standards were highlighted. Fur- safety reward to impose distinct penalties for different behav-
ther, the different categories are discussed with their various iors during collisions, enhancing interpretability. Moreover,
formulations and their strengths and limitations. the safety reward can be tailored to distinguish between
Certain publications use reward terms that don’t fit into accidents involving pedestrians, vehicles, and static obstacles
the categories that have been chosen. Typically, the purpose A realization of such a formulation is illustrated in [43] and
of these terms is to improve model performance, either [28], where distinctions between different collision severity
by implicitly encouraging exploration [27]–[29] or through are made, and lower or higher penalties are assigned based
explicit methods [30]. Given that these terms do not capture on the entity of the traffic participant involved in a collision.
aspects related to autonomous driving objectives, they are Other approaches use collision damage to calculate the
not considered for further analysis. penalty [30], [44]. While these formulations are applicable
in simulation, collision damage is extremely difficult to
A. Safety calculate in real scenarios.
The primary goal of driving is ensuring safety, which can While the aforementioned approaches enhance trans-
be assessed from different perspectives. Safety encompasses parency and interpretability, a significant challenge arises in
hardware elements like brake performance, software design, accurately attributing collisions to specific types of traffic
participants, severity levels, or collision damage. A feasible and promotes safer driving on the road, such as TTC or
strategy for assessing collision severity involves incorporat- headway. Various collision severity levels and actor types
ing the speed of the autonomous vehicle at the time of the can be taken into account in the collision penalty.
collision into the reward, as suggested in [45]. The penalty Finally, recently introduced sophisticated mathematical
for the collision increases with higher collision speeds, decision models, such as Nvidia Force Field (SFF) [54]
providing a justifiable basis for certain penalty values, as and Responsibility-Sensitive Safety (RSS) [55], could play
discussed in equation (2). Additionally, the speed can be a pivotal role in developing a transparent and interpretable
introduced exponentially in the formula to impose even more definition for safety. SFF aims to make decisions that mini-
severe penalties for higher speeds [45]. mize the intersection between the autonomous vehicles and
other road users claimed sets [54], thereby averting critical
rsaf ety = −x × (velocity 2 + 0.5) (2)
scenarios. The claimed set represents the union of trajectories
All the previously mentioned formulations involve penalizing computed via a safety procedure proposed in [54]. On the
penalties for unsafe driving behavior, specifically collisions, other hand, RSS functions as an additional layer on top of the
but they lack any provision for rewarding safe driving behav- autonomous vehicle planning module, ensuring the avoidance
ior relative to the assumed risk. In this regard, trajectories of critical situations in longitudinal and lateral directions, as
exhibiting near-collision behavior are assessed identically to well as during yielding by calculating the worst-case TTC
trajectories exhibiting safer driving behavior. An extension for these maneuvers. To the best of our knowledge, none of
of the conditional reward function to incorporate collisions the state-of-the-art RL approaches incorporate the decision
and near collisions is utilized in a number of publications models mentioned earlier. The sole occurrence of SFF usage
[27], [46]–[48]. is proposed in [56], where a neural network is trained to
Alternatively, the safety objective can be modeled using a predict the claimed sets of actors in Carla simulation [57].
continuous and dense reward rather than a conditional sparse
function. In order to integrate collision risk into the reward B. Progress
function and represent the current situation’s riskiness with a
single continuous value, research efforts, as outlined in [22], The RL agent is motivated to advance toward a pre-
[29], [49], focus on factors such as the distance to the nearest determined goal from its current location by the progress
vehicles or obstacles. Recognizing the dynamic nature of objective. Efficiency is another name for this objective.
other vehicles, a more fitting approach than distance-based There is no universally standardized definition for progress
evaluation involves heuristics such as time-to-collision [50]– within the context of reward functions and autonomous
[52]. The formulation of TTC-based reward is clarified in driving tasks. A basic formulation of this objective involves
equation (3) where the function f describes the penalty receiving a delayed reward upon reaching the goal [35], [40],
according to the current value of TTC. f can either be a [41], [58]. The reward function discussed in Paleja et al.
constant or, for a more realistic representation, the inverse [51] additionally rewards attaining 40% and 60% of total
of TTC. In the latter case, the penalty increases as TTC progress. Additionally, other works simultaneously apply a
decreases [50]. The parameter t is defined as the critical penalty for each time step elapsed since the beginning of
threshold below which the TTC is considered risky. the episode [27], [47] or penalize a speed of zero [21],
[34], incentivizing the agent to complete the task as fast as
f (T T C) , if T T C < t possible.
rsaf ety = (3)
0 , otherwise In order to provide a dense reward, progress is often
Apart from TTC, other metrics like headway [50] and proxied by the distance traveled [30], [35], [48], [49], [52]
distance to other vehicles [22] can also be taken into account or the velocity [21], [28], [34], [36]–[38], [43], [51] in the
when estimating the current risk level. It’s important to current time step. Another approach to modeling progress
highlight that, as indicated by [53], TTC is considered more involves defining a specific target velocity or distance to be
suitable than other utilized risk heuristics. This preference covered within a single time step. Deviations from these
arises from the fact that a small TTC value is distinctive to targets are then penalized, with larger deviations incurring
risky situations demanding an immediate response, whereas higher penalties. However, existing approaches often use a
a small headway or distance to other vehicles may not fixed desired velocity, typically based on the road speed limit
necessarily signify a critical scenario. [29], [42], [46], [47], [58], without considering other factors
Furthermore, this work suggests a safety reward formu- such as traffic density and weather conditions that may
lation that has a continuous dense term that penalizes risky affect velocity. A more suitable formulation could involve
driving behavior and encourages safer driving on the road, dynamically calculating the target velocity based on all
such as TTC or headway, as well as a sparse penalty for relevant factors, though this might pose challenges in real-
collisions. This penalty can be dynamically assigned based world applications. Moreover, as an alternative formulation
on different collision severity levels and actor types. of progress, rewarding acceleration is put forth in [30], [44].
This work recommends a formulation for a safety reward One major problem with dense formulations of progress
that includes both a sparse penalty for collisions and a mentioned earlier is that the agent can move in the opposite
continuous dense term that penalizes risky driving behavior direction of the goal and still receive a progress reward.
Two papers [28], [44] determine the reward by calculating reviewed literature for this work, no research offered a
the distance to goal [44] or the Euclidean distance to the complete coverage of the formulation. Numerous methods
goal [28], which seems to solve the problem at first glance. impose penalties on high acceleration and deceleration [22],
However, a reward based on the road distance traveled on [33], [36], [45], [49], [52], while others focus on penalizing
the route to the destination is a more appropriate approach, the rate of change of acceleration (jerk) [22], [47], [50].
as suggested in [20], since the path to the agent’s destination When assessing comfort, only one paper takes headway into
is not necessarily a straight line and is defined by the road account [50], and two of the examined publications introduce
topology and traffic rules. penalties for hard braking [41], [51], potentially conflicting
Finally, overtaking is rewarded in [35], [42], [46] to with the safety objective of the reward function.
encourage progress in situations where overtaking a slow One aspect that is not among those previously defined in
vehicle in front could lead to faster progress at the expense [60] is steering smoothness, which is used as an evaluation
of a minimal safety risk. measure for comfort [21], [23], [36], [43]. In detail, steering
Although the reviewed formulations of progress can seem smoothness is achieved by penalizing high angles of steering
reasonable in hindsight, they can lead to conflicts and ac- [21], [36] or counter steering [23], [43].
cidents. As illustrated by Knox et al. [24], such progress Across several publications, we observed a consistent
formulations are ineffective even in a simple situation where pattern of leaving out the comfort attribute entirely from the
a static obstacle blocks the agent’s path. Unlike a human reward function design [27], [28], [30], [34], [35], [40], [44],
driver in this situation, the agent might choose to crash into [46], [58] which has a negative impact on the agent’s learned
the obstacle rather than remain idle, since the cumulative policy.
progress penalty over an extended waiting period can be
greater than the collision penalty. This irrational decision is D. Traffic rules conformance
a consequence of the flawed formulation and aggregation of The goal of the traffic rules conformance objective is to
the individual objectives and is discussed further in IV-A. motivate the agent to comply with various traffic regulations.
In conclusion, a significant challenge with the progress Detaching traffic regulations from the concept of safety
objective lies in its overlap with safety. The moment a allows for a more contextualized approach, since the laws
vehicle departs from a stationary position, the associated risk may change depending on the situation. The majority of the
relatively increases. However, to reach the predefined goal, reviewed publications primarily address the basic applica-
movement is essential and, therefore, must be incentivized tions of traffic regulations. The specific rules currently im-
by the reward function. plemented include rewarding adherence to staying in the lane
[21], [23], [28], [34], [38], imposing penalties for surpassing
C. Comfort the speed limit [21], [36], [37], [52], and for undercutting the
The degree to which passengers find their ride comfortable required minimum headway [50]. Additionally, in specific
and agreeable is a determining factor in the overall success scenarios, regulations involve rewarding compliance with the
and widespread adoption of autonomous vehicles, in addition right-of-way [33] and remaining in the correct lane while
to their technical capabilities such as safety and progress turning [33].
objectives. The comfort objective has no established standard A prevalent limitation found in the current literature is
applicable to the reward function of autonomous driving. the absence of a mechanism for simultaneous compliance
The prevailing industry standard, as outlined in [59], defines with multiple traffic laws and the use of rule relaxations.
comfort by evaluating vibrations affecting the spine of pas- For example, none of the publications strictly enforce speed
sengers. This comfort standard is predominantly passenger- limits; rather, they penalize speeding based on the degree of
focused, influenced by variables like exposure duration, age, deviation from the speed limit.
and height of the passenger [59].
Applying such a standard to autonomous driving proves IV. G ENERAL LIMITATIONS
impractical, given scenarios where vehicles operate without This section outlines the main limitations of current reward
passengers or experience frequent changes in passengers. functions, highlighting areas that could be addressed in future
This dynamic change in passengers makes it difficult to research to improve autonomous driving models. Unlike the
assess comfort levels during a journey, as conventional previous section, the focus here is on the overall structure
methods such as user studies are not scalable. One of the of the reward function rather than specific reward terms. In
most appropriate and complete definitions of comfort in the particular, the focus is on comprehending how the different
context of autonomous driving, to the best of our knowledge, objectives are combined to shape the overall reward function
is illustrated in [60]. The right-of-way, acceleration, and the and how the reward’s lack of adaptability to different driving
derivative of acceleration (jerk) in both longitudinal and contexts restricts its generalizability.
lateral directions are taken into account in the suggested
formulation. However, it disregards the rate of change of A. Aggregation of attributes
the steering angle and lateral acceleration in its formulation. 1) Summation: A major limitation is the aggregation of
It should be emphasized that while some aspects of attributes. Most of the reviewed publications use summation
the previously mentioned formulation are considered in the to combine the different reward terms to obtain the final
reward [21], [22], [27], [35], [39]–[41], [43], [45], [47], rule compliance, and comfort, each associated with assigned
[51], [52], [61]. Nevertheless, this simple formulation doesn’t thresholds [33].
encode any priority or distinction between different objec- Although this approach mitigates the challenges associated
tives and fails to handle conflicts in these objectives. As with traditional weighting processes, it introduces new ones.
depicted in equation (4), safety is assigned the same weight The method relies on a strict ordering of objectives to
as progress. calculate the reward, and is unable to manage formulations
where multiple objectives hold the same level of impor-
r = rsaf ety + rprogress + rcomf ort + rrules (4) tance. Additionally, the introduction of a threshold for each
2) Weighted summation: The literature has explored the objective requires manual tuning. Section V-A introduces
incorporation of a mechanism to assign priorities in the Rulebooks as an alternative threshold-free ordering approach
final reward calculation, considering the different levels of with better trade-off handling.
objective significance within autonomous driving.
B. Use Case Specific Reward design with lack of Context
Several works utilize a switching mechanism to condition
Awareness
the RL agent on particular features of the current envi-
ronment state and assign different final reward values to The driving context plays a pivotal role in establishing
it [34], [37], [38], [48], [58]. While the approach appears criteria for desired behavior in autonomous vehicles. This
effective in simplified scenarios, the scalability to realistic context significantly shapes the design of the reward func-
settings is hindered by the growing complexity of conditions tion, aiming to mathematically represent this behavior [24].
in a real-world environment. Additionally, this method design Typically, the reward function is customized to excel in the
relies on a substantial amount of expert knowledge. These specific driving context, incorporating terms specific to the
limitations motivate a need for more adaptive and scalable use case in focus.
approaches that can handle the conflict and trade-offs be- While urban driving is the most prevalent use case [21],
tween objectives. [28], [33], [36], [39], [43]–[45], other scenarios such as
More sophisticated methodologies have been introduced, merging procedures [22], [41], [42], [49], lane change ma-
incorporating the use of weights to assign varying degrees of neuvers [47], [48], or even racing track simulations [27], [34]
importance to different reward terms. This approach involves are occasionally studied. A drawback of use case specific
assigning a weight per attribute, as detailed in equation reward functions is their limited generalization to diverse or
(5). The introduced weights can be manually tuned, as unexpected scenarios and limited versatility in a different
demonstrated in the reviewed papers [28]–[30], [36], [42], driving context or use case.
[44], [46], [49], [50]. Additionally, Inverse Reinforcement On the other hand, relying on a single, all-encompassing
Learning (IRL) approaches are proposed either to learn the reward function might prove insufficient in a dynamic do-
complete reward value [62], [63] at a given state or the main such as autonomous driving [24]. A universal and
individual objectives’ reward contribution [64] or the weights versatile reward function for autonomous driving can be es-
per attribute [65], [66] from recorded expert driving. tablished using a hybrid strategy. This entails combining use
cases specific terms with more broadly applicable objectives
r = wsaf ety × rsaf ety + wprogress × rprogress while maintaining awareness of the vehicle’s driving context.
+ wcomf ort × rcomf ort + wtraf f ic rules × rrules (5) However, none of the reviewed papers made modifications
to the reward function to incorporate context awareness.
Manually fine-tuning weights in the complex setting of A limitation of the suggested reward formulation is that
autonomous driving poses challenges, given the lack of specific terms are tailored for each use case, necessitating
intuitive guidelines on determining the appropriate balance a transition mechanism between different use cases during
between its diverse objectives. Conversely, the application driving. As of now, no such mechanism has been developed
of IRL demands substantial computational resources and a in the field of autonomous driving. An option for handling
diverse dataset to ensure effective generalization, as noted these transitions is the use of Reward machines, which can
in previous research [33]. Additionally, a notable drawback be viewed as an extension or specialization of finite state
of weight-based approaches lies in their lack of adaptability, machines in the context of RL. Further details about Reward
where optimal weights may vary based on context or the machines are explored in section V-B.
type of maneuver.
3) Lexicographic ordering: Using a lexicographic order C. Economic aspects disregarded
is one way to eliminate the need for explicit weighting The examined reward functions have predominantly em-
procedures and address some of the associated disadvantages. phasized the objectives outlined in section III, yet con-
In a lexicographic ordering approach, objectives are strictly sistently overlook economic aspects. Economic aspects in
ordered based on their importance, and the decision is made driving encompass factors such as fuel efficiency, and cost
by considering each objective sequentially, without assigning optimization. While safety and regulations typically take
specific weights, as demonstrated in [33]. The paper intro- precedence in social and ethical priorities, economic aspects
duces a lexicographic deep Q network that incorporates four should receive additional attention due to their significant
prioritized goals, correct lane changes, safety aspects, traffic financial and environmental impact.
V. P ROPOSALS FOR FUTURE WORK requirement of transition mechanisms to switch between
In this section, potential solutions for the identified lim- these subtasks. Despite the drawback of relying on transition
itations in current reward functions are explored, with the mechanisms and the potential for over-engineering when de-
intention of advancing the autonomous driving reward de- composing driving tasks into specific contexts or maneuvers,
sign process. The suggested approaches center on exploring reward machines serve as a starting point in the development
alternative ways to aggregate individual objectives through of rewards that are context-aware.
Rulebooks and incorporating driving context into reward C. A Framework for reward function validation
functions. Additionally, the section examines the absence of
frameworks for validating and evaluating reward functions. Given the significant limitations and possible ill-definitions
of reward function design reviewed in this review, we con-
A. Rulebooks tend that the establishment of an automatic framework for
When examining the primary issues with existing reward the validation of reward functions is imperative. This frame-
functions, attention is drawn to challenges linked to the use work is crucial for guaranteeing the safety and reliability
of weighted reward functions for different objectives. Propos- of reinforcement learning agents developed for autonomous
ing an alternative to the current reward function formulation, driving scenarios.
the Rulebook emerges as a distinctive approach. Following an extensive examination of the existing liter-
A Rulebook can be defined as a tuple 〈R, ≤〉 consisting ature, no instances of a comprehensive framework designed
of a certain number of rules R and a pre-defined order for the validation of reward functions were identified, with
≤ amongst those rules to account for different priorities the exception of a manual approach discussed in [24]. This
[67]. This tuple can be represented as a directed graph. The particular approach introduces eight sanity checks intended
primary benefit lies in the elimination of the need for manual to reveal potential issues within a given reward function.
weight assignment, replaced instead by the establishment However, a notable challenge associated with this method is
of rule priorities within a Rulebook. Furthermore, the rule the non-trivial evaluation of these sanity checks. For exam-
priority can be acquired either entirely or partially from data, ple, one sanity check involves assessing whether the reward
as highlighted in the work by Censi et al. [67]. function unintentionally encourages undesired behavior [24],
A driving action or trajectory can be evaluated based a task that proves impractical to manually verify within
on its rule violation for the Rulebook rules, given their complex domains like autonomous driving.
priority [67]. Subsequent studies apply an iterative technique In recent publications, several works have been introduced
to search for an action with minimum rule violations [68]. to address automatic critical scenario generation. These stud-
Accordingly, a Rulebook can serve as a foundational frame- ies focus on either generating or modifying scenarios to
work for developing a more robust reward function that can produce adversarial examples, intending to assess the agent’s
handle complex situations where certain rules must be dis- behavior in critical situations that can lead to accidents [72]–
regarded or are in conflict with other rules. Rulebooks have [75]. From the perspective of this work, critical scenario gen-
demonstrated successful application in the planning modules eration approaches can serve as an initial step for the design
of autonomous driving systems, showcasing effectiveness of automatic validation frameworks of reward functions in
even when defined based on only a few critical rules [67], autonomous driving.
[69].
VI. C ONCLUSION
B. Context and Reward Machines In this study, an examination of reward functions ap-
One drawback of reward functions is their lack of context plied to state-of-the-art autonomous driving RL models was
awareness, as outlined in Section IV-B. Relying solely on conducted. The review process included breaking down the
a single reward function without considering context may reward functions into individual terms and assigning them
prove inadequate. To address this issue and improve gener- to a predefined set of categories. This approach enables
alizability, an effective approach is to incorporate context as a comprehensive analysis both in a broad sense and in
input and enable learning across various contexts [24]. terms of specific categories. At the category analysis level,
The utilization of reward machines presents a viable a prominent challenge observed was the absence of concrete
approach for encoding driving context. A reward machine industry standards for these categories, resulting in diverse
serves as an extension of finite state machines within the and typically ill-defined formulations of these categories.
context of RL. Distinguished by their expressive capabil- Furthermore, general limitations of the reward function
ity, reward machines excel in hierarchically decomposing design have been discussed. The limitations included the
intricate tasks into discrete subtasks [70]. Each subtask is simplistic approach for aggregation of conflicting attributes,
associated with a unique reward, and a transition mecha- lack of awareness about the context of the driving task, and
nism governs the transitions between these subtasks. This limited generalization due to use case specific formulation.
formulation enhances the adaptability of reward machines to Different aspects of enhancement of reward function are
changes in environmental conditions over time [71]. discussed for future work. The combination of objectives can
In the context of autonomous driving, a subtask can be addressed through Rulebooks and the encoding of context
be a different driving context or maneuver type, with the can be enhanced through reward machines. Future research
recommendations also address the absence of a framework [19] J. Coad, Z. Qiao, and J. M. Dolan, “Safe trajectory planning using
for the validation of reward functions. reinforcement learning for self driving,” arXiv:2011.04702, 2020.
[20] M. Moghadam, A. Alizadeh, E. Tekin, and G. H. Elkaim, “An end-to-
end deep reinforcement learning approach for the long-term short-term
ACKNOWLEDGMENT planning on the frenet space,” arXiv preprint arXiv:2011.13098, 2020.
The research leading to these results is funded by the [21] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
learning for urban autonomous driving,” in 2019 IEEE intelligent
German Federal Ministry for Economic Affairs and Climate transportation systems conference (ITSC). IEEE, 2019, pp. 2765–
Action within the project “KI Wissen” (project number: 2771.
19A20020L). The authors would like to thank the consortium [22] P. Wang, H. Li, and C.-Y. Chan, “Quadratic q-network for learn-
ing continuous control for autonomous vehicles,” arXiv preprint
for the successful cooperation. arXiv:1912.00074, 2019.
[23] P. Wolf, C. Hubschneider, M. Weber, A. Bauer, J. Härtl, F. Dürr, and
R EFERENCES J. M. Zöllner, “Learning how to drive in a real world simulation with
[1] A. S. Mueller, J. B. Cicchino, and D. S. Zuby, “What humanlike errors deep q-networks,” in 2017 IEEE Intelligent Vehicles Symposium (IV).
do autonomous vehicles need to avoid to maximize safety?” Journal IEEE, 2017, pp. 244–250.
of safety research, vol. 75, pp. 310–318, 2020. [24] W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, “Reward
[2] M. Cunningham and M. A. Regan, “Autonomous vehicles: human (mis) design for autonomous driving,” Artificial Intelligence, vol. 316,
factors issues and future research,” in Proceedings of the 2015 Aus- p. 103829, 2023.
tralasian Road safety conference, vol. 14, 2015. [25] J. Eschmann, “Reward function design in reinforcement learning,”
[3] S. Singh, “Critical reasons for crashes investigated in the national Reinforcement Learning Algorithms: Analysis and Applications, pp.
motor vehicle crash causation survey,” Tech. Rep., 2015. 25–33, 2021.
[4] World Health Organization, Global status report on road safety 2018. [26] R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward
Genève, Switzerland: World Health Organization, Jan. 2019. machines: Exploiting reward function structure in reinforcement learn-
[5] S. E. Shladover, “Cooperative (rather than autonomous) vehicle- ing,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208,
highway automation systems,” IEEE Intelligent Transportation Sys- 2022.
tems Magazine, vol. 1, no. 1, pp. 10–19, 2009. [27] L.-C. Wu, Z. Zhang, S. Haesaert, Z. Ma, and Z. Sun, “Risk-aware
[6] L. Liu, S. Lu, R. Zhong, B. Wu, Y. Yao, Q. Zhang, and W. Shi, reward shaping of reinforcement learning agents for autonomous
“Computing systems for autonomous driving: State of the art and driving,” arXiv preprint arXiv:2306.03220, 2023.
challenges,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6469– [28] A. Sharif and D. Marijan, “Evaluating the robustness of deep rein-
6486, 2020. forcement learning for autonomous policies in a multi-agent urban
[7] D. Coelho and M. Oliveira, “A review of end-to-end autonomous driving environment,” in 2022 IEEE 22nd International Conference
driving in urban environments,” IEEE Access, vol. 10, pp. 75 296– on Software Quality, Reliability and Security (QRS). IEEE, 2022,
75 311, 2022. pp. 785–796.
[8] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduc- [29] J. Wu, Z. Huang, and C. Lv, “Uncertainty-aware model-based re-
tion,” Robotica, vol. 17, no. 2, pp. 229–235, 1999. inforcement learning: Methodology and application in autonomous
[9] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp.
et al., “An introduction to deep reinforcement learning,” Foundations 194–203, 2022.
and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354, [30] P. Palanisamy, “Multi-agent connected autonomous driving using deep
2018. reinforcement learning,” in 2020 International Joint Conference on
[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, Neural Networks (IJCNN). IEEE, 2020, pp. 1–7.
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement [31] ISO Central Secretary, “Road vehicles - functional safety - part 1:
learning,” arXiv preprint arXiv:1312.5602, 2013. Vocabulary,” International Organization for Standardization, Geneva,
[11] A. Sadat, M. Ren, A. Pokrovsky, Y.-C. Lin, E. Yumer, and R. Urtasun, CH, Standard ISO/PAS 21448:2019, 2018. [Online]. Available:
“Jointly learnable behavior and trajectory planning for self-driving https://www.iso.org/standard/62711.html
vehicles,” in 2019 IEEE/RSJ International Conference on Intelligent [32] ——, “Road vehicles - safety of the intended functionality
Robots and Systems (IROS). IEEE, 2019, pp. 3949–3956. (sotif),” International Organization for Standardization, Geneva,
[12] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane CH, Standard ISO/PAS 21448:2019, 2019. [Online]. Available:
change decision making using deep reinforcement learning,” in 2018 https://www.iso.org/standard/62711.html
21st International Conference on Intelligent Transportation Systems [33] C. Li and K. Czarnecki, “Urban driving with multi-objective deep
(ITSC). IEEE, 2018, pp. 2148–2155. reinforcement learning,” arXiv preprint arXiv:1811.08586, 2018.
[13] A. R. Fayjie, S. Hossain, D. Oualid, and D.-J. Lee, “Driverless [34] A. Yu, R. Palefsky-Smith, and R. Bedi, “Deep reinforcement learning
car: Autonomous driving using deep reinforcement learning in urban for simulated autonomous vehicle control,” Course Project Reports:
environment,” in 2018 15th international conference on ubiquitous Winter, vol. 2016, 2016.
robots (ur). IEEE, 2018, pp. 896–901. [35] L. Wang, J. Liu, H. Shao, W. Wang, R. Chen, Y. Liu, and S. L. Waslan-
[14] Z. Cao, D. Yang, S. Xu, H. Peng, B. Li, S. Feng, and D. Zhao, “High- der, “Efficient reinforcement learning for autonomous driving with
way Exiting Planner for Automated Vehicles Using Reinforcement parameterized skills and priors,” arXiv preprint arXiv:2305.04412,
Learning,” IEEE Transactions on Intelligent Transportation Systems, 2023.
2021. [36] J. Chen, S. E. Li, and M. Tomizuka, “Interpretable end-to-end urban
[15] H. Wang, S. Yuan, M. Guo, C.-Y. Chan, X. Li, and W. Lan, “Tactical autonomous driving with latent deep reinforcement learning,” IEEE
driving decisions of unmanned ground vehicles in complex highway Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp.
environments: A deep reinforcement learning approach,” Proceedings 5068–5078, 2021.
of the Institution of Mechanical Engineers, Part D: Journal of Auto- [37] L. Anzalone, P. Barra, S. Barra, A. Castiglione, and M. Nappi,
mobile Engineering, vol. 235, no. 4, pp. 1113–1127, 2021. “An end-to-end curriculum learning approach for autonomous driving
[16] M. Zhu, X. Wang, and Y. Wang, “Human-like autonomous car- scenarios,” IEEE Transactions on Intelligent Transportation Systems,
following model with deep reinforcement learning,” Transportation vol. 23, no. 10, pp. 19 817–19 826, 2022.
research part C: emerging technologies, vol. 97, pp. 348–368, 2018. [38] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement
[17] Á. Fehér, S. Aradi, F. Hegedüs, T. Bécsi, and P. Gáspár, “Hybrid learning for autonomous driving,” arXiv preprint arXiv:1704.03952,
DDPG approach for vehicle motion planning,” in International Con- 2017.
ference on Informatics in Control, Automation and Robotics (ICINCO), [39] A. R. Fayjie, S. Hossain, D. Oualid, and D.-J. Lee, “Driverless
2019. car: Autonomous driving using deep reinforcement learning in urban
[18] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining environment,” in 2018 15th international conference on ubiquitous
neural networks and tree search for task and motion planning in chal- robots (ur). IEEE, 2018, pp. 896–901.
lenging environments,” in 2017 IEEE/RSJ International Conference on [40] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,
Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6059–6066. “Navigating occluded intersections with autonomous vehicles using
deep reinforcement learning,” in 2018 IEEE international conference [61] Z. Cao, D. Yang, S. Xu, H. Peng, B. Li, S. Feng, and D. Zhao,
on robotics and automation (ICRA). IEEE, 2018, pp. 2034–2039. “Highway exiting planner for automated vehicles using reinforcement
[41] Y. Hu, A. Nakhaei, M. Tomizuka, and K. Fujimura, “Interaction-aware learning,” IEEE Transactions on Intelligent Transportation Systems,
decision making with adaptive strategies under merging scenarios,” in vol. 22, no. 2, pp. 990–1000, 2020.
2019 IEEE/RSJ International Conference on Intelligent Robots and [62] Z. Zhao, Z. Wang, K. Han, R. Gupta, P. Tiwari, G. Wu, and M. J. Barth,
Systems (IROS). IEEE, 2019, pp. 151–158. “Personalized car following for autonomous driving with inverse
[42] M. Bouton, A. Nakhaei, D. Isele, K. Fujimura, and M. J. Kochenderfer, reinforcement learning,” in 2022 International Conference on Robotics
“Reinforcement learning with iterative reasoning for merging in dense and Automation (ICRA). IEEE, 2022, pp. 2891–2897.
traffic,” in 2020 IEEE 23rd International Conference on Intelligent [63] X. Wen, S. Jian, and D. He, “Modeling the effects of autonomous
Transportation Systems (ITSC). IEEE, 2020, pp. 1–6. vehicles on human driver car-following behaviors using inverse rein-
[43] X. Liang, T. Wang, L. Yang, and E. Xing, “Cirl: Controllable imitative forcement learning,” IEEE Transactions on Intelligent Transportation
reinforcement learning for vision-based self-driving,” in Proceedings Systems, 2023.
of the European conference on computer vision (ECCV), 2018, pp. [64] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-
584–599. forcement learning,” in Proceedings of the twenty-first international
[44] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: conference on Machine learning, 2004, p. 1.
An open urban driving simulator,” in Conference on robot learning. [65] Z. Wu, L. Sun, W. Zhan, C. Yang, and M. Tomizuka, “Efficient
PMLR, 2017, pp. 1–16. sampling-based maximum entropy inverse reinforcement learning with
[45] P. Cai, Y. Luo, A. Saxena, D. Hsu, and W. S. Lee, “Lets-drive: application to autonomous driving,” IEEE Robotics and Automation
Driving in a crowd by learning from tree search,” arXiv preprint Letters, vol. 5, no. 4, pp. 5355–5362, 2020.
arXiv:1905.12197, 2019. [66] S. Rosbach, V. James, S. Großjohann, S. Homoceanu, and S. Roth,
[46] A. Karalakou, D. Troullinos, G. Chalkiadakis, and M. Papageorgiou, “Driving with style: Inverse reinforcement learning in general-purpose
“Deep reinforcement learning reward function design for autonomous planning for automated driving,” in 2019 IEEE/RSJ International
driving in lane-free traffic,” Systems, vol. 11, no. 3, p. 134, 2023. Conference on Intelligent Robots and Systems (IROS). IEEE, 2019,
[47] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated pp. 2658–2665.
lane change strategy using proximal policy optimization-based deep [67] A. Censi, K. Slutsky, T. Wongpiromsarn, D. Yershov, S. Pendleton,
reinforcement learning,” in 2020 IEEE Intelligent Vehicles Symposium J. Fu, and E. Frazzoli, “Liability, ethics, and culture-aware behavior
(IV). IEEE, 2020, pp. 1746–1752. specification using rulebooks,” in 2019 International Conference on
[48] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane Robotics and Automation (ICRA). IEEE, 2019, pp. 8536–8542.
change decision making using deep reinforcement learning,” in 2018 [68] W. Xiao, N. Mehdipour, A. Collin, A. Y. Bin-Nun, E. Frazzoli, R. D.
21st International Conference on Intelligent Transportation Systems Tebbens, and C. Belta, “Rule-based optimal control for autonomous
(ITSC). IEEE, 2018, pp. 2148–2155. driving,” in Proceedings of the ACM/IEEE 12th International Confer-
[49] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learning ence on Cyber-Physical Systems, 2021, pp. 143–154.
architecture toward autonomous driving for on-ramp merge,” in 2017 [69] B. Helou, A. Dusi, A. Collin, N. Mehdipour, Z. Chen, C. Lizarazo,
IEEE 20th International Conference on Intelligent Transportation C. Belta, T. Wongpiromsarn, R. D. Tebbens, and O. Beijbom, “The
Systems (ITSC). IEEE, 2017, pp. 1–6. reasonable crowd: Towards evidence-based and interpretable models
[50] M. Zhu, Y. Wang, Z. Pu, J. Hu, X. Wang, and R. Ke, “Safe, efficient, of driving behavior,” in 2021 IEEE/RSJ International Conference on
and comfortable velocity control based on reinforcement learning Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 6708–6715.
for autonomous driving,” Transportation Research Part C: Emerging [70] R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward
Technologies, vol. 117, p. 102662, 2020. machines: Exploiting reward function structure in reinforcement learn-
[51] R. Paleja, L. Chen, Y. Niu, A. Silva, Z. Li, S. Zhang, C. Ritchie, ing,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208,
S. Choi, K. C. Chang, H. E. Tseng et al., “Interpretable reinforce- 2022.
ment learning for robotics and continuous control,” arXiv preprint [71] A. Camacho, R. T. Icarte, T. Q. Klassen, R. A. Valenzano, and S. A.
arXiv:2311.10041, 2023. McIlraith, “Ltl and beyond: Formal languages for reward function
[52] Z. Zhang, “Autonomous car driving based on deep reinforcement specification in reinforcement learning.” in IJCAI, vol. 19, 2019, pp.
learning,” in 2022 International Conference on Economics, Smart 6065–6073.
Finance and Contemporary Trade (ESFCT 2022). Atlantis Press, [72] A. Wachi, “Failure-scenario maker for rule-based agent using multi-
2022, pp. 835–842. agent adversarial reinforcement learning and its application to au-
[53] K. Vogel, “A comparison of headway and time to collision as safety tonomous driving,” arXiv preprint arXiv:1903.10654, 2019.
indicators,” Accident analysis & prevention, vol. 35, no. 3, pp. 427– [73] D. Rempe, J. Philion, L. J. Guibas, S. Fidler, and O. Litany, “Generat-
433, 2003. ing useful accident-prone driving scenarios via a learned traffic prior,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
[54] D. Nistér, H.-L. Lee, J. Ng, and Y. Wang, “The safety force field,”
Pattern Recognition, 2022, pp. 17 305–17 315.
NVIDIA White Paper, 2019.
[74] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger,
[55] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a for-
“King: Generating safety-critical driving scenarios for robust imitation
mal model of safe and scalable self-driving cars,” arXiv preprint
via kinematics gradients,” in European Conference on Computer
arXiv:1708.06374, 2017.
Vision. Springer, 2022, pp. 335–352.
[56] H. Suk, T. Kim, H. Park, P. Yadav, J. Lee, and S. Kim, “Rationale-
[75] W. Ding, B. Chen, M. Xu, and D. Zhao, “Learning to collide:
aware autonomous driving policy utilizing safety force field imple-
An adaptive safety-critical scenarios generating method,” in 2020
mented on carla simulator,” arXiv preprint arXiv:2211.10237, 2022.
IEEE/RSJ International Conference on Intelligent Robots and Systems
[57] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: (IROS). IEEE, 2020, pp. 2243–2250.
An open urban driving simulator,” in Conference on robot learning.
PMLR, 2017, pp. 1–16.
[58] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J.
Kochenderfer, “Combining planning and deep reinforcement learning
in tactical decision making for autonomous driving,” IEEE transac-
tions on intelligent vehicles, vol. 5, no. 2, pp. 294–305, 2019.
[59] ISO Central Secretary, “Mechanical vibration and shock —
evaluation of human exposure to whole-body vibration — part 1:
General requirements,” International Organization for Standardization,
Geneva, CH, Standard ISO 2631-1:1997, 1997. [Online]. Available:
https://www.iso.org/standard/62711.html
[60] H. Bellem, B. Thiel, M. Schrauf, and J. F. Krems, “Comfort in
automated driving: An analysis of preferences for different automated
driving styles and their dependence on personality traits,” Transporta-
tion research part F: traffic psychology and behaviour, vol. 55, pp.
90–100, 2018.