Best Practices For Armv8 R Cortex r52 St2 Whitepaper
Best Practices For Armv8 R Cortex r52 St2 Whitepaper
03 Introduction
32 Summary/Conclusion
33 Glossary
W H I T E P A P E R 2
Introduction
Vehicle electrical/electronic (E/E) architectures are evolving towards
the centralization of compute resources. This initially happened in domain
controllers before moving to zonal and centralized approaches.
W H I T E P A P E R 3
The new system hardware and software must simultaneously meet the
requirements of every individual workload hosted on a device. These include:
W H I T E P A P E R 4
— For workloads derived from pre-existing (legacy) applications, a desirable
option is to integrate the workload with minimal adaptation. When integrating
workloads that were designed for standalone system hardware, software
must protect against any behaviours with side effects that affect other
applications in the system.
— For workloads which relate to regulated applications, it may be necessary
to obtain certification (for example, in the case of On-Board Diagnostics
(OBD)-relevant applications). To avoid the need for re-certification every
time another workload changes, it is important to demonstrate that
the other workloads do not interfere with the certified workload.
A system that provides this level of isolation between workloads also brings
the advantage of allowing each workload to be developed (and debugged)
in isolation from other workloads. This is especially important if the workloads
are coming from different suppliers.
W H I T E P A P E R 5
System hardware and software provide the following isolation mechanisms
that are used to meet these requirements:
Cortex-R Outline
Arm has a portfolio of CPU processors that are designed to address a wide range
of computing, from the smallest, lowest power microcontrollers to ultra-high
performance server class computing. The Cortex-R processors have been
developed to enable applications where there are demands for real-time
processing and are applicable to a range of different uses cases, not least
of all in automotive applications where systems must respond in short and
deterministic timeframes to successfully meet the requirements of the system
deadlines. In many cases, these applications also include functional safety
(and security) requirements that add to the challenges faced by system
integrators and developers. Cortex-R processors, like the Cortex-R52+
can be used in standalone microcontrollers (MCUs) or as additional cores
in a SoC (System on Chip) design, for example as a safety island.
The first Cortex-R processors, such as Cortex-R5, were built on the Armv7-R
architecture. However, since then the architecture has evolved, with Arm’s
Cortex-R52 and Cortex-R52+ processors implementing the Armv8-R
architecture which helps address the increasing complexity of automotive
real-time software and the transition from discrete dedicated controllers
to those where functions are centralized and combined. The Armv8-R
architecture adds support which enables the better control of software within
a single processor, providing isolation of code and enabling reproducible and
understandable behaviour including virtualization in a real-time processor.
W H I T E P A P E R 6
The Cortex-R52 and Cortex-R52+ processors are highly configurable and
can be defined to suit the implementors application requirements. Some
of the configurability is described in Table 1.
TA BLE 1
Config Parameter Cortex-R52/Cortex-R52+
CortexR52(+) Example,
Cores per cluster Configurable 1-4
Config Parameters
Stage 1 MPU 8,16,20,24 Regions
SIMD NEON
Together with the new Exception Level comes the addition of a two stage
Memory Protection Unit (MPU), which is able to enforce the access
performed by the processor to different resources. The Operating System
is able to control the MPU for its resources at EL1, but the processors
can be implemented to add this additional second stage of the MPU, which
is only configurable from the EL2 from where the hypervisor can run.
W H I T E P A P E R 7
Access to resources is managed with software running at the new higher
Exception Level 2. Application tasks can request access to the required
resources through this software, which enforces the access with the two level
Memory Protection Units (MPU). This approach is not limited to two different
criticality levels, but can support many different contexts with differing
protections. Unlike a Memory Management Unit (MMU) the availability
of an MPU can offer the access management from the Cortex-R processors
to the system resources without the introduction of additional, potentially
schedule breaking, delays to search and load page tables from memory.
These are hard to manage, as well as being difficult to evaluate and guarantee
their timely completion.
The Cortex-R52+ provides information outside the core to enable the system
to establish and maintain control of accesses based on the running software.
This is achieved by propagating the virtual Machine ID (VMID) for device
transactions to enable the system to manage access to those resources.
In the case of Cortex-R52+, this is further extended by supporting buffers
F IG. 1
and memory transaction and requests, which are made directly from
Armv8-R Exception Levels a hypervisor at EL2.
W H I T E P A P E R 8
These Cortex-R processors integrate their own Generic Interrupt Controller
(GIC) shared by all CPUs within the cluster to deliver low latency interrupts
from the system. This can flexibly assign and prioritise Shared Peripheral
Interrupts (SPI) to any of the cores in the cluster. The GIC supports the ability
to signal both physical and virtual interrupts and can trap interrupt accesses
to EL2 to virtualize interrupts.
W H I T E P A P E R 9
Software Integration Mechanisms
As the amount of software in a vehicle increases, progressively more-and-more
applications are being integrated onto one microcontroller. This can be seen
particularly in domain/zonal controllers that provide the bridge between the
very powerful central vehicle computers (typically using Arm Cortex-A cores
and running a POSIX based operating system and Adaptive AUTOSAR), and
the simpler ECUs on the mechatronic rim, which typically use Arm Cortex-R
and Cortex-M cores.
W H I T E P A P E R 10
To a greater or lesser extent, a hypervisor creates the illusion to the guest
software running inside a VM that it is running on its own microcontroller
and not sharing the microcontroller devices with other guest software
in other VMs.
Note that the Armv8-R Cortex-R processors (and similar devices) do not provide
Memory Management Units (MMUs). A hypervisor running on a Cortex-A
device can use its MMU to present each VM with a completely separate virtual
address space. For example, the guest software running inside each VM
can be linked to run at the same address and use the same range of memory
addresses for data. The Memory Protection Unit (MPU) provided by the
Cortex-R52+ allows a hypervisor to protect one VM’s memory from another
VM but does not allow each VM to have a separate virtual address space.
One physical processor core can host multiple virtual cores by context
switching between the virtual cores in the same way that operating systems
context switch between processes. A virtual core’s context is the values
of the general-purpose registers, floating-point registers, some system
configuration registers and the configuration of the EL1 MPU.
Where legacy software is being run inside a VM, we want the VM to look
as much as possible like a real microcontroller to avoid the need to change
the legacy software other than by re-linking so that the guest software
running in each VM uses separate memory.
W H I T E P A P E R 11
Using SMPUs and Peripheral Protection Mechanisms
Microcontrollers that contain Cortex-Rs will normally include a system-level
memory protection unit (SMPU). The primary role of an SMPU is to control
which bus managers (e.g DMA controller) can access which memory addresses.
Cortex-R processor cores and other microcontroller components, such
as Cortex-M cores and some peripherals, can be bus managers. Typically,
an SMPU will have a collection of regions. Each region has a configurable
start address, a size and is assigned to one or more bus manager (or in more
advanced designs, to one or more VMs using a VM identifier stored in the
Cortex-R52+’s VSCTLR.VMID register). A bus manager (or VM) can only
access memory in regions assigned to it.
W H I T E P A P E R 12
F IG. 2
Partitioning Memory with an SMPU
An advantage of using SMPUs, rather than just core MPUs, is that it allows
us to create VMs that include not just Cortex-R52+ cores but also other DMA
capable components that may be in the microcontroller and are connected
to the same memory bus. For example, microcontrollers may include clusters
of Cortex-R52+ cores and some special purpose Cortex-M cores.
W H I T E P A P E R 13
The HVC (hypervisor call) instruction can be used by code running at EL1
to make a request to a hypervisor in the same way that the SVC (supervisor
call) instruction can be used by application software to make a request
to an operating system. When software running at EL1 executes a HVC
instruction, the Cortex-R52+ core switches to EL2 and takes a Hyp-mode
entry exception. The hypervisor handles this exception and then returns
to the guest software at EL1.
F IG. 3
Para-virtualization
W H I T E P A P E R 14
Para-virtualization can also be used to allow peripheral sharing and creation
of virtual peripherals. A peripheral, such as an Ethernet controller, can be shared
in much the same way as the GIC. Completely virtual peripherals can also
be created. For example, one might create a virtual Ethernet controller used
for communication between VMs running on the same microcontroller. In both
cases, the hypervisor would contain an EL2 device driver that either managed
access to the shared peripheral or implemented the virtual peripheral. This
is analogous to the way that an Operating System uses devices drivers
to manage access to peripherals shared by multiple processes or tasks.
W H I T E P A P E R 15
Using EL2 for Trap-and-emulate
In some cases, para-virtualizing guest software may not be possible.
In these cases, trap-and-emulate can be used.
When code running at EL1 or EL0 makes a memory access prohibited by the
EL2 MPU, the Cortex-R52+ processor switches to EL2 and takes a Hyp-mode
entry exception. This feature can be used by a hypervisor to allow emulated
access to peripherals with memory mapped registers. The EL2 MPU is configured
to prohibit access to the registers. When guest software reads or writes
a register, a Hyp-mode entry exception occurs at EL2. The hypervisor works
out which register the guest software was reading or writing by examining
the Cortex-R52+’s Hyp Syndrome Register (HSR) – which contains details
of why an exception occurred - and Hyp Data Fault Address Register
(HDFAR)– which contains the memory address being accessed when
an exception occurred - and either emulates access to the register itself
or delegates to an EL2 device driver.
Trap-and-emulate can be used to access the shared GIC distributer. The EL2
MPU is configured so that guest software access to GIC memory mapped
registers causes an exception. When the exception occurs, the hypervisor
carries out the GIC register access having first checked that the register
access will not interfere with another VM.
W H I T E P A P E R 16
Trap-and-emulate has the advantage that guest software does not need
to be modified to run in a VM. However, para-virtualization is usually more
performant because:
— The hypervisor does not have to work out what the guest software was doing
when the Hyp-mode entry exception occurs.
Interrupt Virtualization
EL2 alone does not allow us to share or virtualize interrupt driven peripherals.
The Armv8-R architecture defines that normally when an interrupt occurs it
interrupts the currently running code at the current privilege level. For example,
if an IRQ occurs while code is running at EL1 then the interrupt will be taken
at EL1 using the IRQ entry in the EL1 vector table, but if the IRQ occurs while
code is running at EL2 then the interrupt will be taken at EL2 using the IRQ
entry in the EL2 vector table.
W H I T E P A P E R 17
F IG. 4
Interrupt Virtualization
Since all interrupts are initially handled by the hypervisor, the hypervisor can
decide if an interrupt should be handled by the hypervisor itself, by an EL2
device driver, or should be virtualized and injected into to a VM. This allows
interrupt driven shared/virtual peripherals to be handled. EL2 device drivers
can also inject virtual interrupts into VMs if needed.
W H I T E P A P E R 18
Of course, nothing comes for free, and interrupt virtualization adds to the total
time taken to process an interrupt. The exact overhead depends on many
factors, including the arrival pattern of interrupts and how many different
GIC interrupts are being used. There are two timing concerns to be aware
of related to interrupt virtualization:
ii. If both interrupts occur at the same time, the EL2 handling
for both interrupts will occur before any EL1 handling, and
the guest software will see a double increase in latency for
A, but no increase in latency for B.
iii. Now imagine that A occurs and reaches the EL1 handler
in the guest software and then B occurs. The EL2 handling
of B will pre-empt the EL1 handling of A even though A is
higher priority.
W H I T E P A P E R 19
To help quantify this GIC virtualization behaviour, we include the results
of some experiments carried out with ETAS’ RTA-HVR hypervisor. In these
experiments, we compared the interrupt latency of Cortex-R52+ cores
running the RTA-OS Operating System without a hypervisor to the same
Cortex-R52 cores running RTA-OS as guest software inside a VM. Two
cores inside the same Cortex-R cluster were used. The first core triggered
interrupts in the second core by setting bits in the GIC ISPENDR registers
(Interrupt Set Pending Registers that are used by software to trigger interrupts).
The latency is the number of timer ticks between the interrupt being triggered
and the start of the (fully Operating System managed – i.e., AUTOSAR Category
2) ISR. The timer was configured to run at the same speed as the processor
clock. The exact value of the latency will depend on the type of memory used
for code/data and the cache configuration, so in the following results
it is important to focus on the comparison between the hypervisor and
non-hypervisor cases rather than the absolute values.
TA BLE 2
Latency (CPU Cycles) Latency (CPU Cycles) Ratio Hypervisor/No
Interrupt Number
No Hypervisor Hypervisor Hypervisor
Table 2 shows what happens when the first core triggers four different interrupts
with a large delay between interrupts so that all interrupt handing has completed
on the second core before the next interrupt is triggered. In this case, we see
an increase of around 80 percent in interrupt latency. In this case, the Operating
System configuration does not contain any untrusted code (i.e., all application
code is running at EL1).
W H I T E P A P E R 20
TA BLE 3
Latency (CPU Cycles) Latency (CPU Cycles) Ratio Hypervisor/No
Interrupt Number
No Hypervisor Hypervisor Hypervisor
Table 3 contains results from the same setup as 2, except that the Operating
System configuration includes untrusted tasks and ISRs. Here we see that
the additional work that must be done by RTA-OS to manage untrusted code
means the work done at EL2 to virtualize interrupts is a smaller proportion
of overall interrupt latency.
TA BLE 4
Latency Delta Between Latency Delta Between
Interrupt
(CPU Cycles) Latency N-1 and N (CPU Cycles) Latency N-1 and N
Number
No Hypervisor for No Hypervisor Hypervisor for Hypervisor
Table 4 shows what happens if the four interrupts are triggered at the same time.
The lower the interrupt number, the higher its priority. For the case without
a hypervisor, we see the expected behaviour given interrupt prioritization. The ISR
for interrupt number 1 runs first and blocks the other interrupts until it has been
completed. The ISR for interrupt number 2 then runs and blocks the other
interrupts until it has been completed. And so on. The time between the ISR for
interrupt N-1 starting and the ISR for interrupt number N starting is approximately
the same (the ISRs executed very little code). However, with a hypervisor present
we see quite different behaviour. The EL2 handling for all four interrupts occurs
before interrupt number 1 is handled by its EL1 ISR. Therefore, we see a much
larger interrupt latency for interrupt number 1 than for the subsequent interrupts.
W H I T E P A P E R 21
Virtual Processor Cores
We can also take advantage of interrupt virtualization to support virtual
processor cores. For example, a timer interrupt can be handled by the
hypervisor and used to drive a virtual core scheduler that decides when
to context switch between virtual cores. Since guest software cannot block
interrupts being taken at EL2, broken or malicious guest software cannot
deny processor time to other guest software. Interrupts that arrive for
a virtual core that are not currently running can be virtualized, queued
in software, and injected into the virtual core when it next runs.
W H I T E P A P E R 22
F IG. 5
Virtual Cores
If multiple virtual cores are hosted by a single physical core, then consideration
must be given to how virtual cores are scheduled. The simplest approach
is to use a static TDMA (Time-division multiple access) algorithm. A TDMA
algorithm has a very low run-time overhead, is easy to understand, and is easy
to work out when a virtual core will run in wall-clock time. The disadvantage
of a purely static algorithm is that it can lead to long latencies when handling
asynchronous events (e.g., interrupts). It may be possible to avoid long
latencies through the careful construction of the static VM schedule to ensure
that an interrupt never has to wait for too long before the VM that handles
it runs. However, this may require detailed understanding of interrupt
worst-case execution times.
W H I T E P A P E R 23
Using virtual processor cores gives the system designer flexibility:
— A VM can be created that contains more cores than would be available if only
physical cores were used. The extra cores might make structuring software
easier – in the same way that threads are used in an operating system.
Using virtual cores effects interrupt latency. With a static scheduling algorithm,
an interrupt that arrives for a virtual core that is not currently running will not
be handled until the virtual core next runs. With a dynamic scheduling algorithm
that automatically switches to the virtual core that handles an interrupt, the
virtual core context switch time will be added to the interrupt latency.
W H I T E P A P E R 24
Software Integration Recommendations
Unfortunately, there is no “one size fits all” approach to integrating multiple
applications into a microcontroller. The previous section has outlined several
mechanisms that can be used to enable integration. Which mechanisms
are appropriate will depend on the types of application being integrated.
W H I T E P A P E R 25
Use Core-local and Cluster-local Resources
It is better to use resources like memory and peripherals that are “close”
to the core that uses them. Each Cortex-R core has TCMs that are much
faster to access than other types of RAM. Often microcontrollers have
cluster local Flash and RAM, and in some cases peripherals (e.g., CAN and
LIN controllers) can be assigned to a cluster. As well as usually being faster,
using cluster local resources often results in less memory bus contention
because a core accessing cluster local resources may not have to contend
with cores in other clusters for access to the memory bus.
Note that using resources close to a core may limit any migration of virtual
processor cores between physical processor cores at run-time (if this
is supported). For example, if a virtual core is linked to use FLASH local
to Cortex-R core cluster 0, the virtual core would run more slowly if migrated
to cluster 1.
W H I T E P A P E R 26
Ultra-low Latency Hard Real-time Applications
Here we are considering applications that require very short and predicable
interrupt latencies. In essence, we want the “bare metal” behaviour. In these
cases, interrupt virtualization and hosting multiple virtual cores on a physical
core is more challenging, because of the increased and less predictable interrupt
latencies that result from using these techniques.
The same argument applies to using EL2 device drivers for sharing
peripherals and creating virtual peripherals. The para-virtualization
or trap-and-emulate required is relatively slow, but when it occurs
it is under the control of the application.
The discussion on virtualizing GIC access and using EL2 device drivers
in the section on ultra-low latency hard real-time applications is also
applicable here.
Best-effort Applications
For best effort (non-real time) applications, multiplexing multiple virtual
cores onto a single physical core and shared and virtual device support can
allow a far more optimal use of microcontroller resources. In some domains,
many applications perform functions that have no hard real-time constraints,
and these applications often perform their function in response to a stimulus
and are then quiescent until the stimulus occurs again. Such systems are
amenable to hosting multiple applications on a single physical core.
W H I T E P A P E R 28
If the system needs to handle asynchronous events with short latencies, then
a dynamic virtual core scheduling algorithm may be needed. However, if there
is no need to handle asynchronous events with short latencies, a simple static
scheduling algorithm will have a lower run-time overhead.
Legacy Software
When integrating legacy applications, one wants to minimize changes to the
software. If the applications do not have hard real-time constraints, then
most of the above mechanisms can be used except for para-virtualization.
Trap-and-emulate would allow legacy peripherals to be emulated using EL2
device drivers.
W H I T E P A P E R 29
Recommendations for
Future Microcontrollers
Provide Fine Grained Assignment of Peripherals to VMs
The need for EL2 device drivers can be reduced if peripherals can be assigned
to VMs at a fine grain. For example, it would be useful to be able to assign
individual Controller Area Network (CAN) channels, or even individual
General-Purpose Input/Output (GPIO) pins, to a VM. This reduces the need
for para-virtualization or trap-and-emulate with the consequent improvement
in performance. While full peripheral virtualization would be ideal (see below),
a compromise would be to use para-virtualization/trap-and-emulate to carry
out initialization and configuration of peripherals, but allow direct access
for the data plane.
W H I T E P A P E R 30
Ensure that DMA is Virtualization Aware
When a VM uses a DMA transfer, or uses a peripheral that uses a DMA transfer,
the DMA transfer must not allow the VM to read or write from memory
addresses that would normally be prohibited by the SMPU or core MPUs.
The ideal would be for the DMA module/channel to automatically inherit the
identity of the VM that configured the DMA module/channel or of the VM that
configured the peripheral that uses the DMA module/channel. The VM identifier
would then be checked by at least the SMPU on a DMA transfer, and the DMA
transfer blocked if the VM did not have permission to read or write the memory
involved in the DMA transfer. To support such behaviour, the Armv8-R VMID
should be distributed to peripherals and DMA controllers.
Some work as already been done in this area for the Armv8-R architecture,
and a good example can be found in the Arm “Device virtualization principles
for real time systems” paper.
W H I T E P A P E R 31
Summary/Conclusion
The evolution of EE-architecture, including zonal controllers, demands
further solutions for real-time software integration. Classic AUTOSAR
is a de-facto standard in the automotive real-time software world, but further
integration options, such as for legacy software, are a must moving forward.
The Armv8-R architecture with the EL2 separation option represents a good
option to enable intelligent integration options. How to use this integration
option is highly dependent on the application though, and dedicated application
demands will define which way of integration is most suitable.
W H I T E P A P E R 32
Glossary (in alphabetical order):
© A R M L T D . 2 0 2 2 All brand names or product names are the property of their respective holders. Neither the whole nor any part of the information contained in, or the
product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in
this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given in good faith. All warranties
implied or expressed, including but not limited to implied warranties of satisfactory quality or fitness for purpose are excluded. This document is intended only to provide information
to the reader about the product. To the extent permitted by local laws Arm shall not be liable for any loss or damage arising from the use of any information in this document or any
error or omission in such information.
33