Cpu Concepts-2
Cpu Concepts-2
Cryptography Extension
• The Secure Hash Algorithm (SHA) functions SHA-1, SHA-224, and SHA-256.
• Finite field arithmetic used in algorithms such as Galois/Counter Mode and Elliptic
Curve Cryptography.
The Cortex-A53 processor supports a range of debug and trace features including:
•Debug ROM.
The Cortex-A53 processor has an Advanced Peripheral Bus version 3 (APBv3)
debug interface that is CoreSight compliant. This permits system access to debug
resources, for example, the setting of watchpoints and breakpoints.
CACHE MEMORY
Cache memory is a chip-based computer component that makes retrieving data from
the computer's memory more efficient. It acts as a temporary storage area that the
computer's processor can retrieve data from easily. This temporary storage area,
known as a cache, is more readily available to the processor than the computer's main
memory source, typically some form of DRAM.
Cache memory is sometimes called CPU (central processing unit) memory because it
is typically integrated directly into the CPU chip or placed on a separate chip that has
a separate bus interconnect with the CPU. Therefore, it is more accessible to the
processor, and able to increase efficiency, because it's physically close to the
processor.
In order to be close to the processor, cache memory needs to be much smaller than
main memory. Consequently, it has less storage space. It is also more expensive than
main memory, as it is a more complex chip that yields higher performance.
What it sacrifices in size and price, it makes up for in speed. Cache memory operates
between 10 to 100 times faster than RAM, requiring only a few nanoseconds to
respond to a CPU request.
The name of the actual hardware that is used for cache memory is high -speed static
random access memory (SRAM). The name of the hardware that is used in a
computer's main memory is dynamic random access memory (DRAM).
Cache memory is not to be confused with the broader term cache. Caches are
temporary stores of data that can exist in both hardware and software. Cache memory
refers to the specific hardware component that allows computers to create caches at
various levels of the network.
Types of cache memory
Cache memory is fast and expensive. Traditionally, it is categorized as "levels" that
describe its closeness and accessibility to the microprocessor. There are three general
cache levels:
L1 cache, or primary cache, is extremely fast but relatively small, and is usually
embedded in the processor chip as CPU cache.
L2 cache, or secondary cache, is often more capacious than L1. L2 cache may be
embedded on the CPU, or it can be on a separate chip or coprocessor and have a
high-speed alternative system bus connecting the cache and CPU. That way it doesn't
get slowed by traffic on the main system bus.
In the past, L1, L2 and L3 caches have been created using combined processor and
motherboard components. Recently, the trend has been toward consolidating all three
levels of memory caching on the CPU itself. That's why the primary means for
increasing cache size has begun to shift from the acquisition of a specific
motherboard with different chipsets and bus architectures to buying a CPU with the
right amount of integrated L1, L2 and L3 cache.
• Direct mapped cache has each block mapped to exactly one cache memory location.
Conceptually, a direct mapped cache is like rows in a table with three columns: the
cache block that contains the actual data fetched and stored, a tag with all or part of
the address of the data that was fetched, and a flag bit that shows the presence in the
row entry of a valid bit of data.
• Fully associative cache mapping is similar to direct mapping in structure but allows
a memory block to be mapped to any cache location rather than to a prespecified
cache memory location as is the case with direct mapping.
• Set associative cache mapping can be viewed as a compromise between direct
mapping and fully associative mapping in which each block is mapped to a subset of
cache locations. It is sometimes called N-way set associative mapping, which
provides for a location in main memory to be cached to any of "N" locations in the
L1 cache.
Data writing policies
Data can be written to memory using a variety of techniques, but the two main ones
involving cache memory are:
• Write-through. Data is written to both the cache and main memory at the same time.
• Write-back. Data is only written to the cache initially. Data may then be written to
main memory, but this does not need to happen and does not inhibit the interaction
from taking place.
The way data is written to the cache impacts data consistency and efficiency. For
example, when using write-through, more writing needs to happen, which causes
latency upfront. When using write-back, operations may be more efficient, but data
may not be consistent between the main and cache memories.
One way a computer determines data consistency is by examining the dirty bit in
memory. The dirty bit is an extra bit included in memory blocks that indicates
whether the information has been modified. If data reaches the processor's register
file with an active dirty bit, it means that it is not up to date and there are more recent
versions elsewhere. This scenario is more likely to happen in a write-back scenario,
because the data is written to the two storage areas asynchronously.
Translation lookaside buffers (TLBs) are also specialized memory caches whose
function is to record virtual address to physical address translations.
Still other caches are not, technically speaking, memory caches at all. Disk caches,
for instance, can use DRAM or flash memory to provide data caching similar to what
memory caches do with CPU instructions. If data is frequently accessed from the
disk, it is cached into DRAM or flash-based silicon storage technology for faster
access time and response.
Specialized caches are also available for applications such as web browsers,
databases, network address binding and client-side Network File System protocol
support. These types of caches might be distributed across multiple networked hosts
to provide greater scalability or performance to an application that uses them.
A
depiction of the memory hierarchy and how it functions
Locality
The ability of cache memory to improve a computer's performance relies on the
concept of locality of reference. Locality describes various situations that make a
system more predictable. Cache memory takes advantage of these situations to create
a pattern of memory access that it can rely upon.
There are several types of locality. Two key ones for cache are:
• Temporal locality. This is when the same resources are accessed repeatedly in a
short amount of time.
• Spatial locality. This refers to accessing various data or resources that are near each
other.
Types of Cache Misses
Compulsory Miss
A compulsory miss, also known as a cold miss, occurs when data is accessed for the
first time. Since the data has not been requested before, it is not present in the cache,
leading to a miss. This type of miss is unavoidable as it is inherent in the first
reference to the data. The only way to eliminate compulsory misses would be to have
an infinite prefetch of data, which is not feasible in real-world systems.
Capacity Miss
A capacity miss happens when the cache cannot contain all the data needed by the
system. This type of miss occurs when the working set (the set of data that a program
accesses frequently) is larger than the cache size. When the cache is filled to capacity
and a new data item is referenced, existing data must be evicted to accommodate the
new data, leading to a miss. Capacity misses can be reduced by increasing the cache
size or optimizing the program to decrease the size of the working set.
Conflict Miss
Conflict misses, also known as collision misses, occur when multiple data items,
which are accessed in a sequence, map to the same cache location, known as a cache
set. This type of miss is a result of the cache’s organization. In a set -associative or
direct-mapped cache, different data items may be mapped to the same set, leading to
conflicts. When a new item is loaded into a filled set, another item must be evicted,
leading to a miss if the evicted item is accessed again. Conflict misses can be
mitigated by improving the cache’s mapping function or by increasing the cache’s
associativity.
Coherence Miss
PIPELINING
The objectives of this module are to discuss the various hazards associated with
pipelining.
We discussed the basics of pipelining and the MIPS pipeline implementation in the
previous module. We made the following observations about pipelining:
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage o Multiple tasks operating
simultaneously
• Potential speedup = Number of pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
• Unbalanced lengths of pipe stages reduces speedup
• Execute billions of instructions, so throughput is what matters
• Data path design – Ideally we expect a CPI value of 1
• What is desirable in instruction sets for pipelining?
• Variable length instructions vs. all instructions same length?
• Memory operands part of any operation vs. memory operands only in loads or stores?
• Register operand many places in instruction format vs. registers located in same
place?
Ideally we expect a CPI value of 1 and a speedup equal to the number of stages in the
pipeline. But, there are a number of factors that limit this. The problems that occur in
the pipeline are called hazards. Hazards that arise in the pipeline prevent the next
instruction from executing during its designated clock cycle. There are three types of
hazards:
• Structural hazards: Hardware cannot support certain combinations of instructions
(two instructions in the pipeline require the same resource).
• Data hazards: Instruction depends on result of prior instruction still in the pipeline
• Control hazards: Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps).
o Replicate resource
§ good performance
§ increases cost (+ maybe interconnect delay)
§ useful for cheap or divisible resource
Figure 11.1 shows one possibility of a structural hazard in the MIPS pipeline.
Instruction 3 is accessing memory for an instruction fetch and instruction 1 is accessing
memory for a data access (load/store). These two are conflicting requirements and
gives rise to a hazard. We should either stall one of the operations as shown in Figure
11.2, or have two separate memories for code and data. Structural hazards will have to
handled at the design time itself.
Next, we shall discuss about data dependences and the associated hazards. There are
two types of data dependence – true data dependences and name dependences.
• A name dependence occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between the instructions associated
with that name
– An output dependence occurs when instruction i and instruction j write the same
register or memory location. The ordering between the instructions must be preserved.
– Since this is not a true dependence, renaming can be more easily done for register
operands, where it is called register renaming
– An instruction that is control dependent on its branch cannot be moved before the
branch so that its execution is no longer controlled by the branch.
– An instruction that is not control dependent on its branch cannot be moved after the
branch so that its execution is controlled by the branch.
Having introduced the various types of data dependences and control dependence, let
us discuss how these dependences cause problems in the pipeline. Dependences are
properties of programs and whether the dependences turn out to be hazards and cause
stalls in the pipeline are properties of the pipeline organization.
Data hazards may be classified as one of three types, depending on the order of read
and write accesses in the instructions:
Add modifies R1 and then Sub should read it. If this order is changed, there is a RAW
hazard
• WAW (Write After Write)
• Corresponds to an output dependence
• Occurs when there are multiple writes or a short integer pipeline and a longer
floating-point pipeline or when an instruction proceeds when a previous instruction is
stalled WAW (write after write)
• This is caused by a name dependence. There is no actual data transfer. It is the same
name that causes the problem
• Considering two instructions i and j, instruction j should write after instruction i has
written the data
Instruction i has to modify register r1 first, and then j has to modify it. Otherwise, there
is a WAW hazard. There is a problem because of R1. If some other register had been
used, there will not be a problem
• Solution is register renaming, that is, use some other register. The hardware can do the
renaming or the compiler can do the renaming
• WAR (Write After Read)
• Arises from an anti dependence
• Cannot occur in most static issue pipelines
• Occurs either when there are early writes and late reads, or when instructions are re-
ordered
• There is no actual data transfer. It is the same name that causes the problem
• Considering two instructions i and j, instruction j should write after instruction i has
read the data.
Instruction i has to read register r1 first, and then j has to modify it. Otherwise, there is
a WAR hazard. There is a problem because of R1. If some other register had been used,
there will not be a problem
• Solution is register renaming, that is, use some other register. The hardware can do
the renaming or the compiler can do the renaming
Figure 11.3 gives a situation of having true data dependences. The use of the result of
the ADD instruction in the next three instructions causes a hazard, since the register is
not written until after those instructions read it. The write back for the ADD instruction
happens only in the fifth clock cycle, whereas the next three instructions read the
register values before that, and hence will read the wrong data. This gives rise to RAW
hazards.
A control hazard is when we need to find the destination of a branch, and can’t fetch
any new instructions until we know that destination. Figure 11.4 illustrates a control
hazard. The first instruction is a branch and it gets resolved only in the fourth clock
cycle. So, the next three instructions fetched may be correct, or wrong, depending on
the outcome of the branch. This is an example of a control hazard.
Now, having discussed the various dependences and the hazards that they might lead
to, we shall see what are the hazards that can happen in our simple MIPS pipeline.
• Structural hazard
• Conflict for use of a resource
Let us look at the speedup equation with stalls and look at an example problem.
Assume:
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedupA = Pipeline Depth/(1+0) x(clockunpipe/clockpipe)
= Pipeline Depth
SpeedupB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 75 x Pipeline Depth
Micro TLB
The first level of caching for the translation table information is a micro TLB of ten
entries that is implemented on each of the instruction and data sides.
All main TLB related maintenance operations affect both the instruction and data
micro TLBs, causing them to be flushed.
Main TLB
A unified main TLB handles misses from the micro TLBs. This is a 512 -entry, 4-
way, set-associative structure. The main TLB supports all VMSAv8 block sizes,
except 1GB. If a 1GB block is fetched, it is split into 512MB blocks and the
appropriate block for the lookup stored.
Accesses to the main TLB take a variable number of cycles, based on:
The Intermediate Physical Address (IPA) cache RAM holds mappings between
intermediate physical addresses and physical addresses. Only Non-secure EL1 and
EL0 stage 2 translations use this cache. When a stage 2 translation is completed it is
updated, and checked whenever a stage 2 translation is required.
Similarly to the main TLB, the IPA cache RAM can hold entries for different sizes.
The walk cache RAM holds the result of a stage 1 translation up to but not including
the last level. If the stage 1 translation results in a section or larger mapping then
nothing is placed in the walk cache. The walk cache holds entries fetched from
Secure and Non-secure state.
Memory is one of the most important host resources. For workloads to access global
system memory, we need to make sure virtual memory addresses are mapped to the
physical addresses. There are several components working together to perform these
translations as efficient as possible. This blog post will cover the basics on how
virtual memory addresses are translated.
Memory Translations
The physical address space is your system RAM, the memory modules inside your
ESXi hosts, also referred to as the global system memory. When talking about virtual
memory, we are talking about the memory that is controlled by an operating system,
or a hypervisor like vSphere ESXi. Whenever workloads access data in memory, the
system needs to look up the physical memory address that matches the virtual
address. This is what we refer to as memory translations or mappings.
To map virtual memory addresses to physical memory addresses, page tables are
used. A page table consists of numerous page table entries (PTE).
One memory page in a PTE contains data structures consisting of different sizes of
‘words’. Each type of word contains multiple bytes of data (WORD (16 bits/2 bytes),
DWORD (32 bits/4 bytes) and QWORD (64 bits/8 bytes)). Executing memory
translations for every possible word, or virtual memory page, into physical memory
address is not very efficient as this could potentially be billions of PTE’s. We need
PTE’s to find the physical address space in the system’s global memory, so there is
no way around them.
To make memory translations more efficient, we use page tables to group chunks of
memory addresses in one mapping. Looking at an example of a DWORD entry of 4
bytes; A page table covers 4 kilobytes instead of just the 4 bytes of data in a single
page entry. For example, using a page table, we can translate virtual address space 0
to 4095 and say this is found in physical address space 4096 to 8191. Now we no
longer need to map all the PTE’s separately, and be far more efficient by using page
tables.
The page tables are managed by a Memory Management Unit (MMU). All the
physical memory references are passed through the MMU. The MMU is responsible
for the translation between virtual memory addresses and physical memory
addresses. With vSphere ESXi, a virtual machine’s vCPU will call out to MMU
functionality by the Virtual Machine Monitor (VMM) process, or a hardware MMU
supported by a vendor specific CPU offloading instruction.
The Memory Management Unit (MMU) works with the Translation Lookaside
Buffer (TLB) to map the virtual memory addresses to the physical memory layer. The
page table always resides in physical memory, and having to look up the memory
pages directly in physical memory, can be a costly exercise for the MMU as it
introduces latency. That is where the TLB comes into play.
TLB in Detail
The TLB acts as a cache for the MMU that is used to reduce the time taken to access
physical memory. The TLB is a part of the MMU. Depending on the make and model
of a CPU, there’s more than one TLB, or even multiple levels of TLB like with
memory caches to avoid TLB misses and ensuring as low as possible memory
latency.
Now that we covered the basics on memory translation, let’s take a look at some
example scenarios for the TLB.
TLB hit
A virtual memory address comes in, and needs to be translated to the physical
address. The first step is always to dissect the virtual address into a virtual page
number, and the page offset. The offset consists of the last bits of the virtual address.
The offset bits are not translated and passed through to the physical memory address.
The offset contains bits that can represent all the memory addresses in a page table.
So, the offset is directly mapped to the physical memory layer, and the virtual page
number matches a tag already in the TLB. The MMU now immediately knows what
physical memory page to access without the need to look into the global memory.
In the example provided in the above diagram, the virtual page number is found in
the TLB, and immediately translated to the physical page number.
1. The virtual address is dissected in the virtual page number and the page offset.
2. The page offset is passed through as it is not translated.
3. The virtual page number is looked up in the TLB, looking for a tag with the
corresponding number.
4. There is an entry in the TLB (hit), meaning we immediately can translate the virtual
to the physical address.
TLB miss
What happens when a virtual page number is not found in the TLB, also referred to
as a TLB miss? The TLB needs to consult the system’s global memory to understand
what physical page number is used. Reaching out to physical memory means higher
latency compared to a TLB hit. If the TLB is full and a TLB miss occurs, the least
recent TLB entry is flushed, and the new entry is placed instead of it. In the following
example, the virtual page number is not found in the TLB, and the TLB needs to look
into memory to get the page number.
5. The virtual address is dissected in the virtual page number and the page offset.
6. The page offset is passed through as it is not translated.
7. The virtual page number is looked up in the TLB, looking for a tag with a
corresponding number. In this example, the TLB does not yet have a valid entry.
8. TLB reaches out to memory to find page number 3 (because of the tag, derived from
the virtual page number). Page number 3 is retrieved in memory with value 0x0006.
9. The memory translation is done and the entry is now cached in the TLB.
A TLB miss is not ideal, but the worst-case scenario is data that is not residing in
memory but on storage media (flash or disk). Where we are talking nanoseconds to
retrieve data in caches or global memory, getting data from storage media will
quickly run into milliseconds or seconds depending on the media used.
10.The virtual address is dissected in the virtual page number and the page offset.
11.The page offset is passed through as it is not translated.
12.The virtual page number is looked up in the TLB, looking for a tag with a
corresponding number. In this example, the TLB does not yet have a valid entry.
13.TLB reaches out to memory to find page number 0 (because of the tag, derived from
the virtual page number). Page number 0 is retrieved in memory but finds that the
data does not resides in memory, but on storage. A page fault is triggered, because we
cannot translate memory pages for data that is not in memory. We need to wait for
the data from storage.
CPU DEBUG LOGIC:
This section gives an overview of debug and describes the debug components.
The processor forms one component of a debug system.
The following methods of debugging an Arm processor based SoC exist:
The diagram shows that GIC has two AXI interfaces independently and has two
base address for these two interfaces.
9.3 Function Description
This GIC architecture splits logically into a Distributor block and one CPU
interface block, as Figure 12-1 shows.
Distributor
This performs interrupt prioritization and distribution to the CPU interface that
connect to the processor in the system.
CPU interface
CPU interface performs priority masking and preemption handling for a
connected processor in the system.
9.3.1 The Distributor
The Distributor centralizes all interrupt sources, determines the priority of each
interrupt, and for CPU interface dispatches the interrupt with the highest priority
to the interface for priority masking and preemption handling.
The Distributor provides a programming interface for:
• Globally enabling the forwarding of interrupts to the CPU interface
• Enabling or disabling each interrupt
• Setting the priority level of each interrupt
• Setting the target processor list of each interrupt
• Setting each peripheral interrupt to be level-sensitive or edge-triggered
• If the GIC implements the Security Extensions, setting each interrupt as either
• Secure or Non-secure
• Sending a Software-generated interrupt (SGI) to processor.
• Visibility of the state of each interrupt
• A mechanism for software to set or clear the pending state of a peripheral interrupt.
Interrupt ID
Interrupts from sources are identified using ID numbers. CPU interface can see
up to 160 interrupts.
The GIC assigns interrupt these 128 ID numbers as follows:
• Interrupt numbers ID32-ID127 are used for SPIs(shared peripheral interrupts).
• ID0-ID15 are used for SGI.
• ID16-ID31 are used for Private peripheral interrupt (PPI).
• The GIC architecture reserves interrupt ID numbers 1022-1023 for special purposes.
ID1022
The GIC returns this value to a processor in response to an interrupt
acknowledge only when the following apply:
This section describes the different types of interrupt that the GIC -500 handles.
SGIs (group 0)
SGIs are inter-processor interrupts, that is, interrupts generated from one core and
sent to other cores. Activating an SGI on one core does not affect the same interrupt
ID on another core. Therefore when an SGI is sent to all cores it is handled
independently on each core. The settings for each SGI are also independent between
cores.
You can generate SGIs using System registers in the generating core, or, in legacy
software, by writing to the Software Generated Interrupt Register, GICD_SGIR.
There are 16 independent SGIs, ID0-ID15, that are recorded separately for every
target core. In backwards compatibility mode, the number of the generating core is
also recorded.
PPIs (group 1)
PPIs are typically used for peripherals that are tightly coupled to a particular core.
Interrupts connected to the PPI inputs associated with one core are only sent to that
core. Activating a PPI on one core does not affect the same interrupt ID on another
core. The settings for each PPI are also independent between cores.
A PPI is an interrupt that is specific to a single core and is generated by a wire input.
PPI signals are active-LOW level-sensitive, by default, but can also be programmed
to be triggered on a rising edge.
SPIs (group 1)
SPIs are typically used for peripherals that are not tightly coupled to a specific core.
You can program each SPI to target either a particular core or any core. Activating an
SPI on one core activates the SPI for all cores. That is, the GIC-500 allows at most
one core to activate an SPI. The settings for each SPI are also shared between all
cores.
SPIs are generated either by wire inputs or by writes to the AXI4 slave programming
interface. The GIC-500 can support up to 960 SPIs corresponding to the external
spi[991:32] signal. The number of SPIs available depends on the implemented
configuration. The permitted values are 32-960, in steps of 32. The first SPI has an
ID number of 32. You can configure whether each SPI is triggered on a rising edge or
is active-HIGH level-sensitive.
When enabled, CPU interface takes the highest priority pending interrupt for its
connected processor and determines whether the interrupt has sufficient
priority for it to signal the interrupt request to the processor.
To determine whether to signal the interrupt request to the processor the CPU
interface considers the interrupt priority mask and the preemption settings for
the processor. At any time, the connected processor can read the priority of its
highest priority active interrupt from a CPU interface register.
The processor acknowledges the interrupt request by reading the CPU interface
Interrupt Acknowledge register. The CPU interface returns one of:
When the processor acknowledges the interrupt at the CPU interface, the
Distributor changes the status of the interrupt from pending to either active, or
active and pending. At this point the CPU interface can signal another interrupt
to the processor, to preempt interrupts that are active on the processor. If there
is no pending interrupt with sufficient priority for signaling to the processor, the
interface de-asserts the interrupt request signal to the processor.
When the interrupt handler on the processor has completed the processing of an
interrupt, it writes to the CPU interface to indicate interrupt completion. When
this happens, the distributor changes the status of the interrupt either:
From active to inactive
From active and pending to pending.
Group 0 Interrupts:
• Group 0 interrupts typically consist of critical or time-sensitive events that require
immediate attention from the processor.
• Timer interrupts: These are generated by timer peripherals (e.g., system timer,
watchdog timer) when a timer reaches its predefined value.
• Group 0 interrupts are often associated with higher priority levels and are not subject
to the same interrupt handling mechanisms as Group 1 interrupts. They are usually
delivered directly to the processor cores without involvement from the GIC's
interrupt controller.
Group 1 Interrupts:
• External interrupts: These interrupts may originate from external sources outside the
processor, such as external interrupt controllers, hardware accelerators, or co -
processors.
• Group 1 interrupts are managed by the GIC's interrupt controller, which prioritizes
and routes them to the appropriate processor core based on their priority levels and
configuration settings. They may be subject to interrupt masking, priority adjustment,
and other control mechanisms provided by the GIC.
The specific conditions and sources of Group 0 and Group 1 interrupts can vary
widely depending on the system architecture, hardware configuration, and the
operating environment of the ARM-based system.
The distributor maintains a state machine for each supported interrupt on CPU
interface. Following figure shows an instance of this state machine, and the
possible state transitions.
Transition C
If the interrupt is enabled and of sufficient priority to be signalled to the
processor, occurs when software reads from the ICCIAR.
Transition D
For an SGI, occurs if the associated SGI is enabled and the Distributor forwards
it to the CPU interface at the same time that the processor reads the ICCIAR to
acknowledge a previous instance of the SGI. Whether this transition occurs
depends on the timing of the read of the ICCIAR relative to the reforwarding of
the SGI.
For an SPI:
Occurs if all the following apply:
• The interrupt is enabled.
• Software reads from the ICCIAR. This read adds the active state to th e interrupt.
• For a level-sensitive interrupt, the interrupt signal remains asserted. This is usually
the case, because the peripheral does not deassert the interrupt until the processor has
serviced the interrupt.
• For an edge-triggered interrupt, whether this transition occurs depends on the timing
of the read of the ICCIAR relative to the detection of the reassertion of the interrupt.
Otherwise the read of the ICCIAR causes transition C, possibly followed by
transition A2.
2. For each enabled interrupt that is pending, the Distributor determines the targeted
processor.
3. For processor, the Distributor determines the highest priority pending interrupt,
based on the priority information it holds for each interrupt, and forwards the
interrupt to the CPU interface.
4. The CPU interface compares the interrupt priority with the current interrupt
priority for the processor, determined by a combination of the Priority Mask Register,
the current preemption settings, and the highest priority active interrupt for the
processor. If the interrupt has sufficient priority, the GIC signals an interrupt
exception request to the processor.
5. When the processor takes the interrupt exception, it reads the ICCIAR in its CPU
interface to acknowledge the interrupt. This read returns an Interrupt ID that the
processor uses to select the correct interrupt handler. When it recognizes this read,
the GIC changes the state of the interrupt:
• If the pending state of the interrupt persists when the interrupt becomes active, or if
the interrupt is generated again, from pending to active and pending.
• Otherwise, from pending to active
6.When the processor has completed handling the interrupt, it signals this completion
by writing to the ICCEOIR in the GIC Generating an SGI A processor generates an
SGI by writing to an ICDSGIR.
In this product, GIC implements 64 priority levels. So only the highest 6 bits are
valid, the lower 2 bits read as zero.
In the GIC prioritization scheme, lower numbers have higher priority, that is, the
lower the assigned priority value the higher the priority of the interrupt. The highest
interrupt priority always has priority field value 0.
The ICDIPRs hold the priority value for each supported interrupt. To determine the
number of priority bits implemented write 0xFF to an ICDIPR priority field and read
back the value stored.
Preemption
A CPU interface supports forwarding of higher priority pending interrupts to a target
processor before an active interrupt completes. A pending interrupt is only forwarded
if it has a higher priority than all of:
• the priority of the highest priority active interrupt on the target processor, the running
priority for the processor, see Running Priority Register (ICCRPR) .
• The priority mask, see Priority masking.
• The priority group, see Priority grouping.
Preemption occurs at the time when the processor acknowledges the new interrupt,
and starts to service it in preference to the previously active interrupt or the currently
running process. When this occurs, the initial active interrupt is said to have been
preempted. Starting to service an interrupt while another interrupt is still active is
sometimes described as interrupt nesting.
Priority masking
The ICCPMR for a CPU interface defines a priority threshold for the target
processor, see Interrupt Priority Mask Register. The GIC only signals pending
interrupts with a higher priority than this threshold value to the target processor. A
value of zero, the register reset value, masks all interrupts to the associated processor.
The GIC always masks an interrupt that has the largest supported priority field value.
This provides an additional means of preventing an interrupt being signalled to any
processor.
Priority grouping
Priority grouping splits each priority value into two fields, the group priority and the
subpriority fields. The GIC uses the group priority field to determine whether a
pending interrupt has sufficient priority to preempt a currently active interrupt.
9.6.3 The effect of the Security Extensions on interrupt handling
If a GIC CPU interface implements the Security Extensions, it provides two interrupt
output signals, IRQ and FIQ:
• The CPU interface always uses the IRQ exception request for Non-secureinterrupts
• Software can configure the CPU interface to use either IRQ or FIQ exception
requests for Secure interrupts.
Security Extensions support
Software can detect support for the Security Extensions by reading the
ICDICTR.SecurityExtn bit, see Interrupt Controller Type Register (ICDICTR).
Secure software makes Secure writes to the ICDISRs to configure each interrupt as
Secure or Non-secure, see Interrupt Security Registers (ICDISRn).
In addition:
• The banking of registers provides independent control of Secure and Non-secure
interrupts.
• The Secure copy of the ICCICR has additional fields to control the processing of
Secure and Non-secure interrupts, see CPU Interface Control Register (ICCICR)
These fields are:
❖ the SBPR bit, that affects the preemption of Non-secure interrupts.
❖ the FIQEn bit, that controls whether the interface signals Secure interrupt to the
processor using the IRQ or FIQ interrupt exception requests.
❖ the AckCtl bit, that affects the acknowledgment of Non-secure interrupts.
❖ the EnableNS bit, that controls whether Non-secure interrupts are signaled to the
processor, and is an alias of the Enable bit in the Non-secure ICCICR.
• The Non-secure copy of the ICCBPR is aliased as the ICCABPR, see Aliased
Binary Point Register (ICCABPR). This is a Secure register, meaning it is only
accessible by Secure accesses. Effect of the Security Extensions on interrupt
acknowledgement
If the highest priority pending interrupt is a Secure interrupt, the processor must make
a Secure read of the ICCIAR to acknowledge it.
If the read of the ICCIAR does not match the security of the interrupt, taking account
of the AckCtl bit value for a Non-secure interrupt, the ICCIAR read does not
acknowledge any interrupt and returns the value:
• 1022 for a Secure read when the highest priority interrupt is Non -secure
• 1023 for a Non-secure read when the highest priority interrupt is Secure.
Here are some common features and uses of generic timers in CPUs:
• Interrupt Generation: Generic timers can be used to generate interrupts at regular
intervals. This feature is often utilized in operating systems for tasks like scheduling
tasks or preempting the CPU to handle higher-priority tasks.
• Performance Monitoring: Some CPUs include generic timers that can be used for
performance monitoring purposes, such as measuring instruction execution time,
cache misses, or other performance-related metrics.
• Power Management: Generic timers may also play a role in power management by
enabling the CPU to enter low-power states or adjust its operating frequency based
on certain timing criteria.
• Timekeeping: In some cases, generic timers are used for basic timekeeping functions
within the CPU, providing a reference for measuring time intervals or tracking
system uptime.
• System Synchronization: Generic timers can also be used for system
synchronization purposes, helping coordinate actions between different components
or subsystems within the CPU or the broader system.
TIMERS:
Here are some common types of timer registers and their typical functions:
• Control Register: This register is used to configure the operating mode of the timer,
such as whether it counts up or down, whether it operates in periodic or one-shot
mode, and whether interrupts are enabled.
• Interval Register: Also known as the Load Register or Period Register, this register
is used to set the initial value of the timer's count, which determines the interval at
which the timer will trigger an interrupt or other event.
• Counter Register: This register holds the current value of the timer's count. The
timer increments or decrements this value based on its operating mode and clock
source. When the count reaches a certain threshold (e.g., zero or a maximum value),
the timer may trigger an interrupt or perform some other action.
• Status Register: This register provides status information about the timer, such as
whether an interrupt has occurred, whether the timer is currently running, or whether
it has reached its terminal count.
• Control/Configuration Registers: These registers may include additional
configuration options for the timer, such as clock source selection, prescaler settings,
and interrupt masking.
SYSTEM COUNTERS:
The SoC implementer is responsible for the design of the System Counter. Usually,
the System Counter requires some initialization when a system boots up. Arm
provides a recommended register interface for the System Counter, but you should
check with your SoC implementer for details of a specific implementation.
The System Counter measures real time. This means that it cannot be affected by
power management techniques like Dynamic Voltage and Frequency Scaling (DVFS)
or putting cores into a lower power state. The count must continue to increment at its
fixed frequency. In practice, this requires the System Counterto be in an always -on
power domain.
To save power, the System Counter can vary the rate at which it updates the count.
For example, the System Counter could update the count by 10 every 10th tick of the
clock. This can be useful when the connected cores are all in low power state. The
system count still needs to reflect time advancing, but power can be saved by
broadcasting fewer counter updates.
Generic Timers:
❖ Generic timers are timers that are used for general-purpose timing functions within an
embedded system.
❖ They are typically used for tasks such as scheduling events, generating delays, or
measuring time intervals.
❖ These timers are usually programmed and controlled by software, allowing
developers to tailor their behavior according to the specific requirements of the
application.
❖ Generic timers are not typically designed to handle system failures or faults.
Watchdog Timers:
❖ Watchdog timers, on the other hand, are specifically designed to monitor the
operation of a system and take corrective action in the event of a malfunction or
system crash.
❖ The primary function of a watchdog timer is to reset the system or trigger an alarm if
the software or hardware fails to periodically "feed" or reset the watchdog timer.
❖ Watchdog timers help ensure the reliability and robustness of embedded systems by
providing a mechanism to recover from faults or errors that could otherwise cause the
system to hang or become unresponsive.
❖ These timers are often used in safety-critical applications or in systems where
continuous operation is essential.
❖ Bark Register:
➢ The "bark" register is often associated with the watchdog timer's ability to alert or
"bark" before taking action. It serves as a pre-warning mechanism to indicate that the
system is about to reset or perform a specific action due to a timeout or fault
condition.
➢ When the watchdog timer is enabled, it starts counting down from a preset value.
Before reaching the end of the countdown, it may trigger an interrupt or set a flag in
the bark register to warn the system that it needs attention.
➢ The bark signal is typically used by software to perform diagnostics, log events, or
take corrective actions before the watchdog timer proceeds to its "bite" phase
(resetting the system or taking a drastic action).
❖ Bite Register:
➢ The "bite" register, on the other hand, represents the watchdog timer's final action or
"bite" when a critical condition is not resolved within the timeout period indicated by
the timer.
➢ When the watchdog timer reaches its timeout value without being reset or serviced by
software (after the bark warning), it enters the "bite" phase. In this phase, it can
initiate a system reset, halt the processor, or trigger other emergency actions
depending on the system's design.
➢ The "bite" phase is often considered the last resort to prevent the system from
entering an unrecoverable state due to software or hardware faults.
• Processor interface: This block monitors the behavior of the processor and
generates P0 elements that are essentially executed instructions and exceptions traced
in program order.
• Trace generation: The trace generation block generates various trace packets based
on P0 elements.
• Filtering and triggering resources:You can limit the amount of trace data generated
by the ETM, through the process of filtering. For example, generating trace only in a
certain address range. More complicated logic analyzer style filtering options are also
available. The ETM trace unit can also generate a trigger that is a signal to the trace
capture device to stop capturing trace.
• FIFO: The trace generated by the ETM trace unit is in a highly-compressed form.
The FIFO enables trace bursts to be flattened out. When the FIFO becomes full, the
FIFO signals an overflow. The trace generation logic does not generate any new trace
until the FIFO is emptied. This causes a gap in the trace when viewed in the
debugger.
• Trace out: Trace from FIFO is output on the synchronous AMBA ATB interface.
• Syncbridge: The ATB interface from the trace out block goes through an ATB
synchronous bridge.
• Reset: The reset for ETM trace unit is the same as a cold reset for the processor. The
ETM trace unit is not reset when warm reset is applied to the processor so that tracing
through warm processor reset is possible. If the ETM trace unit is reset, tracing stops
until the ETM trace unit is reprogrammed and re-enabled. However, if the processor
is reset using warm reset, the last few instructions provided by the processor before
the reset might not be traced.
Purpose:
❖ ETM (Embedded Trace Macrocell): ETM is primarily used for tracing program
execution flow and capturing trace data related to program behavior. It helps in
understanding how the program is executing and identifying performance bottlenecks
or bugs.
❖ ITM (Instrumentation Trace Macrocell): ITM is used for inserting custom trace
messages into the trace stream. It allows developers to add their own trace events or
debug information into the trace output without affecting the program's execution.
Functionality:
❖ ETM: ETM captures a detailed trace of program execution, including instruction
addresses, data accesses, and control flow information. It provides a comprehensive
view of how the processor is executing instructions.
❖ ITM: ITM is more focused on providing debug and trace information specific to the
developer's needs. It allows inserting printf-style debug messages, timestamps, or
other custom information into the trace stream.
Trace Interface:
❖ ETM: ETM typically uses a dedicated trace port or interface to stream trace data to
an external trace capture device or debugger. It generates a rich trace stream
containing information about executed instructions and events.
❖ ITM: ITM is usually integrated into the processor core and communicates with the
debugger or trace capture unit through the Core Sight debug and trace architecture. It
provides a flexible way to add custom trace messages without requiring an additional
trace port.
Usage:
❖ ETM: ETM is commonly used in performance analysis, debugging complex
software, and optimizing code for better execution efficiency. It is especially valuable
in understanding real-time system behavior.
❖ ITM: ITM is used for debugging and tracing at a higher level of abstraction, allowing
developers to insert trace messages or markers in the code to track specific events or
conditions during program execution.
RISC-V Architecture:
❖ RISC-V is an open-source instruction set architecture (ISA) that provides a modular
and extensible framework for designing processors. It defines a base ISA along with
optional extensions for various functionalities.
Multicore Configuration:
❖ In a multicore RISC-V subsystem, multiple RISC-V processor cores are integrated
onto a single chip or within a system-on-chip (SoC) design. These cores can operate
independently and can execute instructions concurrently, allowing for parallel
processing.
Benefits:
❖ Parallelism: Multicore architectures offer parallelism, enabling multiple tasks or
threads to run simultaneously. This can lead to improved performance and scalability
for applications that can be parallelized.
❖ Fault Tolerance: Multicore systems can provide fault tolerance by allowing tasks to
be distributed across multiple cores. If one core fails or experiences issues, the
system can continue functioning with the remaining cores.
❖ Resource Utilization: By distributing workloads across cores, multicore systems can
utilize resources more efficiently, optimizing power consumption and overall system
throughput.
interconnect and Communication:
❖ Efficient communication and synchronization mechanisms are crucial in multicore
systems. Inter-core communication can be facilitated through shared memory,
message passing, or dedicated interconnects depending on the system's design and
requirements.
❖ Synchronization primitives such as locks, semaphores, and barriers are used to
coordinate access to shared resources and ensure correct behavior in concurrent
execution scenarios.
Software Support:
❖ Operating systems (OS) and programming models play a vital role in leveraging
multicore architectures effectively. Multicore-aware OS kernels can schedule tasks
across cores, manage thread synchronization, and optimize resource allocation.
❖ Parallel programming frameworks and languages (e.g., OpenMP, MPI, pthreads)
provide abstractions and APIs for developing parallel software that can utilize
multiple cores efficiently.
Scalability:
❖ Multicore RISC-V subsystems can be designed with scalability in mind, allowing for
configurations ranging from a few cores to many cores depending on the target
application and performance requirements.
❖ Scalability considerations include power efficiency, thermal management, memory
hierarchy design, and software scalability to harness the full potential of the
multicore architecture.
Processor Cores:
❖ Choose the RISC-V processor cores based on the target application and performance
requirements. Common choices include cores based on the RV32I or RV64I base
ISA with optional extensions such as M (integer multiplication and division), A
(atomic instructions), F (single-precision floating-point), D (double-precision
floating-point), and C (compressed instructions).
❖ Determine the number of cores based on the desired level of parallelism and
workload distribution. Common configurations include dual-core, quad-core, octa-
core, or more depending on the scalability needs.
Memory Hierarchy:
❖ Design the memory hierarchy to support multiple cores efficiently. This includes the
instruction cache, data cache, shared L2 cache (if applicable), and system memory
(RAM).
❖ Implement cache coherence protocols such as MESI (Modified, Exclusive, Shared,
Invalid) or MOESI (Modified, Owned, Exclusive, Shared, Invalid) to maintain data
consistency across multiple cores sharing the same memory regions.
Interconnect Fabric:
❖ Choose an appropriate interconnect fabric to connect the processor cores, caches,
memory controllers, and other peripherals within the multicore subsystem.
❖ Considerations include bandwidth, latency, scalability, and support for cache
coherency protocols. Common interconnect technologies include AXI (Advanced
eXtensible Interface), AHB (Advanced High-performance Bus), or custom on-chip
interconnect designs.
Memory Controllers:
❖ Design memory controllers for interfacing with system memory (RAM) and
managing memory accesses from multiple cores.
❖ Ensure memory controller performance matches the bandwidth requirements of the
multicore system and supports features like out-of-order memory accesses,
interleaving, and error correction (ECC).
Peripheral Interfaces:
❖ Include interfaces for peripherals such as UART (Universal Asynchronous Receiver -
Transmitter), SPI (Serial Peripheral Interface), I2C (Inter-Integrated Circuit), GPIO
(General-Purpose Input/Output), timers, and interrupt controllers.
❖ Design peripheral controllers or use standard IP blocks compatible with the RISC-V
ecosystem to enable communication with external devices and peripherals.
Power Management:
❖ Implement power management features to optimize energy consumption in the
multicore subsystem. This includes dynamic voltage and frequency scaling (DVFS),
clock gating, power domains, and sleep modes for idle cores.
❖ Consider thermal management strategies to prevent overheating in high-performance
multicore designs.
Debugging and Trace:
❖ Integrate debugging and trace features to facilitate software development, debugging,
and performance analysis on the multicore system.
❖ Utilize on-chip debug interfaces such as JTAG (Joint Test Action Group) or the
RISC-V Debug specification to enable debugging capabilities across multiple cores
simultaneously.
Software Support:
❖ Develop or port an operating system (OS) with multicore support, such as Linux with
SMP (Symmetric Multiprocessing) or a real-time operating system (RTOS) tailored
for multicore architectures.
❖ Provide software libraries, drivers, and tools that enable parallel programming, thread
synchronization, and efficient utilization of multicore resources.
Multi-core snoop filters are hardware structures used in multicore processor systems
to optimize cache coherence protocols, particularly in snooping-based coherence
schemes. These filters help reduce the overhead associated with snooping on the
system bus by selectively filtering and processing coherence-related transactions
based on the caching state of individual cores. Here's an explanation of multi-core
snoop filters:
Purpose:
❖ The primary purpose of multi-core snoop filters is to improve the efficiency of cache
coherence in multicore systems. They aim to minimize the amount of unnecessary
coherence-related traffic on the system bus, thereby reducing latency and power
consumption associated with snooping.
Operation:
❖ Each core in a multicore system is associated with a snoop filter. The snoop filter
monitors the coherence transactions happening on the system bus and selectively
filters out transactions that are irrelevant to the core's cache state.
❖ When a coherence transaction occurs (e.g., a write to a memory location), the snoop
filter of each core determines whether the transaction is relevant to that core's cache.
If it is, the core takes appropriate action (e.g., updating its cache). If not, the
transaction is ignored, reducing unnecessary bus traffic.
Filtering Mechanisms:
❖ Multi-core snoop filters use various mechanisms to filter coherence transactions
efficiently:
1) Bloom Filters: Bloom filters are probabilistic data structures used to quickly test
whether an element is a member of a set. Each core's snoop filter maintains a Bloom
filter representing the cache state of that core. When a coherence transaction occurs,
the Bloom filter is consulted to determine whether the transaction is relevant to the
core's cache.
2) Tag-Based Filtering: Each coherence transaction carries a tag indicating the memory
address being accessed and the operation being performed (e.g., read or write). The
snoop filter compares this tag with the tags of cached data in the core's cache to
determine relevance.
3) Cache Coherence State Encoding: The snoop filter may encode the coherence state
of cached lines (e.g., MESI or MOESI states) to quickly determine whether a
coherence transaction affects the core's cache.
Benefits:
❖ Reduced Bus Traffic: By filtering out irrelevant coherence transactions, multi-core
snoop filters reduce the amount of unnecessary bus traffic, improving overall system
performance and reducing power consumption.
❖ Lower Latency: Filtering coherence transactions at the core level reduces the latency
associated with snooping, allowing cores to respond more quickly to relevant
coherence events.
❖ Scalability: Snoop filters help maintain scalability in large multicore systems by
limiting the overhead of coherence protocols as the number of cores increases.
Implementation:
❖ Snoop filters are typically implemented as hardware structures integrated into each
core's cache coherence logic. They require efficient hardware designs to perform
filtering operations quickly and accurately.
RISC-V ISA
❖ Base Integer ISA: The base integer ISA (RV32I, RV64I, RV128I) provides essential
instructions for integer arithmetic, logical operations, memory access, control flow,
and system interaction.
❖ Standard Extensions: RISC-V includes several standard extensions that add
functionality beyond the base integer ISA. Some notable extensions include:
➢ M: Integer Multiplication and Division
➢ A: Atomic Instructions
➢ F: Single-Precision Floating-Point
➢ D: Double-Precision Floating-Point
➢ C: Compressed Instructions
➢ Zicsr: Control and Status Register (CSR) Instructions
➢ Zifencei: Instruction-Fetch Fence
INSTRUCTION FORMATS:
R-Type Instructions (Register-Register Arithmetic/Logic):
❖ These instructions perform operations between two source registers and store the
result in a destination register.
❖ Format: opcode rd, rs1, rs2
➢ opcode: Specifies the operation (e.g., add, subtract, bitwise AND).
➢ rd: Destination register.
➢ rs1, rs2: Source registers.
Examples:
• add (addition): add x3, x1, x2 adds the contents of registers x1 and x2 and stores the
result in register x3.
• sub (subtraction): sub x4, x5, x6 subtracts the contents of register x6 from x5 and
stores the result in register x4.
• and (bitwise AND): and x7, x8, x9 performs a bitwise AND operation between
registers x8 and x9 and stores the result in register x7.